First time with Common Workflow Language (CWL)

[ meta  prose  cwl  methodology  ngs  ]

If Perl is Glorified Shell, Shell Scripts Are Dead. Long Live Shell Scripts

For those of you who don't know me, I'm your typical beginner bioinformatician. I can script with a few languages but strongly prefer 1990s-2000s dynamically-typed languages like Ruby and Python and write algorithms when nothing else is available. A little bit of webdev, databases, cloud, stats and stuff sprinkled in.

When I first started, my friends E and R mentored me in the basics of Illumina NGS data: quality control and mapping. My graduate school was fortunate enough to have a good Beowulf cluster so the bulk of my analyses needed to be submitted to a PBS/Torque grid. And thus began a quest for knowledge of linux, shell scripting tips, and eventual understanding of what tool/language to use for which tasks.

The shell is all some people ever need... but if you've ever seen how simple or complex CLIs from Python's argparse can be, you've probably gotten frustrated with the shell too. Sure yaourt and git are mostly bash/C, but it's pretty difficult to make similar level of UI/UX at the CLI for most pipelines in a world where tools, pipelines, and best practices are changing rapidly in a way that is meaningful or stable. In the world of bioinformatic pipelines, shell scripts are both the low-effort solution to provide cohesive workflows and yet difficult or ugly to test and maintain.

Yet Another Workflow Language?

The truth is, workflow languages are just a symptom of programmer frustration with the instability and ugliness of our craft. It's hard to even say that the CWL v1.0 spec meets most user needs or provides reproducibility in an 'excellent' way. That said, the 'direction' of workflow languages is positive but provides little benefit over traditional make workflows or makes good efforts for longevity. CWL is yet another language or syntax to master, doesn't provide a solid ROI for learning, and is a burden on developers more than it is a convenience. At least for now.

That said, I'm a novice and a naysayer here, with little bandwidth to participate and do the pull requests necessary to add the features I think CWL needs to improve it. I'd really like certain companies to put more effort into CWL than they already have as far as creative features go. Of course Toil and other similarly minded OSS tools compete with some companies for business, but they already make a considerable amount of money given the cost of their contract with what is essentially a clone of Galaxy.

What CWL Gets Right

Coming from the world of Galaxy, I felt that Galaxy offered me a freedom from web applications and underestimating or micromanaging other capable scientists and empowered me as a bioinformatic expert to develop and refine my tools/pipelines. The cheetah language is a powerful but woefully documented and has horrible taste. I've spent hours of company time learning a system that was eventually abandoned, not because Galaxy was sub-par, but because there weren't enough tool developers and/or UI/UX discussions/compromises to satisfy users from the project. Galaxy is actually a remarkably well-built and well designed system, even if cheetah and XML tool configurations are not fun to develop. CWL makes development much more of an enjoyable process, although I believe the CWL community would benefit dramatically by investing in cwl-mode or similar development environment tools and/or templates for developers.

With CWL, I feel that I'm empowered to specify and iterate on single tools without managing inputs/outputs. Portability is built in to the spec, and I *should* be able to run the tool as long as my PATH variable is properly specified or I'm working with Dockerized tools. Presentation is a separate concern from pipeline development, so that impatient colleagues who just want something or play with on the command line aren't starving for UI or documentation. Web application and linux novice usability is again abstracted so that the 'correctness' of my pipeline or tools is protected from improper scrutiny by the spec.

Some people live for presentation, and bioconductor or Galaxy might be a better environment for presenting results, but that's not what CWL addresses or hopes to offer. I can work on a summary R script or Rmd to provide enough visualization to end users without micromanaging data connectivity/flow, documentation, or CLI appeal. Firsthand, it's remarkable that there are positions available to create pipelines of subprocess commands because either Python API/SDKs are un(der)developed or shell scripts are seen as ugly, undocumentable, and unmaintainable.

Constructive Criticism for CWL Developers

The spirit of parallelization is still nascent and should be advertised as such. On the more lighthearted side, why say 'scatter' when you mean 'parallel'? Maybe there's a technical reason in implementation, maybe the nomenclature is over-technical or even just developer stubbornness? More oddly, to jobs that are not 'scattered' or even different tools that are independent inputs to a collation tool, there is no built-in parallelism in execution, unlike GNU make which can process components separately. While the DAG treatment in this tool's development has been good, I'd think more than one person has asked some common questions that are treated more completely by the Galaxy community:

  • What if I wanted to make the 3rd input of this scattered array the input for only one tool?
  • Can I add an additional scattered trimming step for each sample that failed pre-alignment QC with exit code > 0?
  • How can I split an length N array into N/2 pairs with some glob and then use the pairs as named inputs for a subworkflow scattered over the new array (N/2)?

Much of my opinions about lists/pairs originate from the Galaxy universe. Some might not matter once you've proceeded to the alignment stage and working with single arrays of alignment files. Some of these questions *seem* dumb (like the 1st one) because the developers haven't anticipated the arbitrary nature of user's workflows with workflow/tool flexibility in configuration. The GUIs might help to some degree, I haven't tested them yet. But I don't normally develop or iterate in a graphical environment and I want to see documentation.

CWL anticipates platform independence but not entirely. Docker is a nice platform, but it can make debugging difficult. I was left wondering if issues were the under-developed and sensitive 'glob' mechanism for output detection, or if it's related to my mount points and container spec. I've spent only 10 hours on CWL for a DNA seq pipeline from a shell script that only took 2hrs; I don't have half the tools specified in CWL yet. To be fair, I know how to write shell scripts and anticipate issues with inputs. More than half of the 10 hours was just searching and reading documentation. But at least 2hrs of the 10 were spent debugging the output recognition glob idiosyncracies, inline javascript issues, struggling with the size of my local system's temporary directory (at least $TMPDIR is recognized), the fact that temporary subdirectories pollute $TMPDIR with default behavior (I don't know if these are working directories or not) and more.

Finally, Concrete Suggestions:

  1. optional support for (de)compression for inputs/outputs
  2. ftp/S3/ssh transfer for inputs/outputs

I know I'm dreaming here, but this could really change cost profiles on storage/quota-constrained hardware or cloud environments where ephemeral storage might outnumber nfs shares. Sure, it might seem like that might not fit the use cases of most developers, but imagine users who can afford to store and process their own samples but can't afford to download and/or (permanently) store large public datasets for reanalysis and comparison. Not only does this fit the description of most students and enterprise end users who rely on increasingly bigly public datasets or large Illumina screens, but it doesn't fit (my opinion of) the development workflow of tool or workflow developers who often use small test datasets and/or small sets of edge-cases.

Conclusions

CWL is educational to understand the state-of-the-art workflow language/specifications. If you're looking for production workflows or rapid development speed, GNU make or shell scripts remain the simplest option. Testing options are constrained by the shell environment, but again it's an OSS domain-specific language specification, not a BDD pipeline environment. Come on Matt, this isn't as cookie-cutter as Ruby development is...

Thanks for reading!