This page serves as a wishlist for FireCloud features needed by the Broad GDAC to match the productivity of FireHose.
- More sophisticated complex expressions
GDAC data is notoriously messy, and certain algorithms are generalized across a variety of data subtypes. For this reason, task configurations in FireHose had MVEL expressions that would, among other things, allow "choosing" between two possible inputs, and FireHose would figure out how to map the correct inputs to the workflow language. In FireCloud, we only have simple expressions, either literals, attributes, or attributes of members in a set. (like this: "Value", "this.value", "this.samples.value"). Having only these features necessitates ugly (and hard to maintain) hacks, such as spinning up a VM to choose between input files as a task, or adding extra optional inputs to WDL and parsing as the first step in a task.
- Built-in attribute/expression evaluator
(I think Chet entered a story/ticket for this) To see the parameters a given workflow or config in any space would be given if launched, but without launching it: This was described in 2012 for Firehose CGA 2012_09_28 presentation p. 41-46: fiss task_eval <space> <iset> <task_name>
"Map" data type
A common use case in FireHose is to select input files from samples in a sample set, and pass these files to an analysis via a two-column tsv file that maps sample ids to the data files. An analagous method exists in FireCloud, allowing you to accept as input an array of sample ids and an array of the data files. The problem arises when the data is sparse – the two arrays are no longer parallel, and the mapping is broken. From a task-authors perspective, this could be solved if there was a Map data type in WDL. In Firecloud, you could pass the input as "this.samples.name->this.samples.data", and not require any sort of Null sentinel value in the bucket.
Different workflows will often share common tasks, such as a preprocessor or a report generator. Since WDLs are not composeable, each workflow must independently maintain a copy of the task. A temporary solution used by gdac-firecloud is to use a script to sync task definitions within the repository, but this has limitations. First, it only works with tasks defined in the repository. It also requires manual intervention by the workflow developer, via 'make sync'. Import statements are currently part of the WDL spec, but are not implemented in FireCloud or any of the development tools (i.e. Cromwell, wdltool, etc.).
How are singleton tasks configs versioned? How does that relate to composite versioning and job avoidance? e.g. AFAIK a composite task is built by literal inclusion of its component singleton tasks. But where is the version of each singleton task "stored" in the composite WDL? And how are these versions checked so that single tasks can be avoided properly? Moreover, if I update singleton task T and want to update all of the composite tasks C1, C2 ... CN of which it's a member, how do I discern C1...CN and how can I automatically update them when T is updated?
Task outputs that are intermediate pollute the Google bucket
Francois brought this one to my attention – In a multi-task workflow, files can be passed from task to task by making them outputs of the task. But often, this intermediate file is not useful once the final step in the workflow has been run. However, these files must still be output into the Google bucket, and assigned a place in the data model, thereby polluting the workspace and google bucket. FireCloud needs a way to specify that an intermediate file does not need to be in a workflow output, or allow workflows to specify outputs explicitly
- Gold Standard for WDL complexity is the GDAC Analysis workflow
Complete with graph dependency evaluation and job avoidance; we need to revisit that story and break it up into pieces, if necessary"
- Built-in Debugging
If a task fails in FireCloud, it can be difficult to determine why, since all the temporary files created while running the task disappear. If a task fails, delocalizing the working directory would help immensely. I should also be able to SSH into a running task and look around to see what files have been created, how much CPU, memory, processes, etc. are being used. That way submitting a task to FireCloud is not a black box.
- Running an on-prem instance
Because of the reasons above, most debugging must be done locally with cromwell. However, these situations do not faithfully recreate the conditions on FireCloud proper (See the MATLAB bug that plagued us for months). A task can succeed locally that fails in FireCloud with identical arguments. So we need a way to debug these failures that doesn't cost us money and time, and gives us greater ability to observe what is going on.
- SSH access to running jobs, look at files in a workspace, see above
Don't let me attempt any action that is not permitted.
Notify me via email when my tasks finish (either on error or sucessfuly), and let these notifications be configurable for each user & workspace
- Long google bucket paths are unreadable, abbreviate to file basenames since in most cases, the directory structure is irrelevant, especially from the Data Model's perspective.
- List of stories entered by Chet from Mike's UI diagrams
- Sticky preferences: which namespace I last worked in, or which namespace I most prefer to work in, ditto for workspace(s), what is my favorite "results per page" setting
- Accurate Documentation
The current api page is intended to list all available orchestration (read user-available) api calls. But this page is either incomplete, or does not accurately represent the available API calls. On a couple occasions, Alex has directed me to an API endpoint that is not listed on this page, but is listed on the page for the service (Rawls, Agora, etc.) at https://swagger.dsde-dev.broadinstitute.org/ . Some of these APIs are publicly accessible as pass-throughs, but the distinction has not been made clear.
- Get latest snapshot of Method or Config
Some endpoints require a snapshot_id (e.g. https://swagger.dsde-dev.broadinstitute.org/#!/Method_Repository/getMethodRepositoryConfiguration), but there isn't an easy way to figure out what the latest snapshot id is without retrieving all snapshot_ids and performing a max check. For these endpoints, there should be a 'latest' option, or the snapshot_id should be optional, and the latest version is retrieved.
- Need a reliable way to kill runaway workflows. There are many reasons the workflows can appear to hang (Docker issue, inadvertent forever loop, services miscommunicating). I still have a test_gistic workspace that claims to be running, but I have no idea if that means there is a Google VM still running, or if Firecloud is just confused.
Mike Notes from calendar: not sorted yet