Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


  1. What constitutes an AWG run?   As summarize in our AWG run checklist, the products in an AWG freeze are

    1. A YYYY_MM_DD  version stamp (e.g. 2012_10_24), denoting when a data snapshot (to be frozen) was obtained from the DCC
    2. This can optionally contain a runcode suffix, such as _00 or _01, to denote additions/deletions to the base snapshot

    3. A sample list representing all of the data available in the snapshot

      A tab-delimited table (readable in either Emacs OR Excel, not only Excel) containing 1 row per aliquot, with at least the following columns:
    4. TCGA identifier (preferably full aliquot, and potentially UUID in the future)
    5. The corresponding Firehose identifier
    6. Sample type (per TCGA standard, TP=tumor primary, NB=normal blood, etc)
    7. The platform of origination on which the given aliquot data were collected (e.g. genome_wide_SNP_6 for copy number data)
    8. URL to the source file archived at the DCC in which that datatype-specific aliquot can be found
    9. URL to the corresponding SDRF for the given DCC archive file

      A heatmap representing platform coverage (per datatype per sample) will also be provided and linked on the AWG run dashboard.

    10. The results of corresponding stddata workflow executed upon the freeze samples

    11. The results of corresponding Analyses workflow executed upon the freeze samples, updated with annotations from stddata workflow

    12. A set of web-browsable Nozzle reports  (descriptions, figures, tables, etc) for (d)

      The individual Nozzle reports for each analysis task run against a given freeze sample cohort are aggregated into a single comprehensive report by the   gdac_reports tool.

    13. An online dashboard providing simple, central access to all of the above
    14. Located at  /xchip/gdac_data/runs/awg_<disease_name>__YYYY_MM_DD on the internal Broad filesystem
    15. Which corresponds to the online URL<disease_name>__YYYY_MM_DD
    16. The gdac_status tool is used to obtain the pass/faill/running/not-run status of all tasks in (d)

    17. A set of DCC-submission-ready archives containing (c), (d), and (e), retrievable from the dashboard with firehose_get

      The gdac2public tool is used to provision these DCC archives from the internal submission tree (into which they are packaged by Firehose) to the online dashboard location mentioned in (f)

  2. How do we generate these freeze products?
  3. Again, in the ideal case we'd simply use the canonical stddata & analyses runs produced on a monthly basis.  But conflicting delivery dates and algorithm versions means that is not always possible, so for PANCAN8 Mike prototyped FISS and python scripts that were also used for SKCM and THCA, and have now been generalized by Dan into a new Python tool.
    1. gdac_freeze tool in same GDAC bin location as fiss et al  (you will need to log in to confluence to see the gdac_freeze page)
    2. Can be run by anyone

  4. Who generates these freeze products?

    After the AWG chair appoints an analysis champion (AC) and data coordinator (DC), the AC and DC work in tandem to guide the freeze products through the process.  The DC is responsible for curating the sample list and communicating such to the AWG at large, while the AC is responsible for seeing that the analyses from the freeze list are sufficiently vetted, and communicating their availability to the AWG at large.  The GDAC engineering team will provide appropriate support as required.

  5. How can I cross-reference the samples in the Firehose AWG run I'm shepherding with those in the freeze list maintained by my AWG?

    1. Interactively: by inspection of the latest samples report.  If your AWG run has a different YYYY_MM_DD version stamp than the latest dicing, then look at the list of all sample reports.
    2. Programmatically:  use the gdac_counts and/or gdac_data CLI tools (use -h or -help options for usage instructions);  additional details can also be obtained with 'fiss sample_list ...'  or 'fiss annot_get ...'

  6. What do we do when something more is needed?

    Inevitably one of the analyses will be incorrect, or using an older version of code, or some samples will need to be added and/or deleted.  Or, after seeing an analysis result one would like to rerun parts of the workflow on a newly identified subtype or subset of the data (see below).  The need to address both of these flexibly is this is the strongest argument for NOT using canonical GDAC monthly runs, in favor of AWG-specific workspaces as created by the gdac_freezetool.

  7. What if I want to do subtype-based analyses?

    All of the same machinery is used: create a fresh sample set which list the samples in your subtype (e.g. as output from a clustering algorithm);  for example, consider this TSV file which defines a small set of 8 samples (note that multiple subtypes may also be specified within a single TSV file).  The sample set will be named LGG-astrocytoma in Firehose (per the sample_set_id column), and can be loaded via this fiss command at the Unix prompt