Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.

The Broad Institute GDAC will gladly coordinate with TCGA analysis working groups (AWGs) to provide custom Firehose runs tailored to their specific needs.  This represents an evolution of Firehose, beyond its original mission of monthly runs intended for archival storage at the DCC and wide public consumption beyond TCGA, by providing in-depth support to ongoing analysis efforts within TCGA.   This provides several "realtime value-added benefits" to AWGs:

  • Currency: pipelines can be run on the latest daily snapshot of data from the DCC, avoiding the time & sample lag of monthly runs
  • Flexibility:  additional runs can be easily performed on AWG-curated disease subtypes, and even include custom analyses
  • Speed:  custom AWG runs can be executed in only a few days time (excluding computationally intensive algorithms that may take >1 week to run)
  • Familiarity:  using the same internal Firehose machinery, and external-facing dashboards, Nozzle reports etc, already known to the community
  • Scope: is a stepping stone to open-access Firehose, that can be manipulated directly by TCGA researchers, instead of having runs curated only at the Broad Institute

Our custom AWG data runs can also be used to define a baseline AWG data freeze. These freeze products are ideally suited for sharing data across the various centers participating in a given AWG. Furthermore, all of the output archives produced by our custom AWG runs are easily obtained with firehose_get, in the same manner as the monthly runs. For a more in depth look at what we provide, take a look at our Feb 2013 presentation to the Lung Adenocarcinoma AWG.  Please contact us at or visit for more details. Our AWG runs are also reflected on the TCGA wiki.

See this TCGA Wiki page for ProgressSummary: TumorStatusReport Spreadsheet (more up-to-date and comprehensive than our internal chart below)

Internal Broad Staffing Prioritization Spreadsheet

 TCGA calendar for AWG telecons






Tumor TypeChairs

Activity Status (this is liable to be out of date--see TCGA progress report spreadsheet above)

GDAC/Broad Science Lead

GDAC Data Lead


BRCA (ductal, lobular)

Chuck Perouductal published, lobular activeHailei  
GBM-LGG comparison Roel Verhaak, Antonio IavaroneactiveHailei  
(STAD-)ESCAPeter Laird, Ilya, Adam Bassactive Sam Meier 
LIHCLewis Roberts (Mayo), David WheeleractiveJaegilTim/Juok 

Chris Sander, Max Loda

activeElie / Manaswi

David Heiman

PAADStacey Gabriel, Ralph HrubanactiveMike L / LouisJuok 
MESOPeter CampbellMarc LadanyiBruce Robinsonactive, but youngJaegil (mutation)David 

Todd Spellman, Marston Linehan

active Gordon 

Tom Giordano, Roel Verhaak

submitted Harindra  

Doug Levine, Rehan Akbani

active, but young Harindra  
CESCGordon Mills, Janet RaderactiveHailei  

Karel Pacak (NIH), Kate Nathanson (UPenn), Matt Wilkerson (UNC)



SARC Li DingMarc LadanyiAlex LazarCarl MorrisonSam SingerBrian Van TineactiveJaegil (mutation)David H 

Lynda Chin, Jeff Gershenwald

done/pubLihuaDan / MikeSemin Lee

Peter Laird, Ilya, Adam Bass


Gad Getz, Tom Giordano




Neil Hayes, Jennifer Grandis

done/pubJuokSemin Lee 

Dan Brat, Al Yung

LUADMatthew MeyersonsubmittedHaileiDan 
LAML done/pubJuok  
KIRC done/pubHailei  
COADREAD done/pub  


UCEC  submitted Semin LeeMike
LUSC done/pub   
GBM done/pubdoneLihuaHailei

Chad Creighton, Kim Rathmell


No representation, KICH declined


Seth Lerner, John Weinstein



Josh Stuart, Chris Sander, Ilya

submitted, n/aMiken/an/a





  1. What constitutes an AWG run?   As summarize in our AWG run checklist, the products in an AWG freeze are

    1. A YYYY_MM_DD  version stamp (e.g. 2012_10_24), denoting when a data snapshot (to be frozen) was obtained from the DCC
    2. This can optionally contain a runcode suffix, such as _00 or _01, to denote additions/deletions to the base snapshot

    3. A sample list representing all of the data available in the snapshot

      A tab-delimited table (readable in either Emacs OR Excel, not only Excel) containing 1 row per aliquot, with at least the following columns:
    4. TCGA identifier (preferably full aliquot, and potentially UUID in the future)
    5. The corresponding Firehose identifier
    6. Sample type (per TCGA standard, TP=tumor primary, NB=normal blood, etc)
    7. The platform of origination on which the given aliquot data were collected (e.g. genome_wide_SNP_6 for copy number data)
    8. URL to the source file archived at the DCC in which that datatype-specific aliquot can be found
    9. URL to the corresponding SDRF for the given DCC archive file

      A heatmap representing platform coverage (per datatype per sample) will also be provided and linked on the AWG run dashboard.

    10. The results of corresponding stddata workflow executed upon the freeze samples

    11. The results of corresponding Analyses workflow executed upon the freeze samples, updated with annotations from stddata workflow

    12. A set of web-browsable Nozzle reports  (descriptions, figures, tables, etc) for (d)

      The individual Nozzle reports for each analysis task run against a given freeze sample cohort are aggregated into a single comprehensive report by the   gdac_reports tool.

    13. An online dashboard providing simple, central access to all of the above
    14. Located at  /xchip/gdac_data/runs/awg_<disease_name>__YYYY_MM_DD on the internal Broad filesystem
    15. Which corresponds to the online URL<disease_name>__YYYY_MM_DD
    16. The gdac_status tool is used to obtain the pass/faill/running/not-run status of all tasks in (d)

    17. A set of DCC-submission-ready archives containing (c), (d), and (e), retrievable from the dashboard with firehose_get

      The gdac2public tool is used to provision these DCC archives from the internal submission tree (into which they are packaged by Firehose) to the online dashboard location mentioned in (f)

  2. How do we generate these freeze products?
  3. Again, in the ideal case we'd simply use the canonical stddata & analyses runs produced on a monthly basis.  But conflicting delivery dates and algorithm versions means that is not always possible, so for PANCAN8 Mike prototyped FISS and python scripts that were also used for SKCM and THCA, and have now been generalized by Dan into a new Python tool.
    1. gdac_freeze tool in same GDAC bin location as fiss et al  (you will need to log in to confluence to see the gdac_freeze page)
    2. Can be run by anyone

  4. Who generates these freeze products?

    After the AWG chair appoints an analysis champion (AC) and data coordinator (DC), the AC and DC work in tandem to guide the freeze products through the process.  The DC is responsible for curating the sample list and communicating such to the AWG at large, while the AC is responsible for seeing that the analyses from the freeze list are sufficiently vetted, and communicating their availability to the AWG at large.  The GDAC engineering team will provide appropriate support as required.

  5. How can I cross-reference the samples in the Firehose AWG run I'm shepherding with those in the freeze list maintained by my AWG?

    1. Interactively: by inspection of the latest samples report.  If your AWG run has a different YYYY_MM_DD version stamp than the latest dicing, then look at the list of all sample reports.
    2. Programmatically:  use the gdac_counts and/or gdac_data CLI tools (use -h or -help options for usage instructions);  additional details can also be obtained with 'fiss sample_list ...'  or 'fiss annot_get ...'

  6. What do we do when something more is needed?

    Inevitably one of the analyses will be incorrect, or using an older version of code, or some samples will need to be added and/or deleted.  Or, after seeing an analysis result one would like to rerun parts of the workflow on a newly identified subtype or subset of the data (see below).  The need to address both of these flexibly is this is the strongest argument for NOT using canonical GDAC monthly runs, in favor of AWG-specific workspaces as created by the gdac_freezetool.

  7. What if I want to do subtype-based analyses?

    All of the same machinery is used: create a fresh sample set which list the samples in your subtype (e.g. as output from a clustering algorithm);  for example, consider this TSV file which defines a small set of 8 samples (note that multiple subtypes may also be specified within a single TSV file).  The sample set will be named LGG-astrocytoma in Firehose (per the sample_set_id column), and can be loaded via this fiss command at the Unix prompt