is now over and the DCC data portal
is offline, where do I find legacy information about TCGA?

A: TCGA is formally scheduled to go offline data processing formally ended in July of 2016, with content from its data portal and CGHub having been migrated to the NCI Genomic Data Commons data portal. We suggest that all questions about TCGA data, policies, practices and procedures be directed to the TCGA program office, the NCI Center for Cancer Genomics, or the Genomics Data Commons.


Q: Where is your documentation?

A: Though the data and software in our pipeline is constantly evolving, we believe that process clarity & operational transparency streamlines efforts and ultimately improves science. We therefore endeavor to provide a reasonable level of background data processing and algorithm documentation, given our time, resource, and priority constraints. In addition, we generate hundreds of analysis reports per month, each containing detailed summaries, figures, and tables, as well as literature references and links to other documentation on the algorithmic codes in our pipeline. For each run we also provide a summary report of samples ingested, and analysis notes and data notes. Our pipeline nomenclature is described below, and further description of the TCGA data formats is available here. Finally, the analysis tasks in the latest run are shown below as a directed graph, which you may click to expand, and then click upon any enabled nodes to view the Nozzle report generated for that analysis result.

Q: How or where can I access the inputs and/or results of a run?

A: In one of several ways, all of which are governed by TCGA data usage policy (and note that only the TCGA DCC requires password access, all Firehose and FireBrowse mechanisms are completely open for public use):

           from which you may simply navigate to the tumor type and run date of interest. More information on the nomenclature and content of these files is given below. Microsoft Windows-based users can use the WinRAR utility to unpack the archive files, while Unix and Apple Mac OS/X users can use the gzip and/or tar utilities.


Q: How do I use a graphic from FireBrowse in my paper?

A: The FireBrowse visualization widgets (viewGene, iCoMut) do not explicitly provide a screen-capture feature; but if you use your browser's Print feature and then save the result to a PDF, you'll have a vector-graphics image than can be scaled without loss of fidelity.

Q: There are
a lot of
many acronyms used in
Firehose and
TCGA, for example to identify disease cohorts. Where can I find what these acronyms mean?

A: Consult the TCGA Encyclopedia for general questions. There are several ways one can map cohort abbreviations to full disease names, including:

Note that our portals list 38 cohorts and the TCGA page shows 34 cohorts, with the difference being aggregate cohorts the Firehose GDAC constructed for convenience of TCGA and the research community. As of December 2016 these aggregate cohorts are:

COADREAD: colorectal, combines COAD + READ
GBMLGG: glioma, combines GBM + LGG
KIPAN: pan-kidney, combines KICH + KIRC + KIRP
STES: stomach-esophogeal, combines STAD + ESCA
PANGI : not publicly available yet, but combines STES + COADREAD


Q: What reference genome build are you using?

A: We match the reference genome used in our analyses to the reference used to generate the data as appropriate. Our understanding is that TCGA standards stipulate that OV, COAD/READ, and LAML data are hg18, and all else is hg19. caveat: SNP6 copy number data is available in both hg18 and hg19 for all tumor cohorts, so we use hg19 for copy number analyses in all cases.

Q: How are the copy number data generated, and what do their file names mean?

A: This is discussed in the application note posted here: Note that the 'minus_germline, or 'nocnv' segment files, refer to whether the steps in section 2.3 are applied.  The steps in section 2.4 are applied regardless.

Q: What centers are responsible for sequencing XYZ tumor?

A: Internally at the Broad we maintain this list.  If you are outside the Broad please consult the TCGA site for more information.


Q: But where do I get Firehose data to test my module?

A: This is described above.

Q: Your results archives have long and complicated names, what do they mean?

A: Our result archive and sample set nomenclature is described here.   An older version, pertaining to results submitted before January 2013, is given here.

Each pipeline we execute results in a set of 6 archive files being submitted to the DCC: primary results in the Level_* archive; auxiliary data (e.g. debugging information) in the aux archive, tracking information in the mage-tab archive; and an MD5 checksum file for each. In most cases you will only need the primary results in the Level_* archives.