Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Panel
Anchor
EndOfTCGA
Q: The TCGA is now over and the DCC data portal is offline, where do I find legacy information about TCGA?

A: TCGA data processing formally ended in July of 2016, with content from its data portal and CGHub having been migrated to the NCI Genomic Data Commons data portal. We suggest that all questions about TCGA data, policies, practices and procedures be directed to the TCGA program office, the NCI Center for Cancer Genomics, or the Genomics Data Commons (or GDC).

Panel
Anchor
EndOfTCGA
Q: I understand that TCGA data has migrated to the GDC, but why do I see discrepancies between GDC and FireBrowse?

A: Note that the GDC serves both HG38 and HG19 data.  The HG19 data are considered “legacy” and represent the original calls as made by each of the sequencing centers in TCGA; they ARE NOT the default data served by the GDC, and instead are served from the (slightly hidden)  legacy archive section of the GDC portal. By default the public GDC interface serves HG38 data; these are newly generated at the GDC itself, with the intent to smooth over differences across the entire set of TCGA samples by “harmonizing” them with common variant callers and reference data. It is important to understand that these HG38 data are not the original HG19 legacy data that is discussed in most of the current TCGA publications. Lastly, note that the public GDAC Firehose & FireBrowse portals ONLY serve HG19 data;  we’ve been reluctant to release HG38 data (and analyses of them) to the general public until they have gone through more in-depth QC/vetting. This QC has not been fully completed yet, but is an active area of investigation (with an analysis working group, or AWG) within the nascent GDAN. We are aiming to have a first release of HG38 GDAC pipelines in FireBrowse by Q1 of 2018, after the QC group completes its assesment to the satisfaction of the NCI.

Panel
Anchor
ContactUs
Q: What is the best way for us to contact you?

A: To help us respond to you faster, and more completely, it is best to use one of these methods to contact us as a group, instead of emailing privately to individuals on our team.

Panel
Anchor
documentation
Q: Where is your documentation?

A: Though the data and software in our pipeline is constantly evolving, we believe that process clarity & operational transparency streamlines efforts and ultimately improves science. We therefore endeavor to provide a reasonable level of background data processing and algorithm documentation, given our time, resource, and priority constraints. In addition, we generate hundreds of analysis reports per month, each containing detailed summaries, figures, and tables, as well as literature references and links to other documentation on the algorithmic codes in our pipeline. For each run we also provide a summary report of samples ingested, and analysis notes and data notes. Our pipeline nomenclature is described below, and further description of the TCGA data formats is available here. Finally, the analysis tasks in the latest run are shown below as a directed graph, which you may click to expand, and then click upon any enabled nodes to view the Nozzle report generated for that analysis result.


Panel
Anchor
accessing
Q: How or where can I access the inputs and/or results of a run?

A: In one of several ways, all of which are governed by TCGA data usage policy (and note that only the TCGA DCC requires password access, all Firehose and FireBrowse mechanisms are completely open for public use):

           from which you may simply navigate to the tumor type and run date of interest. More information on the nomenclature and content of these files is given below. Microsoft Windows-based users can use the WinRAR utility to unpack the archive files, while Unix and Apple Mac OS/X users can use the gzip and/or tar utilities.

...

Panel
Q: Where did my mutations go?

A: MAFs processed by MutSig may have mutations removed for one of several reasons:

  •  They are on contigs other than chr1-22/X/Y
  •  They are malformed in some way, e.g. purport to be a “mutation from C to C”, or are an unknown Variant_Classification (e.g. Targeted_Region)
  •  They are noncoding mutations or in genes absent from MutSig's list of gene definitions, or in poorly covered genes (by historic coverage metrics)
  •  They are in the “mutation blacklist,” a static list of positions observed in the past to harbor common artifacts or germline variation.
  •  They occur in patients MutSig deems to be duplicates.
Panel
Q: Where can I find a copy of the blacklist used in MutSig runs?

A: The blacklist is used to filter out recurrent mutation sites that the MutSig development team found to cause issues with the determination of significance. Because these by nature include germline mutations that may not have been part of available databases at the original generation of the MAF, we are not permitted to release it to the public.

Panel
Q: Why does your table of ingested data show that disease type XYZ has N methylation samples?

A: We ingest and support both of the major methylation platforms (Infinium HumanMethylation450 and HumanMethylation27), therefore the entries in our data table give the sum of both.  However, as noted in our June 2012 release notes, Firehose does not yet include the statistical algorithms used by TCGA AWGs to merge both of these methylation platforms into a single bolus; until those are shared we prefer meth450 over meth27 when available for a given disease type, as it gives not only greater sample counts but also higher resolution data.

...

Panel
Q: But where do I get Firehose data to test my module?

A: This is described above.

Panel
Anchor
nomenclature
Q: Your results archives have long and complicated names, what do they mean?

A: Our result archive and sample set nomenclature is described here.   An older version, pertaining to results submitted before January 2013, is given here.

Each pipeline we execute results in a set of 6 archive files being submitted to the DCC: primary results in the Level_* archive; auxiliary data (e.g. debugging information) in the aux archive, tracking information in the mage-tab archive; and an MD5 checksum file for each. In most cases you will only need the primary results in the Level_* archives.

...