Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


Q: Where is your documentation?

A: Though the data and software in our pipeline is constantly evolving, we believe that process clarity & operational transparency streamlines efforts and ultimately improves science. We therefore endeavor to provide a reasonable level of background data processing and algorithm documentation, given our time, resource, and priority constraints. In addition, we generate hundreds of analysis reports per month, each containing detailed summaries, figures, and tables, as well as literature references and links to other documentation on the algorithmic codes in our pipeline. For each run we also provide a summary report of samples ingested, and analysis notes and data notes. Our pipeline nomenclature is described below, and further description of the TCGA data formats is available here. Finally, the analysis tasks in the latest run are shown below as a directed graph, which you may click to expand, and then click upon any enabled nodes to view the Nozzle report generated for that analysis result.


Q: How or where can I access the inputs and/or results of a run?

A: In one of several ways, all of which are governed by TCGA data usage policy (and note that only the TCGA DCC requires password access, all Firehose and FireBrowse mechanisms are completely open for public use):

           from which you may simply navigate to the tumor type and run date of interest. More information on the nomenclature and content of these files is given below. Microsoft Windows-based users can use the WinRAR utility to unpack the archive files, while Unix and Apple Mac OS/X users can use the gzip and/or tar utilities.


Q: Where can I find the mutation rates calculated during Firehose analyses?

A: Mutation rates are calculated by MutSig, and can be found in the patient_counts_and_rates.txt file bundled within the MutSig result archives. You can retrieve these archives with firehose_get or through the user interface (e.g. here is a link for MutSig2CV analysis results for adrenocortical carcinoma, or ACC). In addition, we plan to add mutation rates to the FireBrowse api in the near future.

Q: What are the differences between MutSig 1.5, 2.0, CV, and

A: MutSig relies on several sources of evidence in the data to estimate the amount of positive selection a gene underwent during tumorigenesis. The three main sources are:

  1. Abundance of mutations relative to the background mutation rate (BMR)
  2. Clustering of mutations in hotspots within the gene
  3. Conservation of the mutated positions (i.e. did the mutation happen at a position that is conserved across vertebrates?)

The first line of evidence, Abundance, goes into the core significance calculation performed in all versions of MutSig. In MutSig1.0, this is simply called "p". MutSig1.0 assumes a constant BMR across all genes in the genome and all patients in the patient cohort. In MutSig1.5, this is also called "p", but MutSig1.5 uses information from synonymous mutations to roughly estimate gene-specific BMRs. Later versions of MutSig (MutSigS2N and MutSigCV) have increasingly sophisticated procedures for treating the heterogeneity in per-gene, per-patient, and per-context BMRs, but they are all answering essentially the same question about Abundance of mutations above the background level.

The other lines of evidence, Conservation and Clustering, are examined by a separate part of MutSig that performs many permutations, comparing the distributions of mutations observed to the null distribution from these permutations. The output of this permutation procedure is a set of additional p-values: p_clust is the significance of the amount of clustering in hotspots within the gene. p_cons is the significance of the enrichment of mutations in evolutionarily conserved positions of the gene. Finally, p_joint is the joint significance of these two signals (Conservation and Clustering), calculated according to their joint distribution. The reason for calculating p_joint is to ensure there is no double-counting of the significance due, for example, to clustering in a conserved hotspot.

Combining MutSig2CV combines all three lines of evidence: In order to take a full accounting of the signals of positive selection in a given gene, we combine all three lines of evidence. This is done by using the Fisher method of combining p-values. The two p-values combined are the "p" (or "p_classic") from the analysis of mutation Abundance, and the p_joint from the analysis of Conservation and Clustering in MutSig2.0. More information on MutSig is available on its entry in the CGA software page, the 2013 and 2014 MutSig publications and , dozens of TCGA-related papers, and in their respective reports.

Q: What do the different fields for significantly mutated genes mean?

A: Many of these fields depend on what version of MutSig was used. The following table covers the majority of them:

genegenegenegeneHUGO Symbol
descriptiondescription longnameFull description/name of the gene
NN  number of sequenced bases in this gene across the individual set
nn  number of (nonsilent) mutations in this gene across the individual set
  nnonnnonnumber of nonsense mutations
npatnpatnpatnpatnumber of patients (individuals) with at least one nonsilent mutation
nsitensitensitensitenumber of unique sites having a nonsilent mutation
nsilnsilnsilnsilnumber of silent mutations in this gene across the individual set
n1n1  number of nonsilent mutations of type "*CpG->T"
n2n2  number of nonsilent mutations of type "*Cp(A/C/T)->T*"
n3n3  number of nonsilent mutations of type "A->G"
n4n4  number of nonsilent mutations of type "transver"
n5n5  number of nonsilent mutations of type "indel+null"
n6n6  number of nonsilent mutations of type "double_null"
p_ns_sp_ns_s  p-value for the observed nonsilent/silent ratio being elevated in this gene
ppppp-value (overall)
qqqqq-value, False Discovery Rate (Benjamini-Hochberg procedure)
 p_classic  p-value for the observed amount of nonsilent mutations being elevated in this gene
 p_clust pCLp-value for clusteringClustering. Probability that recurrently mutated loci in this gene have more mutations than expected by chance. While pCV assesses the gene's overall mutation burden, pCL assesses the burden of specific sites within the gene. This allows MutSig to differentiate between genes with uniformly distributed mutations and genes with localized hotspots.
 p_cons pFNp-value for conservationConservation. Probability that mutations within this gene occur disproportionately at evolutionarily conserved sites. Sites highly conserved across vertebrates are assumed to have greater functional impact than weakly conserved sites.
 p_joint  p-value for joint model of clustering and conservation
   pCVp-value from covariatesAbundance. Probability that the gene's overall nonsilent mutation rate exceeds its inferred background mutation rate (BMR), which is computed based on the gene's own silent mutation rate plus silent mutation rates of genes with similar covariates. BMR calculations are normalized with respect to patient-specific and sequence context-specific mutation rates.
   codelenthe gene's coding length
   nncdnumber of noncoding mutations
   nmisnumber of missense mutations
   nstpnumber of readthrough mutations
   nsplnumber of splice site mutations
   nindnumber of indels


Q: But where do I get Firehose data to test my module?

A: This is described above.

Q: Your results archives have long and complicated names, what do they mean?

A: Our result archive and sample set nomenclature is described here.   An older version, pertaining to results submitted before January 2013, is given here.

Each pipeline we execute results in a set of 6 archive files being submitted to the DCC: primary results in the Level_* archive; auxiliary data (e.g. debugging information) in the aux archive, tracking information in the mage-tab archive; and an MD5 checksum file for each. In most cases you will only need the primary results in the Level_* archives.