Downloading Firehose Data and Analysis Results
Searching our Analysis and Stddata Run Results
Adding Custom Data to Firehose: from external sources like other TCGA centers
Output Archive Nomenclature Analyses Workflow in Firehose
- Directed graph of GDAC Firehose analysis tasks: the name on each node corresponds to the pipeline task(s) executed at that step, and is also reflected in the output archive for that task (see Nomenclature above). Here is a live version of the same graph, in which clicking on a graph node will bring you to the Nozzle report for the respective analysis task.
Expression Microarray processing
- Raw data (level 1), probe-level data (level 2), and gene-level data (level 3) of mRNA and miRNA expression data were downloaded from the DCC.
- The data process was described at TCGA OV paper.
Clinical Data Processing
- Preprocessor for TCGA Broad GDAC input data
- For the input maf file, the preprocessor generates a sample by gene aberration matrix and filter out genes of lower mutation rate.
- For the input expression data generated in the mRNAseq preprocessor in stddata run, the preprocessor filters out genes of lower variance and generates a sample by gene matrix.
- For the input copy number seg file, the preprocessor filters out the duplicate regions and generates a sample by gene matrix.
- For each sample by gene matrix above, it generates a matrix only for intersection of samples across different platform data.
- Further details are available in nozzle report.
- Reproducibility of the result
- According to the author of the R package, reproducing exactly the same percentEV plot is not guaranteed due to the randomness in MCMC-EM simulation.
- mRNAseq_preprocessor: Pick the "normalized_count"(quantile normalized RSEM) value from illumina hiseq/ga2 mRNAseq level_3 (v2) data set and make the mRNAseq matrix with log2 transformed for the downstream analysis. To maximize sample counts we include both HiSeq and GA2 aliquots in each cohort dataset, but if a given patient has both HiSeq and GA2 aliquots the HiSeq aliquot will take precedence (to avoid double-counting a patient during analysis). The pipeline also will create the matrix with RPKM and log2 transform from hiseq/ag2 mRNAseq level 3 (v1) data set.
- Z score calculation of RSEM/RPKM data:
Z = (expression in single tumor sample) - (mean expression in all tumor samples ) / (standard deviation of expression in all tumor samples)
- miRseq_preprocess: Pick the "RPM"(reads per million miRNA precursor reads) from the illumina hiseq/aga mirnaseq Level_3 data set and make the matrix with log2 transformed. The preprocessor removes all records with NA values, which may lower the number of miRs utilized & reported during pipeline execution.
- miRseq_mature_preprocess: Generate matrix with the mature strand value "reads per million miRNA mature reads" from the illumina hiseq/aga mirnaseq Level_3 data set. The mature strands have a MIMAT in the annotation, get all the isoforms of the mature strand by the annotation and sum up all the RPM value (1 sum for each mature strand in the sample), and then merge them into one table and do log2 transform.
- mRNA_Preprocess_Median: Pick the matrix for the platform(Affymetrix HG U133, Affymetrix Exon Array and Agilent gene expression) with the largest number of samples and write it out.
- Preprocessor (includes recent recommendations for improvement)
- Oncotator is used to substantially improve the consistency and utility of TCGA mutation annotation files (MAFs):
- Mutation Significance
- Mutation_CoOccurrence: The pipeline was used to generate the input file for icoMut figure. The input file from the pipeline of Aggregate_AnalysisFeatures. In this pipeline, we set up these threshold for copy number change: arm.gain=2.25, arm.amplification=3, arm.loss=1.75, arm.deletion=1.5, focal.gain=3, focal.amplification=5, focal.loss=1.5, focal.deletion=1. Then we converted the copy number amplification, gain, loss, deletion and others into 4, 3, 2, 1, 0 and set missing value as 5. For the mRNA expression date, we did median centered normalization for each gene across all samples.
- CHASM (Karchin Lab, Johns Hopkins University)
Copy Number Pipelines
- GISTIC 2.0
- CNMFclustering: one based on real copy number data, the other based on threshold value
- Gene By Sample
Feature Table: Aggregate_AnalysisFeatures
The purpose of this pipeline is to aggregate the most important findings across ALL pipelines in the GDAC Firehose analysis workflow, into a single feature table. At present the feature table represents the samples by selected significant events (copy number alterations, somatic mutations, marker genes in each mRNAseq clustering subtype, clinical features and clustering results). The first column of the table is the sample id, with the remaining columns representing the analysis features as described here:
- Clinical features: start with “CLI_” followed by feature name. The clinical file (*.merged.txt) was from the pipeline of Append_CustomClinical.
- Clustering results: start with “CLUS_” followed by platform_method (e.g. CLUS_mRNAseq_cHierarchical). The cluster file (*.mergedcluster.txt) was from the pipeline of Aggregate_Molecular_Subtype_Clusters.
- Somatic mutation genes: start with “SMG_” followed by version number( mutsig2.0,cv,2cv)_gene name (e.g. SMG_mutsig.2CV_FAM47C). The , as taken from the significant gene list (*.sig_genes.txt) was from the Mutsig2CV.produced by Mutsig2CV. The numbers in each row of a given SMG column indicate the type of mutation (with 0 denoting that no mutation was detected):
- Somatic mutation genes expression: start with “SMG_” followed by gene name_mRNA (e.g. SMG_KRT3_mRNA). ThemRNA expression (*.uncv2.mRNAseq_RSEM_normalized_log2.txt) was from the pipeline of mRNAseq_preprocessor.
- Mutation rate: rate_non (non synonymous) and rate_sil (synonymous). The mutation rate (patient_counts_and_rates.txt) was from the Mutsig2CV.
- Marker genes in each mRNAseq clustering subtype: star with “mRNA_” followed by CNMF_gene name_difference_cluster number (e.g. mRNA_CNMF_FAM66E_.0.6_2(In each cluster, the top 5 up regulated and top 5 down regulated genes were selected).
- Significant copy number alterations as reported by GISTIC:
- copy number focal amplifications: start with “Amp_” followed by cytoband (e.g. Amp_1q32.1)
- focal deletion: start with “Del_” followed by cytoband (e.g. Del_1p36.32)
- Arm level amplification: start with “CN_” followed by arm_Amp (e.g. CN_10p_Amp)
- Arm level deletion: start with “CN_” followed by arm_Del (e.g. CN_10p_Del)
- Copy number alteration gene with expression: start with “Amp/Del_” followed by gene name_cytoband_mRNA (e.g. Amp_SOX2_3q26.32_mRNA and Del_PARK2_6q24.3_mRNA)
• Supplemented with copy number altered genes in our master list built from PANCANER cnvs in Zack et al 2013
Adding New Codes To Firehose