Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Panel
Anchor
documentation
Q: Where is your documentation?

A: Though the data and software in our pipeline is constantly evolving, we believe that process clarity & operational transparency streamlines efforts and ultimately improves science. We therefore endeavor to provide a reasonable level of background data processing and algorithm documentation, given our time, resource, and priority constraints. In addition, we generate hundreds of analysis reports per month, each containing detailed summaries, figures, and tables, as well as literature references and links to other documentation on the algorithmic codes in our pipeline. For each run we also provide a summary report of samples ingested, and analysis notes and data notes. Our pipeline nomenclature is described below, and further description of the TCGA data formats is available here. Finally, the analysis tasks in the latest run are shown below as a directed graph, which you may click to expand, and then click upon any enabled nodes to view the Nozzle report generated for that analysis result.


...

Panel
Anchor
accessing
Q: How or where can I access the inputs and/or results of a run?

A: In one of several ways, all of which are governed by TCGA data usage policy (and note that only the TCGA DCC requires password access, all Firehose and FireBrowse mechanisms are completely open for public use):

           from which you may simply navigate to the tumor type and run date of interest. More information on the nomenclature and content of these files is given below. Microsoft Windows-based users can use the WinRAR utility to unpack the archive files, while Unix and Apple Mac OS/X users can use the gzip and/or tar utilities.

...

Panel
Anchor
sampleTypes
Q: What TCGA sample types are Firehose pipelines executed upon?

A:  Since inception Firehose analyses have been executed upon tumor samples and then correlated with clinical data.   Nearly all analyses utilize primary solid tumor samples (numeric code 01, short letter code "TP" as given in the TCGA sample type codes table), with two exceptions:

  1. melanoma (SKCM), where we analyze metastatic tumor samples (numeric code 06, short letter code "TM")
  2. acute myeloid leukemia (LAML), which we analyze using primary blood-derived samples (numeric code 03, short letter code "TB")
Also note that each stddata run dashboard contains a samples summary report, which explains why – even though our GDAC mirrors ALL data from the DCC on a daily basis – not every sample is ingested into Firehose*.
*Specifically, we filter out ALL samples listed as Redacted in the TCGA Annotations Manager, and FFPE samples are only available in standard data archives, not analyses.

Programmatically, the FireBrowse patients api will give you a list of all patients in each cohort, either in bulk (all cohorts) or any subset you chose. It doesn’t give the complete aliquot barcode yet, but will in the very near future.  In addition to playing with this API interactively through the FireBrowse UI, there are also  Python, Unix, and R bindings, and even a pip-installable package.

Panel
Anchor
sampleTypesDiscovery
Q: How do I analyze samples that aren't included in your Firehose runs (e.g. Blood Normals, Solid Tissue Normals, etc.)?

A: All analysis-ready patient samples are available in our stddata archives; this will include matched normals, where available (but note that the so-called TCGA control samples are not included). Normals can be identified by inspection of the barcode schema below, in conjunction with the TCGA code tables. You can obtain the stddata archives using our firehose_get utility or by traversing the FireBrowse user interface or stddata API. Each sample in the archive is identified by a TCGA Barcode that contains the sample type. As shown below, the Sample portion of the barcode can be looked up in the sample type code table available here (as can the tissue source site, aka TSS, et cetera). In addition, FireBrowse makes much of this information available programmatically in its metadata API.

TCGA Barcode Description: As described here, a batch is uniquely determined by the first shipment of a group of analytes (or plate) from the Biospecimen Core Resource. So, in most cases the plate number of a sample is effectively synonymous with the batch id of the sample; an exception to this is when additional analytes from a participant are subsequently shipped the batch id will remain fixed at the first plate number.

...

Panel
Q: But where do I get Firehose data to test my module?

A: This is described above.

Panel
Anchor
nomenclature
Q: Your results archives have long and complicated names, what do they mean?

A: Our result archive and sample set nomenclature is described here.   An older version, pertaining to results submitted before January 2013, is given here.

Each pipeline we execute results in a set of 6 archive files being submitted to the DCC: primary results in the Level_* archive; auxiliary data (e.g. debugging information) in the aux archive, tracking information in the mage-tab archive; and an MD5 checksum file for each. In most cases you will only need the primary results in the Level_* archives.

...

Panel
Q: What can you tell me about the RPPA data?

A: RPPA stands for "reverse phase protein array," which are data generated by the M.D. Anderson Cancer Center as described here. The MDACC also hosts the TCPA portal, which serves clean, batch-corrected RPPA data that may be preferred for your analysis over the uncorrected data deposited directly to the TCGA data coordination center.

...