Skip to end of metadata
Go to start of metadata

ContEst Overview

ContEst is a tool (and method) for estimating the amount of cross-sample contamination in next generation sequencing data.  Using a Bayesian framework, contamination levels are estimated from array based genotypes and sequencing reads.  

More information about the algorithm is presented in our Bioinformatics paper, available HERE.

For help with ContEst, please email contest-help@broadinstitute.org

Downloading ContEst

In order to run ContEst, you will need the main bundle (contest-*-bin.zip) as well as a population frequency file for your genome reference (HG18 or HG19/GRC37).  The example data bundle is also provided if you're running the example command provided on this page.

File

Link

contest-1.0.24530-bin.zip

download

hg18_population_stratified_af_hapmap_3.3.vcf.gz

download

hg19_population_stratified_af_hapmap_3.3.vcf.gz

download

contest-1.0.24530-src.zip

download

ContEst_example_data.zip

download

In addition, if you'd like to build from source you will need the source bundle (contest-*-src.zip).  Instructions for building from source can be under "Building ContEst"

The exact binary used to create the data in the publication is available here.  The revisions above contain minor bug fixes.

Running ContEst

The tool is run using Java; the command to execute the tool looks like:

      java -jar ContEst.jar -T Contamination

The required command-line arguments for the tool:

  • -T Contamination - ContEst is based on the GATK; this is telling the GATK to run the contamination tool
  • -B:genotypes,vcf <your.genotypes.vcf> - Your genotypes files (as a VCF), taken from array data. See below for information about how to convert Birdseed output into VCF.
  • -B:pop,vcf <population_AF_vcf.vcf> - The population allele frequencies for each SNP in HapMap
  • -BTI genotypes - drive the tool by the known genotypes for this sample
  • -I <your_bam.bam> You input BAM, containing the reads for the sample
  • -R <your_copy_of_hg19.fasta> - the FASTA file for the appropriate genome build

Common, optional parameters include:

  • -o <your_output_file.txt> - write the output to this file
  • -pc <precision> - the percision you wish to run the tool with (this 0.1 indicates you'd like to estimate contamination with 0.1 precision)
  • -sn <your_sample_name> - your sample name, as known in the genotypes VCF (optional if only one sample in the genotypes VCF)
  • -llc ?LANE_ - report estimates for each read group in the BAM

ContEst is a Java tool, based on the Genome Analysis toolkit (GATK), and many of it's inputs are processed through the GATK's engine;To get more information on how to run the tool, you can run the following the command:

which produces the following output (along with help using the GATK in general):

Example ContEst Command

We've prepared an example data package (see the downloads section).  You'll also need to download the 1000 genomes B37 reference file and the associated fai file:

To run the example, you'll need to have downloaded the ContEst binary zip file and the hg19_population_stratified_af_hapmap_3.3.vcf.gz to your system.  The example data is based on two low contamination level 1000 genomes samples, mixed together.  The command to run is:

java -Xmx2g -jar <ContEst_JAR_Location>/ContEst.jar \
-I <example_data_location>/chr20_sites.bam \
-R <reference_location>/human_g1k_v37.fasta \
-B:pop,vcf <hg19_population_stratified_af_hapmap_3.3_location>/hg19_population_stratified_af_hapmap_3.3.vcf \
-T Contamination \
-B:genotypes,vcf <example_data_location>/hg00142.vcf \
-BTI genotypes \
-o contamination_results_chr20.txt

This example will produce an output file which should look like the following:

name    population      population_fit  contamination   confidence_interval_95_width    confidence_interval_95_low      confidence_interval_95_high     sites
META    CEU     n/a     8.2     0.9     7.7     8.6     733

Here we can see that ContEst found that the file was approximately 8.2 percent contaminated, with a 95% confidence interval from 7.7 to 8.6.  

Preparing Input Files

Contest takes a series of input files, representing the sequencing data, population structures, as well as array-based genotype calls.  

Sequencing data

Sequencing data needs to formatted in the BAM file format, which is standard across most of the modern sequencer technologies and sequencing centers.  More information about this file format can be found in this PDF, or on the Samtools site.  Sequencing data must be formatted in the BAM (the binary version of the SAM format, it cannot unfortunately be in the SAM format) format.  Since ContEst is built as a Genome Analysis Toolkit (GATK) tool, input sequencing data must also conform to the specifications set forward by the GATK.

Reference information

The reference file must be in the FASTA format, as well as these additional constraints.  Saying that, the common human reference sequences in FASTA format (i.e. HG18, HG19, b36, b37) will work as long as they match your sequence information encoded in the BAM file and array files. 

Array based genotype data

Conversion of BirdSeed array input files to Variant Call Format (VCF) files

ContEst requires its input files to be in the Variant Call Format (VCF) file format.  This format was developed for the 1000 Genomes project, and has become a standard in the genetics community for encoding information about variant calls, site idenity, and other genomicly possitioned data (including structural variant information).  More information about the specification can be found on the 1000 Genomes website:

The VCF Specification

Many users of the ContEst tool will have their array based calls in the BirdSeed formats (after running the Birdseed suite on various array platforms, see the Birdseed website for more information).  There is a two step conversion to go from Birdseed files to the VCF format calls ContEst is looking for. 

An new alternate pathway that converts from Birdseed call files to a VCF is available here.

Convert Birdseed files to GELI intermediate

The first step is to convert the birdseed files into an intermediate GELI file; this is the precursor to the final VCF file required for ContEst.  The tool BirdseedSNPToGeli converts Birdseed files into GELI files.  The command to run this tool is:

Where:

  • birdseed.file is the input birdseed file.
  • sample.id the name of the sample
  • sequence.dictionary the sequence dictionary file (.dict file), available from the Picard toolset.  See direction here.  
  • fasta the fasta file for the appropriate genome build
  • snp60.definition the SNP 6.0 definition file; available for hg18 and hg19 here.

When this is completed, you should have a GELI file as output.  This is fed into the next step, converting the GELI file into VCF.

Converting GELI files to a VCF input file

To convert a GELI input file to a vcf file, download the following tool, GeliToVCF.jar.zip.  The command to run it looks like:

Where:

  • tmp.dir is the location of temperary space on your hard drive
  • sample.id is the name of your sample
  • geli.file is the location of the input GELI file

Creating Population Frequency VCF file

HapMap population frequencies are available in the download sections mapped to both HG18 and HG19.  However, if you would like to build population frequencies simply construct a VCF with your own frequencies represented in the INFO field with the following format:

For example:

Is a population with the name "CEU" where "A" is the base in the reference with a population frequency of 0.13030 and "G" is the non-reference base with a frequency in this population of 0.86970

Building ContEst from source

ContEst is implemented as an external walker for the GATK.  In order to build ContEst, you should first be able to build the GATK source.  Instructions for building the GATK can be found here

In addition, it is necessary to download the Apache BCEL Library and put it in either your system classpath, or your Ant installation lib directory.

Next, download the ContEst source bundle.  For this example, it is assumed that you have unpackaged the bundle into /home/contest and that the GATK source is in /home/gatk.

From the GATK build directory, run:

ant -Dexternal.dir=/home/contest -Dexecutable=contest package

After a successful build, the ContEst.jar can be found under /home/gatk/dist/packages/ContEst*/ContEst.jar

Licensing

ContEst is distributed under a BSD style license, which is also included in the distributions in the file LICENSE.TXT

Copyright (c) 2011, The Broad Institute

Permission is hereby granted, free of charge, to any person
obtaining a copy of this software and associated documentation
files (the "Software"), to deal in the Software without
restriction, including without limitation the rights to use,
copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following
conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.

Labels: