ContEst is a tool (and method) for estimating the amount of cross-sample contamination in next generation sequencing data. Using a Bayesian framework, contamination levels are estimated from array based genotypes and sequencing reads.
More information about the algorithm is presented in our Bioinformatics paper, available HERE.
For help with ContEst, please email firstname.lastname@example.org
In order to run ContEst, you will need the main bundle (contest-*-bin.zip) as well as a population frequency file for your genome reference (HG18 or HG19/GRC37). The example data bundle is also provided if you're running the example command provided on this page.
In addition, if you'd like to build from source you will need the source bundle (contest-*-src.zip). Instructions for building from source can be under "Building ContEst"
The exact binary used to create the data in the publication is available here. The revisions above contain minor bug fixes.
The tool is run using Java; the command to execute the tool looks like:
java -jar ContEst.jar -T Contamination
The required command-line arguments for the tool:
- -T Contamination - ContEst is based on the GATK; this is telling the GATK to run the contamination tool
- -B:genotypes,vcf <your.genotypes.vcf> - Your genotypes files (as a VCF), taken from array data. See below for information about how to convert Birdseed output into VCF.
- -B:pop,vcf <population_AF_vcf.vcf> - The population allele frequencies for each SNP in HapMap
- -BTI genotypes - drive the tool by the known genotypes for this sample
- -I <your_bam.bam> You input BAM, containing the reads for the sample
- -R <your_copy_of_hg19.fasta> - the FASTA file for the appropriate genome build
Common, optional parameters include:
- -o <your_output_file.txt> - write the output to this file
- -pc <precision> - the percision you wish to run the tool with (this 0.1 indicates you'd like to estimate contamination with 0.1 precision)
- -sn <your_sample_name> - your sample name, as known in the genotypes VCF (optional if only one sample in the genotypes VCF)
- -llc ?LANE_ - report estimates for each read group in the BAM
ContEst is a Java tool, based on the Genome Analysis toolkit (GATK), and many of it's inputs are processed through the GATK's engine;To get more information on how to run the tool, you can run the following the command:
which produces the following output (along with help using the GATK in general):
Example ContEst Command
We've prepared an example data package (see the downloads section). You'll also need to download the 1000 genomes B37 reference file and the associated fai file:
To run the example, you'll need to have downloaded the ContEst binary zip file and the hg19_population_stratified_af_hapmap_3.3.vcf.gz to your system. The example data is based on two low contamination level 1000 genomes samples, mixed together. The command to run is:
This example will produce an output file which should look like the following:
Here we can see that ContEst found that the file was approximately 8.2 percent contaminated, with a 95% confidence interval from 7.7 to 8.6.
Preparing Input Files
Contest takes a series of input files, representing the sequencing data, population structures, as well as array-based genotype calls.
Sequencing data needs to formatted in the BAM file format, which is standard across most of the modern sequencer technologies and sequencing centers. More information about this file format can be found in this PDF, or on the Samtools site. Sequencing data must be formatted in the BAM (the binary version of the SAM format, it cannot unfortunately be in the SAM format) format. Since ContEst is built as a Genome Analysis Toolkit (GATK) tool, input sequencing data must also conform to the specifications set forward by the GATK.
The reference file must be in the FASTA format, as well as these additional constraints. Saying that, the common human reference sequences in FASTA format (i.e. HG18, HG19, b36, b37) will work as long as they match your sequence information encoded in the BAM file and array files.
Array based genotype data
Conversion of BirdSeed array input files to Variant Call Format (VCF) files
ContEst requires its input files to be in the Variant Call Format (VCF) file format. This format was developed for the 1000 Genomes project, and has become a standard in the genetics community for encoding information about variant calls, site idenity, and other genomicly possitioned data (including structural variant information). More information about the specification can be found on the 1000 Genomes website:
Many users of the ContEst tool will have their array based calls in the BirdSeed formats (after running the Birdseed suite on various array platforms, see the Birdseed website for more information). There is a two step conversion to go from Birdseed files to the VCF format calls ContEst is looking for.
An new alternate pathway that converts from Birdseed call files to a VCF is available here.
Convert Birdseed files to GELI intermediate
The first step is to convert the birdseed files into an intermediate GELI file; this is the precursor to the final VCF file required for ContEst. The tool BirdseedSNPToGeli converts Birdseed files into GELI files. The command to run this tool is:
- birdseed.file is the input birdseed file.
- sample.id the name of the sample
- sequence.dictionary the sequence dictionary file (.dict file), available from the Picard toolset. See direction here.
- fasta the fasta file for the appropriate genome build
- snp60.definition the SNP 6.0 definition file; available for hg18 and hg19 here.
When this is completed, you should have a GELI file as output. This is fed into the next step, converting the GELI file into VCF.
Converting GELI files to a VCF input file
To convert a GELI input file to a vcf file, download the following tool, GeliToVCF.jar.zip. The command to run it looks like:
- tmp.dir is the location of temperary space on your hard drive
- sample.id is the name of your sample
- geli.file is the location of the input GELI file
Creating Population Frequency VCF file
HapMap population frequencies are available in the download sections mapped to both HG18 and HG19. However, if you would like to build population frequencies simply construct a VCF with your own frequencies represented in the INFO field with the following format:
Is a population with the name "CEU" where "A" is the base in the reference with a population frequency of 0.13030 and "G" is the non-reference base with a frequency in this population of 0.86970
Building ContEst from source
ContEst is implemented as an external walker for the GATK. In order to build ContEst, you should first be able to build the GATK source. Instructions for building the GATK can be found here
In addition, it is necessary to download the Apache BCEL Library and put it in either your system classpath, or your Ant installation lib directory.
Next, download the ContEst source bundle. For this example, it is assumed that you have unpackaged the bundle into /home/contest and that the GATK source is in /home/gatk.
From the GATK build directory, run:
After a successful build, the ContEst.jar can be found under /home/gatk/dist/packages/ContEst*/ContEst.jar
ContEst is distributed under a BSD style license, which is also included in the distributions in the file LICENSE.TXT