Two major challenges in somatic mutation detection in cancer are the heterogeneous nature of tumors and the potential for cross-contamination between tumor and normal samples. These challenges also mean that while a somatic mutation detection algorithm may work well for one tumor, it might work poorly for another.
SomaticSeq, an open-source, accurate somatic mutation detection pipeline integrates five state-of-the-art somatic mutation callers: MuTect, SomaticSniper, VarScan2, JointSNVMix2, and VarDict.6 These yield a high-confidence somatic mutation call set for both single nucleotide variants and small indels.
SomaticSeq implements a stochastic boosting machine learning algorithm. By combining the somatic mutation calls from these five tools, we can generate up to 72 genomic and sequencing features for each call. We use the SomaticSeq trainer to create a trained model based on these features. The trained model can then be used to call high-confidence somatic mutations in test data sets of similar characteristics, e.g., comparable sequencing depth, identical data preprocessing protocols, etc.
The SomaticSeq pipeline has been validated with the DREAM Challenge data (Figure 4), in silico titration of two genomes, as well as real tumor data.