Spinella JF, Mehanna P, Vidal R, Saillour V, Cassart P, Richer C, Ouimet M, Healy J, Sinnett D. SNooPer: a machine learning-based method for somatic variant identification from low-pass next-generation sequencing. BMC Genomics. 2016 Nov 14;17(1):912. PubMed PMID: 27842494; PubMed Central PMCID: PMC5109690.
-Can I apply my trained model on a different dataset ?
Yes, once trained, a model can be applied to any sequencing dataset that has comparable technical characteristics (same capture kit, same sequencing design, same sequencing technology, same mapper, etc).
-What if several aligners have been used in my dataset ?
For optimal results, you should use the same mapping and cleaning process for all your samples.
-What if the sequencing depth of my data varies across my samples ?
Variations in coverage between samples from the same sequencing experiment or from different sequencing experiments with the same design should not influence somatic variant calling. For most of SNooPer's parameters, instead of absolute values, models are trained using features that are normalized using the corresponding median value calculated from randomly extracted subsets of variants (from the mpileup files). For example, this reduces the influence of sequencing depth variability.
-Are trained models readily available in SNooPer if validation sets cannot be acquired ?
Some models are already available at
Please read the associated README.txt file to ensure with your dataset. If the models fit the dataset at hand, SNooPer’s training phase can be bypassed and the readily available models applied directly to call somatic variants.
-Can I use germline polymorphisms as a validation dataset for somatic calling ?
No, a subset of validated somatic variants is ideal. Low frequency tumor alleles that arise in subclonal tumor cell populations may be missed during the calling phase if only germline variants are used to train the model.
-What input file(s) should I provide to SNooPer to train a model ?
For the training phase, the user has to provide 2 types of files:
1. pileup files (.pu) with similar characteristics as the test dataset on which the trained model will be applied.
2. vcf files (.vcf) validation files that are ideally orthogonal validations of the positions contained in the pileup files.
If you are running a somatic analysis, you must also provide the matched normal sequencing data (mpileup format) as input. Furthermore, validation files should only contain somatic variations.
Note that each position contained in the pileup files have to be tested so the class (actual variant or error) will be known by comparison with the .vcf files. If a variant is present in the corresponding validation file, it will be considered as an actual variant. If the variant is absent from the validation file, the variant will be considered as an error.
*Please read SNooPer's manual for more information regarding the input files.
-What input file(s) should I provide to SNooPer to call my variants ?
For the classification phase or to evaluate a model, the user only has to provide the path to the model to be applied and pileup files from the test dataset
-Do I need the related normal data for a somatic analysis ?
Yes, for optimal results, you should use the related normal sequencing data (mpileup format) as input. Note that regions covered in the somatic file but not in the germline file will not be provided in the output file.
You can also apply a supplementary germline filter if you provide a .bed file containing common variations such as data from 1000 Genomes.
Please read SNooPer's manual for more information regarding this option.
-Can I perform a germline analysis using SNooPer ?
Yes, even if SNooPer was initially developed to call somatic variations, you can run a germline analysis (see manual).
-How do I know that my model performs well ?
To evaluate the training phase, SNooPer outputs general statistics including Kappa statistics and a confusion matrix obtained from 10-fold cross validation. SNooPer also provides receiver operating characteristics (ROC) and precision recall (PR) curves and the related area under the curves (AUCs). We also strongly recommend that you trandomly exclude from the training phase a sequencing dataset for which you have validation data (i.e. the class of somatic variants is known, either true variation or sequencing error) and then use the ''evaluate'' option within SNooPer on this independent dataset.
-What filters can I apply to the output file ?
The most basic filtering step would be to simply retain variants presenting the ''PASS'' flag in the output .vcf file based on the algorithm’s categorical classification. Note that this may be overly restrictive.
For a more advanced filtering process, SNooPer reports each identified variation weighted by their class probability from 0.5 to 1, allowing the user to adjust numerical filters with more flexibility than that allowed by categorical predictions. The closer the class probability is to 1, higher is the confidence of the validation.
Note that for a somatic analysis, a somatic p-value is also provided for each somatic variant identified.
-Can SNooPer cope with genome, exome, and transcriptome data ?
SNooPer can cope with any type of sequencing data as long as the input is in mpileup format. However, the training dataset should be of the same type as the test dataset.
-Can SNooPer cope with data from different technologies (e.g. SOLiD, Illumina) ?
Yes, you just have to specify it in your command line if the technology you used is not Illumina-1.8 or higher (see manual).
-How many FP and TP should my training file contain ?
In our original work we showed that SNooPer can accommodate reduced training datasets such as one constituted of only 250 false (FP) and true positives (TP) and still outperform other benchmarked methods.
Note that for optimal results, the balance of FP and TP in the training dataset should be representative of the balance you obtained from the validation while maintaining enough samples of the rarer class.
-How does SNooPer cope with unbalanced test and balanced training datasets ?
To classify an unbalanced test dataset with a model trained with a balanced training dataset, the user can weight the training instances (e.g. stronger cost on false positives) using SNooPer's cost sensitive training option.
-Can I call indels using SNooPer ?
Yes, an Indel calling option is now available and has the same requirements as the SNV calling option.
-What mpileup settings should I use for SNooPer input?
You should use the following mpileup options: -O (to output base positions on reads) and -s (to output mapping quality).