Boenn Markus
1 Institute of Computer Science, Martin Luther University Halle-Wittenberg , Halle/Saale, Germany .
2 Department of Soil Ecology, UFZ - Helmholtz Centre for Environmental Research , Halle/Saale, Germany .
J Comput Biol. 2018 Jun;25(6):613-622. doi: 10.1089/cmb.2018.0007. Epub 2018 Apr 16.
Genomic variations are in the focus of research to uncover mechanisms of host-pathogen interactions and diseases such as cancer. Nowadays, next-generation sequencing (NGS) data are analyzed through dedicated pipelines to detect them. Surrogate NGS data in conjunction with genomic variations help to evaluate pipelines and validate their outcomes, fostering selection of proper tools for a given scientific question. I describe how existing approaches for simulating NGS data in conjunction with genomic variations fail to model local enrichments of single nucleotide polymorphisms (SNPs), so called SNP clusters. Two distributions for count data are applied to publicly available collections of genomic variations. The results suggest modeling of SNP cluster sizes by overdispersion-aware distributions.
基因组变异是揭示宿主-病原体相互作用机制以及癌症等疾病机制研究的重点。如今,通过专门的流程对下一代测序(NGS)数据进行分析以检测这些变异。替代NGS数据与基因组变异相结合有助于评估流程并验证其结果,促进针对特定科学问题选择合适的工具。我描述了现有的结合基因组变异模拟NGS数据的方法是如何无法对单核苷酸多态性(SNP)的局部富集(即所谓的SNP簇)进行建模的。将两种计数数据分布应用于公开可用的基因组变异集合。结果表明可用考虑过离散的分布对SNP簇大小进行建模。