Suppr超能文献

基于泛基因组的基因组推断的整数规划框架

Integer programming framework for pangenome-based genome inference.

作者信息

Chandra Ghanshyam, Hossen Md Helal, Scholz Stephan, Dilthey Alexander T, Gibney Daniel, Jain Chirag

机构信息

Department of Computational and Data Sciences, Indian Institute of Science, Bangalore KA 560012, India.

Department of Computer Science, The University of Texas at Dallas, TX 75080, USA.

出版信息

bioRxiv. 2024 Oct 29:2024.10.27.620212. doi: 10.1101/2024.10.27.620212.

Abstract

Affordable genotyping methods are essential in genomics. Commonly used genotyping methods primarily support single nucleotide variants and short indels but neglect structural variants. Additionally, accuracy of read alignments to a reference genome is unreliable in highly polymorphic and repetitive regions, further impacting genotyping performance. Recent works highlight the advantage of haplotype-resolved pangenome graphs in addressing these challenges. Building on these developments, we propose a rigorous alignment-free genotyping framework. Our formulation seeks a path through the pangenome graph that maximizes the matches between the path and substrings of sequencing reads (e.g., -mers) while minimizing recombination events (haplotype switches) along the path. We prove that this problem is NP-Hard and develop efficient integer-programming solutions. We benchmarked the algorithm using downsampled short-read datasets from homozygous human cell lines with coverage ranging from 0.1× to 10×. Our algorithm accurately estimates complete major histocompatibility complex (MHC) haplotype sequences with small edit distances from the ground-truth sequences, providing a significant advantage over existing methods on low-coverage inputs. Although our algorithm is designed for haploid samples, we discuss future extensions to diploid samples.

摘要

经济实惠的基因分型方法在基因组学中至关重要。常用的基因分型方法主要支持单核苷酸变异和短插入缺失,但忽略了结构变异。此外,在高度多态和重复区域中,读取序列与参考基因组的比对准确性不可靠,这进一步影响了基因分型性能。最近的研究突出了单倍型解析的泛基因组图谱在应对这些挑战方面的优势。基于这些进展,我们提出了一个严格的无比对基因分型框架。我们的公式寻求一条通过泛基因组图谱的路径,该路径能使路径与测序读取的子串(例如,-mers)之间的匹配最大化,同时使沿该路径的重组事件(单倍型切换)最小化。我们证明了这个问题是NP难问题,并开发了高效的整数规划解决方案。我们使用来自纯合人类细胞系的下采样短读数据集进行算法基准测试,覆盖范围从0.1×到10×。我们的算法能够准确估计完整的主要组织相容性复合体(MHC)单倍型序列,与真实序列的编辑距离很小,在低覆盖度输入方面比现有方法具有显著优势。虽然我们的算法是为单倍体样本设计的,但我们也讨论了未来对二倍体样本的扩展。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e652/11565907/a2e2d0805223/nihpp-2024.10.27.620212v1-f0006.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验