Suppr超能文献

一种将非参考序列定位到人类基因组的方法。

A Method for Localizing Non-Reference Sequences to the Human Genome.

机构信息

Departments of Bioengineering, Stanford University, Stanford, CA 94305, USA,

出版信息

Pac Symp Biocomput. 2022;27:313-324.

Abstract

As the last decade of human genomics research begins to bear the fruit of advancements in precision medicine, it is important to ensure that genomics' improvements in human health are distributed globally and equitably. An important step to ensuring health equity is to improve the human reference genome to capture global diversity by including a wide variety of alternative haplotypes, sequences that are not currently captured on the reference genome.We present a method that localizes 100 basepair (bp) long sequences extracted from short-read sequencing that can ultimately be used to identify what regions of the human genome non-reference sequences belong to.We extract reads that don't align to the reference genome, and compute the population's distribution of 100-mers found within the unmapped reads. We use genetic data from families to identify shared genetic material between siblings and match the distribution of unmapped k-mers to these inheritance patterns to determine the the most likely genomic region of a k-mer. We perform this localization with two highly interpretable methods of artificial intelligence: a computationally tractable Hidden Markov Model coupled to a Maximum Likelihood Estimator. Using a set of alternative haplotypes with known locations on the genome, we show that our algorithm is able to localize 96% of k-mers with over 90% accuracy and less than 1Mb median resolution. As the collection of sequenced human genomes grows larger and more diverse, we hope that this method can be used to improve the human reference genome, a critical step in addressing precision medicine's diversity crisis.

摘要

随着人类基因组学研究的最后十年开始取得精准医学进展的成果,确保基因组学在全球范围内公平地改善人类健康变得尤为重要。确保公平的一个重要步骤是通过包含各种替代单倍型来改进人类参考基因组,以捕获全球多样性,这些替代单倍型目前未被参考基因组捕获。我们提出了一种方法,可以定位从短读测序中提取的 100 个碱基对(bp)长序列,这些序列最终可用于确定非参考序列所属的人类基因组区域。我们提取未与参考基因组对齐的读取,并计算未映射读取中发现的 100 -mer 的种群分布。我们使用来自家庭的遗传数据来识别兄弟姐妹之间共享的遗传物质,并将未映射的 k-mer 分布与这些遗传模式匹配,以确定 k-mer 最可能的基因组区域。我们使用两种高度可解释的人工智能方法来进行本地化:与最大似然估计器相结合的计算上易于处理的隐马尔可夫模型。使用一组具有已知基因组位置的替代单倍型,我们表明我们的算法能够以超过 90%的准确率和小于 1Mb 的中位数分辨率定位 96%的 k-mer。随着测序人类基因组的数量不断增加且更加多样化,我们希望该方法可用于改进人类参考基因组,这是解决精准医学多样性危机的关键步骤。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e6ee/8730539/238d06515ab1/nihms-1760618-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验