Departments of Bioengineering, Stanford University, Stanford, CA 94305, USA,
Pac Symp Biocomput. 2022;27:313-324.
As the last decade of human genomics research begins to bear the fruit of advancements in precision medicine, it is important to ensure that genomics' improvements in human health are distributed globally and equitably. An important step to ensuring health equity is to improve the human reference genome to capture global diversity by including a wide variety of alternative haplotypes, sequences that are not currently captured on the reference genome.We present a method that localizes 100 basepair (bp) long sequences extracted from short-read sequencing that can ultimately be used to identify what regions of the human genome non-reference sequences belong to.We extract reads that don't align to the reference genome, and compute the population's distribution of 100-mers found within the unmapped reads. We use genetic data from families to identify shared genetic material between siblings and match the distribution of unmapped k-mers to these inheritance patterns to determine the the most likely genomic region of a k-mer. We perform this localization with two highly interpretable methods of artificial intelligence: a computationally tractable Hidden Markov Model coupled to a Maximum Likelihood Estimator. Using a set of alternative haplotypes with known locations on the genome, we show that our algorithm is able to localize 96% of k-mers with over 90% accuracy and less than 1Mb median resolution. As the collection of sequenced human genomes grows larger and more diverse, we hope that this method can be used to improve the human reference genome, a critical step in addressing precision medicine's diversity crisis.
随着人类基因组学研究的最后十年开始取得精准医学进展的成果,确保基因组学在全球范围内公平地改善人类健康变得尤为重要。确保公平的一个重要步骤是通过包含各种替代单倍型来改进人类参考基因组,以捕获全球多样性,这些替代单倍型目前未被参考基因组捕获。我们提出了一种方法,可以定位从短读测序中提取的 100 个碱基对(bp)长序列,这些序列最终可用于确定非参考序列所属的人类基因组区域。我们提取未与参考基因组对齐的读取,并计算未映射读取中发现的 100 -mer 的种群分布。我们使用来自家庭的遗传数据来识别兄弟姐妹之间共享的遗传物质,并将未映射的 k-mer 分布与这些遗传模式匹配,以确定 k-mer 最可能的基因组区域。我们使用两种高度可解释的人工智能方法来进行本地化:与最大似然估计器相结合的计算上易于处理的隐马尔可夫模型。使用一组具有已知基因组位置的替代单倍型,我们表明我们的算法能够以超过 90%的准确率和小于 1Mb 的中位数分辨率定位 96%的 k-mer。随着测序人类基因组的数量不断增加且更加多样化,我们希望该方法可用于改进人类参考基因组,这是解决精准医学多样性危机的关键步骤。