• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种将非参考序列定位到人类基因组的方法。

A Method for Localizing Non-Reference Sequences to the Human Genome.

机构信息

Departments of Bioengineering, Stanford University, Stanford, CA 94305, USA,

出版信息

Pac Symp Biocomput. 2022;27:313-324.

PMID:34890159
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8730539/
Abstract

As the last decade of human genomics research begins to bear the fruit of advancements in precision medicine, it is important to ensure that genomics' improvements in human health are distributed globally and equitably. An important step to ensuring health equity is to improve the human reference genome to capture global diversity by including a wide variety of alternative haplotypes, sequences that are not currently captured on the reference genome.We present a method that localizes 100 basepair (bp) long sequences extracted from short-read sequencing that can ultimately be used to identify what regions of the human genome non-reference sequences belong to.We extract reads that don't align to the reference genome, and compute the population's distribution of 100-mers found within the unmapped reads. We use genetic data from families to identify shared genetic material between siblings and match the distribution of unmapped k-mers to these inheritance patterns to determine the the most likely genomic region of a k-mer. We perform this localization with two highly interpretable methods of artificial intelligence: a computationally tractable Hidden Markov Model coupled to a Maximum Likelihood Estimator. Using a set of alternative haplotypes with known locations on the genome, we show that our algorithm is able to localize 96% of k-mers with over 90% accuracy and less than 1Mb median resolution. As the collection of sequenced human genomes grows larger and more diverse, we hope that this method can be used to improve the human reference genome, a critical step in addressing precision medicine's diversity crisis.

摘要

随着人类基因组学研究的最后十年开始取得精准医学进展的成果,确保基因组学在全球范围内公平地改善人类健康变得尤为重要。确保公平的一个重要步骤是通过包含各种替代单倍型来改进人类参考基因组,以捕获全球多样性,这些替代单倍型目前未被参考基因组捕获。我们提出了一种方法,可以定位从短读测序中提取的 100 个碱基对(bp)长序列,这些序列最终可用于确定非参考序列所属的人类基因组区域。我们提取未与参考基因组对齐的读取,并计算未映射读取中发现的 100 -mer 的种群分布。我们使用来自家庭的遗传数据来识别兄弟姐妹之间共享的遗传物质,并将未映射的 k-mer 分布与这些遗传模式匹配,以确定 k-mer 最可能的基因组区域。我们使用两种高度可解释的人工智能方法来进行本地化:与最大似然估计器相结合的计算上易于处理的隐马尔可夫模型。使用一组具有已知基因组位置的替代单倍型,我们表明我们的算法能够以超过 90%的准确率和小于 1Mb 的中位数分辨率定位 96%的 k-mer。随着测序人类基因组的数量不断增加且更加多样化,我们希望该方法可用于改进人类参考基因组,这是解决精准医学多样性危机的关键步骤。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e6ee/8730539/6090501444fe/nihms-1760618-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e6ee/8730539/238d06515ab1/nihms-1760618-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e6ee/8730539/7d1bf2ef1ac2/nihms-1760618-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e6ee/8730539/c0039270150a/nihms-1760618-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e6ee/8730539/f36b112d303e/nihms-1760618-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e6ee/8730539/6090501444fe/nihms-1760618-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e6ee/8730539/238d06515ab1/nihms-1760618-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e6ee/8730539/7d1bf2ef1ac2/nihms-1760618-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e6ee/8730539/c0039270150a/nihms-1760618-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e6ee/8730539/f36b112d303e/nihms-1760618-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e6ee/8730539/6090501444fe/nihms-1760618-f0005.jpg

相似文献

1
A Method for Localizing Non-Reference Sequences to the Human Genome.一种将非参考序列定位到人类基因组的方法。
Pac Symp Biocomput. 2022;27:313-324.
2
Localizing unmapped sequences with families to validate the Telomere-to-Telomere assembly and identify new hotspots for genetic diversity.利用家族将未映射的序列本地化,以验证端粒到端粒组装并确定新的遗传多样性热点。
Genome Res. 2023 Oct;33(10):1734-1746. doi: 10.1101/gr.277175.122. Epub 2023 Oct 25.
3
SAKE: Strobemer-assisted k-mer extraction.SAKE:频闪辅助 k-mer 提取。
PLoS One. 2023 Nov 29;18(11):e0294415. doi: 10.1371/journal.pone.0294415. eCollection 2023.
4
Demonstrating the utility of flexible sequence queries against indexed short reads with FlexTyper.使用 FlexTyper 对索引短读取进行灵活序列查询的实用性展示。
PLoS Comput Biol. 2021 Mar 22;17(3):e1008815. doi: 10.1371/journal.pcbi.1008815. eCollection 2021 Mar.
5
Towards a reference genome that captures global genetic diversity.朝着捕获全球遗传多样性的参考基因组迈进。
Nat Commun. 2020 Oct 30;11(1):5482. doi: 10.1038/s41467-020-19311-w.
6
Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome.测序读长增加导致可测性提高的收益递减:人类基因组中 k-mer 分布的意义。
BMC Bioinformatics. 2014 Jan 3;15:2. doi: 10.1186/1471-2105-15-2.
7
The whole genome sequences and experimentally phased haplotypes of over 100 personal genomes.超过 100 个人类基因组的全基因组序列和经过实验相位的单倍型。
Gigascience. 2016 Oct 11;5(1):42. doi: 10.1186/s13742-016-0148-z.
8
A hybrid and scalable error correction algorithm for indel and substitution errors of long reads.一种用于长读段插入/缺失和替换错误的混合可扩展纠错算法。
BMC Genomics. 2019 Dec 20;20(Suppl 11):948. doi: 10.1186/s12864-019-6286-9.
9
Kmer2SNP: Reference-Free Heterozygous SNP Calling Using k-mer Frequency Distributions.Kmer2SNP:基于 k-mer 频率分布的无参考杂合 SNP 调用。
Methods Mol Biol. 2022;2493:257-265. doi: 10.1007/978-1-0716-2293-3_16.
10
Localized assembly for long reads enables genome-wide analysis of repetitive regions at single-base resolution in human genomes.本地化组装长读长序列可实现人类基因组中重复区域的单碱基分辨率的全基因组分析。
Hum Genomics. 2023 Mar 9;17(1):21. doi: 10.1186/s40246-023-00467-7.

引用本文的文献

1
Localizing unmapped sequences with families to validate the Telomere-to-Telomere assembly and identify new hotspots for genetic diversity.利用家族将未映射的序列本地化,以验证端粒到端粒组装并确定新的遗传多样性热点。
Genome Res. 2023 Oct;33(10):1734-1746. doi: 10.1101/gr.277175.122. Epub 2023 Oct 25.
2
Transmission dynamics of human herpesvirus 6A, 6B and 7 from whole genome sequences of families.从家庭的全基因组序列中分析人类疱疹病毒 6A、6B 和 7 的传播动力学。
Virol J. 2022 Dec 24;19(1):225. doi: 10.1186/s12985-022-01941-9.
3
Including diverse and admixed populations in genetic epidemiology research.

本文引用的文献

1
Towards population-scale long-read sequencing.迈向大规模长读长测序。
Nat Rev Genet. 2021 Sep;22(9):572-587. doi: 10.1038/s41576-021-00367-3. Epub 2021 May 28.
2
Estimating sequencing error rates using families.利用家系估计测序错误率。
BioData Min. 2021 Apr 23;14(1):27. doi: 10.1186/s13040-021-00259-6.
3
Sequence three million genomes across Africa.对非洲的三百万个基因组进行测序。
将多样化和混合人群纳入遗传流行病学研究中。
Genet Epidemiol. 2022 Oct;46(7):347-371. doi: 10.1002/gepi.22492. Epub 2022 Jul 16.
Nature. 2021 Feb;590(7845):209-211. doi: 10.1038/d41586-021-00313-7.
4
Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads.利用单细胞测序和长读长技术进行全相基因组组装,无需父母数据。
Nat Biotechnol. 2021 Mar;39(3):302-308. doi: 10.1038/s41587-020-0719-5. Epub 2020 Dec 7.
5
Towards a reference genome that captures global genetic diversity.朝着捕获全球遗传多样性的参考基因组迈进。
Nat Commun. 2020 Oct 30;11(1):5482. doi: 10.1038/s41467-020-19311-w.
6
High-depth African genomes inform human migration and health.高深度的非洲基因组信息揭示了人类的迁徙和健康状况。
Nature. 2020 Oct;586(7831):741-748. doi: 10.1038/s41586-020-2859-7. Epub 2020 Oct 28.
7
Telomere-to-telomere assembly of a complete human X chromosome.端粒到端粒组装完整的人类 X 染色体。
Nature. 2020 Sep;585(7823):79-84. doi: 10.1038/s41586-020-2547-7. Epub 2020 Jul 14.
8
Pan-genomics in the human genome era.人类基因组时代的泛基因组学。
Nat Rev Genet. 2020 Apr;21(4):243-254. doi: 10.1038/s41576-020-0210-7. Epub 2020 Feb 7.
9
A pneumonia outbreak associated with a new coronavirus of probable bat origin.一种新型冠状病毒引发的肺炎疫情,该病毒可能来源于蝙蝠。
Nature. 2020 Mar;579(7798):270-273. doi: 10.1038/s41586-020-2012-7. Epub 2020 Feb 3.
10
The "All of Us" Research Program.“All of Us”研究计划。
N Engl J Med. 2019 Aug 15;381(7):668-676. doi: 10.1056/NEJMsr1809937.