• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

利用长读长组装和机器学习增强短读长转座元件检测和基因分型。

Leveraging long-read assemblies and machine learning to enhance short-read transposable element detection and genotyping.

作者信息

Daigle Austin, Whitehouse Logan S, Zhao Roy, Emerson J J, Schrider Daniel R

机构信息

Department of Genetics, University of North Carolina, Chapel Hill, NC 27599.

Curriculum in Bioinformatics and Computational Biology, University of North Carolina, Chapel Hill, NC 27599.

出版信息

bioRxiv. 2025 Feb 16:2025.02.11.637720. doi: 10.1101/2025.02.11.637720.

DOI:10.1101/2025.02.11.637720
PMID:39990489
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11844559/
Abstract

Transposable elements (TEs) are parasitic genomic elements that are ubiquitous across the tree of life and play a crucial role in genome evolution. Advances in long-read sequencing have allowed highly accurate TE detection, though at a higher cost than short-read sequencing. Recent studies using long reads have shown that existing short-read TE detection methods perform inadequately when applied to real data. In this study, we use a machine learning approach (called TEforest) to discover and genotype TE insertions and deletions with short-read data by using TEs detected from long-read genome assemblies as training data. Our method first uses a highly sensitive algorithm to discover potential TE insertion or deletion sites in the genome, extracting relevant features from short-read alignments. To discriminate between true and false TE insertions, we train a random forest model with a labeled ground-truth dataset for which we have calculated the same set of short-read features. We conduct a comprehensive benchmark of TEforest and traditional TE detection methods using real data, finding that TEforest identifies more true positives and fewer false positives across datasets with different read lengths and coverages, while also accurately inferring genotypes and the precise breakpoints of insertions. By learning short-read signatures of TEs previously only discoverable using long reads, our approach bridges the gap between large-scale population genetic studies and the accuracy of long-read assemblies. This work provides a user-friendly tool to study the prevalence and phenotypic effects of TE insertions across the genome.

摘要

转座元件(TEs)是寄生性基因组元件,在整个生命之树中普遍存在,并且在基因组进化中起着至关重要的作用。长读长测序技术的进步使得能够进行高精度的TE检测,尽管其成本高于短读长测序。最近使用长读长的研究表明,现有的短读长TE检测方法应用于实际数据时表现不佳。在本研究中,我们使用一种机器学习方法(称为TEforest),通过将从长读长基因组组装中检测到的TEs用作训练数据,利用短读长数据来发现TE插入和缺失并进行基因分型。我们的方法首先使用一种高度敏感的算法来发现基因组中潜在的TE插入或缺失位点,从短读长比对中提取相关特征。为了区分真正的和错误的TE插入,我们使用一个标记的真实数据集训练一个随机森林模型,我们已经为该数据集计算了相同的短读长特征集。我们使用实际数据对TEforest和传统TE检测方法进行了全面的基准测试,发现TEforest在不同读长和覆盖度的数据集上识别出更多的真阳性和更少的假阳性,同时还能准确推断基因型和插入的精确断点。通过学习以前只能使用长读长发现的TEs的短读长特征,我们的方法弥合了大规模群体遗传学研究与长读长组装准确性之间的差距。这项工作提供了一个用户友好的工具,用于研究全基因组中TE插入的普遍性和表型效应。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4981/11844559/6f7b0571b99f/nihpp-2025.02.11.637720v2-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4981/11844559/96503b9dc756/nihpp-2025.02.11.637720v2-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4981/11844559/5b6746025547/nihpp-2025.02.11.637720v2-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4981/11844559/15178a25fd3e/nihpp-2025.02.11.637720v2-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4981/11844559/1178cda80703/nihpp-2025.02.11.637720v2-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4981/11844559/4d2c78e6f4e5/nihpp-2025.02.11.637720v2-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4981/11844559/6f7b0571b99f/nihpp-2025.02.11.637720v2-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4981/11844559/96503b9dc756/nihpp-2025.02.11.637720v2-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4981/11844559/5b6746025547/nihpp-2025.02.11.637720v2-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4981/11844559/15178a25fd3e/nihpp-2025.02.11.637720v2-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4981/11844559/1178cda80703/nihpp-2025.02.11.637720v2-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4981/11844559/4d2c78e6f4e5/nihpp-2025.02.11.637720v2-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4981/11844559/6f7b0571b99f/nihpp-2025.02.11.637720v2-f0006.jpg

相似文献

1
Leveraging long-read assemblies and machine learning to enhance short-read transposable element detection and genotyping.利用长读长组装和机器学习增强短读长转座元件检测和基因分型。
bioRxiv. 2025 Feb 16:2025.02.11.637720. doi: 10.1101/2025.02.11.637720.
2
T-lex3: an accurate tool to genotype and estimate population frequencies of transposable elements using the latest short-read whole genome sequencing data.T-lex3:一种准确的工具,可使用最新的短读长全基因组测序数据对转座子进行基因分型和估计种群频率。
Bioinformatics. 2020 Feb 15;36(4):1191-1197. doi: 10.1093/bioinformatics/btz727.
3
McClintock: An Integrated Pipeline for Detecting Transposable Element Insertions in Whole-Genome Shotgun Sequencing Data.麦克林托克:一种用于在全基因组鸟枪法测序数据中检测转座元件插入的集成流程。
G3 (Bethesda). 2017 Aug 7;7(8):2763-2778. doi: 10.1534/g3.117.043893.
4
TrEMOLO: accurate transposable element allele frequency estimation using long-read sequencing data combining assembly and mapping-based approaches.TrEMOLO:使用长读测序数据结合组装和基于映射的方法准确估计转座子等位基因频率。
Genome Biol. 2023 Apr 3;24(1):63. doi: 10.1186/s13059-023-02911-2.
5
Transposable element finder (TEF): finding active transposable elements from next generation sequencing data.转座元件查找器(TEF):从下一代测序数据中寻找活跃的转座元件。
BMC Bioinformatics. 2022 Nov 22;23(1):500. doi: 10.1186/s12859-022-05011-3.
6
Reproducible evaluation of transposable element detectors with McClintock 2 guides accurate inference of Ty insertion patterns in yeast.使用麦克林托克2对转座元件检测器进行可重复评估,可准确推断酵母中Ty插入模式。
Mob DNA. 2023 Jul 14;14(1):8. doi: 10.1186/s13100-023-00296-4.
7
Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements.Illumina TruSeq合成长读段技术助力从头组装,并解析复杂的、高度重复的转座元件。
PLoS One. 2014 Sep 4;9(9):e106689. doi: 10.1371/journal.pone.0106689. eCollection 2014.
8
Benchmarking reveals superiority of deep learning variant callers on bacterial nanopore sequence data.基准测试显示深度学习变异调用程序在细菌纳米孔测序数据上的优越性。
Elife. 2024 Oct 10;13:RP98300. doi: 10.7554/eLife.98300.
9
Detecting transposable elements in long-read genomes using sTELLeR.使用 sTELLER 检测长读长基因组中的转座元件。
Bioinformatics. 2024 Nov 1;40(11). doi: 10.1093/bioinformatics/btae686.
10
LoRTE: Detecting transposon-induced genomic variants using low coverage PacBio long read sequences.LoRTE:利用低覆盖度PacBio长读长序列检测转座子诱导的基因组变异
Mob DNA. 2017 Apr 8;8:5. doi: 10.1186/s13100-017-0088-x. eCollection 2017.

本文引用的文献

1
MCHelper automatically curates transposable element libraries across eukaryotic species.MCHelper可自动整理真核生物物种中的转座因子文库。
Genome Res. 2024 Dec 23;34(12):2256-2268. doi: 10.1101/gr.278821.123.
2
SeqKit2: A Swiss army knife for sequence and alignment processing.SeqKit2:一款用于序列和比对处理的瑞士军刀式工具。
Imeta. 2024 Apr 5;3(3):e191. doi: 10.1002/imt2.191. eCollection 2024 Jun.
3
Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp.使用fastp进行超快速单通道FASTQ数据预处理、质量控制和重复数据删除。
Imeta. 2023 May 8;2(2):e107. doi: 10.1002/imt2.107. eCollection 2023 May.
4
Genome assembly in the telomere-to-telomere era.端粒到端粒时代的基因组组装。
Nat Rev Genet. 2024 Sep;25(9):658-670. doi: 10.1038/s41576-024-00718-w. Epub 2024 Apr 22.
5
Earl Grey: A Fully Automated User-Friendly Transposable Element Annotation and Analysis Pipeline.格雷伯爵茶:一个全自动、用户友好的转座元件注释和分析流程。
Mol Biol Evol. 2024 Apr 2;41(4). doi: 10.1093/molbev/msae068.
6
Reproducible evaluation of transposable element detectors with McClintock 2 guides accurate inference of Ty insertion patterns in yeast.使用麦克林托克2对转座元件检测器进行可重复评估,可准确推断酵母中Ty插入模式。
Mob DNA. 2023 Jul 14;14(1):8. doi: 10.1186/s13100-023-00296-4.
7
TrEMOLO: accurate transposable element allele frequency estimation using long-read sequencing data combining assembly and mapping-based approaches.TrEMOLO:使用长读测序数据结合组装和基于映射的方法准确估计转座子等位基因频率。
Genome Biol. 2023 Apr 3;24(1):63. doi: 10.1186/s13059-023-02911-2.
8
Local assembly of long reads enables phylogenomics of transposable elements in a polyploid cell line.长读段的局部组装使多倍体细胞系中转座元件的系统发生基因组学成为可能。
Nucleic Acids Res. 2022 Nov 28;50(21):e124. doi: 10.1093/nar/gkac794.
9
Recommendations for improving statistical inference in population genomics.关于提高群体基因组学中统计推断的建议。
PLoS Biol. 2022 May 31;20(5):e3001669. doi: 10.1371/journal.pbio.3001669. eCollection 2022 May.
10
Population-scale long-read sequencing uncovers transposable elements associated with gene expression variation and adaptive signatures in Drosophila.人群规模的长读测序揭示了与果蝇基因表达变异和适应性特征相关的转座元件。
Nat Commun. 2022 Apr 12;13(1):1948. doi: 10.1038/s41467-022-29518-8.