Daigle Austin, Whitehouse Logan S, Zhao Roy, Emerson J J, Schrider Daniel R
Department of Genetics, University of North Carolina, Chapel Hill, NC 27599.
Curriculum in Bioinformatics and Computational Biology, University of North Carolina, Chapel Hill, NC 27599.
bioRxiv. 2025 Feb 16:2025.02.11.637720. doi: 10.1101/2025.02.11.637720.
Transposable elements (TEs) are parasitic genomic elements that are ubiquitous across the tree of life and play a crucial role in genome evolution. Advances in long-read sequencing have allowed highly accurate TE detection, though at a higher cost than short-read sequencing. Recent studies using long reads have shown that existing short-read TE detection methods perform inadequately when applied to real data. In this study, we use a machine learning approach (called TEforest) to discover and genotype TE insertions and deletions with short-read data by using TEs detected from long-read genome assemblies as training data. Our method first uses a highly sensitive algorithm to discover potential TE insertion or deletion sites in the genome, extracting relevant features from short-read alignments. To discriminate between true and false TE insertions, we train a random forest model with a labeled ground-truth dataset for which we have calculated the same set of short-read features. We conduct a comprehensive benchmark of TEforest and traditional TE detection methods using real data, finding that TEforest identifies more true positives and fewer false positives across datasets with different read lengths and coverages, while also accurately inferring genotypes and the precise breakpoints of insertions. By learning short-read signatures of TEs previously only discoverable using long reads, our approach bridges the gap between large-scale population genetic studies and the accuracy of long-read assemblies. This work provides a user-friendly tool to study the prevalence and phenotypic effects of TE insertions across the genome.
转座元件(TEs)是寄生性基因组元件,在整个生命之树中普遍存在,并且在基因组进化中起着至关重要的作用。长读长测序技术的进步使得能够进行高精度的TE检测,尽管其成本高于短读长测序。最近使用长读长的研究表明,现有的短读长TE检测方法应用于实际数据时表现不佳。在本研究中,我们使用一种机器学习方法(称为TEforest),通过将从长读长基因组组装中检测到的TEs用作训练数据,利用短读长数据来发现TE插入和缺失并进行基因分型。我们的方法首先使用一种高度敏感的算法来发现基因组中潜在的TE插入或缺失位点,从短读长比对中提取相关特征。为了区分真正的和错误的TE插入,我们使用一个标记的真实数据集训练一个随机森林模型,我们已经为该数据集计算了相同的短读长特征集。我们使用实际数据对TEforest和传统TE检测方法进行了全面的基准测试,发现TEforest在不同读长和覆盖度的数据集上识别出更多的真阳性和更少的假阳性,同时还能准确推断基因型和插入的精确断点。通过学习以前只能使用长读长发现的TEs的短读长特征,我们的方法弥合了大规模群体遗传学研究与长读长组装准确性之间的差距。这项工作提供了一个用户友好的工具,用于研究全基因组中TE插入的普遍性和表型效应。