• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

植物LTR反转录转座子中长末端重复序列的检测、分类及其可解释机器学习分析

Detection and classification of long terminal repeat sequences in plant LTR-retrotransposons and their analysis using explainable machine learning.

作者信息

Horvath Jakub, Jedlicka Pavel, Kratka Marie, Kubat Zdenek, Kejnovsky Eduard, Lexa Matej

机构信息

Faculty of Informatics, Masaryk University, Botanicka 68a, Brno, 60200, Czech Republic.

Department of Plant Developmental Genetics, Institute of Biophysics of the Czech Academy of Sciences, Kralovopolska 135, Brno, 61200, Czech Republic.

出版信息

BioData Min. 2024 Dec 18;17(1):57. doi: 10.1186/s13040-024-00410-z.

DOI:10.1186/s13040-024-00410-z
PMID:39696434
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11656987/
Abstract

BACKGROUND

Long terminal repeats (LTRs) represent important parts of LTR retrotransposons and retroviruses found in high copy numbers in a majority of eukaryotic genomes. LTRs contain regulatory sequences essential for the life cycle of the retrotransposon. Previous experimental and sequence studies have provided only limited information about LTR structure and composition, mostly from model systems. To enhance our understanding of these key sequence modules, we focused on the contrasts between LTRs of various retrotransposon families and other genomic regions. Furthermore, this approach can be utilized for the classification and prediction of LTRs.

RESULTS

We used machine learning methods suitable for DNA sequence classification and applied them to a large dataset of plant LTR retrotransposon sequences. We trained three machine learning models using (i) traditional model ensembles (Gradient Boosting), (ii) hybrid convolutional/long and short memory network models, and (iii) a DNA pre-trained transformer-based model using k-mer sequence representation. All three approaches were successful in classifying and isolating LTRs in this data, as well as providing valuable insights into LTR sequence composition. The best classification (expressed as F1 score) achieved for LTR detection was 0.85 using the hybrid network model. The most accurate classification task was superfamily classification (F1=0.89) while the least accurate was family classification (F1=0.74). The trained models were subjected to explainability analysis. Positional analysis identified a mixture of interesting features, many of which had a preferred absolute position within the LTR and/or were biologically relevant, such as a centrally positioned TATA-box regulatory sequence, and TG..CA nucleotide patterns around both LTR edges.

CONCLUSIONS

Our results show that the models used here recognized biologically relevant motifs, such as core promoter elements in the LTR detection task, and a development and stress-related subclass of transcription factor binding sites in the family classification task. Explainability analysis also highlighted the importance of 5'- and 3'- edges in LTR identity and revealed need to analyze more than just dinucleotides at these ends. Our work shows the applicability of machine learning models to regulatory sequence analysis and classification, and demonstrates the important role of the identified motifs in LTR detection.

摘要

背景

长末端重复序列(LTRs)是LTR逆转录转座子和逆转录病毒的重要组成部分,在大多数真核生物基因组中以高拷贝数存在。LTRs包含逆转录转座子生命周期所必需的调控序列。先前的实验和序列研究仅提供了关于LTR结构和组成的有限信息,大多来自模型系统。为了加深我们对这些关键序列模块的理解,我们重点研究了各种逆转录转座子家族的LTRs与其他基因组区域之间的差异。此外,这种方法可用于LTRs的分类和预测。

结果

我们使用了适用于DNA序列分类的机器学习方法,并将其应用于植物LTR逆转录转座子序列的大型数据集。我们使用(i)传统模型集成(梯度提升)、(ii)混合卷积/长短时记忆网络模型和(iii)使用k-mer序列表示的基于DNA预训练的Transformer模型训练了三种机器学习模型。这三种方法在对该数据中的LTRs进行分类和分离方面均取得成功,并为LTR序列组成提供了有价值的见解。使用混合网络模型进行LTR检测时,获得的最佳分类(以F1分数表示)为0.85。最准确的分类任务是超家族分类(F1 = 0.89),而最不准确的是家族分类(F1 = 0.74)。对训练好的模型进行了可解释性分析。位置分析确定了一系列有趣的特征,其中许多在LTR内具有优先的绝对位置和/或具有生物学相关性,例如位于中心位置的TATA框调控序列,以及LTR两侧边缘的TG..CA核苷酸模式。

结论

我们的结果表明,这里使用的模型识别出了生物学相关的基序,例如LTR检测任务中的核心启动子元件,以及家族分类任务中与发育和应激相关的转录因子结合位点亚类。可解释性分析还突出了5'和3'边缘在LTR识别中的重要性,并揭示了不仅需要分析这些末端的二核苷酸。我们的工作展示了机器学习模型在调控序列分析和分类中的适用性,并证明了所识别基序在LTR检测中的重要作用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e806/11656987/4080a18355cc/13040_2024_410_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e806/11656987/1623ed9590db/13040_2024_410_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e806/11656987/a25789d25c31/13040_2024_410_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e806/11656987/7cf9362eaac5/13040_2024_410_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e806/11656987/cc5d9dc32395/13040_2024_410_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e806/11656987/4080a18355cc/13040_2024_410_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e806/11656987/1623ed9590db/13040_2024_410_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e806/11656987/a25789d25c31/13040_2024_410_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e806/11656987/7cf9362eaac5/13040_2024_410_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e806/11656987/cc5d9dc32395/13040_2024_410_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e806/11656987/4080a18355cc/13040_2024_410_Fig5_HTML.jpg

相似文献

1
Detection and classification of long terminal repeat sequences in plant LTR-retrotransposons and their analysis using explainable machine learning.植物LTR反转录转座子中长末端重复序列的检测、分类及其可解释机器学习分析
BioData Min. 2024 Dec 18;17(1):57. doi: 10.1186/s13040-024-00410-z.
2
Look4LTRs: a Long terminal repeat retrotransposon detection tool capable of cross species studies and discovering recently nested repeats.Look4LTRs:一种能够进行跨物种研究并发现近期嵌套重复序列的长末端重复逆转录转座子检测工具。
Mob DNA. 2024 Apr 16;15(1):8. doi: 10.1186/s13100-024-00317-w.
3
Large-scale transcriptome data reveals transcriptional activity of fission yeast LTR retrotransposons.大规模转录组数据揭示了裂殖酵母 LTR 反转录转座子的转录活性。
BMC Genomics. 2010 Mar 12;11:167. doi: 10.1186/1471-2164-11-167.
4
Evolutionary conservation of orthoretroviral long terminal repeats (LTRs) and ab initio detection of single LTRs in genomic data.正逆转录病毒长末端重复序列(LTRs)的进化保守性以及在基因组数据中从头检测单个LTRs
PLoS One. 2009;4(4):e5179. doi: 10.1371/journal.pone.0005179. Epub 2009 Apr 13.
5
-mer-based machine learning method to classify LTR-retrotransposons in plant genomes.基于-mer的机器学习方法对植物基因组中的LTR反转录转座子进行分类。
PeerJ. 2021 May 19;9:e11456. doi: 10.7717/peerj.11456. eCollection 2021.
6
What Can Long Terminal Repeats Tell Us About the Age of LTR Retrotransposons, Gene Conversion and Ectopic Recombination?长末端重复序列能告诉我们关于LTR反转录转座子的年代、基因转换和异位重组的哪些信息?
Front Plant Sci. 2020 May 20;11:644. doi: 10.3389/fpls.2020.00644. eCollection 2020.
7
Systematic identification and characterization of regulatory elements derived from human endogenous retroviruses.对源自人类内源性逆转录病毒的调控元件进行系统鉴定和表征。
PLoS Genet. 2017 Jul 12;13(7):e1006883. doi: 10.1371/journal.pgen.1006883. eCollection 2017 Jul.
8
Conserved structure and inferred evolutionary history of long terminal repeats (LTRs).长末端重复序列(LTRs)的保守结构和推断的进化历史。
Mob DNA. 2013 Feb 1;4(1):5. doi: 10.1186/1759-8753-4-5.
9
InpactorDB: A Classified Lineage-Level Plant LTR Retrotransposon Reference Library for Free-Alignment Methods Based on Machine Learning.InpactorDB:一个基于机器学习的自由对齐方法的分类谱系水平植物 LTR 反转录转座子参考文库。
Genes (Basel). 2021 Jan 28;12(2):190. doi: 10.3390/genes12020190.
10
DANTE and DANTE_LTR: lineage-centric annotation pipelines for long terminal repeat retrotransposons in plant genomes.DANTE和DANTE_LTR:用于植物基因组中长末端重复逆转录转座子的以谱系为中心的注释管道。
NAR Genom Bioinform. 2024 Aug 29;6(3):lqae113. doi: 10.1093/nargab/lqae113. eCollection 2024 Sep.

引用本文的文献

1
Correction: Detection and classification of long terminal repeat sequences in plant LTR-retrotransposons and their analysis using explainable machine learning.更正:植物LTR反转录转座子中长末端重复序列的检测、分类及其可解释机器学习分析
BioData Min. 2024 Dec 30;17(1):62. doi: 10.1186/s13040-024-00417-6.

本文引用的文献

1
Evaluation metrics and statistical tests for machine learning.机器学习的评估指标和统计检验。
Sci Rep. 2024 Mar 13;14(1):6086. doi: 10.1038/s41598-024-56706-x.
2
Multifaceted roles of transcription factors during plant embryogenesis.转录因子在植物胚胎发生过程中的多方面作用。
Front Plant Sci. 2024 Jan 3;14:1322728. doi: 10.3389/fpls.2023.1322728. eCollection 2023.
3
DNABERT-based explainable lncRNA identification in plant genome assemblies.基于DNABERT的植物基因组组装中可解释的长链非编码RNA识别
Comput Struct Biotechnol J. 2023 Nov 17;21:5676-5685. doi: 10.1016/j.csbj.2023.11.025. eCollection 2023.
4
Transposable elements as essential elements in the control of gene expression.转座元件作为基因表达调控中的关键元件。
Mob DNA. 2023 Aug 18;14(1):9. doi: 10.1186/s13100-023-00297-3.
5
g:Profiler-interoperable web service for functional enrichment analysis and gene identifier mapping (2023 update).用于功能富集分析和基因标识符映射的可互操作网络服务(2023 更新)。
Nucleic Acids Res. 2023 Jul 5;51(W1):W207-W212. doi: 10.1093/nar/gkad347.
6
scEvoNet: a gradient boosting-based method for prediction of cell state evolution.scEvoNet:一种基于梯度提升的细胞状态演化预测方法。
BMC Bioinformatics. 2023 Mar 6;24(1):83. doi: 10.1186/s12859-023-05213-3.
7
Primate-specific transposable elements shape transcriptional networks during human development.灵长类动物特异性转座元件在人类发育过程中塑造转录网络。
Nat Commun. 2022 Nov 23;13(1):7178. doi: 10.1038/s41467-022-34800-w.
8
Environmental stress and transposons in plants.植物中的环境胁迫与转座子
Genes Genet Syst. 2022 Dec 17;97(4):169-175. doi: 10.1266/ggs.22-00045. Epub 2022 Aug 4.
9
Automatic curation of LTR retrotransposon libraries from plant genomes through machine learning.通过机器学习自动构建植物基因组中的 LTR 反转录转座子文库。
J Integr Bioinform. 2022 Jul 12;19(3). doi: 10.1515/jib-2021-0036. eCollection 2022 Sep 1.
10
Current progress and open challenges for applying deep learning across the biosciences.深度学习在整个生命科学中的应用现状及面临的开放性挑战。
Nat Commun. 2022 Apr 1;13(1):1728. doi: 10.1038/s41467-022-29268-7.