• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过微调预先训练的基因组模型来增强对功能表型序列的识别和解释。

Enhancing recognition and interpretation of functional phenotypic sequences through fine-tuning pre-trained genomic models.

机构信息

School of Basic Medical Sciences and Intelligent Medicine Institute, Fudan University, Shanghai, 200032, China.

Shanghai Institute of Stem Cell Research and Clinical Translation, Shanghai, 200120, China.

出版信息

J Transl Med. 2024 Aug 12;22(1):756. doi: 10.1186/s12967-024-05567-z.

DOI:10.1186/s12967-024-05567-z
PMID:39135093
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11318145/
Abstract

BACKGROUND

Decoding human genomic sequences requires comprehensive analysis of DNA sequence functionality. Through computational and experimental approaches, researchers have studied the genotype-phenotype relationship and generate important datasets that help unravel complicated genetic blueprints. Thus, the recently developed artificial intelligence methods can be used to interpret the functions of those DNA sequences.

METHODS

This study explores the use of deep learning, particularly pre-trained genomic models like DNA_bert_6 and human_gpt2-v1, in interpreting and representing human genome sequences. Initially, we meticulously constructed multiple datasets linking genotypes and phenotypes to fine-tune those models for precise DNA sequence classification. Additionally, we evaluate the influence of sequence length on classification results and analyze the impact of feature extraction in the hidden layers of our model using the HERV dataset. To enhance our understanding of phenotype-specific patterns recognized by the model, we perform enrichment, pathogenicity and conservation analyzes of specific motifs in the human endogenous retrovirus (HERV) sequence with high average local representation weight (ALRW) scores.

RESULTS

We have constructed multiple genotype-phenotype datasets displaying commendable classification performance in comparison with random genomic sequences, particularly in the HERV dataset, which achieved binary and multi-classification accuracies and F1 values exceeding 0.935 and 0.888, respectively. Notably, the fine-tuning of the HERV dataset not only improved our ability to identify and distinguish diverse information types within DNA sequences but also successfully identified specific motifs associated with neurological disorders and cancers in regions with high ALRW scores. Subsequent analysis of these motifs shed light on the adaptive responses of species to environmental pressures and their co-evolution with pathogens.

CONCLUSIONS

These findings highlight the potential of pre-trained genomic models in learning DNA sequence representations, particularly when utilizing the HERV dataset, and provide valuable insights for future research endeavors. This study represents an innovative strategy that combines pre-trained genomic model representations with classical methods for analyzing the functionality of genome sequences, thereby promoting cross-fertilization between genomics and artificial intelligence.

摘要

背景

解码人类基因组序列需要全面分析 DNA 序列的功能。通过计算和实验方法,研究人员研究了基因型-表型关系,并生成了重要的数据集,帮助揭示复杂的遗传蓝图。因此,最近开发的人工智能方法可以用于解释这些 DNA 序列的功能。

方法

本研究探讨了深度学习,特别是预训练的基因组模型,如 DNA_bert_6 和 human_gpt2-v1,在解释和表示人类基因组序列中的应用。我们首先精心构建了多个将基因型和表型联系起来的数据集,以微调这些模型,从而实现精确的 DNA 序列分类。此外,我们还评估了序列长度对分类结果的影响,并使用 HERV 数据集分析了模型隐藏层中特征提取的影响。为了增强我们对模型识别的表型特异性模式的理解,我们对具有高平均局部表示权重 (ALRW) 得分的人类内源性逆转录病毒 (HERV) 序列中的特定基序进行了富集、致病性和保守性分析。

结果

我们构建了多个基因型-表型数据集,与随机基因组序列相比,这些数据集显示出令人称赞的分类性能,特别是在 HERV 数据集上,二进制和多类分类的准确率和 F1 值分别超过 0.935 和 0.888。值得注意的是,对 HERV 数据集的微调不仅提高了我们识别和区分 DNA 序列中不同信息类型的能力,而且还成功地识别了与高 ALRW 得分区域中神经退行性疾病和癌症相关的特定基序。对这些基序的后续分析揭示了物种对环境压力的适应性反应及其与病原体的共同进化。

结论

这些发现强调了预训练基因组模型在学习 DNA 序列表示方面的潜力,特别是在使用 HERV 数据集时,为未来的研究工作提供了有价值的见解。本研究代表了一种将预训练基因组模型表示与分析基因组序列功能的经典方法相结合的创新策略,从而促进了基因组学和人工智能之间的交叉融合。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/96ae/11318145/b91fa6a0c2ed/12967_2024_5567_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/96ae/11318145/90b2677a619e/12967_2024_5567_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/96ae/11318145/a7920cf2c2cd/12967_2024_5567_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/96ae/11318145/d94834d915e8/12967_2024_5567_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/96ae/11318145/77da48ecbebc/12967_2024_5567_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/96ae/11318145/04e5df0c0578/12967_2024_5567_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/96ae/11318145/809a03292e7a/12967_2024_5567_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/96ae/11318145/b91fa6a0c2ed/12967_2024_5567_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/96ae/11318145/90b2677a619e/12967_2024_5567_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/96ae/11318145/a7920cf2c2cd/12967_2024_5567_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/96ae/11318145/d94834d915e8/12967_2024_5567_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/96ae/11318145/77da48ecbebc/12967_2024_5567_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/96ae/11318145/04e5df0c0578/12967_2024_5567_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/96ae/11318145/809a03292e7a/12967_2024_5567_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/96ae/11318145/b91fa6a0c2ed/12967_2024_5567_Fig7_HTML.jpg

相似文献

1
Enhancing recognition and interpretation of functional phenotypic sequences through fine-tuning pre-trained genomic models.通过微调预先训练的基因组模型来增强对功能表型序列的识别和解释。
J Transl Med. 2024 Aug 12;22(1):756. doi: 10.1186/s12967-024-05567-z.
2
Molecular diversity and phenotypic pleiotropy of ancient genomic regulatory loci derived from human endogenous retrovirus type H (HERVH) promoter LTR7 and HERVK promoter LTR5_Hs and their contemporary impacts on pathophysiology of Modern Humans.源自人类内源性逆转录病毒 H 型(HERVH)启动子 LTR7 和 HERVK 启动子 LTR5_Hs 的古老基因组调控位点的分子多样性和表型多效性及其对现代人类病理生理学的当代影响。
Mol Genet Genomics. 2022 Nov;297(6):1711-1740. doi: 10.1007/s00438-022-01954-7. Epub 2022 Sep 19.
3
A computational framework to assess genome-wide distribution of polymorphic human endogenous retrovirus-K In human populations.一种用于评估人类种群中多态性人类内源性逆转录病毒-K 的全基因组分布的计算框架。
PLoS Comput Biol. 2019 Mar 28;15(3):e1006564. doi: 10.1371/journal.pcbi.1006564. eCollection 2019 Mar.
4
Contribution of type W human endogenous retroviruses to the human genome: characterization of HERV-W proviral insertions and processed pseudogenes.W型人类内源性逆转录病毒对人类基因组的贡献:HERV-W前病毒插入和加工假基因的特征分析
Retrovirology. 2016 Sep 9;13(1):67. doi: 10.1186/s12977-016-0301-x.
5
Human-specific HERV-K insertion causes genomic variations in the human genome.人类特异性 HERV-K 插入导致人类基因组中的基因组变异。
PLoS One. 2013 Apr 12;8(4):e60605. doi: 10.1371/journal.pone.0060605. Print 2013.
6
Transcription of human endogenous retroviruses in human brain by RNA-seq analysis.通过 RNA-seq 分析转录人内源性逆转录病毒在人脑。
PLoS One. 2019 Jan 3;14(1):e0207353. doi: 10.1371/journal.pone.0207353. eCollection 2019.
7
Porcine endogenous retrovirus (PERV) infection of HEK-293 cell line alters expression of human endogenous retrovirus (HERV-W) sequences.猪内源性逆转录病毒(PERV)对人胚肾细胞系(HEK - 293)的感染会改变人内源性逆转录病毒(HERV - W)序列的表达。
Folia Biol (Praha). 2014;60(1):35-46. doi: 10.14712/fb2014060010035.
8
The human endogenous retrovirus family HERV-K(HML-3).人类内源性逆转录病毒家族HERV-K(HML-3)。
Genomics. 2002 Sep;80(3):331-43. doi: 10.1006/geno.2002.6839.
9
Human endogenous retrovirus-H insertion screening.人类内源性逆转录病毒-H 插入筛查。
Mol Med Rep. 2013 Apr;7(4):1305-9. doi: 10.3892/mmr.2013.1295. Epub 2013 Jan 28.
10
HERV-W group evolutionary history in non-human primates: characterization of ERV-W orthologs in Catarrhini and related ERV groups in Platyrrhini.非人灵长类动物中HERV-W组的进化史:狭鼻猿亚目ERV-W直系同源物及阔鼻猿亚目相关ERV组的特征
BMC Evol Biol. 2018 Jan 19;18(1):6. doi: 10.1186/s12862-018-1125-1.

引用本文的文献

1
Performance of large language models in the differential diagnosis of benign and malignant biliary stricture.大语言模型在良性和恶性胆管狭窄鉴别诊断中的表现
Front Oncol. 2025 Jul 3;15:1613818. doi: 10.3389/fonc.2025.1613818. eCollection 2025.
2
Genetic Analysis of the Awn Length Gene in the Rice Chromosome Segment Substitution Line CSSL29.水稻染色体片段代换系CSSL29芒长基因的遗传分析
Int J Mol Sci. 2025 Feb 8;26(4):1436. doi: 10.3390/ijms26041436.

本文引用的文献

1
HervD Atlas: a curated knowledgebase of associations between human endogenous retroviruses and diseases.HervD Atlas:一个经过精心策划的人类内源性逆转录病毒与疾病关联知识库。
Nucleic Acids Res. 2024 Jan 5;52(D1):D1315-D1326. doi: 10.1093/nar/gkad904.
2
Endogenous retroviruses in development and health.内源性逆转录病毒在发育和健康中的作用。
Trends Microbiol. 2024 Apr;32(4):342-354. doi: 10.1016/j.tim.2023.09.006. Epub 2023 Oct 4.
3
Accurate proteome-wide missense variant effect prediction with AlphaMissense.使用 AlphaMissense 进行精确的全蛋白质错义变异效应预测。
Science. 2023 Sep 22;381(6664):eadg7492. doi: 10.1126/science.adg7492.
4
Phylogenomic analyses provide insights into primate evolution.系统发生基因组分析为灵长类动物的进化提供了新视角。
Science. 2023 Jun 2;380(6648):913-924. doi: 10.1126/science.abn6919. Epub 2023 Jun 1.
5
The landscape of tolerated genetic variation in humans and primates.人类和灵长类动物中可耐受遗传变异的景观。
Science. 2023 Jun 2;380(6648):eabn8153. doi: 10.1126/science.abn8197.
6
Regulatory network and targeted interventions for CCDC family in tumor pathogenesis.CCDC 家族在肿瘤发病机制中的调控网络和靶向干预。
Cancer Lett. 2023 Jul 1;565:216225. doi: 10.1016/j.canlet.2023.216225. Epub 2023 May 13.
7
A draft human pangenome reference.人类泛基因组参考草图。
Nature. 2023 May;617(7960):312-324. doi: 10.1038/s41586-023-05896-x. Epub 2023 May 10.
8
Expression profiles of east-west highly differentiated genes in Uyghur genomes.维吾尔族基因组中东西方高度分化基因的表达谱
Natl Sci Rev. 2023 Mar 21;10(4):nwad077. doi: 10.1093/nsr/nwad077. eCollection 2023 Apr.
9
Genomic benchmarks: a collection of datasets for genomic sequence classification.基因组基准测试:一组用于基因组序列分类的数据集。
BMC Genom Data. 2023 May 1;24(1):25. doi: 10.1186/s12863-023-01123-8.
10
Anti-ROR1 CAR-T cells: Architecture and performance.抗ROR1嵌合抗原受体T细胞:结构与性能。
Front Med (Lausanne). 2023 Feb 17;10:1121020. doi: 10.3389/fmed.2023.1121020. eCollection 2023.