• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用EviAnn进行高效的基于证据的基因组注释。

Efficient evidence-based genome annotation with EviAnn.

作者信息

Zimin Aleksey V, Puiu Daniela, Pertea Mihaela, Yorke James A, Salzberg Steven L

机构信息

Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD.

Center for Computational Biology, Johns Hopkins University, Baltimore, MD.

出版信息

bioRxiv. 2025 May 12:2025.05.07.652745. doi: 10.1101/2025.05.07.652745.

DOI:10.1101/2025.05.07.652745
PMID:40463080
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12132231/
Abstract

For many years, machine learning-based gene finding approaches have been the central components of eukaryotic genome annotation pipelines, and they remain so today. The reliance on these approaches was originally sustained by the high cost and low availability of gene expression data, a primary source of evidence for gene annotation along with protein homology. However, innovations in modern sequencing technologies have revolutionized the acquisition of abundant gene expression data, allowing us to rely more heavily on this class of evidence. In addition to gene expression data, proteins found in a multitude of well-annotated genomes represent another invaluable resource for gene annotation. Existing annotation packages often underutilize these data sources, which prompted us to develop EviAnn (Evidence-based Annotation), a novel evidence-based eukaryotic gene annotation system. EviAnn takes a strongly data-driven approach, building the exon-intron structure of genes from transcript alignments or protein-sequence homology rather than from purely gene finding techniques. We show that when provided with the same input data, EviAnn consistently outperforms current state-of-the-art packages including BRAKER3, MAKER2, and FINDER, while utilizing considerably less computer time. Annotation of a mammalian genome can be completed in less than an hour on a single multi-core server. EviAnn is freely available under an open-source license from https://github.com/alekseyzimin/EviAnn_release.

摘要

多年来,基于机器学习的基因发现方法一直是真核生物基因组注释流程的核心组成部分,如今依然如此。对这些方法的依赖最初是由于基因表达数据的高成本和低可得性,基因表达数据是与蛋白质同源性一起作为基因注释主要证据来源。然而,现代测序技术的创新彻底改变了丰富基因表达数据的获取方式,使我们能够更依赖这类证据。除了基因表达数据,在众多注释良好的基因组中发现的蛋白质是基因注释的另一个宝贵资源。现有的注释软件包常常未充分利用这些数据源,这促使我们开发了EviAnn(基于证据的注释),这是一种新型的基于证据的真核生物基因注释系统。EviAnn采用强烈的数据驱动方法,从转录本比对或蛋白质序列同源性构建基因的外显子 - 内含子结构,而不是纯粹从基因发现技术构建。我们表明,当提供相同的输入数据时,EviAnn始终优于当前最先进的软件包,包括BRAKER3、MAKER2和FINDER,同时使用的计算机时间要少得多。在单个多核服务器上,不到一小时就能完成哺乳动物基因组的注释。EviAnn可根据开源许可从https://github.com/alekseyzimin/EviAnn_release免费获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ecb/12132231/6240b5f6aac5/nihpp-2025.05.07.652745v1-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ecb/12132231/8bf6c6a1ae04/nihpp-2025.05.07.652745v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ecb/12132231/511c48425649/nihpp-2025.05.07.652745v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ecb/12132231/cd5c6bb610fe/nihpp-2025.05.07.652745v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ecb/12132231/fe8cf220022e/nihpp-2025.05.07.652745v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ecb/12132231/6240b5f6aac5/nihpp-2025.05.07.652745v1-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ecb/12132231/8bf6c6a1ae04/nihpp-2025.05.07.652745v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ecb/12132231/511c48425649/nihpp-2025.05.07.652745v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ecb/12132231/cd5c6bb610fe/nihpp-2025.05.07.652745v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ecb/12132231/fe8cf220022e/nihpp-2025.05.07.652745v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ecb/12132231/6240b5f6aac5/nihpp-2025.05.07.652745v1-f0005.jpg

相似文献

1
Efficient evidence-based genome annotation with EviAnn.使用EviAnn进行高效的基于证据的基因组注释。
bioRxiv. 2025 May 12:2025.05.07.652745. doi: 10.1101/2025.05.07.652745.
2
BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS, and TSEBRA.BRAKER3:利用 RNA-seq 和蛋白质证据,通过 GeneMark-ETP、AUGUSTUS 和 TSEBRA 进行全自动基因组注释。
Genome Res. 2024 Jun 25;34(5):769-777. doi: 10.1101/gr.278090.123.
3
BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS and TSEBRA.BRAKER3:使用RNA测序和蛋白质证据以及GeneMark-ETP、AUGUSTUS和TSEBRA进行全自动基因组注释。
bioRxiv. 2024 Feb 29:2023.06.10.544449. doi: 10.1101/2023.06.10.544449.
4
5
SplicedFamAlign: CDS-to-gene spliced alignment and identification of transcript orthology groups. splicedFamAlign:CDS 到基因拼接对齐和转录本同源物组的鉴定。
BMC Bioinformatics. 2019 Mar 29;20(Suppl 3):133. doi: 10.1186/s12859-019-2647-2.
6
G-OnRamp: a Galaxy-based platform for collaborative annotation of eukaryotic genomes.G-OnRamp:一个基于 Galaxy 的真核生物基因组协作注释平台。
Bioinformatics. 2019 Nov 1;35(21):4422-4423. doi: 10.1093/bioinformatics/btz309.
7
CodingQuarry: highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts.CodingQuarry:利用RNA测序转录本对真菌基因组进行高精度隐马尔可夫模型基因预测。
BMC Genomics. 2015 Mar 11;16(1):170. doi: 10.1186/s12864-015-1344-4.
8
Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi.将 RNA-seq 数据与基于同源性的基因预测相结合,用于植物、动物和真菌。
BMC Bioinformatics. 2018 May 30;19(1):189. doi: 10.1186/s12859-018-2203-5.
9
Long-Read Annotation: Automated Eukaryotic Genome Annotation Based on Long-Read cDNA Sequencing.长读注释:基于长读 cDNA 测序的自动化真核基因组注释。
Plant Physiol. 2019 Jan;179(1):38-54. doi: 10.1104/pp.18.00848. Epub 2018 Nov 6.
10
Tiberius: end-to-end deep learning with an HMM for gene prediction.提比略:使用隐马尔可夫模型进行基因预测的端到端深度学习。
Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae685.

本文引用的文献

1
BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS, and TSEBRA.BRAKER3:利用 RNA-seq 和蛋白质证据,通过 GeneMark-ETP、AUGUSTUS 和 TSEBRA 进行全自动基因组注释。
Genome Res. 2024 Jun 25;34(5):769-777. doi: 10.1101/gr.278090.123.
2
GeneMark-ETP significantly improves the accuracy of automatic annotation of large eukaryotic genomes.GeneMark-ETP 显著提高了大型真核基因组自动注释的准确性。
Genome Res. 2024 Jun 25;34(5):757-768. doi: 10.1101/gr.278373.123.
3
Protein-to-genome alignment with miniprot.
用 Miniprot 进行蛋白质到基因组的比对。
Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btad014.
4
FINDER: an automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences.FINDER:一个自动化软件包,用于从 RNA-Seq 数据和相关蛋白质序列中注释真核基因。
BMC Bioinformatics. 2021 Apr 20;22(1):205. doi: 10.1186/s12859-021-04120-9.
5
Propedia: a database for protein-peptide identification based on a hybrid clustering algorithm.Propedia:一种基于混合聚类算法的蛋白质-肽鉴定数据库。
BMC Bioinformatics. 2021 Jan 2;22(1):1. doi: 10.1186/s12859-020-03881-z.
6
GFF Utilities: GffRead and GffCompare.
F1000Res. 2020 Apr 28;9. doi: 10.12688/f1000research.23297.2. eCollection 2020.
7
GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins.GeneMark-EP+:在基因和蛋白质空间中进行自我训练的真核基因预测
NAR Genom Bioinform. 2020 Jun;2(2):lqaa026. doi: 10.1093/nargab/lqaa026. Epub 2020 May 13.
8
Transcriptome assembly from long-read RNA-seq alignments with StringTie2.基于长读 RNA-seq 比对的转录组组装与 StringTie2。
Genome Biol. 2019 Dec 16;20(1):278. doi: 10.1186/s13059-019-1910-1.
9
Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype.基于图的基因组比对和基因分型与 HISAT2 和 HISAT-genotype。
Nat Biotechnol. 2019 Aug;37(8):907-915. doi: 10.1038/s41587-019-0201-4. Epub 2019 Aug 2.
10
The Untranslated Regions of mRNAs in Cancer.癌症中mRNA的非翻译区
Trends Cancer. 2019 Apr;5(4):245-262. doi: 10.1016/j.trecan.2019.02.011. Epub 2019 Mar 22.