使用植物特异性支持向量机进行DNA结合蛋白预测：一种新的基因组注释工具的验证与应用

DNA-binding protein prediction using plant specific support vector machines: validation and application of a new genome annotation tool.

作者信息

Motion Graham B, Howden Andrew J M, Huitema Edgar, Jones Susan

机构信息

Division of Plant Sciences, University of Dundee at the James Hutton Institute, Invergowrie, Dundee DD2 5DA, UK Cell and Molecular Sciences, James Hutton Institute, Invergowrie, Dundee DD2 5DA, UK.

Division of Plant Sciences, University of Dundee at the James Hutton Institute, Invergowrie, Dundee DD2 5DA, UK.

出版信息

Nucleic Acids Res. 2015 Dec 15;43(22):e158. doi: 10.1093/nar/gkv805. Epub 2015 Aug 24.

DOI:10.1093/nar/gkv805

PMID:26304539

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4678848/

Abstract

There are currently 151 plants with draft genomes available but levels of functional annotation for putative protein products are low. Therefore, accurate computational predictions are essential to annotate genomes in the first instance, and to provide focus for the more costly and time consuming functional assays that follow. DNA-binding proteins are an important class of proteins that require annotation, but current computational methods are not applicable for genome wide predictions in plant species. Here, we explore the use of species and lineage specific models for the prediction of DNA-binding proteins in plants. We show that a species specific support vector machine model based on Arabidopsis sequence data is more accurate (accuracy 81%) than a generic model (74%), and based on this we develop a plant specific model for predicting DNA-binding proteins. We apply this model to the tomato proteome and demonstrate its ability to perform accurate high-throughput prediction of DNA-binding proteins. In doing so, we have annotated 36 currently uncharacterised proteins by assigning a putative DNA-binding function. Our model is publically available and we propose it be used in combination with existing tools to help increase annotation levels of DNA-binding proteins encoded in plant genomes.

摘要

目前有151种植物拥有草图基因组，但假定蛋白质产物的功能注释水平较低。因此，准确的计算预测对于首先注释基因组以及为后续更昂贵且耗时的功能测定提供重点至关重要。DNA结合蛋白是一类需要注释的重要蛋白质，但目前的计算方法不适用于植物物种的全基因组预测。在此，我们探索使用物种和谱系特异性模型来预测植物中的DNA结合蛋白。我们表明，基于拟南芥序列数据的物种特异性支持向量机模型比通用模型（74%）更准确（准确率81%），基于此我们开发了一种用于预测DNA结合蛋白的植物特异性模型。我们将此模型应用于番茄蛋白质组，并证明其能够对DNA结合蛋白进行准确的高通量预测。通过这样做，我们通过赋予假定的DNA结合功能注释了36种目前未表征的蛋白质。我们的模型已公开可用，我们建议将其与现有工具结合使用，以帮助提高植物基因组中编码的DNA结合蛋白的注释水平。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a450/4678848/33d59cbb720c/gkv805fig1.jpg

相似文献

DNA-binding protein prediction using plant specific support vector machines: validation and application of a new genome annotation tool.使用植物特异性支持向量机进行DNA结合蛋白预测：一种新的基因组注释工具的验证与应用

Nucleic Acids Res. 2015 Dec 15;43(22):e158. doi: 10.1093/nar/gkv805. Epub 2015 Aug 24.

Prediction of plant pre-microRNAs and their microRNAs in genome-scale sequences using structure-sequence features and support vector machine.利用结构序列特征和支持向量机在基因组规模序列中预测植物前体微小RNA及其微小RNA

BMC Bioinformatics. 2014 Dec 30;15(1):423. doi: 10.1186/s12859-014-0423-x.

AtbHLH29 of Arabidopsis thaliana is a functional ortholog of tomato FER involved in controlling iron acquisition in strategy I plants.拟南芥的AtbHLH29是番茄FER的功能直系同源基因，参与调控I型植物中的铁吸收。

Cell Res. 2005 Aug;15(8):613-21. doi: 10.1038/sj.cr.7290331.

Genome-wide survey of DNA-binding proteins in Arabidopsis thaliana: analysis of distribution and functions.拟南芥 DNA 结合蛋白的全基因组调查：分布与功能分析。

Nucleic Acids Res. 2013 Aug;41(15):7212-9. doi: 10.1093/nar/gkt505. Epub 2013 Jun 17.

Genome-wide analysis of WRKY transcription factors in Solanum lycopersicum.番茄全基因组 WRKY 转录因子分析。

Mol Genet Genomics. 2012 Jun;287(6):495-513. doi: 10.1007/s00438-012-0696-6. Epub 2012 May 9.

Genome-wide identification, sequence characterization, and protein-protein interaction properties of DDB1 (damaged DNA binding protein-1)-binding WD40-repeat family members in Solanum lycopersicum.番茄中DDB1（损伤DNA结合蛋白1）结合WD40重复家族成员的全基因组鉴定、序列特征及蛋白质-蛋白质相互作用特性

Planta. 2015 Jun;241(6):1337-50. doi: 10.1007/s00425-015-2258-8. Epub 2015 Feb 14.

Defining the full tomato NB-LRR resistance gene repertoire using genomic and cDNA RenSeq.利用基因组和cDNA RenSeq技术定义完整的番茄NB-LRR抗性基因库。

BMC Plant Biol. 2014 May 5;14:120. doi: 10.1186/1471-2229-14-120.

A manually annotated Actinidia chinensis var. chinensis (kiwifruit) genome highlights the challenges associated with draft genomes and gene prediction in plants.一个经人工注释的中华猕猴桃（猕猴桃）基因组突出了在植物中与草图基因组和基因预测相关的挑战。

BMC Genomics. 2018 Apr 16;19(1):257. doi: 10.1186/s12864-018-4656-3.

Tomato heat stress transcription factor HsfB1 represents a novel type of general transcription coactivator with a histone-like motif interacting with the plant CREB binding protein ortholog HAC1.番茄热应激转录因子HsfB1代表一种新型的通用转录共激活因子，其具有与植物CREB结合蛋白直系同源物HAC1相互作用的组蛋白样基序。

Plant Cell. 2004 Jun;16(6):1521-35. doi: 10.1105/tpc.019927. Epub 2004 May 6.

引用本文的文献

PLM-DBPs: enhancing plant DNA-binding protein prediction by integrating sequence-based and structure-aware protein language models.PLM-DBPs：通过整合基于序列和结构感知的蛋白质语言模型增强植物DNA结合蛋白预测

Brief Bioinform. 2025 May 1;26(3). doi: 10.1093/bib/bbaf245.

Accurate prediction of nucleic acid binding proteins using protein language model.使用蛋白质语言模型准确预测核酸结合蛋白。

Bioinform Adv. 2025 Jan 20;5(1):vbaf008. doi: 10.1093/bioadv/vbaf008. eCollection 2025.

Improved prediction of DNA and RNA binding proteins with deep learning models.深度学习模型提高 DNA 和 RNA 结合蛋白的预测能力。

Brief Bioinform. 2024 May 23;25(4). doi: 10.1093/bib/bbae285.

ProkDBP: Toward more precise identification of prokaryotic DNA binding proteins.ProkDBP：致力于更精确地识别原核 DNA 结合蛋白。

Protein Sci. 2024 Jun;33(6):e5015. doi: 10.1002/pro.5015.

RBProkCNN: Deep learning on appropriate contextual evolutionary information for RNA binding protein discovery in prokaryotes.RBProkCNN：基于适当上下文进化信息的深度学习用于原核生物中RNA结合蛋白的发现

Comput Struct Biotechnol J. 2024 Apr 15;23:1631-1640. doi: 10.1016/j.csbj.2024.04.034. eCollection 2024 Dec.

Single-Stranded DNA Binding Proteins and Their Identification Using Machine Learning-Based Approaches.单链 DNA 结合蛋白及其基于机器学习的鉴定方法。

Biomolecules. 2022 Aug 26;12(9):1187. doi: 10.3390/biom12091187.

PredDBP-Stack: Prediction of DNA-Binding Proteins from HMM Profiles using a Stacked Ensemble Method.PredDBP-Stack：基于堆叠集成方法的使用 HMM 轮廓预测 DNA 结合蛋白

Biomed Res Int. 2020 Apr 13;2020:7297631. doi: 10.1155/2020/7297631. eCollection 2020.

HMMPred: Accurate Prediction of DNA-Binding Proteins Based on HMM Profiles and XGBoost Feature Selection.HMMPred：基于 HMM 轮廓和 XGBoost 特征选择的 DNA 结合蛋白精确预测。

Comput Math Methods Med. 2020 Mar 28;2020:1384749. doi: 10.1155/2020/1384749. eCollection 2020.

An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences.基于氨基酸序列中上下文特征的 DNA 结合蛋白预测的改进深度学习方法。

PLoS One. 2019 Nov 14;14(11):e0225317. doi: 10.1371/journal.pone.0225317. eCollection 2019.

Genomic insights into HSFs as candidate genes for high-temperature stress adaptation and gene editing with minimal off-target effects in flax.基因组学研究揭示 HSFs 作为高温胁迫适应的候选基因，以及在亚麻中最小化脱靶效应的基因编辑。

Sci Rep. 2019 Apr 3;9(1):5581. doi: 10.1038/s41598-019-41936-1.

本文引用的文献

Structure based approach for understanding organism specific recognition of protein-RNA complexes.基于结构的方法用于理解生物体对蛋白质-RNA复合物的特异性识别。

Biol Direct. 2015 Mar 7;10:8. doi: 10.1186/s13062-015-0039-8.

A survey of computational intelligence techniques in protein function prediction.蛋白质功能预测中的计算智能技术综述。

Int J Proteomics. 2014;2014:845479. doi: 10.1155/2014/845479. Epub 2014 Dec 11.

The structure, function and evolution of proteins that bind DNA and RNA.DNA 和 RNA 结合蛋白的结构、功能和进化。

Nat Rev Mol Cell Biol. 2014 Nov;15(11):749-60. doi: 10.1038/nrm3884. Epub 2014 Oct 1.

newDNA-Prot: Prediction of DNA-binding proteins by employing support vector machine and a comprehensive sequence representation.新型DNA-蛋白质：利用支持向量机和综合序列表示法预测DNA结合蛋白

Comput Biol Chem. 2014 Oct;52:51-9. doi: 10.1016/j.compbiolchem.2014.09.002. Epub 2014 Sep 15.

iDNA-Prot|dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition.iDNA-Prot|dis：通过将氨基酸距离对和简化字母表概况纳入通用伪氨基酸组成来鉴定DNA结合蛋白。

PLoS One. 2014 Sep 3;9(9):e106691. doi: 10.1371/journal.pone.0106691. eCollection 2014.

Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian naïve Bayes.基于随机森林和高斯朴素贝叶斯混合特征选择的DNA结合蛋白序列预测

PLoS One. 2014 Jan 24;9(1):e86703. doi: 10.1371/journal.pone.0086703. eCollection 2014.

Pfam: the protein families database.Pfam：蛋白质家族数据库。

Nucleic Acids Res. 2014 Jan;42(Database issue):D222-30. doi: 10.1093/nar/gkt1223. Epub 2013 Nov 27.

Protein-DNA binding: complexities and multi-protein codes.蛋白质与 DNA 的相互作用：复杂性和多蛋白编码。

Nucleic Acids Res. 2014 Feb;42(4):2099-111. doi: 10.1093/nar/gkt1112. Epub 2013 Nov 16.

Predicting DNA binding proteins using support vector machine with hybrid fractal features.使用支持向量机和混合分形特征预测 DNA 结合蛋白。

J Theor Biol. 2014 Feb 21;343:186-92. doi: 10.1016/j.jtbi.2013.10.009. Epub 2013 Nov 1.

Identification of DNA-binding proteins using support vector machine with sequence information.使用序列信息的支持向量机鉴定 DNA 结合蛋白。

Comput Math Methods Med. 2013;2013:524502. doi: 10.1155/2013/524502. Epub 2013 Sep 16.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用植物特异性支持向量机进行DNA结合蛋白预测：一种新的基因组注释工具的验证与应用

DNA-binding protein prediction using plant specific support vector machines: validation and application of a new genome annotation tool.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献