Suppr超能文献

基于子序列自然向量和支持向量机的基因组成分分类及基因预测

Classification of genomic components and prediction of genes of based on subsequence natural vector and support vector machine.

作者信息

Pei Shaojun, Dong Rui, Bao Yiming, He Rong Lucy, Yau Stephen S-T

机构信息

Department of Mathematical Sciences, Tsinghua University, Beijing, China.

National Genomics Data Center & CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, and China National Center for Bioinformation, Beijing, China.

出版信息

PeerJ. 2020 Aug 3;8:e9625. doi: 10.7717/peerj.9625. eCollection 2020.

Abstract

BACKGROUND

Begomoviruses are widely distributed and causing devastating diseases in many crops. According to the number of genomic components, a begomovirus is known as either monopartite or bipartite begomovirus. Both the monopartite and bipartite begomoviruses have the DNA-A component which encodes all essential proteins for virus functions, while the bipartite begomoviruses still contain the DNA-B component. The satellite molecules, known as betasatellites, alphasatellites or deltasatellites, sometimes exist in the begomoviruses. So, the genomic components of begomoviruses are complex and varied. Different genomic components have different gene structures and functions. Classifying the components of begomoviruses is important for studying the virus origin and pathogenic mechanism.

METHODS

We propose a model combining Subsequence Natural Vector (SNV) method with Support Vector Machine (SVM) algorithm, to classify the genomic components of begomoviruses and predict the genes of begomoviruses. First, the genome sequence is represented as a vector numerically by the SNV method. Then SVM is applied on the datasets to build the classification model. At last, recursive feature elimination (RFE) is used to select essential features of the subsequence natural vectors based on the importance of features.

RESULTS

In the investigation, DNA-A, DNA-B, and different satellite DNAs are selected to build the model. To evaluate our model, the homology-based method BLAST and two machine learning algorithms Random Forest and Naive Bayes method are used to compare with our model. According to the results, our classification model can classify DNA-A, DNA-B, and different satellites with high accuracy. Especially, we can distinguish whether a DNA-A component is from a monopartite or a bipartite begomovirus. Then, based on the results of classification, we can also predict the genes of different genomic components. According to the selected features, we find that the content of four nucleotides in the second and tenth segments (approximately 150-350 bp and 1,450-1,650 bp) are the most different between DNA-A components of monopartite and bipartite begomoviruses, which may be related to the pre-coat protein (AV2) and the transcriptional activator protein (AC2) genes. Our results advance the understanding of the unique structures of the genomic components of begomoviruses.

摘要

背景

双生病毒广泛分布,在许多作物中引发毁灭性病害。根据基因组组分数量,双生病毒可分为单组分双生病毒和双组分双生病毒。单组分和双组分双生病毒均具有DNA - A组分,其编码病毒功能所需的所有必需蛋白,而双组分双生病毒还含有DNA - B组分。卫星分子,即β卫星、α卫星或δ卫星,有时存在于双生病毒中。因此,双生病毒的基因组组分复杂多样。不同的基因组组分具有不同的基因结构和功能。对双生病毒的组分进行分类对于研究病毒起源和致病机制至关重要。

方法

我们提出一种将子序列自然向量(SNV)方法与支持向量机(SVM)算法相结合的模型,用于对双生病毒的基因组组分进行分类并预测双生病毒的基因。首先,通过SNV方法将基因组序列数值表示为一个向量。然后将SVM应用于数据集以构建分类模型。最后,基于特征的重要性,使用递归特征消除(RFE)来选择子序列自然向量的关键特征。

结果

在研究中,选择DNA - A、DNA - B和不同的卫星DNA来构建模型。为了评估我们的模型,使用基于同源性的方法BLAST以及两种机器学习算法随机森林和朴素贝叶斯方法与我们的模型进行比较。根据结果,我们的分类模型能够高精度地对DNA - A、DNA - B和不同的卫星进行分类。特别是,我们能够区分一个DNA - A组分是来自单组分双生病毒还是双组分双生病毒。然后,基于分类结果,我们还可以预测不同基因组组分的基因。根据所选特征,我们发现单组分和双组分双生病毒的DNA - A组分在第二和第十段(约150 - 350 bp和1450 - 1650 bp)中四种核苷酸的含量差异最大,这可能与前衣壳蛋白(AV2)和转录激活蛋白(AC2)基因有关。我们的结果推进了对双生病毒基因组组分独特结构的理解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c2f9/7409808/450b1407d474/peerj-08-9625-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验