Silva José Cleydson F, Carvalho Thales F M, Fontes Elizabeth P B, Cerqueira Fabio R
Department of Informatics, Universidade Federal de Viçosa, Viçosa, Minas Gerais, 36570-900, Brazil.
Department of Biochemistry and Molecular Biology, Universidade Federal de Viçosa, Campus Universitário, Viçosa, Minas Gerais, 36570-900, Brazil.
BMC Bioinformatics. 2017 Sep 30;18(1):431. doi: 10.1186/s12859-017-1839-x.
Geminiviruses infect a broad range of cultivated and non-cultivated plants, causing significant economic losses worldwide. The studies of the diversity of species, taxonomy, mechanisms of evolution, geographic distribution, and mechanisms of interaction of these pathogens with the host have greatly increased in recent years. Furthermore, the use of rolling circle amplification (RCA) and advanced metagenomics approaches have enabled the elucidation of viromes and the identification of many viral agents in a large number of plant species. As a result, determining the nomenclature and taxonomically classifying geminiviruses turned into complex tasks. In addition, the gene responsible for viral replication (particularly, the viruses belonging to the genus Mastrevirus) may be spliced due to the use of the transcriptional/splicing machinery in the host cells. However, the current tools have limitations concerning the identification of introns.
This study proposes a new method, designated Fangorn Forest (F2), based on machine learning approaches to classify genera using an ab initio approach, i.e., using only the genomic sequence, as well as to predict and classify genes in the family Geminiviridae. In this investigation, nine genera of the family Geminiviridae and their related satellite DNAs were selected. We obtained two training sets, one for genus classification, containing attributes extracted from the complete genome of geminiviruses, while the other was made up to classify geminivirus genes, containing attributes extracted from ORFs taken from the complete genomes cited above. Three ML algorithms were applied on those datasets to build the predictive models: support vector machines, using the sequential minimal optimization training approach, random forest (RF), and multilayer perceptron. RF demonstrated a very high predictive power, achieving 0.966, 0.964, and 0.995 of precision, recall, and area under the curve (AUC), respectively, for genus classification. For gene classification, RF could reach 0.983, 0.983, and 0.998 of precision, recall, and AUC, respectively.
Therefore, Fangorn Forest is proven to be an efficient method for classifying genera of the family Geminiviridae with high precision and effective gene prediction and classification. The method is freely accessible at www.geminivirus.org:8080/geminivirusdw/discoveryGeminivirus.jsp .
双生病毒感染多种栽培植物和非栽培植物,在全球范围内造成重大经济损失。近年来,对这些病原体的物种多样性、分类学、进化机制、地理分布以及与宿主的相互作用机制的研究有了大幅增加。此外,滚环扩增(RCA)和先进的宏基因组学方法的应用,使得人们能够阐明病毒群落,并在大量植物物种中鉴定出许多病毒因子。因此,确定双生病毒的命名和进行分类学分类成为复杂的任务。此外,由于宿主细胞中转录/剪接机制的作用,负责病毒复制的基因(特别是属于玉米线条病毒属的病毒)可能会发生剪接。然而,目前的工具在识别内含子方面存在局限性。
本研究提出了一种新方法,称为方贡森林(F2),基于机器学习方法,采用从头开始的方法对属进行分类,即仅使用基因组序列,同时对双生病毒科中的基因进行预测和分类。在这项研究中,选择了双生病毒科的九个属及其相关的卫星DNA。我们获得了两个训练集,一个用于属分类,包含从双生病毒的完整基因组中提取的属性,另一个用于双生病毒基因分类,包含从上述完整基因组的开放阅读框中提取的属性。将三种机器学习算法应用于这些数据集以构建预测模型:使用序列最小优化训练方法的支持向量机、随机森林(RF)和多层感知器。对于属分类,随机森林显示出非常高的预测能力,精确率、召回率和曲线下面积(AUC)分别达到0.966、0.964和0.995。对于基因分类,随机森林的精确率、召回率和AUC分别可达0.983、0.983和0.998。
因此,方贡森林被证明是一种高效的方法,可以高精度地对双生病毒科的属进行分类,并有效地进行基因预测和分类。该方法可在www.geminivirus.org:8080/geminivirusdw/discoveryGeminivirus.jsp免费获取。