Mao Jin, Moore Lisa R, Blank Carrine E, Wu Elvis Hsin-Hui, Ackerman Marcia, Ranade Sonali, Cui Hong
School of Information, University of Arizona, Tucson, 85721, AZ, USA.
Department of Biological Sciences, University of Southern Maine, Portland, 04103, ME, USA.
BMC Bioinformatics. 2016 Dec 13;17(1):528. doi: 10.1186/s12859-016-1396-8.
The large-scale analysis of phenomic data (i.e., full phenotypic traits of an organism, such as shape, metabolic substrates, and growth conditions) in microbial bioinformatics has been hampered by the lack of tools to rapidly and accurately extract phenotypic data from existing legacy text in the field of microbiology. To quickly obtain knowledge on the distribution and evolution of microbial traits, an information extraction system needed to be developed to extract phenotypic characters from large numbers of taxonomic descriptions so they can be used as input to existing phylogenetic analysis software packages.
We report the development and evaluation of Microbial Phenomics Information Extractor (MicroPIE, version 0.1.0). MicroPIE is a natural language processing application that uses a robust supervised classification algorithm (Support Vector Machine) to identify characters from sentences in prokaryotic taxonomic descriptions, followed by a combination of algorithms applying linguistic rules with groups of known terms to extract characters as well as character states. The input to MicroPIE is a set of taxonomic descriptions (clean text). The output is a taxon-by-character matrix-with taxa in the rows and a set of 42 pre-defined characters (e.g., optimum growth temperature) in the columns. The performance of MicroPIE was evaluated against a gold standard matrix and another student-made matrix. Results show that, compared to the gold standard, MicroPIE extracted 21 characters (50%) with a Relaxed F1 score > 0.80 and 16 characters (38%) with Relaxed F1 scores ranging between 0.50 and 0.80. Inclusion of a character prediction component (SVM) improved the overall performance of MicroPIE, notably the precision. Evaluated against the same gold standard, MicroPIE performed significantly better than the undergraduate students.
MicroPIE is a promising new tool for the rapid and efficient extraction of phenotypic character information from prokaryotic taxonomic descriptions. However, further development, including incorporation of ontologies, will be necessary to improve the performance of the extraction for some character types.
微生物生物信息学中对表型组数据(即生物体的完整表型特征,如形状、代谢底物和生长条件)的大规模分析,一直受到缺乏从微生物学领域现有旧文本中快速准确提取表型数据工具的阻碍。为了快速获取有关微生物特征分布和进化的知识,需要开发一种信息提取系统,以便从大量分类描述中提取表型特征,从而将其用作现有系统发育分析软件包的输入。
我们报告了微生物表型组信息提取器(MicroPIE,版本0.1.0)的开发和评估。MicroPIE是一个自然语言处理应用程序,它使用强大的监督分类算法(支持向量机)从原核生物分类描述的句子中识别特征,随后结合应用语言规则和已知术语组的算法来提取特征以及特征状态。MicroPIE的输入是一组分类描述(纯文本)。输出是一个按分类单元-特征矩阵,行是分类单元,列是一组42个预定义特征(例如,最适生长温度)。针对金标准矩阵和另一个学生制作的矩阵对MicroPIE的性能进行了评估。结果表明,与金标准相比,MicroPIE提取了21个特征(50%),其宽松F1分数>0.80,以及16个特征(38%),其宽松F1分数在0.50至0.80之间。包含特征预测组件(支持向量机)提高了MicroPIE的整体性能,尤其是精度。与相同的金标准相比,MicroPIE的表现明显优于本科生。
MicroPIE是一种从原核生物分类描述中快速高效提取表型特征信息的有前景的新工具。然而,为了提高某些特征类型的提取性能,还需要进一步开发,包括纳入本体。