Dececchi T Alexander, Balhoff James P, Lapp Hilmar, Mabee Paula M
Department of Biology, University of South Dakota, Vermillion, SD 57069, USA;
National Evolutionary Synthesis Center, Durham, NC 27705, USA; University of North Carolina, Chapel Hill, NC 27599, USA;
Syst Biol. 2015 Nov;64(6):936-52. doi: 10.1093/sysbio/syv031. Epub 2015 May 26.
The reality of larger and larger molecular databases and the need to integrate data scalably have presented a major challenge for the use of phenotypic data. Morphology is currently primarily described in discrete publications, entrenched in noncomputer readable text, and requires enormous investments of time and resources to integrate across large numbers of taxa and studies. Here we present a new methodology, using ontology-based reasoning systems working with the Phenoscape Knowledgebase (KB; kb.phenoscape.org), to automatically integrate large amounts of evolutionary character state descriptions into a synthetic character matrix of neomorphic (presence/absence) data. Using the KB, which includes more than 55 studies of sarcopterygian taxa, we generated a synthetic supermatrix of 639 variable characters scored for 1051 taxa, resulting in over 145,000 populated cells. Of these characters, over 76% were made variable through the addition of inferred presence/absence states derived by machine reasoning over the formal semantics of the source ontologies. Inferred data reduced the missing data in the variable character-subset from 98.5% to 78.2%. Machine reasoning also enables the isolation of conflicts in the data, that is, cells where both presence and absence are indicated; reports regarding conflicting data provenance can be generated automatically. Further, reasoning enables quantification and new visualizations of the data, here for example, allowing identification of character space that has been undersampled across the fin-to-limb transition. The approach and methods demonstrated here to compute synthetic presence/absence supermatrices are applicable to any taxonomic and phenotypic slice across the tree of life, providing the data are semantically annotated. Because such data can also be linked to model organism genetics through computational scoring of phenotypic similarity, they open a rich set of future research questions into phenotype-to-genome relationships.
越来越大的分子数据库以及可扩展地整合数据的需求,给表型数据的使用带来了重大挑战。形态学目前主要在离散的出版物中描述,以非计算机可读文本形式存在,并且需要投入大量时间和资源才能整合大量的分类群和研究。在此,我们提出一种新方法,利用基于本体的推理系统与Phenoscape知识库(KB;kb.phenoscape.org)协同工作,将大量进化特征状态描述自动整合到一个新形态(存在/缺失)数据的综合特征矩阵中。利用包含超过55项肉鳍鱼类分类群研究的知识库,我们生成了一个综合超级矩阵,其中为1051个分类群的639个可变特征进行了评分,产生了超过145,000个填充单元格。在这些特征中,超过76%的特征通过对源本体的形式语义进行机器推理得出的推断存在/缺失状态而变得可变。推断数据将可变特征子集中的缺失数据从98.5%减少到了78.2%。机器推理还能够分离数据中的冲突,即同时显示存在和缺失的单元格;关于冲突数据来源的报告可以自动生成。此外,推理能够对数据进行量化和新的可视化展示,例如在这里可以识别出在鳍到肢体转变过程中采样不足的特征空间。这里展示的计算综合存在/缺失超级矩阵的方法适用于生命之树中的任何分类学和表型切片,前提是数据经过语义注释。由于此类数据还可以通过表型相似性的计算评分与模式生物遗传学相联系,它们为未来关于表型与基因组关系的一系列丰富研究问题打开了大门。