Department of Psychiatry, University of British Columbia, Vancouver, BC V6T 1Z4, Canada.
Bioinformatics. 2012 Nov 15;28(22):2963-70. doi: 10.1093/bioinformatics/bts542. Epub 2012 Sep 6.
Automated annotation of neuroanatomical connectivity statements from the neuroscience literature would enable accessible and large-scale connectivity resources. Unfortunately, the connectivity findings are not formally encoded and occur as natural language text. This hinders aggregation, indexing, searching and integration of the reports. We annotated a set of 1377 abstracts for connectivity relations to facilitate automated extraction of connectivity relationships from neuroscience literature. We tested several baseline measures based on co-occurrence and lexical rules. We compare results from seven machine learning methods adapted from the protein interaction extraction domain that employ part-of-speech, dependency and syntax features.
Co-occurrence based methods provided high recall with weak precision. The shallow linguistic kernel recalled 70.1% of the sentence-level connectivity statements at 50.3% precision. Owing to its speed and simplicity, we applied the shallow linguistic kernel to a large set of new abstracts. To evaluate the results, we compared 2688 extracted connections with the Brain Architecture Management System (an existing database of rat connectivity). The extracted connections were connected in the Brain Architecture Management System at a rate of 63.5%, compared with 51.1% for co-occurring brain region pairs. We found that precision increases with the recency and frequency of the extracted relationships.
The source code, evaluations, documentation and other supplementary materials are available at http://www.chibi.ubc.ca/WhiteText.
Supplementary data are available at Bioinformatics Online.
自动注释神经科学文献中的神经解剖连接语句将能够实现可访问的大规模连接资源。不幸的是,连接发现并未以正式的方式进行编码,而是以自然语言文本的形式出现。这阻碍了报告的聚合、索引、搜索和集成。我们为 1377 篇摘要注释了连接关系,以方便从神经科学文献中自动提取连接关系。我们测试了几种基于共现和词汇规则的基线方法。我们比较了来自蛋白质相互作用提取领域的七种机器学习方法的结果,这些方法采用了词性、依存关系和语法特征。
基于共现的方法提供了高召回率和弱精度。浅层语言核以 50.3%的精度召回了 70.1%的句子级连接语句。由于其速度和简单性,我们将浅层语言核应用于一组新的摘要。为了评估结果,我们将 2688 条提取的连接与大脑结构管理系统(大鼠连接的现有数据库)进行了比较。在大脑结构管理系统中,提取的连接以 63.5%的比例连接,而共现的大脑区域对的比例为 51.1%。我们发现,精度随着提取关系的最新和最频繁程度而增加。
源代码、评估、文档和其他补充材料可在 http://www.chibi.ubc.ca/WhiteText 上获得。
补充数据可在生物信息学在线获得。