Richardet Renaud, Chappelier Jean-Cédric, Telefont Martin, Hill Sean
Blue Brain Project, Brain Mind Institute and School of Computer and Communication Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland.
Bioinformatics. 2015 May 15;31(10):1640-7. doi: 10.1093/bioinformatics/btv025. Epub 2015 Jan 20.
In neuroscience, as in many other scientific domains, the primary form of knowledge dissemination is through published articles. One challenge for modern neuroinformatics is finding methods to make the knowledge from the tremendous backlog of publications accessible for search, analysis and the integration of such data into computational models. A key example of this is metascale brain connectivity, where results are not reported in a normalized repository. Instead, these experimental results are published in natural language, scattered among individual scientific publications. This lack of normalization and centralization hinders the large-scale integration of brain connectivity results. In this article, we present text-mining models to extract and aggregate brain connectivity results from 13.2 million PubMed abstracts and 630 216 full-text publications related to neuroscience. The brain regions are identified with three different named entity recognizers (NERs) and then normalized against two atlases: the Allen Brain Atlas (ABA) and the atlas from the Brain Architecture Management System (BAMS). We then use three different extractors to assess inter-region connectivity.
NERs and connectivity extractors are evaluated against a manually annotated corpus. The complete in litero extraction models are also evaluated against in vivo connectivity data from ABA with an estimated precision of 78%. The resulting database contains over 4 million brain region mentions and over 100 000 (ABA) and 122 000 (BAMS) potential brain region connections. This database drastically accelerates connectivity literature review, by providing a centralized repository of connectivity data to neuroscientists.
在神经科学领域,与许多其他科学领域一样,知识传播的主要形式是通过已发表的文章。现代神经信息学面临的一个挑战是找到方法,使大量积压的出版物中的知识能够用于搜索、分析,并将这些数据整合到计算模型中。一个关键的例子是元尺度脑连接性,其结果并未在标准化的知识库中报告。相反,这些实验结果是以自然语言发表的,分散在各个科学出版物中。这种缺乏标准化和集中化的情况阻碍了脑连接性结果的大规模整合。在本文中,我们提出了文本挖掘模型,以从1320万篇PubMed摘要和630216篇与神经科学相关的全文出版物中提取和汇总脑连接性结果。使用三种不同的命名实体识别器(NER)来识别脑区,然后根据两个图谱进行标准化:艾伦脑图谱(ABA)和脑结构管理系统(BAMS)的图谱。然后我们使用三种不同的提取器来评估区域间的连接性。
针对一个人工注释的语料库对NER和连接性提取器进行了评估。完整的逐字提取模型也根据ABA的体内连接性数据进行了评估,估计精度为78%。所得数据库包含超过400万个脑区提及以及超过100000个(ABA)和122000个(BAMS)潜在的脑区连接。该数据库通过为神经科学家提供连接性数据的集中存储库,极大地加速了连接性文献综述。