Chai Xiaokang, An Sile, Chen Simeng, Li Wenwei, Feng Zhao, Li Xiangning, Gong Hui, Luo Qingming, Li Anan
Britton Chance Center for Biomedical Photonics, Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan 430074, China.
Key Laboratory of Biomedical Engineering of Hainan Province, School of Biomedical Engineering, Hainan University, Haikou 570228, China.
Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae648.
Neuroscientists have long endeavored to map brain connectivity, yet the intricate nature of brain networks often leads them to concentrate on specific regions, hindering efforts to unveil a comprehensive connectivity map. Recent advancements in imaging and text mining techniques have enabled the accumulation of a vast body of literature containing valuable insights into brain connectivity, facilitating the extraction of whole-brain connectivity relations from this corpus. However, the diverse representations of brain region names and connectivity relations pose a challenge for conventional machine learning methods and dictionary-based approaches in identifying all instances accurately.
We propose BioSEPBERT, a biomedical pre-trained model based on start-end position pointers and BERT. In addition, our model integrates specialized identifiers with enhanced self-attention capabilities for preceding and succeeding brain regions, thereby improving the performance of named entity recognition and relation extraction in neuroscience. Our approach achieves optimal F1 scores of 85.0%, 86.6%, and 86.5% for named entity recognition, connectivity relation extraction, and directional relation extraction, respectively, surpassing state-of-the-art models by 2.6%, 1.1%, and 1.1%. Furthermore, we leverage BioSEPBERT to extract 22.6 million standardized brain regions and 165 072 directional relations from a corpus comprising 1.3 million abstracts and 193 100 full-text articles. The results demonstrate that our model facilitates researchers to rapidly acquire knowledge regarding neural circuits across various brain regions, thereby enhancing comprehension of brain connectivity in specific regions.
Data and source code are available at: http://atlas.brainsmatics.org/res/BioSEPBERT and https://github.com/Brainsmatics/BioSEPBERT.
长期以来,神经科学家一直致力于绘制大脑连接图谱,但大脑网络的复杂性常常使他们专注于特定区域,阻碍了揭示全面连接图谱的努力。成像和文本挖掘技术的最新进展使得积累了大量包含有关大脑连接宝贵见解的文献,便于从该语料库中提取全脑连接关系。然而,大脑区域名称和连接关系的多样表示给传统机器学习方法和基于字典的方法准确识别所有实例带来了挑战。
我们提出了BioSEPBERT,一种基于起止位置指针和BERT的生物医学预训练模型。此外,我们的模型将专门标识符与增强的自注意力能力集成在一起,用于前后脑区域,从而提高了神经科学中命名实体识别和关系提取的性能。我们的方法在命名实体识别、连接关系提取和方向关系提取方面分别取得了85.0%、86.6%和86.5%的最优F1分数,比现有模型分别高出2.6%、1.1%和1.1%。此外,我们利用BioSEPBERT从包含130万篇摘要和193100篇全文文章的语料库中提取了2260万个标准化脑区和165072个方向关系。结果表明,我们的模型有助于研究人员快速获取有关不同脑区神经回路的知识,从而增强对特定区域大脑连接的理解。
数据和源代码可在以下网址获取:http://atlas.brainsmatics.org/res/BioSEPBERT和https://github.com/Brainsmatics/BioSEPBERT。