Rodríguez-Penagos Carlos, Salgado Heladia, Martínez-Flores Irma, Collado-Vides Julio
Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Apdo, Postal 565-A, Avenida Universidad, Cuernavaca, Morelos, 62100, Mexico.
BMC Bioinformatics. 2007 Aug 7;8:293. doi: 10.1186/1471-2105-8-293.
Manual curation of biological databases, an expensive and labor-intensive process, is essential for high quality integrated data. In this paper we report the implementation of a state-of-the-art Natural Language Processing system that creates computer-readable networks of regulatory interactions directly from different collections of abstracts and full-text papers. Our major aim is to understand how automatic annotation using Text-Mining techniques can complement manual curation of biological databases. We implemented a rule-based system to generate networks from different sets of documents dealing with regulation in Escherichia coli K-12.
Performance evaluation is based on the most comprehensive transcriptional regulation database for any organism, the manually-curated RegulonDB, 45% of which we were able to recreate automatically. From our automated analysis we were also able to find some new interactions from papers not already curated, or that were missed in the manual filtering and review of the literature. We also put forward a novel Regulatory Interaction Markup Language better suited than SBML for simultaneously representing data of interest for biologists and text miners.
Manual curation of the output of automatic processing of text is a good way to complement a more detailed review of the literature, either for validating the results of what has been already annotated, or for discovering facts and information that might have been overlooked at the triage or curation stages.
生物数据库的人工整理是一个昂贵且劳动密集型的过程,对于高质量的整合数据至关重要。在本文中,我们报告了一个最先进的自然语言处理系统的实现,该系统可直接从不同的摘要和全文论文集合中创建计算机可读的调控相互作用网络。我们的主要目标是了解使用文本挖掘技术的自动注释如何补充生物数据库的人工整理。我们实现了一个基于规则的系统,用于从处理大肠杆菌K-12调控的不同文档集中生成网络。
性能评估基于针对任何生物体最全面的转录调控数据库——人工整理的RegulonDB,我们能够自动重建其中45%的数据。通过我们的自动分析,我们还能够从未经整理或在文献的人工筛选和审查中遗漏的论文中发现一些新的相互作用。我们还提出了一种新颖的调控相互作用标记语言,它比SBML更适合同时表示生物学家和文本挖掘人员感兴趣的数据。
对文本自动处理的输出进行人工整理是补充对文献进行更详细审查的好方法,无论是用于验证已注释内容的结果,还是用于发现可能在分类或整理阶段被忽视的事实和信息。