Departamento de Inteligencia Artificial, Facultad de Informática, Universidad Politécnica de Madrid, Madrid, Spain.
BMC Bioinformatics. 2010 Aug 3;11:410. doi: 10.1186/1471-2105-11-410.
Primer and probe sequences are the main components of nucleic acid-based detection systems. Biologists use primers and probes for different tasks, some related to the diagnosis and prescription of infectious diseases. The biological literature is the main information source for empirically validated primer and probe sequences. Therefore, it is becoming increasingly important for researchers to navigate this important information. In this paper, we present a four-phase method for extracting and annotating primer/probe sequences from the literature. These phases are: (1) convert each document into a tree of paper sections, (2) detect the candidate sequences using a set of finite state machine-based recognizers, (3) refine problem sequences using a rule-based expert system, and (4) annotate the extracted sequences with their related organism/gene information.
We tested our approach using a test set composed of 297 manuscripts. The extracted sequences and their organism/gene annotations were manually evaluated by a panel of molecular biologists. The results of the evaluation show that our approach is suitable for automatically extracting DNA sequences, achieving precision/recall rates of 97.98% and 95.77%, respectively. In addition, 76.66% of the detected sequences were correctly annotated with their organism name. The system also provided correct gene-related information for 46.18% of the sequences assigned a correct organism name.
We believe that the proposed method can facilitate routine tasks for biomedical researchers using molecular methods to diagnose and prescribe different infectious diseases. In addition, the proposed method can be expanded to detect and extract other biological sequences from the literature. The extracted information can also be used to readily update available primer/probe databases or to create new databases from scratch.
引物和探针序列是基于核酸的检测系统的主要组成部分。生物学家使用引物和探针来完成不同的任务,其中一些与传染病的诊断和处方有关。生物文献是经验证的引物和探针序列的主要信息来源。因此,研究人员越来越需要能够在这些重要信息中进行导航。在本文中,我们提出了一种从文献中提取和注释引物/探针序列的四阶段方法。这些阶段是:(1)将每个文档转换为纸部分的树,(2)使用一组基于有限状态机的识别器检测候选序列,(3)使用基于规则的专家系统精炼有问题的序列,以及(4)用相关的生物体/基因信息对提取的序列进行注释。
我们使用由 297 篇手稿组成的测试集来测试我们的方法。提取的序列及其生物体/基因注释由一组分子生物学家进行了手动评估。评估结果表明,我们的方法适用于自动提取 DNA 序列,分别达到 97.98%和 95.77%的精度/召回率。此外,76.66%的检测到的序列被正确地注释为其生物体名称。对于被分配正确生物体名称的序列中的 46.18%,该系统还提供了正确的基因相关信息。
我们相信,所提出的方法可以为使用分子方法诊断和治疗不同传染病的生物医学研究人员提供常规任务的便利。此外,该方法可以扩展到从文献中检测和提取其他生物序列。提取的信息还可以用于快速更新现有的引物/探针数据库或从头开始创建新的数据库。