Department of Electrical Engineering, National Kaohsiung University of Science and Technology, Kaohsiung, Taiwan, ROC.
Big Data Laboratories, Chunghwa Telecom Co., Taoyuan, Taiwan, ROC.
Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz030.
The detection of MicroRNA (miRNA) mentions in scientific literature facilitates researchers with the ability to find relevant and appropriate literature based on queries formulated using miRNA information. Considering most published biological studies elaborated on signal transduction pathways or genetic regulatory information in the form of figure captions, the extraction of miRNA from both the main content and figure captions of a manuscript is useful in aggregate analysis and comparative analysis of the studies published. In this study, we present a statistical principle-based miRNA recognition and normalization method to identify miRNAs and link them to the identifiers in the Rfam database. As one of the core components in the text mining pipeline of the database miRTarBase, the proposed method combined the advantages of previous works relying on pattern, dictionary and supervised learning and provided an integrated solution for the problem of miRNA identification. Furthermore, the knowledge learned from the training data was organized in a human-interpretable manner to understand the reason why the system considers a span of text as a miRNA mention, and the represented knowledge can be further complemented by domain experts. We studied the ambiguity level of miRNA nomenclature to connect the miRNA mentions to the Rfam database and evaluated the performance of our approach on two datasets: the BioCreative VI Bio-ID corpus and the miRNA interaction corpus by extending the later corpus with additional Rfam normalization information. Our study highlights and also proposes a better understanding of the challenges associated with miRNA identification and normalization in scientific literature and the research gap that needs to be further explored in prospective studies.
在科学文献中检测 MicroRNA(miRNA)提及,使研究人员能够根据使用 miRNA 信息制定的查询,找到相关和适当的文献。考虑到大多数已发表的生物学研究都是以图表标题的形式阐述信号转导途径或遗传调控信息,因此从手稿的主要内容和图表标题中提取 miRNA 对于汇总分析和比较已发表的研究是有用的。在这项研究中,我们提出了一种基于统计原理的 miRNA 识别和标准化方法,用于识别 miRNA 并将其与 Rfam 数据库中的标识符联系起来。作为数据库 miRTarBase 的文本挖掘管道的核心组件之一,该方法结合了基于模式、字典和监督学习的先前工作的优势,为 miRNA 识别问题提供了一种综合解决方案。此外,从训练数据中学习到的知识以人类可理解的方式进行组织,以了解系统为什么认为一段文本是 miRNA 提及,并且代表的知识可以由领域专家进一步补充。我们研究了 miRNA 命名法的歧义程度,以将 miRNA 提及与 Rfam 数据库联系起来,并在两个数据集上评估了我们的方法的性能:BioCreative VI Bio-ID 语料库和通过向后者语料库扩展额外的 Rfam 标准化信息而扩展的 miRNA 相互作用语料库。我们的研究强调并提出了对 miRNA 在科学文献中的识别和标准化相关挑战以及需要在未来研究中进一步探索的研究差距的更好理解。