Lever Jake, Altman Russ, Kim Jin-Dong
Department of Bioengineering, Stanford University, Stanford, CA, 94305, USA.
Database Center for Life Science, Research Organization of Information and Systems, Kashiwa 277-0871, Japan.
Genomics Inform. 2020 Jun;18(2):e15. doi: 10.5808/GI.2020.18.2.e15. Epub 2020 Jun 15.
Named entity recognition tools are used to identify mentions of biomedical entities in free text and are essential components of high-quality information retrieval and extraction systems. Without good entity recognition, methods will mislabel searched text and will miss important information or identify spurious text that will frustrate users. Most tools do not capture non-contiguous entities which are separate spans of text that together refer to an entity, e.g., the entity "type 1 diabetes" in the phrase "type 1 and type 2 diabetes." This type is commonly found in biomedical texts, especially in lists, where multiple biomedical entities are named in shortened form to avoid repeating words. Most text annotation systems, that enable users to view and edit entity annotations, do not support non-contiguous entities. Therefore, experts cannot even visualize non-contiguous entities, let alone annotate them to build valuable datasets for machine learning methods. To combat this problem and as part of the BLAH6 hackathon, we extended the TextAE platform to allow visualization and annotation of non-contiguous entities. This enables users to add new subspans to existing entities by selecting additional text. We integrate this new functionality with TextAE's existing editing functionality to allow easy changes to entity annotation and editing of relation annotations involving non-contiguous entities, with importing and exporting to the PubAnnotation format. Finally, we roughly quantify the problem across the entire accessible biomedical literature to highlight that there are a substantial number of non-contiguous entities that appear in lists that would be missed by most text mining systems.
命名实体识别工具用于在自由文本中识别生物医学实体的提及,是高质量信息检索和提取系统的重要组成部分。没有良好的实体识别,方法会错误标记搜索文本,错过重要信息或识别会让用户沮丧的虚假文本。大多数工具无法捕获非连续实体,即文本中一起指代一个实体的不同跨度,例如短语“1型和2型糖尿病”中的实体“1型糖尿病”。这种类型在生物医学文本中很常见,尤其是在列表中,其中多个生物医学实体以缩写形式命名以避免重复词语。大多数允许用户查看和编辑实体注释的文本注释系统不支持非连续实体。因此,专家甚至无法可视化非连续实体,更不用说对其进行注释以构建用于机器学习方法的有价值数据集了。为了解决这个问题并作为BLAH6黑客马拉松的一部分,我们扩展了TextAE平台以允许对非连续实体进行可视化和注释。这使用户能够通过选择额外的文本为现有实体添加新的子跨度。我们将此新功能与TextAE现有的编辑功能集成,以便轻松更改实体注释并编辑涉及非连续实体的关系注释,并可导入和导出为PubAnnotation格式。最后,我们大致量化了整个可访问生物医学文献中的这个问题,以突出显示列表中存在大量大多数文本挖掘系统会遗漏的非连续实体。