Cabot Chloé, Soualmia Lina F, Grosjean Julien, Griffon Nicolas, Darmoni Stéfan J
Normandie Univ., TIBS - LITIS EA 4108, Rouen University and Hospital, France.
Stud Health Technol Inform. 2017;235:126-130.
Extracting concepts from medical texts is a key to support many advanced applications in medical information retrieval. Entity recognition in French texts is moreover challenged by the availability of many resources originally developed for English texts. This paper proposes an evaluation of the terminology coverage in a corpus of 50,000 French articles extracted from the bibliographic database LiSSa. This corpus was automatically indexed with 32 health terminologies, published in French or translated. Then, the terminologies providing the best coverage of these documents were determined. The results show that major resources such as the NCI and SNOMED CT thesauri achieve the largest annotation of the corpus while specific French resources prove to be valuable assets.
从医学文本中提取概念是支持医学信息检索中许多高级应用的关键。此外,法语文本中的实体识别还受到许多最初为英语文本开发的资源的影响。本文提出了对从书目数据库LiSSa中提取的50000篇法语文章语料库中的术语覆盖范围进行评估。该语料库使用32种已出版法语或翻译的健康术语进行了自动索引。然后,确定了对这些文档提供最佳覆盖范围的术语。结果表明,诸如NCI和SNOMED CT词库等主要资源对语料库的标注最多,而特定的法语资源也被证明是有价值的资产。