Department of Epidemiology and Biostatistics, School of Public Health, Faculty of Medicine, Imperial College London, London, UK.
Toxalim (Research Center in Food Toxicology), INRAE, ENVT, INP-PURPAN, UPS, Université de Toulouse, Toulouse, France.
Clin Exp Allergy. 2021 Sep;51(9):1185-1194. doi: 10.1111/cea.13981. Epub 2021 Jul 24.
Biomedical research increasingly relies on computational approaches to extract relevant information from large corpora of publications.
To investigate the consequence of the ambiguity between the use of terms "Eczema" and "Atopic Dermatitis" (AD) from the Information Retrieval perspective, and its impact on meta-analyses, systematic reviews and text mining.
Articles were retrieved by querying the PubMed using terms 'eczema' (D003876) and "dermatitis, atopic" (D004485). We used machine learning to investigate the differences between the contexts in which each term is used. We used a decision tree approach and trained model to predict if an article would be indexed with eczema or AD tags. We used text-mining tools to extract biological entities associated with eczema and AD, and investigated the discrepancy regarding the retrieval of key findings according to the terminology used.
Atopic dermatitis query yielded more articles related to veterinary science, biochemistry, cellular and molecular biology; the eczema query linked to public health, infectious disease and respiratory system. Medical Subject Headings terms associated with "AD" or "Eczema" differed, with an agreement between the top 40 lists of 52%. The presence of terms related to cellular mechanisms, especially allergies and inflammation, characterized AD literature. The metabolites mentioned more frequently than expected in articles with AD tag differed from those indexed with eczema. Fewer enriched genes were retrieved when using eczema compared to AD query.
There is a considerable discrepancy when using text mining to extract bio-entities related to eczema or AD. Our results suggest that any systematic approach (particularly when looking for metabolites or genes related to the condition) should be performed using both terms jointly. We propose to use decision tree learning as a tool to spot and characterize ambiguity, and provide the source code for disambiguation at https://github.com/cfrainay/ResearchCodeBase.
生物医学研究越来越依赖于计算方法,从大量文献中提取相关信息。
从信息检索的角度研究“湿疹”和“特应性皮炎”(AD)术语使用的歧义所带来的后果,以及其对荟萃分析、系统评价和文本挖掘的影响。
通过查询 PubMed 中的术语“eczema”(D003876)和“dermatitis,atopic”(D004485)检索文章。我们使用机器学习来研究每个术语使用的上下文之间的差异。我们使用决策树方法和训练模型来预测文章是否会被索引为湿疹或 AD 标签。我们使用文本挖掘工具提取与湿疹和 AD 相关的生物实体,并根据所使用的术语研究关键发现的检索差异。
AD 查询结果与兽医科学、生物化学、细胞和分子生物学相关的文章较多;湿疹查询结果与公共卫生、传染病和呼吸系统相关。与“AD”或“Eczema”相关的医学主题词术语不同,前 40 个列表的一致性为 52%。与 AD 文献相关的术语存在细胞机制,特别是过敏和炎症。与 AD 标签相关的文章中提到的代谢物比预期的更频繁,与湿疹相关的文章则不同。使用湿疹查询时检索到的富集基因比使用 AD 查询时少。
使用文本挖掘提取与湿疹或 AD 相关的生物实体存在相当大的差异。我们的研究结果表明,任何系统方法(特别是在寻找与疾病相关的代谢物或基因时)都应该同时使用这两个术语。我们建议使用决策树学习作为识别和描述歧义的工具,并在 https://github.com/cfrainay/ResearchCodeBase 上提供歧义消除的源代码。