Suppr超能文献

CarD-T:通过Transformer解释癌基因词汇

CarD-T: Interpreting Carcinomic Lexicon via Transformers.

作者信息

O'Neill Jamey, Reddy Gudur Ashrith, Dhillon Nermeeta, Tripathi Osika, Alexandrov Ludmil, Katira Parag

机构信息

Mechanical Engineering Department, San Diego State University, San Diego, CA, USA.

Department of Bioengineering, University of California San Diego, La Jolla, CA, USA.

出版信息

medRxiv. 2024 Aug 31:2024.08.13.24311948. doi: 10.1101/2024.08.13.24311948.

Abstract

The identification and classification of carcinogens is critical in cancer epidemiology, necessitating updated methodologies to manage the burgeoning biomedical literature. Current systems, like those run by the International Agency for Research on Cancer (IARC) and the National Toxicology Program (NTP), face challenges due to manual vetting and disparities in carcinogen classification spurred by the volume of emerging data. To address these issues, we introduced the Carcinogen Detection via Transformers (CarD-T) framework, a text analytics approach that combines transformer-based machine learning with probabilistic statistical analysis to efficiently nominate carcinogens from scientific texts. CarD-T uses Named Entity Recognition (NER) trained on PubMed abstracts featuring known carcinogens from IARC groups and includes a context classifier to enhance accuracy and manage computational demands. Using this method, journal publication data indexed with carcinogenicity & carcinogenesis Medical Subject Headings (MeSH) terms from the last 25 years was analyzed, identifying potential carcinogens. Training CarD-T on 60% of established carcinogens (Group 1 and 2A carcinogens, IARC designation), CarD-T correctly to identifies all of the remaining Group 1 and 2A designated carcinogens from the analyzed text. In addition, CarD-T nominates roughly 1500 more entities as potential carcinogens that have at least two publications citing evidence of carcinogenicity. Comparative assessment of CarD-T against GPT-4 model reveals a high recall (0.857 vs 0.705) and F1 score (0.875 vs 0.792), and comparable precision (0.894 vs 0.903). Additionally, CarD-T highlights 554 entities that show disputing evidence for carcinogenicity. These are further analyzed using Bayesian temporal Probabilistic Carcinogenic Denomination (PCarD) to provide probabilistic evaluations of their carcinogenic status based on evolving evidence. Our findings underscore that the CarD-T framework is not only robust and effective in identifying and nominating potential carcinogens within vast biomedical literature but also efficient on consumer GPUs. This integration of advanced NLP capabilities with vital epidemiological analysis significantly enhances the agility of public health responses to carcinogen identification, thereby setting a new benchmark for automated, scalable toxicological investigations.

摘要

致癌物的识别和分类在癌症流行病学中至关重要,因此需要更新方法来管理不断涌现的生物医学文献。目前的系统,如由国际癌症研究机构(IARC)和国家毒理学计划(NTP)运行的系统,由于人工审核以及新出现的数据量引发的致癌物分类差异而面临挑战。为了解决这些问题,我们引入了通过Transformer进行致癌物检测(CarD-T)框架,这是一种文本分析方法,它将基于Transformer的机器学习与概率统计分析相结合,以从科学文本中高效地提名致癌物。CarD-T使用在包含IARC组已知致癌物的PubMed摘要上训练的命名实体识别(NER),并包括一个上下文分类器以提高准确性并管理计算需求。使用这种方法,分析了过去25年中索引有致癌性和致癌作用医学主题词(MeSH)的期刊发表数据,识别潜在致癌物。在60%的已确定致癌物(1类和2A类致癌物,IARC指定)上训练CarD-T,CarD-T从分析文本中正确识别出所有其余的1类和2A类指定致癌物。此外,CarD-T提名了大约1500个更多实体作为潜在致癌物,这些实体至少有两篇引用致癌证据的出版物。将CarD-T与GPT-4模型进行比较评估,结果显示召回率较高(0.857对0.705)和F1分数较高(0.875对0.792),并且精度相当(0.894对0.903)。此外,CarD-T突出显示了554个显示致癌性争议证据的实体。使用贝叶斯时间概率致癌命名(PCarD)对这些进行进一步分析,以根据不断演变的证据对其致癌状态进行概率评估。我们的研究结果强调,CarD-T框架不仅在识别和提名大量生物医学文献中的潜在致癌物方面强大且有效,而且在消费级GPU上也很高效。这种先进的自然语言处理能力与重要的流行病学分析的整合显著提高了公共卫生对致癌物识别反应的敏捷性,从而为自动化、可扩展的毒理学研究设定了新的基准。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4d66/11370823/d962e701c511/nihpp-2024.08.13.24311948v2-f0001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验