IEEE J Biomed Health Inform. 2022 Oct;26(10):5033-5041. doi: 10.1109/JBHI.2022.3193365. Epub 2022 Oct 4.
Drug-induced liver injury describes the adverse effects of drugs that damage the liver. Life-threatening results were also reported in severe cases. Therefore, liver toxicity is an important assessment for new drug candidates. These reports are documented in research papers that contain preliminary in vitro and in vivo experiments. Conventionally, data extraction from publications relies on resource-demanding manual labeling, which restricts the efficiency of the information extraction. The development of natural language processing techniques enables the automatic processing of biomedical texts. Herein, based on around 28,000 papers (titles and abstracts) provided by the Critical Assessment of Massive Data Analysis challenge, this study benchmarked model performances on filtering liver-damage-related literature. Among five text embedding techniques, the model using term frequency-inverse document frequency (TF-IDF) and logistic regression outperformed others with an accuracy of 0.957 on the validation set. Furthermore, an ensemble model with similar overall performances was developed with a logistic regression model on the predicted probability given by separate models with different vectorization techniques. The ensemble model achieved a high accuracy of 0.954 and an F1 score of 0.955 in the hold-out validation data in the challenge. Moreover, important words in positive/negative predictions were identified via model interpretation. The prediction reliability was quantified with conformal prediction, which provides users with a control over the prediction uncertainty. Overall, the ensemble model and TF-IDF model reached satisfactory classification results, which can be used by researchers to rapidly filter literature that describes events related to liver injury induced by medications.
药物性肝损伤是指药物对肝脏造成损害的不良反应。在严重的情况下,也有危及生命的报告。因此,肝毒性是新候选药物的重要评估指标。这些报告记录在研究论文中,其中包含初步的体外和体内实验。传统上,从出版物中提取数据依赖于资源密集型的手动标记,这限制了信息提取的效率。自然语言处理技术的发展使生物医学文本的自动处理成为可能。在此,基于 Massive Data Analysis 挑战赛提供的约 28000 篇论文(标题和摘要),本研究对过滤与肝损伤相关文献的模型性能进行了基准测试。在五种文本嵌入技术中,使用词频-逆文档频率(TF-IDF)和逻辑回归的模型在验证集上的准确率为 0.957,优于其他模型。此外,还使用不同向量技术的单独模型的预测概率开发了具有类似整体性能的集成模型,并使用逻辑回归模型进行集成。该集成模型在挑战赛的保留验证数据中实现了 0.954 的高准确率和 0.955 的 F1 分数。此外,通过模型解释确定了正/负预测中的重要单词。通过一致性预测来量化预测可靠性,这为用户提供了对预测不确定性的控制。总的来说,集成模型和 TF-IDF 模型达到了令人满意的分类结果,研究人员可以使用这些结果来快速筛选描述药物引起的肝损伤相关事件的文献。