Yang Tiancheng, Sucholutsky Ilia, Jen Kuang-Yu, Schonlau Matthias
Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada.
Department of Pathology and Laboratory Medicine, University of California, Davis, Sacramento, CA, United States of America.
PeerJ Comput Sci. 2024 Feb 28;10:e1888. doi: 10.7717/peerj-cs.1888. eCollection 2024.
Pathology reports contain key information about the patient's diagnosis as well as important gross and microscopic findings. These information-rich clinical reports offer an invaluable resource for clinical studies, but data extraction and analysis from such unstructured texts is often manual and tedious. While neural information retrieval systems (typically implemented as deep learning methods for natural language processing) are automatic and flexible, they typically require a large domain-specific text corpus for training, making them infeasible for many medical subdomains. Thus, an automated data extraction method for pathology reports that does not require a large training corpus would be of significant value and utility.
To develop a language model-based neural information retrieval system that can be trained on small datasets and validate it by training it on renal transplant-pathology reports to extract relevant information for two predefined questions: (1) "What kind of rejection does the patient show?"; (2) "What is the grade of interstitial fibrosis and tubular atrophy (IFTA)?"
Kidney BERT was developed by pre-training Clinical BERT on 3.4K renal transplant pathology reports and 1.5M words. Then, exKidneyBERT was developed by extending Clinical BERT's tokenizer with six technical keywords and repeating the pre-training procedure. This extended the model's vocabulary. All three models were fine-tuned with information retrieval heads.
The model with extended vocabulary, exKidneyBERT, outperformed Clinical BERT and Kidney BERT in both questions. For rejection, exKidneyBERT achieved an 83.3% overlap ratio for antibody-mediated rejection (ABMR) and 79.2% for T-cell mediated rejection (TCMR). For IFTA, exKidneyBERT had a 95.8% exact match rate.
ExKidneyBERT is a high-performing model for extracting information from renal pathology reports. Additional pre-training of BERT language models on specialized small domains does not necessarily improve performance. Extending the BERT tokenizer's vocabulary library is essential for specialized domains to improve performance, especially when pre-training on small corpora.
病理报告包含有关患者诊断的关键信息以及重要的大体和显微镜检查结果。这些信息丰富的临床报告为临床研究提供了宝贵的资源,但从这种非结构化文本中提取和分析数据通常是手动且繁琐的。虽然神经信息检索系统(通常作为自然语言处理的深度学习方法实现)是自动且灵活的,但它们通常需要大量特定领域的文本语料库进行训练,这使得它们在许多医学子领域中不可行。因此,一种不需要大量训练语料库的病理报告自动数据提取方法将具有重大价值和实用性。
开发一种基于语言模型的神经信息检索系统,该系统可以在小数据集上进行训练,并通过在肾移植病理报告上进行训练来验证它,以提取针对两个预定义问题的相关信息:(1)“患者表现出哪种排斥反应?”;(2)“间质纤维化和肾小管萎缩(IFTA)的分级是多少?”
通过在3400份肾移植病理报告和150万个单词上对Clinical BERT进行预训练来开发肾脏BERT。然后,通过用六个技术关键词扩展Clinical BERT的分词器并重复预训练过程来开发exKidneyBERT。这扩展了模型的词汇量。所有三个模型都使用信息检索头进行微调。
词汇量扩展的模型exKidneyBERT在两个问题上均优于Clinical BERT和肾脏BERT。对于排斥反应,exKidneyBERT在抗体介导的排斥反应(ABMR)中的重叠率达到83.3%,在T细胞介导的排斥反应(TCMR)中达到79.2%。对于IFTA,exKidneyBERT的精确匹配率为95.8%。
ExKidneyBERT是一种从肾脏病理报告中提取信息的高性能模型。在专门的小领域上对BERT语言模型进行额外的预训练不一定能提高性能。扩展BERT分词器的词汇库对于专门领域提高性能至关重要,特别是在对小语料库进行预训练时。