Yuan Hao, Hicks Parker, Ahmadian Mansooreh, Johnson Kayla A, Valtadoros Lydia, Krishnan Arjun
Genetics and Genome Sciences Program, Michigan State University, East Lansing, MI 48823, United States.
Ecology, Evolution, and Behavior Program, Michigan State University, East Lansing, MI 48823, United States.
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae652.
Reusing massive collections of publicly available biomedical data can significantly impact knowledge discovery. However, these public samples and studies are typically described using unstructured plain text, hindering the findability and further reuse of the data. To combat this problem, we propose txt2onto 2.0, a general-purpose method based on natural language processing and machine learning for annotating biomedical unstructured metadata to controlled vocabularies of diseases and tissues. Compared to the previous version (txt2onto 1.0), which uses numerical embeddings as features, this new version uses words as features, resulting in improved interpretability and performance, especially when few positive training instances are available. Txt2onto 2.0 uses embeddings from a large language model during prediction to deal with unseen-yet-relevant words related to each disease and tissue term being predicted from the input text, thereby explaining the basis of every annotation. We demonstrate the generalizability of txt2onto 2.0 by accurately predicting disease annotations for studies from independent datasets, using proteomics and clinical trials as examples. Overall, our approach can annotate biomedical text regardless of experimental types or sources. Code, data, and trained models are available at https://github.com/krishnanlab/txt2onto2.0.
重复使用大量公开可用的生物医学数据会对知识发现产生重大影响。然而,这些公共样本和研究通常使用非结构化的纯文本进行描述,这阻碍了数据的可查找性和进一步重用。为了解决这个问题,我们提出了txt2onto 2.0,这是一种基于自然语言处理和机器学习的通用方法,用于将生物医学非结构化元数据注释到疾病和组织的受控词汇表中。与使用数值嵌入作为特征的先前版本(txt2onto 1.0)相比,这个新版本使用单词作为特征,从而提高了可解释性和性能,特别是在可用的正训练实例较少时。Txt2onto 2.0在预测过程中使用来自大语言模型的嵌入来处理与从输入文本中预测的每个疾病和组织术语相关的未见但相关的单词,从而解释每个注释的依据。我们以蛋白质组学和临床试验为例,通过准确预测来自独立数据集的研究的疾病注释,证明了txt2onto 2.0的通用性。总体而言,我们的方法可以注释生物医学文本,而不管实验类型或来源如何。代码、数据和训练模型可在https://github.com/krishnanlab/txt2onto2.0上获取。