Dipartimento di Informazione, Elettronica e Bioingegneria, 20133 Milano Country: Italy, Italy.
Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad036. Epub 2023 May 23.
BACKGROUND: Literature about SARS-CoV-2 widely discusses the effects of variations that have spread in the past 3 years. Such information is dispersed in the texts of several research articles, hindering the possibility of practically integrating it with related datasets (e.g., millions of SARS-CoV-2 sequences available to the community). We aim to fill this gap, by mining literature abstracts to extract-for each variant/mutation-its related effects (in epidemiological, immunological, clinical, or viral kinetics terms) with labeled higher/lower levels in relation to the nonmutated virus. RESULTS: The proposed framework comprises (i) the provisioning of abstracts from a COVID-19-related big data corpus (CORD-19) and (ii) the identification of mutation/variant effects in abstracts using a GPT2-based prediction model. The above techniques enable the prediction of mutations/variants with their effects and levels in 2 distinct scenarios: (i) the batch annotation of the most relevant CORD-19 abstracts and (ii) the on-demand annotation of any user-selected CORD-19 abstract through the CoVEffect web application (http://gmql.eu/coveffect), which assists expert users with semiautomated data labeling. On the interface, users can inspect the predictions and correct them; user inputs can then extend the training dataset used by the prediction model. Our prototype model was trained through a carefully designed process, using a minimal and highly diversified pool of samples. CONCLUSIONS: The CoVEffect interface serves for the assisted annotation of abstracts, allowing the download of curated datasets for further use in data integration or analysis pipelines. The overall framework can be adapted to resolve similar unstructured-to-structured text translation tasks, which are typical of biomedical domains.
背景:文献广泛讨论了过去 3 年中传播的 SARS-CoV-2 变异的影响。此类信息分散在几篇研究文章的文本中,阻碍了将其与相关数据集(例如,社区可获得的数百万个 SARS-CoV-2 序列)实际整合的可能性。我们旨在通过挖掘文献摘要来填补这一空白,为每个变体/突变提取与其相关的影响(在流行病学、免疫学、临床或病毒动力学方面),并根据与未突变病毒的关系标记为更高/更低水平。
结果:所提出的框架包括 (i) 从与 COVID-19 相关的大数据语料库 (CORD-19) 提供摘要,以及 (ii) 使用基于 GPT2 的预测模型在摘要中识别突变/变体的影响。上述技术可用于预测具有其影响和水平的突变/变体,在两种不同情况下:(i) 对最相关的 CORD-19 摘要进行批量注释,以及 (ii) 通过 CoVEffect 网络应用程序 (http://gmql.eu/coveffect) 对任何用户选择的 CORD-19 摘要进行按需注释,该应用程序通过半自动数据标记来协助专家用户。在界面上,用户可以检查预测并进行更正;用户输入可以扩展预测模型使用的训练数据集。我们的原型模型是通过精心设计的过程进行训练的,使用了最小且高度多样化的样本池。
结论:CoVEffect 界面可用于辅助摘要注释,允许下载经过整理的数据集,以进一步用于数据集成或分析管道。总体框架可以适应解决类似的非结构化到结构化文本翻译任务,这是生物医学领域的典型任务。
Elife. 2022-11-15
Bioinformatics. 2022-3-28
Comput Methods Programs Biomed. 2021-4
J Med Internet Res. 2024-5-30
Elife. 2021-8-13
Database (Oxford). 2023-7-6
J Med Internet Res. 2024-5-30
Genomics Proteomics Bioinformatics. 2023-10
Patterns (N Y). 2023-4-14
Bioinform Adv. 2023-1-11
NPJ Digit Med. 2022-12-21
Nucleic Acids Res. 2023-1-6
Bioinformatics. 2023-1-1
JAMIA Open. 2022-6-11
Database (Oxford). 2022-6-3