Suppr超能文献

GeMI:基于转换器的基因组元数据集成的交互式接口。

GeMI: interactive interface for transformer-based Genomic Metadata Integration.

机构信息

Department of Electronics, Information, and Bioengineering, Politecnico di Milano, Via Ponzio 34/5, Milano 20133, Italy.

出版信息

Database (Oxford). 2022 Jun 3;2022. doi: 10.1093/database/baac036.

Abstract

The Gene Expression Omnibus (GEO) is a public archive containing >4 million digital samples from functional genomics experiments collected over almost two decades. The accompanying metadata describing the experiments suffer from redundancy, inconsistency and incompleteness due to the prevalence of free text and the lack of well-defined data formats and their validation. To remedy this situation, we created Genomic Metadata Integration (GeMI; http://gmql.eu/gemi/), a web application that learns to automatically extract structured metadata (in the form of key-value pairs) from the plain text descriptions of GEO experiments. The extracted information can then be indexed for structured search and used for various downstream data mining activities. GeMI works in continuous interaction with its users. The natural language processing transformer-based model at the core of our system is a fine-tuned version of the Generative Pre-trained Transformer 2 (GPT2) model that is able to learn continuously from the feedback of the users thanks to an active learning framework designed for the purpose. As a part of such a framework, a machine learning interpretation mechanism (that exploits saliency maps) allows the users to understand easily and quickly whether the predictions of the model are correct and improves the overall usability. GeMI's ability to extract attributes not explicitly mentioned (such as sex, tissue type, cell type, ethnicity and disease) allows researchers to perform specific queries and classification of experiments, which was previously possible only after spending time and resources with tedious manual annotation. The usefulness of GeMI is demonstrated on practical research use cases. Database URL http://gmql.eu/gemi/.

摘要

基因表达综合数据库(GEO)是一个公共档案库,其中包含近二十年来从功能基因组学实验中收集的超过 400 万个数字样本。伴随的实验描述元数据由于存在大量的自由文本,以及缺乏定义良好的数据格式及其验证,因此存在冗余、不一致和不完整的问题。为了解决这个问题,我们创建了基因组元数据集成(GeMI;http://gmql.eu/gemi/),这是一个 Web 应用程序,它可以学习从 GEO 实验的纯文本描述中自动提取结构化元数据(以键值对的形式)。提取的信息可以索引进行结构化搜索,并用于各种下游的数据挖掘活动。GeMI 与用户进行持续的互动。我们系统核心的基于自然语言处理的转换器的模型是经过微调的生成式预训练转换器 2(GPT2)模型的一个版本,由于专门为此目的设计的主动学习框架,它能够不断从用户的反馈中学习。作为该框架的一部分,机器学习解释机制(利用显著图)允许用户轻松快速地了解模型的预测是否正确,并提高整体可用性。GeMI 提取未明确提及的属性(如性别、组织类型、细胞类型、种族和疾病)的能力允许研究人员执行特定的查询和实验分类,这在以前只有花费时间和资源进行繁琐的手动注释后才能实现。GeMI 的实用性在实际的研究用例中得到了证明。数据库 URL http://gmql.eu/gemi/。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验