GeMI：基于转换器的基因组元数据集成的交互式接口。

GeMI: interactive interface for transformer-based Genomic Metadata Integration.

机构信息

Department of Electronics, Information, and Bioengineering, Politecnico di Milano, Via Ponzio 34/5, Milano 20133, Italy.

出版信息

Database (Oxford). 2022 Jun 3;2022. doi: 10.1093/database/baac036.

DOI:10.1093/database/baac036

PMID:35657113

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9216561/

Abstract

The Gene Expression Omnibus (GEO) is a public archive containing >4 million digital samples from functional genomics experiments collected over almost two decades. The accompanying metadata describing the experiments suffer from redundancy, inconsistency and incompleteness due to the prevalence of free text and the lack of well-defined data formats and their validation. To remedy this situation, we created Genomic Metadata Integration (GeMI; http://gmql.eu/gemi/), a web application that learns to automatically extract structured metadata (in the form of key-value pairs) from the plain text descriptions of GEO experiments. The extracted information can then be indexed for structured search and used for various downstream data mining activities. GeMI works in continuous interaction with its users. The natural language processing transformer-based model at the core of our system is a fine-tuned version of the Generative Pre-trained Transformer 2 (GPT2) model that is able to learn continuously from the feedback of the users thanks to an active learning framework designed for the purpose. As a part of such a framework, a machine learning interpretation mechanism (that exploits saliency maps) allows the users to understand easily and quickly whether the predictions of the model are correct and improves the overall usability. GeMI's ability to extract attributes not explicitly mentioned (such as sex, tissue type, cell type, ethnicity and disease) allows researchers to perform specific queries and classification of experiments, which was previously possible only after spending time and resources with tedious manual annotation. The usefulness of GeMI is demonstrated on practical research use cases. Database URL http://gmql.eu/gemi/.

摘要

基因表达综合数据库（GEO）是一个公共档案库，其中包含近二十年来从功能基因组学实验中收集的超过 400 万个数字样本。伴随的实验描述元数据由于存在大量的自由文本，以及缺乏定义良好的数据格式及其验证，因此存在冗余、不一致和不完整的问题。为了解决这个问题，我们创建了基因组元数据集成（GeMI；http://gmql.eu/gemi/），这是一个 Web 应用程序，它可以学习从 GEO 实验的纯文本描述中自动提取结构化元数据（以键值对的形式）。提取的信息可以索引进行结构化搜索，并用于各种下游的数据挖掘活动。GeMI 与用户进行持续的互动。我们系统核心的基于自然语言处理的转换器的模型是经过微调的生成式预训练转换器 2（GPT2）模型的一个版本，由于专门为此目的设计的主动学习框架，它能够不断从用户的反馈中学习。作为该框架的一部分，机器学习解释机制（利用显著图）允许用户轻松快速地了解模型的预测是否正确，并提高整体可用性。GeMI 提取未明确提及的属性（如性别、组织类型、细胞类型、种族和疾病）的能力允许研究人员执行特定的查询和实验分类，这在以前只有花费时间和资源进行繁琐的手动注释后才能实现。GeMI 的实用性在实际的研究用例中得到了证明。数据库 URL http://gmql.eu/gemi/。

相似文献

GeMI: interactive interface for transformer-based Genomic Metadata Integration.GeMI：基于转换器的基因组元数据集成的交互式接口。

Database (Oxford). 2022 Jun 3;2022. doi: 10.1093/database/baac036.

Discovery of perturbation gene targets via free text metadata mining in Gene Expression Omnibus.通过在基因表达综合数据库中进行自由文本元数据挖掘发现干扰基因靶标。

Comput Biol Chem. 2019 Jun;80:152-158. doi: 10.1016/j.compbiolchem.2019.03.014. Epub 2019 Mar 24.

ALE: automated label extraction from GEO metadata.ALE：从 GEO 元数据中自动提取标签。

BMC Bioinformatics. 2017 Dec 28;18(Suppl 14):509. doi: 10.1186/s12859-017-1888-1.

Systematic tissue annotations of genomics samples by modeling unstructured metadata.通过对非结构化元数据进行建模来对基因组学样本进行系统的组织注释。

Nat Commun. 2022 Nov 8;13(1):6736. doi: 10.1038/s41467-022-34435-x.

Automated annotation of scientific texts for ML-based keyphrase extraction and validation.用于基于机器学习的关键短语提取与验证的科学文本自动标注

Database (Oxford). 2024 Sep 27;2024. doi: 10.1093/database/baae093.

A digital repository with an extensible data model for biobanking and genomic analysis management.一个具有可扩展数据模型的数字存储库，用于生物样本库和基因组分析管理。

BMC Genomics. 2014;15 Suppl 3(Suppl 3):S3. doi: 10.1186/1471-2164-15-S3-S3. Epub 2014 May 6.

Explorative visual analytics on interval-based genomic data and their metadata.基于区间的基因组数据及其元数据的探索性可视化分析。

BMC Bioinformatics. 2017 Dec 4;18(1):536. doi: 10.1186/s12859-017-1945-9.

"METAGENOTE: a simplified web platform for metadata annotation of genomic samples and streamlined submission to NCBI's sequence read archive".METAGENOTE：一个简化的基因组样本元数据注释的网络平台，简化了向 NCBI 的序列读取档案提交的流程。

BMC Bioinformatics. 2020 Sep 3;21(1):378. doi: 10.1186/s12859-020-03694-0.

Restructured GEO: restructuring Gene Expression Omnibus metadata for genome dynamics analysis.重构 GEO：用于基因组动态分析的基因表达综合（GEO）元数据重构。

Database (Oxford). 2019 Jan 1;2019:bay145. doi: 10.1093/database/bay145.

Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata.通过聚类进行清理：解决生物医学元数据中数据质量问题的方法。

BMC Bioinformatics. 2017 Sep 18;18(1):415. doi: 10.1186/s12859-017-1832-4.

引用本文的文献

Annotating publicly-available samples and studies using interpretable modeling of unstructured metadata.使用非结构化元数据的可解释模型对公开可用的样本和研究进行注释。

Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae652.

Identification of genomic alteration and prognosis using pathomics-based artificial intelligence in oral leukoplakia and head and neck squamous cell carcinoma: a multicenter experimental study.基于病理组学的人工智能在口腔白斑和头颈部鳞状细胞癌中识别基因组改变及预后的多中心实验研究

Int J Surg. 2025 Jan 1;111(1):426-438. doi: 10.1097/JS9.0000000000002077.

PEPhub: a database, web interface, and API for editing, sharing, and validating biological sample metadata.PEPhub：一个用于编辑、共享和验证生物样本元数据的数据库、网络界面和 API。

Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae033.

Big data and deep learning for RNA biology.大数据和深度学习在 RNA 生物学中的应用。

Exp Mol Med. 2024 Jun;56(6):1293-1321. doi: 10.1038/s12276-024-01243-w. Epub 2024 Jun 14.

PEPhub: a database, web interface, and API for editing, sharing, and validating biological sample metadata.PEPhub：一个用于编辑、共享和验证生物样本元数据的数据库、网络界面及应用程序编程接口。

bioRxiv. 2024 May 11:2023.08.15.551388. doi: 10.1101/2023.08.15.551388.

Challenges to sharing sample metadata in computational genomics.计算基因组学中样本元数据共享面临的挑战。

Front Genet. 2023 May 23;14:1154198. doi: 10.3389/fgene.2023.1154198. eCollection 2023.

CoVEffect: interactive system for mining the effects of SARS-CoV-2 mutations and variants based on deep learning.CoVEffect：基于深度学习的 SARS-CoV-2 突变和变体效应挖掘的交互式系统。

Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad036. Epub 2023 May 23.

Opportunities and challenges in sharing and reusing genomic interval data.共享和再利用基因组区间数据中的机遇与挑战。

Front Genet. 2023 Mar 20;14:1155809. doi: 10.3389/fgene.2023.1155809. eCollection 2023.

本文引用的文献

AMMU: A survey of transformer-based biomedical pretrained language models.基于变压器的生物医学预训练语言模型综述。

J Biomed Inform. 2022 Feb;126:103982. doi: 10.1016/j.jbi.2021.103982. Epub 2021 Dec 31.

Embeddings from protein language models predict conservation and variant effects.基于蛋白质语言模型的嵌入模型可预测保守性和变异效应。

Hum Genet. 2022 Oct;141(10):1629-1647. doi: 10.1007/s00439-021-02411-y. Epub 2021 Dec 30.

Deep transformers and convolutional neural network in identifying DNA N6-methyladenine sites in cross-species genomes.深度转换器和卷积神经网络在跨物种基因组中识别 DNA N6-甲基腺嘌呤位点。

Methods. 2022 Aug;204:199-206. doi: 10.1016/j.ymeth.2021.12.004. Epub 2021 Dec 13.

miRe2e: a full end-to-end deep model based on transformers for prediction of pre-miRNAs.miRe2e：一种基于转换器的端到端深度模型，用于预测前 miRNA。

Bioinformatics. 2022 Feb 7;38(5):1191-1197. doi: 10.1093/bioinformatics/btab823.

Identification, semantic annotation and comparison of combinations of functional elements in multiple biological conditions.在多种生物条件下识别、语义注释和比较功能元素组合。

Bioinformatics. 2022 Feb 7;38(5):1183-1190. doi: 10.1093/bioinformatics/btab815.

A novel antibacterial peptide recognition algorithm based on BERT.基于 BERT 的新型抗菌肽识别算法。

Brief Bioinform. 2021 Nov 5;22(6). doi: 10.1093/bib/bbab200.

The language of proteins: NLP, machine learning & protein sequences.蛋白质的语言：自然语言处理、机器学习与蛋白质序列

Comput Struct Biotechnol J. 2021 Mar 25;19:1750-1758. doi: 10.1016/j.csbj.2021.03.022. eCollection 2021.

Explainability in transformer models for functional genomics.用于功能基因组学的转换器模型的可解释性。

Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbab060.

BERT4Bitter: a bidirectional encoder representations from transformers (BERT)-based model for improving the prediction of bitter peptides.BERT4Bitter：一种基于变换器双向编码器表征（BERT）的模型，用于改进苦味肽的预测。

Bioinformatics. 2021 Sep 9;37(17):2556-2562. doi: 10.1093/bioinformatics/btab133.

A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information.基于 BERT 和二维卷积神经网络的变压器架构，用于从序列信息中识别 DNA 增强子。

Brief Bioinform. 2021 Sep 2;22(5). doi: 10.1093/bib/bbab005.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

GeMI：基于转换器的基因组元数据集成的交互式接口。

GeMI: interactive interface for transformer-based Genomic Metadata Integration.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献