一种用于从SRA癌症数据中提取生物学见解的计算框架。

A computational framework for extracting biological insights from SRA cancer data.

作者信息

Guimarães Paul Anderson Souza, Carvalho Maria Gabriela Reis, Ruiz Jeronimo Conceição

机构信息

Grupo Informática de Biossistemas, Bioengenharia e Genômica, Instituto René Rachou, Fiocruz Minas, Av. Augusto de Lima, 1715, Barro Preto, Belo Horizonte, MG, Brazil.

Biologia Computacional e Sistemas (BCS), Instituto Oswaldo Cruz (IOC), Fiocruz, Rio de Janeiro, Brazil.

出版信息

Sci Rep. 2025 Mar 8;15(1):8117. doi: 10.1038/s41598-025-91781-8.

DOI:10.1038/s41598-025-91781-8

PMID:40057525

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11890766/

Abstract

The integration of sequenced samples and clinical data from independent yet related studies from public domain databases, such as The Sequence Read Archive (SRA), has the potential to increase sample sizes and enhance the statistical power needed for more precise bioinformatic analysis. Data mining and sample grouping are the starting points in this process and still present several challenges, including the presence of structured and unstructured data, missing deposited data, and varying experimental conditions and techniques applied across the studies. Designed to address the main challenges of data mining and sample grouping for biomarkers research, the proposed methodology employs a computational approach integrating relational database construction, text and data mining, natural language processing, network analysis, search by Pubmed publications, and combining MeSH, TTD and WordNet database to identify groups of samples with the same characteristics. As a result, it identifies and illustrates relationships among sample collections, aiming to discover potential cancer biomarkers. In colorectal cancer (CRC) and acute lymphoblastic leukemia (ALL) case studies, this methodology effectively navigates SRA metadata, retrieving, extracting, and integrating data. It highlights significant connections between samples and patient clinical data, revealing important biological insights. The study grouped 2,737 (CRC) and 3,655 (ALL) samples into potential comparison groups, demonstrating the method's power in identifying relationships and aiding biomarker discovery.

摘要

整合来自公共领域数据库（如序列读取存档库（SRA））中独立但相关研究的测序样本和临床数据，有可能增加样本量并增强更精确的生物信息学分析所需的统计能力。数据挖掘和样本分组是这一过程的起点，仍然存在一些挑战，包括结构化和非结构化数据的存在、缺失的存档数据以及各研究中应用的不同实验条件和技术。为解决生物标志物研究中数据挖掘和样本分组的主要挑战而设计的拟议方法采用了一种计算方法，该方法整合了关系数据库构建、文本和数据挖掘、自然语言处理、网络分析、通过PubMed出版物进行搜索，并结合医学主题词表（MeSH）、治疗靶点数据库（TTD）和词网数据库来识别具有相同特征的样本组。结果，它识别并阐明了样本集之间的关系，旨在发现潜在的癌症生物标志物。在结直肠癌（CRC）和急性淋巴细胞白血病（ALL）的案例研究中，该方法有效地浏览了SRA元数据，检索、提取并整合了数据。它突出了样本与患者临床数据之间的重要联系，揭示了重要的生物学见解。该研究将2737个（CRC）和3655个（ALL）样本分组为潜在的比较组，证明了该方法在识别关系和辅助生物标志物发现方面的能力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f5b/11890766/9644afe65dde/41598_2025_91781_Fig1_HTML.jpg

相似文献

A computational framework for extracting biological insights from SRA cancer data.

Sci Rep. 2025 Mar 8;15(1):8117. doi: 10.1038/s41598-025-91781-8.

Experimental design-based functional mining and characterization of high-throughput sequencing data in the sequence read archive.

PLoS One. 2013 Oct 22;8(10):e77910. doi: 10.1371/journal.pone.0077910. eCollection 2013.

Integration of genetic variants and gene network for drug repurposing in colorectal cancer.

Pharmacol Res. 2020 Nov;161:105203. doi: 10.1016/j.phrs.2020.105203. Epub 2020 Sep 17.

Survey of Natural Language Processing Techniques in Bioinformatics.

Comput Math Methods Med. 2015;2015:674296. doi: 10.1155/2015/674296. Epub 2015 Oct 7.

Common Subcluster Mining in Microarray Data for Molecular Biomarker Discovery.

Interdiscip Sci. 2019 Sep;11(3):348-359. doi: 10.1007/s12539-017-0262-3. Epub 2017 Oct 11.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Extracting Various Classes of Data From Biological Text Using the Concept of Existence Dependency.

IEEE J Biomed Health Inform. 2015 Nov;19(6):1918-28. doi: 10.1109/JBHI.2015.2392786. Epub 2015 Jan 19.

Biomarker identification using text mining.

Comput Math Methods Med. 2012;2012:135780. doi: 10.1155/2012/135780. Epub 2012 Nov 11.

Text mining facilitates database curation - extraction of mutation-disease associations from Bio-medical literature.

BMC Bioinformatics. 2015 Jun 6;16:185. doi: 10.1186/s12859-015-0609-x.

Identification of functionally related genes using data mining and data integration: a breast cancer case study.

BMC Bioinformatics. 2009 Oct 15;10 Suppl 12(Suppl 12):S8. doi: 10.1186/1471-2105-10-S12-S8.

本文引用的文献

A Comprehensive Review of Bioinformatics Tools for Genomic Biomarker Discovery Driving Precision Oncology.

Genes (Basel). 2024 Aug 6;15(8):1036. doi: 10.3390/genes15081036.

From genetic associations to genes: methods, applications, and challenges.

Trends Genet. 2024 Aug;40(8):642-667. doi: 10.1016/j.tig.2024.04.008. Epub 2024 May 11.

Applying Natural Language Processing to Textual Data From Clinical Data Warehouses: Systematic Review.

JMIR Med Inform. 2023 Dec 15;11:e42477. doi: 10.2196/42477.

TTD: Therapeutic Target Database describing target druggability information.

Nucleic Acids Res. 2024 Jan 5;52(D1):D1465-D1477. doi: 10.1093/nar/gkad751.

Functional annotation of proteins for signaling network inference in non-model species.

Nat Commun. 2023 Aug 3;14(1):4654. doi: 10.1038/s41467-023-40365-z.

Opportunities and challenges in sharing and reusing genomic interval data.

Front Genet. 2023 Mar 20;14:1155809. doi: 10.3389/fgene.2023.1155809. eCollection 2023.

MetaAnalyst: a user-friendly tool for metagenomic biomarker detection and phenotype classification.

BMC Med Res Methodol. 2022 Dec 28;22(1):336. doi: 10.1186/s12874-022-01812-5.

Best practices for the interpretation and reporting of clinical whole genome sequencing.

NPJ Genom Med. 2022 Apr 8;7(1):27. doi: 10.1038/s41525-022-00295-z.

Colorectal Cancer: Preoperative Evaluation and Staging.

Surg Oncol Clin N Am. 2022 Apr;31(2):127-141. doi: 10.1016/j.soc.2021.12.001. Epub 2022 Mar 9.

Named Entity Recognition of Medical Text Based on the Deep Neural Network.

J Healthc Eng. 2022 Mar 7;2022:3990563. doi: 10.1155/2022/3990563. eCollection 2022.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种用于从SRA癌症数据中提取生物学见解的计算框架。

A computational framework for extracting biological insights from SRA cancer data.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献