Suppr超能文献

一种用于从SRA癌症数据中提取生物学见解的计算框架。

A computational framework for extracting biological insights from SRA cancer data.

作者信息

Guimarães Paul Anderson Souza, Carvalho Maria Gabriela Reis, Ruiz Jeronimo Conceição

机构信息

Grupo Informática de Biossistemas, Bioengenharia e Genômica, Instituto René Rachou, Fiocruz Minas, Av. Augusto de Lima, 1715, Barro Preto, Belo Horizonte, MG, Brazil.

Biologia Computacional e Sistemas (BCS), Instituto Oswaldo Cruz (IOC), Fiocruz, Rio de Janeiro, Brazil.

出版信息

Sci Rep. 2025 Mar 8;15(1):8117. doi: 10.1038/s41598-025-91781-8.

Abstract

The integration of sequenced samples and clinical data from independent yet related studies from public domain databases, such as The Sequence Read Archive (SRA), has the potential to increase sample sizes and enhance the statistical power needed for more precise bioinformatic analysis. Data mining and sample grouping are the starting points in this process and still present several challenges, including the presence of structured and unstructured data, missing deposited data, and varying experimental conditions and techniques applied across the studies. Designed to address the main challenges of data mining and sample grouping for biomarkers research, the proposed methodology employs a computational approach integrating relational database construction, text and data mining, natural language processing, network analysis, search by Pubmed publications, and combining MeSH, TTD and WordNet database to identify groups of samples with the same characteristics. As a result, it identifies and illustrates relationships among sample collections, aiming to discover potential cancer biomarkers. In colorectal cancer (CRC) and acute lymphoblastic leukemia (ALL) case studies, this methodology effectively navigates SRA metadata, retrieving, extracting, and integrating data. It highlights significant connections between samples and patient clinical data, revealing important biological insights. The study grouped 2,737 (CRC) and 3,655 (ALL) samples into potential comparison groups, demonstrating the method's power in identifying relationships and aiding biomarker discovery.

摘要

整合来自公共领域数据库(如序列读取存档库(SRA))中独立但相关研究的测序样本和临床数据,有可能增加样本量并增强更精确的生物信息学分析所需的统计能力。数据挖掘和样本分组是这一过程的起点,仍然存在一些挑战,包括结构化和非结构化数据的存在、缺失的存档数据以及各研究中应用的不同实验条件和技术。为解决生物标志物研究中数据挖掘和样本分组的主要挑战而设计的拟议方法采用了一种计算方法,该方法整合了关系数据库构建、文本和数据挖掘、自然语言处理、网络分析、通过PubMed出版物进行搜索,并结合医学主题词表(MeSH)、治疗靶点数据库(TTD)和词网数据库来识别具有相同特征的样本组。结果,它识别并阐明了样本集之间的关系,旨在发现潜在的癌症生物标志物。在结直肠癌(CRC)和急性淋巴细胞白血病(ALL)的案例研究中,该方法有效地浏览了SRA元数据,检索、提取并整合了数据。它突出了样本与患者临床数据之间的重要联系,揭示了重要的生物学见解。该研究将2737个(CRC)和3655个(ALL)样本分组为潜在的比较组,证明了该方法在识别关系和辅助生物标志物发现方面的能力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4f5b/11890766/9644afe65dde/41598_2025_91781_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验