Suppr超能文献

基于功能数据分析的基因组异常检测

Genomic Anomaly Detection with Functional Data Analysis.

作者信息

Kanjilal Ria, Campelo Dos Santos Andre Luiz, Arnab Sandipan Paul, DeGiorgio Michael, Assis Raquel

机构信息

Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA.

Institute for Human Health and Disease Intervention, Florida Atlantic University, Boca Raton, FL 33431, USA.

出版信息

Genes (Basel). 2025 Jun 15;16(6):710. doi: 10.3390/genes16060710.

Abstract

Genetic variation provides a foundation for understanding evolution. With the rise of artificial intelligence, machine learning has emerged as a powerful tool for identifying genomic footprints of evolutionary processes through simulation-based predictive modeling. However, existing approaches require prior knowledge of the factors shaping genetic variation, whereas uncovering anomalous genomic regions regardless of their causes remains an equally important and complementary endeavor. To address this problem, we introduce ANDES (ANomaly DEtection using Summary statistics), a suite of algorithms that apply statistical techniques to extract features for unsupervised anomaly detection. A key innovation of ANDES is its ability to account for autocovariation due to linkage disequilibrium by fitting curves to contiguous windows and computing their first and second derivatives, thereby capturing the "velocity" and "acceleration" of genetic variation. These features are then used to train models that flag biologically significant or artifactual regions. Application to human genomic data demonstrates that ANDES successfully detects anomalous regions that colocalize with genes under positive or balancing selection. Moreover, these analyses reveal a non-uniform distribution of anomalies, which are enriched in specific autosomes, intergenic regions, introns, and regions with low GC content, repetitive sequences, and poor mappability. ANDES thus offers a novel, model-agnostic framework for uncovering anomalous genomic regions in both model and non-model organisms.

摘要

遗传变异为理解进化提供了基础。随着人工智能的兴起,机器学习已成为一种强大的工具,可通过基于模拟的预测建模来识别进化过程的基因组印记。然而,现有方法需要了解塑造遗传变异的因素的先验知识,而无论其原因如何,发现异常基因组区域仍然是一项同样重要且互补的工作。为了解决这个问题,我们引入了ANDES(使用汇总统计量进行异常检测),这是一套应用统计技术来提取特征以进行无监督异常检测的算法。ANDES的一项关键创新在于,它能够通过对连续窗口拟合曲线并计算其一阶和二阶导数来解释由于连锁不平衡导致的自协方差,从而捕捉遗传变异的“速度”和“加速度”。然后,这些特征被用于训练标记具有生物学意义或人为区域的模型。对人类基因组数据的应用表明,ANDES成功检测到与正选择或平衡选择下的基因共定位的异常区域。此外,这些分析揭示了异常的非均匀分布,这些异常在特定的常染色体、基因间区域、内含子以及GC含量低、重复序列多和可映射性差的区域中富集。因此,ANDES为在模式生物和非模式生物中发现异常基因组区域提供了一个新颖的、与模型无关的框架。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3004/12192579/e90828ea3aec/genes-16-00710-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验