Department of Medicine, Cardiovascular Institute, Beth Israel Deaconess Medical Center, Boston, MA, 02215, United States.
Division of Sleep Medicine, Harvard Medical School, Boston, MA, 02115, United States.
Bioinformatics. 2024 Jul 1;40(7). doi: 10.1093/bioinformatics/btae410.
Genome-wide DNA methylation (DNAm) profiling is indispensable for unveiling how DNAm regulates biological pathways and individual phenotypes. However, managing and analyzing extensive DNAm data generated from large cohort studies present computational obstacles. Apache Parquet is a data file format that allows for efficient data storage, retrieval, and manipulation, alleviating computational hurdles associated with conventional row-based formats. We here introduce MethParquet, the first R package leveraging the columnar Parquet format for efficient DNAm data analysis. It can be used for data extraction, methylation risk score calculation, epigenome-wide association analyses, and other standard post-quality control tasks. The package flexibly implements diverse regression models. Via a public methylation dataset, we show the efficiency of this package in reducing running time and RAM usage in large-scale EWAS.
The MethParquet R package is publicly available on the GitHub repository https://github.com/ZWangTen/MethParquet. It includes a vignette and a toy dataset derived from a public resource.
全基因组 DNA 甲基化(DNAm)分析对于揭示 DNAm 如何调节生物途径和个体表型至关重要。然而,管理和分析来自大型队列研究的大量 DNAm 数据存在计算障碍。Apache Parquet 是一种数据文件格式,可实现高效的数据存储、检索和操作,减轻了与传统基于行的格式相关的计算障碍。我们在这里介绍 MethParquet,这是第一个利用列式 Parquet 格式进行高效 DNAm 数据分析的 R 包。它可用于数据提取、甲基化风险评分计算、全基因组关联分析和其他标准质量控制后任务。该包灵活实现了多种回归模型。通过一个公共的甲基化数据集,我们展示了该包在大规模 EWAS 中减少运行时间和 RAM 使用的效率。
MethParquet R 包可在 GitHub 存储库 https://github.com/ZWangTen/MethParquet 上公开获得。它包含一个示例和一个源自公共资源的玩具数据集。