Fang Huaying, Huang Chengcheng, Zhao Hongyu, Deng Minghua
LMAN, School of Mathematical Sciences, Beijing International Center for Mathematical Research, Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China.
College of Global Change and Earth System Science, Beijing Normal University, Beijing 100875, China.
Bioinformatics. 2015 Oct 1;31(19):3172-80. doi: 10.1093/bioinformatics/btv349. Epub 2015 Jun 4.
Direct analysis of microbial communities in the environment and human body has become more convenient and reliable owing to the advancements of high-throughput sequencing techniques for 16S rRNA gene profiling. Inferring the correlation relationship among members of microbial communities is of fundamental importance for genomic survey study. Traditional Pearson correlation analysis treating the observed data as absolute abundances of the microbes may lead to spurious results because the data only represent relative abundances. Special care and appropriate methods are required prior to correlation analysis for these compositional data.
In this article, we first discuss the correlation definition of latent variables for compositional data. We then propose a novel method called CCLasso based on least squares with [Formula: see text] penalty to infer the correlation network for latent variables of compositional data from metagenomic data. An effective alternating direction algorithm from augmented Lagrangian method is used to solve the optimization problem. The simulation results show that CCLasso outperforms existing methods, e.g. SparCC, in edge recovery for compositional data. It also compares well with SparCC in estimating correlation network of microbe species from the Human Microbiome Project.
CCLasso is open source and freely available from https://github.com/huayingfang/CCLasso under GNU LGPL v3.
Supplementary data are available at Bioinformatics online.
由于用于16S rRNA基因谱分析的高通量测序技术的进步,对环境和人体中的微生物群落进行直接分析变得更加便捷和可靠。推断微生物群落成员之间的相关关系对于基因组调查研究至关重要。传统的Pearson相关分析将观测数据视为微生物的绝对丰度,可能会导致虚假结果,因为这些数据仅代表相对丰度。对于这些成分数据,在进行相关分析之前需要特别小心并采用适当的方法。
在本文中,我们首先讨论了成分数据的潜在变量的相关定义。然后,我们提出了一种基于最小二乘法并带有[公式:见原文]惩罚项的名为CCLasso的新方法,用于从宏基因组数据推断成分数据潜在变量的相关网络。使用一种来自增广拉格朗日方法的有效交替方向算法来解决优化问题。模拟结果表明,CCLasso在成分数据的边恢复方面优于现有方法,例如SparCC。在估计人类微生物组计划中的微生物物种相关网络方面,它也与SparCC表现相当。
CCLasso是开源的,可在https://github.com/huayingfang/CCLasso上根据GNU LGPL v3免费获取。
补充数据可在《生物信息学》在线获取。