自动光谱数据分析分类（ASCLAN）：一种用于区分表型亚类中判别代谢物的基于数据驱动的方法。

Automatic Spectroscopic Data Categorization by Clustering Analysis (ASCLAN): A Data-Driven Approach for Distinguishing Discriminatory Metabolites for Phenotypic Subclasses.

机构信息

Key Laboratory of Systems Biomedicine (Ministry of Education), Shanghai Center for Systems Biomedicine, Shanghai Jiao Tong University , 800 Dongchuan Road, Shanghai 200240, China.

Medway Metabonomics Research Group, Medway School of Pharmacy, Universities of Kent and Greenwich , Chatham Maritime, Kent, ME4 4TB, U.K.

出版信息

Anal Chem. 2016 Jun 7;88(11):5670-9. doi: 10.1021/acs.analchem.5b04020. Epub 2016 May 13.

We propose a novel data-driven approach aiming to reliably distinguish discriminatory metabolites from nondiscriminatory metabolites for a given spectroscopic data set containing two biological phenotypic subclasses. The automatic spectroscopic data categorization by clustering analysis (ASCLAN) algorithm aims to categorize spectral variables within a data set into three clusters corresponding to noise, nondiscriminatory and discriminatory metabolites regions. This is achieved by clustering each spectral variable based on the r(2) value representing the loading weight of each spectral variable as extracted from a orthogonal partial least-squares discriminant (OPLS-DA) model of the data set. The variables are ranked according to r(2) values and a series of principal component analysis (PCA) models are then built for subsets of these spectral data corresponding to ranges of r(2) values. The Q(2)X value for each PCA model is extracted. K-means clustering is then applied to the Q(2)X values to generate two clusters based on minimum Euclidean distance criterion. The cluster consisting of lower Q(2)X values is deemed devoid of metabolic information (noise), while the cluster consists of higher Q(2)X values is then further subclustered into two groups based on the r(2) values. We considered the cluster with high Q(2)X but low r(2) values as nondiscriminatory, while the cluster with high Q(2)X and r(2) values as discriminatory variables. The boundaries between these three clusters of spectral variables, on the basis of the r(2) values were considered as the cut off values for defining the noise, nondiscriminatory and discriminatory variables. We evaluated the ASCLAN algorithm using six simulated (1)H NMR spectroscopic data sets representing small, medium and large data sets (N = 50, 500, and 1000 samples per group, respectively), each with a reduced and full resolution set of variables (0.005 and 0.0005 ppm, respectively). ASCLAN correctly identified all discriminatory metabolites and showed zero false positive (100% specificity and positive predictive value) irrespective of the spectral resolution or the sample size in all six simulated data sets. This error rate was found to be superior to existing methods for ascertaining feature significance: univariate t test by Bonferroni correction (up to 10% false positive rate), Benjamini-Hochberg correction (up to 35% false positive rate) and metabolome wide significance level (MWSL, up to 0.4% false positive rate), as well as by various OPLS-DA parameters: variable importance to projection, (up to 15% false positive rate), loading coefficients (up to 35% false positive rate), and regression coefficients (up to 39% false positive rate). The application of ASCLAN was further exemplified using a widely investigated renal toxin, mercury II chloride (HgCl2) in rat model. ASCLAN successfully identified many of the known metabolites related to renal toxicity such as increased excretion of urinary creatinine, and different amino acids. The ASCLAN algorithm provides a framework for reliably differentiating discriminatory metabolites from nondiscriminatory metabolites in a biological data set without the need to set an arbitrary cut off value as applied to some of the conventional methods. This offers significant advantages over existing methods and the possibility for automation of high-throughput screening in "omics" data.

我们提出了一种新的基于数据驱动的方法，旨在可靠地区分给定的包含两个生物表型子类的光谱数据集的判别代谢物和非判别代谢物。自动光谱数据分析聚类分析（ASCLAN）算法旨在将数据集内的光谱变量分为三个聚类，分别对应于噪声、非判别和判别代谢物区域。这是通过基于代表数据集正交偏最小二乘判别（OPLS-DA）模型中每个光谱变量加载权重的 r(2) 值对每个光谱变量进行聚类来实现的。根据 r(2) 值对变量进行排序，并为对应于 r(2) 值范围的这些光谱数据的子集构建一系列主成分分析 (PCA) 模型。提取每个 PCA 模型的 Q(2)X 值。然后应用 K-均值聚类根据最小欧几里得距离准则对 Q(2)X 值生成两个聚类。被认为没有代谢信息（噪声）的聚类由较低的 Q(2)X 值组成，而由较高的 Q(2)X 值组成的聚类则根据 r(2) 值进一步细分为两组。我们认为具有高 Q(2)X 值但低 r(2) 值的聚类是非判别性的，而具有高 Q(2)X 和 r(2) 值的聚类是判别性变量。基于 r(2) 值的这些光谱变量的三个聚类之间的边界被视为定义噪声、非判别和判别变量的截止值。我们使用六个模拟（1）H NMR 光谱数据集评估了 ASCLAN 算法，这些数据集分别代表小、中和大数据集（每组分别有 50、500 和 1000 个样本），每个数据集都具有简化和完整分辨率的变量集（分别为 0.005 和 0.0005 ppm）。ASCLAN 正确识别了所有的判别代谢物，并且在所有六个模拟数据集中，无论光谱分辨率或样本大小如何，都显示出零假阳性（100%特异性和阳性预测值）。与用于确定特征显著性的现有方法相比，这种错误率更高：单变量 t 检验通过 Bonferroni 校正（高达 10%的假阳性率）、Benjamini-Hochberg 校正（高达 35%的假阳性率）和代谢组全显著水平（MWSL，高达 0.4%的假阳性率），以及各种 OPLS-DA 参数：变量重要性投影（高达 15%的假阳性率）、加载系数（高达 35%的假阳性率）和回归系数（高达 39%的假阳性率）。ASCLAN 的应用进一步通过大鼠模型中广泛研究的肾毒素汞 II 氯化物（HgCl2）得到了例证。ASCLAN 成功地识别了许多与肾毒性相关的已知代谢物，如尿肌酐排泄增加和不同的氨基酸。ASCLAN 算法提供了一种在生物数据集中可靠地区分判别代谢物和非判别代谢物的框架，而无需像某些传统方法那样设置任意的截止值。这与现有方法相比具有显著优势，并为“组学”数据的高通量筛选提供了自动化的可能性。

Automatic Spectroscopic Data Categorization by Clustering Analysis (ASCLAN): A Data-Driven Approach for Distinguishing Discriminatory Metabolites for Phenotypic Subclasses.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献