Division of Clinical Epidemiology, First Hospital of the Jilin University, 71Xinmin Street, Changchun, Jilin 130021, China.
BMC Bioinformatics. 2014 Apr 4;15:97. doi: 10.1186/1471-2105-15-97.
Over the last decade, metabolomics has evolved into a mainstream enterprise utilized by many laboratories globally. Like other "omics" data, metabolomics data has the characteristics of a smaller sample size compared to the number of features evaluated. Thus the selection of an optimal subset of features with a supervised classifier is imperative. We extended an existing feature selection algorithm, threshold gradient descent regularization (TGDR), to handle multi-class classification of "omics" data, and proposed two such extensions referred to as multi-TGDR. Both multi-TGDR frameworks were used to analyze a metabolomics dataset that compares the metabolic profiles of hepatocellular carcinoma (HCC) infected with hepatitis B (HBV) or C virus (HCV) with that of cirrhosis induced by HBV/HCV infection; the goal was to improve early-stage diagnosis of HCC.
We applied two multi-TGDR frameworks to the HCC metabolomics data that determined TGDR thresholds either globally across classes, or locally for each class. Multi-TGDR global model selected 45 metabolites with a 0% misclassification rate (the error rate on the training data) and had a 3.82% 5-fold cross-validation (CV-5) predictive error rate. Multi-TGDR local selected 48 metabolites with a 0% misclassification rate and a 5.34% CV-5 error rate.
One important advantage of multi-TGDR local is that it allows inference for determining which feature is related specifically to the class/classes. Thus, we recommend multi-TGDR local be used because it has similar predictive performance and requires the same computing time as multi-TGDR global, but may provide class-specific inference.
在过去的十年中,代谢组学已经发展成为一个被许多全球实验室使用的主流领域。与其他“组学”数据一样,代谢组学数据的特点是评估的特征数量比样本量小。因此,选择具有监督分类器的最佳特征子集是至关重要的。我们扩展了现有的特征选择算法,阈值梯度下降正则化(TGDR),以处理“组学”数据的多类分类,并提出了两种扩展,称为多-TGDR。这两种多-TGDR 框架都用于分析一个代谢组学数据集,该数据集比较了乙型肝炎(HBV)或丙型肝炎(HCV)感染的肝细胞癌(HCC)的代谢谱与 HBV/HCV 感染引起的肝硬化的代谢谱;目标是改善 HCC 的早期诊断。
我们将两种多-TGDR 框架应用于 HCC 代谢组学数据,这些框架确定了要么在全局范围内跨类别的 TGDR 阈值,要么在每个类别的局部范围内确定 TGDR 阈值。多-TGDR 全局模型选择了 45 个代谢物,其错误率(训练数据上的错误率)为 0%,5 倍交叉验证(CV-5)预测错误率为 3.82%。多-TGDR 局部选择了 48 个代谢物,错误率为 0%,CV-5 错误率为 5.34%。
多-TGDR 局部的一个重要优势是它允许进行推断,以确定哪个特征与特定的类/类有关。因此,我们建议使用多-TGDR 局部,因为它具有相似的预测性能,并且需要与多-TGDR 全局相同的计算时间,但可能提供类特定的推断。