转化生物信息学中的诊断偏差。

Diagnostic biases in translational bioinformatics.

作者信息

Han Henry

机构信息

Department of Computer and Information Science, Fordham University, New York, 10023, NY, USA.

Quantitative Proteomics Center, Columbia University, New York, NY, USA.

出版信息

BMC Med Genomics. 2015 Aug 1;8:46. doi: 10.1186/s12920-015-0116-y.

DOI:10.1186/s12920-015-0116-y

PMID:26232237

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4522082/

Abstract

BACKGROUND

With the surge of translational medicine and computational omics research, complex disease diagnosis is more and more relying on massive omics data-driven molecular signature detection. However, how to detect and prevent possible diagnostic biases in translational bioinformatics remains an unsolved problem despite its importance in the coming era of personalized medicine.

METHODS

In this study, we comprehensively investigate the diagnostic bias problem by analyzing benchmark gene array, protein array, RNA-Seq and miRNA-Seq data under the framework of support vector machines for different model selection methods. We further categorize the diagnostic biases into different types by conducting rigorous kernel matrix analysis and provide effective machine learning methods to conquer the diagnostic biases.

RESULTS

In this study, we comprehensively investigate the diagnostic bias problem by analyzing benchmark gene array, protein array, RNA-Seq and miRNA-Seq data under the framework of support vector machines. We have found that the diagnostic biases happen for data with different distributions and SVM with different kernels. Moreover, we identify total three types of diagnostic biases: overfitting bias, label skewness bias, and underfitting bias in SVM diagnostics, and present corresponding reasons through rigorous analysis. Compared with the overfitting and underfitting biases, the label skewness bias is more challenging to detect and conquer because it can be easily confused as a normal diagnostic case from its deceptive accuracy. To tackle this problem, we propose a derivative component analysis based support vector machines to conquer the label skewness bias by achieving the rivaling clinical diagnostic results.

CONCLUSIONS

Our studies demonstrate that the diagnostic biases are mainly caused by the three major factors, i.e. kernel selection, signal amplification mechanism in high-throughput profiling, and training data label distribution. Moreover, the proposed DCA-SVM diagnosis provides a generic solution for the label skewness bias overcome due to the powerful feature extraction capability from derivative component analysis. Our work identifies and solves an important but less addressed problem in translational research. It also has a positive impact on machine learning for adding new results to kernel-based learning for omics data.

摘要

背景

随着转化医学与计算组学研究的兴起，复杂疾病诊断越来越依赖于海量组学数据驱动的分子特征检测。然而，尽管在即将到来的个性化医疗时代，如何在转化生物信息学中检测并预防可能的诊断偏差仍是一个未解决的问题。

方法

在本研究中，我们在支持向量机框架下针对不同模型选择方法，通过分析基准基因芯片、蛋白质芯片、RNA测序和miRNA测序数据，全面研究诊断偏差问题。我们通过严格的核矩阵分析将诊断偏差分为不同类型，并提供有效的机器学习方法来克服诊断偏差。

结果

在本研究中，我们在支持向量机框架下通过分析基准基因芯片、蛋白质芯片、RNA测序和miRNA测序数据，全面研究诊断偏差问题。我们发现不同分布的数据以及使用不同核的支持向量机都会出现诊断偏差。此外，我们在支持向量机诊断中识别出总共三种类型的诊断偏差：过拟合偏差、标签偏斜偏差和欠拟合偏差，并通过严格分析给出了相应原因。与过拟合和欠拟合偏差相比，标签偏斜偏差更难检测和克服，因为其具有欺骗性的准确率，容易被误认为是正常诊断情况。为解决这个问题，我们提出一种基于导数成分分析的支持向量机，通过取得与临床诊断相当的结果来克服标签偏斜偏差。

结论

我们的研究表明，诊断偏差主要由三个主要因素引起，即核选择、高通量分析中的信号放大机制以及训练数据标签分布。此外，所提出的DCA-SVM诊断方法由于其从导数成分分析中强大的特征提取能力，为克服标签偏斜偏差提供了一个通用解决方案。我们的工作识别并解决了转化研究中一个重要但较少被关注的问题。它对机器学习也有积极影响，为基于核的组学数据学习增添了新成果。