审视基于指纹的分类器的重要性。

Examining the significance of fingerprint-based classifiers.

作者信息

Luke Brian T, Collins Jack R

机构信息

Advanced Biomedical Computing Center, Advanced Technology Program, SAIC-Frederick, Inc, NCI-Frederick, Frederick, MD 21702, USA.

出版信息

BMC Bioinformatics. 2008 Dec 17;9:545. doi: 10.1186/1471-2105-9-545.

DOI:10.1186/1471-2105-9-545

PMID:19091087

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2628908/

Abstract

BACKGROUND

Experimental examinations of biofluids to measure concentrations of proteins or their fragments or metabolites are being explored as a means of early disease detection, distinguishing diseases with similar symptoms, and drug treatment efficacy. Many studies have produced classifiers with a high sensitivity and specificity, and it has been argued that accurate results necessarily imply some underlying biology-based features in the classifier. The simplest test of this conjecture is to examine datasets designed to contain no information with classifiers used in many published studies.

RESULTS

The classification accuracy of two fingerprint-based classifiers, a decision tree (DT) algorithm and a medoid classification algorithm (MCA), are examined. These methods are used to examine 30 artificial datasets that contain random concentration levels for 300 biomolecules. Each dataset contains between 30 and 300 Cases and Controls, and since the 300 observed concentrations are randomly generated, these datasets are constructed to contain no biological information. A modest search of decision trees containing at most seven decision nodes finds a large number of unique decision trees with an average sensitivity and specificity above 85% for datasets containing 60 Cases and 60 Controls or less, and for datasets with 90 Cases and 90 Controls many DTs have an average sensitivity and specificity above 80%. For even the largest dataset (300 Cases and 300 Controls) the MCA procedure finds several unique classifiers that have an average sensitivity and specificity above 88% using only six or seven features.

CONCLUSION

While it has been argued that accurate classification results must imply some biological basis for the separation of Cases from Controls, our results show that this is not necessarily true. The DT and MCA classifiers are sufficiently flexible and can produce good results from datasets that are specifically constructed to contain no information. This means that a chance fitting to the data is possible. All datasets used in this investigation are available on the web.

摘要

背景

对生物流体进行实验检测以测量蛋白质、其片段或代谢物的浓度，正作为早期疾病检测、区分具有相似症状的疾病以及药物治疗效果的一种手段而被探索。许多研究已经产生了具有高灵敏度和特异性的分类器，并且有人认为准确的结果必然意味着分类器中存在一些基于生物学的潜在特征。对这一推测最简单的检验是使用许多已发表研究中使用的分类器来检查设计为不包含任何信息的数据集。

结果

研究了两种基于指纹的分类器——决策树（DT）算法和类中心分类算法（MCA）的分类准确性。这些方法用于检查30个人工数据集，这些数据集包含300种生物分子的随机浓度水平。每个数据集包含30至300个病例和对照，并且由于300个观察到的浓度是随机生成的，所以这些数据集被构建为不包含任何生物学信息。对最多包含七个决策节点的决策树进行适度搜索，发现大量独特的决策树，对于包含60个病例和60个对照或更少的数据集，其平均灵敏度和特异性高于85%，对于包含90个病例和90个对照的数据集，许多决策树的平均灵敏度和特异性高于80%。即使对于最大的数据集（300个病例和300个对照），MCA程序也能找到几个独特的分类器，仅使用六个或七个特征时，其平均灵敏度和特异性高于88%。