Hemphill Edward, Lindsay James, Lee Chih, Măndoiu Ion I, Nelson Craig E
BMC Bioinformatics. 2014;15 Suppl 13(Suppl 13):S4. doi: 10.1186/1471-2105-15-S13-S4. Epub 2014 Nov 13.
There is an ever-expanding range of technologies that generate very large numbers of biomarkers for research and clinical applications. Choosing the most informative biomarkers from a high-dimensional data set, combined with identifying the most reliable and accurate classification algorithms to use with that biomarker set, can be a daunting task. Existing surveys of feature selection and classification algorithms typically focus on a single data type, such as gene expression microarrays, and rarely explore the model's performance across multiple biological data types.
This paper presents the results of a large scale empirical study whereby a large number of popular feature selection and classification algorithms are used to identify the tissue of origin for the NCI-60 cancer cell lines. A computational pipeline was implemented to maximize predictive accuracy of all models at all parameters on five different data types available for the NCI-60 cell lines. A validation experiment was conducted using external data in order to demonstrate robustness.
As expected, the data type and number of biomarkers have a significant effect on the performance of the predictive models. Although no model or data type uniformly outperforms the others across the entire range of tested numbers of markers, several clear trends are visible. At low numbers of biomarkers gene and protein expression data types are able to differentiate between cancer cell lines significantly better than the other three data types, namely SNP, array comparative genome hybridization (aCGH), and microRNA data.
用于研究和临床应用的能生成大量生物标志物的技术范围在不断扩大。从高维数据集中选择最具信息量的生物标志物,并结合识别与该生物标志物集一起使用的最可靠、准确的分类算法,可能是一项艰巨的任务。现有的特征选择和分类算法调查通常集中于单一数据类型,如基因表达微阵列,很少探讨模型在多种生物数据类型上的性能。
本文展示了一项大规模实证研究的结果,其中使用了大量流行的特征选择和分类算法来识别NCI-60癌细胞系的起源组织。实施了一个计算流程,以在NCI-60细胞系可用的五种不同数据类型上,在所有参数下最大化所有模型的预测准确性。使用外部数据进行了验证实验,以证明稳健性。
正如预期的那样,数据类型和生物标志物数量对预测模型的性能有显著影响。尽管在整个测试的标志物数量范围内,没有一个模型或数据类型始终优于其他模型或数据类型,但有几个明显的趋势是可见的。在生物标志物数量较少时,基因和蛋白质表达数据类型能够比其他三种数据类型(即单核苷酸多态性、阵列比较基因组杂交(aCGH)和微小RNA数据)更好地区分癌细胞系。