Suppr超能文献

利用协数据提高随机森林的高维预测能力。

Improved high-dimensional prediction with Random Forests by the use of co-data.

机构信息

Department of Epidemiology and Biostatistics, VU University Medical Center, Amsterdam, 1007 MB, The Netherlands.

Department of Otolaryngology-Head and Neck Surgery, VU University Medical Center, Amsterdam, 1007 MB, The Netherlands.

出版信息

BMC Bioinformatics. 2017 Dec 28;18(1):584. doi: 10.1186/s12859-017-1993-1.

Abstract

BACKGROUND

Prediction in high dimensional settings is difficult due to the large number of variables relative to the sample size. We demonstrate how auxiliary 'co-data' can be used to improve the performance of a Random Forest in such a setting.

RESULTS

Co-data are incorporated in the Random Forest by replacing the uniform sampling probabilities that are used to draw candidate variables by co-data moderated sampling probabilities. Co-data here are defined as any type information that is available on the variables of the primary data, but does not use its response labels. These moderated sampling probabilities are, inspired by empirical Bayes, learned from the data at hand. We demonstrate the co-data moderated Random Forest (CoRF) with two examples. In the first example we aim to predict the presence of a lymph node metastasis with gene expression data. We demonstrate how a set of external p-values, a gene signature, and the correlation between gene expression and DNA copy number can improve the predictive performance. In the second example we demonstrate how the prediction of cervical (pre-)cancer with methylation data can be improved by including the location of the probe relative to the known CpG islands, the number of CpG sites targeted by a probe, and a set of p-values from a related study.

CONCLUSION

The proposed method is able to utilize auxiliary co-data to improve the performance of a Random Forest.

摘要

背景

由于变量数量相对于样本量较大,高维环境中的预测较为困难。我们展示了如何在这种情况下使用辅助“协数据”来提高随机森林的性能。

结果

通过用协数据调制的抽样概率代替用于抽取候选变量的均匀抽样概率,将协数据纳入随机森林。这里的协数据被定义为主要数据变量上可用的任何类型的信息,但不使用其响应标签。这些调制的抽样概率受到经验贝叶斯的启发,是从手头的数据中学习到的。我们用两个例子演示了协数据调制的随机森林(CoRF)。在第一个例子中,我们旨在使用基因表达数据预测淋巴结转移的存在。我们展示了如何使用一组外部 p 值、一个基因特征以及基因表达和 DNA 拷贝数之间的相关性来提高预测性能。在第二个例子中,我们演示了如何通过包括探针相对于已知 CpG 岛的位置、探针靶向的 CpG 位点数量以及来自相关研究的一组 p 值,来改善使用甲基化数据预测宫颈癌(前)癌的效果。

结论

所提出的方法能够利用辅助协数据来提高随机森林的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ecd2/5745983/f98b4509f435/12859_2017_1993_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验