Lee Sophia S F, Sun Lei, Kustra Rafal, Bull Shelley B
Department of Public Health Sciences, University of Toronto, Toronto M5T3M7, Canada.
Bioinformatics. 2008 Jul 15;24(14):1603-10. doi: 10.1093/bioinformatics/btn239. Epub 2008 May 21.
We developed an EM-random forest (EMRF) for Haseman-Elston quantitative trait linkage analysis that accounts for marker ambiguity and weighs each sib-pair according to the posterior identical by descent (IBD) distribution. The usual random forest (RF) variable importance (VI) index used to rank markers for variable selection is not optimal when applied to linkage data because of correlation between markers. We define new VI indices that borrow information from linked markers using the correlation structure inherent in IBD linkage data.
Using simulations, we find that the new VI indices in EMRF performed better than the original RF VI index and performed similarly or better than EM-Haseman-Elston regression LOD score for various genetic models. Moreover, tree size and markers subset size evaluated at each node are important considerations in RFs.
The source code for EMRF written in C is available at www.infornomics.utoronto.ca/downloads/EMRF.
我们开发了一种用于哈斯曼 - 埃尔斯顿数量性状连锁分析的EM随机森林(EMRF),它考虑了标记的模糊性,并根据后裔相同的后验概率(IBD)分布对每个同胞对进行加权。用于为变量选择对标记进行排名的常用随机森林(RF)变量重要性(VI)指数在应用于连锁数据时并非最优,因为标记之间存在相关性。我们定义了新的VI指数,利用IBD连锁数据中固有的相关结构从连锁标记中借用信息。
通过模拟,我们发现EMRF中的新VI指数比原始RF VI指数表现更好,并且在各种遗传模型下,其表现与EM - 哈斯曼 - 埃尔斯顿回归LOD得分相似或更好。此外,在随机森林中,每个节点评估的树大小和标记子集大小是重要的考虑因素。
用C编写的EMRF的源代码可在www.infornomics.utoronto.ca/downloads/EMRF获取。