Department of Psychology, Yale University, New Haven, CT.
Spring Health, New York, NY.
Schizophr Bull. 2018 Aug 20;44(5):1045-1052. doi: 10.1093/schbul/sby005.
Genetic risk variants for schizophrenia have been linked to many related clinical and biological phenotypes with the hopes of delineating how individual variation across thousands of variants corresponds to the clinical and etiologic heterogeneity within schizophrenia. This has primarily been done using risk score profiling, which aggregates effects across all variants into a single predictor. While effective, this method lacks flexibility in certain domains: risk scores cannot capture nonlinear effects and do not employ any variable selection. We used random forest, an algorithm with this flexibility designed to maximize predictive power, to predict 6 cognitive endophenotypes in a combined sample of psychiatric patients and controls (N = 739) using 77 genetic variants strongly associated with schizophrenia. Tenfold cross-validation was applied to the discovery sample and models were externally validated in an independent sample of similar ancestry (N = 336). Linear approaches, including linear regression and task-specific polygenic risk scores, were employed for comparison. Random forest models for processing speed (P = .019) and visual memory (P = .036) and risk scores developed for verbal (P = .042) and working memory (P = .037) successfully generalized to an independent sample with similar predictive strength and error. As such, we suggest that both methods may be useful for mapping a limited set of predetermined, disease-associated SNPs to related phenotypes. Incorporating random forest and other more flexible algorithms into genotype-phenotype mapping inquiries could contribute to parsing heterogeneity within schizophrenia; such algorithms can perform as well as standard methods and can capture a more comprehensive set of potential relationships.
精神分裂症的遗传风险变异与许多相关的临床和生物学表型相关联,希望能够阐明数千个变异体如何与精神分裂症内的临床和病因异质性相对应。这主要是通过风险评分分析来实现的,该方法将所有变异的影响聚合到一个单一的预测因子中。虽然这种方法有效,但在某些领域缺乏灵活性:风险评分无法捕捉非线性效应,也不采用任何变量选择。我们使用了随机森林,这是一种具有这种灵活性的算法,旨在最大限度地提高预测能力,使用与精神分裂症强烈相关的 77 个遗传变异,来预测一个包含精神病人和对照组的综合样本中的 6 个认知内表型(N = 739)。十倍交叉验证应用于发现样本,模型在具有相似遗传背景的独立样本(N = 336)中进行外部验证。线性方法,包括线性回归和特定任务的多基因风险评分,被用于比较。用于处理速度(P =.019)和视觉记忆(P =.036)的随机森林模型,以及为言语(P =.042)和工作记忆(P =.037)开发的风险评分,成功地推广到具有相似预测强度和误差的独立样本。因此,我们建议这两种方法都可能有助于将有限数量的预定、与疾病相关的 SNP 映射到相关表型。将随机森林和其他更灵活的算法纳入基因型-表型映射研究中,可以有助于解析精神分裂症内的异质性;这些算法可以与标准方法一样有效,并可以捕捉更全面的潜在关系集。