Segal Mark R, Barbour Jason D, Grant Robert M
University of California, San Francisco, USA.
Stat Appl Genet Mol Biol. 2004;3:Article2; discussion article 7, article 9. doi: 10.2202/1544-6115.1031. Epub 2004 Feb 12.
The problem of relating genotype (as represented by amino acid sequence) to phenotypes is distinguished from standard regression problems by the nature of sequence data. Here we investigate an instance of such a problem where the phenotype of interest is HIV-1 replication capacity and contiguous segments of protease and reverse transcriptase sequence constitutes genotype. A variety of data analytic methods have been proposed in this context. Shortcomings of select techniques are contrasted with the advantages afforded by tree-structured methods. However, tree-structured methods, in turn, have been criticized on grounds of only enjoying modest predictive performance. A number of ensemble approaches (bagging, boosting, random forests) have recently emerged, devised to overcome this deficiency. We evaluate random forests as applied in this setting, and detail why prediction gains obtained in other situations are not realized. Other approaches including logic regression, support vector machines and neural networks are also applied. We interpret results in terms of HIV-1 reverse transcriptase structure and function.
将基因型(以氨基酸序列表示)与表型相关联的问题,因其序列数据的性质而有别于标准回归问题。在此,我们研究此类问题的一个实例,其中感兴趣的表型是HIV-1复制能力,蛋白酶和逆转录酶序列的连续片段构成基因型。在这种情况下,已经提出了多种数据分析方法。将所选技术的缺点与树结构方法的优势进行了对比。然而,树结构方法反过来也因仅具有适度的预测性能而受到批评。最近出现了一些集成方法(装袋法、提升法、随机森林法),旨在克服这一缺陷。我们评估了在此设置中应用的随机森林法,并详细说明了为何在其他情况下获得的预测增益无法实现。还应用了其他方法,包括逻辑回归、支持向量机和神经网络。我们根据HIV-1逆转录酶的结构和功能来解释结果。