Tian Feifei, Yang Li, Lv Fenglin, Zhou Peng
College of Bioengineering, Chongqing University, Shazheng Road #174, Chongqing 400044, China.
Anal Chim Acta. 2009 Jun 30;644(1-2):10-6. doi: 10.1016/j.aca.2009.04.010. Epub 2009 Apr 14.
Three machine learning algorithms as least-squares support vector machine (LSSVM), random forest (RF) and Gaussian process (GP) were used to model the quantitative structure-retention relationship (QSRR) for predicting and explaining the retention behavior of proteome-wide peptides in the reverse-phase liquid chromatography. Peptides were parameterized using CODESSA approach and 145 descriptors were obtained for each peptide, including diverse structural information such as constitutional, topological, geometrical and physicochemical property. Based upon that, the nonlinear LSSVM, RF and GP as well as another sophisticated linear method (partial least-squares regression (PLS)) were employed in the QSRR model development. By a series of systematic validations as internal cross-validation, external test and Monte Carlo cross-validation, the stability and predictive power of the constructed models were confirmed. Results show that regression models developed using nonlinear approaches such as LSSVM, RF and GP predict better than linear PLS models. Considering the retention times used in this work were measured in different columns and thus have a relatively large uncertainty (reproducibility within 7%), the optimal statistics obtained from GP modeling are satisfactory, with the coefficients of determination (R2) for training set and test set of 0.894 and 0.866, respectively.
使用三种机器学习算法,即最小二乘支持向量机(LSSVM)、随机森林(RF)和高斯过程(GP),对定量结构保留关系(QSRR)进行建模,以预测和解释反相液相色谱中全蛋白质组肽段的保留行为。使用CODESSA方法对肽段进行参数化,每个肽段获得145个描述符,包括各种结构信息,如组成、拓扑、几何和物理化学性质。在此基础上,将非线性LSSVM、RF和GP以及另一种复杂的线性方法(偏最小二乘回归(PLS))用于QSRR模型开发。通过一系列系统验证,如内部交叉验证、外部测试和蒙特卡洛交叉验证,证实了所构建模型的稳定性和预测能力。结果表明,使用LSSVM、RF和GP等非线性方法开发的回归模型比线性PLS模型预测效果更好。考虑到本研究中使用的保留时间是在不同色谱柱上测量的,因此具有相对较大的不确定性(重现性在7%以内),从GP建模获得的最佳统计结果令人满意,训练集和测试集的决定系数(R2)分别为0.894和0.866。