Yeung Darien, Spicer Victor, Zahedi René P, Krokhin Oleg
Department of Biochemistry and Medical Genetics, University of Manitoba, 336 BMSB, 745 Bannatyne Avenue, Winnipeg R3E 0J9, Canada.
Manitoba Centre for Proteomics and Systems Biology, University of Manitoba, 799 JBRC, 715 McDermot Avenue, Winnipeg R3E 3P4, Canada.
Comput Struct Biotechnol J. 2023 Feb 27;21:2446-2453. doi: 10.1016/j.csbj.2023.02.047. eCollection 2023.
Peptide retention time (RT) prediction algorithms are tools to study and identify the physicochemical properties that drive the peptide-sorbent interaction. Traditional RT algorithms use multiple linear regression with manually curated parameters to determine the degree of direct contribution for each parameter and improvements to RT prediction accuracies relied on superior feature engineering. Deep learning led to a significant increase in RT prediction accuracy and automated feature engineering via chaining multiple learning modules. However, the significance and the identity of these extracted variables are not well understood due to the inherent complexity when interpreting "relationships-of-relationships" found in deep learning variables. To achieve both accuracy and interpretability simultaneously, we isolated individual modules used in deep learning and the isolated modules are the shallow learners employed for RT prediction in this work. Using a shallow convolutional neural network (CNN) and gated recurrent unit (GRU), we find that the spatial features obtained via the CNN correlate with real-world physicochemical properties namely cross-collisional sections (CCS) and variations of assessable surface area (ASA). Furthermore, we determined that the discovered parameters are "micro-coefficients" that contribute to the "macro-coefficient" - hydrophobicity. Manually embedding CCS and the variations of ASA to the GRU model yielded an R2 = 0.981 using only 525 variables and can represent 88% of the ∼110,000 tryptic peptides used in our dataset. This work highlights the feature discovery process of our shallow learners can achieve beyond traditional RT models in performance and have better interpretability when compared with the deep learning RT algorithms found in the literature.
肽保留时间(RT)预测算法是用于研究和识别驱动肽与吸附剂相互作用的物理化学性质的工具。传统的RT算法使用具有手动策划参数的多元线性回归来确定每个参数的直接贡献程度,并且RT预测准确性的提高依赖于卓越的特征工程。深度学习通过链接多个学习模块,显著提高了RT预测准确性并实现了自动特征工程。然而,由于在解释深度学习变量中发现的“关系的关系”时存在固有的复杂性,这些提取变量的重要性和身份尚未得到很好的理解。为了同时实现准确性和可解释性,我们分离了深度学习中使用的各个模块,并且在本工作中,这些分离的模块是用于RT预测的浅层学习器。使用浅层卷积神经网络(CNN)和门控循环单元(GRU),我们发现通过CNN获得的空间特征与实际的物理化学性质相关,即交叉碰撞截面(CCS)和可评估表面积(ASA)的变化。此外,我们确定所发现的参数是有助于“宏观系数”——疏水性的“微观系数”。仅使用525个变量将CCS和ASA的变化手动嵌入到GRU模型中,得到的R2 = 0.981,并且可以代表我们数据集中约110,000个胰蛋白酶肽中的88%。这项工作突出了我们的浅层学习器的特征发现过程在性能上可以超越传统RT模型,并且与文献中发现的深度学习RT算法相比具有更好的可解释性。