Mardikoraem Mehrsa, Woldring Daniel
Department of Chemical Engineering and Materials Science, Michigan State University, East Lansing, MI 48824, USA.
Institute for Quantitative Health Science and Engineering, Michigan State University, East Lansing, MI 48824, USA.
Pharmaceutics. 2023 Apr 25;15(5):1337. doi: 10.3390/pharmaceutics15051337.
Advances in machine learning (ML) and the availability of protein sequences via high-throughput sequencing techniques have transformed the ability to design novel diagnostic and therapeutic proteins. ML allows protein engineers to capture complex trends hidden within protein sequences that would otherwise be difficult to identify in the context of the immense and rugged protein fitness landscape. Despite this potential, there persists a need for guidance during the training and evaluation of ML methods over sequencing data. Two key challenges for training discriminative models and evaluating their performance include handling severely imbalanced datasets (e.g., few high-fitness proteins among an abundance of non-functional proteins) and selecting appropriate protein sequence representations (numerical encodings). Here, we present a framework for applying ML over assay-labeled datasets to elucidate the capacity of sampling techniques and protein encoding methods to improve binding affinity and thermal stability prediction tasks. For protein sequence representations, we incorporate two widely used methods (One-Hot encoding and physiochemical encoding) and two language-based methods (next-token prediction, UniRep; masked-token prediction, ESM). Elaboration on performance is provided over protein fitness, protein size, and sampling techniques. In addition, an ensemble of protein representation methods is generated to discover the contribution of distinct representations and improve the final prediction score. We then implement multiple criteria decision analysis (MCDA; TOPSIS with entropy weighting), using multiple metrics well-suited for imbalanced data, to ensure statistical rigor in ranking our methods. Within the context of these datasets, the synthetic minority oversampling technique (SMOTE) outperformed undersampling while encoding sequences with One-Hot, UniRep, and ESM representations. Moreover, ensemble learning increased the predictive performance of the affinity-based dataset by 4% compared to the best single-encoding candidate (F1-score = 97%), while ESM alone was rigorous enough in stability prediction (F1-score = 92%).
机器学习(ML)的进展以及通过高通量测序技术获得的蛋白质序列,已经改变了设计新型诊断和治疗性蛋白质的能力。ML使蛋白质工程师能够捕捉隐藏在蛋白质序列中的复杂趋势,否则在巨大且崎岖的蛋白质适应性景观背景下,这些趋势将很难识别。尽管有这种潜力,但在对测序数据进行ML方法的训练和评估过程中,仍然需要指导。训练判别模型和评估其性能的两个关键挑战包括处理严重不平衡的数据集(例如,在大量无功能蛋白质中只有少数高适应性蛋白质)以及选择合适的蛋白质序列表示(数字编码)。在这里,我们提出了一个在检测标记数据集上应用ML的框架,以阐明采样技术和蛋白质编码方法在改善结合亲和力和热稳定性预测任务方面的能力。对于蛋白质序列表示,我们纳入了两种广泛使用的方法(独热编码和物理化学编码)以及两种基于语言的方法(下一个标记预测,UniRep;掩码标记预测,ESM)。我们详细阐述了在蛋白质适应性、蛋白质大小和采样技术方面的性能。此外,生成了一组蛋白质表示方法,以发现不同表示的贡献并提高最终预测分数。然后,我们使用多个非常适合不平衡数据的指标实施多标准决策分析(MCDA;带熵加权的TOPSIS),以确保在对我们的方法进行排名时具有统计严谨性。在这些数据集的背景下,合成少数类过采样技术(SMOTE)在使用独热、UniRep和ESM表示对序列进行编码时,表现优于欠采样。此外,与最佳单编码候选方法相比,集成学习将基于亲和力的数据集的预测性能提高了4%(F1分数 = 97%),而仅ESM在稳定性预测方面就足够严谨(F1分数 = 92%)。