基于序列的蛋白质结晶倾向预测模型，使用机器学习和两级特征选择。

Sequence-based prediction model of protein crystallization propensity using machine learning and two-level feature selection.

机构信息

Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, 250 Wuxing Street, 110, Taipei, Taiwan.

AIBioMed Research Group, Taipei Medical University, 250 Wuxing Street, 110, Taipei, Taiwan.

出版信息

Brief Bioinform. 2023 Sep 20;24(5). doi: 10.1093/bib/bbad319.

DOI:10.1093/bib/bbad319

PMID:37649385

Abstract

Protein crystallization is crucial for biology, but the steps involved are complex and demanding in terms of external factors and internal structure. To save on experimental costs and time, the tendency of proteins to crystallize can be initially determined and screened by modeling. As a result, this study created a new pipeline aimed at using protein sequence to predict protein crystallization propensity in the protein material production stage, purification stage and production of crystal stage. The newly created pipeline proposed a new feature selection method, which involves combining Chi-square (${\chi }^{2}$) and recursive feature elimination together with the 12 selected features, followed by a linear discriminant analysisfor dimensionality reduction and finally, a support vector machine algorithm with hyperparameter tuning and 10-fold cross-validation is used to train the model and test the results. This new pipeline has been tested on three different datasets, and the accuracy rates are higher than the existing pipelines. In conclusion, our model provides a new solution to predict multistage protein crystallization propensity which is a big challenge in computational biology.

摘要

蛋白质结晶对于生物学至关重要，但涉及的步骤在外部因素和内部结构方面都很复杂且要求很高。为了节省实验成本和时间，可以通过建模来初步确定和筛选蛋白质结晶的趋势。因此，这项研究创建了一个新的管道，旨在使用蛋白质序列在蛋白质材料生产阶段、纯化阶段和晶体生产阶段预测蛋白质结晶倾向。新创建的管道提出了一种新的特征选择方法，该方法涉及将卡方（${\chi }^{2}$）和递归特征消除与 12 个选定特征相结合，然后使用线性判别分析进行降维，最后使用支持向量机算法进行超参数调整和 10 倍交叉验证来训练模型并测试结果。该新管道已在三个不同的数据集上进行了测试，准确率高于现有管道。总之，我们的模型为预测多阶段蛋白质结晶倾向提供了一个新的解决方案，这是计算生物学中的一个重大挑战。

相似文献

Sequence-based prediction model of protein crystallization propensity using machine learning and two-level feature selection.基于序列的蛋白质结晶倾向预测模型，使用机器学习和两级特征选择。

Brief Bioinform. 2023 Sep 20;24(5). doi: 10.1093/bib/bbad319.

Accurate multistage prediction of protein crystallization propensity using deep-cascade forest with sequence-based features.使用基于序列特征的深度级联森林对蛋白质结晶倾向进行准确的多阶段预测。

Brief Bioinform. 2021 May 20;22(3). doi: 10.1093/bib/bbaa076.

PredPPCrys: accurate prediction of sequence cloning, protein production, purification and crystallization propensity from protein sequences using multi-step heterogeneous feature fusion and selection.PredPPCrys：利用多步异构特征融合与选择从蛋白质序列准确预测序列克隆、蛋白质生产、纯化及结晶倾向。

PLoS One. 2014 Aug 22;9(8):e105902. doi: 10.1371/journal.pone.0105902. eCollection 2014.

Combining handcrafted features with latent variables in machine learning for prediction of radiation-induced lung damage.将机器学习中的手工特征与潜在变量相结合，以预测放射性肺损伤。

Med Phys. 2019 May;46(5):2497-2511. doi: 10.1002/mp.13497. Epub 2019 Apr 8.

Sequence-Based Prediction of Transmembrane Protein Crystallization Propensity.基于序列的跨膜蛋白结晶倾向预测。

Interdiscip Sci. 2021 Dec;13(4):693-702. doi: 10.1007/s12539-021-00448-1. Epub 2021 Jun 18.

CrystalM: A Multi-View Fusion Approach for Protein Crystallization Prediction.CrystalM：一种用于蛋白质结晶预测的多视图融合方法。

IEEE/ACM Trans Comput Biol Bioinform. 2021 Jan-Feb;18(1):325-335. doi: 10.1109/TCBB.2019.2912173. Epub 2021 Feb 3.

HMMPred: Accurate Prediction of DNA-Binding Proteins Based on HMM Profiles and XGBoost Feature Selection.HMMPred：基于 HMM 轮廓和 XGBoost 特征选择的 DNA 结合蛋白精确预测。

Comput Math Methods Med. 2020 Mar 28;2020:1384749. doi: 10.1155/2020/1384749. eCollection 2020.

Robust biomarker screening from gene expression data by stable machine learning-recursive feature elimination methods.基于稳健机器学习-递归特征消除方法的基因表达数据的稳健生物标志物筛选。

Comput Biol Chem. 2022 Oct;100:107747. doi: 10.1016/j.compbiolchem.2022.107747. Epub 2022 Jul 29.

Score and Correlation Coefficient-Based Feature Selection for Predicting Heart Failure Diagnosis by Using Machine Learning Algorithms.基于评分和相关系数的特征选择在使用机器学习算法预测心力衰竭诊断中的应用。

Comput Math Methods Med. 2021 Dec 20;2021:8500314. doi: 10.1155/2021/8500314. eCollection 2021.

TargetDBP: Accurate DNA-Binding Protein Prediction Via Sequence-Based Multi-View Feature Learning.目标 DBP：基于序列的多视图特征学习的准确 DNA 结合蛋白预测。

IEEE/ACM Trans Comput Biol Bioinform. 2020 Jul-Aug;17(4):1419-1429. doi: 10.1109/TCBB.2019.2893634. Epub 2019 Jan 18.

引用本文的文献

Using preprocessed datasets to construct and interpret multiclass identification models.使用预处理数据集构建和解释多类识别模型。

Front Plant Sci. 2025 Aug 20;16:1597673. doi: 10.3389/fpls.2025.1597673. eCollection 2025.

Machine learning-based identification of diagnostic and prognostic mitotic cell cycle genes in hepatocellular carcinoma.基于机器学习的肝细胞癌诊断和预后有丝分裂细胞周期基因鉴定

PLoS One. 2025 Aug 28;20(8):e0331118. doi: 10.1371/journal.pone.0331118. eCollection 2025.

Enhancing body fat prediction with WGAN-GP data augmentation and XGBoost algorithm.利用WGAN-GP数据增强和XGBoost算法提高体脂预测能力。

Sci Prog. 2025 Jul-Sep;108(3):368504251366850. doi: 10.1177/00368504251366850. Epub 2025 Aug 6.

Integration of Multi-Scale Profiling and Machine Learning Reveals the Prognostic Role of Extracellular Matrix-Related Cancer-Associated Fibroblasts in Lung Adenocarcinoma.多尺度分析与机器学习相结合揭示细胞外基质相关癌相关成纤维细胞在肺腺癌中的预后作用。

Int J Med Sci. 2025 Jun 12;22(12):2956-2972. doi: 10.7150/ijms.113580. eCollection 2025.

An artificial intelligence-based approach for identifying the proteins regulating liquid-liquid phase separation.一种基于人工智能的方法用于识别调节液-液相分离的蛋白质。

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf313.

Graph-RPI: predicting RNA-protein interactions via graph autoencoder and self-supervised learning strategies.Graph-RPI：通过图自动编码器和自监督学习策略预测RNA-蛋白质相互作用

Brief Bioinform. 2025 May 1;26(3). doi: 10.1093/bib/bbaf292.

SORFPP: Enhancing rich sequence-driven information to identify SEPs based on fused framework on validation datasets.SORFPP：在验证数据集上基于融合框架增强丰富的序列驱动信息以识别SEP

PLoS One. 2025 Apr 28;20(4):e0320314. doi: 10.1371/journal.pone.0320314. eCollection 2025.

A review of machine learning methods for imbalanced data challenges in chemistry.化学中不平衡数据挑战的机器学习方法综述。

Chem Sci. 2025 Apr 22;16(18):7637-7658. doi: 10.1039/d5sc00270b. eCollection 2025 May 7.

Deep Neural Networks Based on Sp7 Protein Sequence Prediction in Peri-Implant Bone Formation.基于Sp7蛋白序列预测种植体周围骨形成的深度神经网络

Int J Dent. 2025 Apr 7;2025:7583275. doi: 10.1155/ijod/7583275. eCollection 2025.

Combining multi-omics analysis with machine learning to uncover novel molecular subtypes, prognostic markers, and insights into immunotherapy for melanoma.将多组学分析与机器学习相结合，以揭示黑色素瘤的新型分子亚型、预后标志物以及免疫治疗相关见解。

BMC Cancer. 2025 Apr 7;25(1):630. doi: 10.1186/s12885-025-14012-3.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于序列的蛋白质结晶倾向预测模型，使用机器学习和两级特征选择。

Sequence-based prediction model of protein crystallization propensity using machine learning and two-level feature selection.

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献