IEEE/ACM Trans Comput Biol Bioinform. 2021 Sep-Oct;18(5):1986-1995. doi: 10.1109/TCBB.2019.2954826. Epub 2021 Oct 7.
X-ray crystallography is the most popular approach for analyzing protein 3D structure. However, the success rate of protein crystallization is very low (2-10 percent). To reduce the cost of time and resources, lots of computation-based methods are developed to detect the protein crystallization. Improving the accuracy of predicting protein crystallization is very important for the determination of protein structure by X-ray crystallography. At present, many machine learning methods are used to predict protein crystallization. In this article, we propose a Fuzzy Support Vector Machine based on Linear Neighborhood Representation (FSVM-LNR) to predict the crystallization propensity of proteins. Proteins are represented by three types of features (PsePSSM, PSSM-DWT, MMI-PS), and these features are serially combined and fed into FSVM-LNR. FSVM-LNR can filter outliers by membership score, which is calculated via reconstruction residuals of k nearest samples. To evaluate the performance of our predictive model, we test FSVM-LNR on the datasets of TRAIN3587, TEST3585 and TEST500. Our method achieves better Mathew's correlation coefficient (MCC) on TRAIN3587 (MCC: 0.56) and TEST3585 (MCC: 0.58). Although the performance of independent test is not the best on TEST500, FSVM-LNR also has a certain predictability (MCC: 0.70) in the identification of protein crystallization. The good performance on the datasets proves the effectiveness of our method and the better performance on large datasets further demonstrates the stability and superiority of our method.
X 射线晶体学是分析蛋白质三维结构最常用的方法。然而,蛋白质结晶的成功率非常低(2-10%)。为了降低时间和资源成本,开发了许多基于计算的方法来检测蛋白质结晶。提高预测蛋白质结晶的准确性对于 X 射线晶体学确定蛋白质结构非常重要。目前,许多机器学习方法被用于预测蛋白质结晶。在本文中,我们提出了一种基于线性邻域表示的模糊支持向量机(FSVM-LNR)来预测蛋白质的结晶倾向。蛋白质由三种类型的特征(PsePSSM、PSSM-DWT、MMI-PS)表示,这些特征被串行组合并输入到 FSVM-LNR 中。FSVM-LNR 可以通过成员得分过滤异常值,成员得分是通过 k 个最近样本的重构残差计算得出的。为了评估我们预测模型的性能,我们在 TRAIN3587、TEST3585 和 TEST500 数据集上测试了 FSVM-LNR。我们的方法在 TRAIN3587(MCC:0.56)和 TEST3585(MCC:0.58)数据集上取得了更好的马修相关系数(MCC)。尽管在 TEST500 上的独立测试性能不是最好的,但 FSVM-LNR 在鉴定蛋白质结晶方面也具有一定的预测能力(MCC:0.70)。在数据集上的良好性能证明了我们方法的有效性,在大型数据集上的更好性能进一步证明了我们方法的稳定性和优越性。