College of Artificial Intelligence, Wuxi Vocational College of Science and Technology, No. 8 Xinxi Road, Wuxi, 214028, China.
College of Information and Computer Engineering, Northeast Forestry University, No. 26 Hexing Road, Harbin, 150040, China.
BMC Bioinformatics. 2020 Oct 27;21(1):480. doi: 10.1186/s12859-020-03826-6.
Classification of certain proteins with specific functions is momentous for biological research. Encoding approaches of protein sequences for feature extraction play an important role in protein classification. Many computational methods (namely classifiers) are used for classification on protein sequences according to various encoding approaches. Commonly, protein sequences keep certain labels corresponding to different categories of biological functions (e.g., bacterial type IV secreted effectors or not), which makes protein prediction a fantasy. As to protein prediction, a kernel set of protein sequences keeping certain labels certified by biological experiments should be existent in advance. However, it has been hardly ever seen in prevailing researches. Therefore, unsupervised learning rather than supervised learning (e.g. classification) should be considered. As to protein classification, various classifiers may help to evaluate the effectiveness of different encoding approaches. Besides, variable selection from an encoded feature representing protein sequences is an important issue that also needs to be considered.
Focusing on the latter problem, we propose a new method for variable selection from an encoded feature representing protein sequences. Taking a benchmark dataset containing 1947 protein sequences as a case, experiments are made to identify bacterial type IV secreted effectors (T4SE) from protein sequences, which are composed of 399 T4SE and 1548 non-T4SE. Comparable and quantified results are obtained only using certain components of the encoded feature, i.e., position-specific scoring matix, and that indicates the effectiveness of our method.
Certain variables other than an encoded feature they belong to do work for discrimination between different types of proteins. In addition, ensemble classifiers with an automatic assignment of different base classifiers do achieve a better classification result.
对具有特定功能的特定蛋白质进行分类对于生物研究至关重要。蛋白质序列的编码方法在特征提取中起着重要作用,在蛋白质分类中发挥着重要作用。根据各种编码方法,许多计算方法(即分类器)用于对蛋白质序列进行分类。通常,蛋白质序列保留与不同生物功能类别(例如,细菌 IV 型分泌效应物或非 IV 型分泌效应物)相对应的某些标签,这使得蛋白质预测成为一种幻想。对于蛋白质预测,应该预先存在一组经过生物实验验证的具有某些标签的蛋白质序列核。然而,在现有的研究中几乎从未见过。因此,应该考虑无监督学习而不是监督学习(例如分类)。对于蛋白质分类,各种分类器可以帮助评估不同编码方法的有效性。此外,从表示蛋白质序列的编码特征中进行变量选择也是一个重要问题,也需要考虑。
针对后一个问题,我们提出了一种从表示蛋白质序列的编码特征中进行变量选择的新方法。以包含 1947 个蛋白质序列的基准数据集为例,进行了从蛋白质序列中识别细菌 IV 型分泌效应物(T4SE)的实验,该实验由 399 个 T4SE 和 1548 个非 T4SE 组成。仅使用编码特征的某些成分(即位置特定评分矩阵)即可获得可比且量化的结果,这表明了我们方法的有效性。
与它们所属的编码特征相比,某些变量确实可以用于区分不同类型的蛋白质。此外,具有不同基分类器自动分配的集成分类器确实可以实现更好的分类结果。