Sakar C Okan, Kursun Olcay, Seker Huseyin, Gurgen Fikret
Int J Data Min Bioinform. 2014;10(2):162-74. doi: 10.1504/ijdmb.2014.064012.
Computational annotation and prediction of protein structure is very important in the post-genome era due to existence of many different proteins, most of which are yet to be verified. Mutual information based feature selection methods can be used in selecting such minimal yet predictive subsets of features. However, as protein features are organised into natural partitions, individual feature selection that ignores the presence of these views, dismantles them, and treats their variables intermixed along with those of others at best results in a complex un-interpretable predictive system for such multi-view datasets. In this paper, instead of selecting a subset of individual features, each feature subset is passed through a clustering step so that it is represented in discrete form using the cluster indices; this makes mutual information based methods applicable to view-selection. We present our experimental results on a multi-view protein dataset that are used to predict protein structure.
在后基因组时代,由于存在众多不同的蛋白质,且其中大多数尚未得到验证,蛋白质结构的计算注释和预测非常重要。基于互信息的特征选择方法可用于选择此类最小但具有预测性的特征子集。然而,由于蛋白质特征被组织成自然分区,忽略这些视图存在的单个特征选择会拆解它们,并将其变量与其他变量混合处理,这充其量会为这类多视图数据集产生一个复杂且难以解释的预测系统。在本文中,不是选择单个特征的子集,而是将每个特征子集经过聚类步骤,以便使用聚类索引以离散形式表示;这使得基于互信息的方法适用于视图选择。我们展示了在用于预测蛋白质结构的多视图蛋白质数据集上的实验结果。