Shin Hyunjin, Sheu Bryan, Joseph Maria, Markey Mia K
Department of Electrical and Computer Engineering, The University of Texas at Austin, USA.
J Biomed Inform. 2008 Feb;41(1):124-36. doi: 10.1016/j.jbi.2007.04.003. Epub 2007 Apr 14.
In recent years, proteomic profiling by mass spectrometry has opened up a new realm of methods for identifying potential biomarkers. Mass spectrometry data, like other proteomic and genomic data, are challenging to analyze because of their high dimensionality and the availability of few samples. Hence, feature selection is extremely important because it directly provides a list of potential biomarkers by choosing a subset of effective features that separate diseased samples from healthy ones. The rule of thumb for feature selection is that features must be discriminant and independent for the best separation of the two groups. However, in general, existing feature selection algorithms only take into account the discrimination ability of features. In this paper, we present a novel method for feature selection, guilt-by-association feature selection (GBA-FS). The algorithm makes it possible to select features that are independent as well as discriminant. After measuring similarities between features, the algorithm groups together similar features using a clustering algorithm, and selects the best representative feature from each group. As a result, it produces a list of discriminant and independent features. The efficacy of GBA-FS was extensively tested on two real-world SELDI TOF data sets. The experimental results demonstrate that GBA-FS assists in selecting more independent features as compared to a common filter type feature selection method, the t test. The results also show that GBA-FS can be used to deconvolve multiply charged states of the same protein molecules. As GBA-FS successfully identifies feature groups with similar mass values, it can also be employed as an alternative to peak detection for preprocessing the mass spectrometry data.
近年来,通过质谱进行蛋白质组分析开创了识别潜在生物标志物的新方法领域。与其他蛋白质组学和基因组学数据一样,质谱数据由于其高维度和样本数量少而难以分析。因此,特征选择极其重要,因为它通过选择将患病样本与健康样本区分开来的有效特征子集,直接提供潜在生物标志物列表。特征选择的经验法则是,为了最好地分离两组,特征必须具有判别性且相互独立。然而,一般来说,现有的特征选择算法只考虑特征的判别能力。在本文中,我们提出了一种新的特征选择方法——关联有罪特征选择(GBA-FS)。该算法能够选择既具有独立性又具有判别性的特征。在测量特征之间的相似性之后,该算法使用聚类算法将相似特征分组,并从每组中选择最佳代表性特征。结果,它生成了一个具有判别性和独立性的特征列表。我们在两个真实世界的SELDI TOF数据集上广泛测试了GBA-FS的有效性。实验结果表明,与普通的过滤型特征选择方法t检验相比,GBA-FS有助于选择更多独立特征。结果还表明,GBA-FS可用于解卷积同一蛋白质分子的多重电荷状态。由于GBA-FS成功识别了具有相似质量值的特征组,它也可以用作质谱数据预处理中峰检测的替代方法。