Baggerly Keith A, Morris Jeffrey S, Wang Jing, Gold David, Xiao Lian-Chun, Coombes Kevin R
Department of Biostatistics, UT M.D. Anderson Cancer Center, Houston, TX 77030, USA.
Proteomics. 2003 Sep;3(9):1667-72. doi: 10.1002/pmic.200300522.
For our analysis of the data from the First Annual Proteomics Data Mining Conference, we attempted to discriminate between 24 disease spectra (group A) and 17 normal spectra (group B). First, we processed the raw spectra by (i) correcting for additive sinusoidal noise (periodic on the time scale) affecting most spectra, (ii) correcting for the overall baseline level, (iii) normalizing, (iv) recombining fractions, and (v) using variable-width windows for data reduction. Also, we identified a set of polymeric peaks (at multiples of 180.6 Da) that is present in several normal spectra (B1-B8). After data processing, we found the intensities at the following mass to charge (m/z) values to be useful discriminators: 3077, 12 886 and 74 263. Using these values, we were able to achieve an overall classification accuracy of 38/41 (92.6%). Perfect classification could be achieved by adding two additional peaks, at 2476 and 6955. We identified these values by applying a genetic algorithm to a filtered list of m/z values using Mahalanobis distance between the group means as a fitness function.
为了分析首届蛋白质组学数据挖掘会议的数据,我们试图区分24种疾病谱(A组)和17种正常谱(B组)。首先,我们对原始谱进行如下处理:(i)校正影响多数谱的加性正弦噪声(在时间尺度上呈周期性);(ii)校正总体基线水平;(iii)归一化;(iv)重新组合组分;(v)使用可变宽度窗口进行数据约简。此外,我们还识别出一组存在于多个正常谱(B1 - B8)中的聚合物峰(180.6 Da的倍数)。数据处理后,我们发现以下质荷比(m/z)值处的强度是有用的判别指标:3077、12886和74263。利用这些值,我们能够达到38/41(92.6%)的总体分类准确率。通过添加另外两个峰(质荷比分别为2476和6955)可实现完美分类。我们通过将遗传算法应用于经筛选的m/z值列表,并以组均值之间的马氏距离作为适应度函数来确定这些值。