Robinson Mark D, De Souza David P, Keen Woon Wai, Saunders Eleanor C, McConville Malcolm J, Speed Terence P, Likić Vladimir A
The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, VIC 3050, Australia.
BMC Bioinformatics. 2007 Oct 29;8:419. doi: 10.1186/1471-2105-8-419.
Gas chromatography-mass spectrometry (GC-MS) is a robust platform for the profiling of certain classes of small molecules in biological samples. When multiple samples are profiled, including replicates of the same sample and/or different sample states, one needs to account for retention time drifts between experiments. This can be achieved either by the alignment of chromatographic profiles prior to peak detection, or by matching signal peaks after they have been extracted from chromatogram data matrices. Automated retention time correction is particularly important in non-targeted profiling studies.
A new approach for matching signal peaks based on dynamic programming is presented. The proposed approach relies on both peak retention times and mass spectra. The alignment of more than two peak lists involves three steps: (1) all possible pairs of peak lists are aligned, and similarity of each pair of peak lists is estimated; (2) the guide tree is built based on the similarity between the peak lists; (3) peak lists are progressively aligned starting with the two most similar peak lists, following the guide tree until all peak lists are exhausted. When two or more experiments are performed on different sample states and each consisting of multiple replicates, peak lists within each set of replicate experiments are aligned first (within-state alignment), and subsequently the resulting alignments are aligned themselves (between-state alignment). When more than two sets of replicate experiments are present, the between-state alignment also employs the guide tree. We demonstrate the usefulness of this approach on GC-MS metabolic profiling experiments acquired on wild-type and mutant Leishmania mexicana parasites.
We propose a progressive method to match signal peaks across multiple GC-MS experiments based on dynamic programming. A sensitive peak similarity function is proposed to balance peak retention time and peak mass spectra similarities. This approach can produce the optimal alignment between an arbitrary number of peak lists, and models explicitly within-state and between-state peak alignment. The accuracy of the proposed method was close to the accuracy of manually-curated peak matching, which required tens of man-hours for the analyzed data sets. The proposed approach may offer significant advantages for processing of high-throughput metabolomics data, especially when large numbers of experimental replicates and multiple sample states are analyzed.
气相色谱 - 质谱联用(GC-MS)是分析生物样品中某些小分子类别的强大平台。当对多个样品进行分析时,包括同一样品的重复样品和/或不同样品状态,需要考虑实验之间的保留时间漂移。这可以通过在峰检测之前对色谱图进行对齐来实现,或者通过从色谱图数据矩阵中提取信号峰之后进行匹配来实现。自动保留时间校正在非靶向分析研究中尤为重要。
提出了一种基于动态规划的信号峰匹配新方法。该方法同时依赖于峰保留时间和质谱。两个以上峰列表的对齐涉及三个步骤:(1)对齐所有可能的峰列表对,并估计每对峰列表的相似度;(2)基于峰列表之间的相似度构建引导树;(3)从最相似的两个峰列表开始,按照引导树逐步对齐峰列表,直到所有峰列表都处理完毕。当对不同样品状态进行两个或更多实验,且每个实验包含多个重复样品时,先对每组重复实验中的峰列表进行对齐(组内对齐),然后将得到的对齐结果再进行对齐(组间对齐)。当存在两组以上的重复实验时,组间对齐也采用引导树。我们在野生型和突变型墨西哥利什曼原虫寄生虫的GC-MS代谢谱实验中证明了该方法的有效性。
我们提出了一种基于动态规划的渐进式方法,用于在多个GC-MS实验中匹配信号峰。提出了一种灵敏的峰相似度函数,以平衡峰保留时间和峰质谱相似度。该方法可以在任意数量的峰列表之间产生最优对齐,并明确模拟组内和组间的峰对齐。所提方法的准确性接近人工精心策划的峰匹配的准确性,而人工匹配分析数据集需要数十个人工时。所提方法在处理高通量代谢组学数据时可能具有显著优势,特别是在分析大量实验重复样品和多个样品状态时。