Wang Minkun, Tsai Tsung-Heng, Di Poto Cristina, Ferrarini Alessia, Yu Guoqiang, Ressom Habtom W
Department of Oncology, Georgetown University, 4000 Reservoir Rd NW, Washington D.C., USA.
Department of Electrical and Computer Engineering, Virginia Tech, 900 N Glebe Rd, Arlington, VA, USA.
BMC Genomics. 2016 Aug 18;17 Suppl 4(Suppl 4):545. doi: 10.1186/s12864-016-2796-x.
A fundamental challenge in quantitation of biomolecules for cancer biomarker discovery is owing to the heterogeneous nature of human biospecimens. Although this issue has been a subject of discussion in cancer genomic studies, it has not yet been rigorously investigated in mass spectrometry based proteomic and metabolomic studies. Purification of mass spectometric data is highly desired prior to subsequent analysis, e.g., quantitative comparison of the abundance of biomolecules in biological samples.
We investigated topic models to computationally analyze mass spectrometric data considering both integrated peak intensities and scan-level features, i.e., extracted ion chromatograms (EICs). Probabilistic generative models enable flexible representation in data structure and infer sample-specific pure resources. Scan-level modeling helps alleviate information loss during data preprocessing. We evaluated the capability of the proposed models in capturing mixture proportions of contaminants and cancer profiles on LC-MS based serum proteomic and GC-MS based tissue metabolomic datasets acquired from patients with hepatocellular carcinoma (HCC) and liver cirrhosis as well as synthetic data we generated based on the serum proteomic data.
The results we obtained by analysis of the synthetic data demonstrated that both intensity-level and scan-level purification models can accurately infer the mixture proportions and the underlying true cancerous sources with small average error ratios (<7 %) between estimation and ground truth. By applying the topic model-based purification to mass spectrometric data, we found more proteins and metabolites with significant changes between HCC cases and cirrhotic controls. Candidate biomarkers selected after purification yielded biologically meaningful pathway analysis results and improved disease discrimination power in terms of the area under ROC curve compared to the results found prior to purification.
We investigated topic model-based inference methods to computationally address the heterogeneity issue in samples analyzed by LC/GC-MS. We observed that incorporation of scan-level features have the potential to lead to more accurate purification results by alleviating the loss in information as a result of integrating peaks. We believe cancer biomarker discovery studies that use mass spectrometric analysis of human biospecimens can greatly benefit from topic model-based purification of the data prior to statistical and pathway analyses.
在癌症生物标志物发现的生物分子定量分析中,一个基本挑战源于人类生物样本的异质性。尽管这个问题在癌症基因组研究中一直是讨论的主题,但在基于质谱的蛋白质组学和代谢组学研究中尚未得到严格调查。在后续分析之前,例如对生物样品中生物分子丰度进行定量比较之前,非常需要对质谱数据进行纯化。
我们研究了主题模型,以在考虑综合峰强度和扫描级特征(即提取离子色谱图(EIC))的情况下对质谱数据进行计算分析。概率生成模型能够在数据结构中进行灵活表示,并推断样本特异性的纯资源。扫描级建模有助于减轻数据预处理过程中的信息损失。我们评估了所提出模型在基于液相色谱 - 质谱的血清蛋白质组学和基于气相色谱 - 质谱的组织代谢组学数据集上捕获污染物混合比例和癌症特征的能力,这些数据集来自肝细胞癌(HCC)和肝硬化患者,以及我们基于血清蛋白质组数据生成的合成数据。
我们对合成数据的分析结果表明,强度级和扫描级纯化模型都可以准确推断混合比例以及潜在的真实癌源,估计值与真实值之间的平均误差率较小(<7%)。通过将基于主题模型的纯化应用于质谱数据,我们发现肝癌病例和肝硬化对照之间有更多蛋白质和代谢物发生了显著变化。与纯化前的结果相比,纯化后选择的候选生物标志物产生了具有生物学意义的通路分析结果,并在ROC曲线下面积方面提高了疾病判别能力。
我们研究了基于主题模型的推理方法,以通过计算解决液相/气相色谱 - 质谱分析样本中的异质性问题。我们观察到,纳入扫描级特征有可能通过减轻峰整合导致的信息损失而产生更准确的纯化结果。我们相信,对人类生物样本进行质谱分析的癌症生物标志物发现研究可以在统计和通路分析之前,从基于主题模型的数据纯化中大大受益。