Abram Krzysztof Jan, McCloskey Douglas
Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Lyngby, Denmark.
Johnson & Johnson MedTech, Bregnerodvej 133, 3460 Birkerod, Denmark.
Biomolecules. 2023 Sep 4;13(9):1343. doi: 10.3390/biom13091343.
Generative modeling and representation learning of tandem mass spectrometry data aim to learn an interpretable and instrument-agnostic digital representation of metabolites directly from MS/MS spectra. Interpretable and instrument-agnostic digital representations would facilitate comparisons of MS/MS spectra between instrument vendors and enable better and more accurate queries of large MS/MS spectra databases for metabolite identification. In this study, we apply generative modeling and representation learning using variational autoencoders to understand the extent to which tandem mass spectra can be disentangled into their factors of generation (e.g., collision energy, ionization mode, instrument type, etc.) with minimal prior knowledge of the factors. We find that variational autoencoders can disentangle tandem mass spectra data with the proper choice of hyperparameters into meaningful latent representations aligned with known factors of variation. We develop a two-step approach to facilitate the selection of models that are disentangled, which could be applied to other complex and high-dimensional data sets.
串联质谱数据的生成建模和表示学习旨在直接从MS/MS光谱中学习代谢物的可解释且与仪器无关的数字表示。可解释且与仪器无关的数字表示将有助于比较不同仪器供应商的MS/MS光谱,并能更好、更准确地查询大型MS/MS光谱数据库以进行代谢物鉴定。在本研究中,我们应用变分自编码器进行生成建模和表示学习,以了解在对生成因素(如碰撞能量、电离模式、仪器类型等)仅有极少先验知识的情况下,串联质谱能在多大程度上被分解为其生成因素。我们发现,通过适当选择超参数,变分自编码器可以将串联质谱数据分解为与已知变化因素对齐的有意义的潜在表示。我们开发了一种两步法来促进对已分解模型的选择,该方法可应用于其他复杂和高维数据集。