Kamran Jawad, Hniopek Julian, Bocklitz Thomas
Institute of Physical Chemistry, Friedrich Schiller University Jena, Helmholtzweg 4, 07743 Jena, Germany.
Department of Photonic Data Science, Leibniz Institute of Photonic Technology, Albert-Einstein-Straße 9, 07745 Jena, Germany.
J Chem Inf Model. 2025 Jul 14;65(13):6632-6643. doi: 10.1021/acs.jcim.5c00513. Epub 2025 Jun 18.
Biophotonic technologies such as Raman spectroscopy are powerful tools for obtaining highly specific molecular information. Due to its minimal sample preparation requirements, Raman spectroscopy is widely used across diverse scientific disciplines, often in combination with chemometrics, machine learning (ML), and deep learning (DL). However, Raman spectroscopy lacks large databases of independent Raman spectra for model training, leading to overfitting, overestimation, and limited model generalizability. We address this problem by generating simulated vibrational spectra using semiempirical quantum chemistry methods, enabling the efficient pretraining of deep learning models on large synthetic data sets. These pretrained models are then fine-tuned on a smaller experimental Raman data set of bacterial spectra. Transfer learning significantly reduces the computational cost while maintaining performance comparable to models trained from scratch in this real biophotonic application. The results validate the utility of synthetic data for pretraining deep Raman models and offer a scalable framework for spectral analysis in resource-limited settings.
拉曼光谱等生物光子技术是获取高度特异性分子信息的强大工具。由于其对样品制备的要求极低,拉曼光谱在各种科学学科中广泛应用,常常与化学计量学、机器学习(ML)和深度学习(DL)相结合。然而,拉曼光谱缺乏用于模型训练的独立拉曼光谱大型数据库,导致过拟合、高估以及模型通用性受限。我们通过使用半经验量子化学方法生成模拟振动光谱来解决这个问题,从而能够在大型合成数据集上对深度学习模型进行高效预训练。然后,这些预训练模型在较小的细菌光谱实验拉曼数据集上进行微调。迁移学习显著降低了计算成本,同时在这个实际生物光子应用中保持了与从头训练的模型相当的性能。结果验证了合成数据用于预训练深度拉曼模型的效用,并为资源有限环境下的光谱分析提供了一个可扩展的框架。