Ameta Durgesh, Kumar Surendra, Mishra Rishav, Behera Laxmidhar, Chakraborty Aniruddha, Sandhan Tushar
Indian Knowledge System and Mental Health Applications Centre, Indian Institute of Technology, Mandi, India.
Indian Knowledge System Centre, ISS, Delhi, India.
PLoS One. 2025 May 28;20(5):e0322514. doi: 10.1371/journal.pone.0322514. eCollection 2025.
This research delves into olfaction, a sensory modality that remains complex and inadequately understood. We aim to fill in two gaps in recent studies that attempted to use machine learning and deep learning approaches to predict human smell perception. The first one is that molecules are usually represented with molecular fingerprints, mass spectra, and vibrational spectra; however, the influence of the selected representation method on predictive performance is inadequately documented in direct comparative studies. To fill this gap, we assembled a large novel dataset of 2606 molecules with three kinds of features: mass spectra (MS), vibrational spectra (VS) and molecular fingerprint features (FP). We evaluated their performance using four different multi-label classification models. The second objective is to address an inherent challenge in odor classification multi-label datasets (MLD)-the issue of class imbalance by random resampling techniques and an explainable, cost-sensitive multilayer perceptron model (CSMLP). Experimental results suggest significantly better performance of the molecular fingerprint-based features compared with mass and vibrational spectra with the micro-averaged F1 evaluation metric. The proposed resampling techniques and cost-sensitive model outperform the results of previous studies. We also report the predictive performance of multimodal features obtained by fusing the three mentioned features. This comprehensive and systematic study compares the predictive performance for odor classification of different features and utilises a multifaceted approach to deal with data imbalance. Our explainable model sheds further light on features and odour relations. The results hold the potential to guide the development of the electric nose and our dataset will be made publicly available.
本研究深入探讨嗅觉,这是一种仍很复杂且未被充分理解的感官模式。我们旨在填补最近试图使用机器学习和深度学习方法来预测人类嗅觉感知的研究中的两个空白。第一个空白是,分子通常用分子指纹、质谱和振动光谱来表示;然而,在直接比较研究中,所选表示方法对预测性能的影响记录不足。为了填补这一空白,我们收集了一个包含2606种分子的大型新数据集,具有三种特征:质谱(MS)、振动光谱(VS)和分子指纹特征(FP)。我们使用四种不同的多标签分类模型评估了它们的性能。第二个目标是通过随机重采样技术和一个可解释的、成本敏感的多层感知器模型(CSMLP)来解决气味分类多标签数据集(MLD)中的一个固有挑战——类不平衡问题。实验结果表明,与质谱和振动光谱相比,基于分子指纹的特征在微平均F1评估指标上具有显著更好的性能。所提出的重采样技术和成本敏感模型优于先前研究的结果。我们还报告了通过融合上述三种特征获得的多模态特征的预测性能。这项全面而系统的研究比较了不同特征对气味分类的预测性能,并采用多方面方法来处理数据不平衡问题。我们的可解释模型进一步揭示了特征与气味之间的关系。这些结果有可能指导电子鼻的开发,并且我们的数据集将公开提供。