Department of Electrical and Computer Engineering, University of Denver, Denver, CO, USA.
Department of Neurology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA.
Comput Biol Med. 2024 Nov;182:109199. doi: 10.1016/j.compbiomed.2024.109199. Epub 2024 Sep 26.
Mild Cognitive Impairment (MCI) is an early stage of memory loss or other cognitive ability loss in individuals who maintain the ability to independently perform most activities of daily living. It is considered a transitional stage between normal cognitive stage and more severe cognitive declines like dementia or Alzheimer's. Based on the reports from the National Institute of Aging (NIA), people with MCI are at a greater risk of developing dementia, thus it is of great importance to detect MCI at the earliest possible to mitigate the transformation of MCI to Alzheimer's and dementia. Recent studies have harnessed Artificial Intelligence (AI) to develop automated methods to predict and detect MCI. The majority of the existing research is based on unimodal data (e.g., only speech or prosody), but recent studies have shown that multimodality leads to a more accurate prediction of MCI. However, effectively exploiting different modalities is still a big challenge due to the lack of efficient fusion methods. This study proposes a robust fusion architecture utilizing an embedding-level fusion via a co-attention mechanism to leverage multimodal data for MCI prediction. This approach addresses the limitations of early and late fusion methods, which often fail to preserve inter-modal relationships. Our embedding-level fusion aims to capture complementary information across modalities, enhancing predictive accuracy. We used the I-CONECT dataset, where a large number of semi-structured conversations via internet/webcam between participants aged 75+ years old and interviewers were recorded. We introduce a multimodal speech-language-vision Deep Learning-based method to differentiate MCI from Normal Cognition (NC). Our proposed architecture includes co-attention blocks to fuse three different modalities at the embedding level to find the potential interactions between speech (audio), language (transcribed speech), and vision (facial videos) within the cross-Transformer layer. Experimental results demonstrate that our fusion method achieves an average AUC of 85.3% in detecting MCI from NC, significantly outperforming unimodal (60.9%) and bimodal (76.3%) baseline models. This superior performance highlights the effectiveness of our model in capturing and utilizing the complementary information from multiple modalities, offering a more accurate and reliable approach for MCI prediction.
轻度认知障碍 (MCI) 是指个体在保持独立进行大多数日常生活活动能力的情况下出现记忆丧失或其他认知能力丧失的早期阶段。它被认为是正常认知阶段和更严重认知衰退(如痴呆或阿尔茨海默病)之间的过渡阶段。根据美国国家老龄化研究所 (NIA) 的报告,MCI 患者患痴呆症的风险更高,因此尽早发现 MCI 对于减轻 MCI 向阿尔茨海默病和痴呆症的转化非常重要。最近的研究利用人工智能 (AI) 开发了自动预测和检测 MCI 的方法。现有的大多数研究都基于单模态数据(例如,仅语音或韵律),但最近的研究表明,多模态可以更准确地预测 MCI。然而,由于缺乏有效的融合方法,有效地利用不同模态仍然是一个巨大的挑战。本研究提出了一种基于嵌入层融合的稳健融合架构,利用共同注意力机制利用多模态数据进行 MCI 预测。这种方法解决了早期融合和晚期融合方法的局限性,这些方法通常无法保留模态间的关系。我们的嵌入层融合旨在捕捉模态间的互补信息,提高预测准确性。我们使用了 I-CONECT 数据集,其中记录了大量年龄在 75 岁以上的参与者与采访者之间通过互联网/网络摄像头进行的半结构化对话。我们提出了一种基于多模态语音语言视觉的深度学习方法,以区分 MCI 与正常认知 (NC)。我们提出的架构包括共同注意力块,用于在嵌入层融合三个不同的模态,以在交叉 Transformer 层中找到语音(音频)、语言(转录语音)和视觉(面部视频)之间的潜在交互。实验结果表明,我们的融合方法在从 NC 中检测 MCI 方面的平均 AUC 为 85.3%,明显优于单模态(60.9%)和双模态(76.3%)基线模型。这种优异的性能突出了我们的模型在捕捉和利用多模态互补信息方面的有效性,为 MCI 预测提供了更准确、更可靠的方法。