Golovanevsky Michal, Schiller Eva, Nair Akira, Han Eric, Singh Ritambhara, Eickhoff Carsten
§Department of Computer Science, Brown University, Providence, RI 02912, USA.
Department of Computer Science, Brown University, Providence, RI 02912, USA.
Pac Symp Biocomput. 2025;30:580-593. doi: 10.1142/9789819807024_0041.
Multimodal models have become increasingly important as they surpass single-modality approaches on diverse tasks ranging from question-answering to disease diagnosis. Despite the importance of multimodal learning, existing efforts focus on vision-language applications, where the number of modalities rarely exceeds four (images, text, audio, video). However, data in healthcare domain, may include many more modalities like X-rays, PET scans, MRIs, genetic screening, genomic data, and clinical notes, creating a need for both efficient and accurate data integration. Many state-of-the-art multimodal models rely on cross-attention or self-attention for effective data integration, which do not scale well for applications with more than two modalities. The complexity per layer of computing attention in either paradigm is, at best, quadratic with respect to the number of modalities, posing a computational bottleneck that impedes broad adoption. To address this, we propose a new attention mechanism, One-Versus-Others (OvO) attention, that scales linearly with the number of modalities, thus offering a significant reduction in computational complexity compared to existing multimodal attention methods. Using three clinical datasets with multiple diverse modalities, we show that our method decreases computation costs while maintaining or increasing performance compared to popular integration techniques. Across all clinical datasets, OvO reduced the number of required floating point operations (FLOPs) by at least 91.98%, demonstrating its significant impact on efficiency and enabling multi-modal predictions in healthcare.
多模态模型在从问答到疾病诊断等各种任务中超越单模态方法,变得越来越重要。尽管多模态学习很重要,但现有工作主要集中在视觉语言应用上,其中模态数量很少超过四种(图像、文本、音频、视频)。然而,医疗保健领域的数据可能包括更多的模态,如X光、PET扫描、MRI、基因筛查、基因组数据和临床记录,这就需要高效且准确的数据整合。许多先进的多模态模型依靠交叉注意力或自注意力进行有效的数据整合,但对于具有两种以上模态的应用来说,它们的扩展性不佳。在这两种范式中,每一层计算注意力的复杂度至多与模态数量呈二次方关系,这构成了一个计算瓶颈,阻碍了其广泛应用。为了解决这个问题,我们提出了一种新的注意力机制,即“一对多”(OvO)注意力,它与模态数量呈线性关系,因此与现有的多模态注意力方法相比,显著降低了计算复杂度。通过使用三个包含多种不同模态的临床数据集,我们表明,与流行的整合技术相比,我们的方法在保持或提高性能的同时降低了计算成本。在所有临床数据集中,OvO将所需的浮点运算(FLOP)数量减少了至少91.98%,证明了其对效率的重大影响,并实现了医疗保健中的多模态预测。