Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, MD, USA.
The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD, USA.
Nat Biomed Eng. 2024 Jan;8(1):57-67. doi: 10.1038/s41551-023-01120-3. Epub 2023 Nov 2.
Large-scale genomic data are well suited to analysis by deep learning algorithms. However, for many genomic datasets, labels are at the level of the sample rather than for individual genomic measures. Machine learning models leveraging these datasets generate predictions by using statically encoded measures that are then aggregated at the sample level. Here we show that a single weakly supervised end-to-end multiple-instance-learning model with multi-headed attention can be trained to encode and aggregate the local sequence context or genomic position of somatic mutations, hence allowing for the modelling of the importance of individual measures for sample-level classification and thus providing enhanced explainability. The model solves synthetic tasks that conventional models fail at, and achieves best-in-class performance for the classification of tumour type and for predicting microsatellite status. By improving the performance of tasks that require aggregate information from genomic datasets, multiple-instance deep learning may generate biological insight.
大规模基因组数据非常适合深度学习算法进行分析。然而,对于许多基因组数据集,标签是在样本级别,而不是针对单个基因组测量。利用这些数据集的机器学习模型通过使用静态编码的度量值生成预测,然后在样本级别进行聚合。在这里,我们表明,可以训练单个具有多头注意力的弱监督端到端多实例学习模型来对体细胞突变的局部序列上下文或基因组位置进行编码和聚合,从而可以对个体测量值对于样本级分类的重要性进行建模,从而提供增强的可解释性。该模型解决了传统模型无法解决的合成任务,并且在肿瘤类型分类和预测微卫星状态方面实现了同类最佳性能。通过提高需要从基因组数据集中汇总信息的任务的性能,多实例深度学习可能会产生生物学见解。