Centre for Bioinformatics, Department of Informatics, University of Oslo, 0373 Oslo, Norway.
UiORealArt Convergence Environment, University of Oslo, 0373 Oslo, Norway.
Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad074. Epub 2023 Oct 17.
Machine learning (ML) has gained significant attention for classifying immune states in adaptive immune receptor repertoires (AIRRs) to support the advancement of immunodiagnostics and therapeutics. Simulated data are crucial for the rigorous benchmarking of AIRR-ML methods. Existing approaches to generating synthetic benchmarking datasets result in the generation of naive repertoires missing the key feature of many shared receptor sequences (selected for common antigens) found in antigen-experienced repertoires.
We demonstrate that a common approach to generating simulated AIRR benchmark datasets can introduce biases, which may be exploited for undesired shortcut learning by certain ML methods. To mitigate undesirable access to true signals in simulated AIRR datasets, we devised a simulation strategy (simAIRR) that constructs antigen-experienced-like repertoires with a realistic overlap of receptor sequences. simAIRR can be used for constructing AIRR-level benchmarks based on a range of assumptions (or experimental data sources) for what constitutes receptor-level immune signals. This includes the possibility of making or not making any prior assumptions regarding the similarity or commonality of immune state-associated sequences that will be used as true signals. We demonstrate the real-world realism of our proposed simulation approach by showing that basic ML strategies perform similarly on simAIRR-generated and real-world experimental AIRR datasets.
This study sheds light on the potential shortcut learning opportunities for ML methods that can arise with the state-of-the-art way of simulating AIRR datasets. simAIRR is available as a Python package: https://github.com/KanduriC/simAIRR.
机器学习 (ML) 在对适应性免疫受体库 (AIRR) 中的免疫状态进行分类方面受到了广泛关注,以支持免疫诊断和治疗的发展。模拟数据对于 AIRR-ML 方法的严格基准测试至关重要。现有的生成合成基准数据集的方法会导致生成幼稚的库,这些库缺乏抗原经验库中许多共享受体序列(针对共同抗原选择)的关键特征。
我们证明,生成模拟 AIRR 基准数据集的常见方法可能会引入偏差,某些 ML 方法可能会利用这些偏差进行不必要的捷径学习。为了减轻模拟 AIRR 数据集中真实信号被不当获取的问题,我们设计了一种模拟策略 (simAIRR),该策略使用受体序列具有现实重叠的方式构建抗原经验样库。simAIRR 可用于根据构成受体级免疫信号的一系列假设(或实验数据源)构建 AIRR 级基准,包括是否对用作真实信号的免疫状态相关序列的相似性或共性做出任何事先假设的可能性。我们通过展示基本的 ML 策略在基于 simAIRR 生成的和真实世界实验 AIRR 数据集上的表现相似,证明了我们提出的模拟方法具有现实世界的真实性。
这项研究揭示了 ML 方法可能会出现的利用最先进的模拟 AIRR 数据集方法的捷径学习机会。simAIRR 可作为 Python 包使用:https://github.com/KanduriC/simAIRR。