• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

simAIRR:具有真实受体序列共享的适应性免疫受体模拟,用于免疫状态预测方法的基准测试。

simAIRR: simulation of adaptive immune repertoires with realistic receptor sequence sharing for benchmarking of immune state prediction methods.

机构信息

Centre for Bioinformatics, Department of Informatics, University of Oslo, 0373 Oslo, Norway.

UiORealArt Convergence Environment, University of Oslo, 0373 Oslo, Norway.

出版信息

Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad074. Epub 2023 Oct 17.

DOI:10.1093/gigascience/giad074
PMID:37848619
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10580376/
Abstract

BACKGROUND

Machine learning (ML) has gained significant attention for classifying immune states in adaptive immune receptor repertoires (AIRRs) to support the advancement of immunodiagnostics and therapeutics. Simulated data are crucial for the rigorous benchmarking of AIRR-ML methods. Existing approaches to generating synthetic benchmarking datasets result in the generation of naive repertoires missing the key feature of many shared receptor sequences (selected for common antigens) found in antigen-experienced repertoires.

RESULTS

We demonstrate that a common approach to generating simulated AIRR benchmark datasets can introduce biases, which may be exploited for undesired shortcut learning by certain ML methods. To mitigate undesirable access to true signals in simulated AIRR datasets, we devised a simulation strategy (simAIRR) that constructs antigen-experienced-like repertoires with a realistic overlap of receptor sequences. simAIRR can be used for constructing AIRR-level benchmarks based on a range of assumptions (or experimental data sources) for what constitutes receptor-level immune signals. This includes the possibility of making or not making any prior assumptions regarding the similarity or commonality of immune state-associated sequences that will be used as true signals. We demonstrate the real-world realism of our proposed simulation approach by showing that basic ML strategies perform similarly on simAIRR-generated and real-world experimental AIRR datasets.

CONCLUSIONS

This study sheds light on the potential shortcut learning opportunities for ML methods that can arise with the state-of-the-art way of simulating AIRR datasets. simAIRR is available as a Python package: https://github.com/KanduriC/simAIRR.

摘要

背景

机器学习 (ML) 在对适应性免疫受体库 (AIRR) 中的免疫状态进行分类方面受到了广泛关注,以支持免疫诊断和治疗的发展。模拟数据对于 AIRR-ML 方法的严格基准测试至关重要。现有的生成合成基准数据集的方法会导致生成幼稚的库,这些库缺乏抗原经验库中许多共享受体序列(针对共同抗原选择)的关键特征。

结果

我们证明,生成模拟 AIRR 基准数据集的常见方法可能会引入偏差,某些 ML 方法可能会利用这些偏差进行不必要的捷径学习。为了减轻模拟 AIRR 数据集中真实信号被不当获取的问题,我们设计了一种模拟策略 (simAIRR),该策略使用受体序列具有现实重叠的方式构建抗原经验样库。simAIRR 可用于根据构成受体级免疫信号的一系列假设(或实验数据源)构建 AIRR 级基准,包括是否对用作真实信号的免疫状态相关序列的相似性或共性做出任何事先假设的可能性。我们通过展示基本的 ML 策略在基于 simAIRR 生成的和真实世界实验 AIRR 数据集上的表现相似,证明了我们提出的模拟方法具有现实世界的真实性。

结论

这项研究揭示了 ML 方法可能会出现的利用最先进的模拟 AIRR 数据集方法的捷径学习机会。simAIRR 可作为 Python 包使用:https://github.com/KanduriC/simAIRR。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd2a/10580376/69bf81343938/giad074fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd2a/10580376/a9ae8ebf0802/giad074fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd2a/10580376/9108b1820230/giad074fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd2a/10580376/8b984df4ed83/giad074fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd2a/10580376/69bf81343938/giad074fig4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd2a/10580376/a9ae8ebf0802/giad074fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd2a/10580376/9108b1820230/giad074fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd2a/10580376/8b984df4ed83/giad074fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cd2a/10580376/69bf81343938/giad074fig4.jpg

相似文献

1
simAIRR: simulation of adaptive immune repertoires with realistic receptor sequence sharing for benchmarking of immune state prediction methods.simAIRR:具有真实受体序列共享的适应性免疫受体模拟,用于免疫状态预测方法的基准测试。
Gigascience. 2022 Dec 28;12. doi: 10.1093/gigascience/giad074. Epub 2023 Oct 17.
2
Simulation of adaptive immune receptors and repertoires with complex immune information to guide the development and benchmarking of AIRR machine learning.利用复杂免疫信息模拟适应性免疫受体和库,以指导适应性免疫受体库(AIRR)机器学习的开发和基准测试。
Nucleic Acids Res. 2025 Jan 24;53(3). doi: 10.1093/nar/gkaf025.
3
Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification.分析机器学习模型在适应性免疫受体谱系分类中的基线性能和极限。
Gigascience. 2022 May 25;11. doi: 10.1093/gigascience/giac046.
4
CompAIRR: ultra-fast comparison of adaptive immune receptor repertoires by exact and approximate sequence matching.CompAIRR:通过精确和近似序列匹配进行适应性免疫受体库的超快速比较。
Bioinformatics. 2022 Sep 2;38(17):4230-4232. doi: 10.1093/bioinformatics/btac505.
5
immuneSIM: tunable multi-feature simulation of B- and T-cell receptor repertoires for immunoinformatics benchmarking.immuneSIM:用于免疫信息学基准测试的 B 细胞和 T 细胞受体库的可调多特征模拟。
Bioinformatics. 2020 Jun 1;36(11):3594-3596. doi: 10.1093/bioinformatics/btaa158.
6
The immuneML ecosystem for machine learning analysis of adaptive immune receptor repertoires.用于适应性免疫受体库机器学习分析的immuneML生态系统。
Nat Mach Intell. 2021 Nov;3(11):936-944. doi: 10.1038/s42256-021-00413-z. Epub 2021 Nov 16.
7
Echidna: integrated simulations of single-cell immune receptor repertoires and transcriptomes.针鼹:单细胞免疫受体库和转录组的综合模拟
Bioinform Adv. 2022 Sep 2;2(1):vbac062. doi: 10.1093/bioadv/vbac062. eCollection 2022.
8
Adaptive Immune Receptor Repertoire (AIRR) Community Guide to TR and IG Gene Annotation.适应性免疫受体库(AIRR)TR 和 IG 基因注释社区指南。
Methods Mol Biol. 2022;2453:279-296. doi: 10.1007/978-1-0716-2115-8_16.
9
Computational Strategies for Dissecting the High-Dimensional Complexity of Adaptive Immune Repertoires.计算策略解析适应性免疫受体的高维复杂性。
Front Immunol. 2018 Feb 21;9:224. doi: 10.3389/fimmu.2018.00224. eCollection 2018.
10
Inferred Allelic Variants of Immunoglobulin Receptor Genes: A System for Their Evaluation, Documentation, and Naming.推断的免疫球蛋白受体基因等位变体:一种用于评估、记录和命名的系统。
Front Immunol. 2019 Mar 18;10:435. doi: 10.3389/fimmu.2019.00435. eCollection 2019.

引用本文的文献

1
Simulation of adaptive immune receptors and repertoires with complex immune information to guide the development and benchmarking of AIRR machine learning.利用复杂免疫信息模拟适应性免疫受体和库,以指导适应性免疫受体库(AIRR)机器学习的开发和基准测试。
Nucleic Acids Res. 2025 Jan 24;53(3). doi: 10.1093/nar/gkaf025.
2
Predictability of antigen binding based on short motifs in the antibody CDRH3.基于抗体 CDRH3 中的短基序预测抗原结合。
Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae537.

本文引用的文献

1
Unconstrained generation of synthetic antibody-antigen structures to guide machine learning methodology for antibody specificity prediction.无约束生成合成抗体-抗原结构,以指导用于抗体特异性预测的机器学习方法。
Nat Comput Sci. 2022 Dec;2(12):845-865. doi: 10.1038/s43588-022-00372-4. Epub 2022 Dec 19.
2
Leakage and the reproducibility crisis in machine-learning-based science.基于机器学习的科学中的漏洞与可重复性危机。
Patterns (N Y). 2023 Aug 4;4(9):100804. doi: 10.1016/j.patter.2023.100804. eCollection 2023 Sep 8.
3
The immuneML ecosystem for machine learning analysis of adaptive immune receptor repertoires.
用于适应性免疫受体库机器学习分析的immuneML生态系统。
Nat Mach Intell. 2021 Nov;3(11):936-944. doi: 10.1038/s42256-021-00413-z. Epub 2021 Nov 16.
4
AIRRSHIP: simulating human B cell receptor repertoire sequences.AIRRSHIP:模拟人类 B 细胞受体序列库。
Bioinformatics. 2023 Jun 1;39(6). doi: 10.1093/bioinformatics/btad365.
5
Functional antibodies exhibit light chain coherence.功能性抗体表现出轻链一致性。
Nature. 2022 Nov;611(7935):352-357. doi: 10.1038/s41586-022-05371-z. Epub 2022 Oct 26.
6
Deep learning reveals predictive sequence concepts within immune repertoires to immunotherapy.深度学习揭示免疫组库中对免疫疗法具有预测性的序列概念。
Sci Adv. 2022 Sep 16;8(37):eabq5089. doi: 10.1126/sciadv.abq5089.
7
Access to ground truth at unconstrained size makes simulated data as indispensable as experimental data for bioinformatics methods development and benchmarking.能够获取任意规模的真实数据使得模拟数据对于生物信息学方法的开发和基准测试而言,与实验数据一样不可或缺。
Bioinformatics. 2022 Oct 31;38(21):4994-4996. doi: 10.1093/bioinformatics/btac612.
8
Comparative Study of Repertoire Classification Methods Reveals Data Efficiency of -mer Feature Extraction.- 分类方法的比较研究揭示了 -mer 特征提取的数据效率。
Front Immunol. 2022 Jul 20;13:797640. doi: 10.3389/fimmu.2022.797640. eCollection 2022.
9
Machine Learning Approaches to TCR Repertoire Analysis.机器学习方法在 TCR repertoire 分析中的应用。
Front Immunol. 2022 Jul 15;13:858057. doi: 10.3389/fimmu.2022.858057. eCollection 2022.
10
CompAIRR: ultra-fast comparison of adaptive immune receptor repertoires by exact and approximate sequence matching.CompAIRR:通过精确和近似序列匹配进行适应性免疫受体库的超快速比较。
Bioinformatics. 2022 Sep 2;38(17):4230-4232. doi: 10.1093/bioinformatics/btac505.