Suppr超能文献

分析机器学习模型在适应性免疫受体谱系分类中的基线性能和极限。

Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification.

机构信息

Centre for Bioinformatics, Department of Informatics, University of Oslo, Oslo 0373, Norway.

Department of Pathology, Immunology and Laboratory Medicine, University of Florida, FL 32610, USA.

出版信息

Gigascience. 2022 May 25;11. doi: 10.1093/gigascience/giac046.

Abstract

BACKGROUND

Machine learning (ML) methodology development for the classification of immune states in adaptive immune receptor repertoires (AIRRs) has seen a recent surge of interest. However, so far, there does not exist a systematic evaluation of scenarios where classical ML methods (such as penalized logistic regression) already perform adequately for AIRR classification. This hinders investigative reorientation to those scenarios where method development of more sophisticated ML approaches may be required.

RESULTS

To identify those scenarios where a baseline ML method is able to perform well for AIRR classification, we generated a collection of synthetic AIRR benchmark data sets encompassing a wide range of data set architecture-associated and immune state-associated sequence patterns (signal) complexity. We trained ≈1,700 ML models with varying assumptions regarding immune signal on ≈1,000 data sets with a total of ≈250,000 AIRRs containing ≈46 billion TCRβ CDR3 amino acid sequences, thereby surpassing the sample sizes of current state-of-the-art AIRR-ML setups by two orders of magnitude. We found that L1-penalized logistic regression achieved high prediction accuracy even when the immune signal occurs only in 1 out of 50,000 AIR sequences.

CONCLUSIONS

We provide a reference benchmark to guide new AIRR-ML classification methodology by (i) identifying those scenarios characterized by immune signal and data set complexity, where baseline methods already achieve high prediction accuracy, and (ii) facilitating realistic expectations of the performance of AIRR-ML models given training data set properties and assumptions. Our study serves as a template for defining specialized AIRR benchmark data sets for comprehensive benchmarking of AIRR-ML methods.

摘要

背景

用于分类适应性免疫受体库(AIRR)中免疫状态的机器学习(ML)方法学的发展最近受到了极大的关注。然而,到目前为止,还没有系统地评估经典 ML 方法(如惩罚逻辑回归)在 AIRR 分类中已经足够充分的情况。这阻碍了对那些可能需要更复杂的 ML 方法发展的情况进行调查性重新定位。

结果

为了确定那些基线 ML 方法能够很好地用于 AIRR 分类的情况,我们生成了一组包含广泛数据结构相关和免疫状态相关序列模式(信号)复杂度的合成 AIRR 基准数据集。我们使用约 1000 个数据集训练了 ≈1700 个具有不同免疫信号假设的 ML 模型,这些数据集共包含 ≈25 万 AIRRs,包含 ≈460 亿 TCRβ CDR3 氨基酸序列,从而使样本量超过当前最先进的 AIRR-ML 设置的两个数量级。我们发现,即使免疫信号仅出现在 50000 个 AIR 序列中的 1 个中,L1 惩罚逻辑回归也能达到很高的预测准确性。

结论

我们通过(i)确定具有免疫信号和数据集复杂度特征的情况,在这些情况下基线方法已经达到了较高的预测准确性,以及(ii)根据训练数据集的特性和假设,为 AIRR-ML 模型的性能提供现实的期望,为新的 AIRR-ML 分类方法提供了参考基准。我们的研究为全面基准测试 AIRR-ML 方法定义了专门的 AIRR 基准数据集。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/515e/9154052/0743abd61e88/giac046fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验