- 分类方法的比较研究揭示了 -mer 特征提取的数据效率。

Comparative Study of Repertoire Classification Methods Reveals Data Efficiency of -mer Feature Extraction.

机构信息

Graduate School of Engineering, The University of Tokyo, Tokyo, Japan.

Institute of Industrial Science, The University of Tokyo, Tokyo, Japan.

出版信息

Front Immunol. 2022 Jul 20;13:797640. doi: 10.3389/fimmu.2022.797640. eCollection 2022.

DOI:10.3389/fimmu.2022.797640

PMID:35936014

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9346074/

Abstract

The repertoire of T cell receptors encodes various types of immunological information. Machine learning is indispensable for decoding such information from repertoire datasets measured by next-generation sequencing (NGS). In particular, the classification of repertoires is the most basic task, which is relevant for a variety of scientific and clinical problems. Supported by the recent appearance of large datasets, efficient but data-expensive methods have been proposed. However, it is unclear whether they can work efficiently when the available sample size is severely restricted as in practical situations. In this study, we demonstrate that their performances can be impaired substantially below critical sample sizes. To complement this drawback, we propose MotifBoost, which exploits the information of short -mer motifs of TCRs. MotifBoost can perform the classification as efficiently as a deep learning method on large datasets while providing more stable and reliable results on small datasets. We tested MotifBoost on the four small datasets which consist of various conditions such as Cytomegalovirus (CMV), HIV, -chain, -chain and it consistently preserved the stability. We also clarify that the robustness of MotifBoost can be attributed to the efficiency of -mer motifs as representation features of repertoires. Finally, by comparing the predictions of these methods, we show that the whole sequence identity and sequence motifs encode partially different information and that a combination of such complementary information is necessary for further development of repertoire analysis.

摘要

T 细胞受体的 repertoire 编码了各种类型的免疫学信息。机器学习对于从下一代测序 (NGS) 测量的 repertoire 数据集中解码这些信息是不可或缺的。特别是，repertoire 的分类是最基本的任务，与各种科学和临床问题都相关。在最近出现的大型数据集的支持下，已经提出了高效但数据密集型的方法。然而，当可用样本量严重受限（如实际情况）时，它们是否能有效地工作尚不清楚。在本研究中，我们证明它们的性能在关键样本量以下会受到严重影响。为了弥补这一缺陷，我们提出了 MotifBoost，它利用了 TCR 短 -mer 基序的信息。MotifBoost 可以在大型数据集上像深度学习方法一样高效地进行分类，同时在小型数据集上提供更稳定和可靠的结果。我们在由各种条件（如巨细胞病毒 (CMV)、HIV、-链、-链）组成的四个小型数据集上测试了 MotifBoost，它始终保持稳定性。我们还澄清了 MotifBoost 的稳健性可归因于 -mer 基序作为 repertoire 表示特征的效率。最后，通过比较这些方法的预测，我们表明全长序列同一性和序列基序编码了部分不同的信息，并且这种互补信息的组合对于 repertoire 分析的进一步发展是必要的。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

- 分类方法的比较研究揭示了 -mer 特征提取的数据效率。

Comparative Study of Repertoire Classification Methods Reveals Data Efficiency of -mer Feature Extraction.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

- 分类方法的比较研究揭示了 -mer 特征提取的数据效率。

Comparative Study of Repertoire Classification Methods Reveals Data Efficiency of -mer Feature Extraction.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献