Suppr超能文献

对表位进行统计分析和标记化,构建人工新表位文库。

Statistical Analysis and Tokenization of Epitopes to Construct Artificial Neoepitope Libraries.

机构信息

Centre for Cooperative Research in Biomaterials (CIC biomaGUNE), Basque Research and Technology Alliance (BRTA), Paseo de Miramón 194, Donostia-San Sebastián, 20014 Spain.

Molecular Biology Institute of Barcelona (IBMB-CSIC), Barcelona Science Park, Baldiri Reixac, 15-21, 08028, Barcelona, Spain.

出版信息

ACS Synth Biol. 2023 Oct 20;12(10):2812-2818. doi: 10.1021/acssynbio.3c00201. Epub 2023 Sep 13.

Abstract

Epitopes are specific regions on an antigen's surface that the immune system recognizes. Epitopes are usually protein regions on foreign immune-stimulating entities such as viruses and bacteria, and in some cases, endogenous proteins may act as antigens. Identifying epitopes is crucial for accelerating the development of vaccines and immunotherapies. However, mapping epitopes in pathogen proteomes is challenging using conventional methods. Screening artificial neoepitope libraries against antibodies can overcome this issue. Here, we applied conventional sequence analysis and methods inspired in natural language processing to reveal specific sequence patterns in the linear epitopes deposited in the Immune Epitope Database (www.iedb.org) that can serve as building blocks for the design of universal epitope libraries. Our results reveal that amino acid frequency in annotated linear epitopes differs from that in the human proteome. Aromatic residues are overrepresented, while the presence of cysteines is practically null in epitopes. Byte pair encoding tokenization shows high frequencies of tryptophan in tokens of 5, 6, and 7 amino acids, corroborating the findings of the conventional sequence analysis. These results can be applied to reduce the diversity of linear epitope libraries by orders of magnitude.

摘要

表位是抗原表面上免疫系统识别的特定区域。表位通常是病毒和细菌等外来免疫刺激物的蛋白质区域,在某些情况下,内源性蛋白质也可能作为抗原。鉴定表位对于加速疫苗和免疫疗法的开发至关重要。然而,使用传统方法在病原体蛋白质组中绘制表位具有挑战性。针对抗体筛选人工新表位文库可以克服这个问题。在这里,我们应用传统的序列分析和受自然语言处理启发的方法来揭示 Immune Epitope Database (www.iedb.org) 中已发表的线性表位中特定的序列模式,这些模式可以作为通用表位文库设计的构建模块。我们的结果表明,注释线性表位中的氨基酸频率与人类蛋白质组中的不同。芳香族残基的出现频率较高,而在表位中几乎不存在半胱氨酸。字节对编码标记化显示出 5、6 和 7 个氨基酸标记中色氨酸的高频,这与传统序列分析的结果一致。这些结果可用于将线性表位文库的多样性降低几个数量级。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d30c/10594869/79ee28247fc4/sb3c00201_0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验