Center for Innovations in Medicine, Biodesign Institute, Arizona State University, Tempe, Arizona, United States of America.
School of Molecular Sciences, Arizona State University, Tempe, Arizona, United States of America.
PLoS Comput Biol. 2023 Jun 20;19(6):e1010773. doi: 10.1371/journal.pcbi.1010773. eCollection 2023 Jun.
Past studies have shown that incubation of human serum samples on high density peptide arrays followed by measurement of total antibody bound to each peptide sequence allows detection and discrimination of humoral immune responses to a variety of infectious diseases. This is true even though these arrays consist of peptides with near-random amino acid sequences that were not designed to mimic biological antigens. This "immunosignature" approach, is based on a statistical evaluation of the binding pattern for each sample but it ignores the information contained in the amino acid sequences that the antibodies are binding to. Here, similar array-based antibody profiles are instead used to train a neural network to model the sequence dependence of molecular recognition involved in the immune response of each sample. The binding profiles used resulted from incubating serum from 5 infectious disease cohorts (Hepatitis B and C, Dengue Fever, West Nile Virus and Chagas disease) and an uninfected cohort with 122,926 peptide sequences on an array. These sequences were selected quasi-randomly to represent an even but sparse sample of the entire possible combinatorial sequence space (~1012). This very sparse sampling of combinatorial sequence space was sufficient to capture a statistically accurate representation of the humoral immune response across the entire space. Processing array data using the neural network not only captures the disease-specific sequence-binding information but aggregates binding information with respect to sequence, removing sequence-independent noise and improving the accuracy of array-based classification of disease compared with the raw binding data. Because the neural network model is trained on all samples simultaneously, a highly condensed representation of the differential information between samples resides in the output layer of the model, and the column vectors from this layer can be used to represent each sample for classification or unsupervised clustering applications.
过去的研究表明,将人血清样本在高密度肽阵列上孵育,然后测量与每个肽序列结合的总抗体,可以检测和区分针对各种传染病的体液免疫反应。即使这些阵列由具有近乎随机氨基酸序列的肽组成,这些肽序列不是为了模拟生物抗原而设计的,这种“免疫特征”方法也是基于对每个样本的结合模式进行统计评估,但它忽略了抗体结合的氨基酸序列中包含的信息。在这里,相反地,使用类似的基于阵列的抗体图谱来训练神经网络,以模拟每个样本的免疫反应中涉及的分子识别的序列依赖性。使用的结合图谱是通过将来自 5 个传染病队列(乙型肝炎和丙型肝炎、登革热、西尼罗河病毒和恰加斯病)和一个未感染队列的血清与 122926 个肽序列在阵列上孵育而产生的。这些序列是准随机选择的,代表整个可能的组合序列空间的均匀但稀疏的样本(约 1012)。这种对组合序列空间的非常稀疏采样足以捕获整个空间中体液免疫反应的统计学准确表示。使用神经网络处理阵列数据不仅可以捕获针对特定疾病的序列结合信息,还可以聚合与序列相关的结合信息,去除序列无关的噪声,并提高基于阵列的疾病分类的准确性,与原始结合数据相比。由于神经网络模型是同时对所有样本进行训练的,因此样本之间的差异信息高度浓缩在模型的输出层中,并且可以使用该层的列向量来表示每个样本,以进行分类或无监督聚类应用。