Center for Complex Systems and Brain Sciences, Florida Atlantic University, Boca Raton, USA.
Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, USA.
Med Biol Eng Comput. 2022 May;60(5):1279-1293. doi: 10.1007/s11517-022-02539-7. Epub 2022 Mar 18.
Computer-aided rational vaccine design (RVD) and synthetic pharmacology are rapidly developing fields that leverage existing datasets for developing compounds of interest. Computational proteomics utilizes algorithms and models to probe proteins for functional prediction. A potentially strong target for computational approach is autoimmune antibodies, which are the result of broken tolerance in the immune system where it cannot distinguish "self" from "non-self" resulting in attack of its own structures (proteins and DNA, mainly). The information on structure, function, and pathogenicity of autoantibodies may assist in engineering RVD against autoimmune diseases. Current computational approaches exploit large datasets curated with extensive domain knowledge, most of which include the need for many resources and have been applied indirectly to problems of interest for DNA, RNA, and monomer protein binding. We present a novel method for discovering potential binding sites. We employed long short-term memory (LSTM) models trained on FASTA primary sequences to predict protein binding in DNA-binding hydrolytic antibodies (abzymes). We also employed CNN models applied to the same dataset for comparison with LSTM. While the CNN model outperformed the LSTM on the primary task of binding prediction, analysis of internal model representations of both models showed that the LSTM models recovered sub-sequences that were strongly correlated with sites known to be involved in binding. These results demonstrate that analysis of internal processes of LSTM models may serve as a powerful tool for primary sequence analysis.
计算机辅助合理疫苗设计(RVD)和合成药理学是迅速发展的领域,利用现有的数据集来开发感兴趣的化合物。计算蛋白质组学利用算法和模型来探测蛋白质以进行功能预测。计算方法的一个潜在强大目标是自身抗体,这是免疫系统中耐受破坏的结果,免疫系统无法区分“自我”和“非自我”,导致自身结构(主要是蛋白质和 DNA)受到攻击。自身抗体的结构、功能和致病性信息可能有助于工程 RVD 对抗自身免疫疾病。当前的计算方法利用包含广泛领域知识的大型数据集进行挖掘,其中大多数需要大量资源,并已间接应用于 DNA、RNA 和单体蛋白质结合的感兴趣问题。我们提出了一种发现潜在结合位点的新方法。我们使用基于 FASTA 原始序列的长短时记忆(LSTM)模型来预测 DNA 结合水解抗体(abzyme)中的蛋白质结合。我们还使用应用于同一数据集的 CNN 模型进行比较。虽然 CNN 模型在绑定预测的主要任务上优于 LSTM,但对两种模型的内部模型表示的分析表明,LSTM 模型恢复了与已知参与绑定的位点强烈相关的子序列。这些结果表明,LSTM 模型内部过程的分析可以作为原始序列分析的有力工具。