Agüero-Chapin Guillermin, Pérez-Machado Gisselle, Sánchez-Rodríguez Aminael, Santos Miguel Machado, Antunes Agostinho
CIMAR/CIIMAR, Centro Interdisciplinar de Investigação Marinha e Ambiental, Universidade do Porto, Rua dos Bragas, 177, Porto, 4050-123, Portugal.
Centro de Bioactivos Químicos, Universidad Central "Marta Abreu" de Las Villas (UCLV), Santa Clara, 54830, Cuba.
Methods Mol Biol. 2016;1401:253-72. doi: 10.1007/978-1-4939-3375-4_16.
Identifying adenylation domains (A-domains) and their substrate specificity can aid the detection of nonribosomal peptide synthetases (NRPS) at genome/proteome level and allow inferring the structure of oligopeptides with relevant biological activities. However, that is challenging task due to the high sequence diversity of A-domains (~10-40 % of amino acid identity) and their selectivity for 50 different natural/unnatural amino acids. Altogether these characteristics make their detection and the prediction of their substrate specificity a real challenge when using traditional sequence alignment methods, e.g., BLAST searches. In this chapter we describe two workflows based on alignment-free methods intended for the identification and substrate specificity prediction of A-domains. To identify A-domains we introduce a graphical-numerical method, implemented in TI2BioP version 2.0 (topological indices to biopolymers), which in a first step uses protein four-color maps to represent A-domains. In a second step, simple topological indices (TIs), called spectral moments, are derived from the graphical representations of known A-domains (positive dataset) and of unrelated but well-characterized sequences (negative set). Spectral moments are then used as input predictors for statistical classification techniques to build alignment-free models. Finally, the resulting alignment-free models can be used to explore entire proteomes for unannotated A-domains. In addition, this graphical-numerical methodology works as a sequence-search method that can be ensemble with homology-based tools to deeply explore the A-domain signature and cope with the diversity of this class (Aguero-Chapin et al., PLoS One 8(7):e65926, 2013). The second workflow for the prediction of A-domain's substrate specificity is based on alignment-free models constructed by transductive support vector machines (TSVMs) that incorporate information of uncharacterized A-domains. The construction of the models was implemented in the NRPSpredictor and in a first step uses the physicochemical fingerprint of the 34 residues lining the active site of the phenylalanine-adenylation domain of gramicidin synthetase A [PDB ID 1 amu] to derive a feature vector. Homologous positions were extracted for A-domains with known and unknown substrate specificities and turned into feature vectors. At the same time, A-domains with known specificities towards similar substrates were clustered by physicochemical properties of amino acids (AA). In a second step, support vector machines (SVMs) were optimized from feature vectors of characterized A-domains in each of the resulting clusters. Later, SVMs were used in the variant of TSVMs that integrate a fraction of uncharacterized A-domains during training to predict unknown specificities. Finally, uncharacterized A-domains were scored by each of the constructed alignment-free models (TSVM) representing each substrate specificity resulting from the clustering. The model producing the largest score for the uncharacterized A-domain assigns the substrate specificity to it (Rausch et al., Nucleic Acids Res 33:5799-5808, 2005).
识别腺苷化结构域(A结构域)及其底物特异性有助于在基因组/蛋白质组水平上检测非核糖体肽合成酶(NRPS),并推断具有相关生物活性的寡肽结构。然而,由于A结构域的序列多样性很高(氨基酸同一性约为10 - 40%),且它们对50种不同的天然/非天然氨基酸具有选择性,这是一项具有挑战性的任务。这些特征使得在使用传统序列比对方法(如BLAST搜索)时,检测A结构域及其底物特异性的预测成为一项真正的挑战。在本章中,我们描述了两种基于无比对方法的工作流程,用于识别A结构域并预测其底物特异性。为了识别A结构域,我们引入了一种图形 - 数值方法,该方法在TI2BioP 2.0版本(生物聚合物的拓扑指数)中实现,第一步使用蛋白质四色图来表示A结构域。第二步,从已知A结构域(阳性数据集)和不相关但特征明确的序列(阴性集)的图形表示中导出简单的拓扑指数(TI),即光谱矩。然后将光谱矩用作统计分类技术的输入预测因子,以构建无比对模型。最后,所得的无比对模型可用于在整个蛋白质组中探索未注释的A结构域。此外,这种图形 - 数值方法可作为一种序列搜索方法,与基于同源性的工具结合使用,以深入探索A结构域特征并应对此类的多样性(阿圭罗 - 查平等人,《公共科学图书馆·综合》8(7):e65926,2013)。预测A结构域底物特异性的第二个工作流程基于由转导支持向量机(TSVM)构建的无比对模型,该模型纳入了未表征A结构域的信息。模型的构建在NRPSpredictor中实现,第一步使用短杆菌肽合成酶A的苯丙氨酸 - 腺苷化结构域活性位点内衬的34个残基的物理化学指纹[蛋白质数据银行(PDB)ID 1 amu]来导出特征向量。提取具有已知和未知底物特异性的A结构域的同源位置,并将其转化为特征向量。同时,根据氨基酸(AA)的物理化学性质对具有相似底物已知特异性的A结构域进行聚类。第二步,从每个所得聚类中表征的A结构域的特征向量优化支持向量机(SVM)。随后,在TSVM的变体中使用SVM,该变体在训练期间整合一部分未表征的A结构域以预测未知特异性。最后,通过代表聚类产生的每种底物特异性的每个构建的无比对模型(TSVM)对未表征的A结构域进行评分。为未表征的A结构域产生最大分数的模型将底物特异性分配给它(劳施等人,《核酸研究》33:5799 - 5808,2005)。