用于腺苷酸化结构域检测和特异性预测的无比对方法

Alignment-Free Methods for the Detection and Specificity Prediction of Adenylation Domains.

作者信息

Agüero-Chapin Guillermin, Pérez-Machado Gisselle, Sánchez-Rodríguez Aminael, Santos Miguel Machado, Antunes Agostinho

机构信息

CIMAR/CIIMAR, Centro Interdisciplinar de Investigação Marinha e Ambiental, Universidade do Porto, Rua dos Bragas, 177, Porto, 4050-123, Portugal.

Centro de Bioactivos Químicos, Universidad Central "Marta Abreu" de Las Villas (UCLV), Santa Clara, 54830, Cuba.

出版信息

Methods Mol Biol. 2016;1401:253-72. doi: 10.1007/978-1-4939-3375-4_16.

Identifying adenylation domains (A-domains) and their substrate specificity can aid the detection of nonribosomal peptide synthetases (NRPS) at genome/proteome level and allow inferring the structure of oligopeptides with relevant biological activities. However, that is challenging task due to the high sequence diversity of A-domains (~10-40 % of amino acid identity) and their selectivity for 50 different natural/unnatural amino acids. Altogether these characteristics make their detection and the prediction of their substrate specificity a real challenge when using traditional sequence alignment methods, e.g., BLAST searches. In this chapter we describe two workflows based on alignment-free methods intended for the identification and substrate specificity prediction of A-domains. To identify A-domains we introduce a graphical-numerical method, implemented in TI2BioP version 2.0 (topological indices to biopolymers), which in a first step uses protein four-color maps to represent A-domains. In a second step, simple topological indices (TIs), called spectral moments, are derived from the graphical representations of known A-domains (positive dataset) and of unrelated but well-characterized sequences (negative set). Spectral moments are then used as input predictors for statistical classification techniques to build alignment-free models. Finally, the resulting alignment-free models can be used to explore entire proteomes for unannotated A-domains. In addition, this graphical-numerical methodology works as a sequence-search method that can be ensemble with homology-based tools to deeply explore the A-domain signature and cope with the diversity of this class (Aguero-Chapin et al., PLoS One 8(7):e65926, 2013). The second workflow for the prediction of A-domain's substrate specificity is based on alignment-free models constructed by transductive support vector machines (TSVMs) that incorporate information of uncharacterized A-domains. The construction of the models was implemented in the NRPSpredictor and in a first step uses the physicochemical fingerprint of the 34 residues lining the active site of the phenylalanine-adenylation domain of gramicidin synthetase A [PDB ID 1 amu] to derive a feature vector. Homologous positions were extracted for A-domains with known and unknown substrate specificities and turned into feature vectors. At the same time, A-domains with known specificities towards similar substrates were clustered by physicochemical properties of amino acids (AA). In a second step, support vector machines (SVMs) were optimized from feature vectors of characterized A-domains in each of the resulting clusters. Later, SVMs were used in the variant of TSVMs that integrate a fraction of uncharacterized A-domains during training to predict unknown specificities. Finally, uncharacterized A-domains were scored by each of the constructed alignment-free models (TSVM) representing each substrate specificity resulting from the clustering. The model producing the largest score for the uncharacterized A-domain assigns the substrate specificity to it (Rausch et al., Nucleic Acids Res 33:5799-5808, 2005).

识别腺苷化结构域（A结构域）及其底物特异性有助于在基因组/蛋白质组水平上检测非核糖体肽合成酶（NRPS），并推断具有相关生物活性的寡肽结构。然而，由于A结构域的序列多样性很高（氨基酸同一性约为10 - 40%），且它们对50种不同的天然/非天然氨基酸具有选择性，这是一项具有挑战性的任务。这些特征使得在使用传统序列比对方法（如BLAST搜索）时，检测A结构域及其底物特异性的预测成为一项真正的挑战。在本章中，我们描述了两种基于无比对方法的工作流程，用于识别A结构域并预测其底物特异性。为了识别A结构域，我们引入了一种图形 - 数值方法，该方法在TI2BioP 2.0版本（生物聚合物的拓扑指数）中实现，第一步使用蛋白质四色图来表示A结构域。第二步，从已知A结构域（阳性数据集）和不相关但特征明确的序列（阴性集）的图形表示中导出简单的拓扑指数（TI），即光谱矩。然后将光谱矩用作统计分类技术的输入预测因子，以构建无比对模型。最后，所得的无比对模型可用于在整个蛋白质组中探索未注释的A结构域。此外，这种图形 - 数值方法可作为一种序列搜索方法，与基于同源性的工具结合使用，以深入探索A结构域特征并应对此类的多样性（阿圭罗 - 查平等人，《公共科学图书馆·综合》8(7):e65926，2013）。预测A结构域底物特异性的第二个工作流程基于由转导支持向量机（TSVM）构建的无比对模型，该模型纳入了未表征A结构域的信息。模型的构建在NRPSpredictor中实现，第一步使用短杆菌肽合成酶A的苯丙氨酸 - 腺苷化结构域活性位点内衬的34个残基的物理化学指纹[蛋白质数据银行（PDB）ID 1 amu]来导出特征向量。提取具有已知和未知底物特异性的A结构域的同源位置，并将其转化为特征向量。同时，根据氨基酸（AA）的物理化学性质对具有相似底物已知特异性的A结构域进行聚类。第二步，从每个所得聚类中表征的A结构域的特征向量优化支持向量机（SVM）。随后，在TSVM的变体中使用SVM，该变体在训练期间整合一部分未表征的A结构域以预测未知特异性。最后，通过代表聚类产生的每种底物特异性的每个构建的无比对模型（TSVM）对未表征的A结构域进行评分。为未表征的A结构域产生最大分数的模型将底物特异性分配给它（劳施等人，《核酸研究》33:5799 - 5808，2005）。

Alignment-Free Methods for the Detection and Specificity Prediction of Adenylation Domains.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献