基于属性聚类的序列相似性网络中的社区检测

Community detection in sequence similarity networks based on attribute clustering.

作者信息

Chowdhary Janamejaya, Löffler Frank E, Smith Jeremy C

机构信息

Center for Molecular Biophysics, Oak Ridge National Laboratory, Oak Ridge, Tennessee, United States of America.

University of Tennessee-Oak Ridge National Laboratory, Joint Institute for Biological Sciences and Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, United States of America.

出版信息

PLoS One. 2017 Jul 24;12(7):e0178650. doi: 10.1371/journal.pone.0178650. eCollection 2017.

DOI:10.1371/journal.pone.0178650

PMID:28738060

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5524321/

Abstract

Networks are powerful tools for the presentation and analysis of interactions in multi-component systems. A commonly studied mesoscopic feature of networks is their community structure, which arises from grouping together similar nodes into one community and dissimilar nodes into separate communities. Here, the community structure of protein sequence similarity networks is determined with a new method: Attribute Clustering Dependent Communities (ACDC). Sequence similarity has hitherto typically been quantified by the alignment score or its expectation value. However, pair alignments with the same score or expectation value cannot thus be differentiated. To overcome this deficiency, the method constructs, for pair alignments, an extended alignment metric, the link attribute vector, which includes the score and other alignment characteristics. Rescaling components of the attribute vectors qualitatively identifies a systematic variation of sequence similarity within protein superfamilies. The problem of community detection is then mapped to clustering the link attribute vectors, selection of an optimal subset of links and community structure refinement based on the partition density of the network. ACDC-predicted communities are found to be in good agreement with gold standard sequence databases for which the "ground truth" community structures (or families) are known. ACDC is therefore a community detection method for sequence similarity networks based entirely on pair similarity information. A serial implementation of ACDC is available from https://cmb.ornl.gov/resources/developments.

摘要

网络是用于呈现和分析多组分系统中相互作用的强大工具。网络中一个常被研究的介观特征是其社区结构，它是通过将相似节点归为一个社区，将不相似节点归为不同社区而产生的。在此，蛋白质序列相似性网络的社区结构通过一种新方法确定：属性聚类相关社区（ACDC）。迄今为止，序列相似性通常通过比对分数或其期望值来量化。然而，具有相同分数或期望值的比对对无法因此而区分。为克服这一缺陷，该方法为比对对构建了一种扩展的比对度量，即链接属性向量，它包括分数和其他比对特征。对属性向量的分量进行重新缩放定性地识别了蛋白质超家族内序列相似性的系统变化。然后，社区检测问题被映射为对链接属性向量进行聚类、选择链接的最优子集以及基于网络的划分密度对社区结构进行细化。发现ACDC预测的社区与已知“真实”社区结构（或家族）的黄金标准序列数据库高度一致。因此，ACDC是一种完全基于比对相似性信息的序列相似性网络社区检测方法。ACDC的串行实现可从https://cmb.ornl.gov/resources/developments获取。