Department of Biophysics and Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, USA.
Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX, USA.
Bioinformatics. 2018 Sep 1;34(17):2997-3003. doi: 10.1093/bioinformatics/bty214.
The ECOD database classifies protein domains based on their evolutionary relationships, considering both remote and close homology. The family group in ECOD provides classification of domains that are closely related to each other based on sequence similarity. Due to different perspectives on domain definition, direct application of existing sequence domain databases, such as Pfam, to ECOD struggles with several shortcomings.
We created multiple sequence alignments and profiles from ECOD domains with the help of structural information in alignment building and boundary delineation. We validated the alignment quality by scoring structure superposition to demonstrate that they are comparable to curated seed alignments in Pfam. Comparison to Pfam and CDD reveals that 27 and 16% of ECOD families are new, but they are also dominated by small families, likely because of the sampling bias from the PDB database. There are 35 and 48% of families whose boundaries are modified comparing to counterparts in Pfam and CDD, respectively.
The new families are now integrated in the ECOD website. The aggregate HMMER profile library and alignment are available for download on ECOD website (http://prodata.swmed.edu/ecod).
Supplementary data are available at Bioinformatics online.
ECOD 数据库基于进化关系对蛋白质结构域进行分类,同时考虑远程和近缘同源性。ECOD 中的家族群根据序列相似性对彼此密切相关的结构域进行分类。由于对结构域定义的不同看法,直接应用现有的序列结构域数据库(如 Pfam)到 ECOD 存在几个缺点。
我们在构建比对和边界划定的过程中借助结构信息,从 ECOD 结构域创建了多个序列比对和轮廓。我们通过对结构叠加进行评分来验证比对质量,以证明它们与 Pfam 中精心策划的种子比对相当。与 Pfam 和 CDD 的比较表明,27%和 16%的 ECOD 家族是新的,但它们也主要由小家族主导,这可能是由于 PDB 数据库的采样偏差。与 Pfam 和 CDD 中的对应物相比,分别有 35%和 48%的家族的边界发生了改变。
新的家族现在已经集成到 ECOD 网站中。可以在 ECOD 网站(http://prodata.swmed.edu/ecod)上下载聚合 HMMER 轮廓库和比对。
补充数据可在 Bioinformatics 在线获取。