潘迪特：一个带有推断树的蛋白质及相关核苷酸结构域数据库。

Pandit: a database of protein and associated nucleotide domains with inferred trees.

作者信息

Whelan Simon, de Bakker Paul I W, Goldman Nick

机构信息

Department of Zoology, University of Cambridge, Downing Street, Cambridge CB2 3EJ, UK.

出版信息

Bioinformatics. 2003 Aug 12;19(12):1556-63. doi: 10.1093/bioinformatics/btg188.

DOI:10.1093/bioinformatics/btg188

PMID:12912837

Abstract

MOTIVATION

A large, high-quality database of homologous sequence alignments with good estimates of their corresponding phylogenetic trees will be a valuable resource to those studying phylogenetics. It will allow researchers to compare current and new models of sequence evolution across a large variety of sequences. The large quantity of data may provide inspiration for new models and methodology to study sequence evolution and may allow general statements about the relative effect of different molecular processes on evolution.

RESULTS

The Pandit 7.6 database contains 4341 families of sequences derived from the seed alignments of the Pfam database of amino acid alignments of families of homologous protein domains (Bateman et al., 2002). Each family in Pandit includes an alignment of amino acid sequences that matches the corresponding Pfam family seed alignment, an alignment of DNA sequences that contain the coding sequence of the Pfam alignment when they can be recovered (overall, 82.9% of sequences taken from Pfam) and the alignment of amino acid sequences restricted to only those sequences for which a DNA sequence could be recovered. Each of the alignments has an estimate of the phylogenetic tree associated with it. The tree topologies were obtained using the neighbor joining method based on maximum likelihood estimates of the evolutionary distances, with branch lengths then calculated using a standard maximum likelihood approach.

摘要

动机

一个包含大量高质量同源序列比对且对其相应系统发育树有良好估计的数据库，对于研究系统发育学的人来说将是一个宝贵的资源。它将使研究人员能够在大量不同序列中比较当前和新的序列进化模型。大量的数据可能为研究序列进化的新模型和方法提供灵感，并可能得出关于不同分子过程对进化的相对影响的一般性结论。

结果

Pandit 7.6数据库包含4341个序列家族，这些序列源自Pfam数据库中同源蛋白质结构域家族的氨基酸比对种子比对（Bateman等人，2002年）。Pandit中的每个家族都包括一个与相应Pfam家族种子比对匹配的氨基酸序列比对、一个包含Pfam比对编码序列（如果可以恢复）的DNA序列比对（总体而言，82.9%的序列取自Pfam）以及仅对那些可以恢复DNA序列的序列进行限制的氨基酸序列比对。每个比对都有与之相关的系统发育树估计。树的拓扑结构是使用基于进化距离最大似然估计的邻接法获得的，然后使用标准最大似然方法计算分支长度。