Yooseph Shibu, Sutton Granger, Rusch Douglas B, Halpern Aaron L, Williamson Shannon J, Remington Karin, Eisen Jonathan A, Heidelberg Karla B, Manning Gerard, Li Weizhong, Jaroszewski Lukasz, Cieplak Piotr, Miller Christopher S, Li Huiying, Mashiyama Susan T, Joachimiak Marcin P, van Belle Christopher, Chandonia John-Marc, Soergel David A, Zhai Yufeng, Natarajan Kannan, Lee Shaun, Raphael Benjamin J, Bafna Vineet, Friedman Robert, Brenner Steven E, Godzik Adam, Eisenberg David, Dixon Jack E, Taylor Susan S, Strausberg Robert L, Frazier Marvin, Venter J Craig
J. Craig Venter Institute, Rockville, Maryland, United States of America.
PLoS Biol. 2007 Mar;5(3):e16. doi: 10.1371/journal.pbio.0050016.
Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.
基于对微生物群体进行鸟枪法测序的宏基因组学项目,使人们对蛋白质家族有了深入了解。我们使用序列相似性聚类,通过一个综合数据集来探索蛋白质,该数据集由来自现有数据库的序列以及从770万个全球海洋采样(GOS)序列组装预测出的612万个蛋白质组成。GOS数据集涵盖了几乎所有已知的原核生物蛋白质家族。总共鉴定出3995个仅由GOS序列组成的中型和大型聚类,其中1700个与已知家族没有可检测到的同源性。仅包含GOS序列的聚类中,病毒起源序列所占比例高于预期,这反映出到目前为止对病毒多样性的采样不足。GOS数据集和当前蛋白质数据库中的蛋白质结构域分布存在明显偏差。一些先前被归类为特定界别的蛋白质结构域在其他界别中也有GOS示例。文献中迄今与已知蛋白质缺乏相似性的约6000个序列(孤儿序列)在GOS数据中有匹配项。GOS数据集还用于改进远程同源性检测。总体而言,除了使当前蛋白质数量几乎增加一倍外,预测的GOS蛋白质还为已知蛋白质家族增添了大量多样性,并揭示了它们的进化过程。使用包括磷酸酶、蛋白酶、紫外线照射DNA损伤修复酶、谷氨酰胺合成酶和核酮糖-1,5-二磷酸羧化酶/加氧酶等几个蛋白质家族对这些观察结果进行了说明。作为结构基因组学工作的一部分,GOS数据增加的多样性对选择实验结构表征的目标具有重要意义。我们的分析表明,新家族正以与新序列增加呈线性或几乎呈线性的速度被发现,这意味着我们距离发现自然界中所有蛋白质家族仍相差甚远。