Tress Michael L, Cozzetto Domenico, Tramontano Anna, Valencia Alfonso
Protein Design Group, CNB-CSIC, Calle Darwin, Cantoblanco 28049 Madrid, Spain.
BMC Bioinformatics. 2006 Apr 19;7:213. doi: 10.1186/1471-2105-7-213.
The environmental sequencing of the Sargasso Sea has introduced a huge new resource of genomic information. Unlike the protein sequences held in the current searchable databases, the Sargasso Sea sequences originate from a single marine environment and have been sequenced from species that are not easily obtainable by laboratory cultivation. The resource also contains very many fragments of whole protein sequences, a side effect of the shotgun sequencing method.These sequences form a significant addendum to the current searchable databases but also present us with some intrinsic difficulties. While it is important to know whether it is possible to assign function to these sequences with the current methods and whether they will increase our capacity to explore sequence space, it is also interesting to know how current bioinformatics techniques will deal with the new sequences in the resource.
The Sargasso Sea sequences seem to introduce a bias that decreases the potential of current methods to propose structure and function for new proteins. In particular the high proportion of sequence fragments in the resource seems to result in poor quality multiple alignments.
These observations suggest that the new sequences should be used with care, especially if the information is to be used in large scale analyses. On a positive note, the results may just spark improvements in computational and experimental methods to take into account the fragments generated by environmental sequencing techniques.
马尾藻海的环境测序引入了一个巨大的新基因组信息资源。与当前可搜索数据库中保存的蛋白质序列不同,马尾藻海序列源自单一海洋环境,并且是从实验室培养不易获得的物种中测序得到的。该资源还包含许多完整蛋白质序列的片段,这是鸟枪法测序方法的一个副作用。这些序列构成了当前可搜索数据库的重要补充,但也给我们带来了一些内在困难。虽然了解使用当前方法是否能够为这些序列赋予功能以及它们是否会增加我们探索序列空间的能力很重要,但了解当前生物信息学技术将如何处理该资源中的新序列也很有趣。
马尾藻海序列似乎引入了一种偏差,降低了当前方法为新蛋白质提出结构和功能的潜力。特别是该资源中序列片段的高比例似乎导致了质量较差的多序列比对。
这些观察结果表明,应谨慎使用新序列,特别是如果要在大规模分析中使用这些信息。从积极的方面来看,这些结果可能会促使计算和实验方法得到改进,以考虑环境测序技术产生的片段。