Computational Biology Program, University of Kansas, Lawrence, KS 66045.
Department of Molecular Biosciences, University of Kansas, Lawrence, KS 66045.
Proc Natl Acad Sci U S A. 2023 Jul 18;120(29):e2220762120. doi: 10.1073/pnas.2220762120. Epub 2023 Jul 11.
Large datasets contribute new insights to subjects formerly investigated by exemplars. We used coevolution data to create a large, high-quality database of transmembrane β-barrels (TMBB). By applying simple feature detection on generated evolutionary contact maps, our method (IsItABarrel) achieves 95.88% balanced accuracy when discriminating among protein classes. Moreover, comparison with IsItABarrel revealed a high rate of false positives in previous TMBB algorithms. In addition to being more accurate than previous datasets, our database (available online) contains 1,938,936 bacterial TMBB proteins from 38 phyla, respectively, 17 and 2.2 times larger than the previous sets TMBB-DB and OMPdb. We anticipate that due to its quality and size, the database will serve as a useful resource where high-quality TMBB sequence data are required. We found that TMBBs can be divided into 11 types, three of which have not been previously reported. We find tremendous variance in proteome percentage among TMBB-containing organisms with some using 6.79% of their proteome for TMBBs and others using as little as 0.27% of their proteome. The distribution of the lengths of the TMBBs is suggestive of previously hypothesized duplication events. In addition, we find that the C-terminal β-signal varies among different classes of bacteria though its consensus sequence is LGLGYRF. However, this β-signal is only characteristic of prototypical TMBBs. The ten non-prototypical barrel types have other C-terminal motifs, and it remains to be determined if these alternative motifs facilitate TMBB insertion or perform any other signaling function.
大型数据集为以前通过范例研究的主题提供了新的见解。我们使用共进化数据创建了一个大型的高质量跨膜β-桶(TMBB)数据库。通过在生成的进化接触图上应用简单的特征检测,我们的方法(IsItABarrel)在区分蛋白质类别时达到了 95.88%的平衡准确率。此外,与 IsItABarrel 的比较表明,以前的 TMBB 算法存在很高的假阳性率。除了比以前的数据集更准确之外,我们的数据库(在线提供)包含来自 38 个门的 1,938,936 个细菌 TMBB 蛋白,分别比以前的 TMBB-DB 和 OMPdb 大 17 倍和 2.2 倍。我们预计,由于其质量和大小,该数据库将成为需要高质量 TMBB 序列数据的有用资源。我们发现 TMBB 可以分为 11 种类型,其中三种以前没有报道过。我们发现含有 TMBB 的生物的蛋白质组百分比差异很大,有些生物的蛋白质组中有 6.79%用于 TMBB,而有些生物的蛋白质组中只有 0.27%用于 TMBB。TMBB 长度的分布表明存在以前假设的重复事件。此外,我们发现 C 末端β信号在不同类别的细菌中存在差异,尽管其保守序列为 LGLGYRF。然而,这种β信号仅存在于典型的 TMBB 中。这十种非典型桶类型具有其他 C 末端基序,尚待确定这些替代基序是否有助于 TMBB 插入或执行任何其他信号功能。