College of Life Information Science & Instrument Engineering , Hangzhou Dianzi University , Hangzhou 310018 , China.
School of Biomedical Sciences , The Chinese University of Hong Kong , Shatin , N.T. , Hong Kong.
J Proteome Res. 2018 Jul 6;17(7):2511-2520. doi: 10.1021/acs.jproteome.8b00262. Epub 2018 May 24.
In synthetic biology, one of the key focuses is building a minimal artificial cell which can provide basic chassis for functional study. Recently, the J. Craig Venter Institute published the latest version of the minimal bacterial genome JCVI-syn3.0, which only encoded 438 essential proteins. However, among them functions of 149 proteins remain unknown because of the lack of effective annotation method. Here, we report a secondary structure element alignment method called SSEalign based on an effective training data set extracting from various bacterial genomes. The experimentally validated homologous genes in different species were selected as training positives, while unrelated genes in different species were selected as training negatives. Moreover, SSEalign used a set of well-defined basic alignment elements with the backtracking line search algorithm to derive the best parameters for accurate prediction. Experimental results showed that SSEalign achieved 88.2% test accuracy, which is better than the existing prediction methods. SSEalign was subsequently applied to identify the functions of those unannotated proteins in the latest published minimal bacteria genome JCVI-syn3.0. Results indicated that at least 136 proteins out of 149 unannotated proteins in the JCVI-syn3.0 genome could be annotated by SSEalign. Our method is effective for the identification of protein homology in JCVI-syn3.0 and can be used to annotate those hypothetical proteins in other bacterial genomes.
在合成生物学中,一个关键的焦点是构建一个最小的人工细胞,它可以为功能研究提供基本的底盘。最近,J. Craig Venter 研究所发布了最新版本的最小细菌基因组 JCVI-syn3.0,它只编码了 438 种必需的蛋白质。然而,其中 149 种蛋白质的功能仍然未知,因为缺乏有效的注释方法。在这里,我们报告了一种基于从各种细菌基因组中提取的有效训练数据集的二级结构元件比对方法,称为 SSEalign。从不同物种中选择经过实验验证的同源基因作为训练阳性,而从不同物种中选择不相关的基因作为训练阴性。此外,SSEalign 使用了一组定义良好的基本对齐元素和回溯线搜索算法来为准确预测导出最佳参数。实验结果表明,SSEalign 的测试准确率达到 88.2%,优于现有的预测方法。随后,我们将 SSEalign 应用于识别最新发布的最小细菌基因组 JCVI-syn3.0 中未注释的蛋白质的功能。结果表明,在 JCVI-syn3.0 基因组中,至少有 149 种未注释蛋白质中的 136 种可以通过 SSEalign 进行注释。我们的方法在 JCVI-syn3.0 中用于鉴定蛋白质同源性是有效的,并且可以用于注释其他细菌基因组中的那些假设蛋白质。