Tuan-Anh Tran, Ly Le Thi, Viet Ngo Quoc, Bao Pham The
Faculty of Mathematics and Computer Science, VNUHCM-University of Science, 227 Nguyen Van Cu Street, District 5, Ho Chi Minh City, Vietnam.
School of Biotechnology, VNUHCM-International University, Quarter 6, Linh Trung Ward, Thu Duc District, Ho Chi Minh City, Vietnam.
BMC Bioinformatics. 2017 Feb 10;18(1):100. doi: 10.1186/s12859-017-1517-z.
Since the recombinant protein was discovered, it has become more popular in many aspects of life science. The value of global pharmaceutical market was $87 billion in 2008 and the sales for industrial enzyme exceeded $4 billion in 2012. This is strong evidence showing the great potential of recombinant protein. However, native genes introduced into a host can cause incompatibility of codon usage bias, GC content, repeat region, Shine-Dalgarno sequence with host's expression system, so the yields can fall down significantly. Hence, we propose novel methods for gene optimization based on neural network, Bayesian theory, and Euclidian distance.
The correlation coefficients of our neural network are 0.86, 0.73, and 0.90 in training, validation, and testing process. In addition, genes optimized by our methods seem to associate with highly expressed genes and give reasonable codon adaptation index values. Furthermore, genes optimized by the proposed methods are highly matched with the previous experimental data.
The proposed methods have high potential for gene optimization and further researches in gene expression. We built a demonstrative program using Matlab R2014a under Mac OS X. The program was published in both standalone executable program and Matlab function files. The developed program can be accessed from http://www.math.hcmus.edu.vn/~ptbao/paper_soft/GeneOptProg/ .
自重组蛋白被发现以来,它在生命科学的诸多方面变得愈发流行。2008年全球制药市场价值达870亿美元,2012年工业酶销售额超过40亿美元。这有力地证明了重组蛋白的巨大潜力。然而,导入宿主的天然基因可能会导致密码子使用偏好、GC含量、重复区域、Shine-Dalgarno序列与宿主表达系统不兼容,从而使产量显著下降。因此,我们基于神经网络、贝叶斯理论和欧几里得距离提出了新的基因优化方法。
我们的神经网络在训练、验证和测试过程中的相关系数分别为0.86、0.73和0.90。此外,用我们的方法优化后的基因似乎与高表达基因相关,并给出了合理的密码子适应指数值。而且,用所提方法优化后的基因与先前的实验数据高度匹配。
所提方法在基因优化及基因表达的进一步研究方面具有很大潜力。我们在Mac OS X系统下使用Matlab R2014a构建了一个演示程序。该程序以独立可执行程序和Matlab函数文件两种形式发布。可从http://www.math.hcmus.edu.vn/~ptbao/paper_soft/GeneOptProg/访问所开发的程序。