Athey John, Alexaki Aikaterini, Osipova Ekaterina, Rostovtsev Alexandre, Santana-Quintero Luis V, Katneni Upendra, Simonyan Vahan, Kimchi-Sarfaty Chava
Division of Plasma Protein Therapeutics, Office of Tissue and Advanced Therapies, Center for Biologics Evaluation and Research, Food and Drug Administration, Silver Spring, USA.
High Performance Integrated Environment, Center for Biologics Evaluation and Research, Food and Drug Administration, Silver Spring, USA.
BMC Bioinformatics. 2017 Sep 2;18(1):391. doi: 10.1186/s12859-017-1793-7.
Due to the degeneracy of the genetic code, most amino acids can be encoded by multiple synonymous codons. Synonymous codons naturally occur with different frequencies in different organisms. The choice of codons may affect protein expression, structure, and function. Recombinant gene technologies commonly take advantage of the former effect by implementing a technique termed codon optimization, in which codons are replaced with synonymous ones in order to increase protein expression. This technique relies on the accurate knowledge of codon usage frequencies. Accurately quantifying codon usage bias for different organisms is useful not only for codon optimization, but also for evolutionary and translation studies: phylogenetic relations of organisms, and host-pathogen co-evolution relationships, may be explored through their codon usage similarities. Furthermore, codon usage has been shown to affect protein structure and function through interfering with translation kinetics, and cotranslational protein folding.
Despite the obvious need for accurate codon usage tables, currently available resources are either limited in scope, encompassing only organisms from specific domains of life, or greatly outdated. Taking advantage of the exponential growth of GenBank and the creation of NCBI's RefSeq database, we have developed a new database, the High-performance Integrated Virtual Environment-Codon Usage Tables (HIVE-CUTs), to present and analyse codon usage tables for every organism with publicly available sequencing data. Compared to existing databases, this new database is more comprehensive, addresses concerns that limited the accuracy of earlier databases, and provides several new functionalities, such as the ability to view and compare codon usage between individual organisms and across taxonomical clades, through graphical representation or through commonly used indices. In addition, it is being routinely updated to keep up with the continuous flow of new data in GenBank and RefSeq.
Given the impact of codon usage bias on recombinant gene technologies, this database will facilitate effective development and review of recombinant drug products and will be instrumental in a wide area of biological research. The database is available at hive.biochemistry.gwu.edu/review/codon .
由于遗传密码的简并性,大多数氨基酸可由多个同义密码子编码。同义密码子在不同生物体中自然出现的频率不同。密码子的选择可能会影响蛋白质的表达、结构和功能。重组基因技术通常利用前一种效应,实施一种称为密码子优化的技术,即用同义密码子替换密码子以提高蛋白质表达。该技术依赖于对密码子使用频率的准确了解。准确量化不同生物体的密码子使用偏好不仅对密码子优化有用,而且对进化和翻译研究也有用:生物体的系统发育关系以及宿主 - 病原体共同进化关系可通过它们的密码子使用相似性来探索。此外,已表明密码子使用通过干扰翻译动力学和共翻译蛋白质折叠来影响蛋白质结构和功能。
尽管显然需要准确的密码子使用表,但目前可用的资源要么范围有限,仅涵盖来自特定生命领域的生物体,要么严重过时。利用GenBank的指数增长以及NCBI的RefSeq数据库的创建,我们开发了一个新的数据库,即高性能集成虚拟环境 - 密码子使用表(HIVE - CUTs),以呈现和分析每个具有公开可用测序数据的生物体的密码子使用表。与现有数据库相比,这个新数据库更全面,解决了限制早期数据库准确性的问题,并提供了几个新功能,例如能够通过图形表示或常用指标查看和比较单个生物体之间以及跨分类类群的密码子使用情况。此外,它会定期更新以跟上GenBank和RefSeq中不断流动的新数据。
鉴于密码子使用偏好对重组基因技术的影响,该数据库将促进重组药物产品的有效开发和审查,并将在广泛的生物学研究领域发挥作用。该数据库可在hive.biochemistry.gwu.edu/review/codon获取。