Steenwyk Jacob L, Buida Thomas J, Gonçalves Carla, Goltz Dayna C, Morales Grace, Mead Matthew E, LaBella Abigail L, Chavez Christina M, Schmitz Jonathan E, Hadjifrangiskou Maria, Li Yuanning, Rokas Antonis
Department of Biological Sciences, Vanderbilt University, Nashville, TN 37235, USA.
Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN 37235, USA.
Genetics. 2022 Jul 4;221(3). doi: 10.1093/genetics/iyac079.
Bioinformatic analysis-such as genome assembly quality assessment, alignment summary statistics, relative synonymous codon usage, file format conversion, and processing and analysis-is integrated into diverse disciplines in the biological sciences. Several command-line pieces of software have been developed to conduct some of these individual analyses, but unified toolkits that conduct all these analyses are lacking. To address this gap, we introduce BioKIT, a versatile command line toolkit that has, upon publication, 42 functions, several of which were community-sourced, that conduct routine and novel processing and analysis of genome assemblies, multiple sequence alignments, coding sequences, sequencing data, and more. To demonstrate the utility of BioKIT, we conducted a comprehensive examination of relative synonymous codon usage across 171 fungal genomes that use alternative genetic codes, showed that the novel metric of gene-wise relative synonymous codon usage can accurately estimate gene-wise codon optimization, evaluated the quality and characteristics of 901 eukaryotic genome assemblies, and calculated alignment summary statistics for 10 phylogenomic data matrices. BioKIT will be helpful in facilitating and streamlining sequence analysis workflows. BioKIT is freely available under the MIT license from GitHub (https://github.com/JLSteenwyk/BioKIT), PyPi (https://pypi.org/project/jlsteenwyk-biokit/), and the Anaconda Cloud (https://anaconda.org/jlsteenwyk/jlsteenwyk-biokit). Documentation, user tutorials, and instructions for requesting new features are available online (https://jlsteenwyk.com/BioKIT).
生物信息学分析,如基因组组装质量评估、比对汇总统计、相对同义密码子使用情况、文件格式转换以及处理与分析,已融入生物科学的各个学科。已经开发了一些命令行软件来进行其中的一些单独分析,但缺乏能进行所有这些分析的统一工具包。为了填补这一空白,我们引入了BioKIT,这是一个多功能的命令行工具包,发布时具有42个功能,其中一些功能是社区提供的,可对基因组组装、多序列比对、编码序列、测序数据等进行常规和新颖的处理与分析。为了证明BioKIT的实用性,我们对171个使用替代遗传密码的真菌基因组的相对同义密码子使用情况进行了全面检查,表明基因水平的相对同义密码子使用这一新指标可以准确估计基因水平的密码子优化情况,评估了901个真核生物基因组组装的质量和特征,并计算了10个系统发育基因组数据矩阵的比对汇总统计数据。BioKIT将有助于促进和简化序列分析工作流程。BioKIT可根据MIT许可从GitHub(https://github.com/JLSteenwyk/BioKIT)、PyPi(https://pypi.org/project/jlsteenwyk-biokit/)和Anaconda Cloud(https://anaconda.org/jlsteenwyk/jlsteenwyk-biokit)免费获取。在线提供了文档、用户教程以及请求新功能的说明(https://jlsteenwyk.com/BioKIT)。