School of Computer Science and Technology and Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China, Gordon Life Science Institute, Belmont, MA 02478, USA and Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, 21589, Saudi Arabia School of Computer Science and Technology and Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China, Gordon Life Science Institute, Belmont, MA 02478, USA and Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, 21589, Saudi Arabia School of Computer Science and Technology and Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China, Gordon Life Science Institute, Belmont, MA 02478, USA and Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, 21589, Saudi Arabia.
School of Computer Science and Technology and Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, Guangdong 518055, China, Gordon Life Science Institute, Belmont, MA 02478, USA and Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, 21589, Saudi Arabia.
Bioinformatics. 2015 Apr 15;31(8):1307-9. doi: 10.1093/bioinformatics/btu820. Epub 2014 Dec 10.
In order to develop powerful computational predictors for identifying the biological features or attributes of DNAs, one of the most challenging problems is to find a suitable approach to effectively represent the DNA sequences. To facilitate the studies of DNAs and nucleotides, we developed a Python package called representations of DNAs (repDNA) for generating the widely used features reflecting the physicochemical properties and sequence-order effects of DNAs and nucleotides. There are three feature groups composed of 15 features. The first group calculates three nucleic acid composition features describing the local sequence information by means of kmers; the second group calculates six autocorrelation features describing the level of correlation between two oligonucleotides along a DNA sequence in terms of their specific physicochemical properties; the third group calculates six pseudo nucleotide composition features, which can be used to represent a DNA sequence with a discrete model or vector yet still keep considerable sequence-order information via the physicochemical properties of its constituent oligonucleotides. In addition, these features can be easily calculated based on both the built-in and user-defined properties via using repDNA.
The repDNA Python package is freely accessible to the public at http://bioinformatics.hitsz.edu.cn/repDNA/.
bliu@insun.hit.edu.cn or kcchou@gordonlifescience.org
Supplementary data are available at Bioinformatics online.
为了开发强大的计算预测器,以识别 DNA 的生物学特征或属性,最具挑战性的问题之一是找到一种合适的方法来有效地表示 DNA 序列。为了便于研究 DNA 和核苷酸,我们开发了一个名为 representations of DNAs(repDNA)的 Python 包,用于生成广泛使用的特征,反映 DNA 和核苷酸的理化性质和序列顺序效应。有三个特征组,由 15 个特征组成。第一组通过 kmer 计算三个描述局部序列信息的核酸组成特征;第二组计算六个自相关特征,根据其特定的理化性质描述 DNA 序列中两个寡核苷酸之间的相关程度;第三组计算六个伪核苷酸组成特征,可用于通过其组成寡核苷酸的理化性质用离散模型或向量表示 DNA 序列,但仍保留相当多的序列顺序信息。此外,这些特征可以通过 repDNA 基于内置和用户定义的属性轻松计算。
repDNA Python 包可在 http://bioinformatics.hitsz.edu.cn/repDNA/ 上公开获取。
bliu@insun.hit.edu.cn 或 kcchou@gordonlifescience.org
补充数据可在 Bioinformatics 在线获得。