Center for Informational Biology at University of Electronic Science and Technology of China.
Brief Bioinform. 2021 Mar 22;22(2):1940-1950. doi: 10.1093/bib/bbaa017.
The locations of the initiation of genomic DNA replication are defined as origins of replication sites (ORIs), which regulate the onset of DNA replication and play significant roles in the DNA replication process. The study of ORIs is essential for understanding the cell-division cycle and gene expression regulation. Accurate identification of ORIs will provide important clues for DNA replication research and drug development by developing computational methods. In this paper, the first integrated predictor named iORI-Euk was built to identify ORIs in multiple eukaryotes and multiple cell types. In the predictor, seven eukaryotic (Homo sapiens, Mus musculus, Drosophila melanogaster, Arabidopsis thaliana, Pichia pastoris, Schizosaccharomyces pombe and Kluyveromyces lactis) ORI data was collected from public database to construct benchmark datasets. Subsequently, three feature extraction strategies which are k-mer, binary encoding and combination of k-mer and binary were used to formulate DNA sequence samples. We also compared the different classification algorithms' performance. As a result, the best results were obtained by using support vector machine in 5-fold cross-validation test and independent dataset test. Based on the optimal model, an online web server called iORI-Euk (http://lin-group.cn/server/iORI-Euk/) was established for the novel ORI identification.
基因组 DNA 复制起始的位置被定义为复制起始位点(ORIs),它们调节 DNA 复制的开始,并在 DNA 复制过程中发挥重要作用。研究 ORIs 对于理解细胞分裂周期和基因表达调控至关重要。通过开发计算方法,准确识别 ORIs 将为 DNA 复制研究和药物开发提供重要线索。
在本文中,构建了第一个名为 iORI-Euk 的综合预测器,用于鉴定多种真核生物和多种细胞类型中的 ORIs。在该预测器中,从公共数据库中收集了七种真核生物(人、鼠、黑腹果蝇、拟南芥、巴斯德毕赤酵母、酿酒酵母和乳酸克鲁维酵母)的 ORI 数据,以构建基准数据集。随后,使用了三种特征提取策略,即 k-mer、二进制编码和 k-mer 与二进制的组合,来制定 DNA 序列样本。我们还比较了不同分类算法的性能。结果表明,在 5 折交叉验证测试和独立数据集测试中,支持向量机的效果最佳。基于最优模型,建立了一个在线网络服务器 iORI-Euk(http://lin-group.cn/server/iORI-Euk/),用于新的 ORI 识别。