Kabir Muhammad, Hayat Maqsood
Department of Computer Science, Abdul Wali Khan University, Mardan, KP, Pakistan.
Mol Genet Genomics. 2016 Feb;291(1):285-96. doi: 10.1007/s00438-015-1108-5. Epub 2015 Aug 30.
Meiotic recombination is vital for maintaining the sequence diversity in human genome. Meiosis and recombination are considered the essential phases of cell division. In meiosis, the genome is divided into equal parts for sexual reproduction whereas in recombination, the diverse genomes are combined to form new combination of genetic variations. Recombination process does not occur randomly across the genomes, it targets specific areas called recombination "hotspots" and "coldspots". Owing to huge exploration of polygenetic sequences in data banks, it is impossible to recognize the sequences through conventional methods. Looking at the significance of recombination spots, it is indispensable to develop an accurate, fast, robust, and high-throughput automated computational model. In this model, the numerical descriptors are extracted using two sequence representation schemes namely: dinucleotide composition and trinucleotide composition. The performances of seven classification algorithms were investigated. Finally, the predicted outcomes of individual classifiers are fused to form ensemble classification, which is formed through majority voting and genetic algorithm (GA). The performance of GA-based ensemble model is quite promising compared to individual classifiers and majority voting-based ensemble model. iRSpot-GAEnsC has achieved 84.46 % accuracy. The empirical results revealed that the performance of iRSpot-GAEnsC is not only higher than the examined algorithms but also better than existing methods in the literature developed so far. It is anticipated that the proposed model might be helpful for research community, academia and for drug discovery.
减数分裂重组对于维持人类基因组中的序列多样性至关重要。减数分裂和重组被认为是细胞分裂的重要阶段。在减数分裂中,基因组被等分为用于有性生殖的部分,而在重组中,不同的基因组被组合形成新的遗传变异组合。重组过程并非在基因组中随机发生,它针对特定区域,即所谓的重组“热点”和“冷点”。由于数据库中多基因序列的大量探索,通过传统方法识别这些序列是不可能的。鉴于重组位点的重要性,开发一种准确、快速、稳健且高通量的自动化计算模型是必不可少的。在该模型中,使用两种序列表示方案提取数值描述符,即:二核苷酸组成和三核苷酸组成。研究了七种分类算法的性能。最后,将各个分类器的预测结果融合形成集成分类,这是通过多数投票和遗传算法(GA)形成的。与单个分类器和基于多数投票的集成模型相比,基于GA的集成模型的性能非常有前景。iRSpot - GAEnsC的准确率达到了84.46%。实证结果表明,iRSpot - GAEnsC的性能不仅高于所研究的算法,而且优于迄今为止文献中已有的方法。预计所提出的模型可能对研究团体、学术界以及药物发现有所帮助。