Narang Vipin, Sung Wing-Kin, Mittal Ankush
Department of Computer Science, 3 Science Drive 2, National University of Singapore, 117543, Singapore.
Genome Inform. 2006;17(2):14-24.
Drosophila melanogaster is one of the most important organisms for studying the genetics of development. The precise regulation of genes during early development is enacted through the control of transcription. The control circuitry is hardwired in the genome as clusters of multiple transcription factor binding sites (TFBS) known as cis-regulatory modules (CRMs). A number of TFBS and CRMs have been experimentally annotated in the Drosophila genome. Currently about 661 CRM sequences are known, of which 155 have been annotated with 778 TFBS. This work attempts computational annotation of TFBS in the remaining 506 uncharacterized Drosophila CRMs. The difficulty of this task lies in the fact that experimental data is insufficient for constructing reliable positional weight matrices (PWM) to predict the TFBS. Thus a novel feature extraction and classification method for TFBS detection has been implemented in this work. The method achieves both high sensitivity and low false positive rate in cross-validation studies. As a result of this work, a new database has been compiled which aggregates all the CRM and TFBS annotation information for Drosophila available to date, and appends new TFBS annotations.
黑腹果蝇是研究发育遗传学最重要的生物之一。早期发育过程中基因的精确调控是通过转录控制来实现的。控制电路以多个转录因子结合位点(TFBS)簇的形式硬连接在基因组中,这些簇被称为顺式调控模块(CRM)。在果蝇基因组中,已经通过实验注释了许多TFBS和CRM。目前已知约661个CRM序列,其中155个已用778个TFBS进行了注释。这项工作尝试对其余506个未表征的果蝇CRM中的TFBS进行计算注释。这项任务的困难在于,实验数据不足以构建可靠的位置权重矩阵(PWM)来预测TFBS。因此,这项工作中实现了一种用于TFBS检测的新颖特征提取和分类方法。该方法在交叉验证研究中实现了高灵敏度和低假阳性率。这项工作的结果是,汇编了一个新数据库,该数据库汇总了迄今为止可用的所有果蝇CRM和TFBS注释信息,并附加了新的TFBS注释。