Chen Feng, Li Zhoufang, Chen Yi-Ping Phoebe
College of Information Science and Engineering, Henan University of Technology, Zhengzhou City, Henan Province 450001, China; Faculty of Science, Technology and Engineering, La Trobe University, Melbourne, Victoria 3086, Australia.
College of Information Science and Engineering, Henan University of Technology, Zhengzhou City, Henan Province 450001, China.
Comput Biol Chem. 2014 Aug;51:83-92. doi: 10.1016/j.compbiolchem.2014.03.001. Epub 2014 Mar 12.
A CIS (common insertion site) indicates a genome region that is hit more frequently by retroviral insertions than expected by chance. Such a region is strongly related to cancer gene loci, which leads to the detection of cancer genes. An algorithm for detecting CISs should satisfy the following: (1) it does not require any prior knowledge of underlying insertion distribution; (2) it can resolve the insertion biases caused by hotspots; (3) it can detect CISs of any biological width; (4) it can identify noises resulting from statistic mistakes and non-CIS insertions; and (5) it can identify the widths of CISs as accurately as possible. We develop a method to resolve these difficulties. We verify a region's significance from two perspectives: distribution width and distribution depth. The former indicates how many insertions in a region while the latter evaluates the insertion distribution across the tumors in a region. We compare our method with kernel density estimation and sliding window on the simulated data, showing that our method not only identifies cancer-related insertions effectively, but also filters noises correctly. The experiments on the real data show that taking insertion distribution into account can highlight significant CISs. We detect 53 novel CISs, some of which have been proven correct by the biological literature.
一个共同插入位点(CIS)指的是基因组中一个区域,该区域被逆转录病毒插入的频率高于随机预期。这样的一个区域与癌症基因位点密切相关,从而能实现癌症基因的检测。一种用于检测CIS的算法应满足以下几点:(1)它不需要关于潜在插入分布的任何先验知识;(2)它能够解决由热点导致的插入偏差;(3)它能够检测任何生物学宽度的CIS;(4)它能够识别由统计错误和非CIS插入产生的噪声;(5)它能够尽可能准确地识别CIS的宽度。我们开发了一种方法来解决这些难题。我们从两个角度验证一个区域的显著性:分布宽度和分布深度。前者表示一个区域内有多少插入,而后者评估一个区域内肿瘤间的插入分布。我们在模拟数据上把我们的方法与核密度估计和滑动窗口进行比较,结果表明我们的方法不仅能有效地识别与癌症相关的插入,还能正确地过滤噪声。在真实数据上的实验表明,考虑插入分布能够突出显著的CIS。我们检测到53个新的CIS,其中一些已被生物学文献证实是正确的。