IEEE/ACM Trans Comput Biol Bioinform. 2018 Mar-Apr;15(2):673-687. doi: 10.1109/TCBB.2016.2636207. Epub 2016 Dec 6.
Identification of combinatorial markers from multiple data sources is a challenging task in bioinformatics. Here, we propose a novel computational framework for identifying significant combinatorial markers ( s) using both gene expression and methylation data. The gene expression and methylation data are integrated into a single continuous data as well as a (post-discretized) boolean data based on their intrinsic (i.e., inverse) relationship. A novel combined score of methylation and expression data (viz., ) is introduced which is computed on the integrated continuous data for identifying initial non-redundant set of genes. Thereafter, (maximal) frequent closed homogeneous genesets are identified using a well-known biclustering algorithm applied on the integrated boolean data of the determined non-redundant set of genes. A novel sample-based weighted support ( ) is then proposed that is consecutively calculated on the integrated boolean data of the determined non-redundant set of genes in order to identify the non-redundant significant genesets. The top few resulting genesets are identified as potential s. Since our proposed method generates a smaller number of significant non-redundant genesets than those by other popular methods, the method is much faster than the others. Application of the proposed technique on an expression and a methylation data for Uterine tumor or Prostate Carcinoma produces a set of significant combination of markers. We expect that such a combination of markers will produce lower false positives than individual markers.
从多个数据源中识别组合标记是生物信息学中的一项具有挑战性的任务。在这里,我们提出了一种新的计算框架,用于使用基因表达和甲基化数据识别有意义的组合标记(s)。将基因表达和甲基化数据集成到单个连续数据以及基于其内在(即反演)关系的(离散后)布尔数据中。引入了一种新的甲基化和表达数据的组合得分(即),该得分是在集成连续数据上计算的,用于识别初始非冗余基因集。此后,使用一种著名的双聚类算法,在确定的非冗余基因集的集成布尔数据上识别(最大)频繁封闭同质基因集。然后提出了一种新的基于样本的加权支持(),该支持在确定的非冗余基因集的集成布尔数据上连续计算,以识别非冗余有意义的基因集。排在前几位的基因集被确定为潜在的 s。由于我们提出的方法生成的有意义的非冗余基因集数量少于其他流行方法,因此该方法比其他方法快得多。将所提出的技术应用于子宫肿瘤或前列腺癌的表达和甲基化数据会产生一组有意义的标记组合。我们期望这样的标记组合比单个标记产生更低的假阳性率。