203 B.T. Road, Machine Intelligence Unit, Indian Statistical Institute, Kolkata, 700108, India.
Comput Biol Med. 2017 Nov 1;90:59-67. doi: 10.1016/j.compbiomed.2017.09.010. Epub 2017 Sep 18.
Discretizing gene expression values is an important step in data preprocessing as it helps in reducing noise and experimental errors. This in turn provides better results in various tasks such as gene regulatory network analysis and disease prediction. A supervised discretization method for gene expressions using gene annotation is developed. The method is called "Gene Annotation Based Discretization" (GABD) where the discretization width is determined by maximizing the positive predictive value (PPV), computed using gene annotations, for top 20,000 gene pairs. The method can capture the gene similarity better than those obtained using original expressions. The performance of GABD is compared with some existing discretization methods like equal width discretization, equal frequency discretization and k-means discretization in terms of positive predictive value (PPV). The utility of GABD is also shown by clustering genes using k-medoid algorithm and thereby predicting the function of 23 unclassified Saccharomyces cerevisiae genes using p-value cut off 10. The source code for GABD is available at http://www.sampa.droppages.com/GABD.html.
将基因表达值离散化是数据预处理的重要步骤,因为它有助于减少噪声和实验误差。这反过来又为各种任务提供了更好的结果,如基因调控网络分析和疾病预测。提出了一种基于基因注释的基因表达监督离散化方法。该方法称为“基于基因注释的离散化”(GABD),其中离散化宽度通过最大化使用基因注释计算的前 20,000 个基因对的阳性预测值(PPV)来确定。该方法比使用原始表达获得的基因相似性更好。根据阳性预测值(PPV),将 GABD 的性能与其他一些现有的离散化方法(如等宽离散化、等频离散化和 k-均值离散化)进行了比较。还使用 k-medoid 算法对基因进行聚类,并使用 p 值截止值 10 预测了 23 个未分类的酿酒酵母基因的功能,从而展示了 GABD 的实用性。GABD 的源代码可在 http://www.sampa.droppages.com/GABD.html 获得。