Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Belvaux, Luxembourg.
Bioinformatics. 2017 Jul 1;33(13):1953-1962. doi: 10.1093/bioinformatics/btx111.
The identification of genes or molecular regulatory mechanisms implicated in biological processes often requires the discretization, and in particular booleanization, of gene expression measurements. However, currently used methods mostly classify each measurement into an active or inactive state regardless of its statistical support possibly leading to downstream analysis conclusions based on spurious booleanization results.
In order to overcome the lack of certainty inherent in current methodologies and to improve the process of discretization, we introduce RefBool, a reference-based algorithm for discretizing gene expression data. Instead of requiring each measurement to be classified as active or inactive, RefBool allows for the classification of a third state that can be interpreted as an intermediate expression of genes. Furthermore, each measurement is associated to a p- and q-value indicating the significance of each classification. Validation of RefBool on a neuroepithelial differentiation study and subsequent qualitative and quantitative comparison against 10 currently used methods supports its advantages and shows clear improvements of resulting clusterings.
The software is available as MATLAB files in the Supplementary Information and as an online repository ( https://github.com/saschajung/RefBool ).
Supplementary data are available at Bioinformatics online.
在生物过程中,识别基因或分子调控机制通常需要对基因表达测量进行离散化,特别是布尔化。然而,目前使用的方法大多将每个测量值分类为活动或不活动状态,而不考虑其统计支持,这可能导致基于虚假布尔化结果的下游分析结论。
为了克服当前方法中固有的不确定性,并改进离散化过程,我们引入了 RefBool,这是一种用于离散化基因表达数据的基于参考的算法。RefBool 不要求对每个测量值进行分类为活动或不活动,而是允许对可解释为基因中间表达的第三种状态进行分类。此外,每个测量值都与 p 值和 q 值相关联,指示每个分类的显著性。在神经上皮分化研究中对 RefBool 的验证,以及随后与 10 种当前使用的方法进行定性和定量比较,支持了它的优势,并显示了聚类结果的明显改进。
该软件以 MATLAB 文件的形式在补充信息中提供,并作为在线存储库(https://github.com/saschajung/RefBool)提供。
补充数据可在生物信息学在线获得。