Department of Environmental Health Sciences, School of Public Health and Health Sciences, University of Massachusetts, 686 North Pleasant Street Amherst, Amherst, MA, 01003, USA.
BMC Genomics. 2022 Mar 14;23(1):204. doi: 10.1186/s12864-022-08427-6.
Rapid development of high-throughput omics technologies generates an increasing interest in algorithms for cutoff point identification. Existing cutoff methods and tools identify cutoff points based on an association of continuous variables with another variable, such as phenotype, disease state, or treatment group. These approaches are not applicable for descriptive studies in which continuous variables are reported without known association with any biologically meaningful variables.
The most common shape of the ranked distribution of continuous variables in high-throughput descriptive studies corresponds to a biphasic curve, where the first phase includes a big number of variables with values slowly growing with rank and the second phase includes a smaller number of variables rapidly growing with rank. This study describes an easy algorithm to identify the boundary between these phases to be used as a cutoff point.
The major assumption of that approach is that a small number of variables with high values dominate the biological system and determine its major processes and functions. This approach was tested on three different datasets: human genes and their expression values in the human cerebral cortex, mammalian genes and their values of sensitivity to chemical exposures, and human proteins and their expression values in the human heart. In every case, the described cutoff identification method produced shortlists of variables (genes, proteins) highly relevant for dominant functions/pathways of the analyzed biological systems.
The described method for cutoff identification may be used to prioritize variables in descriptive omics studies for a focused functional analysis, in situations where other methods of dichotomization of data are inaccessible.
高通量组学技术的快速发展引发了人们对用于确定截止点的算法的浓厚兴趣。现有的截止点方法和工具是基于连续变量与另一个变量(如表型、疾病状态或治疗组)的关联来确定截止点。这些方法不适用于描述性研究,因为在这些研究中,连续变量是在与任何有意义的生物学变量没有已知关联的情况下报告的。
在高通量描述性研究中,连续变量的排序分布最常见的形状是双相曲线,其中第一相包括大量随着等级缓慢增长的变量,第二相包括数量较少的变量,随着等级迅速增长。本研究描述了一种简单的算法,可以识别这些相之间的边界作为截止点。
该方法的主要假设是,少数具有高值的变量主导着生物系统,并决定其主要的过程和功能。该方法在三个不同的数据集上进行了测试:人类基因及其在人类大脑皮层中的表达值、哺乳动物基因及其对化学暴露的敏感性值,以及人类蛋白质及其在人类心脏中的表达值。在每种情况下,所描述的截止识别方法都产生了变量(基因、蛋白质)的短名单,这些变量与分析生物系统的主要功能/途径高度相关。
所描述的截止识别方法可用于在无法使用其他数据二值化方法的情况下,对描述性组学研究中的变量进行优先排序,以进行重点功能分析。