Shetta Omar, Niranjan Mahesan
Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ, UK.
R Soc Open Sci. 2020 Feb 5;7(2):190714. doi: 10.1098/rsos.190714. eCollection 2020 Feb.
The application of machine learning to inference problems in biology is dominated by supervised learning problems of regression and classification, and unsupervised learning problems of clustering and variants of low-dimensional projections for visualization. A class of problems that have not gained much attention is detecting outliers in datasets, arising from reasons such as gross experimental, reporting or labelling errors. These could also be small parts of a dataset that are functionally distinct from the majority of a population. Outlier data are often identified by considering the probability density of normal data and comparing data likelihoods against some threshold. This classical approach suffers from the curse of dimensionality, which is a serious problem with omics data which are often found in very high dimensions. We develop an outlier detection method based on structured low-rank approximation methods. The objective function includes a regularizer based on neighbourhood information captured in the graph Laplacian. Results on publicly available genomic data show that our method robustly detects outliers whereas a density-based method fails even at moderate dimensions. Moreover, we show that our method has better clustering and visualization performance on the recovered low-dimensional projection when compared with popular dimensionality reduction techniques.
机器学习在生物学推理问题中的应用主要由回归和分类的监督学习问题,以及聚类和用于可视化的低维投影变体的无监督学习问题主导。一类尚未得到太多关注的问题是检测数据集中的异常值,这些异常值是由诸如重大实验、报告或标记错误等原因引起的。它们也可能是数据集中与大多数群体在功能上不同的小部分。异常值数据通常通过考虑正常数据的概率密度并将数据似然性与某个阈值进行比较来识别。这种经典方法存在维数灾难问题,这对于经常出现在非常高维度的组学数据来说是一个严重问题。我们基于结构化低秩逼近方法开发了一种异常值检测方法。目标函数包括一个基于图拉普拉斯算子中捕获的邻域信息的正则化器。公开可用基因组数据的结果表明,我们的方法能够稳健地检测异常值,而基于密度的方法即使在中等维度下也会失败。此外,我们表明,与流行的降维技术相比,我们的方法在恢复的低维投影上具有更好的聚类和可视化性能。