Vogelstein Joshua T, Bridgeford Eric W, Tang Minh, Zheng Da, Douville Christopher, Burns Randal, Maggioni Mauro
Johns Hopkins University, Baltimore, MD, USA.
Nat Commun. 2021 May 17;12(1):2872. doi: 10.1038/s41467-021-23102-2.
To solve key biomedical problems, experimentalists now routinely measure millions or billions of features (dimensions) per sample, with the hope that data science techniques will be able to build accurate data-driven inferences. Because sample sizes are typically orders of magnitude smaller than the dimensionality of these data, valid inferences require finding a low-dimensional representation that preserves the discriminating information (e.g., whether the individual suffers from a particular disease). There is a lack of interpretable supervised dimensionality reduction methods that scale to millions of dimensions with strong statistical theoretical guarantees. We introduce an approach to extending principal components analysis by incorporating class-conditional moment estimates into the low-dimensional projection. The simplest version, Linear Optimal Low-rank projection, incorporates the class-conditional means. We prove, and substantiate with both synthetic and real data benchmarks, that Linear Optimal Low-Rank Projection and its generalizations lead to improved data representations for subsequent classification, while maintaining computational efficiency and scalability. Using multiple brain imaging datasets consisting of more than 150 million features, and several genomics datasets with more than 500,000 features, Linear Optimal Low-Rank Projection outperforms other scalable linear dimensionality reduction techniques in terms of accuracy, while only requiring a few minutes on a standard desktop computer.
为了解决关键的生物医学问题,实验人员现在通常对每个样本测量数百万或数十亿个特征(维度),希望数据科学技术能够构建准确的数据驱动推理。由于样本大小通常比这些数据的维度小几个数量级,有效的推理需要找到一种保留鉴别信息的低维表示(例如,个体是否患有特定疾病)。缺乏可扩展到数百万维度并具有强大统计理论保证的可解释监督降维方法。我们引入了一种通过将类条件矩估计纳入低维投影来扩展主成分分析的方法。最简单的版本,线性最优低秩投影,纳入了类条件均值。我们通过合成数据和真实数据基准证明并证实,线性最优低秩投影及其推广导致后续分类的数据表示得到改进,同时保持计算效率和可扩展性。使用由超过1.5亿个特征组成的多个脑成像数据集以及具有超过50万个特征的几个基因组数据集,线性最优低秩投影在准确性方面优于其他可扩展的线性降维技术,而在标准台式计算机上只需要几分钟。