IEEE/ACM Trans Comput Biol Bioinform. 2021 Nov-Dec;18(6):2841-2847. doi: 10.1109/TCBB.2021.3076422. Epub 2021 Dec 8.
The classification of clinical samples based on gene expression data is an important part of precision medicine. In this manuscript, we show how transforming gene expression data into a set of personalized (sample-specific) networks can allow us to harness existing graph-based methods to improve classifier performance. Existing approaches to personalized gene networks have the limitation that they depend on other samples in the data and must get re-computed whenever a new sample is introduced. Here, we propose a novel method, called Personalized Annotation-based Networks (PAN), that avoids this limitation by using curated annotation databases to transform gene expression data into a graph. Unlike competing methods, PANs are calculated for each sample independent of the population, making it a more efficient way to obtain single-sample networks. Using three breast cancer datasets as a case study, we show that PAN classifiers not only predict cancer relapse better than gene features alone, but also outperform PPI (protein-protein interactions) and population-level graph-based classifiers. This work demonstrates the practical advantages of graph-based classification for high-dimensional genomic data, while offering a new approach to making sample-specific networks. Supplementary information: PAN and the baselines are implemented in Python. Source code and data are available at https://github.com/thinng/PAN.
基于基因表达数据的临床样本分类是精准医学的重要组成部分。在本文中,我们展示了如何将基因表达数据转化为一组个性化(样本特定)网络,从而利用现有的基于图的方法来提高分类器的性能。现有的个性化基因网络方法的局限性在于,它们依赖于数据中的其他样本,并且每当引入新样本时,都必须重新计算。在这里,我们提出了一种新的方法,称为基于个性化注释的网络(PAN),它通过使用精心整理的注释数据库将基因表达数据转换为图,从而避免了这种限制。与竞争方法不同,PAN 是为每个样本独立计算的,而不是基于人群,因此是获取单一样本网络的更有效方法。我们使用三个乳腺癌数据集作为案例研究,表明 PAN 分类器不仅可以比仅使用基因特征更好地预测癌症复发,而且还优于 PPI(蛋白质-蛋白质相互作用)和基于人群的基于图的分类器。这项工作证明了基于图的分类方法在高维基因组数据中的实际优势,同时提供了一种新的方法来制作样本特定的网络。补充信息:PAN 和基线均使用 Python 实现。源代码和数据可在 https://github.com/thinng/PAN 上获得。