Department of Computer Science, Purdue University, West Lafayette, IN, USA.
Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA.
BMC Bioinformatics. 2022 May 20;22(Suppl 10):627. doi: 10.1186/s12859-022-04704-z.
Interpretation of high-throughput gene expression data continues to require mathematical tools in data analysis that recognizes the shape of the data in high dimensions. Topological data analysis (TDA) has recently been successful in extracting robust features in several applications dealing with high dimensional constructs. In this work, we utilize some recent developments in TDA to curate gene expression data. Our work differs from the predecessors in two aspects: (1) Traditional TDA pipelines use topological signatures called barcodes to enhance feature vectors which are used for classification. In contrast, this work involves curating relevant features to obtain somewhat better representatives with the help of TDA. This representatives of the entire data facilitates better comprehension of the phenotype labels. (2) Most of the earlier works employ barcodes obtained using topological summaries as fingerprints for the data. Even though they are stable signatures, there exists no direct mapping between the data and said barcodes.
The topology relevant curated data that we obtain provides an improvement in shallow learning as well as deep learning based supervised classifications. We further show that the representative cycles we compute have an unsupervised inclination towards phenotype labels. This work thus shows that topological signatures are able to comprehend gene expression levels and classify cohorts accordingly.
In this work, we engender representative persistent cycles to discern the gene expression data. These cycles allow us to directly procure genes entailed in similar processes.
高通量基因表达数据的解释仍然需要数据分析中的数学工具,这些工具能够识别高维数据的形状。拓扑数据分析(TDA)最近在处理高维结构的几个应用中成功地提取了稳健的特征。在这项工作中,我们利用 TDA 的一些最新进展来整理基因表达数据。我们的工作与前人的工作有两个不同之处:(1)传统的 TDA 管道使用拓扑签名(称为条形码)来增强用于分类的特征向量。相比之下,这项工作涉及整理相关特征,以在 TDA 的帮助下获得更好的代表性。整个数据的这些代表有助于更好地理解表型标签。(2)早期的大多数工作都使用拓扑摘要获得的条形码作为数据的指纹。尽管它们是稳定的签名,但数据和所述条形码之间不存在直接映射。
我们获得的与拓扑相关的经过整理的数据,无论是在浅层学习还是基于监督分类的深度学习中,都有改进。我们进一步表明,我们计算的代表性循环具有对表型标签的无监督倾向。因此,这项工作表明拓扑特征能够理解基因表达水平并相应地对队列进行分类。
在这项工作中,我们生成了有代表性的持久循环来辨别基因表达数据。这些循环使我们能够直接获得涉及类似过程的基因。