Department of Computer Science, TU Dortmund, Dortmund, Germany.
Department of Medical Biometry, Informatics and Epidemiology (IMBIE), University Hospital Bonn, Bonn, Germany.
PLoS One. 2023 Mar 9;18(3):e0282699. doi: 10.1371/journal.pone.0282699. eCollection 2023.
In Genetics, gene sets are grouped in collections concerning their biological function. This often leads to high-dimensional, overlapping, and redundant families of sets, thus precluding a straightforward interpretation of their biological meaning. In Data Mining, it is often argued that techniques to reduce the dimensionality of data could increase the maneuverability and consequently the interpretability of large data. In the past years, moreover, we witnessed an increasing consciousness of the importance of understanding data and interpretable models in the machine learning and bioinformatics communities. On the one hand, there exist techniques aiming to aggregate overlapping gene sets to create larger pathways. While these methods could partly solve the large size of the collections' problem, modifying biological pathways is hardly justifiable in this biological context. On the other hand, the representation methods to increase interpretability of collections of gene sets that have been proposed so far have proved to be insufficient. Inspired by this Bioinformatics context, we propose a method to rank sets within a family of sets based on the distribution of the singletons and their size. We obtain sets' importance scores by computing Shapley values; Making use of microarray games, we do not incur the typical exponential computational complexity. Moreover, we address the challenge of constructing redundancy-aware rankings where, in our case, redundancy is a quantity proportional to the size of intersections among the sets in the collections. We use the obtained rankings to reduce the dimension of the families, therefore showing lower redundancy among sets while still preserving a high coverage of their elements. We finally evaluate our approach for collections of gene sets and apply Gene Sets Enrichment Analysis techniques to the now smaller collections: As expected, the unsupervised nature of the proposed rankings allows for unremarkable differences in the number of significant gene sets for specific phenotypic traits. In contrast, the number of performed statistical tests can be drastically reduced. The proposed rankings show a practical utility in bioinformatics to increase interpretability of the collections of gene sets and a step forward to include redundancy-awareness into Shapley values computations.
在遗传学中,基因集根据其生物学功能被分组为集合。这通常会导致高维、重叠和冗余的集合家族,从而排除了对其生物学意义的直接解释。在数据挖掘中,人们经常认为,降低数据维度的技术可以提高数据的可操作性,从而提高大规模数据的可解释性。此外,在过去的几年中,我们见证了机器学习和生物信息学社区中对理解数据和可解释模型的重要性的认识不断提高。一方面,存在旨在聚合重叠基因集以创建更大途径的技术。虽然这些方法可以部分解决集合规模过大的问题,但在这种生物学背景下,修改生物学途径是难以证明合理的。另一方面,迄今为止提出的提高基因集集合可解释性的表示方法被证明是不够的。受这个生物信息学背景的启发,我们提出了一种基于单元素及其大小分布对集合家族内的集合进行排序的方法。我们通过计算 Shapley 值来获得集合的重要性得分;利用微阵列游戏,我们不会遇到典型的指数级计算复杂度。此外,我们解决了构建冗余感知排名的挑战,在我们的案例中,冗余是与集合在集合中的交集大小成正比的数量。我们利用获得的排名来降低集合家族的维度,从而在保持元素高覆盖率的同时降低集合之间的冗余度。我们最后评估了我们的方法在基因集集合中的应用,并将基因集富集分析技术应用于现在较小的集合中:正如预期的那样,所提出的排名的无监督性质允许对特定表型特征的显著基因集数量没有明显差异。相比之下,可以大大减少执行的统计测试数量。所提出的排名在生物信息学中具有实用价值,可以提高基因集集合的可解释性,并朝着将冗余意识纳入 Shapley 值计算的方向迈出了一步。