Shah Najeebullah, Meng Qiuchen, Zou Ziheng, Zhang Xuegong
MOE Key Lab of Bioinformatics & Bioinformatics Division, BNRIST, Department of Automation, Tsinghua University, Beijing 100084, China.
School of Life Sciences and Center for Synthetic and Systems Biology, Tsinghua University, Beijing 100084, China.
Bioinform Adv. 2024 Jul 29;4(1):vbae109. doi: 10.1093/bioadv/vbae109. eCollection 2024.
In single-cell studies, principal component analysis (PCA) is widely used to reduce the dimensionality of dataset and visualize in 2D or 3D PC plots. Scientists often focus on different clusters within PC plot, overlooking the specific phenomenon, such as horse-shoe-like effect, that may reveal hidden knowledge about underlying biological dataset. This phenomenon remains largely unexplored in single-cell studies.
In this study, we investigated into the horse-shoe-like effect in PC plots using simulated and real scRNA-seq datasets. We systematically explain horse-shoe-like phenomenon from various inter-related perspectives. Initially, we establish an intuitive understanding with the help of simulated datasets. Then, we generalized the acquired knowledge on real biological scRNA-seq data. Experimental results provide logical explanations and understanding for the appearance of horse-shoe-like effect in PC plots. Furthermore, we identify a potential problem with a well-known theory of 'distance saturation property' attributed to induce horse-shoe phenomenon. Finally, we analyse a mathematical model for horse-shoe effect that suggests trigonometric solutions to estimated eigenvectors. We observe significant resemblance after comparing the results of mathematical model with simulated and real scRNA-seq datasets.
The code for reproducing the results of this study is available at: https://github.com/najeebullahshah/PCA-Horse-Shoe.
在单细胞研究中,主成分分析(PCA)被广泛用于降低数据集的维度并在二维或三维主成分图中进行可视化。科学家们通常关注主成分图中的不同聚类,而忽略了可能揭示基础生物学数据集隐藏知识的特定现象,如马蹄形效应。这种现象在单细胞研究中很大程度上仍未得到探索。
在本研究中,我们使用模拟和真实的单细胞RNA测序(scRNA-seq)数据集研究了主成分图中的马蹄形效应。我们从各种相互关联的角度系统地解释了马蹄形现象。首先,我们借助模拟数据集建立了直观的理解。然后,我们将获得的知识推广到真实的生物学scRNA-seq数据上。实验结果为马蹄形效应在主成分图中的出现提供了逻辑解释和理解。此外,我们发现了一个归因于诱导马蹄形现象的著名“距离饱和特性”理论的潜在问题。最后,我们分析了一个马蹄形效应的数学模型,该模型提出了估计特征向量的三角解。在将数学模型的结果与模拟和真实的scRNA-seq数据集进行比较后,我们观察到了显著的相似性。
用于重现本研究结果的代码可在以下网址获取:https://github.com/najeebullahshah/PCA-Horse-Shoe 。