Li Xiaoguang, Zhou Tao, Feng Xingdong, Yau Shing-Tung, Yau Stephen S-T
School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China.
Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China.
Innovation (Camb). 2024 Jul 22;5(5):100677. doi: 10.1016/j.xinn.2024.100677. eCollection 2024 Sep 9.
It is important to understand the geometry of genome space in biology. After transforming genome sequences into frequency matrices of the chaos game representation (FCGR), we regard a genome sequence as a point in a suitable Grassmann manifold by analyzing the column space of the corresponding FCGR. To assess the sequence similarity, we employ the generalized Grassmannian distance, an intrinsic geometric distance that differs from the traditional Euclidean distance used in the classical k-mer frequency-based methods. With this method, we constructed phylogenetic trees for various genome datasets, including influenza A virus hemagglutinin gene, Orthocoronavirinae genome, and SARS-CoV-2 complete genome sequences. Our comparative analysis with multiple sequence alignment and alignment-free methods for large-scale sequences revealed that our method, which employs the subspace distance between the column spaces of different FCGRs (FCGR-SD), outperformed its competitors in terms of both speed and accuracy. In addition, we used low-dimensional visualization of the SARS-CoV-2 genome sequences and spike protein nucleotide sequences with our methods, resulting in some intriguing findings. We not only propose a novel and efficient algorithm for comparing genome sequences but also demonstrate that genome data have some intrinsic manifold structures, providing a new geometric perspective for molecular biology studies.
了解生物学中基因组空间的几何结构很重要。在将基因组序列转化为混沌游戏表示的频率矩阵(FCGR)后,我们通过分析相应FCGR的列空间,将基因组序列视为合适格拉斯曼流形中的一个点。为了评估序列相似性,我们采用广义格拉斯曼距离,这是一种与基于经典k-mer频率的方法中使用的传统欧几里得距离不同的内在几何距离。使用这种方法,我们为各种基因组数据集构建了系统发育树,包括甲型流感病毒血凝素基因、正冠状病毒亚科基因组和严重急性呼吸综合征冠状病毒2(SARS-CoV-2)全基因组序列。我们对大规模序列的多序列比对和无比对方法进行的比较分析表明,我们采用不同FCGR列空间之间子空间距离的方法(FCGR-SD)在速度和准确性方面均优于其竞争对手。此外,我们用我们的方法对SARS-CoV-2基因组序列和刺突蛋白核苷酸序列进行了低维可视化,得出了一些有趣的发现。我们不仅提出了一种用于比较基因组序列的新颖高效算法,还证明了基因组数据具有一些内在的流形结构,为分子生物学研究提供了新的几何视角。