Saw Swee Hock School of Public Health, National University of Singapore and National University Health System, Singapore, 117549, Singapore.
BMC Bioinformatics. 2022 Mar 30;23(1):108. doi: 10.1186/s12859-022-04643-9.
Biological sequence clustering is a complicated data clustering problem owing to the high computation costs incurred for pairwise sequence distance calculations through sequence alignments, as well as difficulties in determining parameters for deriving robust clusters. While current approaches are successful in reducing the number of sequence alignments performed, the generated clusters are based on a single sequence identity threshold applied to every cluster. Poor choices of this identity threshold would thus lead to low quality clusters. There is however little support provided to users in selecting thresholds that are well matched with the input sequences.
We present a novel sequence clustering approach called ALFATClust that exploits rapid pairwise alignment-free sequence distance calculations and community detection in graph for clusters generation. Instead of a single threshold applied to every generated cluster, ALFATClust is capable of dynamically determining the cut-off threshold for each individual cluster by considering both cluster separation and intra-cluster sequence similarity. Benchmarking analysis shows that ALFATClust generally outperforms existing approaches by simultaneously maintaining cluster robustness and substantial cluster separation for the benchmark datasets. The software also provides an evaluation report for verifying the quality of the non-singleton clusters obtained.
ALFATClust is able to generate sequence clusters having high intra-cluster sequence similarity and substantial separation between clusters without having users to decide precise similarity cut-off thresholds.
生物序列聚类是一个复杂的数据聚类问题,因为通过序列比对计算两两序列距离会产生很高的计算成本,并且难以确定用于得出稳健聚类的参数。虽然当前的方法成功地减少了执行的序列比对数量,但生成的聚类是基于应用于每个聚类的单个序列同一性阈值。因此,该同一性阈值选择不当会导致聚类质量较低。但是,在选择与输入序列匹配良好的阈值方面,用户几乎没有得到支持。
我们提出了一种名为 ALFATClust 的新序列聚类方法,该方法利用快速的无序列比对的成对序列距离计算和图中的社区检测来生成聚类。与应用于每个生成的聚类的单个阈值不同,ALFATClust 能够通过考虑聚类分离和聚类内序列相似性,为每个单独的聚类动态确定截止阈值。基准分析表明,ALFATClust 通常通过同时保持基准数据集的聚类稳健性和大量聚类分离来优于现有方法。该软件还提供了一个评估报告,用于验证获得的非单例聚类的质量。
ALFATClust 能够生成具有高聚类内序列相似性和聚类之间大量分离的序列聚类,而无需用户决定精确的相似性截止阈值。