新型重叠子图聚类算法用于抗原表位检测。

Novel overlapping subgraph clustering for the detection of antigen epitopes.

机构信息

Department of Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Hubei, China.

Department of Computer Science, School of Computing and Electronic Information, Guangxi University, Nanning, China.

出版信息

Bioinformatics. 2018 Jun 15;34(12):2061-2068. doi: 10.1093/bioinformatics/bty051.

DOI:10.1093/bioinformatics/bty051

PMID:29409062

Abstract

MOTIVATION

Antigens that contain overlapping epitopes have been occasionally reported. As current algorithms mainly take a one-antigen-one-epitope approach to the prediction of epitopes, they are not capable of detecting these multiple and overlapping epitopes accurately, or even those multiple and separated epitopes existing in some other antigens.

RESULTS

We introduce a novel subgraph clustering algorithm for more accurate detection of epitopes. This algorithm takes graph partitions as seeds, and expands the seeds to merge overlapping subgraphs based on the term frequency-inverse document frequency (TF-IDF) featured similarity. Then, the merged subgraphs are each classified as an epitope or non-epitope. Tests of our algorithm were conducted on three newly collected datasets of antigens. In the first dataset, each antigen contains only a single epitope; in the second, each antigen contains only multiple and separated epitopes; and in the third, each antigen contains overlapping epitopes. The prediction performance of our algorithm is significantly better than the state-of-art methods. The lifts of the averaged f-scores on top of the best existing methods are 60, 75 and 22% for the single epitope detection, the multiple and separated epitopes detection, and the overlapping epitopes detection, respectively.

AVAILABILITY AND IMPLEMENTATION

The source code is available at github.com/lzhlab/glep/.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

含有重叠表位的抗原偶尔会被报道。由于当前的算法主要采用一种抗原-一种表位的方法来预测表位，因此它们无法准确检测这些多个重叠的表位，甚至无法检测到存在于其他一些抗原中的多个分离的表位。

结果

我们引入了一种新的子图聚类算法，用于更准确地检测表位。该算法以图分区作为种子，根据术语频率-逆文档频率（TF-IDF）特征相似性扩展种子以合并重叠的子图。然后，将合并的子图分别分类为表位或非表位。我们的算法在三个新收集的抗原数据集上进行了测试。在第一个数据集，每个抗原只包含一个单一的表位；在第二个数据集，每个抗原只包含多个分离的表位；在第三个数据集，每个抗原包含重叠的表位。我们算法的预测性能明显优于最先进的方法。在单一表位检测、多个分离表位检测和重叠表位检测中，平均 f 分数的提升分别为 60%、75%和 22%。