Zhang Jian, Chai Haiting, Yang Guifu, Ma Zhiqiang
School of Computer Science and Information Technology, Northeast Normal University, Changchun, Jilin Province, 130117, People's Republic of China.
School of Computer and Information Technology, Xinyang Normal University, Xinyang, Henan Province, 464000, People's Republic of China.
BMC Bioinformatics. 2017 Jun 5;18(1):294. doi: 10.1186/s12859-017-1709-6.
Bioluminescent proteins (BLPs) widely exist in many living organisms. As BLPs are featured by the capability of emitting lights, they can be served as biomarkers and easily detected in biomedical research, such as gene expression analysis and signal transduction pathways. Therefore, accurate identification of BLPs is important for disease diagnosis and biomedical engineering. In this paper, we propose a novel accurate sequence-based method named PredBLP (Prediction of BioLuminescent Proteins) to predict BLPs.
We collect a series of sequence-derived features, which have been proved to be involved in the structure and function of BLPs. These features include amino acid composition, dipeptide composition, sequence motifs and physicochemical properties. We further prove that the combination of four types of features outperforms any other combinations or individual features. To remove potential irrelevant or redundant features, we also introduce Fisher Markov Selector together with Sequential Backward Selection strategy to select the optimal feature subsets. Additionally, we design a lineage-specific scheme, which is proved to be more effective than traditional universal approaches.
Experiment on benchmark datasets proves the robustness of PredBLP. We demonstrate that lineage-specific models significantly outperform universal ones. We also test the generalization capability of PredBLP based on independent testing datasets as well as newly deposited BLPs in UniProt. PredBLP is proved to be able to exceed many state-of-art methods. A web server named PredBLP, which implements the proposed method, is free available for academic use.
生物发光蛋白(BLP)广泛存在于许多生物体中。由于BLP具有发光能力,它们可作为生物标志物,在生物医学研究中易于检测,如基因表达分析和信号转导通路研究。因此,准确识别BLP对于疾病诊断和生物医学工程至关重要。在本文中,我们提出了一种名为PredBLP(生物发光蛋白预测)的基于序列的新型准确方法来预测BLP。
我们收集了一系列已被证明与BLP的结构和功能相关的序列衍生特征。这些特征包括氨基酸组成、二肽组成、序列基序和理化性质。我们进一步证明,这四种类型特征的组合优于任何其他组合或单个特征。为了去除潜在的不相关或冗余特征,我们还引入了Fisher马尔可夫选择器和顺序反向选择策略来选择最优特征子集。此外,我们设计了一种谱系特异性方案,事实证明该方案比传统的通用方法更有效。
在基准数据集上的实验证明了PredBLP的稳健性。我们证明了谱系特异性模型明显优于通用模型。我们还基于独立测试数据集以及UniProt中新存入的BLP测试了PredBLP的泛化能力。事实证明,PredBLP能够超越许多现有方法。一个名为PredBLP的网络服务器实现了所提出的方法,可供学术免费使用。