School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.
Department of Rehabilitation, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China.
Curr Gene Ther. 2018;18(5):268-274. doi: 10.2174/1566523218666180913110949.
Knowledge of the correct protein subcellular localization is necessary for understanding the function of a protein and revealing the mechanism of many human diseases due to protein subcellular mislocalization, which is required before approaching gene therapy to treat a disease. In addition, it is well-known that the gene therapy is an effective way to overcome disease by targeting a gene therapy product to a specific subcellular compartment. Deep neural networks to predict protein function have become increasingly popular due to large increases in the available genomics data due to its strong superiority in the non-linear classification ability. However, they still have some drawbacks such as too many hyper-parameters and sufficient amount of labeled data.
We present a deep forest-based protein location algorithm relying on sequence information. The prediction model uses a random forest network with a multi-layered structure to identify the subcellular regions of protein. The model was trained and tested on a latest UniProt releases protein dataset, and we demonstrate that our deep forest predict the subcellular location of proteins given only the protein sequence with high accuracy, outperforming the current state-of-art algorithms. Meanwhile, unlike the deep neural networks, it has a significantly smaller number of parameters and is much easier to train.
了解正确的蛋白质亚细胞定位对于理解蛋白质的功能以及揭示由于蛋白质亚细胞定位错误导致的许多人类疾病的机制是必要的,这是在接近基因治疗来治疗疾病之前所必需的。此外,众所周知,基因治疗是通过将基因治疗产品靶向特定的亚细胞隔室来克服疾病的有效方法。由于可用基因组学数据的大量增加,深度神经网络在预测蛋白质功能方面变得越来越流行,这是由于其在非线性分类能力方面的强大优势。然而,它们仍然存在一些缺点,例如过多的超参数和足够数量的标记数据。
我们提出了一种基于深度森林的依赖于序列信息的蛋白质位置算法。预测模型使用具有多层结构的随机森林网络来识别蛋白质的亚细胞区域。该模型在最新的 UniProt 发布的蛋白质数据集上进行了训练和测试,我们证明了我们的深度森林仅使用蛋白质序列就能非常准确地预测蛋白质的亚细胞位置,优于当前最先进的算法。同时,与深度神经网络不同,它的参数数量明显更少,训练起来也容易得多。