Sastry Anand, Monk Jonathan, Tegel Hanna, Uhlen Mathias, Palsson Bernhard O, Rockberg Johan, Brunk Elizabeth
Department of Bioengineering, University of California, San Diego, CA, USA.
KTH - Royal Institute of Technology, Department of Proteomics and Nanobiotechnology, SE-106 91 Stockholm, Sweden.
Bioinformatics. 2017 Aug 15;33(16):2487-2495. doi: 10.1093/bioinformatics/btx207.
The Human Protein Atlas (HPA) enables the simultaneous characterization of thousands of proteins across various tissues to pinpoint their spatial location in the human body. This has been achieved through transcriptomics and high-throughput immunohistochemistry-based approaches, where over 40 000 unique human protein fragments have been expressed in E. coli. These datasets enable quantitative tracking of entire cellular proteomes and present new avenues for understanding molecular-level properties influencing expression and solubility.
Combining computational biology and machine learning identifies protein properties that hinder the HPA high-throughput antibody production pipeline. We predict protein expression and solubility with accuracies of 70% and 80%, respectively, based on a subset of key properties (aromaticity, hydropathy and isoelectric point). We guide the selection of protein fragments based on these characteristics to optimize high-throughput experimentation.
We present the machine learning workflow as a series of IPython notebooks hosted on GitHub (https://github.com/SBRG/Protein_ML). The workflow can be used as a template for analysis of further expression and solubility datasets.
ebrunk@ucsd.edu or johanr@biotech.kth.se.
Supplementary data are available at Bioinformatics online.
人类蛋白质图谱(HPA)能够同时对多种组织中的数千种蛋白质进行表征,以确定它们在人体中的空间位置。这是通过转录组学和基于高通量免疫组织化学的方法实现的,其中超过40000个独特的人类蛋白质片段已在大肠杆菌中表达。这些数据集能够对整个细胞蛋白质组进行定量跟踪,并为理解影响表达和溶解性的分子水平特性提供了新途径。
结合计算生物学和机器学习可识别阻碍HPA高通量抗体生产流程的蛋白质特性。基于一组关键特性(芳香性、亲水性和等电点),我们分别以70%和80%的准确率预测蛋白质表达和溶解性。我们根据这些特性指导蛋白质片段的选择,以优化高通量实验。
我们将机器学习工作流程呈现为一系列托管在GitHub(https://github.com/SBRG/Protein_ML)上的IPython笔记本。该工作流程可作为分析更多表达和溶解性数据集的模板。
ebrunk@ucsd.edu或johanr@biotech.kth.se。
补充数据可在《生物信息学》在线获取。