计算生物学中的机器学习以加速高通量蛋白质表达。

Machine learning in computational biology to accelerate high-throughput protein expression.

作者信息

Sastry Anand, Monk Jonathan, Tegel Hanna, Uhlen Mathias, Palsson Bernhard O, Rockberg Johan, Brunk Elizabeth

机构信息

Department of Bioengineering, University of California, San Diego, CA, USA.

KTH - Royal Institute of Technology, Department of Proteomics and Nanobiotechnology, SE-106 91 Stockholm, Sweden.

出版信息

Bioinformatics. 2017 Aug 15;33(16):2487-2495. doi: 10.1093/bioinformatics/btx207.

DOI:10.1093/bioinformatics/btx207

PMID:28398465

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5870730/

Abstract

MOTIVATION

The Human Protein Atlas (HPA) enables the simultaneous characterization of thousands of proteins across various tissues to pinpoint their spatial location in the human body. This has been achieved through transcriptomics and high-throughput immunohistochemistry-based approaches, where over 40 000 unique human protein fragments have been expressed in E. coli. These datasets enable quantitative tracking of entire cellular proteomes and present new avenues for understanding molecular-level properties influencing expression and solubility.

RESULTS

Combining computational biology and machine learning identifies protein properties that hinder the HPA high-throughput antibody production pipeline. We predict protein expression and solubility with accuracies of 70% and 80%, respectively, based on a subset of key properties (aromaticity, hydropathy and isoelectric point). We guide the selection of protein fragments based on these characteristics to optimize high-throughput experimentation.

AVAILABILITY AND IMPLEMENTATION

We present the machine learning workflow as a series of IPython notebooks hosted on GitHub (https://github.com/SBRG/Protein_ML). The workflow can be used as a template for analysis of further expression and solubility datasets.

CONTACT

ebrunk@ucsd.edu or johanr@biotech.kth.se.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

人类蛋白质图谱（HPA）能够同时对多种组织中的数千种蛋白质进行表征，以确定它们在人体中的空间位置。这是通过转录组学和基于高通量免疫组织化学的方法实现的，其中超过40000个独特的人类蛋白质片段已在大肠杆菌中表达。这些数据集能够对整个细胞蛋白质组进行定量跟踪，并为理解影响表达和溶解性的分子水平特性提供了新途径。

结果

结合计算生物学和机器学习可识别阻碍HPA高通量抗体生产流程的蛋白质特性。基于一组关键特性（芳香性、亲水性和等电点），我们分别以70%和80%的准确率预测蛋白质表达和溶解性。我们根据这些特性指导蛋白质片段的选择，以优化高通量实验。

可用性和实现方式

我们将机器学习工作流程呈现为一系列托管在GitHub（https://github.com/SBRG/Protein_ML）上的IPython笔记本。该工作流程可作为分析更多表达和溶解性数据集的模板。

联系方式

ebrunk@ucsd.edu或johanr@biotech.kth.se。

补充信息

补充数据可在《生物信息学》在线获取。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

计算生物学中的机器学习以加速高通量蛋白质表达。

Machine learning in computational biology to accelerate high-throughput protein expression.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

CONTACT

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现方式

联系方式

补充信息

相似文献

引用本文的文献

本文引用的文献

计算生物学中的机器学习以加速高通量蛋白质表达。

Machine learning in computational biology to accelerate high-throughput protein expression.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

CONTACT

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现方式

联系方式

补充信息

相似文献

引用本文的文献

本文引用的文献