Genomic Sciences Center RIKEN, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama 230-0045, Japan.
BMC Bioinformatics. 2010 Mar 1;11:113. doi: 10.1186/1471-2105-11-113.
Efficient dissection of large proteins into their structural domains is critical for high throughput proteome analysis. So far, no study has focused on mathematically modeling a protein dissection protocol in terms of a production system. Here, we report a mathematical model for empirically optimizing the cost of large-scale domain production in proteomics research.
The model computes the expected number of successfully producing soluble domains, using a conditional probability between domain and boundary identification. Typical values for the model's parameters were estimated using the experimental results for identifying soluble domains from the 2,032 Kazusa HUGE protein sequences. Among the 215 fragments corresponding to the 24 domains that were expressed correctly, 111, corresponding to 18 domains, were soluble. Our model indicates that, under the conditions used in our pilot experiment, the probability of correctly predicting the existence of a domain was 81% (175/215) and that of predicting its boundary was 63% (111/175). Under these conditions, the most cost/effort-effective production of soluble domains was to prepare one to seven fragments per predicted domain.
Our mathematical modeling of protein dissection protocols indicates that the optimum number of fragments tested per domain is actually much smaller than expected a priori. The application range of our model is not limited to protein dissection, and it can be utilized for designing various large-scale mutational analyses or screening libraries.
高效地将大型蛋白质切割成其结构域对于高通量蛋白质组分析至关重要。迄今为止,尚无研究从生产系统的角度对蛋白质切割方案进行数学建模。在这里,我们报告了一种数学模型,用于根据蛋白质组学研究中的经验优化大规模结构域生产的成本。
该模型使用域和边界识别之间的条件概率计算成功生产可溶性结构域的预期数量。使用从 2032 个 Kazusa HUGE 蛋白质序列中识别可溶性结构域的实验结果来估算模型参数的典型值。在所表达的正确的 24 个结构域的 215 个片段中,有 111 个(对应于 18 个结构域)是可溶性的。我们的模型表明,在我们的初步实验中使用的条件下,正确预测结构域存在的概率为 81%(175/215),正确预测其边界的概率为 63%(111/175)。在这些条件下,每个预测结构域生产可溶性结构域的最具成本效益的方法是准备一个到七个片段。
我们对蛋白质切割方案的数学建模表明,每个结构域测试的最佳片段数量实际上比先验预期的要小得多。我们的模型的应用范围不仅限于蛋白质切割,还可以用于设计各种大规模突变分析或筛选文库。