无需蛋白水解、末端残基切割或纯化的全蛋白测序和定量：一种计算模型。

Whole protein sequencing and quantification without proteolysis, terminal residue cleavage, or purification: A computational model.

作者信息

Sampath G

出版信息

bioRxiv. 2024 Mar 19:2024.03.13.584825. doi: 10.1101/2024.03.13.584825.

DOI:10.1101/2024.03.13.584825

PMID:38558980

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10980043/

Abstract

Sequencing and quantification of whole proteins in a sample without separation, terminal residue cleavage, or proteolysis are modeled computationally. Similar to recent work on DNA sequencing ( , 5233-5238, 2016), a high-volume conjugate is attached to every instance of amino acid (AA) type AA, 1 ≤ i ≤ 20, in an unfolded whole protein, which is then translocated through a nanopore. From the volume excluded by 2L residues in a pore of length L nm (a proxy for the blockade current), a partial sequence containing AA is obtained. Translocation is assumed to be unidirectional, with residues exiting the pore at a roughly constant rate of ~1/μs ( , 1130-1139, 2023). The blockade signal is sampled at intervals of 1 μs and digitized with a step precision of 70 nm; the positions of the AAs are obtained from the positions of well-defined quantum jumps in the signal. This procedure is applied to all 20 standard AA types, the resulting 20 partial sequences are merged to obtain the whole protein sequence. The complexity of subsequence computation is O(N) for a protein with N residues. The method is illustrated with a sample protein from the human proteome (Uniprot id UP000005640_9606). A mixture of M' protein molecules (including multiple copies) can be sequenced by constructing an M' × 20 array of partial sequences from which proteins occurring multiple times are first isolated and their sequences obtained separately. The remaining M singly-occurring molecules are detected from M disjoint paths through the 20 columns of the reduced M × 20 array. Detection complexity is O(M), which is nominally in polynomial time but practical only for small M; to use this method a sample may be subdivided into subsamples down to this level. Quantification of proteins can be done by sorting their computed sequences on the sequence strings and counting the number of duplicates. The possibility of translating this procedure into practice and related implementation issues are discussed.

摘要

在不进行分离、末端残基切割或蛋白水解的情况下，对样品中的全蛋白进行测序和定量，这一过程通过计算建模实现。与近期关于DNA测序的工作（参考文献，5233 - 5238，2016）类似，一种大量的共轭物附着在未折叠全蛋白中每种氨基酸（AA）类型（1≤i≤20）的每个实例上，然后使其通过纳米孔。根据长度为L纳米的孔中2L个残基所排除的体积（作为阻断电流的替代指标），可获得包含AA的部分序列。假设转运是单向的，残基以大约1/微秒的恒定速率离开孔（参考文献，1130 - 1139，2023）。阻断信号以1微秒的间隔进行采样，并以70纳米的步长精度数字化；氨基酸的位置从信号中明确的量子跃迁位置获得。此过程应用于所有20种标准氨基酸类型，将得到的20个部分序列合并以获得全蛋白序列。对于具有N个残基的蛋白质，子序列计算的复杂度为O(N)。该方法通过人类蛋白质组中的一个示例蛋白（Uniprot编号UP000005640_9606）进行说明。通过构建一个M'×20的部分序列阵列，可以对M'个蛋白质分子（包括多个副本）的混合物进行测序，首先从该阵列中分离出多次出现的蛋白质并分别获得其序列。其余M个单次出现的分子通过简化后的M×20阵列的20列中的M条不相交路径进行检测。检测复杂度为O(M)，这在名义上是多项式时间，但仅适用于小的M；为了使用此方法，样品可能需要细分为达到此水平的子样品。蛋白质的定量可以通过在序列字符串上对其计算序列进行排序并计算重复次数来完成。讨论了将此过程转化为实际应用的可能性以及相关的实施问题。