• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一个强大的多元统计序列分析技术新家族。

A new family of powerful multivariate statistical sequence analysis techniques.

作者信息

van Heel M

机构信息

Fritz Haber Institute of the Max Planck Society, Berlin Dahlem, Germany.

出版信息

J Mol Biol. 1991 Aug 20;220(4):877-87. doi: 10.1016/0022-2836(91)90360-i.

DOI:10.1016/0022-2836(91)90360-i
PMID:1880802
Abstract

A novel multivariate statistical approach is presented for extracting and exploiting intrinsic information present in our ever-growing sequence data banks. The information extraction from the sequences avoids the pitfalls of intersequence alignment by analyzing secondary invariant functions derived from the sequences in the data bank rather than the sequences themselves. Such typical invariant function is a 20 x 20 histogram of occurrences of amino acid pairs in a given sequence or fragment thereof. To illustrate the potential of the approach an analysis of 10,000 protein sequences from the National Biomedical Research Foundation Protein Identification Resource is presented, whose analysis already reveals great biological detail. For example, zeta-hemoglobin is found to lie close to amphibian and fish chi-hemoglobin which, in turn, is an important clue to the physiological function of this mammalian early embryonic hemoglobin. The multivariate statistical framework presented unifies such apparently unrelated issues as phylogenetic comparisons between a set of sequences and distance matrices between the constituents of the biological sequences. The Multivariate Statistical Sequence Analysis (MSSA) principles can be used for a wide spectrum of sequence analysis problems such as: assignment of family memberships to new sequences, validation of new incoming sequences to be entered into the database, prediction of structure from sequence, discrimination of coding from non-coding DNA regions, and automatic generation of an atlas of protein or DNA sequences. The MSSA techniques represent a self-contained approach to learning continuously and automatically from the growing stream of new sequences. The MSSA approach is particularly likely to play a significant role in major sequencing efforts such as the human genome project.

摘要

本文提出了一种新颖的多元统计方法,用于从不断增长的序列数据库中提取和利用内在信息。从序列中提取信息时,该方法通过分析从数据库中的序列而非序列本身导出的二级不变函数,避免了序列间比对的陷阱。这种典型的不变函数是给定序列或其片段中氨基酸对出现情况的20×20直方图。为了说明该方法的潜力,本文对来自国家生物医学研究基金会蛋白质鉴定资源库的10000个蛋白质序列进行了分析,分析结果已揭示出丰富的生物学细节。例如,ζ-血红蛋白被发现与两栖动物和鱼类的χ-血红蛋白相近,这反过来又为这种哺乳动物早期胚胎血红蛋白的生理功能提供了重要线索。所提出的多元统计框架统一了诸如一组序列之间的系统发育比较以及生物序列组成部分之间的距离矩阵等看似不相关的问题。多元统计序列分析(MSSA)原理可用于广泛的序列分析问题,如:将新序列归类到家族成员中、验证要输入数据库的新传入序列、从序列预测结构、区分编码DNA区域和非编码DNA区域,以及自动生成蛋白质或DNA序列图谱。MSSA技术代表了一种独立的方法,可从不断增长的新序列流中持续自动学习。MSSA方法尤其可能在诸如人类基因组计划等重大测序工作中发挥重要作用。

相似文献

1
A new family of powerful multivariate statistical sequence analysis techniques.一个强大的多元统计序列分析技术新家族。
J Mol Biol. 1991 Aug 20;220(4):877-87. doi: 10.1016/0022-2836(91)90360-i.
2
[Analysis, identification and correction of some errors of model refseqs appeared in NCBI Human Gene Database by in silico cloning and experimental verification of novel human genes].[通过新型人类基因的电子克隆和实验验证对NCBI人类基因数据库中出现的模型参考序列的一些错误进行分析、鉴定和校正]
Yi Chuan Xue Bao. 2004 May;31(5):431-43.
3
A statistical analytical approach to decipher information from biological sequences: application to murine splice-site analysis and prediction.
J Biomol Struct Dyn. 1995 Feb;12(4):785-801. doi: 10.1080/07391102.1995.10508776.
4
NdPASA: a novel pairwise protein sequence alignment algorithm that incorporates neighbor-dependent amino acid propensities.NdPASA:一种整合了邻域依赖氨基酸倾向的新型双序列蛋白质序列比对算法。
Proteins. 2005 Feb 15;58(3):628-37. doi: 10.1002/prot.20359.
5
GATA: a graphic alignment tool for comparative sequence analysis.GATA:一种用于比较序列分析的图形比对工具。
BMC Bioinformatics. 2005 Jan 17;6:9. doi: 10.1186/1471-2105-6-9.
6
An integrated approach to the analysis and modeling of protein sequences and structures. III. A comparative study of sequence conservation in protein structural families using multiple structural alignments.一种蛋白质序列与结构分析及建模的综合方法。III. 使用多重结构比对对蛋白质结构家族中的序列保守性进行比较研究。
J Mol Biol. 2000 Aug 18;301(3):691-711. doi: 10.1006/jmbi.2000.3975.
7
Engineering Aspects of Olfaction嗅觉的工程学方面
8
Protein-coding region discovery in organisms underrepresented in databases.在数据库中代表性不足的生物体中发现蛋白质编码区域。
Comput Chem. 1999 Jun 15;23(3-4):209-17. doi: 10.1016/s0097-8485(99)00016-9.
9
[Correction of five different types of errors of model REFSEQs appeared in NCBI human gene database only by using two novel human genes C17orf32 and ZNF362].[仅通过使用两个新的人类基因C17orf32和ZNF362校正出现在NCBI人类基因数据库中的五种不同类型的模型REFSEQs错误]
Yi Chuan Xue Bao. 2004 Apr;31(4):325-34.
10
Automatic identification of large collections of protein-coding or rRNA sequences.自动识别大量蛋白质编码或rRNA序列。
Biochimie. 2008 Apr;90(4):609-14. doi: 10.1016/j.biochi.2007.08.006. Epub 2007 Sep 2.

引用本文的文献

1
Machine learning: an advancement in biochemical engineering.机器学习:生化工程的一项进步。
Biotechnol Lett. 2024 Aug;46(4):497-519. doi: 10.1007/s10529-024-03499-8. Epub 2024 Jun 21.
2
Prediction of G protein-coupled receptor encoding sequences from the synganglion transcriptome of the cattle tick, Rhipicephalus microplus.从微小牛蜱神经节转录组预测G蛋白偶联受体编码序列
Ticks Tick Borne Dis. 2016 Jul;7(5):670-677. doi: 10.1016/j.ttbdis.2016.02.014. Epub 2016 Feb 22.
3
Metagenomic Classification Using an Abstraction Augmented Markov Model.
使用抽象增强马尔可夫模型的宏基因组分类
J Comput Biol. 2016 Feb;23(2):111-122. doi: 10.1089/cmb.2015.0141. Epub 2015 Nov 30.
4
Fold homology detection using sequence fragment composition profiles of proteins.使用蛋白质序列片段组成特征来检测折叠同源性。
Proteins. 2010 Oct;78(13):2745-56. doi: 10.1002/prot.22788.
5
Kinome-wide interaction modelling using alignment-based and alignment-independent approaches for kinase description and linear and non-linear data analysis techniques.基于对齐和非对齐方法的激酶描述以及线性和非线性数据分析技术的全激酶组相互作用建模。
BMC Bioinformatics. 2010 Jun 22;11:339. doi: 10.1186/1471-2105-11-339.
6
A novel alignment-free method for comparing transcription factor binding site motifs.一种新的无比对方法用于比较转录因子结合位点基序。
PLoS One. 2010 Jan 20;5(1):e8797. doi: 10.1371/journal.pone.0008797.
7
Spectral diffusion and electron-phonon coupling of the B800 BChl a molecules in LH2 complexes from three different species of purple bacteria.三种不同紫细菌的 LH2 复合物中 B800 BChl a 分子的光谱扩散和电子-声子耦合。
Biophys J. 2009 Nov 4;97(9):2604-12. doi: 10.1016/j.bpj.2009.07.052.
8
Sequence physical properties encode the global organization of protein structure space.序列物理性质编码了蛋白质结构空间的全局组织。
Proc Natl Acad Sci U S A. 2009 Aug 25;106(34):14345-8. doi: 10.1073/pnas.0903433106. Epub 2009 Aug 12.
9
The distance-profile representation and its application to detection of distantly related protein families.距离轮廓表示法及其在远亲蛋白质家族检测中的应用。
BMC Bioinformatics. 2005 Nov 29;6:282. doi: 10.1186/1471-2105-6-282.
10
Comparative genomics using data mining tools.使用数据挖掘工具的比较基因组学。
J Biosci. 2002 Feb;27(1 Suppl 1):15-25. doi: 10.1007/BF02703680.