School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane City, QLD 4072, Australia.
Systems and Computational Biology, Bio21 Institute, University of Melbourne, Parkville, VIC 3052, Australia.
Bioinformatics. 2023 Jul 1;39(7). doi: 10.1093/bioinformatics/btad402.
With the development of sequencing techniques, the discovery of new proteins significantly exceeds the human capacity and resources for experimentally characterizing protein functions. Localization, EC numbers, and GO terms with the structure-based Cutoff Scanning Matrix (LEGO-CSM) is a comprehensive web-based resource that fills this gap by leveraging the well-established and robust graph-based signatures to supervised learning models using both protein sequence and structure information to accurately model protein function in terms of Subcellular Localization, Enzyme Commission (EC) numbers, and Gene Ontology (GO) terms.
We show our models perform as well as or better than alternative approaches, achieving area under the receiver operating characteristic curve of up to 0.93 for subcellular localization, up to 0.93 for EC, and up to 0.81 for GO terms on independent blind tests.
LEGO-CSM's web server is freely available at https://biosig.lab.uq.edu.au/lego_csm. In addition, all datasets used to train and test LEGO-CSM's models can be downloaded at https://biosig.lab.uq.edu.au/lego_csm/data.
随着测序技术的发展,新蛋白质的发现大大超过了人类的能力和资源,无法通过实验来描述蛋白质的功能。基于结构的截止扫描矩阵 (LEGO-CSM) 的定位、EC 编号和 GO 术语是一个全面的基于网络的资源,它利用成熟且强大的基于图的签名,利用蛋白质序列和结构信息来为监督学习模型提供支持,从而准确地根据亚细胞定位、酶委员会 (EC) 编号和基因本体论 (GO) 术语来模拟蛋白质功能。
我们的模型表现与替代方法一样好,甚至更好,在独立的盲测中,亚细胞定位的接收者操作特征曲线下面积高达 0.93,EC 高达 0.93,GO 术语高达 0.81。
LEGO-CSM 的网络服务器可在 https://biosig.lab.uq.edu.au/lego_csm 上免费使用。此外,用于训练和测试 LEGO-CSM 模型的所有数据集均可在 https://biosig.lab.uq.edu.au/lego_csm/data 上下载。