Suppr超能文献

ProFeatX:用于机器学习的并行化蛋白质特征提取套件。

ProFeatX: A parallelized protein feature extraction suite for machine learning.

作者信息

Guevara-Barrientos David, Kaundal Rakesh

机构信息

Department of Computer Science, College of Science, Utah State University, Logan, UT, USA.

Bioinformatics Facility, Center for Integrated BioSystems, Utah State University, Logan, UT, USA.

出版信息

Comput Struct Biotechnol J. 2022 Dec 29;21:796-801. doi: 10.1016/j.csbj.2022.12.044. eCollection 2023.

Abstract

Machine learning algorithms have been successfully applied in proteomics, genomics and transcriptomics. and have helped the biological community to answer complex questions. However, most machine learning methods require lots of data, with every data point having the same vector size. The biological sequence data, such as proteins, are amino acid sequences of variable length, which makes it essential to extract a definite number of features from all the proteins for them to be used as input into machine learning models. There are numerous methods to achieve this, but only several tools let researchers encode their proteins using multiple schemes without having to use different programs or, in many cases, code these algorithms themselves, or even come up with new algorithms. In this work, we created ProFeatX, a tool that contains 50 encodings to extract protein features in an efficient and fast way supporting desktop as well as high-performance computing environment. It can also encode concatenated features for protein-protein interactions. The tool has an easy-to-use web interface, allowing non-experts to use feature extraction techniques, as well as a stand-alone version for advanced users. ProFeatX is implemented in C++ and available on GitHub at https://github.com/usubioinfo/profeatx. The web server is available at http://bioinfo.usu.edu/profeatx/.

摘要

机器学习算法已成功应用于蛋白质组学、基因组学和转录组学,并帮助生物学界回答复杂问题。然而,大多数机器学习方法需要大量数据,且每个数据点具有相同的向量大小。诸如蛋白质之类的生物序列数据是长度可变的氨基酸序列,这使得有必要从所有蛋白质中提取一定数量的特征,以便将其用作机器学习模型的输入。实现这一点有许多方法,但只有少数工具能让研究人员使用多种方案对其蛋白质进行编码,而无需使用不同的程序,或者在许多情况下,无需自己编写这些算法,甚至无需提出新算法。在这项工作中,我们创建了ProFeatX,这是一个包含50种编码的工具,能够以高效快速的方式提取蛋白质特征,支持桌面环境以及高性能计算环境。它还可以对蛋白质 - 蛋白质相互作用的串联特征进行编码。该工具具有易于使用的网页界面,使非专业人员也能使用特征提取技术,同时还为高级用户提供了独立版本。ProFeatX用C++实现,可在GitHub上获取,网址为https://github.com/usubioinfo/profeatx。网络服务器的网址为http://bioinfo.usu.edu/profeatx/。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b9dc/9842958/4a556952ae7a/ga1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验