PyFeat：一个基于 Python 的用于 DNA、RNA 和蛋白质序列的有效特征生成工具。

PyFeat: a Python-based effective feature generation tool for DNA, RNA and protein sequences.

机构信息

Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh.

School of Engineering and Physics, University of the South Pacific, Private Mail Bag, Laucala Campus, Suva, Fiji.

出版信息

Bioinformatics. 2019 Oct 1;35(19):3831-3833. doi: 10.1093/bioinformatics/btz165.

DOI:10.1093/bioinformatics/btz165

PMID:30850831

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6761934/

Abstract

MOTIVATION

Extracting useful feature set which contains significant discriminatory information is a critical step in effectively presenting sequence data to predict structural, functional, interaction and expression of proteins, DNAs and RNAs. Also, being able to filter features with significant information and avoid sparsity in the extracted features require the employment of efficient feature selection techniques. Here we present PyFeat as a practical and easy to use toolkit implemented in Python for extracting various features from proteins, DNAs and RNAs. To build PyFeat we mainly focused on extracting features that capture information about the interaction of neighboring residues to be able to provide more local information. We then employ AdaBoost technique to select features with maximum discriminatory information. In this way, we can significantly reduce the number of extracted features and enable PyFeat to represent the combination of effective features from large neighboring residues. As a result, PyFeat is able to extract features from 13 different techniques and represent context free combination of effective features. The source code for PyFeat standalone toolkit and employed benchmarks with a comprehensive user manual explaining its system and workflow in a step by step manner are publicly available.

RESULTS

https://github.com/mrzResearchArena/PyFeat/blob/master/RESULTS.md.

AVAILABILITY AND IMPLEMENTATION

Toolkit, source code and manual to use PyFeat: https://github.com/mrzResearchArena/PyFeat/.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

从序列数据中提取有用的特征集，其中包含有意义的区分信息，是有效地呈现蛋白质、DNA 和 RNA 的结构、功能、相互作用和表达的关键步骤。此外，能够过滤具有重要信息的特征并避免提取特征中的稀疏性，需要采用有效的特征选择技术。这里我们提出了 PyFeat，它是一个实用的、易于使用的 Python 工具包，用于从蛋白质、DNA 和 RNA 中提取各种特征。为了构建 PyFeat，我们主要专注于提取能够捕获相邻残基相互作用信息的特征，以便能够提供更多的局部信息。然后，我们采用 AdaBoost 技术来选择具有最大区分信息的特征。这样，我们可以显著减少提取的特征数量，并使 PyFeat 能够表示来自大的相邻残基的有效特征的组合。结果，PyFeat 能够从 13 种不同的技术中提取特征，并表示有效的特征的无上下文组合。PyFeat 的独立工具包的源代码以及使用基准的情况，并附有一个全面的用户手册，逐步解释其系统和工作流程，均可公开获取。

网址

https://github.com/mrzResearchArena/PyFeat/blob/master/RESULTS.md。

可用性和实现

PyFeat 工具包、源代码和使用手册：https://github.com/mrzResearchArena/PyFeat/。

补充信息

补充数据可在《生物信息学》在线获取。

相似文献

PyFeat: a Python-based effective feature generation tool for DNA, RNA and protein sequences.PyFeat：一个基于 Python 的用于 DNA、RNA 和蛋白质序列的有效特征生成工具。

Bioinformatics. 2019 Oct 1;35(19):3831-3833. doi: 10.1093/bioinformatics/btz165.

iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences.iFeature：一个用于从蛋白质和肽序列中提取和选择特征的 Python 包和网络服务器。

Bioinformatics. 2018 Jul 15;34(14):2499-2502. doi: 10.1093/bioinformatics/bty140.

pydca v1.0: a comprehensive software for direct coupling analysis of RNA and protein sequences.pydca v1.0：用于 RNA 和蛋白质序列直接耦联分析的综合软件。

Bioinformatics. 2020 Apr 1;36(7):2264-2265. doi: 10.1093/bioinformatics/btz892.

ProFET: Feature engineering captures high-level protein functions.ProFET：特征工程可捕捉高级蛋白质功能。

Bioinformatics. 2015 Nov 1;31(21):3429-36. doi: 10.1093/bioinformatics/btv345. Epub 2015 Jun 30.

Sequence database versioning for command line and Galaxy bioinformatics servers.用于命令行和Galaxy生物信息学服务器的序列数据库版本控制。

Bioinformatics. 2016 Apr 15;32(8):1275-7. doi: 10.1093/bioinformatics/btv724. Epub 2015 Dec 12.

VisFeature: a stand-alone program for visualizing and analyzing statistical features of biological sequences.VisFeature：一个独立的程序，用于可视化和分析生物序列的统计特征。

Bioinformatics. 2020 Feb 15;36(4):1277-1278. doi: 10.1093/bioinformatics/btz689.

Improved mutant function prediction via PACT: Protein Analysis and Classifier Toolkit.通过PACT（蛋白质分析与分类工具包）改进突变体功能预测。

Bioinformatics. 2019 Aug 15;35(16):2707-2712. doi: 10.1093/bioinformatics/bty1042.

DeepCoil-a fast and accurate prediction of coiled-coil domains in protein sequences.DeepCoil—一种快速准确预测蛋白质序列中卷曲螺旋结构域的方法。

Bioinformatics. 2019 Aug 15;35(16):2790-2795. doi: 10.1093/bioinformatics/bty1062.

SGTK: a toolkit for visualization and assessment of scaffold graphs.SGTK：一个支架图可视化和评估的工具包。

Bioinformatics. 2019 Jul 1;35(13):2303-2305. doi: 10.1093/bioinformatics/bty956.

DNA Chisel, a versatile sequence optimizer.DNA 钻，一种通用的序列优化器。

Bioinformatics. 2020 Aug 15;36(16):4508-4509. doi: 10.1093/bioinformatics/btaa558.

引用本文的文献

LncPTPred: predicting lncRNA-protein interaction based on crosslinking and immunoprecipitation (CLIP-Seq) data.LncPTPred：基于交联免疫沉淀（CLIP-Seq）数据预测长链非编码RNA与蛋白质的相互作用

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf432.

PyPropel: a Python-based tool for efficiently processing and characterising protein data.PyPropel：一个用于高效处理和表征蛋白质数据的基于Python的工具。

BMC Bioinformatics. 2025 Mar 1;26(1):70. doi: 10.1186/s12859-025-06079-3.

NPI-HGNN: A Heterogeneous Graph Neural Network-Based Approach for Predicting ncRNA-Protein Interactions.NPI-HGNN：一种基于异构图神经网络的预测非编码RNA-蛋白质相互作用的方法。

Interdiscip Sci. 2025 Feb 21. doi: 10.1007/s12539-025-00689-4.

RNA-ModX: a multilabel prediction and interpretation framework for RNA modifications.RNA-ModX：一种用于RNA修饰的多标签预测与解释框架。

Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae688.

A predictive approach for host-pathogen interactions using deep learning and protein sequences.一种利用深度学习和蛋白质序列预测宿主-病原体相互作用的方法。

Virusdisease. 2024 Sep;35(3):434-445. doi: 10.1007/s13337-024-00882-x. Epub 2024 Jul 16.

Biological Sequence Classification: A Review on Data and General Methods.生物序列分类：数据与通用方法综述

Research (Wash D C). 2022 Dec 19;2022:0011. doi: 10.34133/research.0011. eCollection 2022.

mRCat: A Novel CatBoost Predictor for the Binary Classification of mRNA Subcellular Localization by Fusing Large Language Model Representation and Sequence Features.mRCat：一种新型的 CatBoost 预测器，通过融合大语言模型表示和序列特征，用于 mRNA 亚细胞定位的二分类。

Biomolecules. 2024 Jun 27;14(7):767. doi: 10.3390/biom14070767.

Inference of gene regulatory networks based on directed graph convolutional networks.基于有向图卷积网络的基因调控网络推断。

Brief Bioinform. 2024 May 23;25(4). doi: 10.1093/bib/bbae309.

BioDeepfuse: a hybrid deep learning approach with integrated feature extraction techniques for enhanced non-coding RNA classification.BioDeepfuse：一种混合深度学习方法，结合了集成特征提取技术，用于增强非编码 RNA 分类。

RNA Biol. 2024 Jan;21(1):1-12. doi: 10.1080/15476286.2024.2329451. Epub 2024 Mar 25.

LncRNA-protein interaction prediction with reweighted feature selection.基于重新加权特征选择的 lncRNA-蛋白质相互作用预测。

BMC Bioinformatics. 2023 Oct 30;24(1):410. doi: 10.1186/s12859-023-05536-1.

本文引用的文献

iRecSpot-EF: Effective sequence based features for recombination hotspot prediction.iRecSpot-EF：基于有效序列特征的重组热点预测。

Comput Biol Med. 2018 Dec 1;103:17-23. doi: 10.1016/j.compbiomed.2018.10.005. Epub 2018 Oct 11.

Bioinformatics. 2018 Jul 15;34(14):2499-2502. doi: 10.1093/bioinformatics/bty140.

BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches.生物序列分析：一个基于机器学习方法的 DNA、RNA 和蛋白质序列分析平台。

Brief Bioinform. 2019 Jul 19;20(4):1280-1294. doi: 10.1093/bib/bbx165.

iDNAProt-ES: Identification of DNA-binding Proteins Using Evolutionary and Structural Features.iDNAProt-ES：利用进化和结构特征鉴定 DNA 结合蛋白。

Sci Rep. 2017 Nov 2;7(1):14938. doi: 10.1038/s41598-017-14945-1.

Identifying Sigma70 Promoters with Novel Pseudo Nucleotide Composition.利用新型伪核苷酸组成识别 Sigma70 启动子。

IEEE/ACM Trans Comput Biol Bioinform. 2019 Jul-Aug;16(4):1316-1321. doi: 10.1109/TCBB.2017.2666141. Epub 2017 Feb 8.

Pse-Analysis: a python package for DNA/RNA and protein/ peptide sequence analysis based on pseudo components and kernel methods.伪分析：一个基于伪组件和核方法用于DNA/RNA以及蛋白质/肽序列分析的Python软件包。

Oncotarget. 2017 Feb 21;8(8):13338-13343. doi: 10.18632/oncotarget.14524.

PAI: Predicting adenosine to inosine editing sites by using pseudo nucleotide compositions.PAI：利用伪核苷酸组成预测腺苷到肌苷的编辑位点。

Sci Rep. 2016 Oct 11;6:35123. doi: 10.1038/srep35123.

Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences.Pse-in-One：一个用于生成DNA、RNA和蛋白质序列各种伪组件模式的网络服务器。

Nucleic Acids Res. 2015 Jul 1;43(W1):W65-71. doi: 10.1093/nar/gkv458. Epub 2015 May 9.

Enhanced regulatory sequence prediction using gapped k-mer features.使用带缺口的 k-mer 特征增强调控序列预测。

PLoS Comput Biol. 2014 Jul 17;10(7):e1003711. doi: 10.1371/journal.pcbi.1003711. eCollection 2014 Jul.

propy: a tool to generate various modes of Chou's PseAAC.propy：一种生成 Chou's PseAAC 各种模式的工具。

Bioinformatics. 2013 Apr 1;29(7):960-2. doi: 10.1093/bioinformatics/btt072. Epub 2013 Feb 19.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验