Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh.
School of Engineering and Physics, University of the South Pacific, Private Mail Bag, Laucala Campus, Suva, Fiji.
Bioinformatics. 2019 Oct 1;35(19):3831-3833. doi: 10.1093/bioinformatics/btz165.
Extracting useful feature set which contains significant discriminatory information is a critical step in effectively presenting sequence data to predict structural, functional, interaction and expression of proteins, DNAs and RNAs. Also, being able to filter features with significant information and avoid sparsity in the extracted features require the employment of efficient feature selection techniques. Here we present PyFeat as a practical and easy to use toolkit implemented in Python for extracting various features from proteins, DNAs and RNAs. To build PyFeat we mainly focused on extracting features that capture information about the interaction of neighboring residues to be able to provide more local information. We then employ AdaBoost technique to select features with maximum discriminatory information. In this way, we can significantly reduce the number of extracted features and enable PyFeat to represent the combination of effective features from large neighboring residues. As a result, PyFeat is able to extract features from 13 different techniques and represent context free combination of effective features. The source code for PyFeat standalone toolkit and employed benchmarks with a comprehensive user manual explaining its system and workflow in a step by step manner are publicly available.
https://github.com/mrzResearchArena/PyFeat/blob/master/RESULTS.md.
Toolkit, source code and manual to use PyFeat: https://github.com/mrzResearchArena/PyFeat/.
Supplementary data are available at Bioinformatics online.
从序列数据中提取有用的特征集,其中包含有意义的区分信息,是有效地呈现蛋白质、DNA 和 RNA 的结构、功能、相互作用和表达的关键步骤。此外,能够过滤具有重要信息的特征并避免提取特征中的稀疏性,需要采用有效的特征选择技术。这里我们提出了 PyFeat,它是一个实用的、易于使用的 Python 工具包,用于从蛋白质、DNA 和 RNA 中提取各种特征。为了构建 PyFeat,我们主要专注于提取能够捕获相邻残基相互作用信息的特征,以便能够提供更多的局部信息。然后,我们采用 AdaBoost 技术来选择具有最大区分信息的特征。这样,我们可以显著减少提取的特征数量,并使 PyFeat 能够表示来自大的相邻残基的有效特征的组合。结果,PyFeat 能够从 13 种不同的技术中提取特征,并表示有效的特征的无上下文组合。PyFeat 的独立工具包的源代码以及使用基准的情况,并附有一个全面的用户手册,逐步解释其系统和工作流程,均可公开获取。
https://github.com/mrzResearchArena/PyFeat/blob/master/RESULTS.md。
PyFeat 工具包、源代码和使用手册:https://github.com/mrzResearchArena/PyFeat/。
补充数据可在《生物信息学》在线获取。