Suppr超能文献

Interpro: 一个用于蛋白质序列预处理的 R 包。

Interpol: An R package for preprocessing of protein sequences.

机构信息

Department of Bioinformatics, Center for Medical Biotechnology, University of Duisburg-Essen, Universitaetsstr, 2, 45141 Essen, Germany.

出版信息

BioData Min. 2011 Jun 17;4:16. doi: 10.1186/1756-0381-4-16.

Abstract

BACKGROUND

Most machine learning techniques currently applied in the literature need a fixed dimensionality of input data. However, this requirement is frequently violated by real input data, such as DNA and protein sequences, that often differ in length due to insertions and deletions. It is also notable that performance in classification and regression is often improved by numerical encoding of amino acids, compared to the commonly used sparse encoding.

RESULTS

The software "Interpol" encodes amino acid sequences as numerical descriptor vectors using a database of currently 532 descriptors (mainly from AAindex), and normalizes sequences to uniform length with one of five linear or non-linear interpolation algorithms. Interpol is distributed with open source as platform independent R-package. It is typically used for preprocessing of amino acid sequences for classification or regression.

CONCLUSIONS

The functionality of Interpol widens the spectrum of machine learning methods that can be applied to biological sequences, and it will in many cases improve their performance in classification and regression.

摘要

背景

目前文献中应用的大多数机器学习技术都需要输入数据具有固定的维度。然而,实际输入数据(如 DNA 和蛋白质序列)经常违反此要求,因为它们由于插入和缺失而在长度上有所不同。值得注意的是,与常用的稀疏编码相比,对氨基酸进行数值编码通常可以提高分类和回归的性能。

结果

软件“Interpol”使用当前 532 个描述符(主要来自 AAindex)的数据库将氨基酸序列编码为数值描述符向量,并使用五种线性或非线性插值算法之一将序列标准化为统一长度。Interpol 作为独立于平台的 R 包以开源形式发布。它通常用于分类或回归的氨基酸序列的预处理。

结论

Interpol 的功能拓宽了可应用于生物序列的机器学习方法的范围,并且在许多情况下会提高它们在分类和回归中的性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1786/3138420/47540f8e74ac/1756-0381-4-16-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验