Diaz-Uriarte Ramón
Statistical Computing Team, Structural Biology and Biocomputing Programme, Spanish National Cancer Center (CNIO), Melchor Fernández Almagro 3, Madrid, 28029, Spain.
BMC Bioinformatics. 2007 Sep 3;8:328. doi: 10.1186/1471-2105-8-328.
Microarray data are often used for patient classification and gene selection. An appropriate tool for end users and biomedical researchers should combine user friendliness with statistical rigor, including carefully avoiding selection biases and allowing analysis of multiple solutions, together with access to additional functional information of selected genes. Methodologically, such a tool would be of greater use if it incorporates state-of-the-art computational approaches and makes source code available.
We have developed GeneSrF, a web-based tool, and varSelRF, an R package, that implement, in the context of patient classification, a validated method for selecting very small sets of genes while preserving classification accuracy. Computation is parallelized, allowing to take advantage of multicore CPUs and clusters of workstations. Output includes bootstrapped estimates of prediction error rate, and assessments of the stability of the solutions. Clickable tables link to additional information for each gene (GO terms, PubMed citations, KEGG pathways), and output can be sent to PaLS for examination of PubMed references, GO terms, KEGG and and Reactome pathways characteristic of sets of genes selected for class prediction. The full source code is available, allowing to extend the software. The web-based application is available from http://genesrf2.bioinfo.cnio.es. All source code is available from Bioinformatics.org or The Launchpad. The R package is also available from CRAN.
varSelRF and GeneSrF implement a validated method for gene selection including bootstrap estimates of classification error rate. They are valuable tools for applied biomedical researchers, specially for exploratory work with microarray data. Because of the underlying technology used (combination of parallelization with web-based application) they are also of methodological interest to bioinformaticians and biostatisticians.
微阵列数据常用于患者分类和基因选择。对于终端用户和生物医学研究人员而言,一个合适的工具应兼具用户友好性与统计严谨性,包括仔细避免选择偏差、允许对多种解决方案进行分析,以及能够获取所选基因的其他功能信息。从方法学角度来看,如果一个工具能纳入最新的计算方法并提供源代码,那么它将更有用。
我们开发了基于网络的工具GeneSrF和R包varSelRF,它们在患者分类的背景下,实现了一种经过验证的方法,用于选择非常小的基因集,同时保持分类准确性。计算是并行化的,能够利用多核CPU和工作站集群。输出包括预测错误率的自展估计,以及对解决方案稳定性的评估。可点击的表格链接到每个基因的其他信息(基因本体论术语、PubMed引用、京都基因与基因组百科全书通路),并且输出可以发送到PaLS,以检查为类别预测所选基因集的PubMed参考文献、基因本体论术语、京都基因与基因组百科全书和Reactome通路特征。完整的源代码是可用的,允许对软件进行扩展。基于网络的应用程序可从http://genesrf2.bioinfo.cnio.es获取。所有源代码可从Bioinformatics.org或The Launchpad获取。R包也可从CRAN获取。
varSelRF和GeneSrF实现了一种经过验证的基因选择方法,包括分类错误率的自展估计。它们是应用生物医学研究人员的宝贵工具,特别是对于微阵列数据的探索性工作。由于所使用的底层技术(并行化与基于网络的应用程序的结合),它们对生物信息学家和生物统计学家也具有方法学上的意义。