Yang Zheng Rong
Department of Computer Science, Exeter University, United Kingdom.
Bioinformatics. 2005 Jun 1;21(11):2644-50. doi: 10.1093/bioinformatics/bti404. Epub 2005 Mar 29.
Although the outbreak of the severe acute respiratory syndrome (SARS) is currently over, it is expected that it will return to attack human beings. A critical challenge to scientists from various disciplines worldwide is to study the specificity of cleavage activity of SARS-related coronavirus (SARS-CoV) and use the knowledge obtained from the study for effective inhibitor design to fight the disease. The most commonly used inductive programming methods for knowledge discovery from data assume that the elements of input patterns are orthogonal to each other. Suppose a sub-sequence is denoted as P2-P1-P1'-P2', the conventional inductive programming method may result in a rule like 'if P1 = Q, then the sub-sequence is cleaved, otherwise non-cleaved'. If the site P1 is not orthogonal to the others (for instance, P2, P1' and P2'), the prediction power of these kind of rules may be limited. Therefore this study is aimed at developing a novel method for constructing non-orthogonal decision trees for mining protease data.
Eighteen sequences of coronavirus polyprotein were downloaded from NCBI (http://www.ncbi.nlm.nih.gov). Among these sequences, 252 cleavage sites were experimentally determined. These sequences were scanned using a sliding window with size k to generate about 50,000 k-mer sub-sequences (for short, k-mers). The value of k varies from 4 to 12 with a gap of two. The bio-basis function proposed by Thomson et al. is used to transform the k-mers to a high-dimensional numerical space on which an inductive programming method is applied for the purpose of deriving a decision tree for decision-making. The process of this transform is referred to as a bio-mapping. The constructed decision trees select about 10 out of 50,000 k-mers. This small set of selected k-mers is regarded as a set of decisive templates. By doing so, non-orthogonal decision trees are constructed using the selected templates and the prediction accuracy is significantly improved.
尽管严重急性呼吸综合征(SARS)的爆发目前已经结束,但预计它会卷土重来侵袭人类。对于全球各个学科的科学家来说,一项关键挑战是研究严重急性呼吸综合征相关冠状病毒(SARS-CoV)切割活性的特异性,并利用该研究获得的知识来设计有效的抑制剂以对抗该疾病。从数据中进行知识发现时最常用的归纳编程方法假定输入模式的元素彼此正交。假设一个子序列表示为P2 - P1 - P1' - P2',传统的归纳编程方法可能会得出一条规则,如“如果P1 = Q,那么该子序列被切割,否则未被切割”。如果位点P1与其他位点(例如P2、P1'和P2')不正交,这类规则的预测能力可能会受到限制。因此,本研究旨在开发一种用于构建非正交决策树以挖掘蛋白酶数据的新方法。