Department of Life Science Informatics, B-IT, Rheinische Friedrich-Wilhelms-Universität, Dahlmannstrasse 2, D-53113 Bonn, Germany.
J Chem Inf Model. 2010 Nov 22;50(11):1935-40. doi: 10.1021/ci100319n. Epub 2010 Oct 20.
The identification of molecular descriptors that contain compound class-specific information is of high relevance in chemoinformatics. A generally applicable way to identify such descriptors is to determine and compare their information content in a given compound activity class and in large databases where the vast majority of compounds do not have the desired activity. For this purpose, the Shannon entropy concept from information theory can in principle be employed. However, previous adaptations of this concept for descriptor profiling are insufficient to select discriminatory descriptors for data sets that dramatically differ in size. Therefore, we introduce a methodology to reliably select such descriptors by transforming the previously introduced differential Shannon entropy formalism into mutual information analysis, another concept from information theory. The newly introduced approach is evaluated by descriptor ranking and correlation analysis on 168 compound activity classes.
在化学生信中,识别包含化合物类别特定信息的分子描述符具有重要意义。一种普遍适用的识别此类描述符的方法是确定并比较它们在给定化合物活性类别和大多数化合物不具有所需活性的大型数据库中的信息含量。为此,可以从信息论中使用香农熵的概念。然而,该概念之前的适应性对于描述符分析来说,不足以选择用于大小差异很大的数据的区分描述符。因此,我们引入了一种方法,通过将先前引入的差分香农熵形式转化为信息论中的另一个概念互信息分析,来可靠地选择这些描述符。新引入的方法通过对 168 个化合物活性类别进行描述符排序和相关性分析进行评估。