Font Size: a A A

Applications Of Data Mining Techniques To Text Classification And Bioinformatics

Posted on:2009-05-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z L PeiFull Text:PDF
GTID:1118360245463130Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data Mining is the process to abstract hidden, potentially useful information and knowledge from massive, incomplete, noisy, fuzzy and random data base. It is inter disciplinary subject including: machine learning, statistics, AI, ANN, data base, pattern reorganization, rough set, fuzzy math, and so on. In this paper, some applications of the techniques of data mining in text classification and bioinformatics are studied. For text classification, there are three mainly contributed works in the paper: developed an integration method of feature selection and weight evaluation; proposed a feature selection method considered redundancy features; developed a feature frequency weighting method based on Variable Precision Rough Set. For bioinformatics, there are 2 mainly contributed works as well: proposed a gene annotation method based on Variable Precision Rough Set; developed a method to construct the evolution tree of human populations according to the SNP frequency data set of Y chromosome of humans. The details are as follows:(1) Considered the fact that most of low requency words are noise data, a filtering low frequency words method is proposed. The experiment results show that this method could improve the effectiveness of text classification. Focused on the Mutaul Information based feature selection method and tf.idf feature weight evaluation method, two improved methods are developed, respectively. By using Rocchio,kNN and SVM classifiers, the improved methods are applied to the banchmark text set Reuters-21578 Top10. Numerical results show that the combination of the two improved methods are effective, the macro accuracy, macro recall rate and the macro F1 value are all superior to those of other methods.(2) Define an important concept, namely that the importance degree of feature frequency based on the real rough set theory. Based on this concept, a novel weighting method for feature frequency is proposed, which considers the decisive information when we evaluate the contribution of feature frequency, and therefore it could obtain more objective evaluation results. Experimental results show that the proposed method could improve the distribution the samples'space and make the samples of the same kind more compact, and those ones of different kinds more loose; and the values of macro accuracy, macro recall rate and the macro F1 are all significantly improved.(3) Focused on the high dimensions of the feature space and the high feature redundancy of text classification problems, a Mutual Information and Information Entropy Pair Based Feature Selection Method is developed. Using developed relationship between information construction feature and the classes, the redundant features could be reduced greatly according to the mutual entropy of feature pairs. Two different machine learning methods, namely native Bayes Networks and kNN methods, are applied to the banchmark data sets of Reuters-21578 Top10 and WebKB. Experimental results show that the proposed method is more efficient than MI and CHI.(4) Using experimental methods to determine the sequence funcitons is too much expensive, and couldn't be used for the large scale annotation. TOP BLAST method is a simple and commanly used computational method. Compared with other compational methods, the precision, recall rate and harmonic mean are all higher, but the absolute values are still low. In this paper, a sequence function annotation method using the variable precision rough set theory based on the GO data base and BLAST software is proposed. The numerical results show that the proposed method could obtain higher macro accuracy than TOP BLAST, and similar macro recall rate and the macro F1 value with TOP BLAST.(5) The different order of genome nucleotides reflects the distance between the different population's evolution relationships. To construct the phylogenetic tree according to the level of differences between DNA molecules, it can approve the evolution relationships between different populations set by the traditional taxonomy. Since single nucleotide polymorphism data conserved most of the DNA molecule information, and most of the chromosome Y is none-recombination area, low mutation rate, it is able to record the evolution incident dutifully. Therefore a new method to construct the evolution tree of human populations according to the SNP frequency data set of Y chromosome of humans is developed in the paper. The numerical results show that the proposed method is supportive to the theory of"walking out of Africa". The method offers a new idea for the research of human evolution.To sum up, this paper develops an integration method of improved MI and improved feature weighting methods, a feature selection method for small redundancy features and a novel weighting method for feature frequency based on the real rough set theory, respectively. The work enriches the methods of feature selection and feature weight evaluation, also brings some new ideas to the text classification key techniques. This paper also proposes a gene annotation method based on the variable precision rough set, which has better performance for noisy data, and promote the realization of automatic annotation method. At last, a new method to construct the evolution tree of human populations according to the SNP frequency data set of Y chromosome of humans is developed, which supports the well known theory of"walking out of Africa", and offers a novel idea for the research of human evolution.
Keywords/Search Tags:data mining, text classification, bioinformatics, feature selection, feature weight, rough set, gene function annotation, population evolution
PDF Full Text Request
Related items