Font Size: a A A

The Research And Simulation On The Key Techniques Of Text Mining

Posted on:2015-04-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y ChenFull Text:PDF
GTID:2308330473953963Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Text mining is a part of data mining, which is integrated into a relatively hot research area in natural language processing. Although in the face of complex range of text information, people still are able to organize and combine the information effectively so that the information can be retrieved and located accurately, to improve the efficiency of finding the useful data for users.Based on the analysis of the text mining and its overall framework, this paper researches the following three parts of the text mining:(1) For text feature extraction technique, this paper presents a method based on improved genetic algorithm. This approach takes full advantage of the global optimization ability of genetic algorithms. It applies MI(mutual information) feature extraction method in calculating the adaptation of genetic algorithm firstly, and by using it the correlation between text features and categories will be improved, in order to improve the accuracy of feature extraction ultimately. Then the ant colony algorithm is introduced into selection process of genetic algorithm, to guide the direction of its problem of large randomness, improve the efficiency of the algorithm and save time ultimately. Finally, simulation experiments are conducted to test the accuracy of feature extraction results and execution time of the algorithm. In this way, the efficiency of the algorithm is evaluated.(2) For text clustering, this paper proposes a method based on improved ant colony clustering model. This method makes full use of the self-organization of ant colony clustering algorithm and the insensitivity to the early data input sequence, and improve the shortcomings of it. To solve the convergence problem of ant colony clustering, the aggregation of hierarchical clustering is added to reconstruct the cluster, and a global memory is added to control the whole to prevent clustering too slowly. At the same time, the details of the parameters are optimized to increase the environment adaptation of artificial ants, and ultimately improve the accuracy of the clustering results. In the end, the value precision, recall and 1F are evaluated and the algorithm is proved to be efficient.(3) For text classification, this paper presents an improved KNN algorithm. Since KNN is a lazy algorithm which establishes classifier only in classification process, lowering the efficiency of classifying. This proposed method optimizes KNN and makes it more efficient by trimming the training sample set. Then the algorithm is proved to be more efficient in the aspect of time optimization compared to other peer algorithms.
Keywords/Search Tags:feature extraction, genetic algorithm, ant colony algorithm, text categorization, text clustering
PDF Full Text Request
Related items