Font Size: a A A

Design And Implementation Of Chinese WEB Documents Clustering And Classification System

Posted on:2010-10-24Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2178360278459413Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text classification and clustering is one of the most valuable Technologies in "Text Information Processing" area that arouses extensive study of the organization, management and Processing of large amount of text data, which locate required information swiftly, accurately and comprehensively. Text classification and clustering, the key technology in organizing and processing large mount of text data, can solve the problem of information explosion to a great extent. Moreover, text classification and clustering will be widely applied as the technical basis of information retrieval, search engine, text database, and digital library and so on. With the advent of the information era, text classification and clustering is becoming highlighting.This paper firstly introduced Web mining in theory and analyzed its difference. Two important branches of clustering and classification as well as the relevance of the theory in Web mining were also introduced. Secondly, the characteristic expression and characteristic weights in Web documents were studied systematically, software of analyzing HTML files, extraction features, weights calculation, clustering, classification, result visualization was developed, and the software adopted the multi-threading technique.The common clustering algorithm for clustering were introduced and realized in this paper. Four most commonly used clustering algorithm including K-means algorithm, fuzzy C-means algorithm (FCM), Hierarchical Agglomerative Clustering (HAC) algorithm and particle swarm optimization (PSO) algorithm were adopted. It also gave a corresponding introduction of principal component analysis (PCA).The effect of dimensionality reduction was achieved through principal component analysis of high-dimensional data. Finally the first two-dimension of PCA transforming was selected for planar visualization.The paper analyzed deficiencies of the traditional particle swarm optimization algorithm, and proposed a novel Particle Swarm Optimization Clustering Algorithm Based on Density and PSO initialization method. The algorithm not only possesses the characteristic of globe searching capability but also considers distribution of all data from density angle. The algorithm increases the convergence speed and improves local searching capability by initialization. The experiments with simulated data and real IRIS data show that the clustering effect of our algorithm is better than that of PSO and KMEANS algorithm.It also implemented text classification based on support vector machine, including the choice of the characteristics of the text, the structure methods of classifier, and judging strategy of the categories and so on. The four feature selection methods mentioned were compared in the experiment. Finally, the overall design of the system and the detailed design of modules were finished, and the whole system was realized by Java.
Keywords/Search Tags:Preliminary treatment, Feature selection, Text clustering, Text classification, Visualization
PDF Full Text Request
Related items