Font Size: a A A

Research On Web Text Clustering And Classification Algorithm

Posted on:2012-11-03Degree:MasterType:Thesis
Country:ChinaCandidate:X LiuFull Text:PDF
GTID:2218330338973122Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In today's information age of rapid development,with the development of computer technology and the popularity of the Internet,Web has become a huge,widely distributed, global information service center,Web has been integrated into the exchange of information on the life each area. Web exchange of ideas has become the main channel of access to information.Daily life, one can always access a large number of Web text information, but there is little information in genuine need. The openness of the Internet and heterogeneity, allows users to quickly and accurately obtain necessary information from the Internet becomes more difficult, people are faced with the information explosion has little knowledge of the dilemma. In order to effectively manage vast amounts of information on the Web, and accurate and fast retrieval services for users, Web data mining has become an important field of information retrieval research focus.Web data mining to traditional data mining methods to be improved and used for knowledge discovery.The Web can be divided into three categories:Web content mining, Web usage mining and Web structure mining. The main technologies used are:cluster analysis, classification prediction, correlation analysis, time series analysis. Web text clustering and classification as an important means of Web text mining, information retrieval technology is widely used in recent years, much attention of researchers. Text clustering and classification of foreign research started relatively early, as early as 60 years in the last century has made information extraction technology and the basic idea of text classification. Currently, foreign mining from the research stage into the practical stage, is widely used in e-mail classification, information filtering and so on. Text Mining for the domestic rather late, and achieved some results, the main achievements in the field of Chinese word segmentation.This Web text clustering and classification of the relevant theories are discussed, and in-depth understanding of the traditional text clustering and classification algorithm based on the amount of text data for the Web is too large, so as to reduce the text clustering and classification algorithms Computational overhead for the main idea of the traditional k-means algorithm and the KNN classification algorithm was improved. The main research work include the following:(1) TF-IDF feature words represented extraction algorithm is based on the characteristics of words and document the correlation between the word and the document type or characteristics of the correlation, to measure the characteristics of words characterize the ability of a document or document type. But it did not consider the characteristics of words between the correlation between the results of the cluster,leading to poor clustering results of the problem.This paper introduces the concept of feature words describe the characteristics of co-occurrence relationship between words and ideas using cluster analysis to extract the feature word set.Because this method does not use the Web version of the document focus on category information,it can be used for Web document clustering feature extraction.(2) Clustering based on the extraction of characteristic words,take into account both the traditional k-means algorithm for clustering approaches in principle,allows the center of the class within a class representation of the strongest;and consider it in isolation Point and the initial cluster center of limitations.In the traditional k-means algorithm based on neighborhood correlation between the introduction of the concept,a modified k-means algorithm:Dk-means,the word used for Web text feature extraction.(3) In the k-means clustering algorithm in the presence of the dependence of the initial cluster center problem. Analysis of Web-based text clustering and word clustering feature extraction, the requirements of the initial cluster center bias, and the traditional k-means algorithm is easily trapped into local optimal factors. The introduction of particle swarm optimization of the initial k-means cluster centers, proposed a particle swarm optimization based on the initial cluster centers to improve k-means algorithm:PSO-k-means.(4) Analysis of the impact of KNN text classifier classification efficiency and classification accuracy of the factors, the introduction of text clustering thinking on the training set by clustering documents by category in order to divide the center of the cluster after cluster of all documents on behalf of lower K nearest neighbor of the computational overhead, in order to improve the efficiency of KNN classification. By analyzing the local distribution of training documents to be classified sample density and distance from the center and the clustering of categories to determine the impact of KNN in the generalization process of the cluster center weighted, and improve the basic change strategies, automatically determines the value.Experiments show that the proposed Dk-means clustering center can be spread to the high correlation between the density of the region so that the cluster after clustering the higher correlation; and PSO-k-means to get a good spread of the initial cluster centers Reduce the clustered correlation between the center of the cluster.The improvement of the traditional KNN,the classification accuracy as much as possible under the premise of ensuring a substantial reduction of the computational overhead and improve efficiency.
Keywords/Search Tags:Web, Text Clustering, Text Classification, Extraction of feature words, k-means, KNN
PDF Full Text Request
Related items