Font Size: a A A

Research On Data Mining Technologies Applied To Web Chinese Text

Posted on:2012-07-24Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2218330338965409Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the explosive growth in the amount of information, it is becoming increasingly difficult to find information. People badly need a kind of technology to organize and process the large mount of information. Under the circumstances web mining combined by data mining and web technology aroused. As text is the main component of the web information, text mining becomes a hot research field. Because of the late start, Chinese text mining falls behind English text mining. So we regard web Chinese text mining as our research object.We focus on web Chinese text classification and clustering in this paper. Text classification and clustering are the key technologies in text mining. By organizing and classifying text dataset, they can solve the problem of information explosion to a great extent. Moreover, text classification and clustering will be widely applied as the technical basis of information retrieval, search engine, electronic library and text database and so on. With the advent of the information era, text classification and clustering are becoming more and more widely used.The paper first introduces the relevant theories, including data mining, web mining, text mining, and text classification and clustering. Before text classification and clustering, we need to transform the text into the form that computer can handle with. So we study the preprocessing which transforms the web text dataset into matrix. And then we apply our proposed methods on Chinese text classification and clustering.We introduce and realize the common clustering methods including k-means method and fuzzy c-means method.There are several steps to transform the web text into matrix, firstly we should remove the HTML remarks, filter out the irrelevant information and parse out text; secondly, Chinese text is different from English text. There are no obvious boundaries between words in Chinese text documents, So Chinese word segmentation is the first step in Chinese text preprocessing; And then calculate the weight by weight calculation equation. The main idea of the weight:if a word or phrase appears high-frequently in a document but rarely in other documents, it can well represent the text features of this class, with very good ability to distinguish classes. It is suitable for classification and should be given higher weights. Thus web text can be transformed into a matrix where a row represents a document and a column represents a unique term, and can be analyzed by clustering and classification methods.The paper analyzes the character of high-dimension and sparseness of the text matrix, and the character results in traditional algorithm's failure when clustering such high-dimensional data. To the question, we propose two methods to cluster Chinese text. One is based on subspace and the other is based on singular value decomposition. Documents related to a particular topic are categorized by one subset of terms. That is, there is the feature subspace structure when clustering. So we adopt the subspace clustering algorithm TCPSO to cluster Chinese texts. According to the experimental results, we can see that subspace clustering is suitable to Chinese text clustering and more effective than traditional algorithm. Singular value decomposition sorts the characteristic value of the dataset according to the importance. On the one hand, the unimportant dimension is ignored as "noise". On the other hand it makes the document matrix dimension greatly curtailed, so as to improve the accuracy of the document clustering. We first adopt singular value decomposition to reduce the dimension, and then adopt artificial fish optimization algorithm to cluster Chinese text. From the simulation results we can see the method improve the efficiency.We also realize the Chinese text classification based on improved support vector machine. The parameters of support vector machine have important impact on the classification capability. If the parameters do not enactment appropriately, we can't get the good classification results. So we use PSO to optimize penalty constant C and kernel function parameter g of SVM in this paper. The simulation results show the good generalization ability and classification accuracy.
Keywords/Search Tags:preprocessing, Vector Space Model (VSM), text clustering, text classification
PDF Full Text Request
Related items