Font Size: a A A

Text Classification Method Based On Unsupervised Clustering And Naive Bayesian Classifier

Posted on:2006-07-15Degree:MasterType:Thesis
Country:ChinaCandidate:C L ZhuFull Text:PDF
GTID:2208360155966444Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In the realistic world the most information we got is in various form of the book, the research paper, the newspaper, the digital book, the Web pages and e-mail and so on. The information above is commonly called text information. They are made of a great deal of documents that come from various data sources and are mainly stored in the text database. The most of information stored in the text database is semistructure data that is neither unstructured nor structured completely. It is reported that 80 percent of data is in the form of semistructure data. And the text database is in the rapid development because of the swift rise of the electronic information. Data mining should be applied to the text information in order to extract the useful pattern that is interested and potential and the hidden information from the substantive, heterogeneous and unstructured data sources. This is text mining. With the rapidly development of the text data, text mining have been an important study direction in data mining area.We can extract or discover knowledge from substantive data by using data mining and mode is a description form of the knowledge. So mode mining is an important part not only in the data mining but also in the text mining. And classification and clustering is two familiar methods in mode mining.Unsupervised Text Clustering (UTC) is method by applying the Unsupervised Clustering (UC) to the text. Providing the cluster radius R we can get the cluster center by using the algorithm UTC to cluster the every different species of text. Then we can use the cluster center as the classification of the text in advance. That is to say we should calculate the distance of the every text we choose and the every cluster center we get. And we can get the class that the text belongs to that is the nearest cluster center corresponding to. And the specialty of the method is its rapid speed of classifying the text but its accuracy is low.In recent years people attach importance to the Naive Bayes Classification because of its solid math base and abundant expression capacity of probabilityespecially its characteristic of making good use of the transcendent information. It makes it be hotspots in the area of data mining. And it is wildly used in the area of the data mining.The paper proposes a method that can classify the text unlabeled accurately on the basis of analyzing the specialties of UTC and Naive Bayes Classification. We express the text unlabeled by using Vector Space Model. So we can regard the text as a dot in the n dimension Vector Space. After providing cluster radius R we can get a muster of class symbols, the positive sample center and negative sample center by using UTC algorithm. Then we can take some part of the text that is close to the positive sample center as the training text to train the Naive Bayes Classifier. At the end we would put the text in the blur area to the trained Naive Bayes Classifier to reclassify it. The method not only can avoid the manual classification but also can have a better classification result and can increase the precision of it.The work we have done which the thesis introduces is listed as follows:1 .Describing the process of the text mining and introducing emphatically the techniques of clustering and classification in the mode mining.2.Analyzing the specialty of UTC and Naive Bayes Classification and proposing a method UNBTC that can classify the text unlabeled accurately by combining the UTC and Naive Bayes Classification.3.Establishing an automatic Classification prototype system of text (UNBTC) based on Vector Space Model on the base of the process of the text mining.4.Realizing the algorithm UNBTC in the automatic classification prototype system of text and proving its validity of classifying the text unlabeled.
Keywords/Search Tags:Text Mining, Unsupervised Text Clustering, Naive Bayes Classification, Vector Space Model, Feature Selection
PDF Full Text Request
Related items