Font Size: a A A

Research Of Chinese Text Categorization

Posted on:2007-09-03Degree:MasterType:Thesis
Country:ChinaCandidate:L YangFull Text:PDF
GTID:2178360182485570Subject:Computer applications
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, Web has been developed into a global, massive, distributed and shared information space. It provides a new means for people to search information. But with the explosive increase of information on the Internet, it avalanches abundance irrelevant information with user's request and the relevant information for user is covered up. In the complicated information, automatic classifier plays an important role in finding the needed information and in effectively using the shared information. It improves the efficiency of information retrieval by effectively organizing and managing information.This paper introduces some relevant technologies with text categorization. Word segmentation is the basis of text categorization. The character matching method and the statistical method are two commonly used word segmentation methods. The character matching method is limited by the words' quantity in the dictionary. With the rapid development of modern society, new words appear continuously and then this matching method can't recognize those words accurately. So this paper puts forward the method of combining the character matching method and the statistic method. It matches a string based on dictionary, and then segments words that are in the dictionary. At the same time, we apply the statistics method to identify the new words that are inexistent in the dictionary and supply them to the dictionary for the later text word segmentation. Experiment shows this method can improve the segmentation accuracy while retaining its speed.The naive Bayes method and k-nearest neighbor method are two commonly used text categorization methods. The naive Bayes method predicts the probability of each text. The k-nearest neighbor classifier judges the sort of each text with the sorts of their k nearest neighbor. The compared research of Bayes method and k-nearest neighbor method is carried out on the same platform of "Chinese nature language processing". Experiment shows the Bayes classifier's speed is faster. This method can deal with big data set and it can be applied into online categorization. K-nearest neighbor classifier can receive higher accuracy, so it can be applied into the occasion with the requirement of high accuracy. But its speed is slower, so...
Keywords/Search Tags:Text Categorization, Text Segmentation, The Matching Method, The Statistic Method, Bayes Method, k-Nearest Neighbor Method
PDF Full Text Request
Related items