Font Size: a A A

Language Independent Text Categorization

Posted on:2008-04-10Degree:MasterType:Thesis
Country:ChinaCandidate:L ChenFull Text:PDF
GTID:2178360242971544Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, Web has been developed into a global, massive, distributed and shared information space. It provides a new means for people to search information. But with the explosive increase of information on the Internet, it avalanches abundance irrelevant information with user's request and the relevant information for user is covered up. In the complicated information, automatic classifier plays an important role in finding the needed information and in effectively using the shared information. It improves the efficiency of information retrieval by effectively organizing and managing information.This pager proposes an approach for language independent text classification without word segmentation is. Unlike the case of traditional text classification models, the approach based on character-level n-gram language modeling avoids word segmentation and explicit feature selection and extensive pre-processing.This paper firstly introduces the research status of text categorization; secondly the paper compares the major text express models, which involves discussions concerning parameter selection of N-gram model, smoothing algorithm and so on. Thirdly, The functions of the system are presented, among which the detailed description of core function classifier is also given. The paper also proposes a chain- na?ve Bayes classifier, which combines tightly with N-gram and relaxes the N-gram model's independent assumption. Experiments show that this measure performs well in classifying. Fourthly, Systematically studies key factors of implementation, and describes evaluation method in detail. Finally, we list the result and analysis of experiment. Experimental results on two languages-Chinese and English show that the proposed method can achieve good performance in text classification tasks.
Keywords/Search Tags:text classification, n-gram model, classifier, corpus
PDF Full Text Request
Related items