Research Of Chinese Text Categorization

Posted on:2007-09-03

Degree:Master

Type:Thesis

Country:China

Candidate:L Yang

Full Text:PDF

GTID:2178360182485570

Subject:Computer applications

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet, Web has been developed into a global, massive, distributed and shared information space. It provides a new means for people to search information. But with the explosive increase of information on the Internet, it avalanches abundance irrelevant information with user's request and the relevant information for user is covered up. In the complicated information, automatic classifier plays an important role in finding the needed information and in effectively using the shared information. It improves the efficiency of information retrieval by effectively organizing and managing information.This paper introduces some relevant technologies with text categorization. Word segmentation is the basis of text categorization. The character matching method and the statistical method are two commonly used word segmentation methods. The character matching method is limited by the words' quantity in the dictionary. With the rapid development of modern society, new words appear continuously and then this matching method can't recognize those words accurately. So this paper puts forward the method of combining the character matching method and the statistic method. It matches a string based on dictionary, and then segments words that are in the dictionary. At the same time, we apply the statistics method to identify the new words that are inexistent in the dictionary and supply them to the dictionary for the later text word segmentation. Experiment shows this method can improve the segmentation accuracy while retaining its speed.The naive Bayes method and k-nearest neighbor method are two commonly used text categorization methods. The naive Bayes method predicts the probability of each text. The k-nearest neighbor classifier judges the sort of each text with the sorts of their k nearest neighbor. The compared research of Bayes method and k-nearest neighbor method is carried out on the same platform of "Chinese nature language processing". Experiment shows the Bayes classifier's speed is faster. This method can deal with big data set and it can be applied into online categorization. K-nearest neighbor classifier can receive higher accuracy, so it can be applied into the occasion with the requirement of high accuracy. But its speed is slower, so...

Keywords/Search Tags:

Text Categorization, Text Segmentation, The Matching Method, The Statistic Method, Bayes Method, k-Nearest Neighbor Method

PDF Full Text Request

Related items

1	Research On Chinese Text Classifier Based On Probability Method
2	Text Categorization Based On Naive Bayes Method
3	Automatic Text Segmentation And Algorithm
4	A Study On M3-kNN Network And Application In Text Categorization
5	A Study On Chinese Text Automatic Categorization
6	A Study On Text Categorization Based On Machine Learning
7	A new feature selection method based on support vector machines for text categorization
8	Design And Implementation Of Kazak Text Categorization System
9	Research On Text Categorization Based On Modified Bayes Method And Its Application In NERMS
10	Application And Research Of Feature Selection Method In Chinese Text Categorization