Font Size: a A A

The Study And Implementation Of Chinese Web Text Classification

Posted on:2007-10-02Degree:MasterType:Thesis
Country:ChinaCandidate:D ZouFull Text:PDF
GTID:2178360182980067Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The general workflow of WEB text classification has been systematicallyexplained and implemented in this thesis. We divide the system into four parts basedon analysis of the system's requirement: WEB text collecting and preprocess,Chinese word segmentation, selecting training pattern vector and the training andclassification of the text patterns vector.Unlike the general text classification, we need to collect web pages, preprocessthese web pages and save the weight of the text. We implement a preemptivemulti-thread WEB text collector. It collects the Web pages of special catalog usingDepth First Algorithm. And we implement a WEB text preprocessor to erase the Tagand set the weight for the Web Text by using recursive match method.On the part of Chinese word segmentation, we implement a word segmentationmachine using the Full Binary Search Maximal Match Algorithm for Chinese WordSegmentation. This algorithm is based on the research of Chinese encodingarchitecture GB2312 and the characteristic of Chinese words. We use theBi-direction Matching method and get good performance on the collectness andspeed.On the part of selecting training pattern vector, we first introduce a classifierusing the nexus between words and type to properly select training pattern and toreduce support vectors. And we use TF-IDF to calculate the weight of the vectors.On the part of training and classification, we introduce the basic theory aboutsupport vector machine(SVM), the application of SVM in text classification and thesoftware LIBSVM. The extracted patterns and their weight are used to form the inputfile, through which we can implement text training and text classification.The author proposes a new solution for Chinses WEB text classification,implements it and gets good results based on the test.
Keywords/Search Tags:Chinese word segmentation, Vector Space Machine, Support Vector Machine, Text Classification
PDF Full Text Request
Related items