Web Text Classification System For Chinese Pretreatment Technology

Posted on:2010-11-24

Degree:Master

Type:Thesis

Country:China

Candidate:Z P Wang

Full Text:PDF

GTID:2208360275998513

Subject:Computer application technology

Abstract/Summary:

The quick growth of web page information has raised a new challenge for information retrieval. In order to make users obtain information on line more quickly and exactly when people use search engine, it is necessary to classify the plentiful web pages according to page content. Web text mining is an efficient method to resolve the problem. It uses the basic thinking and theory of data mining for reference, and discovers potential and valuable knowledge from the half-structural and heterogeneous web pages.Text categorization is an important technology of web text mining. At the present time, Chinese text categorization has become more and more popular in the research of web data mining. Its pivotal techniques include web page cleaning, word segmentation, features extraction, text expression and text classification. Web page cleaning, word segmentation, features extraction, text expression are called web page preprocessing. The result of preprocessing is an important fact that will affect the result of text categorization. This article researched each part of preprocessing, and implemented a preprocessing system.During the preprocessing, the effect of features extraction will affect the train time and accuracy of text classification evidently. Traditional feature extraction method treats each feature separately and disregards the semantic feature such as relativity and comparability. This article introduces a feature extraction method based on synonymy statistic. Before the feature extraction, we replace the synonymy with one word first. It can reduce the dimension of feature space. Through the experiment on web text categorization using support vector machine, we evaluated the accuracy of the result. It proved that the accuracy of categorization using feature extraction method based on synonymy statistic was higher than that using feature extraction without synonymy statistic.

Keywords/Search Tags:

Information retrieval, web text categorization, text preprocessing, feature extraction, synonymy statistic

Related items

1	Study Of Text Categorization And Image Restoration In Modern Information Retrieval
2	Research On Text Categorization And Technologies
3	The Research And Implementation Of Chinese Text Categorization
4	Research On Key Problems In Text Mining
5	Research On High-Performance Text Categorization
6	The Research On Several Key Techniques In Text Information Processing
7	An examination of KSS for feature selection for text categorization using support vector machines
8	Studies On Algorithms In Chinese Information Retrieval
9	Research And Implementation Of Text Categorization System Based On VSM
10	Research And Implementation Of Chinese Text Classification, Feature Selection Method,