Font Size: a A A

Web Text Classification System For Chinese Pretreatment Technology

Posted on:2010-11-24Degree:MasterType:Thesis
Country:ChinaCandidate:Z P WangFull Text:PDF
GTID:2208360275998513Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The quick growth of web page information has raised a new challenge for information retrieval. In order to make users obtain information on line more quickly and exactly when people use search engine, it is necessary to classify the plentiful web pages according to page content. Web text mining is an efficient method to resolve the problem. It uses the basic thinking and theory of data mining for reference, and discovers potential and valuable knowledge from the half-structural and heterogeneous web pages.Text categorization is an important technology of web text mining. At the present time, Chinese text categorization has become more and more popular in the research of web data mining. Its pivotal techniques include web page cleaning, word segmentation, features extraction, text expression and text classification. Web page cleaning, word segmentation, features extraction, text expression are called web page preprocessing. The result of preprocessing is an important fact that will affect the result of text categorization. This article researched each part of preprocessing, and implemented a preprocessing system.During the preprocessing, the effect of features extraction will affect the train time and accuracy of text classification evidently. Traditional feature extraction method treats each feature separately and disregards the semantic feature such as relativity and comparability. This article introduces a feature extraction method based on synonymy statistic. Before the feature extraction, we replace the synonymy with one word first. It can reduce the dimension of feature space. Through the experiment on web text categorization using support vector machine, we evaluated the accuracy of the result. It proved that the accuracy of categorization using feature extraction method based on synonymy statistic was higher than that using feature extraction without synonymy statistic.
Keywords/Search Tags:Information retrieval, web text categorization, text preprocessing, feature extraction, synonymy statistic
PDF Full Text Request
Related items