Font Size: a A A

Web Text Classification Method And System Realization

Posted on:2011-08-27Degree:MasterType:Thesis
Country:ChinaCandidate:B ChengFull Text:PDF
GTID:2208360308965834Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In recent years, Web has rapidly become the world's largest public data source of information. It's an urgent problem for Web users to locate the desired information easily and fast in vast information. The Correctness of Web Text Classification is the key issue of this problem. Deriving from the automatic classification technology, Web Text Classification is an important part of Web Text Mining. It can not only effectively improve the efficiency of the user's search by helping users to locate the desired knowledge quickly and accurately, but also obtain the interest of different types of users, which can consult to meet the user's personalized demand.The current classification researches consider that the categories of text is disjoint and flat, without thinking about the hierarchy of categories. When the number of categories is large, the cost to obtain the classifier with flat classification learning is expensive, and when dealing with unknown document, it is obviously inappropriate to compare the document with the whole class models. With in-depth study on the Web text mining and automatic classification, This paper achieve a multi-level Web text classification system combined with hierarchical relations of classes. The Innovative point and key technologies are as follows:1. Established a hierarchical model of training and classification. As the Web content is abundant and involve multiple types of features in many fields, and the flat classification has some problems for the multi-class situation, this paper proposed the idea of hierarchical classification and establish a hierarchical training and classification model.2. Designed and implemented an automatic Web text extractor. The ads and hyperlinks in the Web pages brought great trouble for Web text classification. This paper implemented an automaticic Web text extractor, enabling the Web page to become more pure which only contains the title and body.3. Proposed a keyword extraction method suiting for Web page. As the words in different locations and with different property have different role in Web content, this paper proposed a keyword extraction method based on property,location and weighted word frequency, and used the method in Web text classification with good results.4. Presents a classification method based onχ~2 weighted statistics. The statistics can well reflect the relations between feature and categories.This paper innovatively used theχ~2 statistics in the text classification so that not noly simplify the classification process, but also obtain better speed and accuracy in practical applications.Based on the characteristics of Web text, this paper put forward an idea to deal with large-scale, multi-category Web text classification, and designed a multi-level Web text classification system. The results show that the system's performance is betten than the gereral flat classifier in practice.
Keywords/Search Tags:Hierarchical, Web automatic extraction, Text classification, feature selection, keyword extraction
PDF Full Text Request
Related items