Font Size: a A A

Combination Of Statistical And Rule News Pages Classification System Design And Realization

Posted on:2012-08-23Degree:MasterType:Thesis
Country:ChinaCandidate:T T LvFull Text:PDF
GTID:2218330338970115Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Text classification is one of the key technologies which has high practical value of information organization and management, it can organize and manage the textual information which in the internet. The common web classification methods have text similarity method, K-nearest neighbor, na?ve Bayes method, decision tree and support vector machine method. The web page classification algorithm mainly use vector space model to express the text feature vector of the textual document, and using a similarity calculation method to determine the category of the web page. This thesis mainly deals with the technology of news pages classification. Based on the study of the theory of text classification techniques, analyzes the process of text classification, introduces the feature selection algorithm in detail and gives the evaluation method of the text classification.This thesis mainly analyzes the structures and characteristics of the news page, based on its non-format features of the news page, proposes a method of format the news page. It uses statistical methods and manual select methods to extract feature words set, using the open source tools for web information extraction named NekoHTML, using the regular expression to extract the hyperlinks and web title, to extract the feature of the news page using CHI methods, using the HowNet as the semantic knowledge database to calculate the distance between word and word, based on this to proposes a model of news page text classification based on the statistical and rules.This thesis uses the java language to implement the all aspects of the news page classification, including category feature selection, the newspage information extraction, semantic simialrity calculation and the ancillary rules design. The category feature selection using the statistic methods and manual select methods; the news page body text extraction using the open source tools named NekoHTML to resolve a DOM tree and extract the text node, trim the useless node such as style and script and so on; the semantic distance calculation using the Chinese Academy of Science Liu-Qun's algorithm, we implement it using java language, and the system running interface can be seen in chapter 5; some rules for aid the news page classification is designed by manual select some features in some category textual documents; finally introduces the classification system implementation based on the statistic and rules. Study show that this algorithm is feasible in that the system of news page classification can satisfied some practical applications.
Keywords/Search Tags:news page classification, text classification, extract body content of news page, the system of classification
PDF Full Text Request
Related items