Font Size: a A A

Research And Implementation Of Automatic Classification System And Key Technologies On Chinese Web Page

Posted on:2014-07-04Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhouFull Text:PDF
GTID:2308330479479220Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of information technology, the data on the Internet growth dramatically. One of effective means to organize and manage these massive amounts of data is web page classification which based on its text. Due to the variaty of page content, the traditional classification method of data mining doesn’t do well. Therefor, how to establish a more practical method to slove this problem is the main direction.In this paper,we do some relevant reasearch on this type of problem. We point out the deficiencies that exits in these classification methods which has been used frequently and gives out the solutions. The contributions and relevant work in the paper are described as follows.Firstly, we did some research on web page classification theories,including web classification process, web representing model, Chinese segmentation and feature extraction method.Secondly, we proposed a model based on Labeled_LDA to solve the problem that some of page content contains less words.Thirdly,we proposed a pre-classification algorithm against the phenomenon that some news page can’t be classified precisely for its pellmell content.Fourthly,a new architecture was designed for classification. In this architecture, we put all the idea metioned above together. And the experiment showed that based on the architecture the accuracy was improved 0.5%-1%.In addition, we analyze the shortage exited yet and put forward the direction of further improvement.
Keywords/Search Tags:WebPage Classification, Pre-Classification, Feature Vector Expansion, Induction Model, Classification Architecture
PDF Full Text Request
Related items