Font Size: a A A

Research On Dynamic Centroid Based Webtext Classification Approach And Application

Posted on:2015-05-18Degree:MasterType:Thesis
Country:ChinaCandidate:C C JiangFull Text:PDF
GTID:2298330422990109Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The development and widespread of Internet has made it a global information serviceplatform. The Internet contains massive amount of data and of which more than80percentexists in textural form. Web text classification is of great significance for organization andmanagement of huge network information, and improving efficiency and accuracy ofinformation retrieval.Thus it has become one of the research hotpots in recent years.The key technologies of Web text classification include preprocessing, feature selection,text representation and classification methods. Unlike pain text, Web text contains abundanttags, hyperlinks and meta data information, and it is a semi-structured rich text. Therefore, theoptional feature of Web text is much more flexible and complex. Meanwhile, since Web text islarge scaled, easy to collect and hard to label, its classification often suffers from “labelingbottleneck”.According to the characteristics of Web text and the classification problem it faces, thisthesis proposes a dynamic centroid based text classification approach which called DCTC tosolve the “labeling bottleneck”, and applies the new method into blog classification system.The major research contents include:(1) Web text classification algorithmPropose a new dynamic centroid based text classification approach to handle theclassification of Web text in the case of a relatively smaller training set. The new approachcomputes the initial centroids of each category based on the labeled texts, and progressivelyclassifies the unlabeled texts and using texts with high classification confidence to adjust thecentroids. Experimental results show the effectiveness of the new method.(2) Application ofWeb text classificationDesign and implement a blog text classification system based on DCTC method. Theclient-side is in the form of a browser extension and responsible for Web content extraction; theserver-side exists as a web site and offers automatic text classification using DCTC approachand provides result retrieval. The practical application of the new method further validates itseffectiveness.
Keywords/Search Tags:web text, text classification, centroid classifier, dynamic centroid classificationapproach, semi-supervised study
PDF Full Text Request
Related items