Font Size: a A A

Research On Blog Identification Based On Comprehensive Feature Space

Posted on:2010-09-11Degree:MasterType:Thesis
Country:ChinaCandidate:M LiFull Text:PDF
GTID:2178360275456512Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
Up to now,the influence of blog enlarges fast and the amount of information in the blogosphere increases fast.Blogs have constituted a dynamic virtual network on the World Wide Web by using frequent connection and interaction.This virtual network have interacted closely with real society and become an important information source of the real world.However,it is obviously impossible to search,identify and analyze large amount of information manually.So,the information of blog must be recognized from other information in the World Wide Web,before other researches on blog.The identification of blog is important,because this is the footstone of other researches on blog.In recent years,researches on the identification of Blog pages are increasing gradually.However,because of particularity and complexity on Blog field,it is not satisfactory to directly use technology of text classification or simple alteration on it to identify Blog pages.So we embark on researching the identification of blog pages.This paper researches and analyzes these technologies on the identification of blog pages.The method for the identification of blog pages is presented based on the space of Comprehensive feature.The main work of this thesis includes:(1) In this paper,the method of extracting characteristics from web pages is researched,and the whole process is analyzed about acquisition tidy parse and extraction of Web pages.A new concept is presented about the comprehensive feature space of Blog,then it is discussed that the concept definition and extraction method of Blog characteristics in detail.(2) The formal representation of web page is researched,and the whole process is analyzed about extraction selection and weighting of term.On this basis,TF-IDF algorithm and weight adjustment scheme are designed based on the information of tag. Then,the formal representation is presented about the text characteristics of Web page and the Layout characteristics of Web page,and it is discussed that the concept definition and acquisition of these characteristics in detail. (3) The identification of web pages is researched.The concept of classification and clustering is introduced,and the algorithm is analyzed about KM and KNN in detail. Then,the SILKM algorithm is presented to improve the KM algorithm,and the KNC algorithm is presented based on KM and CV to improve the KNN algorithm.(4) The identification algorithm of Blog pages is presented based on the comprehensive feature space,and the KNC algorithm is applied to the phase of Blog identification based on the layout characteristics and the text characteristics.
Keywords/Search Tags:Blog page identification, comprehensive feature space of Blog, feature extraction, web page representation model, KNN algorithm
PDF Full Text Request
Related items