Research On Blog Identification Based On Comprehensive Feature Space

Posted on:2010-09-11

Degree:Master

Type:Thesis

Country:China

Candidate:M Li

Full Text:PDF

GTID:2178360275456512

Subject:Applied Mathematics

Abstract/Summary:

PDF Full Text Request

Up to now,the influence of blog enlarges fast and the amount of information in the blogosphere increases fast.Blogs have constituted a dynamic virtual network on the World Wide Web by using frequent connection and interaction.This virtual network have interacted closely with real society and become an important information source of the real world.However,it is obviously impossible to search,identify and analyze large amount of information manually.So,the information of blog must be recognized from other information in the World Wide Web,before other researches on blog.The identification of blog is important,because this is the footstone of other researches on blog.In recent years,researches on the identification of Blog pages are increasing gradually.However,because of particularity and complexity on Blog field,it is not satisfactory to directly use technology of text classification or simple alteration on it to identify Blog pages.So we embark on researching the identification of blog pages.This paper researches and analyzes these technologies on the identification of blog pages.The method for the identification of blog pages is presented based on the space of Comprehensive feature.The main work of this thesis includes:(1) In this paper,the method of extracting characteristics from web pages is researched,and the whole process is analyzed about acquisition tidy parse and extraction of Web pages.A new concept is presented about the comprehensive feature space of Blog,then it is discussed that the concept definition and extraction method of Blog characteristics in detail.(2) The formal representation of web page is researched,and the whole process is analyzed about extraction selection and weighting of term.On this basis,TF-IDF algorithm and weight adjustment scheme are designed based on the information of tag. Then,the formal representation is presented about the text characteristics of Web page and the Layout characteristics of Web page,and it is discussed that the concept definition and acquisition of these characteristics in detail. (3) The identification of web pages is researched.The concept of classification and clustering is introduced,and the algorithm is analyzed about KM and KNN in detail. Then,the SILKM algorithm is presented to improve the KM algorithm,and the KNC algorithm is presented based on KM and CV to improve the KNN algorithm.(4) The identification algorithm of Blog pages is presented based on the comprehensive feature space,and the KNC algorithm is applied to the phase of Blog identification based on the layout characteristics and the text characteristics.

Keywords/Search Tags:

Blog page identification, comprehensive feature space of Blog, feature extraction, web page representation model, KNN algorithm

PDF Full Text Request

Related items

1	Research On Several Key Issues In Blog Search
2	Research And Implementation Of Web Page Segmentation Algorithm Mfps Based On Multi-feature
3	Research And Implementation Of Web Page Segmentation Algorithm MFPS Based On Multi-Feature
4	Discovery Of Implied Communities For Blog Page Based On Topic
5	Semi-supervised BLOG Information Extraction Techniques Based On Document Structure
6	Semi-supervised Blog Information Extraction Techniques Based On Document Structure
7	Design And Implementation Of The HTTPS Page Classification Detection System Based On Feature Extraction
8	Research On Financial Blog Crawling And Ranking Algorithm
9	Research Of Chinese Page Automatic Classification Based On Vector Space Model
10	Research On Web Page Classification And Information Collection