Font Size: a A A

Research On Chinese Blog Pages Recognition And Content Extraction

Posted on:2008-09-25Degree:MasterType:Thesis
Country:ChinaCandidate:D ZhangFull Text:PDF
GTID:2178360245498115Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As a new mode of information dissemination network, Blog has step into the lives of network users. It provides a platform of information release, exchange and communication.As the rapid development of blog, it created huge resources, how to recover valuable information from the large-scale data warehouse becomes imperative. This paper is released for the study and research of this powerful model.Compared with ordinary web pages, we extracted the features of blog pages. We identified blog pages from the downloaded pages and extracted content from these pages.This paper starts from analysis of the features of ordinary web page, then the features of blog pages. By comparing features of blog pages with ordinary web page, we identify blog pages. Based on the long-term observation, statistics and analysis of blog pages, we have got some basic definitions. According to these definitions and concepts this paper have a deeply description on the characteristics of blog. First we propose the classification of blog pages, generalized blog pages classification and narrow blog pages classification. We present a method of generalized recognition of the blog pages, take an experiment and have a good result. Then take an experiment of the removal of the blog navigation pages. By comparison and analysis of the existing methods, we propose a new method.There is a need to extract blog posts, comments and statistical information for data mining from blog space. Based on dozens of major Chinese blog sites as a source, we take an experiment and the results show good.This paper has a deeply research on the blog pages; realizes blog classification, and takes relevant experiments. With the completion of the system, a blog pages content extraction, as groundwork for blog content mining.
Keywords/Search Tags:blog, features analysis, identification of blog pages, content extraction, content mining
PDF Full Text Request
Related items