Font Size: a A A

Data Collection And Data Set Establishment For Hot Topic Form Internet

Posted on:2012-12-06Degree:MasterType:Thesis
Country:ChinaCandidate:Z X ZhangFull Text:PDF
GTID:2178330335460852Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
Nowadays, hot topics emerged related to complicated networks have attracted various attentions. Studies on emergence of hot topics and community detection are also rising. Data collection and data set establishment on the Internet needs to be done before the research on hot topics. In this paper, sources of data on the Internet are analyzed, including both content analysis and link analysis. The forums, blogs, news and micro-blogs are also analyzed in a page-structured way. A wrapper induction method based on visual characteristics is proposed to collect data, and finally to establish the data set.This paper focuses following work:1. Analyze web pages in a structured way, which is to summarize pages according to block theory and to obtain content block, function block and link block in order to extract page models.2. Page extraction method based on human visual characteristics is studied in detail. It extracts the concerned information according to human visual characteristics when reading.3. An information extraction system based on a crawler and wrapper mentioned above is designed. In this paper,300 web pages in 10 websites are collected and data sample sets are established based on statistical characteristics. Tests show that precision, recall and F-scores are above 90%.4. This paper collects the data of dig-u.com, which is a popular micro-blog website, and establishes standard data set. Number of nodes is 200862, number of friends'connections is 4345668, number of followers is 4344453. And this set is established for further study.
Keywords/Search Tags:Data Collection, Visual Characteristics, Wrapper Induction, Data Set Establishment
PDF Full Text Request
Related items