Font Size: a A A

Research And Implementation Of Information Collection Technology In Network Forum

Posted on:2015-04-30Degree:MasterType:Thesis
Country:ChinaCandidate:P YangFull Text:PDF
GTID:2208330431976736Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
In recent years, some Online Communities, such as BBS, post bar, blog and microblog, have become important vehicle for information release, sharing and communication. BBS, in particular, has become an important platform of information distribution, sharing and transmission. The content of BBS is created and published by ordinary users, and it mainly reflects the public’s point of view. Analyzing the views can reflect the public’s attitude to the government policy, or help to identify and filter malicious information. So the analysis and study of online community information has an important value and significance to applications, such as public opinion analysis and monitoring, malicious information monitoring and management.In the public voice analysis and monitoring, monitoring and management of malicious information, we need to get in the BBS posts to provide the raw data for further public opinion analysis and the analysis and regulation of malicious information. Thus, crawl data is the precondition of data analysis and application and the foundation of all works in the system. However, the traditional technology to crawl the webpage by page is not suitable for BBS crawl data. There are three reasons:1.The traditional crawler technology ignores the inner structure of BBS page and the correlation between the pages.2. Traditional crawlers, without the function of filtering posts, don’t analyse the page content, but analyse the links in the page.3. Traditional crawler saves complete information of the page, without processing data of the page. And inner structure of BBS pages, the correlation among pages, the filtering function of avoiding grab duplication and store and the post page information processing must be considered.Hence, to solve various deficiencies of traditional crawlers to BBS, this article, on the basis of the study of the general crawler, web page updates and scheduling, analyzes on characteristics of BBS, and has certain achievements on information crawling and gaining of online community. The main research achievements and contributions are as follows:1) Study the characteristics of BBS, and customize crawl strategy for BBS by virtue of the three-tier logical architecture of BBS web pages. By studying the principle of general crawler, make fetching strategy for BBS. According to the characteristics of BBS, put forward and use the filtering mechanism, which is a filtering validation mechanism based on the URL link address format and content detection. In filtering calibration based on content detection, propose and use an improved algorithm of multiple pattern matching algorithm based on BG, namely BGq-Grams_u Unrolling_s-Shift series algorithm. Related experiments on BBS crawler strategy and BBS crawler based on link filtering mechanism, has shown that both the recall rate and accuracy rate are very high, obtaining a fine crawling effect.2) Websites usually have updates, while BBS updates faster and post content increases incrementally, according to the research of BBS incremental updates, exercise statistic analysis for BBS update data. Employ incremental updating strategy based on connection pool, take different crawl strategies for different incremental data, and form increment crawl algorithm of BBS; increment crawl algorithm has a high accuracy and recall rate. Using crawl scheduling algorithm during crawl and scheduling can sharply reduce the system total delay.
Keywords/Search Tags:Data crawl, Filtering mechanism, Incremental updating, Informationextraction
PDF Full Text Request
Related items