Font Size: a A A

Research Of Extracting Interactive Web Contents Based On Incremental Update

Posted on:2012-07-07Degree:MasterType:Thesis
Country:ChinaCandidate:H C DongFull Text:PDF
GTID:2218330368981947Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In the management of the online public opinion and Internet intelligent information, people need to obtain the content of the forum threads for further research on the topic emotion and the dissemination of forum topics. Facing a flood of forum messages, it is easy for rapid extraction of the content of forum to get public opinion post and grasp the direction of the public opinion of public. However, due to the complex layout of the network and the freedom of the user's posting the posts, it is a very difficult task for extracting the structured forum pages efficiently from the forum pages.On the basis of enough research about interactive Web content extraction and web updating and scheduling at home and abroad, according to the unique characteristics of interactive web pages, this paper present a method of extracting interactive web contents based on incremental update. The main innovation work of this paper is as follows:Firstly, this paper presents a new technology of extracting interactive web contents based on incremental update.This method overcomes the problem which caused by the change of the web pages structures and contents, and can extract the content effectively. In this method, web pages are translated into DOM (Document Object Model) tree which will be matched by the templates. In cases when it doesn't match, Fuzzy matching and repetition matching is used. Finally the web pages contents are extracted and obtained. This method can effectively adapt to changes of structure and content of different posts within the same interactive web page, and correctly extract the contents of each house post. And this method can be used for most of the Forum web content extraction, with good efficiency and versatility.Secondly, aiming at the efficiency of crawling and preprocessing for interactive website is not high, according to the relevant features of an interactive website, this paper present a method of incremental crawling interactive web pages and extracting the content of interactive web content incrementally. This method can extract the content of Web page changed accurately and timely, and save the time of crawling and preprocessing, and improve the efficiency. Then, the paper also proposed a punishing or awarding policy to divide the priority of the major forum for the forum.According to the update frequency of changed web pages, it can dynamically adjust the crawling time of major forum for the forum to further enhance the efficiency of overall interactive web content extraction. Finally, the paper has done the experiments for the two innovation points. The comparison and analysis on the simulation result verifies the feasibility and effectiveness of the proposed schemes.
Keywords/Search Tags:Online public opinion, Interactive web, Incremental update, Content extraction, Template
PDF Full Text Request
Related items