Font Size: a A A

Research And Implementation Of An Incremental Web Information Collection And Extraction System

Posted on:2012-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:S S LiFull Text:PDF
GTID:2178330335952233Subject:Computer Science and Technology
Abstract/Summary:
With the rapid development of Internet, people are increasingly dependent on the network to get information. The lifetime of network information resources usually only dozens of days, as time goes on, a lot of the old network information resources are overwhelmed by the new network information. How to more quickly and accurately collect from the useful information on the Internet is becoming the research hot spot. The large scale Non-incremental collection has been developed very mature. In order to avoid the waste of time because of collecting the pages which is not change, incremental collection come into being. In order to improve the efficiency of updating collection and accuracy of extraction, this paper focuses on incremental update web information collection and information extraction based on HMM.This paper analyzes the background, significance, present situation and its difficulties and challenges faced in the research of Web information collection system, describes the information collection system operational principle and the web crawler work processes and on the basis of core technology of information collection and information extraction combined with the incremental information collection demand, determine the problems to be solved in the development of system, propose specific design, build a good performance and scalability of the incremental information collection system. The system includes the following modules:page collection, parsing the page, remove duplicate URLs, remove duplicate pages and updated pages checker. The main work and innovation are as follows:1. The introduction of the index pages, and improves the efficiency of the discovery of new pages, using FWKNN algorithm can identify the index page.2. Too harsh for the MD5 algorithm, this paper adopts a method based on Web framework and rules. First, removes noise; then calculates MD5 value of the body of page. To some extent, this method is improved the accuracy of page similarity analysis. 3. With respect to predict changes in frequency of the page,through analyzing the shortcomings of Poisson model, this paper introduces the update frequency calculation window, and the concept of content analysis and page belonging analysis, improves the forecast accuracy of the frequency of page changes.4. Based on the study of Hidden Markov Model, this paper improves the information extraction method on the basis of HMM, using regular expression to extract the fixed form, smoothing the probability of unknown observations. Experiments show that the extract method obtained better results.Finally, the improved method experiments are done and the analyses of experiments result data, which are prove that the system has successfully achieved the desired goals.
Keywords/Search Tags:Incremental collection, Web information collection, Information extraction
Related items