Research And Implementation Of An Incremental Web Information Collection And Extraction System

Posted on:2012-03-25

Degree:Master

Type:Thesis

Country:China

Candidate:S S Li

Full Text:PDF

GTID:2178330335952233

Subject:Computer Science and Technology

Abstract/Summary:

With the rapid development of Internet, people are increasingly dependent on the network to get information. The lifetime of network information resources usually only dozens of days, as time goes on, a lot of the old network information resources are overwhelmed by the new network information. How to more quickly and accurately collect from the useful information on the Internet is becoming the research hot spot. The large scale Non-incremental collection has been developed very mature. In order to avoid the waste of time because of collecting the pages which is not change, incremental collection come into being. In order to improve the efficiency of updating collection and accuracy of extraction, this paper focuses on incremental update web information collection and information extraction based on HMM.This paper analyzes the background, significance, present situation and its difficulties and challenges faced in the research of Web information collection system, describes the information collection system operational principle and the web crawler work processes and on the basis of core technology of information collection and information extraction combined with the incremental information collection demand, determine the problems to be solved in the development of system, propose specific design, build a good performance and scalability of the incremental information collection system. The system includes the following modules:page collection, parsing the page, remove duplicate URLs, remove duplicate pages and updated pages checker. The main work and innovation are as follows:1. The introduction of the index pages, and improves the efficiency of the discovery of new pages, using FWKNN algorithm can identify the index page.2. Too harsh for the MD5 algorithm, this paper adopts a method based on Web framework and rules. First, removes noise; then calculates MD5 value of the body of page. To some extent, this method is improved the accuracy of page similarity analysis. 3. With respect to predict changes in frequency of the page,through analyzing the shortcomings of Poisson model, this paper introduces the update frequency calculation window, and the concept of content analysis and page belonging analysis, improves the forecast accuracy of the frequency of page changes.4. Based on the study of Hidden Markov Model, this paper improves the information extraction method on the basis of HMM, using regular expression to extract the fixed form, smoothing the probability of unknown observations. Experiments show that the extract method obtained better results.Finally, the improved method experiments are done and the analyses of experiments result data, which are prove that the system has successfully achieved the desired goals.

Keywords/Search Tags:

Incremental collection, Web information collection, Information extraction

Related items

1	Research On Technology Of Information Collection About Digitization Of Ancient Books From The Perspective Of Cultural Relics Conservation
2	Research And Implement Of Web Information Intelligence Collection And Personalized Service System
3	User Web Information Collection And Analysis System Based On The Smart Router
4	The Study On Technology Of Information Collection Based On Web Crawler
5	Design And Implementation Of Web Information Collection System
6	Research And Implement Of Web Information Intelligence Collection And Classification
7	The Study On Technology Of Website Information Collection Based On Web Crawler
8	Development Of Agricultural Field Information Collection And Management System Based On WebGIS
9	Design And Implementation Of Chinese Webpage Automatic Collection And Classification
10	Research On The Ways Of Historic House Museumâ€™s Collection Management Based On CIMS