Study Of Web Crawler And Web Information Extraction

Posted on:2010-03-14

Degree:Master

Type:Thesis

Country:China

Candidate:Y F Jin

Full Text:PDF

GTID:2178360278466979

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

Great enhancement on the web information brings an enormous challenge to human on how to use it effectively. How to immediately and exactly obtain the needed information from Web has become an urgent solution. In order to make full use of Web information, we get into the research on Web Crawler and Web Information Extraction.At first in this paper, we introduce the basic principle of Web Crawler. On the basis of this, we analyze several key technologies, including the following aspects: Web page gathering, URL extraction and normalization, the storage of raw page and how to avoid gathering the same page. Then Taking essentials of a distributed system into consideration, we analyze the following four aspects of a distributed Web Crawler, including parallel, load balancing, architecture and expansibility. Then we design and implement a distributed Web Crawler that based on local area network. In order to demonstrate the performance, two experiments are carried out. The result shows that the distributed web crawler can achieve load balancing and have a good expansibility.Secondly, we discuss Web Information Extraction on the basis of Web Crawler. Considering the needle of the practice, we develop a Web Information Extraction System based on extend-XPath which uses XPath expression to locate the data item and which organizes all XPath expressions into a tree as the whole information extraction formula. In order to enhance commonality, the formula is separated from the information extraction system. Practice shows that it can extract data from Web page exactly, suitable for small or medium scale Web Information Extraction.And we also do analysis the problem of phone number recognition encountered in the process of Web information extraction, and solve it by hybrid programming with Matlab and C++.The actual application shows that the result we get can contribute to make use of Web information more effectively, and can fulfill the needle of small and medium enterprises on Web information gathering and extraction, and has practical value in use.

Keywords/Search Tags:

Web Crawler, Information Extraction, Search Engine, Distributed System

PDF Full Text Request

Related items

1	Research Of A Distributed Web Crawler Search Engine Based On Web Information Collection
2	Distributed Web Crawler System
3	Design And Implementation Of Search Engine Based On Web Crawler
4	Research Of Intranet Information Supervision System Based On Net Crawler And Full-text Search Engine
5	The Research Of Distributed Price Search Engine Based On DHT
6	Research Of Main Technologies Of Vertical Search Engine
7	The Research And Implementation Of Cubic Relationship Search Engine In Taiwan Field
8	The Research On Web Crawler Technology Based On Distributed Calculation
9	Distributed Web Crawler System Design And Implementation
10	Research And Implement Of Individualized Vertical Search Engine