Font Size: a A A

Study Of Web Crawler And Web Information Extraction

Posted on:2010-03-14Degree:MasterType:Thesis
Country:ChinaCandidate:Y F JinFull Text:PDF
GTID:2178360278466979Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
Great enhancement on the web information brings an enormous challenge to human on how to use it effectively. How to immediately and exactly obtain the needed information from Web has become an urgent solution. In order to make full use of Web information, we get into the research on Web Crawler and Web Information Extraction.At first in this paper, we introduce the basic principle of Web Crawler. On the basis of this, we analyze several key technologies, including the following aspects: Web page gathering, URL extraction and normalization, the storage of raw page and how to avoid gathering the same page. Then Taking essentials of a distributed system into consideration, we analyze the following four aspects of a distributed Web Crawler, including parallel, load balancing, architecture and expansibility. Then we design and implement a distributed Web Crawler that based on local area network. In order to demonstrate the performance, two experiments are carried out. The result shows that the distributed web crawler can achieve load balancing and have a good expansibility.Secondly, we discuss Web Information Extraction on the basis of Web Crawler. Considering the needle of the practice, we develop a Web Information Extraction System based on extend-XPath which uses XPath expression to locate the data item and which organizes all XPath expressions into a tree as the whole information extraction formula. In order to enhance commonality, the formula is separated from the information extraction system. Practice shows that it can extract data from Web page exactly, suitable for small or medium scale Web Information Extraction.And we also do analysis the problem of phone number recognition encountered in the process of Web information extraction, and solve it by hybrid programming with Matlab and C++.The actual application shows that the result we get can contribute to make use of Web information more effectively, and can fulfill the needle of small and medium enterprises on Web information gathering and extraction, and has practical value in use.
Keywords/Search Tags:Web Crawler, Information Extraction, Search Engine, Distributed System
PDF Full Text Request
Related items