Design And Implementation Of A Directional Information Extraction Model For Dynamic Web Pages

Posted on:2017-07-07

Degree:Master

Type:Thesis

Country:China

Candidate:J Sheng

Full Text:PDF

GTID:2348330503982540

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the emergence and the rapid development of Web2.0, more and more dynamic web pages were appeared in the Internet. The technology of Ajax achieved an asynchronous data transfer operation between the clients and servers, not only improved the users' experience, but also promoted the spread of the dynamic web pages and the development of the Internet. On the basis of the HTML source code, however, it also maked the traditional information extraction of web crawler can not extract the dynamic information of dynamic web pages. Therefore, the study of the information extraction of dynamic web pages has a certain practical significance. To this end, a directional information extraction model for dynamic web pages is proposed.First of all, the theory and technology related to the information extraction of dynamic web pages are analyzed, at the same time, the object of study web page can be divided into two categories, static web page and dynamic web pages, and carries on the detailed comparative analysis. On this basis, the chanllenges of information extraction as Ajax technology widely used in dynamic web pages are analyzed. Finally, the role of HTML, DOM model and a regular expression in information extraction are detailedly introduced.Secondly, the flaws and insufficiencies when the traditional web crawler crawls dynamic web pages are analyzed, a directional information extraction model for dynamic web pages is proposed. To get web pages, the request of HTTP is neeeded. To parse and execute JavaScript codes, submit the form simulatedly, the HtmlUnit tool is used. To build the DOM of a page, jsoup is used. And then the extracted data information and URL are stored in the database.Then, a concrete realization method of each composition module combined with the proposed directional information extraction model of dynamic web pages is given. Dynamic web pages based on the breadth-first search strategy are crawled, the bloom filter is used to deal with the repetition of URL links, regular expressions and jsoup selector are used to extract of web information and URL links, the multithreading crawler technology is adopted to improve the performance of the model.Finally, based on the proposed information extraction model of dynamic web pages, the baidu tieba website of yanshan university as the object of the experiment is selected as a case. The experiment is designed from two aspects of the model: the efficiency and the performance. The fact that the proposed model has good results in the Precision, the Recall and F-Measure evaluations is proved. The high efficiency and performance of the proposed model are verified, too.

Keywords/Search Tags:

Directional information extraction model, Dynamic web page, Web crawler, Dynamic script, Information extraction

PDF Full Text Request

Related items

1	Vertical Search Engine For Crawling The Web Page Design And Implementation
2	Research On Multi-page Special Web Page Text Extraction And Merging Technology
3	Based On Templated Web Crawler Technology Of Web Page Information Extraction
4	Reasersh On Internet Public Opinion Information Extraction And Classification
5	Research On Data Acquisition And Information Extraction Technology For Dynamic Web Applications
6	The Study And Implementation On The Key Problems Of Intelligent Search Engine Technology
7	The Design And Implementation Of Distributed Web Crawler System Based On Automatic Extraction Of Webpage Information
8	Research On Web Page Classification And Information Collection
9	Study On Model And Algorithm Of Dynamic Feature Fusion Based On Information Sources Selection And Sequential Extraction
10	Research On Entity-level Search Crawler And Information Extraction