Font Size: a A A

Design And Implementation Of A Directional Information Extraction Model For Dynamic Web Pages

Posted on:2017-07-07Degree:MasterType:Thesis
Country:ChinaCandidate:J ShengFull Text:PDF
GTID:2348330503982540Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the emergence and the rapid development of Web2.0, more and more dynamic web pages were appeared in the Internet. The technology of Ajax achieved an asynchronous data transfer operation between the clients and servers, not only improved the users' experience, but also promoted the spread of the dynamic web pages and the development of the Internet. On the basis of the HTML source code, however, it also maked the traditional information extraction of web crawler can not extract the dynamic information of dynamic web pages. Therefore, the study of the information extraction of dynamic web pages has a certain practical significance. To this end, a directional information extraction model for dynamic web pages is proposed.First of all, the theory and technology related to the information extraction of dynamic web pages are analyzed, at the same time, the object of study web page can be divided into two categories, static web page and dynamic web pages, and carries on the detailed comparative analysis. On this basis, the chanllenges of information extraction as Ajax technology widely used in dynamic web pages are analyzed. Finally, the role of HTML, DOM model and a regular expression in information extraction are detailedly introduced.Secondly, the flaws and insufficiencies when the traditional web crawler crawls dynamic web pages are analyzed, a directional information extraction model for dynamic web pages is proposed. To get web pages, the request of HTTP is neeeded. To parse and execute JavaScript codes, submit the form simulatedly, the HtmlUnit tool is used. To build the DOM of a page, jsoup is used. And then the extracted data information and URL are stored in the database.Then, a concrete realization method of each composition module combined with the proposed directional information extraction model of dynamic web pages is given. Dynamic web pages based on the breadth-first search strategy are crawled, the bloom filter is used to deal with the repetition of URL links, regular expressions and jsoup selector are used to extract of web information and URL links, the multithreading crawler technology is adopted to improve the performance of the model.Finally, based on the proposed information extraction model of dynamic web pages, the baidu tieba website of yanshan university as the object of the experiment is selected as a case. The experiment is designed from two aspects of the model: the efficiency and the performance. The fact that the proposed model has good results in the Precision, the Recall and F-Measure evaluations is proved. The high efficiency and performance of the proposed model are verified, too.
Keywords/Search Tags:Directional information extraction model, Dynamic web page, Web crawler, Dynamic script, Information extraction
PDF Full Text Request
Related items