Research On Data Acquisition And Information Extraction Technology For Dynamic Web Applications

Posted on:2020-04-05

Degree:Master

Type:Thesis

Country:China

Candidate:Y Li

Full Text:PDF

GTID:2428330572972252

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the development of information technology,human beings have entered the era of big data.Big data analysis is not only crucial in business competition,in the public sector,big data also plays an important role in promoting economic development and maintaining social stability.Accelerating the development of big data has become an inevitable choice for government.However,the lack of scientific management and systematic planning of government data led to the fragmentation of government data sources,the fragmentation of data,the fragmentation of applications and services,and the difficulty in obtaining high-quality data sources in the open work of government data.How to obtain data sources of government efficiently and efficiently has become a research hotspot in recent years.This paper studies how to automatically obtain the heterogeneous and independent data sources in government system.Based on the protocol and structure analysis of government websites,this paper proposes a dynamic Web page collection framework based on event simulation.The framework innovatively introduces a proxy gateway,through which we can inject JavaScript code into target sites.In the meantime,the framework successfully implements JavaScript parsing and page rendering using a built-in native browser.In terms of acquisition strategy,the framework improves the state transition method of existing research,and finally realizes an automated page collection scheme compatible with both dynamic and static websites.On this basis,the paper proposes a tree alignment algorithm and a text density algorithm for the two typical semi-structured information in the government system:the extraction of list information and topic information.The tree alignment algorithm proposes using characteristics of HTML DOM tree to identify and segment data records innovatively,and using partial alignment when aligning data records,which improves the efficiency and accuracy greatly.The text density algorithm extracts the information from news page based on a feature of news page that the text density of effective information is significantly different from the text density of other regional in the news page.This two algorithms complement each other and propose an effective automation scheme for the structural information extraction of government websites.At last,our paper selects several government websites to conduct experiments,which compares our approach with existing algorithms to prove the effectiveness of our algorithm.

Keywords/Search Tags:

government big data, event triggering, web page collection, structured information extraction

PDF Full Text Request

Related items

1	The Implementation And Application Of Extracting Structured Data From Web Pages
2	Research On Web Page Classification And Information Collection
3	The Research Of Semi-structured Web Pages Information Extraction
4	Research On Event Extraction Based On Structured Learning
5	Design And Implementation Of Grassroots Government Information Collection System
6	Comparison Of Typical Event-triggering Mechanisms And Its Validation In Networked Inverted Pendulum System
7	On Service Triggering In IMS Network
8	Research On Keyword Extraction And Structured List Data Extraction
9	Research On Quantitative Feedback And Event-Triggering Control Of Networked Control Systems
10	Structure Information Extraction- Study And Implementation On Semi-auto Wrapper