Font Size: a A A

Research On Data Acquisition And Information Extraction Technology For Dynamic Web Applications

Posted on:2020-04-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2428330572972252Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of information technology,human beings have entered the era of big data.Big data analysis is not only crucial in business competition,in the public sector,big data also plays an important role in promoting economic development and maintaining social stability.Accelerating the development of big data has become an inevitable choice for government.However,the lack of scientific management and systematic planning of government data led to the fragmentation of government data sources,the fragmentation of data,the fragmentation of applications and services,and the difficulty in obtaining high-quality data sources in the open work of government data.How to obtain data sources of government efficiently and efficiently has become a research hotspot in recent years.This paper studies how to automatically obtain the heterogeneous and independent data sources in government system.Based on the protocol and structure analysis of government websites,this paper proposes a dynamic Web page collection framework based on event simulation.The framework innovatively introduces a proxy gateway,through which we can inject JavaScript code into target sites.In the meantime,the framework successfully implements JavaScript parsing and page rendering using a built-in native browser.In terms of acquisition strategy,the framework improves the state transition method of existing research,and finally realizes an automated page collection scheme compatible with both dynamic and static websites.On this basis,the paper proposes a tree alignment algorithm and a text density algorithm for the two typical semi-structured information in the government system:the extraction of list information and topic information.The tree alignment algorithm proposes using characteristics of HTML DOM tree to identify and segment data records innovatively,and using partial alignment when aligning data records,which improves the efficiency and accuracy greatly.The text density algorithm extracts the information from news page based on a feature of news page that the text density of effective information is significantly different from the text density of other regional in the news page.This two algorithms complement each other and propose an effective automation scheme for the structural information extraction of government websites.At last,our paper selects several government websites to conduct experiments,which compares our approach with existing algorithms to prove the effectiveness of our algorithm.
Keywords/Search Tags:government big data, event triggering, web page collection, structured information extraction
PDF Full Text Request
Related items