Font Size: a A A

The Research Of Web Information Extraction Technique And Application Based On NFA Regular Matching

Posted on:2016-04-12Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y ChenFull Text:PDF
GTID:2308330470966149Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of network technology, the Internet has become an integral part of our daily lives. How to extract the information we need from the Web has become much important. The majority of extraction software based on the template extract Web information through the matching algorithm of regular expression based on NFA. But these software have some problems, such as low efficiency of extraction and difficult to maintain template, which is very difficult to be widely applied in reality. The paper makes a deep study on how to optimize the NFA and improve the algorithm of constructing NFA. The paper also analyzes the preprocessing of Web page and the integration of three layer frameworks, and finally achieve a extraction system of housing data. The system is mainly to provide sample data for assessment system of real estate tax.Firstly, according to the research of NFA(Nondeterministic Finite Automaton),the algorithm for constructing NFA through extension-mode is proposed, and some methods to reduce the time of constructing NFA and save memory space are designed.Secondly, in the engine of regular expression based on NFA, the paper proposes a method of constructing a optimized regex and provide strategies for drafting the extraction rules of Web sites.Thirdly, for preprocessing of Web page, the method for identifying coding of Web page is presented, and a de-noise algorithm based on template is proposed,removing the visible and invisible noise of Web page.Fourthly, the development model based on Ext Js 、 Spring and Hibernate framework is proposed, combining MVC with DAO. The Io C technology and the thinking of AOP to separate business logic code and basic operational code is introduced, reducing the redundant code.At the last, based on the above methods, the paper implements a extraction system of housing data. The system regularly and automatically collect、extract、de-noise 、 de-emphasis information by extraction rules and matching algorithm of regular expression based on NFA.
Keywords/Search Tags:NFA, Webpage de-noise, Regular expression, Extraction rule, Ext Js framework
PDF Full Text Request
Related items