Font Size: a A A

Web For Intelligent Information Extraction Technology

Posted on:2010-08-30Degree:MasterType:Thesis
Country:ChinaCandidate:C S ZhengFull Text:PDF
GTID:2208360275983371Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With rapidly developing country's economy, enlarging enhancement of the national information infrastructure construction and enhancement of the people life's quality, the network has already penetrated people's life and became an essential part in our works or our life. How quickly and efficiently obtain the information on Web? This problem has already become an important research topic. But in the network information great variety, the homepage structural style is changeable, on most pages also includes a lot of noise information, such as ads, navigation, hot links, and so on, which have brought a great deal of distress to researchers. There are a lot of deficiencies in current information extraction technologies, such as dealing with only one type of Web pages, extracted result with the low level detail, accuracy and efficiency of contradiction, manual intervention and the intelligence operations, unsupported the incremental information processing problem. For resolving these problems, the subject develops a new information extraction method.This thesis is based on a template of information extraction algorithms: it is the first to identify the target entities'partition tags from the Web page by the rules generator, and then configure these partition tags to the template by the template generator, finally withdraws the related information according to the same site's template by the information extractor. Specifically include the following:1. By analyzing the page structure, the layout of Web page and the rule of the distribution of tags, in conjunction with the current information extraction technology at home and abroad, the author sums up a set of templates which can define any form Web page structure, and designs the method which can automatically configure the templates;2. The information extractor has been designed to implements one method which reading the configuration template, as well as extracting the information from the Web page by the template's configuration. And the increment/multi-page processing algorithm which is used to solve the program which has the same topics of content distribution in a number of pages, namely needing for integration of computing, as well as to address different time periods, subject dynamically updated Web content, namely to carry out incremental extraction; Eliminating Repetition algorithm which to deal with similar or the same subject duplicate question in different sites;3. The Structured Storage extract the relevant information in accordance with the configuration template, and the results are preserved as a structured form; the information extraction system is dynamically designed, which extract different Web pages according to different needs, and not needing to change code but dynamic changing template's configuration.The information extraction algorithm is proposed to automatically extract information from kinds of type homepage by the template in theory of this paper, and the corresponding system also is developed. The practice result proves that, this system has better extraction speed based on the Web information extraction technology research and higher rate of accuracy and recalls rate than the current Web-based information extraction techniques.
Keywords/Search Tags:Information extraction, Regular generator, Template generator, Increment/multi-page processing
PDF Full Text Request
Related items