Font Size: a A A

Web Information Automatically Extract Technology Research

Posted on:2013-07-07Degree:MasterType:Thesis
Country:ChinaCandidate:C L LiaoFull Text:PDF
GTID:2248330374486007Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the advancement of science and the development of network technology,internet has been infused into our lives across various fields. Facing increasing dailywebpage data, here comes the question that is how to gain information of value and ofinterest. With the aim to accessing semi-structured data among these pages, this thesisstudies and realizes an information extraction method based on models, and applies it toactual M-IE system.The key of this method lies in how to let users make simple extraction rulesthrough interface and base on these rules to extract information. After analyzing users’page-scanning behaviors and considering the studies of browser kernel, this thesisproposes a kind of script description on the basis of three elements. These threeelements describe the followings respectively:1. Which elements in webpage are inneed of process;2. Which kind of operation needs to be carried out on the element;3. Inwhich kind of format to export the information relevant to the element. This thesis alsogives a detailed explanation on how to analyze the final generated script and gives adetailed algorithm flow on the realization of element1.This information extraction method based on models can be applied in M-IEsystem designed in this thesis. This system can make accurate extraction of informationin forum, MicroBlog and web portals, and export structured data, and these data aresemantic. Information extracted from forum and MicroBlog may reflect what kind ofinformation is real hot topic to root grass groups. Users may visually and simplygenerate extraction rules through the program without any professional knowledge. Thegeneral structure of M-IE system can be divided into extraction rules generating module,extraction rules analyzing module, information filtering module, data base module anddata analyzing module. Each module in the system is equipped with well defined portand can be substituted dynamically.Examples from school BBS and sina MicroBlog are given at the end of this thesisshow how to generate extraction rules by means of simple operations at the interface.During the generating process, we can clearly preview that the data to be extracted are structured and of semanteme.
Keywords/Search Tags:Web Information Extraction, Modeling-based Wrapper, Script
PDF Full Text Request
Related items