Font Size: a A A

Technology For Domain-oriented Automatic Information Extraction From Semi-structured Web

Posted on:2009-12-25Degree:MasterType:Thesis
Country:ChinaCandidate:L LiuFull Text:PDF
GTID:2198360272460970Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In response to handle the serious challenges brought by the information explosion, there is an urgent requirement for many automatic tools to search and feedback quickly the real information in demand to users from the mass of information sources. Information extraction (IE) is just proposed to solve these problems. However, most of the Web IE tools are targeted at a specific website page, and the wrappers are usually packaged manually. Such systems have poor transplantation, so they can't adapt to changes of target page structures and are disabled to take on other websites. According to analyzing the characteristics of many domain page structures, this thesis presents an automatic Web information extraction method mainly based on domain keyword dictionary. By pre-training, the system can automatically generated extraction rules based on the structural characteristics of Web site information so as to automatically finish extraction.There are two main contributions involving this thesis as following:The domain keyword dictionary is imported to support as knowledge for domain information extraction. We present a keyword extraction approach for semi-structured domain texts. According to the characteristics of semi-structured text in specific domains, we customize the wrapper manually and combine with the spider to process extraction to train out domain keyword dictionary. The domain keyword dictionary generated from Web corpus can obviously improve the adaptability of IE technology.An approach is presented to generate automatic extraction rules based on keywords dictionary. The main process includes four steps which are to analyze sample pages from the target site to transform into DOM tree, to extract effective nodes, to match domain keyword dictionary and information of effective nodes, and to map the results into extraction rules of the target sites.Other related problems are also addressed in this thesis, which include: (1) A construction method for semi-structured domain text wrapper using regular expression, (2) Methods to deal with all kinds of links in the process of hyperlink extraction, (3) An improved algorithm to determine nodes in the DOM tree so as to adapt to semi-structured texts, and (4) A mapping method from domain dictionary and effective nodes into regular expressions, and so on.This thesis takes prices site of agricultural products as an example to demonstrate the process of domain keyword dictionary creation and extraction rules generation proposed. The experimental results show that the rules generated automatically are effective to extract price information about the agricultural products.In short, the extraction rules generated based on domain keyword dictionary has the advantage of knowledge sources. This system is opened and upgraded easily. By deeply training of web corpus, the integrity of the keyword dictionaries can continue to be increased, and the system's adaptability will also be improved. It can be competent in many Web information extraction work for semi-structured domain text.
Keywords/Search Tags:Web Information Extraction, Domain Keyword Dictionary, Wrapper, Hyperlink Extraction, Effective Node Extraction, Automatic rules generate
PDF Full Text Request
Related items