Font Size: a A A

Research And Implementation On Chinese Web Pages-Oriented Information Extraction Technologies

Posted on:2014-02-23Degree:MasterType:Thesis
Country:ChinaCandidate:J ChenFull Text:PDF
GTID:2268330395989210Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Along with the development of Internet technology, the amount of web pages becomes much huger, these massive web pages contain many useful information. Usually, the users can only directly utilize the structured information, while the information the users need is always contained in the unstructured or semi-structured text, so it’s difficult to directly utilize the information in web pages. In order to make a better use of the information contained in the web pages, there is a need to extract the target information from the web pages and restored it in a structured form.Web Information extraction aims at extracting structured information from web pages. Web pages usually contain both free text (unstructured text) that are written in natural language and semi-structured text such as tables or itemized and enumerated lists. In this paper, we concentrate on how to extract information from Chinese web pages and an integrated method combine with web page paragraphs selecting and data integration for extracting information from both free text and semi-structured text is proposed. Heuristic rules are utilized to pick out the free text and semi-structured text from pages separately, and then the NLP skills such as Word Segmentation, Part-Of-Speech Tagging (POS), Named-Entity Recognition (NER) based on rules and syntactic/semantic rules are utilized to extract information from free text. Meanwhile we extract information from semi-structured text based on single-slot rules which are generated by wrapper induction. Finally, the information extracted from the two types of text is converted to standardized data in order to resolve data conflicts, and then the results from the two type of text are integrated as the final one. We apply this method in the real-world application of enterprise registered information extraction and do some related experiments, the experimental results demonstrate the average precision and recall of this integrated method are93.41%and87.44%.At the same time, the F-value of the integrated extraction improves obviously in comparison with only utilizing free text extraction or semi-structured text extraction.
Keywords/Search Tags:Information extraction, Free text, Semi-structured text, Wrapper induction
PDF Full Text Request
Related items