Research And Implementation On Chinese Web Pages-Oriented Information Extraction Technologies

Posted on:2014-02-23

Degree:Master

Type:Thesis

Country:China

Candidate:J Chen

Full Text:PDF

GTID:2268330395989210

Subject:Computer application technology

Abstract/Summary:

Along with the development of Internet technology, the amount of web pages becomes much huger, these massive web pages contain many useful information. Usually, the users can only directly utilize the structured information, while the information the users need is always contained in the unstructured or semi-structured text, so itâ€™s difficult to directly utilize the information in web pages. In order to make a better use of the information contained in the web pages, there is a need to extract the target information from the web pages and restored it in a structured form.Web Information extraction aims at extracting structured information from web pages. Web pages usually contain both free text (unstructured text) that are written in natural language and semi-structured text such as tables or itemized and enumerated lists. In this paper, we concentrate on how to extract information from Chinese web pages and an integrated method combine with web page paragraphs selecting and data integration for extracting information from both free text and semi-structured text is proposed. Heuristic rules are utilized to pick out the free text and semi-structured text from pages separately, and then the NLP skills such as Word Segmentation, Part-Of-Speech Tagging (POS), Named-Entity Recognition (NER) based on rules and syntactic/semantic rules are utilized to extract information from free text. Meanwhile we extract information from semi-structured text based on single-slot rules which are generated by wrapper induction. Finally, the information extracted from the two types of text is converted to standardized data in order to resolve data conflicts, and then the results from the two type of text are integrated as the final one. We apply this method in the real-world application of enterprise registered information extraction and do some related experiments, the experimental results demonstrate the average precision and recall of this integrated method are93.41%and87.44%.At the same time, the F-value of the integrated extraction improves obviously in comparison with only utilizing free text extraction or semi-structured text extraction.

Keywords/Search Tags:

Information extraction, Free text, Semi-structured text, Wrapper induction

Related items

1	Algorithm Research For Text Information Extraction Based On Wrapper Model
2	The Study Of Semi-supervised Web Data Extraction Rule Induction Based On User Interaction
3	Design And Implementation Of The Core Information Extraction System Of Semi-structured Financial Contract
4	Scalable Detection and Extraction of Data in Lists in OCRed Text for Ontology Population Using Semi-Supervised and Unsupervised Active Wrapper Induction
5	Research And Application Of Extraction Method Of Semi-structured Text Information
6	Information Extraction For Semi-structured Chinese Resume
7	Study And Design Of Text Information Extraction And Classification System
8	Research And Application Of Semi-structured Data Extraction
9	Researches On Models And Algorithms Of Text Information Extraction
10	Identification Of The Semi-Structured Text