Research On Structured Information Extraction Based On Pattern Matching

Posted on:2014-06-08

Degree:Master

Type:Thesis

Country:China

Candidate:C L Yang

Full Text:PDF

GTID:2268330401988837

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Semi-structured text which possesses a strong domain character is a kind of the domain-oriented applied texts. With the rapid development of the Internet, the Web-presented semi-structured text has a wide range of applications. And it seems to be a very considerable prospect to extract information from the semi-structured texts.Generally, the existing Web information extraction methods extract information elements which called the coarse-grained extraction results instead of structured information that users prefer from Web texts. Relying on Corpus, the semantic analysis methods are able to achieve satisfactory results, but, they cannot work effectively in the open environment.This thesis proposes a structured information extraction method based on pattern matching. On the basis of the coarse-grained extraction results, this method extracts structured information which contains a certain semantic meaning. The main work of this thesis is as follows:(1) Recognize domains from the coarse-grained extraction results. The process of the domain recognition is essentially a process of text categorization. Constructing text vector by the concept instead of ordinary words and optimizing the weight of the vector depends on the weight of the concepts, this thesis uses SVM classifier to achieve domain recognition. According to the result of domain recognition, the suitable domain lexicon will be loaded during the process of words segmentation.(2) In this thesis, a XML-based pattern library and a definition for extraction patterns are introduced. In the extraction patterns, besides the attribute role which expresses the semantic meaning, the boundary mechanism which enhances the accuracy of the pattern and the keywordâ€™s synonyms mechanism which enhances the coverage of the pattern are advanced as well.(3) In this thesis, structured information is extracted with the pattern matching method based on keywords and the part of speech of the words. In addition, a pattern extraction method based on keywords and the similarity between the patterns is advanced. With this method, the unknown patterns will be extracted and the pattern library will be updated automatically. (4) In this thesis, the pattern sets divided by keywords is converged to different clusters with the method based on pattern clustering. In the cluster, the method based on the reverse shortest edit distance is used to promote the pattern generalization and enhance the coverage of the patterns.

Keywords/Search Tags:

semi-structured text, coarse-grained extraction results, domainrecognition, extraction patterns, pattern matching, pattern generalization

PDF Full Text Request

Related items

1	Research Of Pattern Extraction From Semi-structured Data Based On Rules
2	Research On Semantic Information Extraction For Semi-structured Documents
3	The Method Of Fine-Grained Topic Information Extraction And Text Clustering Based On Chinese Phrase
4	Design And Implementation Of The Core Information Extraction System Of Semi-structured Financial Contract
5	Research And Application Of Semi-structured Data Extraction
6	Research On Feature Extraction Method Of Semi-structured Document
7	Research And Application Of Extraction Method Of Semi-structured Text Information
8	Research On Semi-supervised Entity Semantic Relation Extraction
9	Research On The Image Matching Method Based On Point Pattern Matching
10	Semi-Approximate Pattern Matching Algorithm Based On BPM-BM