Font Size: a A A

Research On Structured Information Extraction Based On Pattern Matching

Posted on:2014-06-08Degree:MasterType:Thesis
Country:ChinaCandidate:C L YangFull Text:PDF
GTID:2268330401988837Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Semi-structured text which possesses a strong domain character is a kind of the domain-oriented applied texts. With the rapid development of the Internet, the Web-presented semi-structured text has a wide range of applications. And it seems to be a very considerable prospect to extract information from the semi-structured texts.Generally, the existing Web information extraction methods extract information elements which called the coarse-grained extraction results instead of structured information that users prefer from Web texts. Relying on Corpus, the semantic analysis methods are able to achieve satisfactory results, but, they cannot work effectively in the open environment.This thesis proposes a structured information extraction method based on pattern matching. On the basis of the coarse-grained extraction results, this method extracts structured information which contains a certain semantic meaning. The main work of this thesis is as follows:(1) Recognize domains from the coarse-grained extraction results. The process of the domain recognition is essentially a process of text categorization. Constructing text vector by the concept instead of ordinary words and optimizing the weight of the vector depends on the weight of the concepts, this thesis uses SVM classifier to achieve domain recognition. According to the result of domain recognition, the suitable domain lexicon will be loaded during the process of words segmentation.(2) In this thesis, a XML-based pattern library and a definition for extraction patterns are introduced. In the extraction patterns, besides the attribute role which expresses the semantic meaning, the boundary mechanism which enhances the accuracy of the pattern and the keyword’s synonyms mechanism which enhances the coverage of the pattern are advanced as well.(3) In this thesis, structured information is extracted with the pattern matching method based on keywords and the part of speech of the words. In addition, a pattern extraction method based on keywords and the similarity between the patterns is advanced. With this method, the unknown patterns will be extracted and the pattern library will be updated automatically. (4) In this thesis, the pattern sets divided by keywords is converged to different clusters with the method based on pattern clustering. In the cluster, the method based on the reverse shortest edit distance is used to promote the pattern generalization and enhance the coverage of the patterns.
Keywords/Search Tags:semi-structured text, coarse-grained extraction results, domainrecognition, extraction patterns, pattern matching, pattern generalization
PDF Full Text Request
Related items