Research On Semantic Information Extraction For Semi-structured Documents

Posted on:2005-02-17

Degree:Master

Type:Thesis

Country:China

Candidate:Y Li

Full Text:PDF

GTID:2168360152968065

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Large scales of electronic information and knowledge, which are available from the Internet, are hidden in semi-structured documents. So, it's a valuable and promising research field to extract and make full use of these information and knowledge. Because the organization and language of semi-structured documents are flexible to a certain extent, common information extraction algorithms can hardly take perfect effects. The research on the method for semi-structured information extraction and processing, thereby, is required. In this way, the structural characters can be used as guide in the extracting procedure, other characters, such as syntax, presentation are also valuable information, which can be used together when recognizing special information.In order to describe semi-structured documents thoroughly, we present them as a three-layer view model composed by logical view â€“ the documents' logical structure, semantic view â€“ the documents' content and their semantic information, and layout/presentation view â€“ the information's visual effect. The three views are intertwined, each one present the semi-structured document in a different level of view. This thesis concentrates its research mainly on the mapping model and algorithms from the logical view to the semantic view, including the definition and modeling for semantic metadata, the semantic object matching algorithm, the syntax based similarity analyzing algorithm.Both the logical view and the semantic view of semi-structured documents are organized as a tree. In the mapping procedure, the congruence of one single logical node and one single semantic object, which is computed according to the syntactic character and other characters, is not the only factor to be cared; It is also important to find whether the logical node's sub-tree could match the semantic object's children well. By this means, the precision of semantic object matching algorithm is improved.In the syntax based similarity analyzing algorithm, we use rules to describe the syntactic characters of semantic objects, translate the rule to DFA and analyze the input words which match the DFA's transitions continuously, and then compute the similarity between the input text and the rule. The advantage of this algorithm is to solve the problem of partial matching.In the research, an experimental system is designed and implemented to verify the methods posed in this thesis. The testing data are annals from a stock exchange, and the results are satisfactory.

Keywords/Search Tags:

semi-structured document, pattern matching, similarity, information extraction

PDF Full Text Request

Related items

1	Research On Structured Information Extraction Based On Pattern Matching
2	Research On Keyword Extraction And Structured List Data Extraction
3	Research Of Pattern Extraction From Semi-structured Data Based On Rules
4	Research On Feature Extraction Method Of Semi-structured Document
5	Study On Information Autonomous Extraction Technology Of Web Pages
6	Study On Semi-structured Data Mining
7	Design And Implementation Of The Core Information Extraction System Of Semi-structured Financial Contract
8	Semantic Based Information Retrieval From Semi-structured Documents
9	Pattern-Based Information Extraction From HTML Documents
10	Research And Application Of Extraction Method Of Semi-structured Text Information