Font Size: a A A

Research On Semantic Information Extraction For Semi-structured Documents

Posted on:2005-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2168360152968065Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Large scales of electronic information and knowledge, which are available from the Internet, are hidden in semi-structured documents. So, it's a valuable and promising research field to extract and make full use of these information and knowledge. Because the organization and language of semi-structured documents are flexible to a certain extent, common information extraction algorithms can hardly take perfect effects. The research on the method for semi-structured information extraction and processing, thereby, is required. In this way, the structural characters can be used as guide in the extracting procedure, other characters, such as syntax, presentation are also valuable information, which can be used together when recognizing special information.In order to describe semi-structured documents thoroughly, we present them as a three-layer view model composed by logical view – the documents' logical structure, semantic view – the documents' content and their semantic information, and layout/presentation view – the information's visual effect. The three views are intertwined, each one present the semi-structured document in a different level of view. This thesis concentrates its research mainly on the mapping model and algorithms from the logical view to the semantic view, including the definition and modeling for semantic metadata, the semantic object matching algorithm, the syntax based similarity analyzing algorithm.Both the logical view and the semantic view of semi-structured documents are organized as a tree. In the mapping procedure, the congruence of one single logical node and one single semantic object, which is computed according to the syntactic character and other characters, is not the only factor to be cared; It is also important to find whether the logical node's sub-tree could match the semantic object's children well. By this means, the precision of semantic object matching algorithm is improved.In the syntax based similarity analyzing algorithm, we use rules to describe the syntactic characters of semantic objects, translate the rule to DFA and analyze the input words which match the DFA's transitions continuously, and then compute the similarity between the input text and the rule. The advantage of this algorithm is to solve the problem of partial matching.In the research, an experimental system is designed and implemented to verify the methods posed in this thesis. The testing data are annals from a stock exchange, and the results are satisfactory.
Keywords/Search Tags:semi-structured document, pattern matching, similarity, information extraction
PDF Full Text Request
Related items