Font Size: a A A

Research On Automatic Construction Of Natural Language Thesaurus

Posted on:2008-06-22Degree:MasterType:Thesis
Country:ChinaCandidate:H P DuFull Text:PDF
GTID:2178360242965539Subject:Information Science
Abstract/Summary:PDF Full Text Request
A information retrieval(IR) system contains four child systems, they are indexing system, searching system, thesaurus system and the user-system interface. Thesaurus, which is the basis of a IR system, effects retrieval efficency greatly.The main cause of low efficency of network information retrieval is mis-match problem. To improve the performance of IR system based on keyword matching, a control mechanism such as thesaurus is needed, upgrading word-match degree to concept-match degree. .A thesaurus constructed by hand is nice but costly and time-consuming, the pre-chosen terms may have nothing to do with texts which come into the IR system lately. Someone has proved that general-purpose thesauri do not improve retrieval effectiveness in a number of information retrieval experiments. So it is needed to construct thesaurs automaticly and quickly to improve the efficency of network information retrievel.To resolve the above problem, this article approaches an automatic method. A natural language thesaurus is constructed which can control natural language used when indexing and searching, and it is an "inner controlled and outer uncontrolled"thesaurus.Automatic thesaurus construction means that the equivalence relationship, hierarchical relationship and related relationship between terms are recognized through NLP technologies such as patten-recognition, co-occurrence analysis, word clustering etc. This article focuses on recognition of the last two kinds, firstly an association concept space is constructed, then terms that are semantic-related are clustered, and hierachical relationship between terms in same clusters is recognized at last. Patten-recognition and word similarity algorithm are used to recognize synonyms.After a taxation thesaurus is generated, it will be used to automaticly indexing tax web page texts using the inner control terms. User can search the thesaurus using natural language, the system then supplys some corresponding inner control terms, thus relieving user's burden in searching.This article also discusses how to update and maintain the constructed thesaurus, and try to use a N-gram algorithm to recognize unknown words.The automatic thesaurus construction system is developed using VB.NET and ACCESS.
Keywords/Search Tags:natural language, thesaurus, automatic thesaurus construction, automatic indexing, association concept space, word clustering
PDF Full Text Request
Related items