Research On Automatic Construction Of Natural Language Thesaurus

Posted on:2008-06-22

Degree:Master

Type:Thesis

Country:China

Candidate:H P Du

Full Text:PDF

GTID:2178360242965539

Subject:Information Science

Abstract/Summary:

PDF Full Text Request

A information retrieval(IR) system contains four child systems, they are indexing system, searching system, thesaurus system and the user-system interface. Thesaurus, which is the basis of a IR system, effects retrieval efficency greatly.The main cause of low efficency of network information retrieval is mis-match problem. To improve the performance of IR system based on keyword matching, a control mechanism such as thesaurus is needed, upgrading word-match degree to concept-match degree. .A thesaurus constructed by hand is nice but costly and time-consuming, the pre-chosen terms may have nothing to do with texts which come into the IR system lately. Someone has proved that general-purpose thesauri do not improve retrieval effectiveness in a number of information retrieval experiments. So it is needed to construct thesaurs automaticly and quickly to improve the efficency of network information retrievel.To resolve the above problem, this article approaches an automatic method. A natural language thesaurus is constructed which can control natural language used when indexing and searching, and it is an "inner controlled and outer uncontrolled"thesaurus.Automatic thesaurus construction means that the equivalence relationship, hierarchical relationship and related relationship between terms are recognized through NLP technologies such as patten-recognition, co-occurrence analysis, word clustering etc. This article focuses on recognition of the last two kinds, firstly an association concept space is constructed, then terms that are semantic-related are clustered, and hierachical relationship between terms in same clusters is recognized at last. Patten-recognition and word similarity algorithm are used to recognize synonyms.After a taxation thesaurus is generated, it will be used to automaticly indexing tax web page texts using the inner control terms. User can search the thesaurus using natural language, the system then supplys some corresponding inner control terms, thus relieving user's burden in searching.This article also discusses how to update and maintain the constructed thesaurus, and try to use a N-gram algorithm to recognize unknown words.The automatic thesaurus construction system is developed using VB.NET and ACCESS.

Keywords/Search Tags:

natural language, thesaurus, automatic thesaurus construction, automatic indexing, association concept space, word clustering

PDF Full Text Request

Related items

1	Automatic thesaurus discovery via selective Natural Language Processing: A corpus based approach
2	Research In Thesaurus-based Ontology Building Method
3	Automatic Supervised Thesauri Construction with 'Roget's Thesaurus'
4	Experiments with automatic indexing and a relational thesaurus in a Chinese information retrieval system
5	Establishment And Study Of Cultural Relics Digital Protection Thesaurus
6	Study On The Theory & Practice Of Automatic Indexing Of WWW Science And Technology Information Resources
7	An automatic feedback thesaurus approach and its parallel implementations
8	Resolving quasi-synonym relationships in automatic thesaurus construction using fuzzy rough sets and an inverse term frequency similarity function
9	The Automatic Formation Of Conversion From Chinese Thesaurus To Ontology
10	Research On Maritime Ontology Construction Based On Thesaurus And FCA