Font Size: a A A

Research And Application On Chinese Automatic Word Segmentation In Full Text Retrieval

Posted on:2008-04-06Degree:MasterType:Thesis
Country:ChinaCandidate:T LiuFull Text:PDF
GTID:2178360215997632Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Chinese automatic word segmentation is to use computer to cut sequential text into character strings based on word units. Full text retrieval is a retrieval method of using all text information as search objects. The full text retrieval not only improves search accuracy and rate, but also enlarges user's searching freedom. Chinese automatic word segmentation is the first step of full text retrieval, and also the foundation of Chinese information disposal. Accordingly, research on Chinese automatic word segmentation has important theoretic and realistic meanings.The work of this thesis can be described as follows: designing organizing structure of traditional word segmentation dictionary based on Hash structure, in order to improve search rate; improving traditional word segmentation arithmetic, changing fixed max word length into dynamic determination in the person of long-word-first principle; discussing cross ambiguity and combinatorial segmentation, especially putting forward an improved maximum matching method aiming at cross ambiguity, which holds 90% of all ambiguity; discussing three kinds of unknown word person name, institution name, place name, especially putting forward a recognition method of Chinese place name based on mutual information.Through plentiful testing, it shows that: the automatic word segmentation method researched and initially implemented in the theis has comparatively fast segmentation speed, averagely up to 12,000 characters every second. At the same time, at the aspect of segmentation precision, the system has 97.56% of calling back rate as to cross ambiguity and 93.41% of calling back rate as to place name recognition. As a whole, the method has preferable word segmentation effect, and it can be initially implemented into the full text retrieval and every kinds of Chinese text disposal.
Keywords/Search Tags:Chinese automatic word segmentation, Full text retrieval, Maximum matching method, Ambiguity recognition, Unknown word recognition, Mutual information
PDF Full Text Request
Related items