Font Size: a A A

Research On Automatic Segmentation Based On Dictiongary

Posted on:2011-05-24Degree:MasterType:Thesis
Country:ChinaCandidate:T K GuoFull Text:PDF
GTID:2178330332471470Subject:Microelectronics and Solid State Electronics
Abstract/Summary:PDF Full Text Request
With the development of modern information technology, Chinese word segmentation technology is already widely used in information retrieval, information extraction, machine translation, speech synthesis and other natural language processing field, and combined with the characteristics of the Chinese-text, automatic word segmentation technology has become a basic issue of information processing.Chinese word segmentation technology meets major problem is dealing with the texts are cut into the use of the word segmentation by algorithm, in order to desolve the computer processing of text-information, understanding and delivery process. In the realization of the process of sub-word faces many difficults which are the ambiguity and the unknown word processing. The article related to sub-word with the existing technology theory and implementation methods, through the maximum matching algorithm and a combination of statistical algorithms frequency get the Chinese version of the word knowledge in-depth study and practice.Based on the analysis of automatic word segmentation and difficulty, in order to reduce ambiguity and improve the word accuracy rate, with a dictionary and a composite based on an improved match and word frequency statistics algorithms, The system based on the characteristics of dictionary is divided into two parts, the basic word dictionary and word dictionary which is different from traditional dictionary optimized for the term of the storage structure, using double-word Hash index structure will be dictionary words by the former two words for the index key is stored, resulting in improved matching to find the speed and accuracy, right name, exclusive terms, names, and quantifiers and the correct segmentation increase greatly, increase word system performance. Segmentation Algorithm for the mainly reflected in the application of the forward and reverse maximum matching word segmentation with a two-way word, while two-way matching field of application of the ambiguity arising from the close links between words judged by word frequency information to complete the ambiguity between words and unknown word.Using development tool of C# has realize this algorithm and get the right sub-word in serious ambiguity phrases and paragraphs without obvious ambiguities, through the comparative experiments, the algorithm is better than the single use of the largest word matching algorithms, the system accuracy at the sub-word of information meet the requirements of text processing.
Keywords/Search Tags:Chinese word segmentation, Maximum matching algorithm, dictiongary, ambiguity processing, word frequency information
PDF Full Text Request
Related items