Font Size: a A A

Research On Chinese New Word Identification And Analysis

Posted on:2007-12-17Degree:MasterType:Thesis
Country:ChinaCandidate:S Q CuiFull Text:PDF
GTID:2178360185454173Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
A word that is not included in a Chinese segmentation lexicon is called a new word. Theidentification of Chinese new words is a key technique in Chinese Information Processing.There is no blank between Chinese words, so we encounter two problems in Chinesesegmentation: ambiguity resolution and new word identification, they become the bottlenecksto further improve the performance of Chinese segmentation. The research on named entitiessuch as person name, place name and organization name, etc, has got good achievement.However, the research on common new words, is still waiting for a breakthrough.In this thesis, after computing the term frequency and document frequency, we refine theproblem according to linguistic knowledge. In training step, we extract a garbage-stringlexicon, a garbage-head lexicon, a garbage-tail lexicon, a suffix lexicon and theIWP(Independent Word Probability) parameters. In the identification step, we adopt differentapproaches for different new word patterns, and improve the performance. In an experiment on400 web pages, we detect the new words with frequency bigger than 1, the precision reaches80.4%, and the recall reaches 81.8%.The features of new words include surface feature, distribution feature and semanticfeature, etc. There is little research on these features of new words, but it's a useful way tounderstand new words. The new word identification of this thesis is based on a large-scalecorpus from Internet, so we can get abundant information from the context. Based on it, we doa deep research on the space distribution and time distribution from the view of term frequency,mutual information and word similarity.Abbreviation relationship is a kind of semantic feature. For there are many abbreviationsin new words, we put forward a method to bootstrap an abbreviation lexicon. In this step, wemake use of world knowledge and the corpus, compute the language model of phrases, thealignment model from phrase to word, and give a score for each pair of abbreviation andphrase. In an experiment on 500,000 web pages, we extract abbreviations with frequencybigger than 100, and get the precision of 51.4% and the recall of 81.7%.Based on the technique above, we developed an Internet oriented Chinese new wordidentification and analysis system based on B/S architecture, which supports online andreal-time operation.
Keywords/Search Tags:Candidate New Word, Garbage-String, Space Distribution, Time Distribution, Abbreviation Source Phrase
PDF Full Text Request
Related items