Research On Chinese New Word Identification And Analysis

Posted on:2007-12-17

Degree:Master

Type:Thesis

Country:China

Candidate:S Q Cui

Full Text:PDF

GTID:2178360185454173

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

A word that is not included in a Chinese segmentation lexicon is called a new word. Theidentification of Chinese new words is a key technique in Chinese Information Processing.There is no blank between Chinese words, so we encounter two problems in Chinesesegmentation: ambiguity resolution and new word identification, they become the bottlenecksto further improve the performance of Chinese segmentation. The research on named entitiessuch as person name, place name and organization name, etc, has got good achievement.However, the research on common new words, is still waiting for a breakthrough.In this thesis, after computing the term frequency and document frequency, we refine theproblem according to linguistic knowledge. In training step, we extract a garbage-stringlexicon, a garbage-head lexicon, a garbage-tail lexicon, a suffix lexicon and theIWP(Independent Word Probability) parameters. In the identification step, we adopt differentapproaches for different new word patterns, and improve the performance. In an experiment on400 web pages, we detect the new words with frequency bigger than 1, the precision reaches80.4%, and the recall reaches 81.8%.The features of new words include surface feature, distribution feature and semanticfeature, etc. There is little research on these features of new words, but it's a useful way tounderstand new words. The new word identification of this thesis is based on a large-scalecorpus from Internet, so we can get abundant information from the context. Based on it, we doa deep research on the space distribution and time distribution from the view of term frequency,mutual information and word similarity.Abbreviation relationship is a kind of semantic feature. For there are many abbreviationsin new words, we put forward a method to bootstrap an abbreviation lexicon. In this step, wemake use of world knowledge and the corpus, compute the language model of phrases, thealignment model from phrase to word, and give a score for each pair of abbreviation andphrase. In an experiment on 500,000 web pages, we extract abbreviations with frequencybigger than 100, and get the precision of 51.4% and the recall of 81.7%.Based on the technique above, we developed an Internet oriented Chinese new wordidentification and analysis system based on B/S architecture, which supports online andreal-time operation.

Keywords/Search Tags:

Candidate New Word, Garbage-String, Space Distribution, Time Distribution, Abbreviation Source Phrase

PDF Full Text Request

Related items

1	Using Word and Phrase Abbreviation Patterns to Extract Age From Twitter Microtexts
2	Assembly Line Material Distribution Optimization Research In Just-in-time Based On Distribution BOM
3	Research On Chinese Phrase Annotation And Calculation Based On Multi-level Corpus
4	Light Source Simulation And Light Field Distribution Analysis Of Single Light Field Time-Grating Sensor
5	Study On Parallel Algorithms For Approximate String Matching With Single Pattern And Single Text On Heterogeneous Cluster Computing Systems
6	Distribution And Modeling Of Communication Services
7	Fundamental Studies On Extracting Parameters Of Frequency Agile Signal
8	Research On Vehicle Routing Problem In Logistics Distribution
9	Optimized Route And Scheduling For Responsive Feeder Transit Considering Passengers’ Time And Space Distribution
10	Read Title Era Of Newspaper Headlines Language Characteristics