Chinese New Word Identification Based On Large-scale Corpus

Posted on:2009-03-18

Degree:Master

Type:Thesis

Country:China

Candidate:H L Lv

Full Text:PDF

GTID:2178360242984716

Subject:Computer application technology

Abstract/Summary:

Internet gives an enormous impetus to Information Communication. More and more new words come out in our life. They reflect the development and evolution trend of lexics, become concerns of Linguistics and make big challenge to Natural Language Processing. Automatic identification of new words has greate significance for Chinese lexicography, information extraction, the Chinese word segmentation, and other fields in NLP.A word that is not included in a Chinese lexical analyzer's lexicon is called a new word. A major problem of Chinese word segmentation facing is the New Word Identification (NWI).We try to solve it as follows: first download adequate document from Internet and build a corpus; then segment the corpus, so most new words are cut into fragments. Then search for repeated strings to obtain new word candidates.For the identification of sequence of single characters, we build a model named local bi-gram model, which makes use of outer lingual environment and inner structure of a string simultaneously. And a local bi-gram statistic that trained through large scale corpus is used to decide whether a character sequence is a new word. Mutual Information was used to measure two neighboring characters' couplings. For measuring the couplings of sequences that comprise of more characters, we employ Average Mutual Information(AMI) that equals the meanvalue of all neighboring two characters of the sequence. The experimental result by using local bi-gram model shows its preferred F-Measure equals 79.05%, compared to AMI's 71.37%.We also carried out an experiment that combining the tow methods mentioned above, and achieved comprabe result which F-Measure equals 79.94%.For the identification of the new words in two-one pattern (a bi-character word followed with a single character), two metohds are used to constitute the suffix-character set.One method is using suffixs set that has been build by existed research. The other is by collecting characters that frequencely appearing at the tail of three-character words. We conduct experements by using one method respectively, and also carry out an experement by using the union suffix-character set.

Keywords/Search Tags:

Corpus, New Word Identification, Average Mutual Information, Local Bi-gram Model

Related items

1	Research On Algorithm For Network New Word Recognition
2	Research And Implementation Of New Word Recognition Based On N-gram And Hybrid Strategy
3	The Research On Chinese Word Segmentation System Based On SVM
4	New Word Discovery Based On Large-scale Corpus And Improving Chinese Segmentation System
5	Research Of Chinese New Word Identificaion
6	Word Sense Disambiguation Corpus Automatic Acquisition
7	Research On The Contextual Cohesion Of Social Media Texts For News
8	Statistical Learning In Chinese Word Segmentatin And Application-specific Segmentation
9	Researches Into New Chinese Words Identification Based On Large-Scale Corpus
10	Research For Chinese New Word Identification Based On Context-aware