Font Size: a A A

Chinese New Word Identification Based On Large-scale Corpus

Posted on:2009-03-18Degree:MasterType:Thesis
Country:ChinaCandidate:H L LvFull Text:PDF
GTID:2178360242984716Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Internet gives an enormous impetus to Information Communication. More and more new words come out in our life. They reflect the development and evolution trend of lexics, become concerns of Linguistics and make big challenge to Natural Language Processing. Automatic identification of new words has greate significance for Chinese lexicography, information extraction, the Chinese word segmentation, and other fields in NLP.A word that is not included in a Chinese lexical analyzer's lexicon is called a new word. A major problem of Chinese word segmentation facing is the New Word Identification (NWI).We try to solve it as follows: first download adequate document from Internet and build a corpus; then segment the corpus, so most new words are cut into fragments. Then search for repeated strings to obtain new word candidates.For the identification of sequence of single characters, we build a model named local bi-gram model, which makes use of outer lingual environment and inner structure of a string simultaneously. And a local bi-gram statistic that trained through large scale corpus is used to decide whether a character sequence is a new word. Mutual Information was used to measure two neighboring characters' couplings. For measuring the couplings of sequences that comprise of more characters, we employ Average Mutual Information(AMI) that equals the meanvalue of all neighboring two characters of the sequence. The experimental result by using local bi-gram model shows its preferred F-Measure equals 79.05%, compared to AMI's 71.37%.We also carried out an experiment that combining the tow methods mentioned above, and achieved comprabe result which F-Measure equals 79.94%.For the identification of the new words in two-one pattern (a bi-character word followed with a single character), two metohds are used to constitute the suffix-character set.One method is using suffixs set that has been build by existed research. The other is by collecting characters that frequencely appearing at the tail of three-character words. We conduct experements by using one method respectively, and also carry out an experement by using the union suffix-character set.
Keywords/Search Tags:Corpus, New Word Identification, Average Mutual Information, Local Bi-gram Model
PDF Full Text Request
Related items