Font Size: a A A

Research On Chinese New Word Discovery Technology Based On Large Scale Network Corpus

Posted on:2018-09-01Degree:MasterType:Thesis
Country:ChinaCandidate:W Y WuFull Text:PDF
GTID:2428330623950676Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Now with the rapid development of the web social media such as micro-blog,WeChat,a large number of web corpus is also produced,these web corpus contains a large number of Chinese new words.Chinese new word discovery is the process of extracting words that not include in the background lexicon from the text corpus.It is a basic task in the field of Chinese Natural Language Processing,and has important theoretical and practical value.How to find Chinese new words from large-scale web corpus without supervision has gradually become the focus of research.To this end,this paper designs an Chinese new words discovery framework NWH(New Word Hunter)based on large-scale web corpus,a new feature for Chinese new word detection is proposed,the decision model used in Chinese new word detection is also studied,the main work includes the following three aspects:1.An Chinese new word discovery framework based on large scale web corpus which is named NWH is designed.The framework consists of candidate new word set construction module and Chinese new word detection module,the candidate new word set construction module extracts the candidate new words from the input corpus by repeated string extraction,the new word detection module calculates the eigenvalues of candidate new words and combines with the decision model to obtain the final new word discovery results.2.Feature has important influence on the effect of new word detection.In this paper,based on the theory of information distance,a new feature which named NWD(New Word Distance)is designed for Chinese new word detection,NWD can directly and effectively measure the information distance of Chinese string to its semantic.Experimental results show that the new word detection effect of NWD is better than seven commonly used features,such as PMI.3.Decision model plays an important role in the detection of new words.This paper uses support vector machine as a decision model for Chinese new words detection,and a novel method for obtaining training data is proposed,which makes it possible to use support vector machines as decision models.The experimental results show that the new word detection effect of support vector machine decision model combined with Gauss radial basis function is much better than the commonly used multidimensional threshold model.
Keywords/Search Tags:Web Corpus, Chinese New Word Discovery, New Word Distance, Support Vector Machine
PDF Full Text Request
Related items