Font Size: a A A

A New Words Extraction Method Based On Domain Specificity And Statistical Language Knowledge

Posted on:2017-01-09Degree:MasterType:Thesis
Country:ChinaCandidate:L L MeiFull Text:PDF
GTID:2308330503458927Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years, with the rapid development of economy and society, a large number of new words appear in people’s life. In the field of natural language processing, automatic extraction of new words is indispensable. As the basic technology of language information processing, new words extraction has great research value and practical application. This paper proposes a novel new words extraction method, the main work is as follows:1. This paper proposes a new words extraction method based on domain specificity and statistical language knowledge. Through observing and analyzing corpus, we perform a filtering algorithm based on domain specicity to obtain a candidate list of new words; then,we employ the statistical language knowledge(including word frequency, internal tightness)to extract new words. Experiments demonstrate the effectiveness of this method.2. This paper introduces the optimization of the new words extraction methods. We optimize the new words extraction methods from two aspects: optimization of internal tightness, using EMI to measure cohension instead of PMI; adding external context features,using the left entropy and right entropy to measure liberalization. To evaluate the effectiveness of the optimized method, we conduct several experiments, including comparation with the state-of-the-art methods, evaluation of different statistical language knowledge and parameters tuning. Experimental results show that the optimized method can greatly improve the performance when compared with the previous method. The maximum accuracy is increased by 39% and the maximum recall rate is increased by 63%.3. This paper also presents applications of the new words extraction method. One of them is the application to word segmentation. Experimental results show the method can increase the accuracy of word segmentation by 10% on corpus containing new words. The other application is that the new words extraction method can be applied in English domain lexicon construction. Experiments verify the method is scalable and language independent.The new words extraction method based on domain specificity and statistical language knowledge is an unsupervised method. It does not require training corpus and defining rules,which overcome the shortcomings of traditional methods. In addition, this method is highly scalable and language independent. It can extract a lot of new words and domain words.
Keywords/Search Tags:New words extraction, word segmentation, domain specificity, statistical language knowledge, domain words extraction
PDF Full Text Request
Related items