Font Size: a A A

Research On Algorithm For Network New Word Recognition

Posted on:2016-10-15Degree:MasterType:Thesis
Country:ChinaCandidate:Q ZhouFull Text:PDF
GTID:2348330479486992Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the advent of the information age, the Internet is profoundly effect on people's daily life, work and study.etc., and it also changes the people's daily communication way. The continual emergence of network new word is a good evidence for it. Unlike Indo-European language system there exist space between words; there is no clear space between in Chinese words. But the Chinese language smallest independent unit is the word, so the computer want to identify Chinese words should segment Chinese words firstly. But the network new words have influence on the result of the Chinese words segmentation. According to research statistics, it shows that the errors of Chinese words segmentation are mostly caused by that computer cannot identify new words. If we can identify the new word quickly and update the Chinese dictionary timely, it will have a great help to improve the accuracy of Chinese words segmentation system. Therefore, new words recognition has become a thorny problem in the Chinese automatic words segmentation.In recent years, many scholars and research institutes have done a lot of research work in new words recognition field, and they have made some achievements in this field, but the new words recognition efficiency is not very good. In order to solve this problem, this paper puts forward a new words identification method based on the characteristics of weibo messageFirstly, in order to guarantee the timeliness of the corpus, we construct a Weibo corpus by using Web Crawler to craw the Sina Weibo messages; Secondly, we use atomic segmentation and N-gram algorithm to segment the Weibo messages and obtain candidate strings. Then we use the garbage string dictionary to filter the candidate strings, so we can obtain candidate new words according the filtering result. Thirdly, we put forward a new words recognition method according to the characteristics of the Weibo messages information entropy maximization and Weibo vocabulary ‘sparse. We combine new words recognition with words segmentation, and use the method proposed in this paper to identify the candidate new words according to the segmentation results. Finally aiming at the shortcomings of the above method, we put forward using the average mutual information to improve it, which can improve the new words recognition accuracy. Compared with other new words recognition methods, the experimental results show that the new words recognition method proposed in this paper can improve the efficiency of new word recognition.
Keywords/Search Tags:new words recognition, Chinese word segmentation, N-gram, average mutual information
PDF Full Text Request
Related items