Research On Algorithm For Network New Word Recognition

Posted on:2016-10-15

Degree:Master

Type:Thesis

Country:China

Candidate:Q Zhou

Full Text:PDF

GTID:2348330479486992

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the advent of the information age, the Internet is profoundly effect on people’s daily life, work and study.etc., and it also changes the people’s daily communication way. The continual emergence of network new word is a good evidence for it. Unlike Indo-European language system there exist space between words; there is no clear space between in Chinese words. But the Chinese language smallest independent unit is the word, so the computer want to identify Chinese words should segment Chinese words firstly. But the network new words have influence on the result of the Chinese words segmentation. According to research statistics, it shows that the errors of Chinese words segmentation are mostly caused by that computer cannot identify new words. If we can identify the new word quickly and update the Chinese dictionary timely, it will have a great help to improve the accuracy of Chinese words segmentation system. Therefore, new words recognition has become a thorny problem in the Chinese automatic words segmentation.In recent years, many scholars and research institutes have done a lot of research work in new words recognition field, and they have made some achievements in this field, but the new words recognition efficiency is not very good. In order to solve this problem, this paper puts forward a new words identification method based on the characteristics of weibo messageFirstly, in order to guarantee the timeliness of the corpus, we construct a Weibo corpus by using Web Crawler to craw the Sina Weibo messages; Secondly, we use atomic segmentation and N-gram algorithm to segment the Weibo messages and obtain candidate strings. Then we use the garbage string dictionary to filter the candidate strings, so we can obtain candidate new words according the filtering result. Thirdly, we put forward a new words recognition method according to the characteristics of the Weibo messages information entropy maximization and Weibo vocabulary ‘sparse. We combine new words recognition with words segmentation, and use the method proposed in this paper to identify the candidate new words according to the segmentation results. Finally aiming at the shortcomings of the above method, we put forward using the average mutual information to improve it, which can improve the new words recognition accuracy. Compared with other new words recognition methods, the experimental results show that the new words recognition method proposed in this paper can improve the efficiency of new word recognition.

Keywords/Search Tags:

new words recognition, Chinese word segmentation, N-gram, average mutual information

PDF Full Text Request

Related items

1	Research On Chinese Word Segmentation Algorithm Based On News Text
2	The Research On Chinese Word Segmentation System Based On SVM
3	Statistical Learning In Chinese Word Segmentatin And Application-specific Segmentation
4	Research On Words Segmentation Algorithm And Word Variant Extraction Method Of Message Variety Based
5	Chinese Word Auto-segmentation Design And Algorithm Realization For Chinese Network Information Retrieval
6	Research And Implementation Of New Word Recognition Based On N-gram And Hybrid Strategy
7	Chinese New Word Identification Based On Large-scale Corpus
8	Comparative Research On Open-Source Chinese Word Segmentation Machines
9	Research For Chinese New Word Identification Based On Context-aware
10	New Words Discovery Research For Specific Areas