Font Size: a A A

Quantitative Research On Hot Words And Dialects Over The Internet

Posted on:2015-01-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:1108330476955908Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the advancement of science and technology, particularly the development of information technology and the popularity of the Internet, the Chinese language has undergone tremendous changes. As the most active part of the language, vocabulary has the most significant changes. Under the Internet environment, there are two main categories of the vocabulary changes: On the one hand is the emerging of the hot and new words,on the other hand is the popular usage of dialect words over the Internet. Both changes introduce di?culties to Chinese information processing. Study of the changes in vocabulary, helps us to improve the performance of Chinese information processing; while the detection of hot words and dialect words can help quantitative research of the Chinese language, and supplement the Chinese dictionaries. In this thesis, based on the search engine query log and the records of Chinese Pinyin input method, we study the changes in Chinese vocabulary.The main contributions of this thesis include:(1) We propose a method to identify hot words based search engine query logs and analyze hot words with the temporal dynamics of the queries. We find that the time patterns of query terms can be roughly divided into some categories. Based on the detection of the major burst period, we design an algorithm to automatically detect hot words according to the frequency ratio within the burst period.(2) Accounting for the semantic similarity and the time similarity of the query logs,we expand the range of detected hot words. Specifically, we further detect the low query frequency words related to those already detected hot words, which solves the di?culties of identifying the low frequency hot/new words. Through the analysis of the time patterns of the query frequency sequences, we identify new words from the hot words based on the burst modes, and so that to supplement the new word dictionary.(3) We propose an approach to automatically identify dialects from Chinese Pinyin input method records. By extracting the users’ geographic information, we get the geographical features of the logs. By analyzing the applications that call the user input method, we get the colloquial features of the logs. Through the comprehen-sive analysis of the geographical feature and colloquial characteristic, we propose a dialect recognition method based on the ranking of the combined features. Experimental results show that the combination of the colloquial features and geographical features greatly improves the dialect word recognition performance.(4) By constructing a bipartite graph between the users and the input entries, we identify the dialect users from the records of the input method with the idea of collaborative filtering. Then according to the input records of those dialect users, we expand the dialect detection set. We sort out the featured words in each geographical region, through the analysis of the level of coverage of the dialect vocabulary and other related characteristics. Comparing the set of the dialect feature words with the set of popular words over the country, we provide quantitative studies for the dialect regional partition theory. Finally, we realize the visualization to the geographical distribution of dialects, to assist the research on the dialect regions.
Keywords/Search Tags:Query log, Chinese input method, Hot words detection, Dialects identification
PDF Full Text Request
Related items