Font Size: a A A

Identifying Top Chinese Network Buzzwords From Social Media Big Data Set

Posted on:2016-01-31Degree:MasterType:Thesis
Country:ChinaCandidate:Y L TangFull Text:PDF
GTID:2308330464973824Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the boom of China’s Internet industry and a blowout development of the mobile Internet, we are experiencing a different lifestyle in information age. In the meanwhile, more and more network buzzwords enter people’s daily life gradually. Network buzzwords are the communication language of the internet, which are simple and practical, welcomed by netizens. To some extent, network buzzwords are the main embodiment of internet culture, which play an important role in public opinion analysis, social focus tracking and language evolution study.At present, most rankings of network buzzwords are obtained artificially. Questionnaire is the most popular method, which is subjective and costly. Thus, it’s necessary to develop an objective method based on machine learning using the computer. Moreover, automatic acquisition of the network buzzwords, as part of the application of natural language processing, plays an important role in promoting computational linguistics and Chinese information processing.An automatic method of acquisition of network buzzwords is raised in the paper, This method is based on the social media big data set, word bag is built by text segmentation of corpus using conditional random field model, extraction from Internet encyclopedia platform based on rules, derived from cell thesaurus of Chinese input methods. A novel algorithm relying on the time-distribution feature of words is proposed and a KL-divergence measure is used to estimate words’popularity so as to figure out buzzwords in a specific period. The time-distribution feature simply states the fact that buzzwords’ usage has a sharp increase during a very short period, which is then modeled formally with the KL-divergence measure. The experiment results on 2014 annual social media big data set show that buzzwords can be identified accurately using the newly-raised algorithm, which is highly coincident with results collaboratively tagged by human beings.In conclusion, a buzzword popularity computing system is designed and implemented, which is composed of web page information extraction, word frequency statistics and network buzzwords popularity calculation. Experiment indicate that result from automatic acquisition with a high consistency to the result from artificial collaborative tagging which prove that acquisition method is validated to reflect the real characteristics and trends of languages objectively. Moreover, this system not only provides the reference basis for the determination of network buzzwords’epidemic characteristics, but also provides a convenient way to obtain more Chinese buzzwords to professional scholars.
Keywords/Search Tags:network buzzwords, conditional random fields, time distribution, language model, KL divergence
PDF Full Text Request
Related items