Font Size: a A A

Research Of Kazakh Hot Words Extraction Methods For Internet Public Sentiment

Posted on:2017-09-05Degree:MasterType:Thesis
Country:ChinaCandidate:B Y HuFull Text:PDF
GTID:2348330503484343Subject:Engineering, computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, with the popularity of the Internet, we have entered the era of big data information explosion. Xinjiang is a multi-ethnic province, where many languages are widely used. With the rapid development ofeconomy and culture of the Xinjiang region, Kazakh as one of the main languages, the number of users also increased year by year. Meanwhile, the growth rate of the number of Kazakh web pages is also growing fast.How to quickly and accurately find the hot information from the vast amounts of Kazakh network text hasattracted much attention.This study discusses the related technical issues, that is, how to get the recent hot words from the huge amount of Kazakh website.Combined with the characteristics of Kazakh language, the method of how to extract hot word is studied. In this paper, the test data is obtained from Kazakh edition of people's daily online and Kazakh edition of Tianshan website, the two sites are the largest and most standard Kazakh websites.The obtained data will have a pretreatment process. Then we use the entropy of words and 3? criterion to eliminate “abnormal word” in Kazakh news text, follow through the hot words extraction algorithm to obtain the hot words.The main content of this paper are as follows:(1) The important research significance, research status and development of Network public opinion and hot words extraction technology are introduced. In addition, the commonly used methods for calculating the weights of words are described.(2) In order to remove the useless information, the text obtained from Kazakh news website is extracted by web crawlers.Future extracted text is pretreated, mainly including word segmentation and stop word filtering.By preprocessing, candidate hot words can be obtained.(3) In the process of extracting hot words, there are a certain number of frequencies that are very low or very high, but are not related to the meaning of theexpression, which we call“abnormal word”. We use the entropy of words and 3?criterion to eliminate “abnormal word” in Kazakh news text. The result of experiments suggests that this method can effectively eliminate “abnormal word”.(4) Based on the research status of Kazakh,the TF-PDF algorithm deeply analysis. we constructthe L-HKAD(Local-Hot Keywords Attention Degree) formula based on the combination of TF-PDF algorithm and thethought of media attentionto quantitatively describe the attentiondegree of candidate words.On this basis, using the frequencyinformation of the vocabulary and the correlation factor of the composition,the hot words areeffective combination. To a certain extent, the "words separation" phenomenon has been reduced.(5) We use the real news network data to verify the method, which is combined of the improved TF-PDF algorithm and the entropy of words and 3?criterion.We conduct two comparative experiments, and compared with other weight calculation method, the results suggest that this method can improve the hot topic coverage rate.The discovery of the topic of network public opinion has an important inspiration by this method.It is also found that some of the hot phrases which are obtained after hot words combination can express the meaning of some hot topics more completely.
Keywords/Search Tags:Kazakh, network public opinion, text mining, media attention, Hot Words
PDF Full Text Request
Related items