Font Size: a A A

Research On Extraction Algorithm Of Mongolian Network Hot Words

Posted on:2016-06-02Degree:MasterType:Thesis
Country:ChinaCandidate:K LiFull Text:PDF
GTID:2308330461983103Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid spread of the network in Mongolia region, the Internet has beco-me the main medium for information transmission of Mongolian people. However, as information gathering on the Internet, serious information overload comes into being. So, how to obtain valuable information from thousands of Mongolian websites, has become a major challenge to contemporary research. In such an environment, it is very important to extract hot words accurately, which has become a focus of scientific research.In this work, the Mongolian news flow as the research object, we deal with the massive press coverage automatically to extract Mongolian network hot words.The main contents of this paper are as follows:(1) Analyze structural features of the Mongolian network news, during the calc-ulation of terms weights, we count the frequency of appearance of candidate words in the title and content respectively. Through the weighted algorithm, we give a higher weights to the terms which appears both in the title and in the content. Experiments show that this method can improve the detection capability of hot topics.(2) Analyze the candidate words list, news contents contain some high frequency words, but they are unrelated to express the meaning of the news, which we call "ab-normal words". We use the entropy of words and 3σ criterion to eliminate "abnormal words " in Mongolian news text. The results of experiments suggest that this method can effectively eliminate "abnormal words ".(3) According to the characteristics of hot words, the variation of words weights is used to express the change of the attention of the hot spots within specific time. Ex-periments indicate that this method can accurately extract the Mongolian hot words.(4) In this paper, on the one hand, we use four extraction algorithms of hot wor-ds based on word frequency and TF* PDF to extract Mongolian network hot words; on the other hand, we conduct a series of comparative experiments, the hot topic cov-erage rate of the algorithms are compared.
Keywords/Search Tags:extraction of hot words, word frequency, location weighted, TF*PDF
PDF Full Text Request
Related items