Font Size: a A A

An Improved Clustering Algorithm Based On The Multilingual Topic Found

Posted on:2017-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:X M WangFull Text:PDF
GTID:2308330503461495Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the speeding popularity of the mobile terminal users worldwide, especially the rapid development of mobile terminal information platform, the Internet has become an important platform of the public dissemination and access to informat io n,the network news media has become the most direct channels for public access to information. Miro-blog has a profound influence in the Internet life. The coming of Internet era, not only makes the information sources become increasingly widespread, but also expects higher requirement of information accuracy, popularity, real-time and fairness. People are not only satisfied with the domestic information, even looking forward to more synchronous international mass for the same information feedback. According to the relevant authority, the world’s most widely used language is Englis h, French, while the English, Chinese and French have the highest frequency of use of language in the network news. For the same event, ideas and views from people are inconsistent because of different countries and different culture background. Users more hope to obtain more comprehensive and complete information from the mult ilanguage, different perspectives. In order to get the information needed by the users quickly and widely from the mass of information, multilingual topic discovery research has arisen for a long time.In this article, multi- language topic discovery system is divided into five layers: information collection layer, pretreatment and purification layer, multi- langua ge conversion layer, text clustering layer and information presentation layer. For the algorithm used in each layer, we have conducted improvement and optimizat io n combing with the characteristic of the news media, forming the news topic discovery system at last. First of al, in the layer of multi- language conversion, given the current level of the machine translation is very accurate, this article translated multi- langua ge documents into Chinese documents based on the common language all through the Microsoft translation software Babylon. Secondly, in keywords to obtain, it combined with the LDA(Latent Dirichlet Al ocation, implicit Dirichlet distribution) algorit hm and the TF- IDF(term frequency, inverse document frequency, word frequency-inverse document frequency) algorithm. The LDA algorithm can quickly find the core vocabulary, but living in the era of big data, there is still a lack of precision. And the TF-IDF algorithm can remove those common words which are appearing in a high frequency but not useful. The level of importance of core vocabulary to the article is different, so we introduced the weight calculation of the core vocabulary. In this algorithm design, we used feature weight calculation method proposed by the article [23] that is the draw lesson from the part of the IDF in TF-IDF. By similarity matrix calculation, keywords clustering found topics and establishing a link between text and topic, the keywords of the clustering is completed. At the same time, the paper focuses on the improvement of the algorithm of single language topic discovery. In the topic discovery system, since the Single- pass online clustering algorithm is less time consuming, which conforms to the characteristics of news reports, but existing the error of clustering problems. While the hierarchical clustering algorithm can define the starting position of the cluster at any time, with the combination of these two algorit hms in different stages of the text clustering, the online update of topic clustering is realized. Finally, learning from the experience of Journalism and communication, for news media from the origin of the outbreak to quell the process characteristics, this paper argues that the feature of time is particularly important. In this study, learning from the ideas of the article [49], the concept of using the logic function Logistic(regression) function to join the time factor weight is proposed, which can effectively improve the accuracy of the clustering results of the news topic.Based on the above work, this paper realizes the news topic discovery system, and carries out the data evaluation and experiment analysis in a certain range, and the validity and feasibility of the system is verified by some news forum data.
Keywords/Search Tags:multi-language, topic discovery, single language, clustering method, Single-pass clustering algorithm, hierarchical clustering algorithm, time factor, system
PDF Full Text Request
Related items