Font Size: a A A

Research On Vietnamese News Topic Recognition Method Based On Suffix Tree Clustering Algorithm

Posted on:2017-07-26Degree:MasterType:Thesis
Country:ChinaCandidate:L J ZhuFull Text:PDF
GTID:2358330488464858Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
News is affecting every aspect of people's life and work as a kind of information media.Nowadays, in the face of such a huge amount of news information,it is an urgent problem to get the news topic that people are interested in.Topic detection technology can be aimed at the problems mentioned in the answer to the question.It can achieve the identification of the topic, organize the interesting topic of the customer and present the situation of its development, meet the needs of the people on the news.In the international environment, the cooperation between countries are increasing.China and Vietnam is the mountain next to the water, be closely related and mutually dependent geographical relations,so it is very necessary to understand the situation in Vietnam.According to the Vietnamese characteristics, topic detection model based on news was proposed. We can meet the personalized service and topic identification through this model.This paper introduces the domestic and foreign development situation of topic identification and tracking technology,designs the topic identification system of Vietnamese news text.The main research contents are as follows:(1) Complete the news text pretreatment by the Vietnamese information processing platform of the laboratory, in which the text features is chosen from the word frequency, word class, named entity, title and location. Then, the weight is adjusted according to the word frequency and the inverse document frequency of news texts. Furtherly, the top n strategies and the threshold limit are employed to extract the feature of news text.(2) Firstly, preprocess the climbed corpus from the Vietnamese news website by the processing platform to obtain high-quality corpus, the space vector model and the suffix tree model are used to express corpus, and then construct the suffix tree corpus, as to the published phrase, its base class is chosen by the method proposed in Chapter ? as the basis of clustering. Secondly, optimize the similarity calculation formula of combing the base classes to improve the clustering performance. Finally, the clustering label is used to express the clustering results. A test experiment is designed for the improved suffix tree clustering algorithm STCV and traditional suffix tree clustering algorithm STC to verify the performance of proposed method.(3) A prototype system has been designed according to the above research, through the system, the news topics can be identified and it makes people have a better use of news information.
Keywords/Search Tags:Vietnam, Feature Selection, Topic Detection, Suffix Tree Clustering
PDF Full Text Request
Related items