Font Size: a A A

Key Technology Research On Tibetan Websites Topic Detection And Tracking

Posted on:2014-10-27Degree:MasterType:Thesis
Country:ChinaCandidate:X H MengFull Text:PDF
GTID:2268330425470722Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the vigorous development of the Internet in Tibetan areas, the use of the network has become a missing part in the life of the Tibetan people living in Gansu, Tibet and Qinghai regions. How to Mine valuable information from huge amounts of data generated every day on the Internet, which has become a new direction in the current research areas.Topic detection and tracking (TDT) was proposed to help people to find the unknown new topics and known topics from the vast amounts of news reports in the follow-up reports.The research object of this paper is the text of Tibetan news website news.The TDT research mainly consists of six sections:corpus pretreatment,text feature item extraction,weight calculation, text vector representation and text similarity comparison, text clustering algorithms and classification algorithms.This paper focuses on the feature weight calculation and text clustering algorithms. In the course of the research, when the text representation adopts vector, the elements of the vector is constituted by a weight value, and the weights can be regarded as one of the dominant factors of the study.This paper is based on the traditional weight calculation, and by raising the proportion of the feature weight in the Tibetan headlines.So that the analysis of weights corresponding the feature is more reliability.with the method of text clustering, it will be more easy to implement and understand in terms of topic detection.The essence of the research is dynamic text clustering algorithms, text clustering is a method clustering similar document to a class cluster using the vector space model.this paper proposes a clustering algorithm is based on simple clustering algorithm. First of all, this algorithm improve the impact that the different text order cause the difference of the clustering results. Secondly, introducing the concept of seed topic,this algorithm determine the subject category through the number of seed topic.The new clustering algorithm of this study have a certain degree of increase, compared with the previous algorithm, in a smaller corpus.This is basically in line with the expected demand, this research work has laid a good foundation for future research on tibetan websites topic detection and tracking.
Keywords/Search Tags:Tibetan Sites, TDT, Seed Topic, Clustering Algorithm
PDF Full Text Request
Related items