Key Technology Research On Tibetan Websites Topic Detection And Tracking

Posted on:2014-10-27

Degree:Master

Type:Thesis

Country:China

Candidate:X H Meng

Full Text:PDF

GTID:2268330425470722

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the vigorous development of the Internet in Tibetan areas, the use of the network has become a missing part in the life of the Tibetan people living in Gansu, Tibet and Qinghai regions. How to Mine valuable information from huge amounts of data generated every day on the Internet, which has become a new direction in the current research areas.Topic detection and tracking (TDT) was proposed to help people to find the unknown new topics and known topics from the vast amounts of news reports in the follow-up reports.The research object of this paper is the text of Tibetan news website news.The TDT research mainly consists of six sections:corpus pretreatment,text feature item extraction,weight calculation, text vector representation and text similarity comparison, text clustering algorithms and classification algorithms.This paper focuses on the feature weight calculation and text clustering algorithms. In the course of the research, when the text representation adopts vector, the elements of the vector is constituted by a weight value, and the weights can be regarded as one of the dominant factors of the study.This paper is based on the traditional weight calculation, and by raising the proportion of the feature weight in the Tibetan headlines.So that the analysis of weights corresponding the feature is more reliability.with the method of text clustering, it will be more easy to implement and understand in terms of topic detection.The essence of the research is dynamic text clustering algorithms, text clustering is a method clustering similar document to a class cluster using the vector space model.this paper proposes a clustering algorithm is based on simple clustering algorithm. First of all, this algorithm improve the impact that the different text order cause the difference of the clustering results. Secondly, introducing the concept of seed topic,this algorithm determine the subject category through the number of seed topic.The new clustering algorithm of this study have a certain degree of increase, compared with the previous algorithm, in a smaller corpus.This is basically in line with the expected demand, this research work has laid a good foundation for future research on tibetan websites topic detection and tracking.

Keywords/Search Tags:

Tibetan Sites, TDT, Seed Topic, Clustering Algorithm

PDF Full Text Request

Related items

1	Research On Correlative Techniques Of Hot-topic Discovery About Internet Public Opinion
2	Study On Tibetan Information Retrieval&Search Results Clustering And System Implementation
3	Research On Topic Clustering Algorithm Based On Topic Models
4	Network Hot Topic Discovery Based On Topic Model And Clustering Algorithm
5	Study On Hot Topic Detection Based On The Analysis Of Tibetan Public Opinion
6	An Improved Clustering Algorithm Based On The Multilingual Topic Found
7	Research On Multi-Agent And Swarm Intelligence Tibetan Network Public Opinion Management
8	The Research On Semi-supervised Clustering Algorithm With Seed Object Constraints
9	Le groupage flou avec AFSA: Methodologie et application a l'analyse des sites web
10	The Research Of Clustering Algorithm For Hot Topic Detection