Font Size: a A A

Research Of Topic Detection And Tracking Based On Multisource Data

Posted on:2018-02-18Degree:MasterType:Thesis
Country:ChinaCandidate:L J ChengFull Text:PDF
GTID:2348330512482996Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the popularization of Internet and the development of science and technology,including news sites,micro-blog,the network platform has gradually become an important channel for public access to information.Faced with the massive data information of the network platform,how to get the information they need quickly has become the focus of people's attention.Topic detection and tracking is proposed in this situation,which can detect topics from the information flow and track some specific topics,and then help people more fully understand the relevant events.Due to the uneven quality of data on the network platform,the topic related reports may be scattered on multiple platforms.However,most of the existing topic detection and tracking studies are focused on a particular platform,which can easily lead to cognitive biases or the lack of topic reports.This paper takes the news and micro-blog as the research object.For two kinds of reports with semantic co-occurrence words,we combine two kinds of reports to finish topic detection and tracking.The main contents of this thesis include:(1).In this paper,we propose a new method of topic detection based on frequent word set clustering,which can be used to detect topics in two kinds of reports at the same time.More specifically,we process the frequent word sets in a clustering method to find centroid vector of every topic,which are generated from news and microblog reports.And then,we can finish the topic detection task.The algorithm is improved by constructing topic model,calculating similarity between the frequent word sets and topic fusion methods.In the experiment,the average miss rate of the algorithm is less than 20%,the average false detection rate is about 5%,and the detection results of the two kinds of reports are not very different.(2).In this paper,we improved and based on the KNN algorithm to achieve the multi-source data topic tracking.The algorithm combines two kinds of reports in the topic tracking process.In particular,the algorithm first reduces the scope of the topic category by comparing the similarity with the topic center,and then uses the exact similarity with each report to determine the categories to be tracked.In addition,according to the characteristics of the uneven quality of the report and the evolution of the topic,this paper puts forward the corresponding strategies for the selection of feedback reports and the weighting method of topic words.Compared with the experimental results,the proposed algorithm can reduce the cost of topic tracking by 5%.Through the above two aspects,we can combine the data effectively to achieve multisource data topic detection and tracking.The research can be applied to hot topic detection and continuous tracking of specific topics in public sentiment or intelligence systems.In the next stage,we can do more research on the representation of other topic models,the emotional analysis of the topic and integrate more types of data.
Keywords/Search Tags:multisource data, frequent word sets, topic detection, topic tracking
PDF Full Text Request
Related items