Research On Hot Topic Detection Technology Of Netnews

Posted on:2017-09-27

Degree:Master

Type:Thesis

Country:China

Candidate:S X Li

Full Text:PDF

GTID:2348330518470820

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the vigorous development of the Internet,a huge amount of data is generated on the network every day,only the numbers of news produced by portal websites is very impressive.How to get the topics which have most attention from the large of information is a subject worthy of study,the purpose of topic detection is to solve this problem.The key technologies in the topic detecting process include words segmentation,feature extraction,similarity computation,text representation and clustering algorithm etc.Although there are a lot of studies on these,there are still some aspects that need to be improved and perfected.This thesis studies the current existing schemes of topic detection deeply,analyzes the existing problems,and puts forward the improved scheme for solving these problems.The main work of this thesis is as follows:Firstly,through the analysis of the model of text representation,the vector space model(VSM)is chosen as the text representation model,which is improved in the aspect of feature item extraction.At present,the eigenvalue extraction and weight calculation are generally based on word frequency statistics,ignoring the semantic relations between words.A modified method of extracting feature words and weighting feature words based on TF-IDF and word similarity computation,which is based on thesaurus was proposed in this thesis.Secondly,the improvement method of clustering algorithm was proposed.In this thesis,the Single-Pass clustering algorithm which is suitable for processing dynamic data is selected by comparison and analysis.Because the calculation of the similarity between documents and clusters in Single-Pass is carried out by taking the maximum of the similarity between documents and documents which in cluster,with the increase of the number of documents,the amount of computation per round is also increasing.To solve this problem,an incremental algorithm with cluster centers is proposed in the thesis,which can reduce the computation time by adjusting the cluster centers,and also ease the sensitivity to the initial document order.In addition,this thesis extends the Single-Pass to a threshold,which is defined as two thresholds,which is used to cluster the topics and sub topics,so that the level of the topic is more distinct.Lastly,the improved feature extraction method and the improved Single-Pass clustering algorithm are proved by experiments.The evaluation index of the TDT was used in the experiment,and the performance and efficiency of the algorithm was verified by comparing with the evaluation results of other algorithms.Experimental results show that the improved scheme increases the accuracy of clustering and reduces the error cost.

Keywords/Search Tags:

Topic detection, feature extraction, semantic similarity, Single-Pass clustering algorithm

PDF Full Text Request

Related items

1	The Design And Implementation Of The Hot Topic Detection System Based On The Improved Single-Pass Algorithm
2	Improvement Of Single-Pass Clustering Algorithm And Its Application In Microblog Topic Detection
3	Research On Chinese Micro-blog Hot Topics Detection
4	An Improved Clustering Algorithm Based On The Multilingual Topic Found
5	Research On The Method Of Topic Discovery And Hotness Evaluation For News
6	Research On The Key Technology Of Hot Spot Topic Discovery Based On Microblogging
7	Internet News Hot Mining System Research And Implementation
8	Research On Network News Hot Topic Detection Based On LDA Model And Clustering
9	News Topic Detection Based On LDA Fusion Model And Multi-layer Clustering
10	Research On Multi-Level Topic Clustering Based On Cross Degree