Font Size: a A A

Research On Hot Topic Detection Technology Of Netnews

Posted on:2017-09-27Degree:MasterType:Thesis
Country:ChinaCandidate:S X LiFull Text:PDF
GTID:2348330518470820Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the vigorous development of the Internet,a huge amount of data is generated on the network every day,only the numbers of news produced by portal websites is very impressive.How to get the topics which have most attention from the large of information is a subject worthy of study,the purpose of topic detection is to solve this problem.The key technologies in the topic detecting process include words segmentation,feature extraction,similarity computation,text representation and clustering algorithm etc.Although there are a lot of studies on these,there are still some aspects that need to be improved and perfected.This thesis studies the current existing schemes of topic detection deeply,analyzes the existing problems,and puts forward the improved scheme for solving these problems.The main work of this thesis is as follows:Firstly,through the analysis of the model of text representation,the vector space model(VSM)is chosen as the text representation model,which is improved in the aspect of feature item extraction.At present,the eigenvalue extraction and weight calculation are generally based on word frequency statistics,ignoring the semantic relations between words.A modified method of extracting feature words and weighting feature words based on TF-IDF and word similarity computation,which is based on thesaurus was proposed in this thesis.Secondly,the improvement method of clustering algorithm was proposed.In this thesis,the Single-Pass clustering algorithm which is suitable for processing dynamic data is selected by comparison and analysis.Because the calculation of the similarity between documents and clusters in Single-Pass is carried out by taking the maximum of the similarity between documents and documents which in cluster,with the increase of the number of documents,the amount of computation per round is also increasing.To solve this problem,an incremental algorithm with cluster centers is proposed in the thesis,which can reduce the computation time by adjusting the cluster centers,and also ease the sensitivity to the initial document order.In addition,this thesis extends the Single-Pass to a threshold,which is defined as two thresholds,which is used to cluster the topics and sub topics,so that the level of the topic is more distinct.Lastly,the improved feature extraction method and the improved Single-Pass clustering algorithm are proved by experiments.The evaluation index of the TDT was used in the experiment,and the performance and efficiency of the algorithm was verified by comparing with the evaluation results of other algorithms.Experimental results show that the improved scheme increases the accuracy of clustering and reduces the error cost.
Keywords/Search Tags:Topic detection, feature extraction, semantic similarity, Single-Pass clustering algorithm
PDF Full Text Request
Related items