Font Size: a A A

Topic Discovery Research Oriented To News Text

Posted on:2019-05-02Degree:MasterType:Thesis
Country:ChinaCandidate:T T WangFull Text:PDF
GTID:2405330551458541Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
With the development and progress of science and technology,human beings have entered the era of big data.As a result,there produce a large number of network information from internet,and the information is out of order.Besides,how to find the content that users are interested in from the network information is a heated and difficult topic in the field of text mining.In recent years,the researches of topic discovery almost are based on Vector Space Model(VSM)and LDA(Latent Dirichlet Allocation)model.However,how to improve the quality of clustering is always a basic problem in the field of topic discovery,that is from the news reports.So in the paper,there are three different models,including Vector Space Model,the binary Co-occurrence Latent Semantic Vector Space Model(CLSVSM)and LDA theme model,applied in the field of topic discovery to do research and analysis.First of all,there are some shortcomings for Vector Space Model.So Vector Space Model is constructed based on part of speech extraction in the paper.Then,based on the TF-IDF weight method,we use the K-means method and the agglomerative hierarchical clustering method to analyze and compare the result of clustering.The second,we know that comparing with the Vector Space Model,the Co-occurrence Latent Semantic Vector Space Model can greatly improve the accuracy of text clustering.So the paper applies the binary Cooccurrence Latent Semantic Vector Space Model to the field of topic discovery.In addition,the paper compares the Co-occurrence Latent Semantic Vector Space Model with the other two models from aspects of clustering effect and topic recognition.The last,we select the text,that is a part of Sogou news corpus,to do the experiments.And,the paper utilizes the result of F-measure to evaluate the results of clustering.Some conclusions are obtained through the experiment.In the Vector Space Model,the clustering results of obtained by the method of part of speech extraction are more accurate.But the clustering result of the method is not as good as LDA theme model and the binary Co-occurrence Latent Semantic Vector Space Model.In addition,there is no significant difference in clustering quality between LDA model and the binary Co-occurrence Latent Semantic Vector Space Model.Also,the results verify the effectiveness of the method of constructing Vector Space Model,that is combined with part of speech.In addition,the results show that the method of applying the binary Co-occurrence Latent Semantic Vector Space Model to carry out the research of topic discovery is reasonable and effective.Furthermore,we combine the characteristics of the three models to extract different topic words from each category.Of course,the methods of extracting topic words for every model are different.In addition,according to these subject words,that are extracted from every category,we can easily understand the main contents of the news text.And,we can clearly find the main topics contained in the news contents.
Keywords/Search Tags:Topic discovery, Text clustering, LDA theme model, Co-occurrence Latent Semantic Vector Space Model, Vector Space Model
PDF Full Text Request
Related items