Font Size: a A A

Research On Text Clustering Algorithm Based On Word Frequency And Semantic

Posted on:2018-10-19Degree:MasterType:Thesis
Country:ChinaCandidate:H L QiFull Text:PDF
GTID:2348330536479665Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text clustering is a kind of technology that uses computer to automatically identify clusters in text set.With the era of big data coming,The internet was filled with content generated by a variety of applications a.From the point of view of data management,It is necessary to analyze and extract the information from different industries and fields quickly,efficiently and effectively.Meanwhile,From the point of view of information security,It is necessary to protect the sensitive information of individuals and countries and analyze the malicious and false content.In order to protect the victims,it is necessary to find the source of harm timely and accurately.Among the many analysis mining,text clustering technology is considered to be fast and accurate to find the available information and behavior patterns for specific purposes.Meanwhile,as a kind of machine learning method,clustering analysis is an important task of data mining and Natural Language Processing,It has important applications in search engine,user segmentation,pattern recognition and so on.In order to improve the accuracy of text clustering,this paper proposes three effective methods for text clustering.Firstly,the paper propose a new algorithm based on density and minimum distance to initialize the K-means class cluster center.The algorithm uses the overall distribution of the data set to calculate the density of each data point,and then compares the minimum distance between each point and the larger density point,It uses these two parameters to initialize the initial center of the K-means algorithm.Secondly,this paper presents an algorithm of text clustering based on probabilistic latent semantic analysis model(Probabilitic latent semantic analysis)to extract semantics(PLSA-KNN).In this model,the document is represented as a three layer model of document-topics and topic-words.The algorithm first calculates the probability distribution model of three layers,To extract the text semantic information,it uses low dimensional document theme,theme-lexical entry representing high dimensional distribution of word frequency information,Then the algorithm uses the K-nearest neighbor algorithm(KNN)algorithm for text clustering.Finally,on the basis of second point,This paper presents a new algorithm with Bayesian framework on probabilistic latent semantic analysis model using LDA(Latent Dirichlet Allocation)to document modeling.The algorithm makes an in-depth analysis of the topic of the document,It represents the distribution of the topic of a document with multinomial distribution,and extracts the most likely topics from the results,On this basis,the extracted topics and words are clustered by KNN,thus the semantic clustering is realized.
Keywords/Search Tags:Natural language processing, Text clustering, Semantic similarity, Clustering algorithm, Probabilistic topic model
PDF Full Text Request
Related items