Research On Text Clustering Algorithm Based On Word Frequency And Semantic

Posted on:2018-10-19

Degree:Master

Type:Thesis

Country:China

Candidate:H L Qi

Full Text:PDF

GTID:2348330536479665

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Text clustering is a kind of technology that uses computer to automatically identify clusters in text set.With the era of big data coming,The internet was filled with content generated by a variety of applications a.From the point of view of data management,It is necessary to analyze and extract the information from different industries and fields quickly,efficiently and effectively.Meanwhile,From the point of view of information security,It is necessary to protect the sensitive information of individuals and countries and analyze the malicious and false content.In order to protect the victims,it is necessary to find the source of harm timely and accurately.Among the many analysis mining,text clustering technology is considered to be fast and accurate to find the available information and behavior patterns for specific purposes.Meanwhile,as a kind of machine learning method,clustering analysis is an important task of data mining and Natural Language Processing,It has important applications in search engine,user segmentation,pattern recognition and so on.In order to improve the accuracy of text clustering,this paper proposes three effective methods for text clustering.Firstly,the paper propose a new algorithm based on density and minimum distance to initialize the K-means class cluster center.The algorithm uses the overall distribution of the data set to calculate the density of each data point,and then compares the minimum distance between each point and the larger density point,It uses these two parameters to initialize the initial center of the K-means algorithm.Secondly,this paper presents an algorithm of text clustering based on probabilistic latent semantic analysis model(Probabilitic latent semantic analysis)to extract semantics(PLSA-KNN).In this model,the document is represented as a three layer model of document-topics and topic-words.The algorithm first calculates the probability distribution model of three layers,To extract the text semantic information,it uses low dimensional document theme,theme-lexical entry representing high dimensional distribution of word frequency information,Then the algorithm uses the K-nearest neighbor algorithm(KNN)algorithm for text clustering.Finally,on the basis of second point,This paper presents a new algorithm with Bayesian framework on probabilistic latent semantic analysis model using LDA(Latent Dirichlet Allocation)to document modeling.The algorithm makes an in-depth analysis of the topic of the document,It represents the distribution of the topic of a document with multinomial distribution,and extracts the most likely topics from the results,On this basis,the extracted topics and words are clustered by KNN,thus the semantic clustering is realized.

Keywords/Search Tags:

Natural language processing, Text clustering, Semantic similarity, Clustering algorithm, Probabilistic topic model

PDF Full Text Request

Related items

1	The Research On Chinese Sentential Semantic Model Parsing And Text Representation
2	Research On Topic Clustering Algorithm Based On Topic Models
3	Text Semantic Similarity Algorithm Based On Transformer
4	Research On Text Clustering Based On Semantic Similarity
5	Research And Application Of Short Text Semantic Similarity Model Based On Deep Learning
6	Research And Implementation Of The Internet Hot Topic Clustering
7	Search Of Group Intelligent Text Clustering Methods Based On Semantic Similarity
8	Research And Application Of Short Text Similarity Algorithm Based On Semantic Dependency Tree
9	Research On The Construction Method Of Technology Domain Thematic Library Based On Multilevel Topic Vector
10	Research And Application Of Topic Model For Short Texts Based On Part-of-Speech Feature And Semantic Enhancement