Font Size: a A A

Research On Short Text Clustering Algorithm Based On Machine Learning

Posted on:2020-05-20Degree:MasterType:Thesis
Country:ChinaCandidate:G K ZhangFull Text:PDF
GTID:2428330596478734Subject:Information security
Abstract/Summary:PDF Full Text Request
It has become an important problem that how to find useful information from massive network data quickly and accurately.As a kind of information processing method,text clustering is one of the important approaches to mine text data information.To solve the traditional clustering algorithm exits the problem of insufficient feature information,high dimension of features and loss of small class information when clustering short texts and to solve the problem of the traditional clustering algorithm to seriously ignore short text extrinsic features leads to the accuracy of the clustering result is not high when dealing with short web text,two different types of short text clustering algorithms are proposed:(1)(Frequent itemsets collaborative pruning iteration clustering framework,FIPC)for “long tail phenomenon”;(2)(Short text clustering algorithm for binary heterogeneous networks and label propagation,HINLP)considering short text extrinsic features.The main research is as follows:(1)Analyze and summarize the research status of traditional algorithms in the field of Short text clustering in recent years.The techniques related to Short text clustering are studied from three aspects: short text feature expansion algorithm,short text feature selection algorithm and short text clustering algorithm.(2)For the "long tail phenomenon" in short text data,the vector space model is used to model the short text,and the TF-IDF is selected for spatial dimension reduction.The collaborative pruning clustering framework is constructed by the collaborative pruning strategy.The optimization algorithm FIPC is proposed by combining the K-mediods algorithm,to mine the small class short text information of the "long tail",solving the problems when the traditional clustering algorithms process the short text of "long tail phenomenon" the high latitude of features and the loss of small class information.improving the credibility of the clustering results.At the same time,filter threshold reduction mechanism based on the frequent word,to avoid the overlap problem of clustering.(3)The traditional short text clustering algorithm not combine short text information with short text network information,which leads to the problem of bias in clustering results.Therefore,this thesis proposes a short text clustering algorithm HINLP based on binary heterogeneous network and label propagation.In the phase of data preprocessing,the algorithm mines the external feature(such as the features of text authors,the features of texts forwarding,etc.)associations between short texts deeply to increase the accuracy of short text representation.In the phase of heterogeneous network construction,the weighted meta-paths represent the similarity relationship between short texts.In the phase of text clustering,the HINLP algorithm uses the label propagation algorithm to find the same community for clustering.As a result of the directed weighting of the network,the random propagation of traditional label is avoided.(4)Compared with several classical short text clustering algorithms,the FIPC algorithm solves the high latitude of features and the loss of small class information effectively in "long tail" short text clustering.This thesis studied the short text clustering algorithms and the network community discovery algorithms at present comprehensively.The comparison of experiments shows that the HINLP algorithm owns good accuracy on text clustering accuracy.
Keywords/Search Tags:short text clustering, heterogeneous network, label propagation, long tail phenomenon, cooperative pruning
PDF Full Text Request
Related items