Font Size: a A A

Clustering And Locality Sensitive Hashing Algorithms On Text Stream Data Under Classification-oriented Measure

Posted on:2022-05-01Degree:MasterType:Thesis
Country:ChinaCandidate:M Q ZhouFull Text:PDF
GTID:2518306557969269Subject:Data mining
Abstract/Summary:PDF Full Text Request
Data stream is consisted of a set of continus data block.Text retrieval in data stream scenario is an important basic application in the field of data mining,which plays an irreplaceable role in various analysis tasks such as machine learning.The traditional retrieval methods can not fully consider the semantic relationship between samples,so they perform poorly with regard to utility and efficiency.This paper uses clustering and locality sensitive hashing to retrieve category label attributes of text information.In particular,based on the review of text retrieval,text clustering and locality sensitive hashing,the existing clustering algorithms and approximate nearest neighbor retrieval techniques are deeply studied and analyzed,and the corresponding research scheme is designed.The details are as follows:Clustering algorithm is an effective method to solve the retrieval of text data without labels.However,in the existing research,it is often faced with the problem of poor clustering utility caused by the choice of similarity,and many scholars propose some improved algorithms to alleviate this situation.No matter pearson similarity or TF-IDF similarity is chosen,it can not guarantee the effectiveness of clustering.Moreover,it may have good effect on one data set and poor in other data sets.In the data stream scenario,the server needs to cluster the data blocks in the initial stage.In order to meet the requirements of real-time response and achieve high clustering efficiency,this thesis proposes a error-driven multi-similarity fuzzy C menas algorithm,called PCM.In PCM,Pearson similarity,TF-IDF similarity and jaccard similarity are used to fuse,and particle swarm optimization algorithm is used to solve the weight of each similarity adaptively.Finally,because of the uneven problem of the data block,the direct use of existing hard clustering such as kmeans algorithm will result in most samples being divided into more samples.Thus,we propose to adopt the improved T-S model to improve FCM.The experimental results over a obtained dataset and two real-world datasets confirm that PCM algorithm is better than traditional clustering algorithm.At the same time,duo to the adopting of PSO,PCM can save time cost.In the new data block,this paper matches every record in the data block with locality sensitive hashing to achieve the balance of accuracy and efficiency.Obviously,the construction of each feature in the data stream is related to the accuracy and complexity of retrieval.The traditional feature construction method has low feature division and long construction time,which can not meet the real-time needs of the data stream scenario in this thesis.The existing research shows that hash based method is better with regard to utility and efficiency.In order to construct the features with high precision under the premise of low complexity,this thesis proposes data-driven multi-lay supervised kernel locality sensitive hashing,called SKH.The original hash method produces random hash features,and its effect is limited by random function and will produce oscillation.SKH adopts the hierarchical training information from carefully constructed supervision information,uses datadriven scheme to learn hash code,and introduces kernel function to enhance the separability of data,which further improves the efficiency of retrieval.The experimental results over a obtained dataset and two real-world datasets confirm that SKH algorithm is better than that of traditional retrieval algorithm.At the same time,due to the adopting of kernel function,SKH can save time cost.
Keywords/Search Tags:Data Stream, Text Retrieval, Clustering, Particle Swarm Optimization, Similarity, Locality Sensitive Hashing
PDF Full Text Request
Related items