Font Size: a A A

Research And Application Of Web Chinese Text Clustering Algorithm Based On Minimum Spanning Tree

Posted on:2022-02-20Degree:MasterType:Thesis
Country:ChinaCandidate:Z XingFull Text:PDF
GTID:2518306722988719Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As the number of Chinese Internet users,search engine users under the condition of increasing scale and Internet penetration,had a mass of data,on the Internet to search in the huge amounts of data information valuable data information,and unified management and distribution,it is data mining and information to solve the core problem of automated processing.However,the existing text clustering algorithm still has many shortcomings,which requires the improvement of the existing clustering algorithm,and puts forward a new clustering algorithm method and theory.Exist in the traditional clustering algorithm,the initial parameters or precursors of dependence is higher,the data quantity or slightly larger the characteristic dimension of sample processing is difficult,high algorithm computational complexity leads to low efficiency of the algorithm,and a series of problems,according to the scale of Internet data is more and more big,be badly in need of better Chinese text clustering Web tools such as problems of reality,Therefore,the clustering algorithm and Web Chinese text clustering tools are innovated and improved.The main contributions and innovations of this paper are as follows:Firstly,an Improved Minimum Spanning Tree Clustering(IMSTC)algorithm was proposed.The main ideas of IMSTC algorithm are as follows: firstly,the sample data set to be processed is reconstructed by using the main ideas of principal component analysis;Then,the reconstructed sample data set is abstracted into a weighted complete graph WCG.Then the weighted complete graph WCG is transformed into a fully connected minimum spanning tree structure.Then the one-dimensional weight space of the minimum spanning tree edge set was clustered to determine the pruning parameters.According to this pruning parameter,the minimum spanning tree is cut and pruned.At last,the noise and outcroppings in the preliminary clustering results are filtered to get the cluster set of clustering results.The algorithm maintains low dependence on the initial parameters or precursor parameters,and at the same time,it can handle the input data set with slightly larger data volume or feature dimension,which improves the universality of the input samples.The computational complexity is reduced and the efficiency of the algorithm is improved.Secondly,using IMSTC algorithm and combining with JIEBA word segmentation and TF-IDF word frequency vector Text model,a Web Chinese Text Clustering framework(WCTCF)is developed.The WCTCF framework is used to process real Web Chinese text,and the standardized clustering of Web Chinese text can be completed.Finally,the clustering effect and performance of IMSTC algorithm are tested and verified by using two-dimensional random data sets and three data sets with different characteristics selected from the classical UCI data set.Web Chinese text clustering framework WCTCF uses real Web Chinese text data sets to verify the practicability of this framework and the improvement of clustering effect given by the innovation of IMSTC algorithm.
Keywords/Search Tags:Data mining, Text clustering, K-means++ algorithm, Minimum spanning tree, Data dimension reduction
PDF Full Text Request
Related items