Research And Application Of Web Chinese Text Clustering Algorithm Based On Minimum Spanning Tree

Posted on:2022-02-20

Degree:Master

Type:Thesis

Country:China

Candidate:Z Xing

Full Text:PDF

GTID:2518306722988719

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

As the number of Chinese Internet users,search engine users under the condition of increasing scale and Internet penetration,had a mass of data,on the Internet to search in the huge amounts of data information valuable data information,and unified management and distribution,it is data mining and information to solve the core problem of automated processing.However,the existing text clustering algorithm still has many shortcomings,which requires the improvement of the existing clustering algorithm,and puts forward a new clustering algorithm method and theory.Exist in the traditional clustering algorithm,the initial parameters or precursors of dependence is higher,the data quantity or slightly larger the characteristic dimension of sample processing is difficult,high algorithm computational complexity leads to low efficiency of the algorithm,and a series of problems,according to the scale of Internet data is more and more big,be badly in need of better Chinese text clustering Web tools such as problems of reality,Therefore,the clustering algorithm and Web Chinese text clustering tools are innovated and improved.The main contributions and innovations of this paper are as follows:Firstly,an Improved Minimum Spanning Tree Clustering(IMSTC)algorithm was proposed.The main ideas of IMSTC algorithm are as follows: firstly,the sample data set to be processed is reconstructed by using the main ideas of principal component analysis;Then,the reconstructed sample data set is abstracted into a weighted complete graph WCG.Then the weighted complete graph WCG is transformed into a fully connected minimum spanning tree structure.Then the one-dimensional weight space of the minimum spanning tree edge set was clustered to determine the pruning parameters.According to this pruning parameter,the minimum spanning tree is cut and pruned.At last,the noise and outcroppings in the preliminary clustering results are filtered to get the cluster set of clustering results.The algorithm maintains low dependence on the initial parameters or precursor parameters,and at the same time,it can handle the input data set with slightly larger data volume or feature dimension,which improves the universality of the input samples.The computational complexity is reduced and the efficiency of the algorithm is improved.Secondly,using IMSTC algorithm and combining with JIEBA word segmentation and TF-IDF word frequency vector Text model,a Web Chinese Text Clustering framework(WCTCF)is developed.The WCTCF framework is used to process real Web Chinese text,and the standardized clustering of Web Chinese text can be completed.Finally,the clustering effect and performance of IMSTC algorithm are tested and verified by using two-dimensional random data sets and three data sets with different characteristics selected from the classical UCI data set.Web Chinese text clustering framework WCTCF uses real Web Chinese text data sets to verify the practicability of this framework and the improvement of clustering effect given by the innovation of IMSTC algorithm.

Keywords/Search Tags:

Data mining, Text clustering, K-means++ algorithm, Minimum spanning tree, Data dimension reduction

PDF Full Text Request

Related items

1	KK-means Clustering Method Improved Based-on Minimum Cost Spanning Tree And Its Applications In Seismic Data
2	Research On Clustering Algorithm Based On Minimum Spanning Tree
3	Research On Non-parameterized Clustering Algorithm And Its Application In Text Clustering
4	Research Of Clustering Algorithms Based On Minimum Spanning Tree
5	Research On Clustering Algorithm Based On Minimum Spanning Tree
6	The Research On Dynamic And Abstract Clustering Method Of High Dimensional Sparse Data
7	Dimension Reduction Technology Research Based On Text Features
8	Research And Application Of Clustering Analysis In Intrusion Detection
9	A minimum spanning tree based clustering algorithm for high throughput biological data
10	Clustering Analysis Of Big Data Based On The Minimum Spanning Tree Of Network Optimization