Font Size: a A A

Research Of Chinese Short Text Clustering Algorithm

Posted on:2017-05-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y ChengFull Text:PDF
GTID:2308330482495694Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, China has entered the era of digital and information technology, In all walks of life the transmission and exchange of information are dependent on advanced text mining technology. Text clustering is an important branch of text mining, is a kind of unsupervised machine learning, It detect the degree of association automatically between the text, and clustering the most similar texts into the same cluster. In recent years, the short text in instant messaging,online chat logs, bulletin board system title, blog comments, news comment on network, short message service(SMS), microblogging and other fields have been widely used. This essay is highly refined and sparse, and it has a wide range of areas and a huge quantity. Traditional text clustering are often unable to achieve satisfactory effect, how to clustering the short text efficiently and reliable, is the major challenge in the field of text clustering. At present, there are already many clustering algorithm is used to handle the short text clustering, which CHIR-TCFS algorithm improved the chi square test, and resolved the supervised question of applying feature selection algorithm to the short text clustering. This paper is aimed at the defects of low frequency characteristics of CHIR algorithm, with the times of features, proposed an improved algorithm of CHIR algorithm CHIRF. In view of the defects of random selection of initial clustering in TCFS algorithm, proposed an initial clustering center selection algorithm ICCP based on points, combined CHIR and ICCP proposed a short text clustering algorithm CHIRF-NTCFS. The experiment and contrast test are completed, the results show that the clustering effect of this algorithm is better than that of K-means algorithm and CHIR-TCFS algorithmThe main work of this paper is:1) Explained the difficulty of the main characteristics of the research background and the significance of the short text clustering, the research status at home and abroad in the field of short text clustering, a brief introduction of short text clustering in text pre processing method, including Chinese word segmentation, text to stop words and passages in the feature selection.2) Introduced several traditional short text clustering algorithms, including K-means algorithm, K-medoids algorithm, BIRCH algorithm and EM algorithm, and the advantages and defects of the algorithm are evaluated.3) Detail introduced an improved feature selection algorithm for chi square test,and a short text clustering algorithm CHIR-TCFS based on CHIR. CHIR algorithm solves the problem that the chi square test can not identify the positive and negative relationship between feature and class. The CHIR-TCFS algorithm solves the problem of applying the feature selection algorithm in short text clustering.4) Aiming at the problems of CHIR algorithm in the presence of low frequency words defects, proposed a improved CHIR algorithm CHIRF combined with the times of features, realized the optimization of feature selection algorithm in short text clustering, Aiming at the problem of random selection of initial cluster centers in TCFS algorithm, proposed an initial cluster center selection algorithm ICCP based on points. Combined with the CHIRF algorithm and ICCP algorithm, this paper proposes a short text clustering algorithm CHIRF-NTCFS, which solves the problem of the application of CHIRF algorithm in short text clustering algorithm.5) Based on MATLAB programming environment realized the k-means algorithm, CHIR-TCFS algorithm and CHIRF-NTCFS algorithm, obtained the optimal parameters of the CHIRF algorithm through the experiment of parameter values. Substitute the optimal parameters in CHIRF-NTCFS algorithm, designed and completed two experiments according to the size and the number of text clustering,the results show that CHIRF-NTCFS algorithm is superior to other two algorithms of clustering effect.
Keywords/Search Tags:Short text, Chinese Text clustering, CHIR-TCFS, CHIRF, ICCP, CHIRF-NTCFS
PDF Full Text Request
Related items