Font Size: a A A

Research And Implementation Of Short Text Clustering In Social Network

Posted on:2017-06-05Degree:MasterType:Thesis
Country:ChinaCandidate:X N ZhaoFull Text:PDF
GTID:2428330566953133Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data,the Internet has penetrated into the lives of the people in a more rapid and convenient way as a carrier,playing an increasingly important role.Many social network platforms have emerged as the times require,and become an important way for people to know current events,obtain information and express their views and opinions.The information in the social network is varied but has no rules,the disordered information has brought a lot of difficulties in text research area such as hot topics discovering,public opinions monitoring and information retrieval,whether for an organization or an individual,it is hard to quickly locate the needed contents from a large amount of information.Therefore,it is of great significance to excavate the commercial and military value of information by researching clustering of the social network information.Unlike other text messages,most of the social network information appears in the form of short text,the traditional long text clustering methods are not suitable.Therefore,how to cluster short texts quickly and effectively is also a new challenge for text clustering.In this paper,we have carried on the thorough research to the related text clustering papers and the technology.Based on analysing the characteristics of the social network short text and aiming at the technical difficulties in the short text clustering,we put forward the corresponding solutions.The main research work of this paper is as follows:(1)For the large number of network catchwords created by social network impact on Chinese word segmentation,the new word detection method based on generalized suffix tree has been proposed.Based on the word segmentation and part-of-speech tagging using NLPIR system,we make new word combination rules to extract strings from short texts,then construct the generalized suffix tree of strings.To a certain extent,the time and space complexity of the tree‘s construction is reduced.The method based on the length of strings and internal mutual information feature to detect new words has been put forward,which correct the deviation that simply consider the external character or the internal correlation of strings and the errors in Chinese word segmentation,the whole work lay the foundation for short text clustering.(2)According to the problem of short text sparse feature and the K-means clustering algorithm is sensitive to the initial K value and the cluster center point,an improved feature extraction method and a K-means clustering algorithm based on word co-occurrence have been proposed to achieve short text clustering.In short text representation,we combine the rich meaning of the new words with the word part of speech feature to extract text feature words,then construct word co-occurrence graph by extracting the frequent word from the short text set,the number of clusters in the graph is used as the K value of the K-means algorithm,and extract the key words of each cluster as the initial cluster centers according to the importance of each node.Using improved short text feature representation model and initial clustering value to cluster short texts can solve the problem of short text feature words extraction and correct the bias of the clustering result caused by the random selection of the initial value of clusters.Extracting the words that make more contribution to the text of each cluster as the cluster label to identify the topic words for each category,that can make up for the lack of cluster theme after the completion of the clustering and display a more intuitive result of clustering.(3)Combining the proposed new word detection method with the K-means clustering method based on word co-occurrence,we design and implement a short text clustering prototype system for social network.This system mainly integrates the functions of reading the original data,Chinese word segmentation,new word detection,extracting feature words,extracting initial clustering center and short text clustering.In the environment of the system,we complete the clustering work of the real information text of social network,and verifies the correctness and validity of the proposed method.
Keywords/Search Tags:Social Network, Short Text, New Word Detection, Clustering
PDF Full Text Request
Related items