Font Size: a A A

Research On Social Network Data Mining Based On Natural Language Processing

Posted on:2018-04-29Degree:MasterType:Thesis
Country:ChinaCandidate:P H ZhangFull Text:PDF
GTID:2348330515957624Subject:Engineering
Abstract/Summary:PDF Full Text Request
Micro-blog is a very popular social platform.Users use short text or multimedia information on the platform to achieve real-time information sharing and exchange.Users publish the text is short,but a long time to accumulate the data contains a wealth of personalized features such as user information.The user data of the platform contains rich social information value.The data mining of Micro-blog users is of great significance for the development of social network and the analysis of social information.The main function of the Social network data mining is to get the user's personalized features and other information through the analysis and mining of micro-blog in the mass of short text.The first work is to collect a large number of micro-blog data from the network and store the information in a specific format.Then,we process the word segmentation and information feature representation.Finally,the data mining method is used to analyze the user identification and user type.In this paper,we design a user data crawling system based on simulated login using the web crawler technology,and provide a method to get a large number of micro-blog data from the network.According to the characteristics of user data structure,this paper use JSON format and store into NOSQL database.In order to solve the problem of finding new words in current word segmentation methods,a new Chinese word segmentation method based on lexical matching and statistical annotation is proposed.This method is based on the dictionary matching method and incorporates the CRF annotation algorithm,and iterative training in the word segmentation process to achieve self-learning ability.By combining the matching method with the labeling method,the segmentation results are selected according to the Chinese semantic rule,which effectively improves the segmentation effect of the Chinese word segmentation in terms of word segmentation accuracy and unrecorded word discovery.Experimental results on the test corpus show that the method proposed in this paper improves the F-value by 9.6% more than the matching method and 2.9% more than the CRF algorithm.One of the main features of micro-blog data mining is the one-hot representation,and its shortcoming is that it can not express the context semantics.In this paper,the user character representation based on word2 vec,and the context information is added to the user characteristic representation,and the dimension of the user information vector is reduced,which improves the efficiency of the subsequent data mining algorithm.Through analyzing the data of micro-blog users,it is found that there are some users who will bring noise interference to the data mining.In this paper,the garbage user identification model based on SVM is designed,and the F value of the garbage user identification test set is 0.94.Then,according to the content of micro-blog users' attention,the K-means clustering algorithm is used to divide the user community.Due tothe uncertainty of user community partitioning,the optimal clustering center values are calculated by DB-index algorithm,which improves the inter-class and extra-class similarity of clustering results.
Keywords/Search Tags:micro-blog, segmentation, SVM classifier, k-means
PDF Full Text Request
Related items