Font Size: a A A

Analysis Method Of Targeted Information Based On Weibo Topics

Posted on:2017-11-01Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhaoFull Text:PDF
GTID:2428330566953515Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development and rapid prevalence of social media,more and more Internet users prefer creating information energetically to being accepting information passively,this change has greatly enriched Internet information resources.The Internet information about pornography,violence and terrorism,reactionary remarks and so on has been the focus of Public Information Network Security Supervisory Organ.How to guarantee the Internet information security and to construct a healthy Internet environment caused the widespread concern in the whole society.However,the characteristics of social media,Such as the non-fully openness and the application of Ajax put forward new requirements and challenges to the traditional information security supervision.This thesis mainly studies on Sina Weibo(hereinafter referred to briefly as the “Weibo”)which is considered as the most distinctive representative of social media,In this thesis,the theory and method of data mining will be applied to the targeted information detection on Weibo topics,for analyzing and discussing how to improve the identification and detection ratio of targeted information in the field of Weibo.The main works are as follows:1.Designing and implementing a topic data acquisition system on Weibo.Aiming at the difficulties and defects which traditional crawler would have to confront in the application of data acquisition on Weibo,the thesis designed and implemented a Weibo focused crawler that can login Weibo and extract Ajax asynchronous data,In order to improve the efficiency of data acquisition,the data acquisition method based on API was combined with the Weibo focused crawler,and generated the Weibo topic data acquisition system.The results of data acquisition experiment shows that the acquisition system is better than the Weibo focused crawler on the efficiency of data extraction.2.Researching the suitable preprocessing on Weibo original data.In order to make the target data beneficial to be identified and processed by the kernel algorithm,the thesis firstly analyzed the characteristics of the original data,and completed data cleaning.And then,According to the evaluation of four Chinese word segmentation tools,the thesis chose the most suitable Chinese word analyzer——NLPIR for Weibo to accomplish Weibo text segmentation.Moreover,comparative analysis were conducted on different methods or models of Weibo text representation,feature selection and weight computation,As a result of it,Vector Space Model(VSM),Document Frequency(DF)and the improved TF-IDF algorithm are considered to be more suitable for Weibo Data in time efficiency,algorithm complexity,and objectivity.The experiment shows that the preprocessing is feasible and efficient.3.Proposing a targeted information detection method based on the classification of co-occurred targeted words(DMCCTW).Aiming at the deficiency of K-means clustering algorithm,the thesis implement Canopy parallel clustering algorithm based on MapReduce in the Hadoop platform.And then,On the basis of clustering algorithm,DMCCTW model was built with reference to the thought of Co-occurred Word Phenomenon,to identify and detect the missing targeted information through shifting the focus of detection to isolated points(or groups)in clusters.Meanwhile,The Algorithm on Mining Co-occurrence Targeted Words(AMCTW)was proposed,It would help to increase the coverage of targeted word thesaurus,and made DMCCTW further improved.4.To accomplish data visualization by building a D3-Cloud platform.According to the numbers of targeted information and the behavioral characteristics of users who sent targeted information in topics,with the help of graphical tools,D3-Cloud platform was built to display the targeted topics and the corresponding users qualitatively and quantificationally in the form of word clouds.By analysing the hidden characteristics and relationships,another way of diffusion of targeted information in Weibo was found finally,it is the topic name in Weibo.Through the above work,the thesis completed the analysis method of targeted information detection based on Weibo topics.The experimental results prove that the analysis methods of this thesis are feasible and effective.
Keywords/Search Tags:Targeted Information, Web Crawler, Ajax, Clustering, Co-occurred word
PDF Full Text Request
Related items