Font Size: a A A

Towards Microblog Data Analysis And Management Based On Graph Model

Posted on:2013-04-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:B ZhaoFull Text:PDF
GTID:1228330395955783Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
A microblog is a popular Web2.0system, such as Twitter and Sina Wcibo. It allows users to post short messages, also known as tweets, which have up to140characters. However tweets cover a wide variety of content, ranging from break-ing news, discussion, personal life, activities and interests. Microblogs have been a broadcast medium expressing public opinion. Towards hot events, microblogs usu-ally collect diverse and abundant thoughts, comments and opinions from various individual viewpoints in a short period. In the end such individual viewpoints will converge into several collective ones.In this thesis, we aim to study microblogging mining techniques based on graph model. Compared with traditional medium, microblogs own several distinguishing characteristics, such as short length, massive size, low quality, real-time nature, so-cial networking. Microblogs pose several challenges with regard to its characteristics. First, tweets are deficient in statistical and linguistic features due to short length. The existing methods for mining long text corpus are not suitable for microblogs. Second, microblog messages contain all kinds of noisy data like typos, ad hoc ab-breviations, phonetic substitutions and so on. Thus these will adversely affect NLP (Natural Language Processing) tool processing. Third, owing to the massive size of microblogs, the proposed approaches need to guarantee both the scalability and efficiency. Finally, microblogs embed not only massive text messages, but also large numbers of unstructured data, such as social networking based on graph model. The key challenge is the efficiency of mining algorithms. Without the appropriate disk block design and indexing structures, microblog mining algorithms will be not efficient.To summarize, our main contributions are as follows:·Tremendous increase of spam has become a serious problem. We aim to detect spammer community by means of retweeting relationship. Firstly, we define a new function for rating the intensity of spammer behaviours. We then pro-pose two spam detection algorithms based on reuse detection model. One is sentence-level detection algorithm, the other is term-level one. The sentence- level detection algorithm prefers the behaviour pattern of spammers and ignores the topic of spam messages. The term-level detection algorithm focuses the topic of spam messages and compensates for lack of sentence-level one.·In order to identify collective viewpoints, we propose a Term-Tweet-User graph, which simultaneously incorporates text content, temporal information and com-munity structure, to model postings over time. Based on such model, we pro-pose Time-Sensitive Random Walk to effectively measure the relevance between pairs of terms through considering temporal aspects, and then group terms into collective viewpoints. Additionally, we propose Incremental Random Walk to recompute relevance between nodes incrementally and efficiently.·Bipartite graph data management (BGDM) is an important issue. Firstly we present the common atomic operators in BGDM, which can be implemented using max-stars. We then discuss a bipartite graph block structure in detail and the relevant query algorithms, which utilize Bloom filter to avoid loading the whole block for star vertex queries.Finally, extensive experimental results conducted on real data collected from mi-croblogs demonstrated that our proposal outperforms the state-of-the-art approach-es.
Keywords/Search Tags:Microblog, Random Walk, Spammer, Reuse Detection, Bipartite Graph, Graph Clustering
PDF Full Text Request
Related items