Font Size: a A A

Research On Quality Evaluation And Application Of The Information In Web Social Media

Posted on:2013-09-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:X H HanFull Text:PDF
GTID:1228330395470216Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Social media is a group of Internet-based applications that build on the ideological and technological foundations of Web2.0, which allows user to share their information and personal reviews, as well as to converse publicly with other users. User can also build virtual social relationships with others through social media. Different types of social media platforms such as forum, blog, microblog and SNS have become prevalent channels of knowledge sharing and information propagation. The characteristics of social media are:(1) huge amount of users;(2) strong interactivity;(3) broad information topics;(4) highly real-time;(5) multi-media and multi-dimesion data. Due to these characteristics, plenty of highly valuable knowledge and information have been accumulated in social media. Therefore, the ability to analyze and utilize these knowledge and information has become important to researchers and others studying business. Related studies should be meaningful to scientific research and business applications. However, there are also some problems and challenges for the task of social media mining:(1) Conventional text mining methods are not very effective on social media data because of data sparseness;(2) The ratio of low quality information is very high;(3) It is difficult to fuse multi-media and multi-dimesion information.Based on the problems and challenges, this thesis investigates the problems of quality evaluation of information in social media and its application under the support of NSF. The main research contents and innovations of this thesis are shown as follows:(1) We propose an LDA based approach to detetct low-quality posts in Web forums.There are large amount of low-quality posts in Web forums, which bring inconvenience to users. Moreover, forum based researches also suffer from these low-quality posts. Thus, filtering low-quality posts is an important and necessary preprocessing step before the utilization of forum information. In this thesis, we propose a bi-classification based approach to detect low-quality posts. Different from existing methods, the new one takes both semantic and statistic features of a post into account to evaluate its quality. Semantic features are computed in LDA topic space in order to reduce the negative effects of data sparseness. An LDA topic model is first built on the start-post collection and then is used to compute three semantic features, i.e., J/I topic proportion, topic uncertainty and topic relevance. Statistic features include content surface features, syntactic features and forum specific features, which are chosen based on the analysis of posts’contents. We train an SVM classifier to filter the low-quality posts. Experiments are carried out on three different types of Web forum collections. The experimental results show that the new approach outperforms previous ones in terms of precision, recall and F1measures.(2) We propose a novel machine learning based approach to rank Web forum posts.The behaviors of initiating a thread and browsing posts in Web forums are similar with the process of searching documents from a search engine. Thus, it would be very helpul to users and other Web forum applications to rank posts according their qualities in a similar way with searching results ranking.Based on the study on learning to rank in information retrieval context, we propose a novel machine learning based approach named LGPRank to rank Web forum posts according to the qualities of their contents. We take the start post in a thread as a query and all the reply posts as relevant documents of the query. LGPRank employs a genetic programming (GP) framework to learn an optimal ranking function based on the training dataset. We still take both the semantic and statistic information of a post into account in the learning process. Wikipedia is used as an external repository to estimate an LDA model to compute semantic features. Experiments are conducted on two Web forum datasets in comparison with methods used in prior ranking researches. LGPRank outperforms all the other methods in terms of P@N, NDCG@N and MAP measures. Furthermore, the experimental results also indicate that the proposed LDA semantic features have a positive effect in improving the ranking performance.(3) We propose a novel approach to use social media data to detect hot events.The events happen in real world can be reported timely and widely in social media. With the rapid development of digital imaging technology, people can now easily record every moment of their lives by using a variety of photo recorders, such as digital cameras and smart phones. They also can upload and share their digital images through Web image communities (e.g., Flickr). Big parts of the images in image communities were taken when particular real-world events happened. Apart from images, textual user annotations as well as geographic information also exist in image communities. Thus, image communities have become a good data source for event detection research. However, there are also challenges in front of the utilization of image community data, such as data sparseness and noise.In this thesis, we propose a novel hot event detection approach which using image community (Flickr) data as data source. Text words and visual words are first extracted from user annoations and images respectively to form a Flickr document. In order to fuse the textual and visual information in a Flickr document, an LDA based data representation is also presented. The proposed hot event detection algorithm considers geographical closeness constraint in the event detection process and uses Aging Theory to model the life cycle of an event. Hot events are detected by ranking the events according to their energy values in a specific time span. Experiments are conducted on a real Flickr collection. Experimental results reveal that the performance of event detection can be improved in terms of precision, recall and F1value by using the proposed data representation and algorithm. The results of hot event detection are also proved reasonable according to P@10score. Both domestic users and goverments can benefit from this research. For domestic users, they can find important information in a better way; and for goverments, this research should be helpful in public opinion analysis.
Keywords/Search Tags:social media, quality evaluation, topic model, geneticprogramming, event detection
PDF Full Text Request
Related items