Font Size: a A A

Semi-supervised Evolutionary Ensembles And Its Applications In Web Video Categorization

Posted on:2016-03-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:Amjad MahmoodFull Text:PDF
GTID:1108330485988606Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data mining, and in particular text mining, has seen tremendous growth in recent years because of the immense advances in hardware and software. A large amount of text data is created in a variety of social network, web, and other information-centric applications. This increasing amount of text data has created a need for advances in algorithm design which can learn interesting patterns from the data in a dynamic and scalable way. Clustering ensembles have emerged as an outstanding algorithm in data mining to leverage the consensus across multiple clustering solutions and combines their predictions into a single solution with improved robustness, stability and accuracy. Multimedia advancement and popularity of the social Web has collectively provided an easy way to generate bulk of videos. This abundance of videos has made the selection criteria quite complicated for a user to search and get the desired video. Categorization of such Web videos has become a hot research challenge. First, a detailed survey of the existing scientific literature in the data mining field is performed which suggests that most commonly studied problem in this domain is related to clustering and classification. Different algorithms are explored to best suit the problem of Web video categorization. In particular situations, additional support plays an important role in terms of semi-supervision paradigm. In this research work, we propose three successive algorithms for social media mining, e.g., Web Video Categorization (WVC), using their low cost textual features, intrinsic relations and extrinsic Web support. The main contributions of this research work are as follows.Initially, a new algorithm, Semi-supervised Cluster-based Similarity Partitioning Algorithm (SS-CSPA), is proposed to categorize the videos containing textual data provided by their up-loaders. The feature of this algorithm is the introduction of an unsupervised learning, consensus between clustering and additional support of pairwise constraints. First, after extracting the textual features, the videos are represented as a vector of feature terms based on the Vector Space Model (VSM). The pairwise constraints are in the form of must-link pairs grouped together in the shape of mesh topology, i.e., if a video is related to a group of videos, all videos of that group are related to each other as well. Finally, all base clustering results of three different clustering algorithms are aggregated by using a clustering ensemble technique under the kind supervision of must-link constraints. Promising results are obtained by experimental validation of the proposed algorithm.In the next phase of this research, a modified algorithm, Semi-Supervised Cluster-based Similarity Partitioning Algorithm evolved by GA (SS-CSPA-GA), is proposed. The main goal of this algorithm is to improve the similarity between two videos, which is achieved by ex-tending the traditional VSM to Semantic VSM (S-VSM) by considering the semantic similarity between the feature terms and using the WordNet to measure the extent of relations between two feature terms. The clustering ensemble process is iterated with the help of GA guided by a new measure, Pre-Paired Percentage (PPP), which is used as the fitness function during the genetic cycle. The purpose of this measure is to provide the comparison between two solutions in the absence of ground truth labels. The idea behind is that a clustering solution is supposed to be better than the other one if it has implemented more must-links as compared to other. Crossover and mutation operators, the most important steps in genetic cycle, are required to be defined for the purpose of producing new solutions from existing population. These key operations arc expressed in terms of an intelligent mechanism of clustering ensemble. The proposed idea is two fold:increasing the solutions space and ensuring the healthy off-springs, i.e., new solutions are more accurate as compared to their parents. Finally, experiments are carried out on the real world social-Web data (YouTube) which verifies the effectiveness of the proposed algorithms.In the last phase of this research, a comprehensive framework, Semi-supervised Evolution-ary Ensemble (SS-EE), is proposed using their low cost textual features, intrinsic relations and Web support. In the previous algorithm it is explored that there are certain categories between which a clear boundary is not possible. The overlapping of such categories is resolved by defin-ing a new distance measure. Triangular Similarity (TrS) between two Textual Feature Vectors (TFV) based on the frequencies of most relevant terms in each category. The novelty of this approach is that the extent of similarity between two videos is measured indirectly using their comparison with a third reference video instead of comparing them directly with each other. Further contributions of this research work are the extension of the traditional VSM to a new S-VSM by considering the semantic similarity between the feature terms using Normalized Google Distance (NGD) approach, and the termination of genetic cycle by defining a new mea-sure, Clustering Quality (Cq), based on similarity matrix and clustering labels. Experiments on real world social-Web data (YouTube) have been performed to validate the SS-EE framework.
Keywords/Search Tags:Social Media Mining, Clustering Ensemble, Genetie Algorithm, Semantic Simi- larity, Pairwise Constraints
PDF Full Text Request
Related items