Font Size: a A A

Contributions a la detection des anomalies et au developpement des systemes de recommandation

Posted on:2013-11-21Degree:Ph.DType:Thesis
University:Universite de Sherbrooke (Canada)Candidate:Shu, WuFull Text:PDF
GTID:2458390008969226Subject:Computer Science
Abstract/Summary:
Data mining, also called Knowledge Discovery in Databases, is a relatively young and interdisciplinary research field of computer science. It is the process of analyzing large-scale datasets, extracting knowledge, and then transforming this knowledge into a human-understandable structure for further use. Outlier detection and recommendation systems are two important tasks in data mining. Outlier detection refers to detecting observations in a given dataset that do not conform to normal observations, while recommendation systems try to predict user's preference towards items from historic data of purchase and other related socio-economic data of the users. The main focus of this thesis is to study two key issues in outlier detection and recommendation systems: outlier detection from (or in) large-scale categorical datasets and recommendation systems from highly-skewed rating datasets.;Previous research on recommendation systems has neglected one significant rating scenario, which broadly exists in many real Web applications, such as e-commerce (e.g. Amazon, Taobao) and content provider websites (e.g. Youtube). The rating datasets collected from these websites have different characteristics from the traditional movie and music rating datasets. Their ratings distributions are with high skewness. After examining the properties of this kind of rating datasets, we propose a new framework for estimating rating and quantitative high-order preference for skewed rating datasets. This framework allows to generate novel and more effective matrix factorization and neighborhood models. Experimental results on typical highly-skewed datasets show that new models created under this framework can generate better performance than the conventional methods on the skewed rating datasets for not only rating prediction but also for Top-N recommendation.;Detecting outliers in large-scale categorical datasets is a very important and open significant topic in outlier detection. Existing methods in this area suffer from low effectiveness and low efficiency due to high dimensionality and large size of the datasets, high-complexity of statistical tests or inefficient proximity-based measures. In this thesis, we provide a formal definition of outlier in the categorical datasets, and design two effective and efficient algorithms with only one parameter for the task of outlier detection in large-scale categorical datasets.
Keywords/Search Tags:Detection, Datasets, Recommendation systems
Related items