Font Size: a A A

Research Of Machine Learning Models And Algorithms For Information Filtering And Information Retrieval

Posted on:2008-02-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:L ZhangFull Text:PDF
GTID:1118360245992497Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet technologies, information networks play more and more important roles in people's routine work and daily life. To obtain information that people really need from the massive information quickly and efficiently has become a key problem in our information research society. There are two main approaches to solve this problem: information filtering (IF) and information retrieval (IR), which are of important academic interest and valuable applications. The main research work of this thesis is based on statistical machine learning methods, especially the IF/IR models and algorithms. The main contents are as follows:First, a brief introduction to IF/IR is given, including the concept, structure and features as well as their origin and history. As the theory basis of this thesis, several statistical machine learning methods and their functions in IF/IR are also introduced. Second, on the basis of introduction on several popular collaborative filtering approaches, this thesis presents a new probabilistic model for collaborative filtering, named real preference Gaussian mixture model. It has two latent variables corresponding to classes of user and item. Each user or item may be probabilistically clustered to more than one groups. And it also consists of user rating style and item public praise. The new model is more actual and practical than the other methods.Third, another focus of this thesis is on using finite mixture models to cluster large scale document data. A generalized method for unsupervised text clustering is presented. It integrates the mixture model's model selection, feature selection and parameter estimation into a general framework. Moreover, a modified version of"feature significance"is proposed such that the features'revalence to the mixture components is introduced to the mixture model as a set of latent variables and the component-relative features are selected when estimating the model's parameters. As an example of the generalized framework, a multinomial mixture model with feature selection is discussed in detail.Fourth, this thesis use graph-based methods to deal with semi-supervised learning problems. The main idea is to investigate the similarities between data examples by defining some density-based distance over the graph. The inner structure information of the dataset is then obtained and utilized to compute the classifier. On semi-supervised classification, a kNN density-based distance form is presented to re-weight the graph, then the Laplacian kernel method is introduced to build classifiers over the whole feature space. On semi-supervised clustering, a density-based constraint expansion method is proposed. The constraint set is expanded by the similarity of the data samples. Then the expanded constraint set contains the manifold information of the dataset, and can be used in all semi-supervised clustering algorithms.Finally, the main research contents are summarized at the end of the thesis with an expectation for future study and research.
Keywords/Search Tags:Collaborative Filtering, Unsupervised Learning, Semi-supervised Classification, Semi-supervised Clustering, Finite Mixture Models
PDF Full Text Request
Related items