Font Size: a A A

Research On Semi-supervised Text Clustering Algorithm For Personalized Topics

Posted on:2017-02-08Degree:MasterType:Thesis
Country:ChinaCandidate:J LiFull Text:PDF
GTID:2358330503988905Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the wide spread of the Internet across the world, the number of Internet user is increasing. It has led to the exponential growth of Internet data accumulation in which a considerable portion of the data is document data. Thus, analyzing these document data to find unknown valuable information has become a very important issue. Semi-supervised document clustering is an important method for textual analysis in Data Mining, which can use a small amount of supervised-information to improve the clustering performance. So it has attracted increasingly attention from scholars and engineers. Document clustering divides the document set into several clusters automatically by the means of analysis and recognition of the relationship between the documents in the document sets. This makes documents with same topic clustered in the same cluster as much as possible and the documents with different topics in different cluster. Most existing semi-supervised document clustering algorithms ignored the user's individual wishes and cannot achieve a favorable personalized document division. The supervised-information form needed by some semi-supervised document clustering algorithms is difficult to achieve and the number of supervised-information from users is limited. This limited the utilization of those algorithms. In practical applications, the number of supervised-information from users is few and far between compared with a large number of document data. This makes the impact of these small amounts of supervised-information on clustering process limited.According to the analysis on the background of semi-supervised document clustering and the problem of the existing semi-supervised clustering algorithms:(1) This paper presents a new semi-supervised clustering algorithm with a novel supervised-information format which makes users provide the supervised-information more conveniently. The proposed supervised-information format is “interested in” or “uninterested in” key words.(2) It also solves user-personalized reflection problems and the problems of supervised-information form. Based on the supervised-information provide by users, documents and the word distribution of latent topic, the proposed method learns from and expands the supervised-information to solve the problem of lack of supervised-information.Considering the good performance of topic model LDA in clustering and the latent topics mined in the clustering process, the LDA is introduced into the semi-supervised document clustering problem, combined with the new forms of supervised-information and its expansion. This article proposes a new extendable semi-supervised document clustering algorithm based on user preferences—extended LDA(ex LDA). In order to verify the effectiveness of the algorithm, this paper designs several experiments on real data sets. Firstly, the rationality and effectiveness of the supervised-information is analyzed in supervised-information format perspective. Experiments on real data sets show that compared with the traditional and the latest semi-supervised document clustering algorithms, the extendable semi-supervised document clustering algorithm based on user preferences has significantly improve the clustering results in documents and realizes the users personalized document division.
Keywords/Search Tags:Data Mining, semi-supervised document clustering, ex LDA, user preferences, users personalized document division
PDF Full Text Request
Related items