Font Size: a A A

Research On Web Log And Subspace Clustering Mining Algorithms

Posted on:2009-09-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:R HuFull Text:PDF
GTID:1118360275470893Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data mining is to identify valid, novel, potentially useful and ultimately understandable patterns in data. With the rapid development of information technologies, data gained from many fields are growing exponentially every day. Especially, large scale and complex data are generated in many applications, such as web applications, natural science, and electronic business etc. How to help users extract knowlege from these data effectively is an urgent problem that should be solved. Thus, it has very important theoretical and practical significance to consider the need of applications and the data characteristics of different fields to design effective mining algorithms for such large scale and high dimensional data.For the problem of mining translations of web queries from web click-through data, the framework MTQC leverages web logs as an effective corpus to mine web query translations. Based on the analysis of web logs which are collected from the interaction information between web users and search engines, MTQC fully leverages the bilingual URL pairs and queries related to these URLs. It is a two-step mining process. First, it identifies bilingual URL pairs, then it matches query translation pairs. Two algorithms named MTQC-1 and MTQC-2 are based on the framework. They thus have many good properties, such as require no crawling or words segmentation, can capture popular translations, can extract semantically relevant translations to improve Cross-Lingual Information Retrieval. The experiments conducted in the large scale and real click-through data show that compared to the state-of-the-art translation alogirthms, the proposed algorithms are effective in translating out of vocabulary queries and popular queries.For the problem of mining pattern-based maximal subspace clusters, the traditional distance-based subspace clustering algorithms are unable to mine such patterns, and the classic pattern-based mining algorithms have some limitations. The maximal pattern-based subspace clustering algorithm EMaPle is a novel algorithm to mine the patterns which satisfy coherence, size and sign constraints. It leverages the characteristic of gene expression data and chooses to compute MDSc only in the attribute space which is small in size, compared to the large number of objects in the typical datasets. Then, it applies global pruning rules to the MDSc. It scans the organized prefix tree in depth-first. At the same time, it applies local pruning rules to prune meanless attributes and subtrees. The experiements show that the proposed algorithm outperforms the classic algorithm significantly on both real and synthetic datasets.For the problem of subspace skyline clustering, it addresses that in the arbitrary subspace of high dimensional space, how to organize the skyline results better, improve the manageability, and thus help users improve the decision efficiency. Based on the analysis of the difficulties of skyline query in high dimension, subspace skyline cluster is a novel and compact structure. By introducing clustering to skyline query, it subtly combines the advantages of skyline query and clustering. Some requirements were proposed in the traditional skyline query algorithms, that is, progressiveness, correctness, efficiency, justice and extensibility. The SSSCM and TSSCM algorithms leverage the function of the nearest neighbor and presorting, and are inspired by the top-k query algorithms. They satisfy all the requirements mentioned above. The experiments carried out on the two real datasets and two synthetic datasets show the efficiency and effectiveness of the proposed algorithms.
Keywords/Search Tags:WEB log mining, query translation, subspace clustering, pattern similarity, skyline query, subspace skyline cluster
PDF Full Text Request
Related items