Font Size: a A A

The Research On Web Structure Mining And High Dimensional Data Mining

Posted on:2013-02-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:H YuFull Text:PDF
GTID:1228330395999263Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data mining is one of the frontier research directions in Artificial Intelligence, Machine Learning, Pattern Recognition and Information Decision. With the rapid development of Web and the increasing ability of data sampling, Web mining and high dimension data mining are two important branches of data mining.Web is an important platform for people to spread and get information. At present, there are more than one billion web pages on the Internet and the number is increasing dramatically day by day. Also, the information contained in the Web increases explosively. On the other side, Web is self-organized and non-structured, so classical information retrieval techniques could not be applied in Web data mining. Other than web pages, there are huge numbers of Hyperlinks in the Web. Since Hyperlink contains the information to evaluate the importance of web page, Web structure mining (also called Hyperlink analysis) becomes an important way to improve the performance of the Web information retrieval.Clustering is one of the basis methods in data mining and is widely used in a lot of domains. Recently, the data in many clustering fields appears the high dimensional characteristic, such as transaction data, file-word frequency data, users grading data, Web logs and multi-media data, etc. Most of the classical clustering algorithms are based on the assumption that the processing data are the low dimensional data, which means they could not get effective clustering result when the data is high dimensional. Now, high dimensional data clustering is one of the key research problems of clustering analysis. Manifold clustering is a high dimensional data clustering technique which has developed quickly and has been studied widely. in recent years.In the paper, we focus on Web link analysis and high dimensional data clustering, which are the two classical research problems of data mining. We study the page ranking algorithms based on link analysis in search engine, the maximum flow algorithms to find web communities, the efficient dissimilarity in manifold clustering algorithms and the sampling-based low-rank approximation scheme for reducing the computational burdens in large scale manifold learning. The major contributions of the paper are summarized as follows:(1) Analyze the characters of classical page ranking algorithms in search engine which are basing on link analysis, i.e. PageRank and HITS. With regard to PageRank which focuses on no topics, a multi-level importance propagation framework for static ranking of web pages is proposed. It fits the direct hyperlinks and indirect hyperlinks with different weight according to the given attenuation model. Experiments demonstrate that the proposed PageRank modified framework improves the accuracy of searching results. With regard to HITS which focuses on topics, we fit the links with different weights by web pages similarity and links popularity. The modified HITS algorithm alleviates the topic drift problem effectively.(2) Study the relation between the edge capacity and the scale of the web community in the maximum flow method of identifying communities. The characters of link structure are mined in view of identifying communities. We improve the original maximum flow algorithm by employing the power law distribution of web pages’in-degree and out-degree, differentiating the web links among pages and efficiently assigning edge capacities variably. The improved maximum flow algorithm picks up few noise pages and improves the quality of the identified communities.(3) Neighbor path based effective dissimilarity is proposed to enhance the clusters’ characters of the low dimensional manifold obtained by the manifold learning algorithms. It improves the clustering performance consistently. We analyze how the approximating quality of the Nystrom method depends on the choice of landmark points and the impact of matrix approximation error on the clustering performance of manifold clustering algorithms. An incremental sampling scheme for the Nystrom method based manifold clustering is proposed and it improves the clustering performance of fast manifold clustering approximated by the Nystrom method.
Keywords/Search Tags:Web Structure Mining, Web Link Analysis, High Dimensional DataClustering, Manifold Clustering, Incremental Sampling
PDF Full Text Request
Related items