Font Size: a A A

Research And Implementation Of Web Document Clustering Algorithm Based On Semantic Gravitation And Density Distribution

Posted on:2012-03-03Degree:MasterType:Thesis
Country:ChinaCandidate:Z T LiFull Text:PDF
GTID:2248330395958231Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the continuous development of network technology and greatly enriched web information resources, how to efficiently access resources of internet, then analysis and process these resources effectively has became a big problem for data mining increasingly. Some traditional method still collects massive information using common web spiders, which gets rough result with a large number of noise data and brings unnecessary affect to further data analysis and processing. At the same time, a good clustering algorithm is also an important part of the text analysis procedure. Facing high-dimensional feather of web document, some distance-based similarity measures have already shown their shortcomings. For example, in the text feature word space, due to high dimension of the vector, there are a lot of vectors with small mode, whose most feature-bits are zero. Calculation results show that similarities between any vectors with this characteristic are large. However, according to the semantic analysis of these texts, the contents of them are not similar. This is the obvious difference between high-dimensional data and low-dimensional data when they are expressed by vector. Therefore, the major research goal of this paper is to find an algorithm that is suitable for similarity calculation of high-dimensional text and can improve the accuracy of large-scale web document clustering.Based on the above analysis, this thesis starts from web text collection. By adding steps of text analysis and relevance assessment of content into data collection procedure, this paper proposes a calculating method that using data as a similarity measurement based on gravity. According to this main idea, this thesis does some research on web document clustering algorithm. And this article includes the following aspects of work:(1) During collection of internet resources, this thesis improves traditional spider properly. By adding content analysis and relevance evaluation into crawling process, unrelated internet resources are filtered out initially. As a result, the downloaded data sets’effectiveness and relevance can be ensured.(2) After analysis of traditional web analytic systems’principles, this thesis presents a semi-automated manual intervention template generation tool. The tool has the advantage of avoiding the complex work of web page coding analysis. And templates generated have good versatility, which can extract content from a class of web pages with the same board structure.(3) The most widely used similarity measurement method is based on Euclidean distance. The advantages of this method are that it has a good basis of mathematical theory. By converting texts into corresponding vectors of feature word, this method can be directly related to the calculation and visualization of the results. However, text vector models are often high dimensional featured. And distribution of data in high dimensional space can not be represented by low dimensional model. Data feature in high dimensional space can’t be reflected well. So this paper presents a data gravitation similarity measurement that takes the advantage of the Euclidean distance method in low-dimensional calculation. The advantage of this method is it fully takes in to account that the traditional Euclidean distance method can not reflect the semantic of texts and emphasizes the correlation between data attributes. So this method is able to get good results even with irregular distribution samples.(4) Since traditional similarity measurement methods have trouble in data clustering, especially in high dimensional data clustering procedure, similarity calculation method based on data gravity is introduced to clustering process. In order to overcome this method’s shortcomings when expressing similar relation between classes, this paper presents a new clustering algorithm based on semantic gravitation and density distribution, which combines and improves partition-based clustering method and density-based clustering method. After calculating objects’density, this new algorithm uses higher density objects as cluster centers for clustering. Thereby, it reduces the impact caused by the selection bias of the initial cluster centers and ensures better accuracy.Experimental results show that this thesis’s algorithm has more accurate clustering results; particularly in text clustering that has high-dimensional and sparse data.
Keywords/Search Tags:data gravitation, similarity caculation, hierarchical clustering, webpage analysis
PDF Full Text Request
Related items