Font Size: a A A

Research On Web Document Clustering Approaches Based On Phrase Features

Posted on:2011-11-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:R L YangFull Text:PDF
GTID:1118330338982763Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Over the past decades, the Internet has become a vast information repository far beyond our exceptation, where most resources are stored in the formats of Web documents. In order to make full use of Web resources, help users to browse information accurately through a quick search, and enhance the applications of Web documents, the Web data mining technologies have been extensively developed for many years. In particular, by classifying, clustering, analyzing relevance and predicating the trends among data, we can find and extract useful patterns and hidden information from the data mass.Web document clustering is an important branch of Web data mining, which can be employed to efficiently organize documents, extract summary of documents and navigate in documents without any supervised training and manually labelled documents. It is thought to provide a solution to the issue of information explosion. Thus, Web document clustering has a wide application prospect in the fields of information retrieval, information filtering, search engine and digital library.Following the descriptions of the basic concepts, methods, present studies and existing issues about Web text clustering, this thesis introduces revelant theories and technologies to Web document clustering, including document representation models, similarity measures, clustering algorithms, evaluation metrics and complexity analysis. Based on these, the thesis proposes three Web document clustering approaches based on phrase features.â‘ For semi-structured Web documents, HTML tags are used to designate different parts of the document and identify key parts based on this structure. Using this feature, this thesis proposes a novel Weighted Suffix Tree Clustering (WSTC) method for Web documents. In accordance with a document's structure, different document parts are assigned different levels of significance, after which each part are partitioned into several sentences and finally each sentence are partitioned into multiple words. The weighted suffix tree document model is built with sentences associated with weights, each of which is stored in a node of the weighted suffix tree. Moreover, during the process of identifying base clusters and merging clusters, various factors will be taken into an overall consideration, such as the frequencies of documents and sentences, the lengths of phrases, structure weights of each node, etc. The evaluation experimental results demonstrate that the WSTC algorithm can actually improve the clustering quality.â‘¡From the weighted suffix tree document model, it is observed that each node representing a phrase can be used as a feature term of documents. Due to this characteristic, a hybrid model is proposed, called WSTVSM, which combines weighted suffix tree model with vector space model for Web document clustering. To obtain a WSTVSM, we start by constructing a weighted suffix tree document model, in which each node and its weight are then mapped into a unique feature term in the vector space model. The feature term weights of vector space model extends the term TF-IDF weighting scheme with weighted phrases. After that, the new weighted phrase-based document similarity can be computed by means of the obtained WSTVSM. Finally, the Group-average Hierarchical Clustering (GAHC) algorithm is employed to generate the final clusters.â‘¢Though conventional partitional clustering algorithms, like K-means, are easy to implement and runs fast, they usually lack of robustness. To solve such drawbacks, this thesis proposes a hybrid clustering algorithm which optimizes the initial center values of K-means via WSTC algorithm. First, the initial center values are extracted after the Web documents set is clustered using WSTC algorithm. Second, by mapping the each internal node of weighted suffix tree into vector space model, each feature term weight is computed using TF-IDF weighting scheme extended with weighted phrases. The final result is generated by K-means algorithm with optimized initial center values. The evaluation experiments indicate that the novel hybrid algorithm is not only more effective for document clustering than ordinary K-means and WSTC algorithm, but also can run as fast as the latters.
Keywords/Search Tags:Web document clustering, suffix tree clustering, weighted suffix tree, weighted phrase, K-means clustering algorithm
PDF Full Text Request
Related items