Font Size: a A A

Research On Document Clustering Based On Semantic Similarity Of Hownet

Posted on:2011-08-11Degree:MasterType:Thesis
Country:ChinaCandidate:J N XuFull Text:PDF
GTID:2178360302993445Subject:Information Science
Abstract/Summary:PDF Full Text Request
At present we are in an "information explosion" age. Increasing number of online information, including news, e-magazines, e-mail, technical reports, documents, and on-line libraries, etc., a large part of which belong to unstructured or semi-structured. In face of massive online semi-structured or non-structural text message, how fast and efficient is the classification and organization of them, and how to provide accurate and useful information for users is that all information workers or researchers are eager to solve. How the classification and identification of free documents is done in the absence of specific guidance has drawn more and more researchers'attention.In this paper, based on the extensive research of the current situation of Chinese text clustering, key technologies of text clustering are summarized, including the automatic text segmentation, text feature selection, text feature reconstruction, text representation, the measure of text similarity and the clustering algorithm. Analysis of several feature selection methods for text clustering and their influence on clustering is provided; introduction of some Chinese text representation models and several clustering methods is covered, and this paper also points out their weakness.Text representation model and semantic association are known to be the hard parts of the Chinese text processing, so this paper focuses on the semantic analysis of the text. Text pre-processing method based on text lexical category of the text, and the compression through semantic features makes the text feature dimension reduce a lot, in turn accelerate the speed of clustering. The selection of the features according to TF (term frequency) and IDF (inverse document frequency), makes the text feature set better represent the semantic content of the text. The established text representation model is based on the term frequency and also reflects the semantic characteristic intensity of the text, and it enables the document to be expressed as a set of features and the calculations of the similarity between texts to be accomplished through the calculations of the similarity between the terms of the texts. Through this can we truly analyze the similarity between texts in the semantic way and it is closer to people's subjective measure, and enables the quantification of the similarity between texts, so ease the computer recognition processing. The semantic clustering method is based on this text representation model, Based on this text representation model, to build feature-based semantic similarity of the clustering model. This paper ends with the implementation of the clustering method, and the validation of it through several experiments.
Keywords/Search Tags:Clustering, Feature extraction, Semantic similarity, Text Model
PDF Full Text Request
Related items