Research On Document Clustering Based On Semantic Similarity Of Hownet

Posted on:2011-08-11

Degree:Master

Type:Thesis

Country:China

Candidate:J N Xu

Full Text:PDF

GTID:2178360302993445

Subject:Information Science

Abstract/Summary:

PDF Full Text Request

At present we are in an "information explosion" age. Increasing number of online information, including news, e-magazines, e-mail, technical reports, documents, and on-line libraries, etc., a large part of which belong to unstructured or semi-structured. In face of massive online semi-structured or non-structural text message, how fast and efficient is the classification and organization of them, and how to provide accurate and useful information for users is that all information workers or researchers are eager to solve. How the classification and identification of free documents is done in the absence of specific guidance has drawn more and more researchers'attention.In this paper, based on the extensive research of the current situation of Chinese text clustering, key technologies of text clustering are summarized, including the automatic text segmentation, text feature selection, text feature reconstruction, text representation, the measure of text similarity and the clustering algorithm. Analysis of several feature selection methods for text clustering and their influence on clustering is provided; introduction of some Chinese text representation models and several clustering methods is covered, and this paper also points out their weakness.Text representation model and semantic association are known to be the hard parts of the Chinese text processing, so this paper focuses on the semantic analysis of the text. Text pre-processing method based on text lexical category of the text, and the compression through semantic features makes the text feature dimension reduce a lot, in turn accelerate the speed of clustering. The selection of the features according to TF (term frequency) and IDF (inverse document frequency), makes the text feature set better represent the semantic content of the text. The established text representation model is based on the term frequency and also reflects the semantic characteristic intensity of the text, and it enables the document to be expressed as a set of features and the calculations of the similarity between texts to be accomplished through the calculations of the similarity between the terms of the texts. Through this can we truly analyze the similarity between texts in the semantic way and it is closer to people's subjective measure, and enables the quantification of the similarity between texts, so ease the computer recognition processing. The semantic clustering method is based on this text representation model, Based on this text representation model, to build feature-based semantic similarity of the clustering model. This paper ends with the implementation of the clustering method, and the validation of it through several experiments.

Keywords/Search Tags:

Clustering, Feature extraction, Semantic similarity, Text Model

PDF Full Text Request

Related items

1	Research On Text Clustering Based On Semantic Similarity
2	Text Similarity Computing Theory And Applied Research
3	Research On Text Clustering Algorithm Based On Word Frequency And Semantic
4	Research Of Multi-Documents Summarization Based On Information Extraction And Semantic Similarity
5	Research On Thesis Text Clustering Based On Semantic Similarity
6	Research And Application Of Semantic-based Automatic Text Summarization Generation Technology
7	Search Of Group Intelligent Text Clustering Methods Based On Semantic Similarity
8	Clustering Algorithm Research Of Short Text Based On Semantic Similarity
9	Chinese Text Clustering Based On Text Similarity
10	Research On F Eature Word Extraction Of APP Based On User's Comments