Font Size: a A A

Research And Development Of Document Clustering Based On Semantic Feature

Posted on:2009-08-27Degree:MasterType:Thesis
Country:ChinaCandidate:D WangFull Text:PDF
GTID:2178360242481302Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the quick development and wide use of Internet, the electrical information increases a lot, most of which is textual information. In order to find out the small fraction of information we need from the tremendous large quantity of information resource, we need some tools to organize and manage the information. As a branch of the research of data mining, text clustering plays an important role in web information retrieval. It helps us automatically classify documents, to narrow the searching space and improve the searching efficiency. Text clustering has been applied in many fields. For example, organizing the documents returned from search engine; extracting the rules for document classifying; organizing personal emails according to their content, etc.Since 1960s, the research of clustering has been put forward. In these decades, researchers have put up many clustering algorithms. These algorithms can be generally divided to several categories below: partition based clustering, hierarchical clustering, density clustering, grid clustering and so on. These algorithms are all based on Vector Space Model (VSM), which is a model of text presentation. VSM is used to convert the text, which is unstructured, to a vector, which is structural. In VSM, each document is presented as a point in a high-dimensional space. The strength of VSM is that we can easily compute the distance between any two points, which means the similarity of the two documents that the two points present. Clustering algorithm is processed using the vectors, which are called feature vector. These classic clustering algorithms and their derivatives are proved to be effective for structural data. But, for documents, there exits a common problem, which is, VSM assumes that words are independent to one another, but their semantic relation is ignored. And it operates through comparing the words to another according to their spellings, so the phenomenon of synonymous and polysemous words will interrupt the performance of clustering. That is why when used to process documents, the performance of these algorithms are not good. Thus, we need to research how to find out the semantic information from text and how to use it to improve document clustering.Researchers have put up some methods of using semantic information to improve document clustering, two of which are introduced in the paper: Pos Tag and Andres's method. Pos Tag is a technique used in linguistic clarifying. Experiments show that this method did not bring much advancement in document clustering. In Andres's method, he integrates the semantic information to the feature vector of documents. The advantage of this method is that we can also make use of past clustering algorithm based on SVM. But there still exist some deficiencies, such as bring lots of noise and incomplete consideration of semantic relations. So based on Andres's method, we put up a new strategy to integrate semantic information to feature vector of documents.In Chapter 3 we introduce the core algorithm in this paper. Firstly, the documents are preprocessed, and we get a global term list, which saves all the nouns in the document set. Then we use WordNet to determine the relations between these terms, in order to get a set of similar terms. Each group of similar terms is represented by a specific term, which we called concept. WordNet is a semantic dictionary, generally used in the research of data mining. The terms in the dictionary are organized by their semantic relations instead of spellings. And it provides an interface to programs. After the similar term set is built, we process the term set of each document in either of the two ways below. The first strategy is that for each term that appears in similar term set, if its concept is also in the document term set, then we add the term frequency into the concept; if not, we add the concept to document term set and remove the original term. The second strategy is different from the first one only in that we do not remove the original term. After this process, we count the term frequency and document frequency, then compute the weight of feature vector and use clustering algorithm to cluster the documents. In this way, we can connect the semantically related terms that are not the same in spelling, in order to improve the clustering performance. We will not bring noise using these method and we can connect the terms that have common hypernym term.In Chapter 4, we did experiments to test the effectiveness of our method. The four data sets we used are extracted from 20Newsgroups, which is a categorized document set. The four sets vary in total quantity of documents, number of classes and standard deviation, in order to test the performance of our algorithm in different circumstances. The measure we used to judge the performance of clustering is the value of purity and entropy. We choose K-means and Bi-Sec K-means algorithms to cluster the processed feature vectors. In our experiment, we use original algorithm and two strategies of integrating semantic feature in two algorithms respectively. From the results of experiment, we can conclude that cluster performance is improved using the two ways of integrating semantic information. Moreover, the second strategy is more effective in all circumstances.The semantic based document clustering can effectively improve original clustering algorithm. Still, it has some deficiencies such as the efficiency of building similar term set and the measure to determine similarity between terms, which need further improvement.
Keywords/Search Tags:Development
PDF Full Text Request
Related items