Font Size: a A A

Research On Search Results Clustering And Label Extraction

Posted on:2011-06-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y H ChenFull Text:PDF
GTID:1118360332457944Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The search result clustering technique is dedicated to cluster search results into topics on the fly and gives each cluster an accurate and readable label. Users can then navigate directly to the result set that interests them. They can also gain more insight into the keyword by looking through these labels, which might help to reconstruct their keywords. Compared with traditional text clustering tasks, search result clustering faces extra challenges such as incomplete texts, real time computing and accurate cluster labeling. To tackle these special requirements, a novel clustering algorithm is proposed which is based on Self-organizing Map (SOM) and Latent Semantic Indexing (LSI). LSI enables texts and feature words to be re-represented in the same semantic space. After re-representing, texts can then be used for SOM training and the feature words for computing neuron labels. These labels, combined with the neuron weight, are further used to merge neurons into clusters with concise description.This paper extracts semantic information to facilitate clustering process according to the specialty of search result clustering. An automatic label generating algorithm based on semantic space transformation is developed by undermining the semantic relations between texts. To not lose the practicality of the algorithm, we try to keep the search results'time efficiency, precision and coverage. Mainly we study on the following issues:1. Optimized selection of initial data based on semantic feature extractionThe clustered results are to be directly presented to users, which makes traditional machine learning techniques incapable of resolving this issue. To understand the semantic is the ultimate goal of natural language processing, and clustering the search results is necessary to text understanding. Since Chinese semantic analysis theory is still not fully fledged, we propose a method that takes the initiative to learn kinds of semantic semantic feature from the Internet, thesaurus and Chinese semantic analysis, and thus introduce the Chinese semantic analysis technique into search results clustering.2. Clustering algorithm based on dynamic LSI and SOM The precision of the clustering analysis is the basic requirement for a good search result clustering method. We have tried several clustering analysis algorithms, and decided to choose self-organizing feature map (SOM) as our basic clustering method.Considering that different result sets contain different numbers of topics, we adopt an improved version of SOM, the map of which may grow from a smaller one to a proper size. Because the neuron weight approximates the centroid of the samples mapped onto it, the map has a large quantitative error when there are insufficient neurons. In addition, texts in search result have shorter length and more noise, which results in severe sparsity in VSM. We believe LSI is the winner among various feature extraction techniques. LSI not only reduces dimension, but also extracts better features by mining the semantic relation between sparse features.3. Cluster label extraction based on LSI and SOMAutomatic label extraction is another key issue of our work. As an important way of describing clusters, cluster labels have been studied for several years. Many researchers use word frequency based method which only take the word's frequency into account. While having a relatively high coverage, its precision is decreased by the many keywords maliciously stuffed in the page in order to promote ranking. Such limitation is broken off in this paper. By space transformation, neurons and features that have cluster information are mapped into the new semantic space through LSI decomposition, and cluster labels are computed by the inner product of neurons and features. These candidate cluster labels are further filtered by means of semantic analysis and user requirements.4. Merge of initial clusters based on label similarityAs well as providing better visualization, label extraction also helps user find the right information more efficiently. The extracted labels are further used to feed back and improve the clustering process, and to deal with the need of merging small clusters generated during the map expansion.This work includes some preparatory research on search results clustering and label extraction. A new method with close integration of Chinese semantic analysis is proposed, which laid groundwork for further investigation.
Keywords/Search Tags:Search Engine, Search Results Clustering, Self-organizing Feature Maps, Latent Semantic Indexing, Lable Extraction
PDF Full Text Request
Related items