Font Size: a A A

The Research On Clustering Algorithm For Text Search Engine

Posted on:2013-05-24Degree:MasterType:Thesis
Country:ChinaCandidate:K LiuFull Text:PDF
GTID:2298330362967414Subject:Software engineering
Abstract/Summary:PDF Full Text Request
An important problem in the application of search engine that needs to be solvedurgently is how to return friendly query results for ambiguous queries, so that users couldsave a lot of time on their query operation. Too much time has been wasted on filtering thehuge amount of unfriendly query results manually to find out the right page they really want,due to those traditional search engines such as Google, Baidu, and Bing, which just return along flat list of records that are sorted according to their correlation with the query. Thus, itbecomes a hot topic on the text clustering algorithms to give users kindly query results.Before we start to deal with the text clustering algorithm, what we first need to do is torepresent the text data in a mathematical way. One common choice is based on VSM (VectorSpace Model). Being simple, it would easily cause dimension disaster. Meanwhile, it alwaysturns out to be less efficient and less accurate because of the unfit solutions to dealing withthose polysemy and synonym words as well, which leads to non-friendly query results. Tosolve the problems mentioned above, this paper aims to develop a new method for textclustering based on suffix tree and semantic similarity with HowNet, and design a ChineseClustering Search Engine (CCSE), which is mainly for Chinese clustering. When CCSEbegins to work, it firstly builds up a suffix tree with all the texts of search results, and picksout those suffix phrases that only contain noun, verb and adjective, and end up with noun.Secondly it works out some descriptive phrases with high Term Frequency–InverseDocument Frequency(TF-IDF) scores, and names these phrases candidate phrases. Thirdly,CCSE executes suffix tree clustering(STC) algorithm to merge some similar clusters.Fourthly, it calculates the sementic similarity between those candidate phrases, merge those clusters into the one which has the highest TF-IDF score if their candidate phrases are similarenough. Lastly, the CCSE system removes those clusters which have low Intra-ClusterSimilarity(ICS), and give cluster results to users. In the process of semantic similaritycomputing with HowNet, an improved method is used, which does well in dealing with thosenew words. This method uses an extensible semantic-oriented algorithm that could countthose new words in the similarity computing, and improve the quality of clustering.Meanwhile, since those clustering labels are generated at first of the algorithm, the labels canbe kept highly-descriptive.Firstly, this paper presents the feasibility of the application of clustering algorithm usedin search engine,and introduces the structure of search engine, the cluster model and theHowNet. Secondly, the system design of the text search engine is described, and the details ofsome key technologies–new-words problem, similarity-computing of sentences, and themodel of clustering algorithm are analyzed in detail. Lastly, an evaluation study on theperformance of the system is carried out, and the analysis of the test results is given.Experimental results indicate the feasibility and practicality of the system design.
Keywords/Search Tags:Text search engine, Chinese clustering, HowNet, Semanticssimilarity
PDF Full Text Request
Related items