Font Size: a A A

The Research Of A Multi-language Supporting Description-oriented Clustering Algorithm On Meta-Search Engine Result

Posted on:2012-07-29Degree:MasterType:Thesis
Country:ChinaCandidate:G X JiangFull Text:PDF
GTID:2178330332483126Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the explosive growth of the Internet, search engine, widely used as a tool of information retrieval, has been widely studied and applied. However, the ever existing search engine coverage of Internet contents is limited. According to a research, a single search engine can just find forty-five percent of the total relevant information. In addition, search engine has adopted a variety of techniques to improve the accuracy of information retrieval, but the way of a linear list of search engine results, which mixes unrelated documents with relevant documents, has brought user great burden.This article commits to build clustering of search results, which is based on meta search engine techniques. We use all the popular search engine as a data source, then after a certain pre-processing of the source search engine, hierarchical clustering results is formed and returned to the query users. In the process of clustering, we first generate clustering label based on the global data, so that the label can be more readable for the users. What's more, the data in the same category are closed to each other, while the data between different category are distant. Both are aiming to ease the burden of users who want to find the data in the Internet.Different from other search results clustering algorithm, we propose a multi-language supporting, label first clustering algorithm, which we named DCFC algorithm. This algorithm supports both Chinese and English query, focuses on generating human readable labels, shows search results in hierarchical structure. We also provide several parameters which can be used to adjust the results by users, for example, users can control the maximum length of category label, the maximum record number which is used for clustering. There are five steps in DCFC algorithm:data-preprocessing, segmentation, frequent phrase generation, the generation of hierarchical category label, assign data to the corresponding category. Many experiments, which will be displayed in chapter 5, show that the DCFC algorithm can generate more readable category label than other clustering algorithm, the total results and accuracy are both more meaningful.We implement the DCFC system by JAVA, it has two main parts, one is a used as data source of DCFC system which is based on meta search engine theory, personalized, high-performance, distributed, the other is clustering module. By comparing the experiment results of DCFC with other search engine clustering algorithm, such as LINGO, VIVISIMO, QUINTURA, we can conclude that the DCFC algorithm is more effective.
Keywords/Search Tags:Meta-Search Engine, Chinese Segmentation, Text Clustering, DCFC Clustering
PDF Full Text Request
Related items