Font Size: a A A

Design And Implementation Of Meta Search Engine Based On Suffix Array Clustering

Posted on:2011-11-13Degree:MasterType:Thesis
Country:ChinaCandidate:G D HuFull Text:PDF
GTID:2178360332957238Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Together with the development of information technology, especially with the popularization of Internet, we have entered a period with rich information. And this resulted in the birth of Search Engines. The traditional search engines provide results according to key words, but the search results were so mass that user can't get interesting information quickly. Search result is different because database and search technology of each search engine is different, user used to get information from search result which is on top of each search engine. Meta search was inverted to combine search result of single engine and return it to user.Because there are many result documents which were returned by search engine, it's a good idea to classify these documents, so users can search and browse information easily. Automatically clustering search results using document clustering algorithm is a good idea. Now there appear more and more search engines using clustering technology, such as Vivisimo, Carrot2, and all these engines behave well. More meta search engine appears, together with development of clustering engine technology. But clustering engine have some problems, such as the accuracy of clustering labels, multi-language support and so on, so we have to make further research.This paper designed a meta search engine system based on Suffix Array Clustering (SAC), it uses Suffix Array technology in clustering engine. Suffix tree and suffix array are all excellent data structure in string handling. Suffix array occupy little space, and suffix tree is faster than suffix array in some case, but not obvious. In our suffix array clustering engine system, we build two mapping table through using little space, and this makes the process of cluster selection save a great deal of time. So compared with clustering engine system based on suffix tree, our engine system have no weakness in speed.In our system, first of all we segment word and remove the stop-words through all documents, and then build suffix array. For building suffix array, it is the most important problem to select and form clustering results. When selecting cluster, we divided it into two steps: basic selection and filtrating based on Grouper Rules. During basic selection period, we introduce a limit value to filter out the initial eligible labels and form an initial cluster set. Then we carry out filtrating based on Grouper Rules to get final result cluster set. At this time our handling is not over yet, we introduce label's semantics handling; this means that we combine similar labels. Finally we return final results to user. Meanwhile we also use an interactive clustering design idea, after we return clustering results to user, when user is interested in one cluster, he or she can click this cluster's label, and then we will cluster documents belonging to this label for the second time, and put results under this label as a sub-cluster. Through interactive cluster design, we can only cluster documents which user is interested for the second time; this can greatly reduce cluster documents scale and improve system efficiency effectively.Our experimental statistics indicate that meta search engine based on SAC can cluster fastly and effectively. But rationality and readability can be further improved. Here are some directions which we can improve:1. Although current meta search engine has already introduced semantics, but it only combine similar labels. One direction is to introduce layer result cluster based on semantics, this can provide user with better results and forms and clearer layers, user can get interested information fast. Another direction is to further improve cluster results using hypernym/hyponym, antimony and other important semantics relationships. This all need further research.2. Introduce Chinese dictionaries and add Chinese semantics handling. Now we can only combine similar English labels through introduction of Wordnet. So introduce Chinese dictionaries, add Chinese semantics handling, combine similar Chinese labels, also need to be considered.3. Introduce field dictionaries, either during segmenting word period or during final clustering and filtering period. A field dictionary can provide great help to cluster of professional field documents.4. Consider that we can sort documents from several search engines through introduction of weight form. Now we can only select one engine such as Google or Yahoo, it doesn't support integrated handling of two results from different search engine.5. To provide more personalized services, we can add some options on user interface such as selection of member engine, documents'number, and cluster's number and so on, user can set these options as they want.Along with traditional search engine technology becoming more and more mature, it provides a more powerful support for the development of meta-search engines. Therefore, I believe that algorithm based on clustering meta search engine will become more and more popular and be greatly developed in the future.
Keywords/Search Tags:Suffix array, Clustering, Meta search engine
PDF Full Text Request
Related items