Font Size: a A A

Design And Implementation Of Meta Search Engine Based On Suffix Tree Clustering

Posted on:2018-08-21Degree:MasterType:Thesis
Country:ChinaCandidate:J H ChenFull Text:PDF
GTID:2348330515973958Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The Internet revolution has brought great convenience to people.With the arrival of the era of big data,how to get information by an efficient way are being paid more and more attention.And the search engine technology is one of the effective tools which can solve these problems.Currently,however,the technology of the search engine still has lots of disadvantages.Even though there are many business search engines to choose,users usually couldn't find the information which they are interested in quickly.Because different search engines designed by different databases and sort algorithms,there is a big difference of the result set searched by different search engines.In order to improve the search engine coverage and recall,meta search engine technologywas born.Traditional search engine still has another deficiency.When the user submits the key word to search,the result set is usually too complex to read,especially in this situation that the key word has several meanings.Users often spend a lot of time looking for useful information in a large number of results.One of the solutions to solve the problem is that clustering the results.And users can get more accurate results by submitting the keyword query,so as to improve the search efficiency.At present,more and more meta search engines begin to import the function of clustering data such as Vivisimo and Carrot2 which is an open source search engine.However,clustering search engine technology has not yet entered the stage of full maturity.Capacity of classification,readability of labels,and support for Chinese and so on,these aspects still need further study.This paper mainly analyzed and studied the meta search engine and clustering algorithm,and on this basis designed and implemented a meta search engine based on suffix tree clustering algorithm,using myEclipse10 as the main development tool.The mainwork is as follows:1.This paper introduced the working principle of meta search engine,and described the working mode of each module of meta search engine.2.We did the research on short text clustering algorithm and introduced several commonly used clustering algorithms.After that we compared their advantages and disadvantages and analyzed the suffix tree clustering algorithm in detail.3.To solve the problem that labels produced by clustering algorithm lack readability,first we improved the method to select the labels and let the labels which accord with Chinese customs get high score.Secondly,after the clustering,put the clusters together which have the same labels to ensure that each label is different.Finally,the semantic rules are used to filter the labels of all classes,and only the filtered clusters are returned as a result to ensure the readability of the labels.At the same time,this paper analyzed the performance of the meta search engine system.The experimental results show that the system can support the Chinese better,clustering algorithm efficiency and classification ability was satisfactory,the quality of class labels with some improvement,the number of meaningless labels significantly reduced.However,this meta search engine system still have some problems and aspects to be solved as follows:1.the system only cluster the title and abstract by retrieval information,and did not give the weight of information.In the future,we can not only set the weight proportion of the title and abstract,but also use more information such as the first sentence in paragraphs,which can further enhance the text features,improve clustering effect.2.This system which realizes the suffix tree clustering algorithm is work in computer memory,this working style makes that the system can not handle a lot of data.In the future,we can try to improve the algorithm to let the system work on memory and external memory.After that the system can deal a large amount of data.When the searching result is more than a certain number,we can make multiple clustering forthe users.3.At present Chinese synonym thesaurus is less.In the future this system may add a Chinese synonym thesaurus and use the semantic similarity calculation method to merge clusters which have similar semantics.
Keywords/Search Tags:Short-text clustering, Suffix tree, Meta search engine, Label processing
PDF Full Text Request
Related items