Font Size: a A A

Search Engine Design Analysis And The Results Of Clustering Improvements

Posted on:2008-08-06Degree:MasterType:Thesis
Country:ChinaCandidate:X Q DongFull Text:PDF
GTID:2208360212475348Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Along with the rapid growth of information on Web, It's easy to obtain but hard to get useful information. Search engine, in some degree, accelerate the speed of information retrieval, but the common search engine can not satisfy one's need in some special fields. For this, design and develop special search engine is necessary. But the special domain knowledge of design search engine makes it difficult in implement. Design and development of search engine and its interrelated technology, such as search result cluster, all are expatiated on in this thesis.The two main parts of IR: indexing and searching are discussed firstly, based on the open source project Lucene, including the index file format, index file building, search process. Then analysis the work system and composition of search engine, and the interrelated technology such as the spider, the web page parser, the link analysis, the index building, the format of index file, search process and the data structure for speeding the search.Secondly, the fault of the common search engine's snippets is pointed out: only the phrases by the keyword, it can not descript the semantic character of the text. To fetch up it, semantic feature extractor is advanced. And after that, is the implement.To extract the feature words, the suffix array is applied. So the theories of suffix array are introduced, and then point out that complete substring can descript the text's features. Then develop the module of constructing the suffix array, form the left complete substring, right complete substring and the complete substring.Cluster search result is the effective way to enhance the search engine function. The cluster process and the cluster method's difference are discussed. Then analysis the cluster method based on singular value decomposition and modifies the framework of common search engine to improve the cluster function. At last, the experimentation is shown.
Keywords/Search Tags:Search engine, feature extract, text cluster, Lucene, Nutch
PDF Full Text Request
Related items