Font Size: a A A

Research And Implementation Of The Domain-Dependent Vertical Search System

Posted on:2010-08-09Degree:MasterType:Thesis
Country:ChinaCandidate:Z J WangFull Text:PDF
GTID:2178360272970111Subject:Systems Engineering
Abstract/Summary:PDF Full Text Request
At present the main search engines in Internet such as Google, Baidu, Yahoo provide the customers large number of information in horizontal way. Although the general search engine can satisfy the user's need for massive information, it is very difficult to give consideration to the accuracy and the relevant of search quality. And its purpose attempts to index the whole Web, resulting in the low coverage of the Web pages and out-of-date indexes. Especially the general search engine is lack of direction for domain customers whose demand for information is relatively centralized and more detailed. As a branch direction of the search engine, the vertical search engine collects the Web page information from multiple different resources in a specific domain, and reorganized the information as structured data, so it can provide more professional and individualized information service for specialized customers and satisfy their requests for domain detailed information.The research work can be divided into two parts. Firstly, this paper mainly studies the web spider technology and information extraction technology. This paper focuses on solving a series of problems for vertical spider, including defining the domain topic, the search strategy and the similarity calculation. In order to define the domain topic, the initial seeding websites are designated by domain specialists and the feature keywords are extracted from the related Web pages. Different from the traditional graphic search algorithms for general search engines, the best-first search strategy is employed to guide the spider to collect the most relevant pages efficiently while searching the Web. The VSM method is proposed to calculate the topic relevance, and the keywords are considered to set different weight values in different parts of page. Another issue of the vertical search is Web information extraction. The existing technologies of information extraction are reviewed and analyzed, and the regular expression method is introduced in detail and employed to extract structured data from Web pages in the paper.Secondly, this paper focuses on researching search results clustering. With the analysis of the characteristics of search results clustering and the disadvantages of traditional clustering algorithms, a new method called Suffix Tree Clustering (STC) algorithm is proposed for Web search results clustering in Chinese context. The suffix tree in the paper is built in terms of Chinese words. An effective strategy is introduced into solve the problem of cluster description for cluster merging based on the binary similarity measure, and also similar phrase clusters are merged based on the semantic similarity calculation to improve the quality of clusters. Experiments show that the proposed STC algorithm has a better performance in both precision and speed than traditional document clustering algorithms.At last this paper designs and realizes a vertical search system based on the patent domain. The Lucene which is an open source search engine library is employed to realize the indexing and searching functions. The Forward Maximum Matching (FMM) method is proposed to segment Chinese text. And the information visualization technology is used for visualizing the search results clustering.
Keywords/Search Tags:Vertical Search, Web Spider, Information Extraction, Suffix Tree Clustering
PDF Full Text Request
Related items