In recent years, with the rapid development of Internet, various information on the Internet expanded rapidly, how quickly and accurately find the information users need to become exceptionally important. With this demand, search engine technology has made great progress, and there were a number of very good search engines, but there have been a number of cluster-based search engines. With the traditional linear form of a list of search results returned for the user than the search engines, search engine based on clustering of the biggest advantages is that the user's search results are returned in the form of clustering, which further facilitates the user in the mountains of information quickly and accurately find the information they need.However, these existing clustering-based search engines are only based on the basic simple clustering of the Web content at the expense of the user's search terms and related information between pages. Our thesis is based on users search for words in Web page clustering algorithm, synonyms clustering CBC (Clustering By Committee) algorithm is applied to the web page clustering ideas. Vector space model, the weights were calculated from the characteristic value, the text vector to determine similarity computation and clustering center of the aspects of CBC clustering algorithm was improved. In particular, we have increased eigenvalue value in the search word in the text of the weight vector, by this way to reflect the user's search term on the web clustering results. Experiments show that the improved algorithm is feasible and effective. Finally, in the proposed clustering algorithm based on the design and implementation of a Chinese Web page clustering system. The system is modular in design, implementation of the cluster from a web page to process the entire web page clustering. |