Font Size: a A A

Research And Implementation Of Clustering Feedback Grid Resource Distribution Search Engine

Posted on:2015-12-08Degree:MasterType:Thesis
Country:ChinaCandidate:S Z XuFull Text:PDF
GTID:2208330431478042Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In response to the current explosive expansion of enterprise information on the size and widespread demand of information sharing, enterprise-level search engines rise. Enterprise search works for organizing business decision and operation, which is quite different from Internet search, therefore the information recall and precision are needed to be ensured. The problems enterprise search facing are:First:For TB-level enterprise data, the existing centralized search engine server can not achieve storage and management of the index effectively, multiple distributed servers are required for the business; in the case of multiple concurrent indexing and retrieval tasks to perform, search engine declines sharp in the performance, the multiple servers are must used to share the task to ensure efficiency.Second:The amount of unstructured data which contains more than80%percentage of total enterprise information resources is becoming increasing, various unstructured data carry important information about the enterprise development. solving the retrieval problem of unstructured data is extremely important for companies.Third:The accuracy demand for the search area is highly increasing, the majority of the search engines return many records which do not show by classified subjects. It’s difficult for users to focus information quickly and accurately in a linear arrangement results. Search results clustering feedback can help users locate information from a certain degree.With the development of information technology in Power Grid Corporation, a grid resource search engine is urgently needed by enterprise persons to find data information. Combined with distributed computing technology and search engine-related technology, design and implement a grid resources distributed search engine which can handle massive data, support high concurrent tasks and respond quickly. In addition, the text information extraction and text clustering techniques are used for displaying clustered search results, making it easy for users to locate the document quickly and accurately. The main contents of this paper are:(1) Based on the grid professional vocabulary, Chinese Word segmentation is realized with IKAnalyzer. Combined with full-text search technology and distributed computing, model and analysis indexing and retrieval subsystem of the search engine respectively.(2) The K-means clustering algorithm is optimized to determine the initial cluster centers and K value selection. Based on the most remote to select the initial cluster centers; according possible K values to cluster, evaluate the total variance of clustering results to determine K. The testing proves that the improved algorithm for text collection adaptive clustering has a good clustering effect.(3) Design the overall architecture and the important modules of grid repository search engine. Implement details of SolrCloud distributed search engine with Solr and Zookeeper. Load balancing strategy collaborates with each effective node of the distributed search engine servers. Search engine servers use distributed indexing and search strategies to achieve a parallel index of massive data, and support a large number of concurrent users perform search missions.(4) Complete the deployment of distributed search engine, test for indexing and retrieval performance of the search engine, and show retrieval and results clustering feedback capabilities by the search engine searching examples.
Keywords/Search Tags:search engine, distributed, Solr, K-means, clustering feedback, enterprise, grid
PDF Full Text Request
Related items