Font Size: a A A

Design And Implementation Of A Clustering Search Engine For Open Source Communities

Posted on:2013-09-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y CaoFull Text:PDF
GTID:2268330392473882Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Open-source software applications increase efficiency and quality of softwaredevelopment, which have become an important trend in software engineering. With therapid development and wide application of open-source software, there are anincreasing amount of open-source communities on the internet to support open-sourcesoftware development and sharing. Currently, open-source software in a great varietyand quantity widely exists in the open-source communities. It’s a great challenge tosearch and select useful open-source software from information-sea. The studies ofautomatically collecting and mining data from open-source communities, clustering thesearching results, and providing users with an open-source software oriented searchservice are regarded as important researching and practical directions.This paper analyzes the related works on search engine with cluster methods.Based on the distribution and the data characteristics of open-source software, a searchsystem Influx is designed to download information in the open-source communitiesusing web crawler, extract and search projects attributes, and analyze the searchingresults using a cluster algorithm. Influx system can effectively support cluster searchingin cross open-source communities. The main work of this paper includes:First, this paper compares and analyzes the related technologies on search engineand cluster search. Based on the special requirements in the search system ofopen-source communities, a framework of cluster search system Influx is proposed,which consists of data storage, data searching, data analysis, and data access layers. Theproposed Influx system has good scalability.Next, this paper designs the data searching and cluster analysis mechanisms for theopen-source cluster search system. By the integration of Heritrix and Lucene platforms,the system realizes effective crawl of the open-source software information, informationextraction, and project attributes index. Based on the principle of K-means algorithm, animproved K-means cluster algorithm is proposed to classify the searching results inorder to provide a selectively browsing user interface.Finally, the proposed open-source oriented search system Influx was implementedand experiments were performed to validate the system mechanisms and capabilities.The experiment results indicate that the Influx system can effectively supportopen-source software searching and cluster analysis for cross communities on theinternet.
Keywords/Search Tags:open-source software, cluster search, cluster algorithm, webcrawler, Heritrix, Lucene
PDF Full Text Request
Related items