Font Size: a A A

The Research And Implementation Of Enterprise Search Engine Based On Nutch

Posted on:2012-01-31Degree:MasterType:Thesis
Country:ChinaCandidate:B ChenFull Text:PDF
GTID:2218330362457689Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the development of information technology, the information inside a modern enterprise is growing explosively. The voluminous information makes it difficult to get useful information and lowers the efficiency of employees. Therefore how to search internal information of enterprise has become a hot topic. The traditional enterprise search engine often uses the B/S architecture, because of its low scalability, when the enterprise data grows and exceeds its capacity, this architecture will meet a bottleneck of limited computing ability, storage and network bandwith.With a detailed study of open source search engine Nutch and its relevant technology, a full enterprise search engine which uses distributed processing architecture was designed. According to the features and updating law of data sources,designed three crawlers to crawl document, database and website data. In this system, the collecting, indexing, searching sub-systems all work in a distributed processing manner. The indexing module uses the MapReduce programming model to crawl data and put the analyzed data into the orginal database; the indexing module reads data from the original database and creates a index database; the search module returns the search result by searching the index database. All of the sub-systems communicate with each other by ditributed file system HDFS. Proved by test, the system has successfully completed real-time indexing of different data sources under distributed processing enviroment and achieved the intended goal.
Keywords/Search Tags:Nutch, Enterprise Search, Distributed Processing, Distributed Crawlers
PDF Full Text Request
Related items