Font Size: a A A

The Research And Development Of Distributed Web Text Retrieval System Based On Hadoop

Posted on:2014-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:D N R H M J MaiFull Text:PDF
GTID:2248330398967118Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In this paper, it conducted the research and development of Nutch distributed textretrieval system on Hadoop, retrieval system running on a cluster composed by morethan one PC, system’s data processing using MapReduce distributed programmingmodel, data storage using a distributed file system HDFS, the various modules of thesystem carrying out data exchange via HDFS, data collection module storing theanalyzed original data into the collection database of HDFS, indexing module read thecollection data from HDFS and storing the builded inverted index into index databaseon HDFS, retrieval module then search index from HDFS and return results to theuser, the entire system meet the user’s search request through working together of datacapture, indexing and retrieval modules. After the ensure normal operation of Nutchon Hadoop conducts secondary development for Nutch, that is a preliminary solutionto the problem of the input and switch of Uyghur text, page layout, writing direction,font dive. And established a Uyghur text support Web distributed text retrievalsystems based on Hadoop. In order to gurantee the reliability of the system, there is nosingle point of failure in the design, making the paralysis of a single server does notaffect the use of the entire system, the system also uses different levels of backupmeasures, with varying degrees of all data backup to improve the security of data.The following are specific work done in the process of design and research:1. Analyze hadoop open source cloud computing platform and the relevanttechnology to Nutch search engine, it’s characteristics and working principle;2. Build a Hadoop platform with three nodes;3. Installing and configuring Nutch open source search engine on the Hadoopplatform; 4. Based on some critical technology in Uyghur text feature, conducts secondarydevelopment for Nutch.
Keywords/Search Tags:Cloud computing, the Uyghur text retrieval, Hadoop technology, Nutch search system
PDF Full Text Request
Related items