Font Size: a A A

OA Journals Resource Discovery And Acquisition Methods Based On Hadoop

Posted on:2014-11-30Degree:MasterType:Thesis
Country:ChinaCandidate:B R DuFull Text:PDF
GTID:2268330392964385Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
OA journals are usually classified as the deep Web (Deep Web) resources so that traditional search engines cannot index effectively. An effective way to solve this problem is to realize an integration platform for online journals, and provide a unified, transparent retrieval service interface. In this solution, discover and acquire OA journals are two important aspects in the integration of OA journal paper resources.Some previous work has proven that distributed storage and parallel mechanism are two useful technologies for OA journals processing. Therefore, we adopt distributed file system of Hadoop (HDFS) and parallelism (MapReduce) to handle massive amounts of information. In this paper, we realize the discovery and acquisition of OA journals based on Hadoop. We research the discovery and acquisition of OA journal resources in the following aspects.Firstly, we design an acquisition system for OA journals to support indexing deep Web resource. In this system, we design the overall framework, the overall module and the overall process. We also detail the function of each module in the system and its workflow.Secondly, according to the access method of the paper resources within OA journal site, we propose a papers resource discovery oriented OA journal site. First, We build the C4.5decision tree by extracting the characteristics of the home of OA journal site and divide the OA journal site into Vol directory type and retrieval interface type; We propose the paper resource discovery algorithm according to the two types of OA journal site and build the paper information resource database file.Thirdly, in order to construct the metadata repository of OA journals, this paper proposes an acquisition method of papers for OA journal site. First, it gets the download information and relevant parameters of pdf papers by analyzing the file of the paper information resource database Then it utilize the HTTP protocol to download the pdf papers, compresses multiple pdf files into Sequence files and uploads them to the HDFS file system.Finally, we achieve a OA journals resource acquisition prototype system based on Hadoop,conducte experimental verification by the prototype system.
Keywords/Search Tags:Open access, OA journal site, papers resource discovery, C4.5decision tree, Vol directory, retrieval interface
PDF Full Text Request
Related items