OA Journals Resource Discovery And Acquisition Methods Based On Hadoop

Posted on:2014-11-30

Degree:Master

Type:Thesis

Country:China

Candidate:B R Du

Full Text:PDF

GTID:2268330392964385

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

OA journals are usually classified as the deep Web (Deep Web) resources so that traditional search engines cannot index effectively. An effective way to solve this problem is to realize an integration platform for online journals, and provide a unified, transparent retrieval service interface. In this solution, discover and acquire OA journals are two important aspects in the integration of OA journal paper resources.Some previous work has proven that distributed storage and parallel mechanism are two useful technologies for OA journals processing. Therefore, we adopt distributed file system of Hadoop (HDFS) and parallelism (MapReduce) to handle massive amounts of information. In this paper, we realize the discovery and acquisition of OA journals based on Hadoop. We research the discovery and acquisition of OA journal resources in the following aspects.Firstly, we design an acquisition system for OA journals to support indexing deep Web resource. In this system, we design the overall framework, the overall module and the overall process. We also detail the function of each module in the system and its workflow.Secondly, according to the access method of the paper resources within OA journal site, we propose a papers resource discovery oriented OA journal site. First, We build the C4.5decision tree by extracting the characteristics of the home of OA journal site and divide the OA journal site into Vol directory type and retrieval interface type; We propose the paper resource discovery algorithm according to the two types of OA journal site and build the paper information resource database file.Thirdly, in order to construct the metadata repository of OA journals, this paper proposes an acquisition method of papers for OA journal site. First, it gets the download information and relevant parameters of pdf papers by analyzing the file of the paper information resource database Then it utilize the HTTP protocol to download the pdf papers, compresses multiple pdf files into Sequence files and uploads them to the HDFS file system.Finally, we achieve a OA journals resource acquisition prototype system based on Hadoop,conducte experimental verification by the prototype system.

Keywords/Search Tags:

Open access, OA journal site, papers resource discovery, C4.5decision tree, Vol directory, retrieval interface

PDF Full Text Request

Related items

1	Research On Automatic Acquistion Method For Open Access Journal Papers
2	Research On The Unified Platform For Access And Full Text Retrieval In Open Access Journals
3	Research On And Implementation Of Accessing A Network And Discovery Of Resource Directory For Iot Node
4	Study On Vehicles To Implement Open Access And The Effect Of Open Access
5	Open Access: One New Model Of Academic Information Communication
6	Open Access Research In Japan
7	Research On Automatic Search Method For Open Access Journal Websites
8	Study On The Status Of Implementing Open Access In Chinese Academic Journal Press
9	A Comparative Study Of Chinese And American Open Access Journals
10	Research On Resource Discovery And Location Technology For Constrained Applications