Font Size: a A A

OA Journal Site Resource Extraction And Storage Methods

Posted on:2015-03-18Degree:MasterType:Thesis
Country:ChinaCandidate:Q ZhangFull Text:PDF
GTID:2298330422470668Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the growth of OA (Open Access) journals on the Internet, the OA islandproblem becomes more evident, and it has restricted the effective use of OA resources.One way to solve this problem is online integration. With this, how to effectively extractOA journal resources on the Internet, and how to achieve them are two of the core issues.Based on a comprehensive analysis of domestic and foreign research, the paper conductedin-depth research on the OA journals resource extraction and storage issues.Firstly, the paper introduced the general knowledge and methods on Web informationextraction, as well as the architecture of Hadoop distributed file systems and distributedcomputing framework, how they work and how to use them.Secondly, the lack of traditional OA journals reptile site and page structureknowledge, comprehensiveness and accuracy of the OA journals poor resource extractionsites and other defects, this paper presents OA journals site resource extraction method,and two templates to generate a template-based method, and proposed a framework forOA resource extraction sites for OA journals on these foundations.Thirdly, Hadoop cannot store the OA resource well, the paper presents a merge-basedstorage method and a multiple index method. It controls the number of files in HDFS bycombining small files, and then reduces the memory of key node NameNode. Indexes canbe built for different attributes, so that improves the query speed.Finally, the paper tested and analyzed the methods, and looked ahead at furtherresearches.
Keywords/Search Tags:OA resource, Extract, Small files, Hadoop, Index
PDF Full Text Request
Related items