Font Size: a A A

Cloud-Based Service for Access Optimization to Textual Big Dat

Posted on:2019-01-16Degree:Ph.DType:Dissertation
University:Indiana UniversityCandidate:Peng, ZongFull Text:PDF
GTID:1478390017488929Subject:Computer Science
Abstract/Summary:
Libraries are increasingly amassing large digitized textual corpora. Digitized volumes (i.e.,books) are converted into page level searchable text, enabling page level or even phrase level discovery not possible in the days of the card catalog. Efficiently storing, searching and mining the mass of information adds value to the corpus and also brings challenges. Hathitrust Research Center is a platform that services digitized volumes of Hathitrust Digital Libraries are amassing large digitized textual corpora. Volumes (i.e., books) when digitized are converted into page level searchable text, enabling computational analysis at the page or even phrase level. Our research is carried out in the context of the Hathitrust Research Center (HTRC), a platform that services over 5 billion pages of digitized content from the Hathitrust Digital Library. This BigData challenge is constrained by possessing both public domain and in-copyright works.;A data management environment in support of a big data textual collection is a collection heterogeneous services that work seamlessly. The service is provided to community as a cloud-based service. This dissertation focuses on optimization of this heterogeneous (i.e., polyglot) environment.;The contributions are threefold: Efficient storing, searching and mining the mass of information imposes heavy workloads on the search tools compared to GUI-based (web browser) access, resulting in performance interference between users. An approach to guarantee fair share of resource utilization search service is proposed. Results show that, under constrained circumstances, the proposed fair share techniques give better performance.;Data management in a polyglot setting of HTRC data for analytical access is complicated by the sensitive nature of the majority of digitized volumes. This dissertation identifies and evaluates solutions to storage/access for digitized collection that is heterogeneously restricted. Various data storage systems are analyzed to identify the optimal option using quantitative performance evaluation and qualitative-based study.;Finally, digitized volumes and their metadata are non-static as volumes are added to and metadata records updated. The effect is that researchers see inconsistent analytical results. To allow reproduction and citation of previous data set used in research experiment, we propose a query-based solution with versioning mechanism to ensure that data sets can be reconstructed dynamically. Analysis evaluates its cost and trade-off when used with different version management approaches.
Keywords/Search Tags:Textual, Digitized, Page level, Data, Service, Access
Related items