The Conversation Corpus Management System Based On Spark

Posted on:2021-04-12

Degree:Master

Type:Thesis

Country:China

Candidate:S Wang

Full Text:PDF

GTID:2428330620461339

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

In recent years,with the rapid development of computer technology,as an approach of linguistic research,corpus has played an important role in promoting the study of Chinese,English and other languages in the world.The construction of corpus has also attracted more attentions from international and domestic researchers.Corpus is a collection of large-scale transcripts which are structured,representative,and can be retrieved by computer programs.Different scales and types of corpora have different influences on linguistic research,moreover,with the developing of corpus processing,the scope of application is becoming more and more extensive.Taking the dialogue language as the research object and establishing the relevant dialogue corpus will help people to express the grammarian rule of a language more formally and computationally.This paper focuses on the design of a corpus management system for dialogue corpus,and research the storage and query of corpus.Dialogue corpus has a certain structure,which can be stored by XML documents and distributed storage by spark computing frameworks.The main contents of this paper are as follows:(1)Designed and implemented based on Spark dialogue corpus management system,including storage module and query module in the system.Users can upload corpus to building corpus according to their own requirements,and they can also conduct aggregate query on corpus.(2)According to the characteristics of the language structure of dialogue corpus,the storage of dialogue corpus was studied.XML was used to store dialogue corpus,and the data was stored in the big data environment to transform the uploaded data of users into XML documents.(3)For corpus in the context of big data,use the original XPath query or XQuery queries on XML document query response time will be more and more long.In order to improve querying efficiently,Spark is used for distributed query of XML document,and XPath or XQuery is used to realize distributed query of corpus.

Keywords/Search Tags:

Dialogue corpus, Spark computing framework, Distributed query, XML

PDF Full Text Request

Related items

1	An Ad-hoc Query Engine Based On Spark SQL
2	Graph Reachability Distributed Computing And Application Based On Spark
3	A Distributed Computing Framework to Manage, Query, and Analyze Big Geospatial Data for Urban Studies - Case Studies with Urban Heat Island and Tourist Movement Pattern Minin
4	Research On Apache Spark Distributed Parallel Computing Framework Optimization Technology
5	Research Of Query Processing Technology For Geospatial Big Data Based On Spark
6	Parallel Research On Data Mining Algorithm Based On YARN And Spark Framework
7	The Optimal Sequenced Route Query Based On Distributed Systems
8	A System For Distributed MD Data Analysis Based On Spark
9	Design And Implementation Of Forum Data Analysis Platform Based On SPARK
10	Reseach On Optimizing Top-k Join Queries Based On Spark