Font Size: a A A

The Conversation Corpus Management System Based On Spark

Posted on:2021-04-12Degree:MasterType:Thesis
Country:ChinaCandidate:S WangFull Text:PDF
GTID:2428330620461339Subject:Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of computer technology,as an approach of linguistic research,corpus has played an important role in promoting the study of Chinese,English and other languages in the world.The construction of corpus has also attracted more attentions from international and domestic researchers.Corpus is a collection of large-scale transcripts which are structured,representative,and can be retrieved by computer programs.Different scales and types of corpora have different influences on linguistic research,moreover,with the developing of corpus processing,the scope of application is becoming more and more extensive.Taking the dialogue language as the research object and establishing the relevant dialogue corpus will help people to express the grammarian rule of a language more formally and computationally.This paper focuses on the design of a corpus management system for dialogue corpus,and research the storage and query of corpus.Dialogue corpus has a certain structure,which can be stored by XML documents and distributed storage by spark computing frameworks.The main contents of this paper are as follows:(1)Designed and implemented based on Spark dialogue corpus management system,including storage module and query module in the system.Users can upload corpus to building corpus according to their own requirements,and they can also conduct aggregate query on corpus.(2)According to the characteristics of the language structure of dialogue corpus,the storage of dialogue corpus was studied.XML was used to store dialogue corpus,and the data was stored in the big data environment to transform the uploaded data of users into XML documents.(3)For corpus in the context of big data,use the original XPath query or XQuery queries on XML document query response time will be more and more long.In order to improve querying efficiently,Spark is used for distributed query of XML document,and XPath or XQuery is used to realize distributed query of corpus.
Keywords/Search Tags:Dialogue corpus, Spark computing framework, Distributed query, XML
PDF Full Text Request
Related items