Font Size: a A A

Research On Natural Language To SQL Methods For Complex Query

Posted on:2023-09-23Degree:MasterType:Thesis
Country:ChinaCandidate:Z K DengFull Text:PDF
GTID:2558307061951029Subject:Computer technology
Abstract/Summary:PDF Full Text Request
NL2SQL aims to convert natural language questions into corresponding SQL queries.It is one of the hotspots in the field of natural language processing in recent years,and has important application prospects in data retrieval,question answering and other scenarios.However,with the complexity of database schemas and the diversification of natural language expressions,NL2SQL research for complex questions still faces three challenges: 1)for cross-domain database schemas,it is difficult to generalize from known database schemas in the training process to unknown database schemas; 2)in schema linking,it is difficult to find the mention of corresponding SQL queries from natural language questions; 3)in schema modeling,due to the differences in the structure of natural language questions in text form and structured database schemas,it is difficult to encode them efficiently.In response to the above problems,this paper first proposes a table pre-training framework based on intermediate representation completion.By learning the schema linking information from a large amount of data in the pre-training stage,the schema linking ability and generalization ability of downstream models are improved.Then,through the method of syntactic analysis,natural language questions and database schemas are unifiedly represented in the form of graphs to solve the problem of schema modeling.The main work of this paper includes:1)A mask-based table pre-training framework IRC is proposed:The framework is mainly divided into two stages: data generation and model training.The framework constructs a large amount of cross-domain data based on context-free grammar in the data generation stage,and fine-tunes the pre-trained language model to obtain a table pre-training model,which enhances the generalization ability of the downstream NL2SQL model,and alleviates NL2SQL cross-domain problems.The framework proposes intermediate representation completion and schema linking pre-training tasks in the model training phase to address the problem of schema linking.By completing the intermediate representation of the masked SQL query,the model can mine the mapping relationship between natural language questions,database schema and intermediate representations.At the same time,by identifying the correlation between natural language questions and database schema,schema linking ability of the model is enhanced.The experimental results show that the table pre-training model proposed in this paper outperforms the traditional pre-training language model in the Spider and DuSQL datasets,and when the downstream model is RAT-SQL,it exceeds the mainstream GRAPPA model in the table pre-training model by0.3% and 1.0%.2)AnNL2SQL model ACQD based on annotation decomposition is proposed:The model is divided into two modules: natural language question decomposition and parsing.In natural language question decomposition module,model initializes embedding of natural language by IRC,and iteratively decomposes complex natural language question into multiple simple questions by split prediction,text span prediction,modified word identifying and relation classification,which reduces the difficulty of encoding complex natural language questions,and solves the problem of hierarchical information of complex semantics that existing models ignore.In parsing module,model expresses the natural language question and database schema in the form of a graph through the method of syntactic analysis,and uses the link information between them to connect the natural language question graph and the database schema graph,solving the problem of schema modeling.At the same time,through schema dependency learning subtask,the trigger between natural language questions and database schema is learned,which further enhances the schema linking ability of the model.The experimental results show that the model proposed in this paper not only surpasses the traditional NL2SQL model in the Spider and DuSQL datasets,but also surpasses the similar NL2SQL model based on grammar decoding,and approaches or exceeds the mainstream GRAPPA-enhanced SmBoP model,which is-0.6% and 0.3%.
Keywords/Search Tags:Natural language processing, NL2SQL, Table pre-training, Intermediate representation, Annotation-based decomposition
PDF Full Text Request
Related items