Font Size: a A A

Research On Generating SQL Statements Through Natural Language Based On Deep Learning

Posted on:2022-10-01Degree:MasterType:Thesis
Country:ChinaCandidate:C J FanFull Text:PDF
GTID:2518306347492634Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The era of big data has come,massive data is stored in a variety of databases,how to mine valuable information from these massive data has become the focus of research.Medical,education,finance,software development and other industries will frequently use SQL state-ments to add,delete,modify and query data.For people with a certain programming foun-dation,using SQL may be a relatively easy thing,but for more people,they need to learn a certain database and SQL language related professional knowledge,and they need to be familiar with the database schema before they can skillfully write SQL statements.There-fore,reducing the learning cost of SQL language,generating SQL query statements faster and better,and using a more natural way of database operation is a problem worthy of study.Natural language processing(NLP)is a very important research direction in the key field of AI.The purpose of NLP is to hope that the computer can understand the natural language of human beings,and at the same time expect that the computer can generate the language types that human beings can understand according to the data of non-human language types.These are the two main problems of NLP,which are natural language understanding(NLU)and natural language generation(NLG).The development of natural language processing can greatly promote our progress in the field of artificial intelligence,which has important theoretical value and practical significance.It is a widely used way to query data through human-computer interaction between natural language and database.This way can not only save the cost of learning professional knowl-edge,but also improve the efficiency of data query.Therefore,Text2SQL task is of great research value.How to eliminate the expression and structure gap between natural language,data table structure and content in database and SQL statement,correctly understand the se-mantics of natural language and translate the user's intention into correct SQL statement are the main challenges in Text2SQL task.Text2SQL task can be divided into two aspects:the first one is to encode natural language into the content that computer can understand,the other is to interpret the encoded content into SQL statements,that is,the classic encoder and decoder process.At the same time,we will also face two different SQL statement generation scenarios,namely single table query and multi tables query.There are also many differences between the two situations,but there are also reusable technologies.The quality of natural language coding directly affects the implementation effect of the following tasks.One-hot coding is obviously a direct and effective way,which gives each word a unique representation in mathematical sense,but the sequence information of words in the text will be lost in the process of such representation.Word2vec can not only solve the dimension explosion problem,but also has the semantic characteristics of context.However it can not solve the polysemy problem of a word com-pared with the traditional one-hot word vector.The natural language pre-training model Bert has proved its effectiveness in the natural language processing task in many experiments.So it is a popular practice to apply the natural language pre-training model to natural language coding.After completing the coding process of natural language,we get the coding result of natu-ral language query statement,which contains lexical,grammatical and semantic information.According to the coding result,we need to understand the actual query requirements of users,and transform the requirements into correct SQL statements.At present,there are two meth-ods to solve this task:1.Pipeline method based on non deep learning;2.End2End method based on deep learning.Pipeline method based on non deep learning is to transform natural language query into an intermediate expression,and then transform these intermediate ex-pressions into SQL statements.The advantage of this method is that it does not need a large number of "Natural language query-SQL statement" pairing,because it is costly and time-consuming to get a lot of annotated-data.The disadvantage of this method is that it can't deal with some complex and changeable natural language descriptions,and can only deal with several relatively fixed expressions.Moreover,it depends on well-defined templates and manually designed features in advance,so the domain migration is also poor.According to the amount of training data used,deep learning based on end2end learning can be segmented into into weak-supervised learning and supervised learning.Although weakly-supervised learning has the advantages of fast training data collection and low cost of data annnotation,it is undeniable that the effect of the model will be better through sufficient training data of"Natural language query-SQL statement" pairing.Seq2SQL divides the generated SQL statement into three parts:aggregate operation:(sum,count,min,max,etc.),select:select column,where:query condition.Each part uses different methods to calculate,and the au-thor also proposes to use reinforcement learning to optimize based on the query results.For the purpose of solving the problem that the effect of reinforcement learning in Seq2SQL is not obvious,SQLNet divides the SQL statement into two parts:select and where.Each part has several slots.It only needs to fill in the corresponding symbols in the slots.Finally,the operator and condition value are classified by attention mechanism.The type SQL model is based on SQLNet and uses template filling method to generate SQL statements.SQLNet sets a separate model for each component in the template;TypeSQL improves on this,for similar components,such as SELECT-COL and COND-COL and CONDS(the number of conditions),which have dependencies between them,can be modeled better by merging into a single model.Compared with the previous coder output a piece of linear text,SyntaxSQL-Net put structural information into the the process of decoding,which means,the decoding output is a structure of SQL statement tree.Through this technology,the accurate matching rate of the model is greatly improved.Similar to SyntaxSQLNet,IRNet also uses tree struc-ture to transform SQL statement by defining a series of CFG grammars.Another part of the author's improvement is mainly in schemelinking,that is,how to find the table and column mentioned in the question.The existing algorithm model mainly aims at single table query,which may face the problem of column name reuse,which greatly affects the accuracy of the algorithm.At the same time,the existing algorithm model also has few optimization specifically for multi tables query,and in the actual application scenario,it usually involves the query operation of multiple data tables,and the complexity is also great.In the multi tables query scenario,we need to find the content that matches the description in the natural language query in a larger scope,distinguish which data tables these contents come from in the database,and consider more SQL elements when generating SQL statements.In this thesis,based on the full investigation of the related work and research about Text2SQL task,taking the Text2SQL task as the research object,aiming at the single table and its extended multi tables query scenario,the complete process and method of generating SQL statements through natural language query are realized.The main work and contribution of this thesis can be summarized as follows:1.This thesis proposes a complete process and method to enhance the structured rep-resentation by using the context output of the pre-trained model and complete the downstream tasks of Text2SQL by using the different classification models for select-ing different SQL parts.In this thesis,I first parse the SQL statements,transform the sequence generation problem into the template filling problem,that is,decompose it into multiple classification problems,and label the data according to the transformed problems.Then,in the single table query scenario,the Text2SQL task is divided into two sub tasks,which are the general classification model and the condition value acquisition model.The output results of the two models constitute a complete SQL statement.In the extended multi table query scenario,the corresponding SQL struc-ture is much more complex than that of the single table,and the prediction results are difficult to manage with a simple template.Therefore,we use the tree syntax structure,and add SPC(statement position code)to manage the sub query problem,and com-plete the generation of SQL in the form of recursion.At the same time,the encoder layer is reformed to strengthen the relationship between the natural language problems and the structure and content of the database table,so as to improve the accuracy of the model to select the database table and column.2.In the single table query scenario,the Text2SQL task is decomposed into multiple sub tasks based on template filling.By using the great feature expression ability of the pre-training model in text processing,the corresponding classification model is constructed on each subtask through fine-tuning.The function of general classification model is to get the filling part of SQL statement except condition value.When filling the column name in SQL statement,the model improves the problem of column name reuse;condition value acquisition model is to get SQL different models are built for the text type and the numerical type of the conditional value in the statement,which makes full use of the content stored in the natural language query and the database.The model improves the inconsistency between the description of the natural language query and the column value data stored in the database.Through comparative experiments,it is proved that the method improves the accuracy of SQL statement generation from column name,condition value prediction and other aspects on WikiSQL dataset and WikiTableQuestion dataset.3.In the multi tables query scenario,the idea of single table query scenario is still adopted.Based on template filling,Text2SQL task is decomposed into sub classi-fication tasks,but the model is optimized for the special situation of multi table query.The encoder layer in single table query only uses a single attention mechanism to fuse information.In the case of multiple tables,we can introduce transfomer and heuristic fusion Function is used to strengthen the connection of context to construct a more complex model.This model can greatly enhance the connection between questions and database data table in the structure and content.For the reason of considering the table structure and the information in question at the same time,this model is better to determine the tables and columns that we need in the case of multiple tables.At the same time,because the SQL structure corresponding to multi data table task is much more complex than before,it is difficult to manage the prediction process with a simple template.Therefore,according to the characteristics of SQL statements,SQL statement parsing tree is designed,and SPC(statement position code)information is added to solve the problem of SQL statement query nesting.The comparative experi-ments show that this method can effectively migrate from single table query scenario to multi table query scenario,complete the generation of multi table SQL statements,and compare with multiple models on multi table query dataset Spider,and achieve good results.
Keywords/Search Tags:NLP, Test2SQL, Pre-training Model, SQL Generating
PDF Full Text Request
Related items