Natural Language Text to SQL Query Conversion Using
NLP and Deep Learning
Phase 1
Introduction
In the present-day world, a significant amount of data is stored in Relational Database Management Systems. Many enterprise applications require information to be stored and retrieved from databases that need thorough knowledge in Structured Query Language. Despite SQL is one of the most-used languages in the entire tech world, most people do not have adequate knowledge to query in SQL. Therefore, there is a need to provide a natural language interface for databases to access information for non-technical users. The primary objective of this project is to develop a model which enables novice user to query databases using natural language. This project uses Natural Language Processing and a Deep Learning model to convert natural language queries to SQL queries[1].
Implementation Plan
The process of converting Natural Language text to SQL query includes the following steps:
Dataset Collection
Text Preprocessing
Word Embedding
Deep Learning model
Training and Testing
Basic Architecture of Text to SQL conversion
Dataset
The Dataset used in this project is a crowd-sourced dataset for developing natural language questions for relational databases. This dataset consists of 80654 hand-annotated questions and SQL queries distributed across 24241 tables from Wikipedia. The data files used in this project can be found in the Data folder. It is released along with the paper "Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning"[2].
Data Content:
80654 questions in the dataset are divided into three separate files for train, validation, and test files, each consisting of 56355, 8421, 15878 natural language, and SQL queries respectively.
The Dataset contains the following files-
train.jsonl ,dev.jsonl and test.jsonl contains the following fields:
table_id: the database id to which this question is addressed.
question: the natural language question
sql: the SQL query corresponding to the question.
train.tables.jsonl,dev.tables.jsonl, test.tables.jsonl contains the following information for each database:
id: database id
header: table header
types: datatypes for header
rows: corresponding rows of the table
name: original table names stored in the database.
page_title: name of the page
section_title: name of the section
caption: table caption
page_id: Page id
Related Work
A system which helps in conversion of simple information retrieval query into a corresponding SQL query would help many novice users in various sectors. There is substantial research undergoing to solve this problem and various methodologies are proposed for the conversion of natural language text or speech to a database query.
In 2016, Li Dong, et.al. [3] proposed a model based on an attention-enhanced sequence to sequence model to convert Language to logical form. In this paper, they encoded natural language utterances into vectors and generate their corresponding logical forms as sequences or trees using recurrent neural networks with LSTM(long short-term memory) units. This model was trained on three different datasets like GEO, ATIS, and JOBS.
In 2019, Uma M, et.al. [4] put forward a system to convert natural language questions to SQL queries for the railway reservation database. The dataset used has a set of 2880 structured natural language queries on train fare and seats available. They used Natural Language processing and regular expressions to map queries in the English language to SQL. In the same year, Jiaqi Guo, et.al. [5] proposed a neural approach called IRNet for complex and cross-domain Text-to-SQL conversion. This model performs a schema linking over a question and a database schema and adopts a grammar-based neural model to synthesize a SemQL query with domain knowledge.
Presentation
Phase 2
Text Preprocessing
SQL structured output from the input files is converted into plain text SQL queries.
SQL query and natural language questions are tokenized and stored in the output files folder along with the SQL query and natural language question
NLTK is used for word tokenization of the datasets.
Exploratory Data Analysis
To observe and understand more about the Natural language questions and queries, distribution plots have been created. The graphs below show the distribution of the number of words in questions and answers, the distribution of length in questions, and answers by a number of characters for the dataset.
Word Embedding
As machine learning models cannot process text directly, we convert textual data into numerical data using the word embedding technique. Word embedding helps to capture the semantic, syntactic context of a word and helps understand how similar/dissimilar it is to other terms.
There are many popular word embedding methods to extract features from text such as Word2Vec, TF-IDF, Bag of words, Glove embedding, etc.
In this project, word2Vec and GloVe method is used to create word embeddings. The advantage of the GloVe is that, unlike Word2Vec, Glove doesn’t rely just on local statistics, but incorporates global statistics to obtain word vectors[6].
Here, I’m using glove.6B.100d.txt. which contains 6B Tokens, 400K vocab and can be found here: https://nlp.stanford.edu/projects/glove/.
Model Construction
Attention - Sequence to Sequence model
Converting natural language text to SQL queries falls under the language-translation problem, so in this project, I have used the Seq2Seq algorithm. A Sequence to Sequence model helps to map a fixed-length input with a fixed-length output where the length of the input and output may differ[7].
To build the model, an Encoder-Decoder architecture with Recurrent Neural Networks is used. This model consists of three parts: encoder, intermediate (encoder) vector, and decoder[7].
Encoder:
The encoder of a seq2seq network is an RNN that outputs some value for every word from the input sentence. For every input word, the encoder outputs a vector and a hidden state and uses the hidden state for the next input word.
It encodes the entire sequence into a fixed-length vector called a context vector.
The purpose of the embedding layer is to convert tokens into feature vectors.
GRU layer is used as a solution to short-term memory which occurs by Recurrent Neural Networks. They have internal mechanisms called gates that can help in regulating the flow of information[7].
Encoder Vector:
This is the final hidden state produced from the encoder part of the model.
This vector aims to encapsulate the information for all input elements in order to help the decoder make accurate predictions.
It acts as the initial hidden state of the decoder part of the model.
Attention Decoder:
The Attention decoder network focuses on a different part of the encoder's outputs, unlike the decoder where a single vector carries information of the entire sentence. Attention weights are calculated using decoder inputs and hidden state inputs. The model consists of the Input layer, Embedding layer, attention layer, GRU layer, and activation functions.
Softmax is used to create a probability vector that will help us determine the final output
Training the RNN:
For quickly and efficiently training the RNN, the teacher forcing technique is used which uses ground truth from a prior time step as input. This is a network training method that is very important in training deep learning language models used in machine translation, text summarization, and image captioning, among many other applications[9]. here, a teacher forcing ratio of 1 is applied.
Results:
The results obtained in the attentional sequence to sequence model are not satisfactory as it is not using any hand-engineered grammar. The output space of the softmax in this model is unnecessarily large for this task. To overcome this problem, In 2017, Victor Zhong, et.al. [1] proposed Seq2SQL, a deep neural network for translating natural language to corresponding SQL queries. The model uses the structure of the SQL database to decrease the room produced for the query output. It used in-the-loop query execution rewards across the database to learn the strategy for generating unordered components of the query. I tried to implement the same model on this dataset to see its performance.
Presentation
Phase 3
Seq2Sql Model
Seq2Sql model uses the augmented pointer network to generates each SQL query token-by-token by selecting from an input sequence. In this case, the input sequence is the concatenation of the column names, the question, and the vocabulary of the SQL language such as SELECT, COUNT, etc. With this input, the pointer network can produce SQL Query.
The network encodes the input sequence using a bidirectional LSTM network. The inputs to the encoder are the embeddings corresponding to the input sequence and the concatenated question. For embedding, GloVe embedding has been used. The output of the encoder is given to the decoder which uses a unidirectional LSTM. The output of the decoder is the scalar attention score for each position of the input sequence.
This model is internally comprised of three individual models - Aggregation predictor, Selection predictor, and Condition predictor. Firstly, the network finds an aggregation operation for the query. Next, the network finds a column in the input table corresponding to the SELECT column. Lastly, the network generates the conditions for the query[2].
Results
The model is trained for 30 epochs and to evaluate the model, overall evaluation accuracy and logical accuracy as well as, individual accuracy for condition predictor, selection predictor, and aggregation predictor is calculated for training, validation, and test datasets.
The results are as follows:
Overall Dev accuracy - 74%
aggregator model - 90%
selection model - 88%
condition model - 59%
Test Logical accuracy - 49%
Test Execution accuracy - 57%
Conclusion and Future Work
In this project, the data is trained on two different models, i.e., Attention - sequence to sequence model which uses encoder-decoder architecture and Augmented pointed network model. The results obtained from the Augmented pointer network model are more satisfactory when compared to the Sequence to Sequence model. But, there is still room for increasing the accuracy of the model.
In the future, I would like to use the BERT(Bidirectional Encoder Representations from Transformers) model which makes use of Transformer, an attention mechanism to learn contextual relations between words (or sub-words) in a text, and train it with more complex datasets to improve the model's performance[10].
Presentation
References
Veysel Kocaman, "Text2SQL in Spark NLP", October 30, 2021. [Online]. Available: https://medium.com/swlh/text2sql-in-spark-nlp-converting-natural-language-questions-to-sql-queries-on-scale-6ae9a9061d74.
Victor Zhong, et.al. "SEQ2SQL: Generating Structured Queries From Natural Language Using Reinforcement Learning", November 9, 2017. Available: https://arxiv.org/pdf/1709.00103.pdf.
Li Dong, et.al. "Language to Logical Form with Neural Attention", January 2016. Available: https://www.aclweb.org/anthology/P16-1004.pdf.
Uma M, et.al. "Formation of SQL from Natural Language Query using NLP", 2019. Available: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8862080.
Jiaqi Guo, et.al. " Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation", May 29, 2019. Available: https://arxiv.org/pdf/1905.08205.pdf.
Thushan Ganegedara, "Intuitive Guide to Understanding GloVe Embeddings", May 5, 2019. [Online]. Available: https://towardsdatascience.com/light-on-math-ml-intuitive-guide-to-understanding-glove-embeddings-b13b4f19c010.
Simeon Kostadinov, "Understanding Encoder-Decoder Sequence to Sequence model", February 4, 2019. [online]. Available: https://towardsdatascience.com/understanding-encoder-decoder-sequence-to-sequence-model-679e04af4346.
Michael Phi, "Illustrated Guide to LSTM’s and GRU’s: A step by step explanation", September 24, 2018. [Online]. Available: https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21.
Eduardo Munoz, "A Guide to the Encoder-Decoder Model and the Attention Mechanism", October 11, 2020. [online]. Available: https://betterprogramming.pub/a-guide-on-the-encoder-decoder-model-and-the-attention-mechanism-401c836e2cdb.
Rani Horev, "BERT Explained: State of the art language model for NLP", November 10, 2018. [online]. Available: https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270.
Tao Yu, “Spider: One More Step Towards Natural Language Interfaces to Databases”, September 25, 2018. [Online]. Available: https://medium.com/@tao.yu/spider-one-more-step-towards-natural-language-interfaces-to-databases-62298dc6df3c.
Tao Yu, et.al. “Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-sql Task”, February 2, 2019. [Online]. Available: https://arxiv.org/pdf/1809.08887.pdf.
Aerin Kim, “[Text-to-SQL] Learning to query tables with natural language”, June 4, 2020. [Online]. Available: https://aerinykim.medium.com/text-to-sql-learning-to-query-tables-with-natural-language-7d714e60a70d.
Anisha TS, et.al. “Text to SQL Query Conversion Using Deep Learning: A Comparative Analysis”, September 7, 2019. Available: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3437616.
Jason Brownlee, "How Does Attention Work in Encoder-Decoder Recurrent Neural Networks", October 13, 2017. [online]. Available: https://machinelearningmastery.com/how-does-attention-work-in-encoder-decoder-recurrent-neural-networks/.
Jason Brownlee, "Encoder-Decoder Recurrent Neural Network Models for Neural Machine Translation", August 7, 2019. [online]. Available: https://machinelearningmastery.com/encoder-decoder-recurrent-neural-network-models-neural-machine-translation/.