Subqueries
An Introduction to Extraction Transformation and Loading (ETL) Processes
NOSQL Databases
A subquery in MySQL is a query, which is nested into another SQL query and embedded with SELECT, INSERT, UPDATE or DELETE statement along with the various operators.
We can also nest the subquery with another subquery. A subquery is known as the inner query, and the query that contains subquery is known as the outer query.
The inner query executed first gives the result to the outer query, and then the main/outer query will be performed. MySQL allows us to use subquery anywhere, but it must be closed within parenthesis.
The following are the rules to use subqueries:
Subqueries should always use in parentheses.
If the main query does not have multiple columns for subquery, then a subquery can have only one column in the SELECT command.
We can use various comparison operators with the subquery, such as >, <, =, IN, ANY, SOME, and ALL. A multiple-row operator is very useful when the subquery returns more than one row.
We cannot use the ORDER BY clause in a subquery, although it can be used inside the main query.
If we use a subquery in a set function, it cannot be immediately enclosed in a set function.
The following are the advantages of using subqueries:
The subqueries make the queries in a structured form that allows us to isolate each part of a statement.
The subqueries provide alternative ways to query the data from the table; otherwise, we need to use complex joins and unions.
The subqueries are more readable than complex join or union statements.
SUBQUERY SYNTAX:
SELECT column_list (s) FROM table_name
WHERE column_name OPERATOR
SELECT column_list (s) FROM table_name [WHERE])
A comparison operator is an operator used to compare values and returns the result, either true or false. The following comparison operators are used in MySQL <, >, =, <>, <=>, etc.
A correlated subquery in MySQL is a subquery that depends on the outer query.
It uses the data from the outer query or contains a reference to a parent query that also appears in the outer query. MySQL evaluates it once from each row in the outer query.
EXAMPLE:
We select an customer name and city whose income is higher than the average income of all employees in each city.
The EXISTS operator is a Boolean operator that returns either true or false result. It is used with a subquery and checks the existence of data in a subquery.
If a subquery returns any record at all, this operator returns true. Otherwise, it will return false.
The NOT EXISTS operator used for negation that gives true value when the subquery does not return any row. Otherwise, it returns false. Both EXISTS and NOT EXISTS used with correlated subqueries.
We can use a subquery which is followed by the keyword ALL, ANY, or SOME after a comparison operator. The following are the syntax to use subqueries with ALL, ANY, or SOME:
operand comparison_operator ANY (subquery)
operand comparison_operator ALL (subquery)
operand comparison_operator SOME (subquery)
The ALL keyword compares values with the value returned by a subquery. Therefore, it returns TRUE if the comparison is TRUE for ALL of the values returned by a subquery.
The ANY keyword returns TRUE if the comparison is TRUE for ANY of the values returned by a subquery. The ANY and SOME keywords are the same because they are the alias of each other.
ETL, which stands for extract, transform and load, is a data integration process that combines data from multiple data sources into a single, consistent data store that is loaded into a data warehouse or other target system.
ETL is a type of data integration that refers to the three steps (extract, transform, load) used to blend data from multiple sources. It's often used to build a data warehouse.
During this process, data is taken (extracted) from a source system, converted (transformed) into a format that can be analyzed, and stored (loaded) into a data warehouse or other system. Extract, load, transform (ELT) is an alternate but related approach designed to push processing down to the database for improved performance.
When used with an enterprise data warehouse (data at rest), ETL provides deep historical context for the business.
By providing a consolidated view, ETL makes it easier for business users to analyze and report on data relevant to their initiatives.
ETL can improve data professionals’ productivity because it codifies and reuses processes that move data without requiring technical skills to write code or scripts.
ETL has evolved over time to support emerging integration requirements for things like streaming data.
Organizations need both ETL and ELT to bring data together, maintain accuracy and provide the auditing typically required for data warehousing, reporting and analytics.
During data extraction, raw data is copied or exported from source locations to a staging area. Data management teams can extract data from a variety of data sources, which can be structured or unstructured.
Those sources include but are not limited to:
SQL or NoSQL servers
CRM and ERP systems
Flat files
Web pages
In the staging area, the raw data undergoes data processing. Here, the data is transformed and consolidated for its intended analytical use case.
This phase can involve the following tasks:
Filtering, cleansing, de-duplicating, validating, and authenticating the data.
Performing calculations, translations, or summarizations based on the raw data. This can include changing row and column headers for consistency, converting currencies or other units of measurement, editing text strings, and more.
Conducting audits to ensure data quality and compliance
Removing, encrypting, or protecting data governed by industry or governmental regulators
Formatting the data into tables or joined tables to match the schema of the target data warehouse.
In this last step, the transformed data is moved from the staging area into a target data warehouse. Typically, this involves an initial loading of all data, followed by periodic loading of incremental data changes and, less often, full refreshes to erase and replace data in the warehouse.
For most organizations that use ETL, the process is automated, well-defined, continuous and batch-driven.
Typically, ETL takes place during off-hours when traffic on the source systems and the data warehouse is at its lowest.
NoSQL databases have dynamic schemas for unstructured data
NoSQL databases are horizontally scalable
NoSQL databases are document, key-value, graph, or wide-column stores
NoSQL is better for unstructured data like documents or JSON
NoSQL (“non SQL” or “not only SQL”) databases were developed in the late 2000s with a focus on scaling, fast queries, allowing for frequent application changes, and making programming simpler for developers.
Handle large volumes of data at high speed with a scale-out architecture
Store unstructured, semi-structured, or structured data
Enable easy updates to schema and fields
Developer-friendly
Don’t support ACID (atomicity, consistency, isolation, durability) transactions across multiple documents
NoSQL databases can be larger than SQL databases
Depending on the NoSQL database type you select, you may not be able to achieve all of your use cases in a single database. For example, graph databases are excellent for analyzing relationships in your data but may not provide what you need for everyday retrieval of the data such as range queries.
MongoDB is a source-available document-oriented cross-platform database program. MongoDB is classified as a NoSQL database that stores data in JSON-like documents with optional schemas. MongoDB also offers indexing, ad hoc queries and real-time aggregation to powerfully access and analyze data. Being a distributed database, MongoDB comes with higher availability, geographical distribution, and horizontal scalability.
Apache Cassandra is an open-source distributed NoSQL database that offers high scalability and availability without affecting the performance of the application. Apache Cassandra manages unstructured data with thousands of writes every second. Proven fault tolerance and linear scalability on cloud infrastructure or commodity hardware make Cassandra a perfect choice for mission-critical data. To assure scalability and reliability, Cassandra is tested on 1000 node clusters, 100 real-world schemas and use cases by Apple, Netflix, Amazon and others. Besides, it can even handle failed replacements of nodes without shutting down the system. Cassandra also replicates data across multiple nodes automatically, making it an ideal choice for developers.
Redis is also an open-source, in-memory data structure store. Redis is used as a cache, database and message broker that comes with optional durability. Redis also possesses bitmaps, range queries, streams, hyperloglogs, geospatial indexes along with data structures like sets, hashes, strings, sorted sets, and lists. Redis is written in ANSI C and can be used with nearly all of the programming languages. With its in-memory dataset, Redis preserves its extremely fast performance.
DynamoDB is a fully managed NoSQL database offered by Amazon Web Services that support document data structure and key-value cloud services. DynamoDB offers seamless scalability with faster performance. It also offers encryption at rest to eliminate the complexity of protecting sensitive data. With DynamoDB, you also get on-demand backup capabilities and enable point-in-time recovery for DynamoDB tables. It helps developers in auto-scaling, back-up, restore and in-memory caching for application data.