Absolutely, your analogy of indexing to the index in a book is quite apt. It provides a clear and intuitive understanding of how indexing works in databases.
In the context of databases, creating an index is akin to building a reference point for the data. Much like flipping through a book's index to quickly locate information, a database index allows for the rapid retrieval of specific data without scanning through the entire dataset. This indexing technique significantly enhances the speed and efficiency of queries, as the database can swiftly pinpoint the relevant information, leading to faster query performance.
The intricacies of how databases achieve this efficiency through indexing can be explored further in the upcoming sections of your article.
Creating indexes in a relational database involves using the CREATE INDEX statement.
The syntax may vary slightly between different database management systems (DBMS)
Syntax:
CREATE INDEX [index_name]
ON [table_name] ([column_name]);
Query:
CREATE INDEX product_category_index
ON product (category);
When you run this query, it will experience a prolonged execution time compared to a standard query. This is because the database is scanning through a substantial 12 million rows and constructing a new 'category' index from the ground up, a process that takes approximately 4 minutes.
Now, let's assess how the performance of the original query improves after the implementation of indexing.
SELECT COUNT(*)
FROM product
WHERE category = ‘electronics’;
You'll observe a significant improvement in the query's speed this time. It is likely to complete in a much shorter timeframe, perhaps around 400 milliseconds.
Moreover, the positive impact of indexing on 'category' extends beyond just queries explicitly involving this condition. To illustrate, let's consider a scenario where queries involve additional conditions beyond 'category'—even these queries will experience enhanced performance due to the indexing on 'category'.
SELECT COUNT(*)
FROM product
WHERE category = ‘electronics’
AND product_subcategory = ‘headphone’;
In this case, the query's execution time is expected to be reduced compared to its normal duration, perhaps completing in around 600 milliseconds. The database can efficiently locate all 'electronics' products using the index, resulting in a smaller set of records. Subsequently, it can then identify 'headphones' from this narrowed-down set in the usual manner.
Now, let's explore the impact of changing the order of conditions in the 'WHERE' clause.
SELECT COUNT(*)
FROM product
WHERE product_subcategory = ‘headphone’
AND category = ‘electronics’;
Exploring the realm of database indexing involves delving into two primary types:
A clustered index stands as the unique index for a table, employing the primary key to structure the data within that table. Unlike a non-clustered index, a clustered index doesn't require explicit declaration; instead, it is automatically generated when the primary key is defined. By default, the clustered index utilizes the ascending order of the primary key for organization.
These clustered indexes play a pivotal role in shaping the physical arrangement of data within the table, facilitating efficient retrieval and storage operations.
Let me demonstrate this with an easy example.
The 'product' table will automatically come with a clustered index named 'product_pkey,' and this index is structured around the primary key 'product_id.'
Now, when you run a query to search the table by ‘product_id’ (like in the query below), the clustered index will help the database to perform optimal searches, and return the result faster.
SELECT product_name, category, price
FROM product
WHERE product_id = 3;
You might be curious about how it accomplishes this. Indices employ an efficient search technique called binary search.
Binary search stands out as a highly effective algorithm for locating an item within a sorted list. Its methodology involves iteratively dividing the data in half and assessing whether the target entry, sought through a query, is positioned before or after the middle entry in the dataset.
If the query value is less than the middle entry, the search narrows down to the lower half; otherwise, it focuses on the upper half.
This process continues until the desired value is located. The brilliance of binary search lies in its ability to minimize the number of searches required, resulting in faster query execution.
The following table helps to understand the impact of binary search in terms of number of searches:
Likewise, in the case of our dataset containing 12 million rows, employing a binary search would necessitate a maximum of 24 searches, as opposed to the worst-case scenario of 12 million searches.
This underscores the formidable efficiency and power of indexes in optimizing data retrieval processes.
Now, the challenge is to extend the benefits of indexing beyond the primary key, and the solution lies in non-clustered indexes.
All the queries we initially explored to enhance query performance relied on non-clustered indexes—indexes that need to be explicitly defined.
A non-clustered index is distinct in that it's stored separately from the actual data in the table. It operates much like the index page of a book, as mentioned earlier. The index page is situated in one location, while the contents of the book are in another. This design permits the inclusion of more than one non-clustered index per table, as we discussed earlier.
But how is this achieved?
Consider crafting a query that involves searching for an entry in a column for which you've already established a non-clustered index. This type of index inherently encompasses:
1. Column entries for which the index is created.
2. Addresses of the corresponding rows in the main table to which the column entries belong.
You can see this visually in the left mini-table in the figure:
Let me explain this using a query.
CREATE INDEX product_category_index
ON product (category);
SELECT product_name, category, price
FROM product
WHERE category = ‘electronics’;
The database operates through three key steps:
1. **Firstly:** It navigates to the non-clustered index (in this case, 'product_category_index'), pinpointing the column entry you searched for (e.g., category = 'electronics') using the efficient binary search method.
2. **Secondly:** It seeks the address of the corresponding row in the main table that corresponds to the identified column entry.
3. **Finally:** It accesses that specific row in the main table, retrieving additional column values as needed for your query (e.g., product_name, price).
It's important to note that a non-clustered index involves an additional step compared to a clustered index—it requires finding the address and then going to the corresponding row in the main table. This additional step makes non-clustered indexes relatively slower than their clustered counterparts.
CREATE TABLE demo(
id INT NOT NULL,
first_name VARCHAR(25) NOT NULL,
last_name VARCHAR(35) NOT NULL,
age INT NOT NULL,
PRIMARY KEY(id)
);
/* Index on First Name */
CREATE INDEX demo_fname ON demo (first_name);
/* Will tell us whether our query uses the intended index */
explain SELECT * FROM demo WHERE first_name = "Donald" \G
Output
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: demo
partitions: NULL
type: ref
possible_keys: demo_fname
key: demo_fname
key_len: 27
ref: const
rows: 1
filtered: 100.00 Extra: NULL
To harness the index's benefits, it's crucial to isolate the column, ensuring it's not incorporated into a function or expression.