Offloading Data from Oracle to HDFS Using Sqoop: A Case Study
At my current client, we faced a significant challenge: the need to move historical data from an Oracle database to Hadoop Distributed File System (HDFS) to save costs and improve data processing efficiency. The client's Oracle database had grown substantially over the years, leading to increased storage costs and reduced performance. By offloading this data to HDFS, we aimed to leverage Hadoop's scalable storage and processing capabilities.
Here’s a detailed look at the procedure we used to achieve this migration using Apache Sqoop.
Prerequisites
Before starting, ensure you have the following:
1. Oracle Database : Access to the Oracle database with the necessary read permissions.
2. Hadoop Cluster : A running Hadoop cluster with HDFS configured.
3. Sqoop Installed : Sqoop installed and configured on the Hadoop cluster.
4. JDBC Connector : Oracle JDBC driver downloaded and placed in the Sqoop lib directory.
Steps to Offload Data
1. Install and Configure Sqoop
- Download and install Sqoop on the Hadoop cluster.
- Place the Oracle JDBC driver (e.g., `ojdbc8.jar`) in the `$SQOOP_HOME/lib` directory.
2. Create a Directory in HDFS
- Create a target directory in HDFS where the data from Oracle will be stored.
```bash
hdfs dfs -mkdir /user/yourusername/oracle_data
3. Run the Sqoop Import Command
- Use the Sqoop command to import data from Oracle to HDFS. Below is an example command:
```bash
sqoop import \
--connect jdbc:oracle:thin:@//hostname:port/service_name \
--username your_username \
--password your_password \
--table your_table_name \
--target-dir /user/yourusername/oracle_data/your_table_name \
--num-mappers 1 \
--split-by primary_key_column \
--as-parquetfile
- Parameters:
- `--connect`: JDBC connection string for Oracle.
- `--username`: Oracle database username.
- `--password`: Oracle database password.
- `--table`: The table name in Oracle to be imported.
- `--target-dir`: HDFS directory where data will be stored.
- `--num-mappers`: Number of mappers to use (increase for better performance).
- `--split-by`: Column used to split the data for parallel import (typically a primary key).
- `--as-parquetfile`: Specifies the data format (can be `--as-textfile` for text files).
4. Verify the Imported Data
- Check the imported data in HDFS to ensure the process completed successfully.
```bash
hdfs dfs -ls /user/yourusername/oracle_data/your_table_name
hdfs dfs -cat /user/yourusername/oracle_data/your_table_name/part-m-00000
Example
Assuming the Oracle database is running on `oracle.example.com`, port `1521`, with a service name `ORCL`, and you want to import the table `EMPLOYEES`:
sqoop import \
--connect jdbc:oracle:thin:@//oracle.example.com:1521/ORCL \
--username scott \
--password tiger \
--table EMPLOYEES \
--target-dir /user/yourusername/oracle_data/EMPLOYEES \
--num-mappers 4 \
--split-by EMPLOYEE_ID \
--as-parquetfile
Handling Large Data Sets
For large datasets, consider using Sqoop’s incremental import feature to offload data regularly:
sqoop import \
--connect jdbc:oracle:thin:@//oracle.example.com:1521/ORCL \
--username scott \
--password tiger \
--table EMPLOYEES \
--target-dir /user/yourusername/oracle_data/EMPLOYEES \
--incremental append \
--check-column EMPLOYEE_ID \
--last-value 1000
You can also enable compression to reduce storage space:
--compression-codec org.apache.hadoop.io.compress.SnappyCodec
Conclusion
By following these steps, we successfully offloaded historical data from the Oracle database to HDFS at my current client. This not only reduced storage costs but also enhanced data processing capabilities by leveraging Hadoop’s scalable architecture. Sqoop proved to be an invaluable tool in this migration process, offering a seamless and efficient way to move data from relational databases to big data platforms.