The Data Blog. - Hadoop Security Blog (Kerberos)

Why Hadoop cluster security is required?

In early versions of Hadoop the restricting access where designed to prevent accidental data loss, rather than prevent unauthorized access to data. The file permission system in HDFS prevents one user to accidentally whole file system from a program. But does not prevent a malicious user from assuming root’s identify to access or delete any data in cluster. HDFS file permissions provide only a mechanism for authorization. Only authorization is not enough for security purpose, because the system is still open for abuse. In year 2009 the first Hadoop authentication mechanism was implemented in Yahoo! In their design, the Hadoop itself does not manage the user authentication, instead it relies on Kerberos. Kerberos is a mature open-source network authentication protocol to authenticate users. Kerberos does not manage the user permissions. It just performs the user authentication process. It’s the job of Hadoop to determine whether authenticated user has permissions to perform a given action. Kerberos – principal (user) components In Kerberos, a user is called as principal, which made-up of three components: the primary, instance and realm. The first component, primary, is a string and may be operating system username of user or a name of a service. The instance is an optional section that follows the primary component. The instance may define a user role or a host name, on which the service is running. The instance is separated by primary by using slash The third component, realm, is similar to a domain in DNS. The realm in Kerberos defines the group of principals. The below examples shows the Kerberos principals: hadoopuser@HADOOP.CLOUDERA.COM – A standard user principal. User hadoopuser in realm HADOOP.CLOUDERA.COM. hadoopuser/admin@HADOOP.CLOUDERA.COM – User with admin role in the realm HADOOP.CLOUDERA.COM. hdfs/hadoop01.mydomain.com@HADOOP.CLOUDERA.COM – the hdfs service on host hadoop01.mydomain.com on realm HADOOP.CLOUDERA.COM. The following diagram shows the three-step Kerberos ticket exchange protocol.

The Kerberos consist of a central Key Distribution Center (KDC), which contains two services – Authentication Server and Ticket Granting server. There are two steps of authentication – client authentication and user authentication. The authentication server is responsible for authenticating the client and providing Ticket Granting Ticket (TGT). With given valid TGT the Ticket Granting Service (TGS), authenticates user when communicate with Kerberos enabled service. The KDC contains the database of principals and their keys, somewhat similar to /etc/passwd. The following diagram shows the example of accessing /user/user_hdfs/data_details.txt. In this case user user_hdfs wants to execute command – hadoop fs –get /user/user_hdfs/data_details.txt In secure mode, the HDFS namenode and datanode will not permit any communication that does not have any valid Kerberos ticket.

Cloudera Security with Kerberos – Setup

In Hadoop there two forms of authentication takes place with respect of Kerberos. The authentication of nodes within cluster to ensure only trusted machines are part of the cluster. And the authentication of users, that access the cluster to interact with services. The HDFS and MapReduce follow the same architecture. Each daemon in cluster is given a unique principal. It is necessary and required to create a principal for each daemon, for each host in cluster. Also it is required to create keytab, which is stored on local disk. When Hadoop works in secure mode, multiple principals are used and the format for each principal is service-name/hostname@KRB.REALSM.COM. In case of HDFS the service-name is hdfs and mapred in case of MapReduce. The following points need to consider while implementing the Kerberos security in Hadoop.

- 1. The datanode and tasktracker runs on same machine, two principals need to generate, one is for datanode and one is for tasktracker.
  2. The principal also need to be created for each user in Hadoop including hdfs and mepred.
  3. The keytab file which stores the principal information must be protected by proper file system permissions.
  4. Users performing the HDFS operations and running any MapReduce programs must be authenticated.

When demons startup keytab file is used to authenticate with KDC and get the ticket. These tickets are used to connect to namenode and jobtracker. The datanode and tasktrackers will not be able to connect to namenode and jobtrackar without valid ticket.

Configuring Hadoop Security:

Before we configure the Hadoop security with Kerberos, following are the import points to remember:

- 1. Hadoop security is required to implement for all of its existing processes. So it is necessary to identify all of its existing processes including Hadoop admin scripts and tools.
  2. Before implementing Hadoop security, make sure the Hadoop cluster is up and running properly.
  3. Use the MIT Kerberos with Hadoop as it is properly tested. Perform the basic Kerberos operations such as authenticating and receiving ticket-granting ticket.
  4. Each hostname are proper and consistent. Each Hadoop demon is having its own principal that will be used for authentication purpose. Since the hostnames are the part of the principals, all host names must consistent and should be known at the time of principals are created.
  5. Each daemon on each host of the cluster must have a distinct Kerberos principal.
  6. Export principal keys to keytabs and distribute them to the proper cluster nodes.
  7. Update the Hadoop configuration file. After generating the principals, the Hadoop configuration files need to be updated.
  8. Restart all the Hadoop services and test it.

Steps to implement the Kerberos Hadoop security:

- 1. Open exiting KDC configuration file and create the default realm file.

The sample contents are as follows:

[kdcdefaults]v4_mode = nopreauthkdc_tcp_ports = 88,750 [realms] ADD.COM = { database_name = /var/kerberos/krb5kdc/principal admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab acl_file = /var/kerberos/krb5kdc/kadm5.acl key_stash_file = /var/kerberos/krb5kdc/stash kdc_ports = 750,88 max_life = 8h 0m 0s max_renewable_life = 1d 0h 0m 0s master_key_type = des3-hmac-sha1 supported_enctypes = des3-hmac-sha1:normal arcfour-hmac:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal des-cbc-crc:v4 des-cbc-crc:afs3 default_principal_flags = +preauth

- 1. Create principals in server. These principals are created through command prompt.

As described the format principal is hdfs/hadoop01.mydomain.com@HADOOP.CLOUDERA.COM The hdfs service on host hadoop01.mydomain.com on realm HADOOP.CLOUDERA.COM. Following are the sample principals crated on ADD cluster.

addprinc -randkey hdfs/100-164-178-26.cloud.opsource.net@ADD.COMaddprinc -randkey hdfs/100-164-178-27.cloud.opsource.net@ADD.COM addprinc -randkey mapred/100-164-178-26.cloud.opsource.net@ADD.COM addprinc -randkey mapred/100-164-178-27.cloud.opsource.net@ADD.COM addprinc -randkey HTTP/100-164-178-26.cloud.opsource.net@ADD.COM addprinc -randkey HTTP/100-164-178-27.cloud.opsource.net@ADD.COM

- 1. Create the key-tables (keytab) for principals

xst -norandkey -k hdfs.keytab hdfs/INFCSPAD0006 HTTP/INFCSPAD0006xst -norandkey -k hdfs.keytab hdfs/INFCSPAD0007 HTTP/INFCSPAD0007 xst -norandkey -k mapred.keytab mapred/INFCSPAD0006 HTTP/INFCSPAD0006 xst -norandkey -k mapred.keytab mapred/INFCSPAD0007 HTTP/INFCSPAD0007

- 1. Update the Hadoop configuration files

It includes the security configuration for namenode, datanode and secondary namenode. Namenode security

<property><name>dfs.namenode.keytab.file</name><value>/etc/hadoop/conf/hdfs.keytab</value> </property> <property> <name>dfs.namenode.kerberos.principal</name> <value>hdfs/_HOST@ADD.COM</value> </property> <property> <name>dfs.namenode.kerberos.internal.spnego.principal</name> <value>HTTP/_HOST@ADD.COM</value> </property>

Datanode Security

<property><name>dfs.datanode.keytab.file</name><value>/etc/hadoop/conf/hdfs.keytab</value> </property> <property> <name>dfs.datanode.kerberos.principal</name> <value>hdfs/_HOST@ADD.COM</value> </property>

Secondary namenode security

<property><name>dfs.secondary.namenode.keytab.file</name><value>/etc/hadoop/conf/hdfs.keytab</value> </property> <property> <name>dfs.secondary.namenode.kerberos.principal</name> <value>hdfs/_HOST@ADD.COM</value> </property> <property> <name>dfs.secondary.namenode.kerberos.internal.spnego.principal</name> <value>HTTP/_HOST@ADD.COM</value> </property>

- 1. Restart the server and deploy the client configuration through Cloudera manager.
  2. After restarting, start the all the services through Cloudera manager.

Testing the Hadoop security settings

- 1. Check whether all principals are correctly created. To check this following command is used. Login to the namenode and run command ‘listprincs’.

[adduser@INFCSPAD0006 ~] listprincs

Try to execute Hadoop command ‘hadoop fs –ls /’

The above command should raise a privileged exception as follows.

[adduser@INFCSPAD0006 ~]$ hadoop fs -ls /13/08/30 05:23:13 ERROR security.UserGroupInformation: PriviledgedActionException as:adduser (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]13/08/30 05:23:13 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] 13/08/30 05:23:13 ERROR security.UserGroupInformation: PriviledgedActionException as:adduser (auth:KERBEROS) cause:java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] ls: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: “INFCSPAD0006/10.164.178.26”; destination host is: “10-164-178-26.cloud.opsource.net”:8020;Initialize the Kerberos for principal (authuser), and run same command again.

You should be able to get the following output.

[adduser@INFCSPAD0006 ~]$ kinit authuserPassword for authuser@ADD.COM:[adduser@INFCSPAD0006 ~]$ hadoop fs -ls / Found 2 items drwxrwxrwt – hdfs supergroup 0 2013-08-28 14:19 /tmp drwxr-xr-x – hdfs supergroup 0 2013-08-28 14:40 /user [adduser@INFCSPAD0006 ~]$ klist Ticket cache: FILE:/tmp/krb5cc_509 Default principal: authuser@ADD.COM Valid starting Expires Service principal 08/30/13 05:23:28 08/30/13 13:23:28 krbtgt/ADD.COM@ADD.COM renew until 08/31/13 05:23:25 Kerberos 4 ticket cache: /tmp/tkt509 klist: You have no tickets cached

Kerberos with Talend Setup

a. Overview

Talend provides a powerful and versatile open source big data product that makes the job of working with big data technologies easy and helps drive and improve business performance, without the need for specialist knowledge or resources. Talend’s big data product combines big data components for MapReduce 2.0 (YARN), Hadoop, HBase, Hive, HCatalog, Oozie, Sqoop and Pig into a unified open source environment so we can quickly load, extract, transform and process large and diverse data sets from disparate systems.

b. How Talend Works

Talend provides an easy-to-use graphical environment that allows developers to visually map big data sources and targets without the need to learn and write complicated code. Running 100% natively on Hadoop, Talend Big Data provides massive scalability. Once a big data connection is configured the underlying code is automatically generated and can be deployed remotely as a job that runs natively on your big data cluster – HDFS, Pig, HCatalog, HBase, Sqoop or Hive. Talend’s big data components have been tested and certified to work with leading big data Hadoop distributions, including Amazon EMR, Cloudera, IBM PureData, Hortonworks, MapR, Pivotal Greenplum, Pivotal HD, and SAP HANA. Talend provides out-of-the-box support for big data platforms from the leading appliance vendors including Greenplum/Pivotal, Netezza, Teradata, and Vertica.

c. Open Source Nature

Developers can use the Talend Studio without restrictions. As Talend’s big data products rely on standard Hadoop APIs, developers can easily migrate data integration jobs between different Hadoop distributions without any concerns about underlying platform dependencies. Support for Apache Oozie is provided out-of-the-box, allowing operators to schedule their data jobs through open source software. With 450+ connectors, Talend integrates almost any data source so we can transform and integrate data in real-time or batch. Pre-built connectors for HBase, MongoDB,Cassandra, CouchDB, Couchbase, Neo4J and Riak speed development without requiring specific NoSQL knowledge. Talend big data components can be configured to bulk upload data to Hadoop or other big data appliance, either as a manual process, or an automatic schedule for incremental data updates.

d. Talend Job with Kerberos

Steps to configure and run a Talend Kerberos setup Prerequisites:

- 1. Create an Edge node that has Hadoop access.
  2. Using the settings defined above enable Kerberos for the new edge node.
  3. Once the Kerberos is enables set up the keytab using the following command

Once Kerberos is setup , open the box in UI mode to start Telend.

Create Talend Project :

- 1. Open Talend studio and create a new workspace.
  2. Open the Workspace and create a new Job called HDFSKerberos.
  3. Once the Job opens up Create a component called tHDFSConnection1.
  4. Setup the connection as follows

Value Cloudera 4.x hdfs ://<IP>:8020 Enabled Principle of Namenode that was created. Yes Name of principal created in prerequisites tab Location of keytab specified above.

Version

Namenode URI

Use Kerberos Authentication

Namenode Principal

Use Keytab to Authenticate

Principal name

Principal keytab

Property

Distribution

- 1. Now create a new component tHDFSGet1 and set the properties as follows

Value True tHDFSConnection1 /user/normaluser/text.txt (the folder where Kerberos user has access) C:/<any folder> Always Enabled

Component List

HDFS Directory

Local Directory

Overwite File

Includes Subdirectories

Property

Use Existing Connection

- 1. Once the components are created successfully , right click on tHDFSConnection1 and connect that to tHDFSGet1 using row->trigger connector.
  2. Save the project and go to Run components section.
  3. Run the project and the data from HDFS will be loaded into c: drive after proper authentication.

Sentry Setup on CDH4 for Hive Security

a. Overview

Apache Sentry (incubating) is the next step in enterprise-grade big data security and delivers fine-grained authorization to data stored in Apache Hadoop. An independent security module that integrates with open source SQL query engines Apache Hive and Cloudera Impala, Sentry delivers advanced authorization controls to enable multi-user applications and cross-functional processes for enterprise data sets.

b. Sentry Features & advantages

Role-Based Administration: –Database administrators can unlock key role-based access control (RBAC) requirements and define what users and applications can do with data within a server, database, table, or view. Data Classification:– Content producers and owners can intersperse sensitive data with non-sensitive data in the same data set. Improved Regulatory Compliance:–Business teams can leverage the power of Hadoop while aligning with regulatory mandates like HIPAA, SOX, and PCI. Expanded User Base:–Operations staff can open Hadoop data systems to a more diverse set of users, extending the power of Hadoop and making it suitable for new industries, organizations, and enterprise usage. Sentry utilizes the existing Hive metastore and offers an extensible plug-in for HiveServer2 that expands the foundation for Hadoop security, building upon the existing capabilities of concurrency and Kerberos-based authentication. Sentry provides the following advantages:

- - Gain comprehensive control of user access to subsets of data
  - Simplify permissions management based on functional roles
  - Delegate security management to individual administrators.
  - Benefit from open source innovation for Hive, Impala, and more
  - Make Hadoop safer, more compliant, and ready for enterprise use, in even the most highly regulated industries, with Sentry.

c. How Sentry provides Security

Precise Data Access Ensure that the right resources have the proper and relevant permissions to appropriate data or subsets of data and SQL activities in Hive and Impala. Easier Administration Simplify administration by granting sets of permissions to resources within the organization based on functional roles within a Hive or Impala database. Compliance Assurance Store sensitive data alongside non-sensitive data in the same data set within Hadoop without replication and ensure usage and data compliance for regulations and governance policies. Broad Usage Empower new and varied users and data within the enterprise and alleviate security concerns by building on the foundations of concurrency, authentication, and authorization provided by Hive, Impala, and Sentry. Multi-User Applications Build multi-user applications on top of Hive and Impala by segregating access to data sets for appropriate users and delegating the permissions management to local database administrators. Better Solutions Avoid sub-optimal choices for authorization like self-regulated, “benevolent” advisory authorization or “all-or-nothing,” coarse-grained, file-based access./p> Reuse and Extensibility Build on existing systems like the Hive metastore and establish a solid, open, and extensible framework for fine-grain authorization and security beyond SQL on Hadoop.

d. Sentry Setup with Hive

Before Sentry is setup we have to ensure that Kerberos is enabled in the cluster and users have specific access to folders and services. Follow the below steps to setup Sentry with Kerberos:

- 1. Open Cloudera Manager and go to settings and install new parcel http://archive.cloudera.com/sentry/parcels/latest/
  2. Set the user for hive warehouse folder in Hadoop to hive

$ sudo -u hdfs hdfs dfs -chmod -R 770 /user/hive/warehouse $ sudo -u hdfs hdfs dfs -chown -R hive:hive /user/hive/warehouse

- 1. In Cloudera Manager go to HiveServer2 settings
  2. Under Configurations section , uncheck the HiveServer2 Enable Impersonation property.
  3. Create a new File sentry-provider.ini in /user/hive/sentry

Go to Cloudera Manager MapReduce configuration and set the Minimum User ID for Job Submission to 0.

- In Hive settings go to Service-Wide category, Sentry section, check Enable Sentry Authorization.
- Save all changes and deploy client configuration.
- Restart cluster for changes to take effect .

Kerberos Setup Setup for Revo R

In order for Kerberos to work with Revo R, create an edge node that connects to the Kerberos enabled cluster. Set the cluster with Kerberos as follows:

- 1. Create a new user say normaluser in the server .
  2. Create a new folder structure in Hadoop under /user/normaluser
  3. Setup the new user as a Kerberos user using the following commands

Kadmin.localKtadd ktadd normaluser@ADD.COM