The Kerberos consist of a central Key Distribution Center (KDC), which contains two services – Authentication Server and Ticket Granting server. There are two steps of authentication – client authentication and user authentication. The authentication server is responsible for authenticating the client and providing Ticket Granting Ticket (TGT). With given valid TGT the Ticket Granting Service (TGS), authenticates user when communicate with Kerberos enabled service. The KDC contains the database of principals and their keys, somewhat similar to /etc/passwd. The following diagram shows the example of accessing /user/user_hdfs/data_details.txt. In this case user user_hdfs wants to execute command – hadoop fs –get /user/user_hdfs/data_details.txt In secure mode, the HDFS namenode and datanode will not permit any communication that does not have any valid Kerberos ticket.
Cloudera Security with Kerberos – Setup
In Hadoop there two forms of authentication takes place with respect of Kerberos. The authentication of nodes within cluster to ensure only trusted machines are part of the cluster. And the authentication of users, that access the cluster to interact with services. The HDFS and MapReduce follow the same architecture. Each daemon in cluster is given a unique principal. It is necessary and required to create a principal for each daemon, for each host in cluster. Also it is required to create keytab, which is stored on local disk. When Hadoop works in secure mode, multiple principals are used and the format for each principal is service-name/hostname@KRB.REALSM.COM. In case of HDFS the service-name is hdfs and mapred in case of MapReduce. The following points need to consider while implementing the Kerberos security in Hadoop.
The datanode and tasktracker runs on same machine, two principals need to generate, one is for datanode and one is for tasktracker.
The principal also need to be created for each user in Hadoop including hdfs and mepred.
The keytab file which stores the principal information must be protected by proper file system permissions.
Users performing the HDFS operations and running any MapReduce programs must be authenticated.
When demons startup keytab file is used to authenticate with KDC and get the ticket. These tickets are used to connect to namenode and jobtracker. The datanode and tasktrackers will not be able to connect to namenode and jobtrackar without valid ticket.
Configuring Hadoop Security:
Before we configure the Hadoop security with Kerberos, following are the import points to remember:
Hadoop security is required to implement for all of its existing processes. So it is necessary to identify all of its existing processes including Hadoop admin scripts and tools.
Before implementing Hadoop security, make sure the Hadoop cluster is up and running properly.
Use the MIT Kerberos with Hadoop as it is properly tested. Perform the basic Kerberos operations such as authenticating and receiving ticket-granting ticket.
Each hostname are proper and consistent. Each Hadoop demon is having its own principal that will be used for authentication purpose. Since the hostnames are the part of the principals, all host names must consistent and should be known at the time of principals are created.
Each daemon on each host of the cluster must have a distinct Kerberos principal.
Export principal keys to keytabs and distribute them to the proper cluster nodes.
Update the Hadoop configuration file. After generating the principals, the Hadoop configuration files need to be updated.
Restart all the Hadoop services and test it.
Steps to implement the Kerberos Hadoop security:
Open exiting KDC configuration file and create the default realm file.
The sample contents are as follows:
[kdcdefaults]v4_mode = nopreauthkdc_tcp_ports = 88,750 [realms] ADD.COM = { database_name = /var/kerberos/krb5kdc/principal admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab acl_file = /var/kerberos/krb5kdc/kadm5.acl key_stash_file = /var/kerberos/krb5kdc/stash kdc_ports = 750,88 max_life = 8h 0m 0s max_renewable_life = 1d 0h 0m 0s master_key_type = des3-hmac-sha1 supported_enctypes = des3-hmac-sha1:normal arcfour-hmac:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal des-cbc-crc:v4 des-cbc-crc:afs3 default_principal_flags = +preauth
Create principals in server. These principals are created through command prompt.
As described the format principal is hdfs/hadoop01.mydomain.com@HADOOP.CLOUDERA.COM The hdfs service on host hadoop01.mydomain.com on realm HADOOP.CLOUDERA.COM. Following are the sample principals crated on ADD cluster.
addprinc -randkey hdfs/100-164-178-26.cloud.opsource.net@ADD.COMaddprinc -randkey hdfs/100-164-178-27.cloud.opsource.net@ADD.COM addprinc -randkey mapred/100-164-178-26.cloud.opsource.net@ADD.COM addprinc -randkey mapred/100-164-178-27.cloud.opsource.net@ADD.COM addprinc -randkey HTTP/100-164-178-26.cloud.opsource.net@ADD.COM addprinc -randkey HTTP/100-164-178-27.cloud.opsource.net@ADD.COM
Create the key-tables (keytab) for principals
xst -norandkey -k hdfs.keytab hdfs/INFCSPAD0006 HTTP/INFCSPAD0006xst -norandkey -k hdfs.keytab hdfs/INFCSPAD0007 HTTP/INFCSPAD0007 xst -norandkey -k mapred.keytab mapred/INFCSPAD0006 HTTP/INFCSPAD0006 xst -norandkey -k mapred.keytab mapred/INFCSPAD0007 HTTP/INFCSPAD0007
Update the Hadoop configuration files
It includes the security configuration for namenode, datanode and secondary namenode. Namenode security
<property><name>dfs.namenode.keytab.file</name><value>/etc/hadoop/conf/hdfs.keytab</value> </property> <property> <name>dfs.namenode.kerberos.principal</name> <value>hdfs/_HOST@ADD.COM</value> </property> <property> <name>dfs.namenode.kerberos.internal.spnego.principal</name> <value>HTTP/_HOST@ADD.COM</value> </property>
Datanode Security
<property><name>dfs.datanode.keytab.file</name><value>/etc/hadoop/conf/hdfs.keytab</value> </property> <property> <name>dfs.datanode.kerberos.principal</name> <value>hdfs/_HOST@ADD.COM</value> </property>
Secondary namenode security
<property><name>dfs.secondary.namenode.keytab.file</name><value>/etc/hadoop/conf/hdfs.keytab</value> </property> <property> <name>dfs.secondary.namenode.kerberos.principal</name> <value>hdfs/_HOST@ADD.COM</value> </property> <property> <name>dfs.secondary.namenode.kerberos.internal.spnego.principal</name> <value>HTTP/_HOST@ADD.COM</value> </property>
Restart the server and deploy the client configuration through Cloudera manager.
After restarting, start the all the services through Cloudera manager.
Testing the Hadoop security settings
Check whether all principals are correctly created. To check this following command is used. Login to the namenode and run command ‘listprincs’.
[adduser@INFCSPAD0006 ~] listprincs
Try to execute Hadoop command ‘hadoop fs –ls /’
The above command should raise a privileged exception as follows.
[adduser@INFCSPAD0006 ~]$ hadoop fs -ls /13/08/30 05:23:13 ERROR security.UserGroupInformation: PriviledgedActionException as:adduser (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]13/08/30 05:23:13 WARN ipc.Client: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] 13/08/30 05:23:13 ERROR security.UserGroupInformation: PriviledgedActionException as:adduser (auth:KERBEROS) cause:java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] ls: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: “INFCSPAD0006/10.164.178.26”; destination host is: “10-164-178-26.cloud.opsource.net”:8020;Initialize the Kerberos for principal (authuser), and run same command again.
You should be able to get the following output.
[adduser@INFCSPAD0006 ~]$ kinit authuserPassword for authuser@ADD.COM:[adduser@INFCSPAD0006 ~]$ hadoop fs -ls / Found 2 items drwxrwxrwt – hdfs supergroup 0 2013-08-28 14:19 /tmp drwxr-xr-x – hdfs supergroup 0 2013-08-28 14:40 /user [adduser@INFCSPAD0006 ~]$ klist Ticket cache: FILE:/tmp/krb5cc_509 Default principal: authuser@ADD.COM Valid starting Expires Service principal 08/30/13 05:23:28 08/30/13 13:23:28 krbtgt/ADD.COM@ADD.COM renew until 08/31/13 05:23:25 Kerberos 4 ticket cache: /tmp/tkt509 klist: You have no tickets cached
Kerberos with Talend Setup
Talend provides a powerful and versatile open source big data product that makes the job of working with big data technologies easy and helps drive and improve business performance, without the need for specialist knowledge or resources. Talend’s big data product combines big data components for MapReduce 2.0 (YARN), Hadoop, HBase, Hive, HCatalog, Oozie, Sqoop and Pig into a unified open source environment so we can quickly load, extract, transform and process large and diverse data sets from disparate systems.
Talend provides an easy-to-use graphical environment that allows developers to visually map big data sources and targets without the need to learn and write complicated code. Running 100% natively on Hadoop, Talend Big Data provides massive scalability. Once a big data connection is configured the underlying code is automatically generated and can be deployed remotely as a job that runs natively on your big data cluster – HDFS, Pig, HCatalog, HBase, Sqoop or Hive. Talend’s big data components have been tested and certified to work with leading big data Hadoop distributions, including Amazon EMR, Cloudera, IBM PureData, Hortonworks, MapR, Pivotal Greenplum, Pivotal HD, and SAP HANA. Talend provides out-of-the-box support for big data platforms from the leading appliance vendors including Greenplum/Pivotal, Netezza, Teradata, and Vertica.
Developers can use the Talend Studio without restrictions. As Talend’s big data products rely on standard Hadoop APIs, developers can easily migrate data integration jobs between different Hadoop distributions without any concerns about underlying platform dependencies. Support for Apache Oozie is provided out-of-the-box, allowing operators to schedule their data jobs through open source software. With 450+ connectors, Talend integrates almost any data source so we can transform and integrate data in real-time or batch. Pre-built connectors for HBase, MongoDB,Cassandra, CouchDB, Couchbase, Neo4J and Riak speed development without requiring specific NoSQL knowledge. Talend big data components can be configured to bulk upload data to Hadoop or other big data appliance, either as a manual process, or an automatic schedule for incremental data updates.
d. Talend Job with Kerberos
Steps to configure and run a Talend Kerberos setup Prerequisites:
Create an Edge node that has Hadoop access.
Using the settings defined above enable Kerberos for the new edge node.
Once the Kerberos is enables set up the keytab using the following command