home‎ > ‎

Using perl and Thrift to access HDFS

posted Mar 1, 2011, 5:30 PM by Gino Ledesma   [ updated Mar 1, 2011, 7:37 PM ]
Hadoop is awesome, and the Hadoop Filesystem (HDFS) is a critical piece used by other Hadoop applications. Hadoop is written entirely in Java, which makes most of the tools and examples on the Internet very Java-centric. This doesn't mean that you'd need to write stuff in Java to use Hadoop.

Thrift makes it possible for non-JVM languages to interact with Hadoop components, such as HBase and HDFS. This guide will walk you through building a Perl client to access files in HDFS.

Getting Started

Assumptions

This covers Hadoop 0.20.2. Later versions may work, barring major API changes.

Prerequisites

You will need the following to build the perl module and client:
  • Thrift
  • Hadoop
You will need the following to run the client:
  • Java
  • HadoopThriftServer

Building the Perl module

  1. Build and Install Thrift. For CentOS/RHEL 5 users, I've provided instructions and binary RPMs for easy installation.
  2. Download the Hadoop binary distribution. (Note: Cloudera's CDH distribution does not include the hadoopfs.thrift IDL file required to build the module.)
  3. Unpack the archive
  4. tar zxvf hadoop-0.20.2.tar.gz
  5. Download and apply the namespace patch. For details on what the patch does, see my notes.
    cd hadoop-0.20.2
    patch -p1 < ~/hadoop-thriftfs-namespace.patch
  6. Build the thrift module for perl
  7. cd src/contrib/thriftfs/
    rm -rf gen-perl
    thrift --gen perl if/hadoopfs.thrift
  8. The perl module will be inside the gen-perl/ directory, containing the following files:
    1. HadoopFS/Constants.pm
    2. HadoopFS/FileSystem.pm
    3. HadoopFS/Types.pm
I've provided a binaries containing the perl module in tar.gz and RPM forms:

Writing the client

I've written a demo client that touches on some of the file system operations in HDFS, such as open(), read(), write(), stat(), rm(), and listStatus(). You can download the hadoopfs-client.pl from this site.

#!/usr/bin/perl

use strict;
use warnings;


use Thrift::Socket;
use Thrift::BufferedTransport;
use Thrift::BinaryProtocol;
use HadoopFS::FileSystem;
use Getopt::Long;
use POSIX qw(strftime);

my $host = $ARGV[0] || "localhost";
my $port = $ARGV[1] || 35922;
my $pathname = $ARGV[2] || "/";

my $socket = Thrift::Socket->new ($host, $port);
$socket->setSendTimeout (10000);
$socket->setRecvTimeout (20000);

my $transport = Thrift::BufferedTransport->new ($socket);
my $protocol = Thrift::BinaryProtocol->new ($transport);
my $client = HadoopFS::FileSystemClient->new ($protocol);

eval {
$transport->open ();
};
if ($@)
{
print "Unable to connect: $@->{message}\n";
exit 1;
}

# Create a file and write some data to it
my $filename = "/foo.txt";
print "File: $filename\n";
my $path = HadoopFS::Pathname->new ( { pathname => $filename } );
if ($client->exists ($path))
{
# Delete the file if it exists
print "  ... file exists.\n";
$client->rm ($path);
print "  ... deleted.\n";
}

print "  ... creating\n";
my $fh = $client->create ( $path );
print "  ... writing data\n";
$client->write ($fh, "Hello, World!\n");
$client->close ($fh);
print "  ... closed\n";

# Read the contents of the file
eval {
print "  ... re-opening\n";
$fh = $client->open ($path);
my $data = undef;
my $offset = 0;
my $length = 4096;
print "  ... reading data\n";
while ($data = $client->read ($fh, $offset, $length)) # Read 4K blocks at a time
{
print "$filename: $data";
last if (length ($data) < $length);
$offset += $length;
print "\n";
}
$client->close ($fh);
print "  ... closed\n";
};
if ($@)
{
print $@->{message} . "\n";
}

print "\n";
# Get information about the file we just created
my $file = $client->stat ($path);
print "File stat:\n";
printf(" ... Filename: %s\n", $file->{path});
printf(" ... Length: %s\n", $file->{length});
printf(" ... Block Replication: %s\n", ($file->{blockReplication} ? $file->{blockReplication} : "-"));
printf(" ... Block Size: %s\n", ($file->{blockSize} ? $file->{blockSize} : "-"));
printf(" ... Modification Time: %s\n", strftime ("%Y-%m-%d %H:%M:%S", localtime ($file->{modificationTime} / 1000)));
printf(" ... Directory: %s\n", ($file->{isdir} ? "yes" : "no"));
printf(" ... Permission: %s\n", $file->{permission});
printf(" ... Owner: %s\n", $file->{owner});
printf(" ... Group: %s\n", $file->{group});

print "\n";
# Get the file blocks of this file
print "File Blocks:\n";
my $blocks = $client->getFileBlockLocations ($path, 0, 4096);
foreach my $block (@{$blocks})
{
printf ("... Offset: %d\n", $block->{offset});
printf ("... Length: %d\n", $block->{length});
printf ("... Hosts:\n");
foreach my $datanode (@{$block->{names}})
{
print "      $datanode\n";
}
}

# Using ls
print "\n";
print "ls /:\n";
$path = HadoopFS::Pathname->new ( { pathname => $pathname } );
my $files = $client->listStatus ( $path );
if ($files && @{$files} > 0)
{
printf ("Found %d items\n", scalar @{$files});
}
foreach my $file (@{$files})
{
my $pathname = $file->{path};
my $hdfshost;
if ($pathname =~ m/^hdfs:\/\/([^\/])+(\/.+)$/)
{
$hdfshost = $1;
$pathname = $2;
}
printf("%s%s %-3s %-9s %-9s %10s %s %s\n",
$file->{isdir} ? "d" : "-",
$file->{permission},
$file->{blockReplication} ? $file->{blockReplication} : "-",
$file->{owner},
$file->{group},
$file->{length},
strftime ("%Y-%m-%d %H:%M", localtime ($file->{modificationTime})),
$pathname
);
}

$transport->close ();

exit 0;

I'll cover the Thrift API calls in a separate post, but the above client should give you enough of an idea of what it can do.

Using the client

For the perl client to work, it needs to connect to a running HadoopThriftServer. This server is part of the Hadoop libraries.

Running the HadoopThriftServer

There is a script that starts the HadoopThriftServer in the stock distribution:
cd hadoop-0.20.2/src/contrib/thriftfs/scripts
chmod +x ./start_thrift_server.sh
./start_thrift_server.sh

It is important to note that you will need to include your Hadoop configuration path in your CLASSPATH. This is the path that contains the core-site.xml and hdfs-site.xml files.

HadoopThriftServer in CDH3

For those using Cloudera's CDH3 distribution, the HadoopThriftServer class is part of the /usr/lib/hadoop/contrib/thriftfs/hadoop-thriftfs-0.20.2+737.jar file, but it will not be able to find the thriftfs.api.* dependency which is found in the hadoopthriftapi.jar and libthrift.jar files found in the stock distribution under src/contrib/thriftfs/lib/. If you're using CDH3, you'll need to include the two JAR files in your CLASSPATH. A convenient location would be /usr/lib/hadoop/contrib/thriftfs/lib/.

sudo cp hadoop-0.20.2/src/contrib/thriftfs/lib/*.jar /usr/lib/hadoop/contrib/thriftfs/lib/

You can use the following script to start HadoopThriftServer:

#!/bin/sh

HADOOP_CONF_DIR=/etc/hadoop/conf/

# hadoop config
CLASSPATH=${HADOOP_CONF_DIR}

HADOOP_HOME=/usr/lib/hadoop
for f in ${HADOOP_HOME}/*.jar \
         ${HADOOP_HOME}/lib/*.jar \
         ${HADOOP_HOME}/contrib/thriftfs/*.jar
do
    CLASSPATH=${CLASSPATH}:${f}
done

java -Dcom.sun.management.jmxremote -classpath $CLASSPATH org.apache.hadoop.thriftfs.HadoopThriftServer $* 

To start the HadoopThriftServer, you can optionally specify a port:
./start_thrift_server.sh 35922
Starting the hadoop thrift server on port [35922]...
11/03/01 16:15:35 INFO hadoop.thrift: Starting the hadoop thrift server on port [35922]...

Running the client

The hadoopfs-client.pl takes the following optional arguments:
  • hostname of HadoopThriftServer (default: localhost)
  • port of HadoopThriftServer (default: 35922)
  • HDFS root path to query for ls (default: /)
The following is an example output of running the client:

perl hadoopfs-client.pl 
File: /foo.txt
  ... file exists.
  ... deleted.
  ... creating
  ... writing data
  ... closed
  ... re-opening
  ... reading data
/foo.txt: Hello, World!
  ... closed

File stat:
 ... Filename: hdfs://hadoop-namenode/foo.txt
 ... Length: 14
 ... Block Replication: 6
 ... Block Size: 67108864
 ... Modification Time: 2011-03-01 19:20:21
 ... Directory: no
 ... Permission: rw-r--r--
 ... Owner: gledesma
 ... Group: supergroup

File Blocks:
... Offset: 0
... Length: 14
... Hosts:
      10.16.70.155:50010
      10.16.70.163:50010
      10.16.70.167:50010
      10.16.70.165:50010
      10.16.70.156:50010
      10.16.70.159:50010

ls /:
Found 3 items
-rw-r--r-- 6   gledesma  supergroup         14 43134-10-24 19:26 /foo.txt
drwxr-xr-x -   root      supergroup          0 43024-09-26 19:05 /user
drwxr-xr-x -   hadoop    supergroup          0 43003-05-11 06:11 /var

HDFS Thrift Namespace Patch


The hadoopfs.thrift IDL file only has namespaces for Java and PHP. Building the thrift module for other languages, such as Python, Perl, C++, and Ruby will cause the modules to be declared in the top-level namespace. This can cause problems because of potential namespace collisions (e.g. Pathname and MalformedInputException may already be taken in that particular language).

I've adapted Carlos Valiente's patch (part of HDFS-417) to the hadoopfs.thrift IDL file but made some small changes. The original does a couple of things:
  • Set "hadoopfs" as the namespace for Ruby, Perl, and C++.
  • Rename ThriftHadoopFileSystem to FileSystem. Together with the namespace, this becomes hadoopfs.filesystem or HadoopFS::FileSystem in perl.
  • Rename ThriftHandle to FileHandle
  • Rename ThriftIOException to IOException
  • Rename the following fields of FileStatus:
    • block_replication => blockReplication
    • block_size => blockSize
    • modification_time => modificationTime
I've preserved the stock IDL struct definitions for FileHandle (formerly FileHandle), Pathname, MalformedInputException, and IOException (formerly ThriftIOException). Carlos's patch adds the field ID to the field keys, but this causes problems with reading/writing from the Thrift socket. You'll end up getting an error in the likes of "TSocket could not read 4 bytes from <hostname>:<port>."
ċ
hadoop-thriftfs-namespace.patch
(5k)
Gino Ledesma,
Mar 1, 2011, 7:31 PM
ċ
hadoopfs-client.pl
(3k)
Gino Ledesma,
Mar 1, 2011, 7:31 PM
ċ
perl-HadoopFS-0.20.2-1.el5.noarch.rpm
(11k)
Gino Ledesma,
Mar 1, 2011, 7:31 PM
ċ
perl-HadoopFS-0.20.2.tar.gz
(15k)
Gino Ledesma,
Mar 1, 2011, 7:31 PM
Comments