Hadoop is awesome, and the Hadoop Filesystem (HDFS) is a critical piece used by other Hadoop applications. Hadoop is written entirely in Java, which makes most of the tools and examples on the Internet very Java-centric. This doesn't mean that you'd need to write stuff in Java to use Hadoop.
Thrift makes it possible for non-JVM languages to interact with Hadoop components, such as HBase and HDFS. This guide will walk you through building a Perl client to access files in HDFS.
This covers Hadoop 0.20.2. Later versions may work, barring major API changes.
You will need the following to build the perl module and client:
You will need the following to run the client:
I've written a demo client that touches on some of the file system operations in HDFS, such as open(), read(), write(), stat(), rm(), and listStatus(). You can download the hadoopfs-client.pl from this site.
I'll cover the Thrift API calls in a separate post, but the above client should give you enough of an idea of what it can do.
Using the client
For the perl client to work, it needs to connect to a running HadoopThriftServer. This server is part of the Hadoop libraries.
There is a script that starts the HadoopThriftServer in the stock distribution:
It is important to note that you will need to include your Hadoop configuration path in your CLASSPATH. This is the path that contains the core-site.xml and hdfs-site.xml files.
The hadoopfs-client.pl takes the following optional arguments:
The following is an example output of running the client:
drwxr-xr-x - root supergroup 0 43024-09-26 19:05 /user
The hadoopfs.thrift IDL file only has namespaces for Java and PHP. Building the thrift module for other languages, such as Python, Perl, C++, and Ruby will cause the modules to be declared in the top-level namespace. This can cause problems because of potential namespace collisions (e.g. Pathname and MalformedInputException may already be taken in that particular language).
I've adapted Carlos Valiente's patch (part of HDFS-417) to the hadoopfs.thrift IDL file but made some small changes. The original does a couple of things:
I've preserved the stock IDL struct definitions for FileHandle (formerly FileHandle), Pathname, MalformedInputException, and IOException (formerly ThriftIOException). Carlos's patch adds the field ID to the field keys, but this causes problems with reading/writing from the Thrift socket. You'll end up getting an error in the likes of "TSocket could not read 4 bytes from <hostname>:<port>."