open-issues

Last meeting: Friday 1/14/11 (see Weijia's Slides)

Matt L made first draft of these notes from the meeting; others can revise/append. Names in bold suggest "to do" items.

Main Issues

Slow data transfer to/from HDFS
- Weijia will email Flip Kromer:
- Weijia will have TACC un-RAID5 the Hadoop disks (Hadoop already provides redundancy and is robust to failure, flip recommended against RAID to reduce IO time of jobs, and will free up another 0.5 TB per node)
- Hohyon/Jae Hyeon will investigate use of distcp for parallel transfer
- currently only getting 100 MB/s while infiniband supports much higher
- Cause? Java low-level IO limited support? Hadoop routing all data through namenode? That we simply aren't using distcp?
- Possible to consider leaving data on nodes across sessions, though this would introduce more problems with scheduling since we would need specific nodes
- Matt asked if all the time was sending bits across the wire or if some of the time was converting between NFS and HDFS block structures -- Weijia thought latter would be minimal compared to former
This issue is answered:
Difficulty getting cluster time (scheduling nodes) due to contention with other users
- Weijia expects demand will decrease (1) when new Lonestar4 cluster comes online in February and (2) when viz conference deadline passes at end of March
- Demand will increase, however, as we invite the UT community to join us
- Yinon suggested possibly one could dynamically scale the number of nodes -- i.e. start with however many nodes are free, ping qsub requesting another single node until it is free, and once it is, dynamically bring it into the namenode's knowledge
- Jason and Matt will plan now on using Amazon EC2 as backup for their course to ensure students have compute time when they need it (i.e. at the last minute before their assignments are due)
Need working examples to get people started
- Hohyon / Jae Hyeon to provide, based on their own datasets (if simple/clear enough) or perhaps using a wikipedia snapshot and a standard Cloud9 example (http://www.umiacs.umd.edu/~jimmylin/Cloud9/docs/content/wikipedia.html) as Jason/Yinon suggested. The dataset should be one that can be downloaded online, and should be at least 1GB in size.
File cleanup is not happening automatically
- Weijia to investigate
- scripts should be cleaning up files when shutting down Hadoop but aren't
Need some metrics / standard benchmark numbers
- Hohyon / Jae Hyeon / Weijia to do (e.g. perhaps we have some statistics now on the jobs H and JH have been running to understand where the bottlenecks are in their jobs? Have they used HPROF to profile their jobs?)
- Basically we need to know cluster is setup correctly and where bottlenecks are -- for example, with current data transfer problems, benchmark would show we differ from what is expected -- this will be useful for catching more subtle issues and opportunities for optimization
- Comparison across different harddisk speeds on /hadoop and /largemem machines
- Metrics will suggest tuning, e.g. default Hadoop block size
- For data transfer during MapReduce, use of combiners and compression to minimize IO

Secondary issues

Can't use all 64 nodes at once (hadoop and largemem queues are disjoint)
- possible to change and can revisit later
- such a combined queue would likely also have the same 8 hour limit as largemem since those nodes are in greater demand
Maximum time limits / Knowing run-time of job before it's run
- hadoop queue has 24 hour limit, user must request time in advance
  - all TACC clusters have hard limits of some duration, though can submit a special case request for extended running if need it
  - If you request the maximum time, grid engine scheduler may delay starting your job, but you will have it for the full 24, and if you finish early, won't be charged for the (full) hours not used
  - If you request less time, possible to request extension for more time
  - largemem queue also supports hadoop, 8 hour limit
- Yinon -- checkpointing possible with Hadoop to save state before time expires and later resume interrupted job?
VNC
- Yinon requested ssh-tunnel rather than VNC similar to what they have on comp ling cluster so one can use the Web GUI without a VNC connection, which requires more bandwidth than typically available in coffee shops -- Weijia indicated this is unlikely due to security concerns at TACC (shouldn't ssh-tunnel from known user be as safe as normal ssh by known user?)
- Weijia suggested possibility of adding Hadoop option to https://portal.longhorn.tacc.utexas.edu as another alternative to VNC
- Currently VNC starts up and cleans up Hadoop -- Weijia suggested could have alternative scripts that did startup/tear down without VNC for those who had already tested jobs and simply wanted to run them
/corral currently only accessible from login nodes
- Weijia says fix is in the works
- may be able to use /scratch instead -- see next bullet
We may be able to make greater use of /scratch and less use of /corral
- TACC documentation says /scratch purged without warning once a file becomes N days old, so we've avoided /scratch for all but temporary files
- Weijia says practice differs from documentation (e.g. purging hasn't ever happened yet and email notification is provided before files deleted), so use of /scratch should be explored further
- Even if no purging, /corral is backed up while /scratch is not
Hadoop configuration files / centralization issues
- Currently users have to install something themselves; Weijia suggested a central install in /hadoop (is this the same bullet as the next one? Is there a way to reduce startup effort while maintaining usage flexibility?)
- Weijia raised issue of centralized (same for all) vs. individual
  - may seem simpler for new users but would eliminate individual flexibility
  - since the user doesn't need to touch his individual files and defaults will work (like standard .login scripts), Matt voted to leave as is
- Hohyon pointed out Hadoop log file shouldn't be stored in /home since it grows too fast for space
- May want some path for standard hadoop libraries used by multiple users to migrate to a shared /hadoop location for use by all
Weijia can make email-list based on unix hadoop group at some point if we want it instead of / in addition to google group email list