|
MyCluster Overview
MyCluster is a system that builds personal Condor, OpenPBS, or SGE clusters on-demand. The system uses the concept of a job proxy, which is submitted to remote host server clusters, in-lieu of the actual user job. These job proxies, when dispatched by the schedulers on the host server clusters, provision CPUs into personal clusters created for the user. Depending on when job proxies are dispatched and terminated, the scientist sees an expanding and shrinking personal cluster over time. Most importantly, user jobs are submitted, managed and controlled in these dynamic personal clusters, through a single uniform job management interface of the scientist choice. The system may also be used to argument existing departmental compute infrastructures, by configuring the job proxies to provision CPUs from large remote data centers into local clusters during periodic peak computational demands. Thus, for example, scientific user jobs can seamlessly scale from departmental to national HPC infrastructures, transparently migrating to NSF supercomputing centers when needed. Finally, the job proxies submitted and managed by MyCluster are malleable job abstractions. These job proxies may be submitted with job sizes and duration that can be adapted over time to improve the throughput of the user jobs in the provisioned personal clusters. For example, the system implements adaptive mechanisms to migrate job proxies between contributing host server clusters based on the observed throughput of the wide-area systems. This is important as scientific experiments supported by the system can sometimes run for days or weeks, and adapting to changing load conditions across the wide-area systems is critical. Virtual Login Session MyCluster introduces the abstraction of a virtual login session. The figure above shows a virtual login session, instantiated from a scientist desktop computer. The virtual login session emulates the experience of logging into the head node of a traditional cluster for submitting and managing jobs to its compute nodes. The figure above shows the system provisioning CPUs from Amazon EC2. The cluster created on the scientist desktop is a four CPU cluster, with the scientist shown issuing the Condor commands condor_status, to inspect the state of the available CPUs, condor_submit, to submit a job ensemble of 10 jobs, and condor_q, to check the status of the submitted jobs. Process Architecture
A high level overview of the process architecture of MyCluster is shown above. When a user first starts a virtual login session on the desktop, the system spawns the appropriate job queue manager processes on the scientist desktop, and a number of proxy manager agents. These proxy managers are started on gateway nodes to the server clusters contributing CPUs to the session, and are responsible for submitting job proxies to the local scheduler at each contributing site. Importantly, the proxy managers are semi-autonomous processes, implementing decentralized policies for surviving outages, maximizing user job throughput, and enacting site-specific semantics. When the job proxies are dispatched by the local site schedulers, they start the appropriate job starter daemons for the selected job management interface, i.e. Condor, SGE or OpenPBS. These job starter daemons then register back with the job queue manager at the scientist desktop across the network. Jobs submitted by the scientist to the personal cluster can then be dispatched to the newly registered job starter daemons, with the scientist seeing an expanding and shrinking cluster as job starter daemons register and terminate over time. Install and User Documentation You may download an alpha version of the software here (version 2.0.9 - ALPHA). This version supports the creation of Condor and SGE clusters only. Please see our quick install guide for install/configure instructions. Note that you can use our Amazon EC2 public AMI (ami-81b252e8) to quickly test-drive our system. MyCluster depends on the MyRT run-time overlay to enable the transparent deployment of the job management systems like Condor and SGE on the wide-area network. Consult the vo-login , ec2_pool, and vo_pool command line tools man page. For example of usage, see screenshots of Amazon EC2 provisioned Condor and SGE personal clusters using the ec2_pool tool. Also see screenshots of TeraGrid provisioned Condor and SGE personal personal clusters using the vo_pool tool. An earlier version of MyCluster is also installed on all the HPC clusters on the NSF TeraGrid. The MyCluster TeraGrid user guide is available here. You can email us at mycluster-dev (with the domain tacc.utexas.edu) Related Publications 1. Edward Walker, “Continuous Adaptation for High Performance Throughput Computing across Distributed Clusters”, Proceedings of IEEE Cluster2008, Tsukuba, Japan, Oct 2008. (pdf)
2. Edward Walker, “Liability-Adjusted Throughput as a Metric for Improving User Job Throughput”, Proceedings of TeraGrid’08, Las Vegas, NV, June 2008.
3. Edward Walker, Jeffrey P. Gardner, Vladim Litvin, and Evan Turner, “Personal Adaptive Clusters as Containers for Scientific Jobs”, Cluster Computing, vol. 10(3), Sept. 2007.
4. Edward Walker and Chona Guiang, “Challenges in Executing Large Parameter Sweep Studies in Widely Distributed Computing Environments”, Proceedings of IEEE Workshop on Challenges of Large Applications in Distributed Environments (CLADE’2007), Monterey, CA, June 2007. (pdf)
5. Edward Walker, David Earl, and Michael Deem, “How To Run a Million Jobs in Six Months on the NSF TeraGrid”, Proceedings of TeraGrid’07, Madison, WI, June 2007. (pdf)
6. Edward Walker, Jeffrey P. Gardner, Vladim Litvin, and Evan Turner, “Creating Personal Adaptive Clusters for Managing Scientific Jobs in a Distributed Computing Environment”, Proceedings of IEEE Workshop on Challenges of Large Applications in Distributed Environments (CLADE’2006), Paris, July 2006. (pdf)
This material is based in whole or in part on work supported by the National Science Foundation (NSF) under the NSF Middleware Initiative — Grant No. OCI-0721931. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.
|



