Open On Demand: Linux Desktop for the Masses!
The SSCC supports users with a large range of technical acumen who use Linux for data processing. See how we’re making Linux accessible to a broader range of users by allowing them to interact with Linux via a virtual desktop in addition to the ssh/shell command line.
ResearchDrive for Batch Processing
ResearchDrive represents a significant opportunity to leverage a campus service, both to manage the explosion in research data storage (can anyone say “Budget”?), and to provide a data store that users can access seamlessly from “anywhere.” Batch processing platforms present an additional challenge in terms of ResearchDrive ubiquitous access. I’ll present a solution that can be integrated with Slurm, Condor, and other batch platforms. This will be an interactive discussion.
Platform R: High-performance Computing @ SMPH
Brian will talk about the new high performance computing resource for analysis of omics and imaging data that is under development at the SMPH.
Em and Derek will talk about their project-focused approach to storing and tracking research data for WID and Morgridge. They will describe their “DataVault” and the utilities used to help researchers initialize and maintain authorized access, quotas, data retention compliance, and other core metadata.
NSPM-33 refers to guidance issued in January 2022 (from the White House Office of Science and Technology Policy) to federal agencies for implementing National Security Presidential Memorandum 33. This guidance includes cybersecurity areas and protocols that may be relevant to UW-Madison’s research systems.
John, who is Interim Director of the Research Security Program at the OVCRGE will lead a discussion of the requirements, the UW-Madison response so far, and implications for UW-Madison’s research computing and data systems.
The Infrastructure Services (IS) team at the Center for High Throughput Computing manages a fabric of services that enable research computing on campus and across the nation. At this meeting, the IS team will discuss these services and what it takes to maintain them while advancing the frontiers of research cyberinfrastructure.
Cory will talk about his work creating an integration between Kubernetes and GitLab.
Brian and Ken will talk about some of the research support they're doing and their work expanding services available to research groups in the College of Engineering, including storage and data movement.
Russell and David will talk about a new SLURM partitioning resource that is being set up with funding from the Research Core Revitalization Program. They’ll share their experience putting this new resource into production as well as the training and documentation they’ve developed to enable researchers to take advantage of this new service.
IceCube has recently deployed a 9.6PB (raw) Ceph cluster and migrated to it some of their Lustre file systems. Vladimir will talk about his experiences making this happen.
Suman will discuss research projects and resources at the Wisconsin Wireless and NetworkinG Systems (WiNGS) lab.
Jeff will share information about UW System Policy 1038 and its implications for event logging. Topics could include logging efforts at the System, campus, and departmental level as well as backend technologies.
This will be a very informal discussion in which we’ll share updates from each of our groups. Please feel free to briefly discuss any projects, new resources and capabilities, or even ideas you are addressing in your work.
Brian will share updates on activities at the CHTC.
Brian will present on the GLBRC’s (Great Lakes Bioenergy Research Center) Data Catalog, their one-stop shop for GLBRC’s data, and how he and colleagues at the Wisconsin Energy Institute host/maintain/deploy a highly available rails application and its supporting infrastructure. This includes a range of things under their app stack, including gitops/IaC, app test and release processes, high availability, object storage, ansible, logging, and the like.
Brian will discuss recent improvements in capabilities for machine learning at the Center for High Throughput Computing.
Richard will talk about his work setting up various mechanisms for delivering data to UW Biotechnology Center customers. These include methods such as ResearchDrive uploads, Globus, SFTP and Web.
Matt will talk about the computing and storage that help support the Cryo-EM Research Center. He’ll discuss the uses of cryo-EM, detectors and generation of movie data, GPU processing requirements, his center’s Ceph storage, and use of Globus with the systems.
UW-Madison now has an enterprise license for Globus for data transfers between groups on campus and with external collaborators. The presenters will provide an overview of what the enterprise license includes and what's new with Globus Connect Server version 5 that may be of interest. The enterprise license supports multiple server/nodes at our institution and we'd like to facilitate a discussion at this meeting about collaborations and data sharing use cases at your research center and how Globus could support those.
This session will provide an update on new NVIDIA capabilities pertinent to high performance computing clusters, which were formally announced by the NVIDA CEO on May 14 (See https://www.youtube.com/nvidia). Topics will include A100 GPU, the DGX A100 server, plus some software updates – SPARK 3.0, Jarvis, Parabricks. This session will be interactive with plenty of opportunities for questions and discussion.
Colin Vanden Heuvel from Mechanical Engineering will talk about management of heterogeneous hardware at the Simulation Based Engineering Lab.
David Schultz and Steve Barnet, Physics, will talk about IceCube cloud bursting in the public cloud.
Chris Harrison from Biostatistics and Medical Informatics, will describe his experience organizing the UW-Madison booth at SuperComputing 19 this fall and get your ideas about how to represent UW-Madison at next year's conference.
Sage Weil, Sage founder and chief architect of Ceph will share some Ceph basics plus what's new in Nautilus and coming in the Octopus updates. Bring details of your Ceph implementation to share along with any questions you'd like to ask Sage.
Tom Limoncelli, Site Reliability Engineering Manager, StackOverflow
Tom is the keynote speaker at the IT Professionals conference on June 6 (https://itproconf.wisc.edu). He agreed to meet with the RSAG group shortly after he arrives in Madison the day before! This will be an informal discussion that can be about any topics we like, so it will likely be completely different from his keynote the next day, which is about applying DevOps outside of software development.
Tom suggested these topics, as a starter:
Time Management
What's it like to work at Stack Overflow
Or just a general "AMA" (Ask Me Anything) like on Reddit
Tom’s background is in sys admin and he has authored several books on that and related topics (https://www.amazon.com/Thomas-A.-Limoncelli/e/B004J0QIVM%3Fref=dbs_a_mng_rwt_scns_share)
CaRCC (Campus Research Computing Consortium)
Lauren Michael from the Center for High Throughput Computing will talk about CaRCC (https://carcc.org), the role UW is playing, and opportunities for RSAG members and others to participate.
Storage Discussion
Several research centers on campus will share their current storage solutions and challenges:
Biotechnology Center (Richard Kunert)
IceCube (Steve Barnet)
SSEC Ceph (Kevin Hrpcek)
SSEC Lustre (Scott Nolin)
WEI (Dirk Norman)
DoIT Storage (Mike Layde)
Dennis Lange and Mike Ippolito, DoIT Network Services
Dennis and Mike will facilitate an open discussion to collect input from RSAG on the impact and challenges that changes to the routable private address space would create for research networking.
RSAG members who visited Argonne National Labs last month will debrief on what they saw and learned.
A subset of RSAG members are taking a field trip to Argonne National Labs.
Chris Harrison, Biostats
Chris discussed early findings from a paper he’ll be presenting at a workshop associated with the 32nd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018) at the end of May. The topic of the paper is "atSNPInfrastructure, a case study for searching billions of records while providing significant cost savings over cloud providers.”
Chad Seys, Physics
Chad will talk about the HDFS (Hadoop File System) at Physics, to continue the distributed file system theme we’ve been on.
Kevin Hrpcek, SSEC
Kevin was back to talk about SSEC's Ceph installation, how it is set up and how it works. The installation is currently Ceph Luminous 12.2..2 sized at 5 PB with 50 million objects. Some features of the system include an Erasure Code pool to save space when replicating objects—this adds a 25% overhead to IOPS. BlueStore is used as a storage backend and the system has an integration of the Ceph Rados Block device with Kubernetes to section off some of the data collection/analysis they need to do in short timeframes. For automated management and monitoring the system is set up with Puppet and Icinga NRPE plugin.
Kevin Hrpcek, SSEC
Kevin described some unique things he’s doing with Condor, flocking, and Docker. He runs a Docker container on all of his compute hosts and allows flocking only within this container when his management scripts allow it. This is all done to increase security and reduce risk to his hosts and network.
Heath Skarlupka, IceCube
Heath discussed his recent deployment of a Kubernetes cluster for running a containerized ElasticSearch Cluster.
Tom Jordan, DoIT Middleware
Tom discussed federated identity management in the research space and how it can be used for cross-institutional research collaboration. DoIT’s middleware group has been pretty engaged with Internet2 / InCommon and Tom will talk about how other research cooperatives (LIGO, XSEDE, etc) are using federated identity management as well as the campus resources DoIT can provide to support cross-institutional collaboration, including help with federated identity management with vendor products and configuring research Service Providers to work with InCommon and eduGain. He introduced COmanage, a tool that can be used to build and manage identity within cross-institutional virtual organizations.
Richard Kunert, Biotechnology Center
Richard discussed several way the Biotechnology Center supports data downloads , including Web downloads, SFTP and Globus Connect. He also covered how the Biotechnology Center manages authentication to enable data sharing between PIs.
Pat Christian, DoIT Network Services and Jan Cheetham, CIO Office
Pat and Jan summarized the workshop on the National Research Platform (NRP) they attended last month and the group discussed the potential of the platform to benefit UW-Madison researchers who participate in data-intensive research with external collaborators.
The NRP is a vision for connecting the science DMZs across US research institutions to enhance data transfers and enable data-driven, collaborative science. It has implications for the physical sciences as well as biomedical and genomics research.
Vladimir Brik, Virtualization Systems Administrator, WIPAC
Vlad demonstrated and discussed a real-time monitoring tool called netdata (https://my-netdata.io/). He uses NetData to replace real-time command line monitoring tools like dstat. It is easy to install and use even without configuring, has thousands of metrics already built in and a good visualization interface. Because it samples the system every 1 sec, it is useful for analyzing performance problems that require high resolution and occur in short timeframes. Vlad first starting using NetData when he needed to diagnose an issue with a ZFS server and wanted to try the ZFS-specific metrics in NetData. Because it collects a lot data, NetData it is not as useful for following histories over hours or weeks; instead Vlad finds it complements tools like Nagios or Ganglia that are better suited for monitoring trends over the long term.
.
Derek Cooper and Neil Van Lysel, Morgridge Institute
Derek and Neil talked discussed the data storage system they designed/built for the Jan Huisken lab at the Morgridge Institute. The Huisken lab uses light sheet imaging microscopy to capture TBs of data from living specimens per day. Neil and Derek conducted a proof of concept that included testing the ability of two different network-attached storage systems. The microscope set-up consists of an array of up to 12 cameras each capturing and streaming 800 Mbps of data (100 frames per second, 40 megapixels/frame) to an analysis server that is connected to the file server system. Two network attached file systems, Nimble and Isolon Nitro, were tested against several metrics such as file system throughput, image file creation time, and network throughput.
Andy discussed the use of Docker with software used to model radiation transport and geometry of complex nuclear systems. Because the applications include integration with libraries such as MOAB (Mesh Oriented Database), HDFS, and specialized compilers, use of containers to pre-package the dependencies has simplified the start-up process for new members of the research group. It has also provided for a persistent testing environment and enhanced reproducibility of simulations and analysis.
Erin talked about using Docker for running computing jobs on resources at the Center for High Throughput Computing. The CHTC has supported this for a couple of years and is currently upgrading their computers to the CentOS 7 operating system, which supports Docker much more reliably than Enterprise Linux 6. To run a CHTC job inside a Docker container, the container must be hosted on the DockerHub website and a few changes need to be made to the HTCondor submit file. A number of Docker containers for running applications often used by CHTC customers already exist on DockerHub, for example, containers for R and Python, which can be deployed on CHTC resources.
Jesse described ways Docker is being used in campus infrastructure for cloud productivity and collaboration services like Office 365 and G Suite to improve processes like account provisioning, SMTP relaying, and DMARC processing for authentication of email. Docker has allowed Jesse’s team, which includes student developers working across multiple platforms, to move away from monolithic code bases that have been used to run infrastructure services to a set of micro services that are easier to upgrade, integrate, and deploy across the team. However, the move to containerization has introduced some new complexities, too. These include the need to experiment with new ways of integrating external data sources and challenges with orchestration.
William will cover his work in a single application area (anomaly detection in infrastructure logs) and discuss some of the architectural lessons his team has learned as they've put machine learning techniques into production.
Jesse Stroik and Scott Nolin, system administrators in the Space Science and Engineering Center (SSEC), described the software management system they created to enable scientists at the SSEC to use software they’ve built in a user-friendly environment that provides a uniform experience for all researchers and every workstation at SSEC. Behind the scenes, the platform manages a large body of software versions, compilers, libraries, and operating systems; resources that had to be maintained by individual research groups for each workstation in the past.
The system is built with LMOD (a lua-based module system developed at the Texas Advanced Computing Center) and the RPM utility for Linux (RPM Package Manager). They’ve found this system is easy to use with configuration management systems like Puppet which are well suited to managing software distributed via RPM. This allowed SSEC to define a software stack in configuration management and distribute it consistently to a set of hosts. LMOD provides a consistent mechanism for users to ensure the compiler and supported software is consistently installed and loaded in their current environment.
One challenge they encountered was consistent design in naming and versioning the software. This was mitigated by setting a naming/versioning policy for LMOD which resulted in added benefits of reproducibility and quick adoption by users by providing them less uncertainty in their own software builds. In fact, because the system creates links between the software version and its libraries, compilers, etc., it essentially creates a provenance trail for scientists, documenting all the software, library, and compiler versions that they used to generate a particular result with a given version of an application.
This system is used by SSEC scientists and their external collaborators for testing proof-of-concept software enhancements. It works well with cluster applications since LMOD is written to manage MPI libraries. Researchers have successfully used the system to specify cluster and cron jobs in MatLab. When a new version of MatLab gets loaded, the system will recognize that the user still needs the older version to run these types of jobs. The software management system underlies a new cluster for the NOAA funded project, the S4 cluster. The system also supports researchers who need to use older software for specific tasks because it preserves older operating systems and compilers required to make that software run necessary for long-term reproducibility.