Software Update Guide
NOTE: to avoid kernel panic remove oldest kernels from /boot. Leave at least two kernels in case you have to fall back to a previous kernel after reboot.
Please be aware of all the following directions in this section before proceeding
For the key service nodes only update known security problems, and never reboot, just restart services:
Whenever possible don't do a kernel update because that requires a reboot, and we want to NEVER reboot these!
hepcms-hn.umd.edu
hepcms-ovirt.umd.edu
hepcms-foreman.umd.edu
Any nodes not in base hiera group will need to have automatic SL6 software updates turned off
Always do the following before applying any updates:
yum install yum-utils yum-plugin-ps
yum check-update
needs-restarting
Be sure to keep a log of the update, occasionally a single part of the update will fail and you'll have to check what broke.
General yum guidelines (a few more tricks on our cheat sheet also):
After update, do needs-restarting again to find out what services need restarting.
Here is an example restarting commands from Trey (copy/paste into command line as root), these "Should eliminate all but the udev-d and tty processes from the list":
# Restart services - so far these are ones I've found
for s in nfslock messagebus haldaemon munge ntpd postfix sshd sssd zabbix-agent mcollective ovirt-guest-agent httpd tomcat6 atd crond ; do
if test -f /etc/init.d/${s} ; then
/etc/init.d/${s} status &>/dev/null
if [ $? -eq 0 ]; then
/etc/init.d/${s} restart
/etc/init.d/${s} status
fi
fi
done
# Dell tools too on Dell systems
if test -f /opt/dell/srvadmin/sbin/srvadmin-services.sh ; then
/opt/dell/srvadmin/sbin/srvadmin-services.sh restart
fi
OSG notes:
In general, you don't want to upgrade OSG, condor, or hadoop without doing this for ALL nodes
For OSG node updates of OSG software, don't reboot, just restart services, including OSG services (in the proper order: http://hep-t3.physics.umd.edu/HowToForAdmins/errors.html#errorsOSG needs rewritten)
Make sure to have a saved version of condor configuration before upgrade
After upgrading condor, restart condor
Before upgrading hadoop, put hadoop in safe mode on hepcms-namenode
Copy the current NameNode index to /data as well as your laptop
Do the upgrade, then restart services as needed (on hepcms-namenode, hepcms-secondary-namenode)
Make sure the full hadoop disk is still there and once the FULL upgrade procedure for the outage is done, take hadoop out of safemode
Worker Nodes are also hadoop datanodes, so try not to reboot them, just restart services
Interactive nodes should be restarted (planned outage to announce to users)
If an unplanned outage is needed, use the command w to see who is logged in
Also check ps ahux to see if anyone left a screen, tmux, or other long-running process (they're not supposed to leave long-running processes on interactive)
I have often done reboots during weeknights when nobody's logged in, or a weekend, but it always runs the risk of a bare metal IN not coming back online (see next note)
Note that this is nice to clear abandoned ssh logins, badly cancelled processes, and swap memory
Warning about bare metal machines (worker nodes, hepcms-in2, main cluster service nodes)
Occasionally they won't come back up after reboot, either they hang on some command (or disk mount) or need input at the machine to fsck a disk (usually a data disk) by hand that failed disk check on reboot
Therefore you want to have at least one person on campus to intervene in the case of a reboot of bare metal machines
node cpus disk TB condor slots hadoop usage osg-version singularity
comp 5 8 3.5 7 4 3.4 yes comp five has proority slots so the number reads 14 but actual slots are 7
10 8 3.5 7 3.4 yes
7 8 3.5 7 3.4 yes
11 8 3.5 7 3.4 yes apparently never upfdtaed using dougs instrcution and was updated earlier with --skip broken
updated singularity to 2.6.1 but didnt work. just re,oved the /usr/libexec/singularity/bin/start-suid for now.
8 8 3.5 7 3.3
6 8 3.5 7 3.4 yes rm /usr/libexec/singularity/bin/start-suid instead
r510-0-9 24 22 nothing 3.4 yes
11 24 22 nothing 3.4 yes /dev/sda1 is 100% so cant install
rm /usr/libexec/singularity/bin/start-suid instead
6 24 22 23 3.4 yes rm /usr/libexec/singularity/bin/start-suid instead
1 24 22 23 3.4 yes
4 24 22 23 3.4 yes
10 24 22 23 3.4 yes
5 24 22 23 3.4 yes rm /usr/libexec/singularity/bin/start-suid
r720-0-1 32 22
3.4
2 32 22 31 3.4 yes
total 280 218
adding singularity to worker nodes compute-0-10,
r510-0-11, 6, 04,
fixed some issues by changing
https://gitlab.cern.ch/SITECONF/T3_US_UMD/blob/master/JobConfig/site-local-config.xml
changes all instances of /sharesoft/cmssw to /cvmfs/cms.cern.ch
remove emove the osg-wn-client-glexec package as we no longer need it.
yum remove osg-wn-client-glexec
and dont update the ones causing trouble.
yum y update --disablerepo=whatever is making the update fail
you might even have to diable these when getting one or two pacjkges
yum -y install --disablerepo=dell-omsa-indep,dell-omsa-specific https://repo.opensciencegrid.org/osg/3.4/osg-3.4-el6-release-latest.rpm
using Doug's instructions:
cd /etc/yum.repos.d/
mkdir SaveOSGRepos
mv osg* SaveOSGRepos/
yum clean all
yum -y install https://repo.opensciencegrid.org/osg/3.4/osg-3.4-el6-release-latest.rpm
yum clean all
yum -y install lcmaps vo-client-lcmaps-voms osg-configure-misc llrun voms-clients
yum -y update
osg-configure -c
ln -s /data/osg/scripts/grid-mapfile /etc/grid-security/
sometimes rpm would fail with an error that "will not update and nothing to do". A reinstall of rpm helped.
These are exact instructions that worked on r510-0-9
cd /etc/yum.repos.d/
mkdir SaveOSGRepos
mv osg* SaveOSGRepos/
yum clean all
yum -y reinstall --disablerepo=dell-omsa-indep,dell-omsa-specific https://repo.opensciencegrid.org/osg/3.4/osg-3.4-el6-release-latest.rpm
yum clean all
yum -y install lcmaps vo-client-lcmaps-voms osg-configure-misc llrun voms-clients
yum -y update --disablerepo=dell-omsa-indep,dell-omsa-specific
osg-version
osg-configure -c
ln -s /data/osg/scripts/grid-mapfile /etc/grid-security/
yum clean all
yum -y install singularity-runtime
emacs -nw /etc/singularity/singularity.conf
gridftp already has the osg 3.4
[root@hepcms-gridftp boot]# osg-version
OSG 3.4.17
[root@hepcms-gridftp boot]#
Critical vulnerability found update to singu;arity-runtime-2.6.1
yum -y install singularity-runtime
Singularity install
from https://opensciencegrid.org/docs/worker-node/install-singularity/
yum -y install https://repo.opensciencegrid.org/osg/3.4/osg-3.4-el6-release-latest.rpm
yum clean all
yum -y install singularity-runtime
to update
yum update singularity-runtime
[root@compute-0-10 ~]# vi or emacs -nw /etc/singularity/singularity.conf
configure
ENABLE UNDERLAY: [yes/no]
# DEFAULT: no
# Enabling this option will make it possible to specify bind paths to locations
# that do not currently exist within the container, similar to the overlay
# option. This will only be used if overlay is not enabled.
enable underlay = yes
# ENABLE OVERLAY: [yes/no/try]
# DEFAULT: try
# Enabling this option will make it possible to specify bind paths to locations
# that do not currently exist within the container. If 'try' is chosen,
# overlayfs will be tried but if it is unavailable it will be silently ignored.
enable overlay = no
MAX LOOP DEVICES: [INT]
# DEFAULT: 256
# Set the maximum number of loop devices that Singularity should ever attempt
# to utilize.
max loop devices = 0
validate
[jabeen@compute-0-10 ~]$
bash
singularity exec --contain --ipc --pid --home $PWD:/srv --bind /cvmfs /cvmfs/singularity.opensciencegrid.org/opensciencegrid/osgvo:el6 ps -ef
singularity exec --contain --ipc --pid \
> --home $PWD:/srv \
> --bind /cvmfs \
> /cvmfs/singularity.opensciencegrid.org/opensciencegrid/osgvo:el6 \
> ps -ef
WARNING: Container does not have an exec helper script, calling 'ps' directly
UID PID PPID C STIME TTY TIME CMD
jabeen 1 0 1 00:25 ? 00:00:00 shim-init ps -ef
jabeen 2 1 0 00:25 ? 00:00:00 ps -ef
[jabeen@compute-0-10 ~]$
Warning
If you modify /etc/singularity/singularity.conf, be careful with your upgrade procedures. RPM will not automatically merge your changes with new upstream configuration keys, which may cause a broken install or inadvertently change the site configuration. Singularity changes its default configuration file more frequently than typical OSG software.
Look for singularity.conf.rpmnew after upgrades and merge in any changes to the defaults.
hepcms-gridftp
yum update --disablerepo=puppetlabs-products
root@hepcms-gridftp ~]# service globus-gridftp-server restart
hepcms-ce
yum update --disablerepo=puppetlabs-products
root@hepcms-1 ~]# service httpd restart
Stopping httpd: [ OK ]
Starting httpd: [ OK ]
[root@hepcms-1 ~]# service condor-cron restart
Stopping Condor-cron daemons: [ OK ]
Starting Condor-cron daemons: [ OK ]
[root@hepcms-1 ~]# service rsv restart
Stopping RSV: Stopping all metrics on all hosts.
Stopping consumers.
Starting RSV: Starting 13 metrics for host 'hepcms-1.umd.edu'.
Starting 2 metrics for host 'hepcms-0.umd.edu:8443'.
Starting 1 metrics for host 'hepcms-gridftp.umd.edu'.
Starting 2 consumers.
hepcms-se
yum update --disablerepo=puppetlabs-products
[root@hepcms-0 ~]# service bestman2 restart
[root@hepcms-0 ~]# service xrootd restart
[root@hepcms-0 ~]# service cmsd restart
hepcms-gums
yum update --disablerepo=puppetlabs-products
[root@hepcmsdev-6 ~]# service tomcat6 restart
[root@hepcmsdev-6 ~]# service mysqld restart
hepcms-squid
yum update --disablerepo=puppetlabs-products
[root@hepcms-squid ~]# service frontier-squid restart
Yum update
or Yum update --disablerepo=puppetlabs-products
if the puppetlab repo was in the list of updates
LINK: https://www.scientificlinux.org/category/sl-errata/slsa-20171100-1/
List of nodes to update & restart services or reboot
(put a red x by nodes updated - green means rebooting is done to the whole group):
yum check-update tells you if anything needs to be updated still
We are only updating the two mentioned packages - testing on Compute-0-10
yum update nss nss-util
Dependencies Resolved
=========================================================================================================================================================
Package Arch Version Repository Size
=========================================================================================================================================================
Updating:
nss x86_64 3.28.4-1.el6_9 sl-security 879 k
nss-util x86_64 3.28.4-1.el6_9 sl-security 67 k
Updating for dependencies:
nspr x86_64 4.13.1-1.el6 sl-security 113 k
nss-sysinit x86_64 3.28.4-1.el6_9 sl-security 50 k
nss-tools x86_64 3.28.4-1.el6_9 sl-security 445 k
Transaction Summary
=========================================================================================================================================================
Upgrade 5 Package(s)
[root@compute-0-10 ~] reboot
[jabeen@hepcms-in2 ~]$ ssh -XY compute-0-10
[root@compute-0-10 ~]# service gmond status
everything seems fine. home, /data and cvmfs are mounted.
[root@compute-0-10 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_compute010-lv_root
50G 2.8G 44G 6% /
tmpfs 7.8G 0 7.8G 0% /dev/shm
/dev/sda1 477M 114M 339M 26% /boot
fuse_dfs 200T 140T 61T 70% /mnt/hadoop
10.1.0.1:/export/home
7.2T 1.1T 5.8T 16% /home
r720-datanfs.privnet:/data
37T 32T 4.7T 88% /data
cvmfs2 20G 60M 20G 1% /cvmfs/config-osg.opensciencegrid.org
cvmfs2 20G 60M 20G 1% /cvmfs/cms.cern.ch
[root@compute-0-10 ~]#
Now updating all nodes using clush
[root@hepcms-hn ~]# ssh-agent $SHELL
[root@hepcms-hn ~]# ssh-add
check the updates and check for any broken dependencies
clush -w @compute yum check-update nss nss-util
clush -w @compute yum update nss nss-util
Finally update using -y option
clush -w @compute yum update -y nss nss-util
Do the same for groups R510, INT, R720-datanfs and others
[root@hepcms-hn ~]# clush -w hepcms-namenode,hepcms-secondary-namenode,hepcms-ce,hepcms-se,hepcms-gums,hepcms-squid yum update -y nss nss-util
hepcms-hn, hepcms-ovirt, hepcms-foreman are done separately and ssh'ing to ovirt and forman
Copied the hadoop namenode check point files to personal computer for safe keeping
[root@hepcms-namenode hadoop]# tar -cvzf namenode-data-osg-hadoop-May25.tgz ./checkpoint*
Shabnams-MacBook-Air-2:~ jabeen$ scp jabeen@hepcms.umd.edu:/data/osg/hadoop/namenode-data-osg-hadoop-May25.tgz .
jabeen@hepcms.umd.edu's password:
namenode-data-osg-hadoop-May25.tgz 100% 134MB 13.4MB/s 00:10
Shabnams-MacBook-Air-2:~ jabeen$ pwd
/Users/jabeen
putting hadoop in safemode before rebooting rest of the nodes:
[root@hepcms-namenode ~]# hadoop dfsadmin -safemode enter
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
Safe mode is ON
[root@hepcms-namenode ~]#
We rebooted all the worker nodes and in2 one by one. after putting hadoop in safe mode.
Ovirt and, in turn, all the vms were rebooted after we fixed ovirt memory problem.
All the machines were brought back according to the recovery plan:
https://sites.google.com/a/physics.umd.edu/tier-3-umd/admin-guide/troubleshooting/powerup
yum check-update tells you if anything needs to be updated still
For these nodes try not to update anything unless it's needed for security and never reboot, just restart services:
hepcms-foreman (nothing has been updated except security updates since the VM was made) - definitely don't update foreman!
hepcms-hn (yum cron still running 13 Nov 2015)
hepcms-ovirt (nothing has been updated except security updates since the machine was made)
xhepcms-squid.privnet (remove the public IP from this node before rebooting, removed from Foreman, it didn't go away from hepcms-ovirt until reboot)
hepcms-0.umd.edu -- need to remake maybe
hepcms-1.umd.edu -- need to remake maybe
xhepcms-in1.umd.edu
xhepcms-in2.umd.edu
xhepcms-in3.umd.edu
hepcms-in4.umd.edu
xhepcms-sl5.umd.edu
Had to mount /data and /home by hand (it's a kluged machine so I'm not surprised)
Also had to restart condor as condor_q didn't work (and then it was fine)
xhepcms-foreman.umd.edu
xhepcms-ovirt.umd.edu
xhepcms-hn.umd.edu
xr720-datanfs.privnet
xhepcms-namenode.privnet
xhepcms-secondary-namenode.privnet
xr720-0-1.privnet (reboot WNs one at a time and give hadoop time to recover)
xr720-0-2.privnet
xr510-0-1.privnet
xr510-0-9.privnet (not yet on hadoop - use as test kickstart with wipe from Foreman?)
xcompute-0-5.privnet
NSS software vulnerability notes from Trey including restarting services script:
# Install and/or update yum-utils and yum-plugin-ps
yum install yum-utils yum-plugin-ps
# Check if anything needs restarting before update
needs-restarting
# Check what is running that links to nss or nspr packages
yum ps nss\* nspr\*
yum update nss\* nspr\*
# Check what needs to be restarted
needs-restarting
# Restart services - so far these are ones I've found
for s in nfslock messagebus haldaemon munge ntpd postfix sshd sssd zabbix-agent mcollective ovirt-guest-agent httpd tomcat6 atd crond ; do
if test -f /etc/init.d/${s} ; then
/etc/init.d/${s} status &>/dev/null
if [ $? -eq 0 ]; then
/etc/init.d/${s} restart
/etc/init.d/${s} status
fi
fi
done
# Dell tools too on Dell systems
if test -f /opt/dell/srvadmin/sbin/srvadmin-services.sh ; then
/opt/dell/srvadmin/sbin/srvadmin-services.sh restart
fi
Per advice from Trey turned these things off on hepcms-hn:
Ah , I'd first do `/etc/init.d/cups stop ; chkconfig cups off` , no reason to run print services
May be worth disabling the stuff you don't need , like `libvirtd` , `dnsmasq`
Was recommended but wasn't a service `polkitd` if it's a service
[root@hepcms-hn ~]# service polkitd status
polkitd: unrecognized service