Software Updates

Overall software update guidelines for our cluster:

ISSUES:

Genral update to osg 3.4 Sep 2018

March 2019 servicenodes upgardes to OSG3.4

dec 12, 2018 update.

GRID services nodes OSG software update: Feb 2018

Full OSg update Oct 4

Security Patch Update May 2017

List of nodes to update & restart services or reboot for 12 Nov 2015:

12 Nov 2015 (put an x by nodes updated):

13 Nov 2015:

List of nodes to reboot:

List of nodes to restart services and NOT reboot:

Software Update Guide

Overall software update guidelines for our cluster:

NOTE: to avoid kernel panic remove oldest kernels from /boot. Leave at least two kernels in case you have to fall back to a previous kernel after reboot.

Please be aware of all the following directions in this section before proceeding
For the key service nodes only update known security problems, and never reboot, just restart services:
- Whenever possible don't do a kernel update because that requires a reboot, and we want to NEVER reboot these!

hepcms-hn.umd.edu

hepcms-ovirt.umd.edu

hepcms-foreman.umd.edu

Any nodes not in base hiera group will need to have automatic SL6 software updates turned off
Always do the following before applying any updates:

yum install yum-utils yum-plugin-ps

yum check-update

needs-restarting

Be sure to keep a log of the update, occasionally a single part of the update will fail and you'll have to check what broke.
General yum guidelines (a few more tricks on our cheat sheet also):
- https://twiki.opensciencegrid.org/bin/view/Documentation/Release3/YumRpmBasics
After update, do needs-restarting again to find out what services need restarting.
Here is an example restarting commands from Trey (copy/paste into command line as root), these "Should eliminate all but the udev-d and tty processes from the list":

# Restart services - so far these are ones I've found

for s in nfslock messagebus haldaemon munge ntpd postfix sshd sssd zabbix-agent mcollective ovirt-guest-agent httpd tomcat6 atd crond ; do

if test -f /etc/init.d/${s} ; then

/etc/init.d/${s} status &>/dev/null

if [ $? -eq 0 ]; then

/etc/init.d/${s} restart

/etc/init.d/${s} status

done

# Dell tools too on Dell systems

if test -f /opt/dell/srvadmin/sbin/srvadmin-services.sh ; then

/opt/dell/srvadmin/sbin/srvadmin-services.sh restart

OSG notes:
In general, you don't want to upgrade OSG, condor, or hadoop without doing this for ALL nodes
- For OSG node updates of OSG software, don't reboot, just restart services, including OSG services (in the proper order: http://hep-t3.physics.umd.edu/HowToForAdmins/errors.html#errorsOSG needs rewritten)
Make sure to have a saved version of condor configuration before upgrade
- After upgrading condor, restart condor
Before upgrading hadoop, put hadoop in safe mode on hepcms-namenode
- Copy the current NameNode index to /data as well as your laptop
- Do the upgrade, then restart services as needed (on hepcms-namenode, hepcms-secondary-namenode)
- Make sure the full hadoop disk is still there and once the FULL upgrade procedure for the outage is done, take hadoop out of safemode
Worker Nodes are also hadoop datanodes, so try not to reboot them, just restart services

Interactive nodes should be restarted (planned outage to announce to users)
- If an unplanned outage is needed, use the command w to see who is logged in
- Also check ps ahux to see if anyone left a screen, tmux, or other long-running process (they're not supposed to leave long-running processes on interactive)
- I have often done reboots during weeknights when nobody's logged in, or a weekend, but it always runs the risk of a bare metal IN not coming back online (see next note)
- Note that this is nice to clear abandoned ssh logins, badly cancelled processes, and swap memory
Warning about bare metal machines (worker nodes, hepcms-in2, main cluster service nodes)
- Occasionally they won't come back up after reboot, either they hang on some command (or disk mount) or need input at the machine to fsck a disk (usually a data disk) by hand that failed disk check on reboot
- Therefore you want to have at least one person on campus to intervene in the case of a reboot of bare metal machines

node cpus disk TB condor slots hadoop usage osg-version singularity

comp 5 8 3.5 7 4 3.4 yes comp five has proority slots so the number reads 14 but actual slots are 7

10 8 3.5 7 3.4 yes

7 8 3.5 7 3.4 yes

11 8 3.5 7 3.4 yes apparently never upfdtaed using dougs instrcution and was updated earlier with --skip broken

updated singularity to 2.6.1 but didnt work. just re,oved the /usr/libexec/singularity/bin/start-suid for now.

8 8 3.5 7 3.3

6 8 3.5 7 3.4 yes rm /usr/libexec/singularity/bin/start-suid instead

r510-0-9 24 22 nothing 3.4 yes

11 24 22 nothing 3.4 yes /dev/sda1 is 100% so cant install

rm /usr/libexec/singularity/bin/start-suid instead

6 24 22 23 3.4 yes rm /usr/libexec/singularity/bin/start-suid instead

1 24 22 23 3.4 yes

4 24 22 23 3.4 yes

10 24 22 23 3.4 yes

5 24 22 23 3.4 yes rm /usr/libexec/singularity/bin/start-suid

r720-0-1 32 22

3.4

2 32 22 31 3.4 yes

total 280 218

adding singularity to worker nodes compute-0-10,

r510-0-11, 6, 04,

ISSUES:

fixed some issues by changing

https://gitlab.cern.ch/SITECONF/T3_US_UMD/blob/master/JobConfig/site-local-config.xml

changes all instances of /sharesoft/cmssw to /cvmfs/cms.cern.ch

Genral update to osg 3.4 Sep 2018

remove emove the osg-wn-client-glexec package as we no longer need it.

yum remove osg-wn-client-glexec

and dont update the ones causing trouble.

yum y update --disablerepo=whatever is making the update fail

you might even have to diable these when getting one or two pacjkges

yum -y install --disablerepo=dell-omsa-indep,dell-omsa-specific https://repo.opensciencegrid.org/osg/3.4/osg-3.4-el6-release-latest.rpm

using Doug's instructions:

cd /etc/yum.repos.d/

mkdir SaveOSGRepos

mv osg* SaveOSGRepos/

yum clean all

yum -y install https://repo.opensciencegrid.org/osg/3.4/osg-3.4-el6-release-latest.rpm

yum clean all

yum -y install lcmaps vo-client-lcmaps-voms osg-configure-misc llrun voms-clients

yum -y update

osg-configure -c

ln -s /data/osg/scripts/grid-mapfile /etc/grid-security/

sometimes rpm would fail with an error that "will not update and nothing to do". A reinstall of rpm helped.

These are exact instructions that worked on r510-0-9

cd /etc/yum.repos.d/

mkdir SaveOSGRepos

mv osg* SaveOSGRepos/

yum clean all

yum -y reinstall --disablerepo=dell-omsa-indep,dell-omsa-specific https://repo.opensciencegrid.org/osg/3.4/osg-3.4-el6-release-latest.rpm

yum clean all

yum -y install lcmaps vo-client-lcmaps-voms osg-configure-misc llrun voms-clients

yum -y update --disablerepo=dell-omsa-indep,dell-omsa-specific

osg-version

osg-configure -c

ln -s /data/osg/scripts/grid-mapfile /etc/grid-security/

yum clean all

yum -y install singularity-runtime

emacs -nw /etc/singularity/singularity.conf

March 2019 servicenodes upgardes to OSG3.4

gridftp already has the osg 3.4

[root@hepcms-gridftp boot]# osg-version

OSG 3.4.17

[root@hepcms-gridftp boot]#

dec 12, 2018 update.

Critical vulnerability found update to singu;arity-runtime-2.6.1

yum -y install singularity-runtime

Singularity install

from https://opensciencegrid.org/docs/worker-node/install-singularity/

yum -y install https://repo.opensciencegrid.org/osg/3.4/osg-3.4-el6-release-latest.rpm

yum clean all

yum -y install singularity-runtime

to update

yum update singularity-runtime

[root@compute-0-10 ~]# vi or emacs -nw /etc/singularity/singularity.conf

configure

ENABLE UNDERLAY: [yes/no]

# DEFAULT: no

# Enabling this option will make it possible to specify bind paths to locations

# that do not currently exist within the container, similar to the overlay

# option. This will only be used if overlay is not enabled.

enable underlay = yes

# ENABLE OVERLAY: [yes/no/try]

# DEFAULT: try

# Enabling this option will make it possible to specify bind paths to locations

# that do not currently exist within the container. If 'try' is chosen,

# overlayfs will be tried but if it is unavailable it will be silently ignored.

enable overlay = no

MAX LOOP DEVICES: [INT]

# DEFAULT: 256

# Set the maximum number of loop devices that Singularity should ever attempt

# to utilize.

max loop devices = 0

validate

[jabeen@compute-0-10 ~]$

bash

singularity exec --contain --ipc --pid --home $PWD:/srv --bind /cvmfs /cvmfs/singularity.opensciencegrid.org/opensciencegrid/osgvo:el6 ps -ef

singularity exec --contain --ipc --pid \

> --home $PWD:/srv \

> --bind /cvmfs \

> /cvmfs/singularity.opensciencegrid.org/opensciencegrid/osgvo:el6 \

> ps -ef

WARNING: Container does not have an exec helper script, calling 'ps' directly

UID PID PPID C STIME TTY TIME CMD

jabeen 1 0 1 00:25 ? 00:00:00 shim-init ps -ef

jabeen 2 1 0 00:25 ? 00:00:00 ps -ef

[jabeen@compute-0-10 ~]$

Warning

If you modify /etc/singularity/singularity.conf, be careful with your upgrade procedures. RPM will not automatically merge your changes with new upstream configuration keys, which may cause a broken install or inadvertently change the site configuration. Singularity changes its default configuration file more frequently than typical OSG software.

Look for singularity.conf.rpmnew after upgrades and merge in any changes to the defaults.

GRID services nodes OSG software update: Feb 2018

hepcms-gridftp

yum update --disablerepo=puppetlabs-products

root@hepcms-gridftp ~]# service globus-gridftp-server restart

hepcms-ce

yum update --disablerepo=puppetlabs-products

root@hepcms-1 ~]# service httpd restart

Stopping httpd: [ OK ]

Starting httpd: [ OK ]

[root@hepcms-1 ~]# service condor-cron restart

Stopping Condor-cron daemons: [ OK ]

Starting Condor-cron daemons: [ OK ]

[root@hepcms-1 ~]# service rsv restart

Stopping RSV: Stopping all metrics on all hosts.

Stopping consumers.

Starting RSV: Starting 13 metrics for host 'hepcms-1.umd.edu'.

Starting 2 metrics for host 'hepcms-0.umd.edu:8443'.

Starting 1 metrics for host 'hepcms-gridftp.umd.edu'.

Starting 2 consumers.

hepcms-se

yum update --disablerepo=puppetlabs-products

[root@hepcms-0 ~]# service bestman2 restart

[root@hepcms-0 ~]# service xrootd restart

[root@hepcms-0 ~]# service cmsd restart

hepcms-gums

yum update --disablerepo=puppetlabs-products

[root@hepcmsdev-6 ~]# service tomcat6 restart

[root@hepcmsdev-6 ~]# service mysqld restart

hepcms-squid

yum update --disablerepo=puppetlabs-products

[root@hepcms-squid ~]# service frontier-squid restart

Full OSg update Oct 4

Yum update

or Yum update --disablerepo=puppetlabs-products

if the puppetlab repo was in the list of updates

Security Patch Update May 2017

LINK: https://www.scientificlinux.org/category/sl-errata/slsa-20171100-1/

List of nodes to update & restart services or reboot

(put a red x by nodes updated - green means rebooting is done to the whole group):

yum check-update tells you if anything needs to be updated still

We are only updating the two mentioned packages - testing on Compute-0-10

yum update nss nss-util

Dependencies Resolved

=========================================================================================================================================================

Package Arch Version Repository Size

=========================================================================================================================================================

Updating:

nss x86_64 3.28.4-1.el6_9 sl-security 879 k

nss-util x86_64 3.28.4-1.el6_9 sl-security 67 k

Updating for dependencies:

nspr x86_64 4.13.1-1.el6 sl-security 113 k

nss-sysinit x86_64 3.28.4-1.el6_9 sl-security 50 k

nss-tools x86_64 3.28.4-1.el6_9 sl-security 445 k

Transaction Summary

=========================================================================================================================================================

Upgrade 5 Package(s)

[root@compute-0-10 ~] reboot

[jabeen@hepcms-in2 ~]$ ssh -XY compute-0-10

[root@compute-0-10 ~]# service gmond status

everything seems fine. home, /data and cvmfs are mounted.

[root@compute-0-10 ~]# df -h

Filesystem Size Used Avail Use% Mounted on

/dev/mapper/vg_compute010-lv_root

50G 2.8G 44G 6% /

tmpfs 7.8G 0 7.8G 0% /dev/shm

/dev/sda1 477M 114M 339M 26% /boot

fuse_dfs 200T 140T 61T 70% /mnt/hadoop

10.1.0.1:/export/home

7.2T 1.1T 5.8T 16% /home

r720-datanfs.privnet:/data

37T 32T 4.7T 88% /data

cvmfs2 20G 60M 20G 1% /cvmfs/config-osg.opensciencegrid.org

cvmfs2 20G 60M 20G 1% /cvmfs/cms.cern.ch

[root@compute-0-10 ~]#

Now updating all nodes using clush

[root@hepcms-hn ~]# ssh-agent $SHELL

[root@hepcms-hn ~]# ssh-add

check the updates and check for any broken dependencies

clush -w @compute yum check-update nss nss-util

clush -w @compute yum update nss nss-util

Finally update using -y option

clush -w @compute yum update -y nss nss-util

Do the same for groups R510, INT, R720-datanfs and others

[root@hepcms-hn ~]# clush -w hepcms-namenode,hepcms-secondary-namenode,hepcms-ce,hepcms-se,hepcms-gums,hepcms-squid yum update -y nss nss-util

hepcms-hn, hepcms-ovirt, hepcms-foreman are done separately and ssh'ing to ovirt and forman

Copied the hadoop namenode check point files to personal computer for safe keeping

[root@hepcms-namenode hadoop]# tar -cvzf namenode-data-osg-hadoop-May25.tgz ./checkpoint*

Shabnams-MacBook-Air-2:~ jabeen$ scp jabeen@hepcms.umd.edu:/data/osg/hadoop/namenode-data-osg-hadoop-May25.tgz .

jabeen@hepcms.umd.edu's password:

namenode-data-osg-hadoop-May25.tgz 100% 134MB 13.4MB/s 00:10

Shabnams-MacBook-Air-2:~ jabeen$ pwd

/Users/jabeen

putting hadoop in safemode before rebooting rest of the nodes:

[root@hepcms-namenode ~]# hadoop dfsadmin -safemode enter

DEPRECATED: Use of this script to execute hdfs command is deprecated.

Instead use the hdfs command for it.

Safe mode is ON

[root@hepcms-namenode ~]#

We rebooted all the worker nodes and in2 one by one. after putting hadoop in safe mode.

Ovirt and, in turn, all the vms were rebooted after we fixed ovirt memory problem.

All the machines were brought back according to the recovery plan:

https://sites.google.com/a/physics.umd.edu/tier-3-umd/admin-guide/troubleshooting/powerup

List of nodes to update & restart services or reboot for 12 Nov 2015:

12 Nov 2015 (put an x by nodes updated):

yum check-update tells you if anything needs to be updated still

For these nodes try not to update anything unless it's needed for security and never reboot, just restart services:

hepcms-foreman (nothing has been updated except security updates since the VM was made) - definitely don't update foreman!

hepcms-hn (yum cron still running 13 Nov 2015)

hepcms-ovirt (nothing has been updated except security updates since the machine was made)

13 Nov 2015:

List of nodes to reboot:

xhepcms-squid.privnet (remove the public IP from this node before rebooting, removed from Foreman, it didn't go away from hepcms-ovirt until reboot)

hepcms-0.umd.edu -- need to remake maybe

hepcms-1.umd.edu -- need to remake maybe

xhepcms-in1.umd.edu

xhepcms-in2.umd.edu

xhepcms-in3.umd.edu

hepcms-in4.umd.edu

xhepcms-sl5.umd.edu

Had to mount /data and /home by hand (it's a kluged machine so I'm not surprised)

Also had to restart condor as condor_q didn't work (and then it was fine)

List of nodes to restart services and NOT reboot:

xhepcms-foreman.umd.edu

xhepcms-ovirt.umd.edu

xhepcms-hn.umd.edu

xr720-datanfs.privnet

xhepcms-namenode.privnet

xhepcms-secondary-namenode.privnet

xr720-0-1.privnet (reboot WNs one at a time and give hadoop time to recover)

xr720-0-2.privnet

xr510-0-1.privnet

xr510-0-9.privnet (not yet on hadoop - use as test kickstart with wipe from Foreman?)

xcompute-0-5.privnet

NSS software vulnerability notes from Trey including restarting services script:

# Install and/or update yum-utils and yum-plugin-ps

yum install yum-utils yum-plugin-ps

# Check if anything needs restarting before update

needs-restarting

# Check what is running that links to nss or nspr packages

yum ps nss\* nspr\*

yum update nss\* nspr\*

# Check what needs to be restarted

needs-restarting

# Restart services - so far these are ones I've found

for s in nfslock messagebus haldaemon munge ntpd postfix sshd sssd zabbix-agent mcollective ovirt-guest-agent httpd tomcat6 atd crond ; do

if test -f /etc/init.d/${s} ; then

/etc/init.d/${s} status &>/dev/null

if [ $? -eq 0 ]; then

/etc/init.d/${s} restart

/etc/init.d/${s} status

done

# Dell tools too on Dell systems

if test -f /opt/dell/srvadmin/sbin/srvadmin-services.sh ; then

/opt/dell/srvadmin/sbin/srvadmin-services.sh restart

Per advice from Trey turned these things off on hepcms-hn:

Ah , I'd first do `/etc/init.d/cups stop ; chkconfig cups off` , no reason to run print services

May be worth disabling the stuff you don't need , like `libvirtd` , `dnsmasq`

Was recommended but wasn't a service `polkitd` if it's a service

[root@hepcms-hn ~]# service polkitd status

polkitd: unrecognized service

Report abuse