ADMIN GUIDE

Link to internal troubleshooting page: Troubleshooting

Link to weekly meetings:meetings

https://www.google.com/url?q=https%3A%2F%2Fumdt3.slack.com%2Farchives%2Ft3newsysadmin%2Fp1465839605000051&sa=D

Link to the BIGGER admin guide: UMDT3

Things to do before using any network stop or similar command (or any change in IP table)

Please make sure you are physically in front of the machine.
Please run the commands by others working on the cluster.

RESOURCES And Commands

- Yum commands YUM
- /bin/ — Used to store user commands. The directory /usr/bin/ also stores user commands.
- /sbin/ — Location of many system commands, such as shutdown. The directory /usr/sbin/ also contains many system commands.
- /root/ — The home directory of root, the superuser.
- /misc/ — This directory is used for automatically mounting directories on removeable devices (such as Zip drives) and remote directories (such as NFS shares) using autofs. Refer to the autofs manual page (type man autofs at a shell prompt) for more information.
- /mnt/ — This directory typically contains the mount points for file systems mounted after the system is booted.
- /media/ — This directory contains the mount points for removable media, such as diskettes, CD-ROMs, and USB flash drives.
- /boot/ — Contains the kernel and other files used during system startup.
- /lost+found/ — Used by fsck to place orphaned files (files without names).
- /lib/ — Contains many device modules and library files used by programs in /bin/ and /sbin/. The directory /usr/lib/ contains library files for user applications.
- /dev/ — Stores device files.
- /etc/ — Contains configuration files and directories.
- /var/ — For variable (or constantly changing) files, such as log files and the printer spool.
- /usr/ — Contains files and directories directly relating to users of the system, such as programs and supporting library files.
- /proc/ — A virtual file system (not actually stored on the disk) that contains system information used by certain programs.
- /initrd/ — A directory that is used to mount the initrd.img image file and load needed device modules during bootup.

Warning: Do not delete the /initrd/ directory. You will be unable to boot your computer if you delete the directory and then reboot your Red Hat Enterprise Linux system.

- /tftpboot/ — Contains files and applications needed for Preboot Execution Environment (PXE), a service that allows client machines and machines without hard drives to boot an operating system from an image on a central PXE server.
- /tmp/ — The temporary directory for users and programs. /tmp/ allows all users on a system read and write access.
- /home/ — Default location of user home directories.
- /opt/ — Directory where optional files and programs are stored. This directory is used mainly by third-party developers for easy installation and uninstallation of their software packages.

NIS Account Creation /Management

Condor/Grid Jobs

Replace bad disks:

Check Status:

Format new disk

Identify bad disk:

omreport storage pdisk controller=0 pdisk=0:0:3

Identify disk on the machine. disk 0:0:3 should have blinking light after following command is run:

omconfig storage pdisk action=blink controller=0 pdisk=0:0:3

Replace the disk and check if new disk is in non-critical state.

omreport storage pdisk controller=0 pdisk=0:0:3

Stop blinking:

omconfig storage pdisk action=unblink controller=0 pdisk=0:0:3

DISK CLEAN UP

Clush Commands
Clush Groups
Cleaning cvfms cache
Cleaning hadoop logs
Remove temp files more than 5 days old
- Hadoop PhEDEx Cleaning

HADOOP /NFS

Puppet/Foreman

RENEWING GRID SITE CERTIFICATES

MONITORING

Ganglia web interface not working:

HARDWARE ISSUES

SCRIPTS

RESOURCES

Site Certs

https://twiki.grid.iu.edu/bin/view/Documentation/Release3/InstallCertAuth

https://twiki.grid.iu.edu/bin/view/Documentation/Release3/OsgCaCertsUpdater

Condor commands

ps aux | grep condor_schedd

condor 9911 0.0 3.2 668508 529988 ? Ss 2016 55:47 condor_schedd -f

condor 1938084 0.0 0.1 121632 25228 ? S Mar09 2:49 condor_schedd [extra process]

root 2979082 0.0 0.0 6452 724 pts/21 S+ 21:32 0:00 grep condor_schedd

[root@hepcms-in2 condor]# service condor restart [restart service]

Stopping Condor daemons: [

the command to see your CE reporting is:

$ condor_status -pool collector.opensciencegrid.org:9619 -any | grep -i umd

#Stop service after current jobs stop

condor_off -startd -peaceful r720-0-2

# start queue on node

systemctl start condor

general

df -ah

umount -nf /data

[mount /data

To see the partitions and mounted sustem information on a server

/etc/fstab

/etc/exports

ps -ef | grep rsync

ps aux | grep condor_schedd

condor 9911 0.0 3.2 668508 529988 ? Ss 2016 55:47 condor_schedd -f

condor 1938084 0.0 0.1 121632 25228 ? S Mar09 2:49 condor_schedd [extra process]

root 2979082 0.0 0.0 6452 724 pts/21 S+ 21:32 0:00 grep condor_schedd

[root@hepcms-in2 condor]# kill 1938084

[root@hepcms-in2 condor]# kill 9911

[root@hepcms-in2 condor]# ps aux | grep condor_schedd

[checked that condor_schedd is killed]

root 2979085 0.0 0.0 6448 692 pts/21 S+ 21:33 0:00 grep condor_schedd

[root@hepcms-in2 condor]# service condor restart [restart service]

Stopping Condor daemons: [

Hadoop commands

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#balancer

hadoop dfsadmin -report

hadoop fsck / -blocks

service hadoop-hdfs-datanode status

this logs in /scratch/hadoop/hadoop-hdfs/hadoop-hdfs-datanode xxxx .privnet.out

you can grep warning

grep -i warn /scratch/hadoop/hadoop-hdfs/hadoop-hdfs-datanode-r720-0-1.privnet.log

if a disk fails need to take it out of hadoop. Then exclude this datanode (use internal IP like r510-0-5.privnet) edit /etc/hadoop/conf/hosts-exclude on the hepcms-namenode then hdfs dfsadmin -refreshNodes

lsof /mnt/hadoop can show if some has lot of root files open at the same time

Can do the following command from any hadoop node. to balance hadoop manually hepcms-namenode: hdfs balancer

RSV tests

http://opensciencegrid.org/docs/monitoring/rsv-control/

on hepcms-ce

list all metrics rsv-control --job-list

run these two metrics rsv-control --run --host hepcms-1 org.osg.general.ping-host org.osg.general.java-version

disable the matrices no longer required after update to 3.4 - bestman and gratia

rsv-control --disable --host hepcms-1 org.osg.srm.srmcp-readwrite

sv-control --disable --host hepcms-1 org.osg.gratia.metric

rsv-control --disable --host hepcms-0.umd.edu:8443 org.osg.srm.srmcp-readwrite

The names of the hosts and which matrics are enabled is on themonitoring rsv page.

DNS Server:

sudo yum install bind bind-utils

Here the instructions.

https://www.digitalocean.com/community/tutorials/how-to-configure-bind-as-a-private-network-dns-server-on-centos-7

Following files need changes. To add a new IP address edit following three files.

/etc/named.conf,

/var/named/dynamic/db.privnet

/var/named/dynamic/db.1.10.in-addr.arpa

SSH ISSUES

ssh_exchange_identification: read: Connection reset by peer:

possible solution:

1.) Backup your .ssh folder by mv .ssh .ssh_backup

2.) delete the .ssh folder in your home directory

3.) ssh into username@hepcms.umd.edu

it could be that user tried wrong name or passwd too many time (3) and IP was blocked by denyhost.

denyhost

On each interactive note use the script /root/unblock_denyhosts.sh to clean the block. It only needs one argument, the IP address.

If you do not have the IP address of a user, but the user tried and failed to login, you can determine the user's IP address by searching the /var/log/secure* files for the username (which should have the IP address listed). Running "grep -i [username] /var/log/*" will help here.

NIS Account Creation /Management

If there is a new account requested:

- 1. Only for CMS HEP (Nick Hadley, Sarah Eno, Andris Skuja, Alberto Belloni, Drew Baden), CMS Heavy Ion (Alice Mignerey's group), UMD theory (Raman Sundrum's group), and for Higgs Honors class taught by Shabnam Jabeen and Sarah Eno (those accounts for the duration of the class and not longer)
  2. Can get special permission for CMS colleagues offsite, talk to Nick Hadley, but we prefer they use grid tools and not local accounts to access CMS shared files
  3. If you have an email request that seems legitimate, it could still be a social hack, please confirm with a professor/graduate student/postdoc in the above groups as well as looking up that person on the UMD directory (note that occasionally postdocs are slow to show up in the directory)

How to create an NIS account

The scripts are now in /root/scripts_accounts/ on hepcms-hn

Reference: http://www.linuxhomenetworking.com/wiki/index.php/Quick_HOWTO_:_Ch30_:_Configuring_NIS#Adding_New_NIS_Users

Add Users

- 1. Log into the head node (HN) (hepcms-hn.umd.edu)
  2. su - to become root
  3. Note that we prefer NOT to make accounts with uppercase, as it generally causes case sensitive login problems, only make them lowercase
    1. run the script MakeAccount.sh
    2. cd scripts_accounts
    3. ./MakeAccount.sh [usernames] [password] # use "" with multiple names

- - - The above script is equivalent to:
    - Make the user and give them a first password(it can be anything, doesn't matter since they have to change it), and set them required to change it upon first login (if they are a theory user do -g theory):

- - 1. useradd -g users -c "Full Name" username
    2. passwd username

- - - This is set because it appears to work with NIS in tests (i.e. it forced the student to change their password and that new one was propagated across all machines, but if for some reason it doesn't, use yppasswd instead for changes: chage -d 0 username

- - Anytime you make changes to the main NIS database of users (password, new users, etc.), update the maps:
    1. cd /var/yp; make

- - - Optional: You can check to see if the user's authentication information has been updated by using the ypmatch command, which should return the user's encrypted password string: (optional, a bit buggy? was not working properly as of 1/6/2016. This step can be skipped for now ) - Margarita

- - 1. ypmatch username passwd

- - - Tested and it seems to work. Remember!!!, use yppasswd instead of passwd

Again, above steps are now in a script (mainly to create large number of accounts) : /root/scripts_accounts/MakeAccount.sh

Make /data Directory (for some users)

Note that Higgs Honors students do NOT get /data area that is provided to normal HEP users.
- 1. make a /data/users/username area (if theory they may also create space in /data/groups/theory/username upon request)
    1. From the head node, go through internal network to r720-datanfs, be sure you are root (su -):
    2. ssh r720-datanfs
    3. mkdir /data/users/username
    4. chown username:users /data/users/username

- - Only if requested, make an SE area in /hadoop (instructions HERE)

Document new user

- - document the new user in our .csv file (so they can get sysadmin emails) and generate text for a welcome email
  - The following script should now automatically run as part of /root/scripts_accounts/MakeAccount.sh
    - To write to file, ensure that you have entered No when asked to add a new user, and not aborting before that
    - Would you like to add a new user (Y/N)? N
    - The following users have been added:
    - xxxxx
    - Write new users to output file '/root/cronscripts/hepcms_Users.csv' (Y/N)? Y
    - <Writing to output file '/root/cronscripts/hepcms_Users.csv'>
    - Here is the manual command:
    - cd /root/scripts_accounts
    - python AddNewUser.py
    - group examples: "theory" "HIN" (heavy ion) "Fall2015" (Higgs class), the default is CMS HEP, so no group needs to be specified
    - Generate welcome email text you can COPY/PASTE into your mail program, see options here: python pyNewUserInstructions.py --help
      - Note that the python script may not be able to handle special characters in the password properly, so make sure it put information in correctly, use for instance: --passwd="Special&Pass"
        As a side note you can see how this .csv file is used elsewhere: python SendMail.py --help, and python parseUsers.py --help

How to change the password:

- - 1. If you have an email request, ensure that this is a true request and not a hacked account or a social hack, try to make phone/voice contact with the user if possible
    2. From ANY machine on the cluster as root (su -) or sudoers (sudo -i)
    3. yppasswd username
    4. chage -d 0 username
    5. cd /var/yp; make
    6. This should automatically update, if for some reason it doesn't, you can make changes on the head node as root, and update the NIS maps:
    7. cd /var/yp; make
    8. In case of error in changing password due to chfn:

- - You will get an error in the HN /var/log/messages like this:
  - ONLY for this problem, on the HN (as root su -), use system-config-users to edit the password by hand. Be very careful, a lot of account destruction can be done with this program
  - Then re-make the NIS database with this comment (on the HN as root su -):
    - cd /var/yp; make
    - How to remove a user:
    - Not yet documented, please consider data retention policies (my general guideline is once the proofs have been submitted to the journal and you've followed experimental guidelines in data retention, you can delete the files). Also consider sometimes users use things in other people's areas (like the geant files in /data/users/jtemple)
    - userdel -r username
    - cd /var/yp; make
  - For the following two areas which won't get automatically moved, please consider other users may share files made by one user in these areas. Additionally, there might be a PhEDEx registered dataset in a user's private hadoop SE area!
  - Don't forget that they will have an area in /data/users/username
  - They might also have an area in /mnt/hadoop/cms/user/username, and an associated grid certificate account on our gums service (check the GUMS page to remove that)

IMPORTANT: After every change to NIS account

hepcm-hn: cd /var/yp; make

The GUI controls users and groups for accounts on hepcms-hn, it’s just another way to do Linux account management other than command line.

NIS handles spreading that information to the rest of the cluster. Anytime you change accounts, either with system-config-users, or with useradd, userdel, or any other *Linux* users tools, you have to tell NIS to pick up and spread the changes.

How to change ownership of a folder

as root or sudo

chown -R username: foldername

In case there are two similar accounts (eg. oscillatorb and osscilatorB)

A) old

oscillator

and

OscillatorB

removed via

system-config-users

B) instances of

andrej

removed carefully from

/etc/group

and

/etc/gshadow

and

/etc/shadow

above and

system-config-users

no longer complains about

andrej

cd /var/yp; make; cd -

just to be sure everything’s in sync for NIS(edited)

well, I had already copied some files from the two home directories before all this and put them here:

/data/users/oscillator

in one is a CMS map program in a

public_html

folder, which was pretty cool, and the tarfile has some programs

[

so it’s up to Fred to keep or delete, I did

chown -R oscillator:users /data/users/oscillator/oscillatorb*

to make them properly owned

Change usergroup of a user

It is esier to use the gui:

[jabeen@hepcms-hn ~]$ system-config-users

On command line one can do:

chgrp -R users /home/jabeen

Change user defailt shell

ypchsh

For above command to work. this argument was needed to set in /etc/sysconfig/yppasswdd on headnode.

Add a sysadmin to the sudoers file and implement sudo on an individual node:

- - https://www.linux.com/learn/tutorials/306766:linux-101-introduction-to-sudo
  - Also (see sudoers here: https://sites.google.com/a/physics.umd.edu/tier-3-umd/commands)
  - On that specific node, as root ("su -" OR "sudo -i" ) add the user username to the "wheel" group.
    - usermod -aG wheel username (to add)
    - gpasswd -d username wheel (to remove from the group)
  - Then on individual node where you want to give the user access use visudo to edit /etc/sudoers file (be very careful because you can mess up the system with changes in this file)
  - Make sure to use visudo, since it will check to make sure that the sudoers file is properly formatted.
    - visudo
  - find the following two lines
    - ## Allows people in group wheel to run all commands
    - # %wheel ALL=(ALL) ALL sudoers
  - move your curser on the # before %wheel and delete # by pressing x
  - Alternatively, you can press insert to go into editing mode, and use backspace to erase the #
  - save and exit by typing (esc can look up vi text editor commands if need be) :x
  - It should look like this now for that line: %wheel ALL=(ALL) ALL
  - make sure you are editing it as su - otherwise the changes do not save.

TESTING PHASE of sudoers:

- - Exit out of root, and log in as your regular username (the one used with usermod -aG wheel username)
  - Test this (may not work right away -- see below)
    - groups
    - sudo su - It appears that command "sudo -i" works instead.
    - The command groups should show you being in the group users, and wheel
  - You will be warned, and you should now have root access *on that node only*. NIS doesn't sync sudo in the current setttings.
    - The command sudo tells unix to run a single command as root, in this case the su - will elevate you to root permanently, thus allowing you to enter in more commands as root
      - Enter in your user password, and you should see a # instead of $ indicating you are currently root
    - Note (6 Oct 2015): I wasn't able to get the su USERNAME - command to work, I successfully added "belt" to the wheel group on hepcms-in2, and successfully edited /etc/sudoers (with visudo) as root to have:
    - %wheel ALL=(ALL) ALL
      - And still sudo whoami doesn't work as belt on hepcms-in2. No idea why. (but it is apparently ok!)
      - http://linuxpoison.blogspot.com/2008/12/configuring-sudo-and-adding-users-to.html
    - Test that this works (days later it magically worked, and didn't work immediately on hepcms-in1 3:10pm 19 Oct 2015):
    - [belt@hepcms-in2 ~]$ sudo more /etc/sudoers.d/10_wheel
    - [sudo] password for belt:
    - %wheel ALL=(ALL) ALL
    - [belt@hepcms-in2 ~]$ more /etc/sudoers.d/10_wheel
    - /etc/sudoers.d/10_wheel: Permission denied
    - 5:34pm 19 Oct 2015: it works now on hepcms-in1! So apparently there's some time delay needed after setting this up.

How to make a SE storage space and link it on GUMS

We dont use gums anymore. Move on to the map file section below.

Note: this is done ONLY by request

Requirements:

- 1. Admin must be able to make a new user account on hepcms-hn
  2. Admin must be in the admins group on GUMS (https://hepcmsdev-6.umd.edu:8443/gums/manualUserGroups.jsp) - authenticate with grid cert
  3. Have the user's grid certificate DN (output of helps, they have instructions on our user's page: http://hep-t3.physics.umd.edu/HowToForUsers.html#crab)
  4. Have the user's CERN account name (could be different than their hepcms account name), this is because crab jobs will write to /mnt/hadoop/cms/store/user/CERNUsername
  5. Note that CERNUsername could be the same as HepcmsUsername

How to setup SE hadoop/grid account with grid map file:

- - 1. Make a new user account, with HepcmsUsername_g in the (default) users group, it's not a standard login, so we make it with /bin/true
      - useradd -g users -c "HepcmsUsername grid user" -n HepcmsUsername_g -s /bin/true
    2. Note that we prefer not to have usernames with capitalization, it's indicated in this section only for your readability
    3. There's no password since it's not a login account, so we proceed to sync the NIS maps anyway

- - - Anytime you make changes to the main NIS database of users (password, new users, etc.), update the maps:
      - cd /var/yp; make

- - 1. Then make their area on hadoop, as root on any node (maybe ideally on hepcms-namenode? maybe should use hadoop dfs commands? don't know for sure, it worked like this)
    2. cd /mnt/hadoop/cms/store/user
    3. mkdir CERNUsername
    4. chown HepcmsUsername_g:users CERNUsername

Next, you need the cern DN for the user. You can either ask them to run the voms-proxy-info command and send you the output or get it directly from here:

https://lcg-voms2.cern.ch:8443/voms/cms/user/search.action

edit /data/osg/scripts/grid-mapfile on any node where /data is mounted. This is the file with grid user mapping and is linked in every node as /etc/grid-security/grid-mapfile

Send the following information to the user:

Your storage space on the UMD HEP T3 SE has been created as /store/user/username

Files are written to this space primarily with CRAB jobs, for documentation, see:

https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrab

Ownership of these files is via a second user account linked to your grid certificate, so you will not be able to move, delete, or rename these files with your regular login, only through SE commands. This is the only difference you will see from a normal local filesystem. Some examples are given here:

https://sites.google.com/a/physics.umd.edu/umdt3/user-guide/file-transfer-from-to-the-cluster#TOC-T3_US_UMD-hadoop-examples:

If you have difficulty using this area, please contact the sysadmins.

Keep in mind that hadoop is internally replicated, so the disk space available is half of what is shown with "df -h". Additionally, one R510 node can store 12TB (after replication), so it is best to keep at least 24TB (before replication) free in case one node goes down to protect the data.

You are strongly encouraged to retain files of 1GB or larger for the health of the hadoop system.

For other information:

https://sites.google.com/a/physics.umd.edu/umdt3/user-guide

old not-used-anymore instructions

- 1. Then authenticate with your grid certificate to GUMS web page https://hepcmsdev-6.umd.edu:8443/gums/manualUserGroups.jsp and map their grid certificate to their account
    - Click on "Manual Account Mappings" (note that if you get some weird page about security, click back on "Home" in the upper left and try again, should be fixed)
    - Click "add" at the very bottom of the page
    - Put their full DN: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=CERNUsername/CN=SOMENUMBER/CN=Full User Name
    - Choose localAccountMapper
    - Put in their HepcmsUsername_g
    - Click "save"
  2. Notify the user that they have an account, there is a shell script on hepcms-hn in /root/cronscripts/HadoopSE_NewUserWelcome.txt, that can be used as input for pyNewUserInstructions.py
    - - python pyNewUserInstructions.py -u CERNUsername -t HadoopSE_NewUserWelcome.txt
    - It will ask for First Name and Password, those values don't matter as they aren't in the NewUserWelcome.txt file, you can put anything there or press enter
  3. Email the user the text of the new user welcome

CMSSW/CVMFS

Root Distribution for SL6

It turns out they do have a root distribution for slc6, they just don't announce that on the root page (odd!).

source /cvmfs/sft.cern.ch/lcg/views/LCG_95/x86_64-slc6-gcc7-opt/setup.sh

Condor/Grid

xrootd service failed

After installing new certificates, xrrotd service failed.

[root@hepcms-0 xrd]# service cmsd restart

Shutting down xrootd (cmsd, default): [ OK ]

Starting xrootd (cmsd, default): [ OK ]

[root@hepcms-0 xrd]# service xrootd status

[default] xrootd dead but pid file exists

[root@hepcms-0 xrd]# service cmsd status

[default] cmsd (pid 18945) is running...

First tried removing the pid process

[root@hepcms-0 xrd]# ps -eaf | grep pid

xrootd 18945 1 0 11:27 ? 00:00:05 /usr/bin/cmsd -l /var/log/xrootd/cmsd.log -c /etc/xrootd/xrootd-clustered.cfg -k fifo -b -s /var/run/xrootd/cmsd-default.pid -n default

root 19675 17790 0 11:40 pts/0 00:00:00 grep pid

[root@hepcms-0 xrd]#

root@hepcms-0 xrd]# ls -slrt /var/run/xrootd/

total 20

0 prw-r----- 1 xrootd xrootd 0 Oct 20 12:24 ofsEvents

4 -rw-r--r-- 1 xrootd xrootd 5 Oct 27 07:21 xrootd.pid

4 -rw-r--r-- 1 xrootd xrootd 169 Oct 27 07:21 xrootd.anon.env

4 -rw-r--r-- 1 xrootd xrootd 68 Jan 5 11:51 cmsd.pid

4 -rw-r--r-- 1 xrootd xrootd 5 Jan 5 11:51 cmsd-default.pid

4 -rw-r--r-- 1 xrootd xrootd 167 Jan 5 11:51 cmsd.anon.env

[root@hepcms-0 xrd]# mv /var/run/xrootd/xrootd.pid /var/run/xrootd/xrootd.pid-old

[root@hepcms-0 xrd]# service xrootd start

Starting xrootd (xrootd, default): [FAILED]

[root@hepcms-0 xrd]# service xrootd restart

Shutting down xrootd (xrootd, default): [FAILED]

Starting xrootd (xrootd, default): [FAILED]

Kill all PID associated with xrootd ontained

root@hepcms-0 ~]# ps -ef | grep xroot

xrootd 18990 1 0 11:27 ? 00:00:00 perl /usr/share/xrootd/utils/XrdOlbMonPerf 30

xrootd 20223 1 0 11:43 ? 00:00:00 perl /usr/share/xrootd/utils/XrdOlbMonPerf 30

xrootd 21145 1 0 11:51 ? 00:00:08 /usr/bin/cmsd -l /var/log/xrootd/cmsd.log -c /etc/xrootd/xrootd-clustered.cfg -k fifo -b -s /var/run/xrootd/cmsd-default.pid -n default

xrootd 21190 21145 0 11:51 ? 00:00:00 perl /usr/share/xrootd/utils/XrdOlbMonPerf 30

Located error message in /var/log/xrootd/xrootd.log that xrootd.t2.ucsd.edu:9930 '; Name or service not known.

Edited /etc/xrootd/xrootd-clustered.cfg according to instructions here:

https://twiki.cern.ch/twiki/bin/view/CMSPublic/XRootDMonitoring#XRootD_Site_Configuration

service xrootd start worked.

r510-0-11 CE errors, `/` is 100%.

haddop 12 and hadoop 8 lost partition. So hadoop was filling up / area while trying to write to /hadoop 12.

moved /hadoop12/data/currenbt to /hadoop1

[root@r510-0-11 ~]# parted /dev/sda print all | grep /dev

Disk /dev/sda: 2000GB

Disk /dev/sdb: 2000GB

Disk /dev/sdc: 2000GB

Disk /dev/sdd: 2000GB

Disk /dev/sde: 2000GB

Disk /dev/sdf: 2000GB

Error: /dev/sdl: unrecognised disk label

Disk /dev/sdj: 2000GB

Disk /dev/sdk: 2000GB

Disk /dev/sdg: 2000GB

Disk /dev/sdi: 2000GB

Error: /dev/sdh: unrecognised disk label

[root@r510-0-11 ~]# mkfs.ext4 /dev/sdl

[root@r510-0-11 ~]# mkfs.ext4 /dev/sdh

root@r510-0-11 ~]# blkid

/dev/sdh: UUID="2e4df78f-1cdc-4243-b309-f24f20154e14" TYPE="ext4"

/dev/sdl: UUID="cdb099c6-1798-4c8e-84cd-7b2f59641110" TYPE="ext4"

added the UUID back to /etc/fstab and mounted the disks again.

[root@r510-0-11 ~]# mount -a /hadoop12

[root@r510-0-11 ~]# mount -a /hadoop8

make sure both /hadoop disks have data directories

ls -slrt /hadoop*

/hadoop12:

total 20

16 drwx------ 2 root root 16384 Mar 25 13:10 lost+found

4 drwxr-xr-x 2 hdfs hadoop 4096 Mar 25 13:45 data

/hadoop8:

total 20

16 drwx------ 2 root root 16384 Mar 25 13:27 lost+found

4 drwxr-xr-x 2 hdfs hadoop 4096 Mar 25 13:45 data

Also check if the hadoop disks are not masked.

[root@r510-0-11 ~]# grep hadoop12 /etc/hadoop/conf/hdfs-site.xml

<value>/hadoop1/data,/hadoop2/data,/hadoop3/data,/hadoop4/data,/hadoop5/data,/hadoop6/data,/hadoop7/data,/hadoop8/data,/hadoop9/data,/hadoop10/data,/hadoop11/data,/hadoop12/data</value>

[root@r510-0-11 ~]#

restart hadoop service

[root@r510-0-11 ~]# service hadoop-hdfs-datanode restart

Stopping Hadoop datanode: [ OK ]

stopping datanode

Starting Hadoop datanode: [ OK ]

starting datanode, logging to /scratch/hadoop/hadoop-hdfs/hadoop-hdfs-datanode-r510-0-11.privnet.out

[root@r510-0-11 ~]#

Sam test show hepcms-ce has problesm.

CE troubleshooting

https://opensciencegrid.org/docs/compute-element/troubleshoot-htcondor-ce/

cmsenv command fails /sharesoft/cmssw/cmsset_default.csh: Transport endpoint is not connected.

the error is generated in .bashrc login file.

# CMSSW =

export VO_CMS_SW_DIR=/cvmfs/cms.cern.ch/ . $VO_CMS_SW_DIR/cmsset_default.sh

umount and mount cvmfs

umount -l /cvmfs/cms.cern.ch ; mount /cvmfs/cms.cern.ch

Condor job on hold (held) from /data

to check or release the jobs go to the schedular (in1 or in2)

[jabeen@hepcms-in1 ~]$condor_q -hold -af HoldReason

Error from slot2@r510-0-4.privnet: Failed to execute '/data/users/ahorst/hgcal_tile/build/condor-executable.sh': (errno=13: 'Permission denied') 637568.0 [????????????] [?????????] Error from slot2@r510-0-4.privnet: Failed to execute '/data/users/ahorst/hgcal_tile/build/condor-executable.sh': (errno=13: 'Permission denied')

[jabeen@hepcms-in1 ~]$ ls -lsrt /data/users/ahorst/hgcal_tile/build/condor-executable.sh

4 -rw-r--r-- 1 ahorst users 2035 Jun 15 09:47 /data/users/ahorst/hgcal_tile/build/condor-executable.sh

The script does not seem to executable permissions.

You can also use

condor_q -analyze 3393151.0

Once the hold reason is fixed, release the job uisng ID

condor_release 3393151.0

or realse all your jobs as

condor_release jabeen

Changing storage.xml

https://twiki.cern.ch/twiki/bin/view/CMSPublic/SiteConfInGitlab

https://gitlab.cern.ch/SITECONF/T3_US_UMD

changed the file on git site for UMD

this storage.xml file is the one in the cvmfs area and is owned by cvmfs. Changes are propaged from git to local cluster in hour or so.

[jabeen@hepcms-0 ~]$ cd /cvmfs/cms.cern.ch/SITECONF/T3_US_UMD/PhEDEx/

The file in the hepcms-se xrootd area should be identicle.

[root@hepcms-0 xrootd]# ls -slrt /etc/xrootd/storage.xml

[root@hepcms-0 xrootd]# cp storage.xml storage.xml_Feb2018

[root@hepcms-0 xrootd]# emacs -nw storage.xml

you can update git file manually.

If the storage.xml file in your cvmfs area hasn't updated to the latest git commit at UMD yet, could try the following as root?

# cvmfs_talk -i cms.cern.ch evict /SITECONF/T3_US_UMD/PhEDEx/storage.xml

# stat /cvmfs/cms.cern.ch//SITECONF/T3_US_UMD/PhEDEx/storage.xml

File: `/cvmfs/cms.cern.ch//SITECONF/T3_US_UMD/PhEDEx/storage.xml'

Size: 1384 Blocks: 3 IO Block: 4096 regular file

Device: 1ah/26d Inode: 188799077 Links: 1

Access: (0644/-rw-r--r--) Uid: ( 498/ cvmfs) Gid: ( 498/ cvmfs)

Access: 2018-02-16 13:43:28.000000000 -0600

Modify: 2018-02-16 13:43:28.000000000 -0600

Change: 2018-02-16 13:43:28.000000000 -0600

fixed some singularity issues by changing site-local-config.xml

https://gitlab.cern.ch/SITECONF/T3_US_UMD/blob/master/JobConfig/site-local-config.xml

changes all instances of /sharesoft/cmssw to /cvmfs/cms.cern.ch

Posix cp as stage-out command Oct 17:

from Stephan Lammel at fnal

T3_US_UMD is currently using Posix cp as stage-out command.

Once we go to Singularity, a command that understands certificates

would be needed. Can i ask you what your plans are? Would it be

possible to resolve/replace cp with, for instance, gfal2 or xrdcp

now? (The SAM WN-mc test requires role=production and we would

like to switch it to lcgadmin and ship a certificate with it

instead. This, however, will not work in case of Posix cp.)

https://gitlab.cern.ch/SITECONF/T3_US_UMD/blob/master/JobConfig/site-local-config.xml

changed

(For instance to switch from cp to gfal2, which most sites use.)

Instructions on how to change files of your site in SITECONF are

at https://twiki.cern.ch/twiki/bin/view/CMSPublic/SiteConfInGitlab .

RSV read/write metric failing/CRABcheckwrite failing/bestman 100% CPU /CVFMS keeps disappearing even after mounting many times.

Grid SAM metric 13 critical and 15 warning, and GRID RSV

org.osg.srm.srmcp-readwrite

13org.cms.SRM-VOPut (/cms/Role_production)

15org.cms.SRM-VOGet (/cms/Role_production)

Detailed output of Metric Result

Field Value

Hostname hepcms-0.umd.edu

Metric org.cms.SRM-VOPut

VOFQAN /cms/Role=production

Service Flavour SRM

Timestamp 2017-08-22T21:45:05Z

Status CRITICAL

Summary CRITICAL:

Details

CRITICAL:

Testing from: etf-18.cern.ch

DN: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=sciaba/CN=430796/CN=Andrea Sciaba/CN=596854710/CN=1175387511/CN=1800558125/CN=1185662883

VOMS FQANs: /cms/Role=production/Capability=NULL, /cms/ALARM/Role=NULL/Capability=NULL, /cms/Role=NULL/Capability=NULL, /cms/TEAM/Role=NULL/Capability=NULL

gfal2 2.9.3

VOPut: Copy file using gfal.filecopy().

Parameters:

source: file:///var/lib/gridprobes/cms.Role.production/org.cms/SRM/hepcms-0.umd.edu/testFile.txt

dest: srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/mnt/hadoop/cms/store/unmerged/SAM/testSRM/SAM-hepcms-0.umd.edu/lcg-util/testfile-put-nospacetoken-1503438003-1b90049def0b.txt

src_spacetoken:

dst_spacetoken:

timeout: 120

StartTime of the transfer: 2017-08-22 23:41:21.609877

ERROR: DESTINATION MAKE_PARENT srm-ifce err: Communication error on send, err: [SE][Mkdir][] httpg://hepcms-0.umd.edu:8443/srm/v2/server: CGSI-gSOAP running on etf-18.cern.ch reports Error reading token data header: Connection reset by peer

VO specific Detailed Output: None critical= 1 File was NOT copied to SRM. file= testfile-put-nospacetoken-1503438003-1b90049def0b.txt

metricName: org.osg.srm.srmcp-readwrite

metricType: status

timestamp: 2017-08-22 18:34:51 EDT

metricStatus: CRITICAL

serviceType: OSG-SRM

serviceURI: hepcms-0.umd.edu:8443

gatheredAt: hepcms-1.umd.edu

summaryData: CRITICAL

detailsData: Failed to transfer file to remote server.

Command: gfal-copy 'file:///usr/share/rsv/probe-helper-files/storage-probe-test-file' 'srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/mnt/hadoop/osg/rsv/storage-probe-test-file.1503440880.3035853' 2>&1

Output from gfal-copy:

gfal-copy error: 70 (Communication error on send) - DESTINATION SRM_PUT_TURL srm-ifce err: Communication error on send, err: [SE][PrepareToPut][] httpg://hepcms-0.umd.edu:8443/srm/v2/server: CGSI-gSOAP running on hepcms-1.umd.edu reports Error reading token data header: Connection reset by peer

Copying 306 bytes file:///usr/share/rsv/probe-helper-files/storage-probe-test-file => srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/mnt/hadoop/osg/rsv/storage-probe-test-file.1503440880.3035853

bestman seems to be running large number of threads. shutting and restarting bestamn didnt help.

log shows this exception

[root@hepcms-0 xrd]# more /var/log/bestman2/bestman2.log

securePort=8443

-- done with listing web service parameters --

BeStMan: space mgt component is disabled.

[Note:] srmcacheKeywordOn is set to true automatically when space mgt is disabled.

............ no static tokens defined for bestman

.........local SRM is on: httpg://hepcms-0.umd.edu:8443/srm/v2/server current user:bestman

.... using gsi connection.

...appling /etc/bestman2/conf/WEB-INF/jetty.xml

........pool:null qtp310490400{10<=0<=0/256,-1}

..........acceptQueueSize:0

..................acceptor:1

java.net.BindException: Address already in use

at java.net.PlainSocketImpl.socketBind(Native Method)

at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:376)

at java.net.ServerSocket.bind(ServerSocket.java:376)

at java.net.ServerSocket.<init>(ServerSocket.java:237)

It turns out there are thousands are processes are running on that port.

[root@hepcms-0 xrd]# lsof -i:8443

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME

java 30784 bestman 50u IPv6 98423388 0t0 TCP hepcms-0.umd.edu:pcsync-https->fts430.cern.ch:34328 (ESTABLISHED)

java 30784 bestman 71u IPv6 45003155 0t0 TCP *:pcsync-https (LISTEN)

java 30784 bestman 72u IPv6 97613623 0t0 TCP hepcms-0.umd.edu:pcsync-https->fts432.cern.ch:42720 (ESTABLISHED)

java 30784 bestman 74u IPv6 108512931 0t0 TCP hepcms-0.umd.edu:pcsync-https->fts433.cern.ch:52592 (CLOSE_WAIT)

java 30784 bestman 75u IPv6 98581670 0t0 TCP hepcms-0.umd.edu:pcsync-https->fts435.cern.ch:38534 (ESTABLISHED)

java 30784 bestman 76u IPv6 108512932 0t0 TCP hepcms-0.umd.edu:pcsync-https->bighep.ucr.edu:37924 (CLOSE_WAIT)

java 30784 bestman 77u IPv6 108512933 0t0 TCP hepcms-0.umd.edu:pcsync-https->fts433.cern.ch:52600 (CLOSE_WAIT)

Killed all these processes:

[root@hepcms-0 xrd]# kill 30784

[root@hepcms-0 xrd]# lsof -i:8443

[root@hepcms-0 xrd]#

stop and start bestman again

[root@hepcms-0 xrd]# service bestman2 stop

Shutting down bestman2: [ OK ]

[root@hepcms-0 xrd]# service bestman2 start

Starting bestman2: [ OK ]

This fixed the 100% CPU issue

Cleaned cvmfs for all and VM nodes. NOTE gridftp is not yet part of vm group for clush so did it separately

on [root@hepcms-hn ~]#

ssh-agent $SHELL

ssh-add

clush -w @all /data/osg/scripts/fixCVMFS.sh

clush -w @all df -ah | cvmfs

clush -w @vm /data/osg/scripts/fixCVMFS.sh

clush -w @vm df -ah | cvmfs

[root@hepcms-gridftp ~]# /data/osg/scripts/fixCVMFS.sh

[root@hepcms-gridftp ~]# service globus-gridftp-server status

GridFTP server is running (pid=11545)

in about 10 minutes crabcheckwrite was a success and later the RSV and SAM metrics were green as well.

[jabeen@hepcms-in2 src]$ cmsenv

voms-proxy-init -voms cms

source /cvmfs/cms.cern.ch/crab3/crab.csh

crab checkwrite --site=T3_US_UMD

1:40

[

1:38

Checkwrite failing OSG srm ping matrix failing. hepcms-0 shows errors

org.osg.srm.srmping

metricName: org.osg.srm.srmping metricType: status timestamp: 2018-02-13 14:33:03 EST metricStatus: OK serviceType: OSG-SRM serviceURI: hepcms-0.umd.edu:8443 gatheredAt: hepcms-1.umd.edu summaryData: OK detailsData: SRM server running at hepcms-0.umd.edu:8443 is alive and responding to the srm-ping command. Output from srm-ping: srm-ping 2.2.2.3.0 Wed Nov 7 16:03:09 CST 2012 BeStMan and SRM-Clients Copyright(c) 2007-2012, Lawrence Berkeley National Laboratory. All rights reserved. Support at SRM@LBL.GOV and documents at http://sdm.lbl.gov/bestman OSG Support at osg-software@opensciencegrid.org and documentation at https://www.opensciencegrid.org/bin/view/Documentation/Release3/ ############################################################## # SRM_HOME = /etc/bestman2 # BESTMAN_LIB = /usr/share/java/bestman2 # JAVA_HOME = /etc/alternatives/java_sdk java version "1.7.0_151" OpenJDK Runtime Environment (rhel-2.6.11.0.el6_9-x86_64 u151-b00) OpenJDK 64-Bit Server VM (build 24.151-b00, mixed mode) # BESTMAN_SYSCONF = /etc/sysconfig/bestman2 ############################################################## ################################################################# # BeStMan and BeStMan Clients Copyright(c) 2007-2011, # Lawrence Berkeley National Laboratory. All rights reserved. # Support at SRM@LBL.GOV and documents at http://sdm.lbl.gov/bestman ################################################################# # # BESTMAN_SYSCONF contains both external env settings and internal definitions #

Solution. Restarting bestman and running fixcvmfs script on all machines.

CMSRUN failure. SAM warnings for different nodes.

Problem.

Solution

This means the SAM test file is missing from your storage (at the least, there may be more problems). I used the central phedex machine to drop that file in place where I think SAM is going to look for it. We'll see if xrootd will pass now:

gfal-copy gsiftp://red-gridftp.unl.edu/mnt/hadoop/user/uscms01/pnfs/unl.edu/data4/cms/store/test/xrootd/CMSSAM/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_0_0_90X_mcRun1_realistic_v4-v1/10000/28B9D1FB-8B31-E711-AA4E-0025905B85B2.root gsiftp://hepcms-gridftp.umd.edu/mnt/hadoop/cms/store/test/xrootd/T3_US_UMD/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_0_0_90X_mcRun1_realistic_v4-v1/10000/28B9D1FB-8B31-E711-AA4E-0025905B85B2.root

Carl Lundstedt

Job Priority condor configuration:

refer to this link: condor_config

For priority tag see this link

condor jobs disappearing in scheduler

Might be due to extra condor_schedd process running on interactive node.

[root@hepcms-in2 condor]# ps aux | grep condor_schedd

condor 9911 0.0 3.2 668508 529988 ? Ss 2016 55:47 condor_schedd -f

condor 1938084 0.0 0.1 121632 25228 ? S Mar09 2:49 condor_schedd [extra process]

root 2979082 0.0 0.0 6452 724 pts/21 S+ 21:32 0:00 grep condor_schedd

[root@hepcms-in2 condor]# kill 1938084

[root@hepcms-in2 condor]# kill 9911

[root@hepcms-in2 condor]# ps aux | grep condor_schedd

[checked that condor_schedd is killed]

root 2979085 0.0 0.0 6448 692 pts/21 S+ 21:33 0:00 grep condor_schedd

[root@hepcms-in2 condor]# service condor restart [restart service]

Stopping Condor daemons: [ OK ]

Starting Condor daemons: [ OK ]

[root@hepcms-in2 condor]# ps aux | grep condor_schedd

condor 2979179 1.6 0.0 102888 8592 ? Ss 21:33 0:00 condor_schedd -f

root 2979213 0.0 0.0 6452 728 pts/21 S+ 21:33 0:00 grep condor_schedd

org.osg.srm.srmcp-readwrite failing

user grid jobs failing jobs are failing with status 60321: "Site related issue: no space, SE down, refused connection".

also Checkwrite errors.

Two RSV tests failing - critical

1 of 16 - metricName: org.osg.srm.srmcp-readwrite

16 of 16: Running metric org.osg.globus.gridftp-simple

metricName: org.osg.globus.gridftp-simple

metricName: org.osg.srm.srmcp-readwrite metricType: status timestamp: 2017-02-11 19:28:03 EST metricStatus: CRITICAL serviceType: OSG-SRM serviceURI: hepcms-0.umd.edu:8443 gatheredAt: hepcms-1.umd.edu summaryData: CRITICAL detailsData: Failed to transfer file to remote server. Command: gfal-copy 'file:///usr/share/rsv/probe-helper-files/storage-probe-test-file' 'srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/mnt/hadoop/osg/rsv/storage-probe-test-file.1486859280.706269' 2>&1 Output from gfal-copy: gfal-copy error: 13 (Permission denied) - DESTINATION SRM_PUT_TURL srm-ifce err: Permission denied, err: [SE][PrepareToPut][SRM_AUTHORIZATION_FAILURE] httpg://hepcms-0.umd.edu:8443/srm/v2/server: not mapped./DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=rsv/hepcms-1.umd.edu Copying 306 bytes file:///usr/share/rsv/probe-helper-files/storage-probe-test-file => srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/mnt/hadoop/osg/rsv/storage-probe-test-file.1486859280.706269 EOT

Solution:

http cert on hepcmsdev-6 (gums server) expired & restarting mysqld and tomcat6 on hepcmsdev-6

This could also happen if a new cert does not have the right permissions. For example, bestmancert.pem is not owned by bestman.

Grid Sam SRM CRITICAL Also CheckWrite failure

with matric 13 CRITICAL and 15 Warning showed gridftp cert on hepcms-gridftp expired.

This also made bestan2 on hepcms-se 'dead'

13 org.cms.SRM-VOPut (/cms/Role_production)

15 org.cms.SRM-VOGet (/cms/Role_production)

Field Value

Hostname hepcms-0.umd.edu

Metric org.cms.SRM-VOPut

VOFQAN /cms/Role=production

Service Flavour SRM

Timestamp 2017-07-14T13:22:22Z

Status CRITICAL

Summary CRITICAL:

Details

CRITICAL:

Testing from: etf-18.cern.ch

DN: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=sciaba/CN=430796/CN=Andrea Sciaba/CN=596854710/CN=1175387511/CN=22887794/CN=116673394

VOMS FQANs: /cms/Role=production/Capability=NULL, /cms/ALARM/Role=NULL/Capability=NULL, /cms/Role=NULL/Capability=NULL, /cms/TEAM/Role=NULL/Capability=NULL

gfal2 2.9.3

VOPut: Copy file using gfal.filecopy().

Parameters:

source: file:///var/lib/gridprobes/cms.Role.production/org.cms/SRM/hepcms-0.umd.edu/testFile.txt

dest: srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/mnt/hadoop/cms/store/unmerged/SAM/testSRM/SAM-hepcms-0.umd.edu/lcg-util/testfile-put-nospacetoken-1500038540-8a81d45b79fb.txt

src_spacetoken:

dst_spacetoken:

timeout: 120

StartTime of the transfer: 2017-07-14 15:22:20.847726

ERROR: globus_ftp_client: the server responded with an error 530 530-globus_xio: Server side credential failure 530-globus_gsi_gssapi: Error with GSI credential 530-globus_gsi_gssapi: Error with gss credential handle 530-globus_credential: Error with credential: The host credential: /etc/grid-security/hostcert.pem 530- with subject: /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=hepcms-gridftp.umd.edu 530- has expired 1151 minutes ago. 530- 530 End.

VO specific Detailed Output: None critical= 1 File was NOT copied to SRM. file= testfile-put-nospacetoken-1500038540-8a81d45b79fb.txt

Solution:

renewed gridftp certs on hepcms-gridftp

for details see the cert renewal instructions.

Gridftep cancelling transfer due to over-load limit . Also, sam metric 12(voput error)

metricName: org.osg.globus.gridftp-simple

metricType: status

timestamp: 2018-04-04 13:18:10 EDT

metricStatus: CRITICAL

serviceType: GridFTP

serviceURI: hepcms-gridftp.umd.edu

gatheredAt: hepcms-1.umd.edu

summaryData: CRITICAL

detailsData: Successful transfer to remote host.

Failed to transfer from remote host.

Command: globus-url-copy 'gsiftp://hepcms-gridftp.umd.edu//mnt/hadoop/osg/rsv/gridftp-probe-test-file.1522861680.1747958.remote' 'file:///tmp/gridftp-probe-test-file.1522861680.1747958.local' 2>&1

Output:

error: globus_ftp_client: the server responded with an error

530 Login incorrect. : Server is cancelling transfer due to over-load limit (host=hepcms-gridftp.umd.edu, user=rsv, path=(null))

on hepcms-gridftp

ran fixcvmfs

also /var / was 90% so removed old log files.

Seems to fix SAm errors. but not RSV read/write and gridftp

Field Value

Hostname hepcms-gridftp.umd.edu

Metric org.cms.SRM-VOGet

VOFQAN /cms/Role=production

Service Flavour SRM

Timestamp 2018-04-04T18:28:38Z

Status CRITICAL

Summary CRITICAL:

Details

CRITICAL:

Testing from: etf-18.cern.ch

DN: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=sciaba/CN=430796/CN=Andrea Sciaba/CN=596854710/CN=1175387511/CN=1369956787/CN=1557781315

VOMS FQANs: /cms/Role=production/Capability=NULL, /cms/ALARM/Role=NULL/Capability=NULL, /cms/Role=NULL/Capability=NULL, /cms/TEAM/Role=NULL/Capability=NULL

gfal2 2.14.2

2018-04-04T18:24:00Z

Source: gsiftp://hepcms-gridftp.umd.edu//mnt/hadoop/cms/store/unmerged/SAM/testSRM/SAM-hepcms-gridftp.umd.edu/lcg-util/testfile-put-nospacetoken-1522865933-f53f33fe9b45.txt

Destination: file:///var/lib/gridprobes/cms.Role.production/org.cms/SRM/hepcms-gridftp.umd.edu/testFileIn.txt

Get file using gfal.filecopy().

Parameters:

source: gsiftp://hepcms-gridftp.umd.edu//mnt/hadoop/cms/store/unmerged/SAM/testSRM/SAM-hepcms-gridftp.umd.edu/lcg-util/testfile-put-nospacetoken-1522865933-f53f33fe9b45.txt

dest: file:///var/lib/gridprobes/cms.Role.production/org.cms/SRM/hepcms-gridftp.umd.edu/testFileIn.txt

src_spacetoken:

dst_spacetoken:

timeout: 120

StartTime of the transfer: 2018-04-04 20:24:00.610629

ERROR: Could not open source: globus_ftp_client: the server responded with an error 530 Login incorrect. : Server is cancelling transfer due to over-load limit (host=hepcms-gridftp.umd.edu, user=sam, path=(null))

2018-04-04T18:28:38Z

VO specific Detailed Output: None critical= 1 File was NOT copied from SRM. file= testfile-put-nospacetoken-1522865933-f53f33fe9b45.txt

Also, top on hepcms-gridftp is extremely busy with young's procesesses. That could also be why transfers are failing with overload error.s

RSV tests are still red srmcp read write and grid ftp. lets see if the clear.

This did not fix it.

But noticed, haddop is not mounted on gridftp.

mounted hadoop and restart the grid ftp service

[root@hepcms-gridftp ~]# umount /mnt/hadoop

umount: /mnt/hadoop: not mounted

[root@hepcms-gridftp ~]# mount -a /mnt/hadoop

[root@hepcms-gridftp ~]# service globus-gridftp-server start

Starting globus-gridftp-server: [ OK ]

[root@hepcms-gridftp ~]#

Finally read-write RSV test was green but gridftp was still refusing transfers because of over-load.

on head node cvmfs fix seems to fix it.

clush -w @all /data/osg/scripts/fixCVMFS.sh

Everything is back to normal

voms proxy in condor

For now use this work around

voms-proxy-init -voms cms

cp /tmp/x509up_u`id -u` ~/

The first line should create your proxy id file in /tmp/ area that is needed by condor to use your proxy.

second copies it to your home area because condor can see files in this area.

Then you can add a line in your .jdl explicitly telling condor where to look for the proxy file:

x509userproxy = /home/yhshin/x509up_u1112

We should infact use the solution below but As of 16Feb. the solution below does not work. Sent email to T3:

From: https://twiki.cern.ch/twiki/bin/view/CMSPublic/WorkBookXrootdService#OpenCondor

Open a file in Condor Batch or CERN Batch

Condor

If one wants to use local condor batch to analyze user/group skims located at remote sites. The only modification needed is adding:

use_x509userproxy = true

in your condor jdl file (the file which defines universe, Executable, etc..).

For OLDER versions of HTCondor (before 8.0.0), you need:

x509userproxy = /tmp/x509up_uXXXX

The string /tmp/x509up_uXXXX is the string in the "path:" statement from output of "voms-proxy-info -all", which contains your valid grid proxy. Condor will pass this information to the working node of the condor batch.

CRAB Unable to check write permission

Instructions to duplicate issue:

mkdir dummy

cd dummy

cmsrel CMSSW_8_0_6

cd CMSSW_8_0_6/src/

cmsenv

source /cvmfs/cms.cern.ch/crab3/crab.sh

voms-proxy-init -voms cms

crab checkwrite --site=T3_US_UMD

Output of the "checkwrite" command:

Will check write permission in the default location /store/user/<username>

Retrieving DN from proxy...

DN is: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=yoshin/CN=742847/CN=Young Ho Shin

Retrieving username from SiteDB...

Username is: yoshin

Validating LFN /store/user/yoshin...

LFN /store/user/yoshin is valid.

Will use `gfal-copy`, `gfal-rm` commands for checking write permissions

Will check write permission in /store/user/yoshin on site T3_US_UMD

Attempting to create (dummy) directory crab3checkwrite_20160617_201033 and copy (dummy) file crab3checkwrite_20160617_201033.tmp to /store/user/yoshin

Executing command: env -i X509_USER_PROXY=/tmp/x509up_u1112 gfal-copy -p -v -t 180 file:///home/yhshin/dummy/CMSSW_8_0_6/src/crab3checkwrite_20160617_201033.tmp 'srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/mnt/hadoop/cms/store/user/yoshin/crab3checkwrite_20160617_201033/crab3checkwrite_20160617_201033.tmp'

Please wait...

Failed running copy command

Stdout:

Copying 85 bytes file:///home/yhshin/dummy/CMSSW_8_0_6/src/crab3checkwrite_20160617_201033.tmp => srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/mnt/hadoop/cms/store/user/yoshin/crab3checkwrite_20160617_201033/crab3checkwrite_20160617_201033.tmp

event: [1466208644235] BOTH GFAL2:CORE:COPY LIST:ENTER

event: [1466208644236] BOTH GFAL2:CORE:COPY LIST:ITEM file:///home/yhshin/dummy/CMSSW_8_0_6/src/crab3checkwrite_20160617_201033.tmp => srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/mnt/hadoop/cms/store/user/yoshin/crab3checkwrite_20160617_201033/crab3checkwrite_20160617_201033.tmp

event: [1466208644236] BOTH GFAL2:CORE:COPY LIST:EXIT

event: [1466208648618] BOTH SRM PREPARE:ENTER

Stderr:

WARNING Failed to ping srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/mnt/hadoop/cms/store/user/yoshin/crab3checkwrite_20160617_201033/crab3checkwrite_20160617_201033.tmp

WARNING Transfer failed with: DESTINATION MAKE_PARENT srm-ifce err: Communication error on send, err: [SE][Mkdir][] httpg://hepcms-0.umd.edu:8443/srm/v2/server: CGSI-gSOAP running on hepcms-in1.umd.edu reports Error reading token data header: Connection reset by peer

gfal-copy error: 70 (Communication error on send) - DESTINATION MAKE_PARENT srm-ifce err: Communication error on send, err: [SE][Mkdir][] httpg://hepcms-0.umd.edu:8443/srm/v2/server: CGSI-gSOAP running on hepcms-in1.umd.edu reports Error reading token data header: Connection reset by peer

Checkwrite Result:

Unable to check write permission in /store/user/yoshin on site T3_US_UMD

Please try again later or contact the site administrators sending them the 'crab checkwrite' output as printed above.

Note: You cannot write to a site if you did not ask permission.

Solution:

bestman2 runs out of memory (on hepcms-se) - verify by /var/log/messages

fixed it by restarting a service (bestman2) (on hepcms-se)

Result:

Checkwrite Result:

Success: Able to write in /store/user/yoshin on site T3_US_UMD

Note that alternately this (and other CRAB writing output issues) could be due to:

- - gridftp not running (hepcms-gridftp)
  - The user doesn't have a SE account (https://sites.google.com/a/physics.umd.edu/umdt3/user-guide/submitting-analysis-jobs#TOC-To-stage-your-data-back-to-the-hepcms-SE: to request)
  - The user didn't authenticate with their grid proxy properly (they should have gotten an error about that)
  - They user's grid certificate is not properly linked in GUMS to their SE user that owns /store/user/CERNusername, or GUMS is not working on hepcmsdev-6 (hepcms-gums)
  - hadoop is down (see hadoop troubleshooting if the /mnt/hadoop directory's not there)
    - Some sort of weird hadoop permissions. Note that crab will automatically make *subdirectories* of /store/user/CERNusername for our SE users.
  - Some node is missing the /store softlink (it's puppetized for everyone)
  - If it's regular crab job output transfer to SE issues, check the CMS dashboard for the site status of the place the job came from if all our stuff passes tests for SE, CE and hadoop health
  - A different looking error if trouble communicating to CERN

org.osg.srm.srmcp-readwrite, org.osg.globus.gridftp-simple, Crab Checkwrite failure(August 4th, 2017)

hepcms-gridftp was rebooted August 2nd.

Apparently, /etc/resolv.conf was overwritten at the reboot, messing up servernames in resolv.conf. This resulted in not being able to mount /mnt/hadoop or /data.

March 2019: apparently it happens when DNS1,2 are specified in the ifcfg file. commented them out (they are also commented out in hn)

[root@hepcms-gridftp ~]# more /etc/sysconfig/network-scripts/ifcfg-eth1

###DNS1="128.8.74.2"

###DNS2="128.8.76.2"

Old solotion.

NOTE: /etc/resolv.conf is overwritten on reboot a copy is saved as:

4 -rw-r--r-- 1 root root 125 Aug 2 11:56 resolv.conf.save

[root@hepcms-gridftp log]# more /etc/resolv.conf.save

options rotate timeout:1

# This file is being maintained by Puppet.

# DO NOT EDIT

search privnet umd.edu

nameserver 10.1.0.2

The new file somehow had two more addresses, which we commented out.

[root@hepcms-gridftp log]# more /etc/resolv.conf

options rotate timeout:1

# This file is being maintained by Puppet.

# DO NOT EDIT

search privnet umd.edu

nameserver 10.1.0.2

#nameserver 128.8.74.2

#nameserver 128.8.76.2

[root@hepcms-gridftp log]

Now unmount and mount hadoop and start the service.

[root@hepcms-gridftp log]# umount /mnt/hadoop

umount: /mnt/hadoop: not mounted

[root@hepcms-gridftp log]# umount /mnt/hadoop

umount: /mnt/hadoop: not mounted

[root@hepcms-gridftp log]# umount /mnt/hadoop

umount: /mnt/hadoop: not mounted

[root@hepcms-gridftp log]# df -h

Filesystem Size Used Avail Use% Mounted on

/dev/sda1 16G 2.9G 13G 19% /

/dev/sda3 16G 573M 15G 4% /tmp

/dev/sda5 7.9G 2.5G 5.1G 33% /var

10.1.0.1:/export/home

7.2T 1.2T 5.7T 18% /home

10.1.0.7:/data 37T 34T 2.9T 93% /data

cvmfs2 20G 398M 20G 2% /cvmfs/config-osg.opensciencegrid.org

cvmfs2 20G 398M 20G 2% /cvmfs/cms.cern.ch

[root@hepcms-gridftp log]# mount -a /mnt/hadoop

INFO /builddir/build/BUILD/hadoop-2.0.0-cdh4.7.1/src/hadoop-hdfs-project/hadoop-hdfs/src/main/native/fuse-dfs/fuse_options.c:164 Adding FUSE arg /mnt/hadoop

INFO /builddir/build/BUILD/hadoop-2.0.0-cdh4.7.1/src/hadoop-hdfs-project/hadoop-hdfs/src/main/native/fuse-dfs/fuse_options.c:115 Ignoring option allow_other

INFO /builddir/build/BUILD/hadoop-2.0.0-cdh4.7.1/src/hadoop-hdfs-project/hadoop-hdfs/src/main/native/fuse-dfs/fuse_options.c:115 Ignoring option dev

INFO /builddir/build/BUILD/hadoop-2.0.0-cdh4.7.1/src/hadoop-hdfs-project/hadoop-hdfs/src/main/native/fuse-dfs/fuse_options.c:115 Ignoring option suid

[root@hepcms-gridftp log]# df -h

Filesystem Size Used Avail Use% Mounted on

/dev/sda1 16G 2.9G 13G 19% /

/dev/sda3 16G 573M 15G 4% /tmp

/dev/sda5 7.9G 2.5G 5.1G 33% /var

10.1.0.1:/export/home

7.2T 1.2T 5.7T 18% /home

10.1.0.7:/data 37T 34T 2.9T 93% /data

cvmfs2 20G 398M 20G 2% /cvmfs/config-osg.opensciencegrid.org

cvmfs2 20G 398M 20G 2% /cvmfs/cms.cern.ch

fuse_dfs 198T 134T 64T 68% /mnt/hadoop

[root@hepcms-gridftp log]# ls -slrt /

Make sure soft links to store and hadoop in / are green.

[root@hepcms-gridftp log]# service globus-gridftp-server status

GridFTP server is not running

[root@hepcms-gridftp log]# service globus-gridftp-server start

Starting globus-gridftp-server: [ OK ]

[root@hepcms-gridftp log]#

All of the errors below are resolved.

changing /dev/shm permissions

these should have same permissions at /tmp

[root@hepcms-in1 ~]# chmod 1777 /dev/shm

[root@hepcms-in1 ~]# ls -ld /dev/shm

drwxrwxrwt 2 root root 40 Feb 27 19:22 /dev/shm

increasing priority for condor user

add a cron script for user@privnet on the head node, in `/root/cronscripts/EnoPriority.sh` for example

add in the crontab on HN (edit `/var/spool/cron/root`, this entry:

*/20 * * * * /root/cronscripts/EnoPriority.sh

That resets his user priority every 20 minutes.

These are documented here: https://sites.google.com/a/physics.umd.edu/tier-3-umd/t3-cluster-building-manual/headnode

so, if the user submits a bunch of jobs right now, he would have priority to start taking over the cluster *as existing jobs leave*, even over the 380 idle jobs waiting to start, for instance:

condor_status -submitters

but the `219` and `25` jobs running would have to finish and they can be up to 24 hours long (crab has limits built in, we do not impose time limits on our condor queue)

DISK CLEAN UP

Clush Commands

https://sites.google.com/a/physics.umd.edu/tier-3-umd/dont-edit/commands/clustershell#TOC-Commands:-

clush -w @r510 -b service hadoop-hdfs-datanode restart

Clush Groups

are on hn /etc/clustershell/groups

all: hepcms-

in1,hepcms-in2,r720-0-1,r720-0-2,r720-datanfs,r510-0-1,r510-0-5,r510-0-6,r510-0-9,r510-0-10,r510-0-11,r510-0-4,compute-0-5,compute-0-6,compute-0-7,compute-0-8,compute-0-10,compute-0-11,hepcms-ce,hepcms-se,hepcms-namenode,hepcms-secondary-namenode,hepcms-squid,hepcms-gums,hepcms-gridftp,foreman-vmtest2

bm: hepcms-in2,r720-0-1,r720-0-2,r720-datanfs,r510-0-1,r510-0-4,r510-0-5,r510-0-6,r510-0-9,r510-0-10,r510-0-11,hepcms-gridftp,compute-0-5,compute-0-6,compute-0-7,compute-0-8,compute-0-10,compute-0-11

vm: hepcms-in1,hepcms-ce,hepcms-se,hepcms-namenode,hepcms-secondary-namenode,hepcms-squid,hepcms-gums,hepcms-gridftp,foreman-vmtest2

int: hepcms-in1,hepcms-in2,hepcms-in3

compute: compute-0-5,compute-0-6,compute-0-7,compute-0-8,compute-0-10,compute-0-11

r510: r510-0-9,r510-0-5,r510-0-4,r510-0-1,r510-0-6,r510-0-10,r510-0-11

r720: r720-0-1,r720-0-2,r720-datanfs

se: hepcms-se

ce: hepcms-ce

gridftp: hepcms-gridftp

[root@hepcms-hn ~]# clush -w @vm ls -ls /cvmfs/cms.cern.ch/cmsset_default.csh

foreman-vmtest2: ssh: Could not resolve hostname foreman-vmtest2: Name or service not known

clush: foreman-vmtest2: exited with exit code 255

hepcms-se: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh

hepcms-squid: ls: cannot access /cvmfs/cms.cern.ch/cmsset_default.csh: No such file or directory

clush: hepcms-squid: exited with exit code 2

hepcms-in1: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh

hepcms-namenode: ls: cannot access /cvmfs/cms.cern.ch/cmsset_default.csh: No such file or directory

clush: hepcms-namenode: exited with exit code 2

hepcms-secondary-namenode: ls: cannot access /cvmfs/cms.cern.ch/cmsset_default.csh: No such file or directory

clush: hepcms-secondary-namenode: exited with exit code 2

hepcms-gums: ls: cannot access /cvmfs/cms.cern.ch/cmsset_default.csh: No such file or directory

clush: hepcms-gums: exited with exit code 2

hepcms-ce: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh

[root@hepcms-hn ~]# clush -w @all ls -ls /cvmfs/cms.cern.ch/cmsset_default.csh

foreman-vmtest2: ssh: Could not resolve hostname foreman-vmtest2: Name or service not known

clush: foreman-vmtest2: exited with exit code 255

hepcms-in3: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).

clush: hepcms-in3: exited with exit code 255

compute-0-7: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh

compute-0-8: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh

compute-0-11: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh

compute-0-6: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh

hepcms-in2: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh

r510-0-4: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh

r510-0-5: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh

r510-0-11: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh

r510-0-9: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh

r720-0-1: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh

r510-0-1: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh

r720-0-2: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh

are in @hepcms-hn: more /etc/clustershell/groups

###

### File managed by puppet

###

all: hepcms-in2,hepcms-in3,compute-0-8,compute-0-6,compute-0-7,compute-0-11,r720-0-1,r720-0-2,r510-0-1,r510-0-5,r510-0-9,r510-0-11,

vm: se, ce, squid, gums, namenode, secondary-namenode, hepcms-in1

INT: hepcms-in1,hepcms-in2,hepcms-in3,hepcms-in4,hepcms-in5,hepcms-in6,hepcms-in7

compute: compute-0-5,compute-0-6,compute-0-7,compute-0-8,compute-0-11

R510: r510-0-1,r510-0-5,r510-0-9,r510-0-11

R720: r720-0-1,r720-0-2

SE: hepcms-se

CE: hepcms-ce

you can also use

clush -w nodename command

clush -w node , then +nodename to add the rest with the broken fuse mount

clush -w @nodegroup -nodename +nodename .......

to remove or add nodes out of a group for that interactive session

Cleaning cvfms cache

from head node as root:

ssh-agent $SHELL

ssh-add

clush -w @all df -h

clush -w @compute cvmfs_config wipecache

clush -w @R510 cvmfs_config wipecache

clush -w @R720 cvmfs_config wipecache

Cleaning hadoop logs

hadoop logs in /scratch can make a disk go 100% full. and show up as problem in Ganglia.

as root on the offending node:

It turns out that new logs are being saved in a different directory. for that do:

root@r510-0-5 scripts]python /data/osg/scripts/pyCleanupHadoopLogs.py -k 15 -s $(r510-0-5.privnet).log --dir /scratch/hadoop/hadoop-hdfs/

can use clush for above command.

check that logs are being updated

ls -alrh /scratch/hadoop/hadoop-hdfs/

If not restart the hadoop

service hadoop-hdfs-datanode start

[root@r510-0-5 scripts]# service hadoop-hdfs-datanode status

Hadoop datanode is running [ OK ]

Old log directory:

[root@r510-0-5 scripts]# ls -1 /scratch/hadoop/log/

hadoop-hdfs-datanode-R510-0-5.local.log.2015-04-19

[root@r510-0-5 scripts]/data/osg/scripts

[root@r510-0-5 scripts]python /data/osg/scripts/pyCleanupHadoopLogs.py -k 15 -s $(R510-0-5.local).log --dir /scratch/hadoop/log/

For hadoop comands:

https://sites.google.com/a/physics.umd.edu/tier-3-umd/t3-cluster-building-manual/hadoop#TOC-Hadoop-troubleshooting

Remove temp files more than 5 days old

find ./* -mtime +5 -exec rm -rf {} \;

Hadoop PhEDEx Cleaning

July 2018, becasue of a stress test there was about 25TB of data in these directories.

Removed all of it.

/mnt/hadoop/cms/store/PhEDEx_Debug/

/mnt/hadoop/cms/store/PhEDEx_LoadTest07/

rm -rf LoadTest07_Debug_*

Clear the swap space and RAM cache

[root@hepcms-hn ~]# swapoff -a && swapon -a

[root@hepcms-hn ~]# sync; echo 1 > /proc/sys/vm/drop_caches

sync will flush the file system buffer. Command Separated by “;” run sequentially.

https://www.tecmint.com/clear-ram-memory-cache-buffer-and-swap-space-on-linux/

HADOOP /NFS

Adding siab-1/data2 as the second nfs disk to the cluster

from home:

/etc/fstab

[jabeen@hepcms-hn ~]$ more /etc/fstab

# /etc/fstab

# Created by anaconda on Wed May 13 13:30:22 2015

# Accessible filesystems, by reference, are maintained under '/dev/disk'

# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info

UUID=b4133507-22e5-4ba5-8521-6836d7051ca5 / ext4 defaults 1 1

UUID=f8ef7370-5a4e-4206-96ca-69a5f63cd8e6 /export ext4 defaults,usrquota,grpquota 1 2

UUID=5443d7bc-f5f5-4418-a02b-70edb813c428 /scratch ext4 defaults 1 2

UUID=031c91c8-cb63-4e83-a8c2-3e706127e123 /tmp ext4 defaults 1 2

UUID=88a037dd-9936-4f30-b01b-a1edfaabfdeb /var ext4 defaults 1 2

UUID=44898750-d91f-4b02-ad4b-a324c5d20f4d swap swap defaults 0 0

tmpfs /dev/shm tmpfs defaults 0 0

devpts /dev/pts devpts gid=5,mode=620 0 0

sysfs /sys sysfs defaults 0 0

proc /proc proc defaults 0 0

10.1.0.7:/data /data nfs rw,async,intr,nolock,nfsvers=3 0 0

10.1.0.100:/data2 /data2 nfs rw,async,intr,nolock,nfsvers=3 0 0

nfs.isipnl.nas.umd.edu:/ifs/data/CMNS_Physics /CampusBackup nfs nfsvers=3,tcp,rw,hard,intr,timeo=600,retrans=2,rsize=131072

,wsize=524288 0 0

nfs.isipnl.nas.umd.edu:/ifs/data/CMNS_HEP_00 /DataCampusBackup nfs nfsvers=3,tcp,rw,hard,intr,timeo=600,retrans=2,rsize=131

072,wsize=524288 0 0

[jabeen@hepcms-hn ~]$ /etc/export

[jabeen@hepcms-hn ~]$ more /etc/exports

/export 10.0.0.0/255.0.0.0(fsid=1,rw,async,no_subtree_check,no_root_squash)

/export 128.8.164.11(fsid=1,rw,async,no_subtree_check,no_root_squash)

on datanfs

[root@r720-datanfs ~]# more /etc/fstab

# HEADER: This file was autogenerated at Thu Dec 17 16:09:27 -0500 2015

# HEADER: by puppet. While it can still be managed manually, it

# HEADER: is definitely not recommended.

# /etc/fstab

# Created by anaconda on Tue Jul 28 11:50:16 2015

# Accessible filesystems, by reference, are maintained under '/dev/disk'

# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info

UUID=0408cd1c-e1b3-4ce8-9a32-1e69f1b44914 / ext4 defaults 1 1

UUID=a92a7d16-cc33-4b41-aae4-2492de2b0daf /data xfs defaults 1 2

UUID=60462962-9ef3-4bad-95f6-000cd5961fc7 /scratch ext4 defaults 1 2

UUID=51ae0cf9-e6cb-4abb-b827-7400605388d0 /tmp ext4 defaults 1 2

UUID=58d92d33-0155-43d9-aab0-72edc1768fb2 /var ext4 defaults 1 2

UUID=039baec5-1182-4265-9c5e-4688e2d410c4 swap swap defaults 0 0

UUID=71d4ccff-a837-43f8-871e-5118d81a413b swap swap defaults 0 0

tmpfs /dev/shm tmpfs defaults 0 0

devpts /dev/pts devpts gid=5,mode=620 0 0

sysfs /sys sysfs defaults 0 0

proc /proc proc defaults 0 0

10.1.0.1:/export/home /home nfs rw,async,intr,nolock,nfsvers=3 0 0

hadoop-fuse-dfs /mnt/hadoop fuse server=hepcms-namenode.privnet,port=9000,rdbuffer=131072,allow_other 0 0

[root@r720-datanfs ~]# more /etc/exports

# File managed by Puppet, do not edit!

/data 10.1.0.0/16(fsid=1,rw,async,no_subtree_check,no_root_squash) 10.1.255.231(fsid=1,rw,async,no_subtree_check,no

_root_squash)

/data/hadoop 10.1.255.232(fsid=1,rw,async,no_subtree_check,no_root_squash)

On siab-1

/etc/exports had /data2 (rw,sync,no_root_squash)

replaced matching with hn /etc/export

/data2 10.0.0.0/255.0.0.0(fsid=1,rw,async,no_subtree_check,no_root_squash)

Now export the directories in /etc/exports with the command

[0806] root@siab-1 ~# exportfs -arv

ON hepcms-in2 added this line to /etc/fstab

10.1.0.100:/data2/home /home nfs rw,async,intr,nolock,nfsvers=3 0 0

if previous home mount is stale first umount home and then move

umount /home

mount -a /home

BACKUP nfs disks

are mounted on hepmcms-hn

campus backup unmounted using the lazy unmount as normal command gave busy devoice.

umount -l /CampusBackup

umount -l /DataCampusBackup

nfs.isipnl.nas.umd.edu:/ifs/data/CMNS_Physics

500G 259G 242G 52% /CampusBackup

nfs.isipnl.nas.umd.edu:/ifs/data/CMNS_HEP_00

9.0T 7.6T 1.5T 84% /DataCampusBackup

After outage lost individual hadoop disk mounts on many nodes. resulting in all kinds of grid, xrootd SAM errors, and checkwrite error

Metric org.cms.WN-analysis

VOFQAN /cms/Role=lcgadmin

Service Flavour HTCONDOR-CE

Metric org.cms.WN-xrootd-access

VOFQAN /cms/Role=lcgadmin

Service Flavour HTCONDOR-CE

Hostname hepcms-0.umd.edu

Metric org.cms.SE-xrootd

VOFQAN read

Service Flavour XROOTD

Hostname hepcms-gridftp.umd.edu

Metric org.cms.SRM-VOGet

VOFQAN /cms/Role=production

Service Flavour SRM

Solution:

Hadoop Datanode Dead

[root@compute-0-11 ~]# service hadoop-hdfs-datanode status

Hadoop datanode is dead and pid file exists [FAILED]

[root@compute-0-11 ~]# mount /hadoop1

[root@compute-0-11 ~]# mount /hadoop2

[root@compute-0-11 ~]# service hadoop-hdfs-datanode stop

Stopping Hadoop datanode: [ OK ]

no datanode to stop

[root@compute-0-11 ~]# service hadoop-hdfs-datanode start

Starting Hadoop datanode: [ OK ]

make sure all the individual disk mounts have correct permission

chown hdfs:hadoop /hadoop1/data

make sure on hepcm-namenode and secondary name node services are running

service hadoop-hdfs-namenode status

service hadoop-hdfs-secondarynamenode status

on hepcms-namenode

check the safemode is ON or OFF hdfs dfsadmin -safemode get

If safe mode is turned ON, please issue the below command to leave from safemode. hdfs dfsadmin -safemode leave

for working hadoop it should be off.

Check Hadoop Health

clush -b -w @r510 hadoop fsck / -blocks > hadoop-fsck-pipe-blocks.output

clush -b -w @r720 hadoop fsck / -blocks >> hadoop-fsck-pipe-blocks.output

clush -b -w @compute hadoop fsck / -blocks >> hadoop-fsck-pipe-blocks.output

grep -i --before-context=20 "r510" hadoop-fsck-pipe-blocks.output > hadoop-fsck-pipe-blocks-output.log

grep -i --before-context=20 "compute" hadoop-fsck-pipe-blocks.output >> hadoop-fsck-pipe-blocks-output.log

grep -i --before-context=20 "r720" hadoop-fsck-pipe-blocks.output >> hadoop-fsck-pipe-blocks-output.log

rm hadoop-fsck-pipe-blocks.output

-------

To check individual files:

for some reason on interactives hadoop commands are set to local file system. In reality you should use hdfs path, i.e. `/cms/store/user/...` if the command refers to the hadoop system (edited)

[kakw@compute-0-6 0000]$ hdfs dfs -ls /cms/store/user/yoshin/EmJetAnalysis/Analysis-20171103-v0/QCD_HT1000to1500/QCD_HT1000to1500_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/Analysis-20171103/171119_212955/0000/ | head

Found 1523 items

drwxr-xr-x - yhshin_g users 0 2017-11-20 10:50 /cms/store/user/yoshin/EmJetAnalysis/Analysis-20171103-v0/QCD_HT1000to1500/QCD_HT1000to1500_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/Analysis-20171103/171119_212955/0000/failed

-rw-rw-r-- 2 yhshin_g users 350540231 2017-11-19 20:17 /cms/store/user/yoshin/EmJetAnalysis/Analysis-20171103-v0/QCD_HT1000to1500/QCD_HT1000to1500_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/Analysis-20171103/171119_212955/0000/ntuple_1.root

-rw-rw-r-- 2 yhshin_g users 287446471 2017-11-20 11:21 /cms/store/user/yoshin/EmJetAnalysis/Analysis-20171103-v0/QCD_HT1000to1500/QCD_HT1000to1500_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/Analysis-20171103/171119_212955/0000/ntuple_10.root

columns are: permissions number_of_replicas userid groupid filesize modification_date modification_time filename

Check the overall haddop health:

[root@r720-0-1 ~]# hadoop dfsadmin -report

DEPRECATED: Use of this script to execute hdfs command is deprecated.

Instead use the hdfs command for it.

Configured Capacity: 195044131741696 (177.39 TB)

Present Capacity: 186079690669334 (169.24 TB)

DFS Remaining: 3311928012800 (3.01 TB)

DFS Used: 182767762656534 (166.23 TB)

DFS Used%: 98.22%

Under replicated blocks: 0

Blocks with corrupt replicas: 0

Missing blocks: 0

-------------------------------------------------

Datanodes available: 13 (14 total, 1 dead)

Live datanodes:

Name: 10.1.0.18:50010 (r510-0-1.privnet)

Hostname: r510-0-1.privnet

Decommission Status : Normal

Configured Capacity: 23351871590400 (21.24 TB)

DFS Used: 21859788259328 (19.88 TB)

Non DFS Used: 1076121092096 (1002.22 GB)

DFS Remaining: 415962238976 (387.40 GB)

DFS Used%: 93.61%

DFS Remaining%: 1.78%

Last contact: Tue Aug 22 12:19:01 EDT 2017

Name: 10.1.0.30:50010 (r510-0-11.privnet)

Hostname: r510-0-11.privnet

Decommission Status : Normal

Configured Capacity: 21420767245312 (19.48 TB)

DFS Used: 19468739567994 (17.71 TB)

Non DFS Used: 984552755846 (916.94 GB)

DFS Remaining: 967474921472 (901.03 GB)

DFS Used%: 90.89%

DFS Remaining%: 4.52%

Last contact: Tue Aug 22 12:19:01 EDT 2017

Name: 10.1.0.28:50010 (compute-0-7.privnet)

Hostname: compute-0-7.privnet

Decommission Status : Decommissioned

Configured Capacity: 3790909986816 (3.45 TB)

DFS Used: 1199918022656 (1.09 TB)

Non DFS Used: 173738641408 (161.81 GB)

DFS Remaining: 2417253322752 (2.20 TB)

DFS Used%: 31.65%

DFS Remaining%: 63.76%

Last contact: Tue Aug 22 12:18:59 EDT 2017

Name: 10.1.0.33:50010 (compute-0-6.privnet)

Hostname: compute-0-6.privnet

Decommission Status : Normal

Configured Capacity: 3790909986816 (3.45 TB)

DFS Used: 3581889528174 (3.26 TB)

Non DFS Used: 173738649234 (161.81 GB)

DFS Remaining: 35281809408 (32.86 GB)

DFS Used%: 94.49%

DFS Remaining%: 0.93%

Last contact: Tue Aug 22 12:19:01 EDT 2017

Name: 10.1.0.27:50010 (compute-0-11.privnet)

Hostname: compute-0-11.privnet

Decommission Status : Normal

Configured Capacity: 3790909986816 (3.45 TB)

DFS Used: 3585854763008 (3.26 TB)

Non DFS Used: 173738641408 (161.81 GB)

DFS Remaining: 31316582400 (29.17 GB)

DFS Used%: 94.59%

DFS Remaining%: 0.83%

Last contact: Tue Aug 22 12:18:59 EDT 2017

Name: 10.1.0.17:50010 (r510-0-9.privnet)

Hostname: r510-0-9.privnet

Decommission Status : Normal

Configured Capacity: 21258521805824 (19.33 TB)

DFS Used: 19962238316544 (18.16 TB)

Non DFS Used: 977815778304 (910.66 GB)

DFS Remaining: 318467710976 (296.60 GB)

DFS Used%: 93.90%

DFS Remaining%: 1.50%

Last contact: Tue Aug 22 12:18:59 EDT 2017

Name: 10.1.0.31:50010 (r510-0-4.privnet)

Hostname: r510-0-4.privnet

Decommission Status : Normal

Configured Capacity: 23244357623808 (21.14 TB)

DFS Used: 21804435677184 (19.83 TB)

Non DFS Used: 1067771875328 (994.44 GB)

DFS Remaining: 372150071296 (346.59 GB)

DFS Used%: 93.81%

DFS Remaining%: 1.60%

Last contact: Tue Aug 22 12:18:59 EDT 2017

Name: 10.1.0.24:50010 (compute-0-8.privnet)

Hostname: compute-0-8.privnet

Decommission Status : Normal

Configured Capacity: 3790909986816 (3.45 TB)

DFS Used: 3579222183936 (3.26 TB)

Non DFS Used: 173738641408 (161.81 GB)

DFS Remaining: 37949161472 (35.34 GB)

DFS Used%: 94.42%

DFS Remaining%: 1.00%

Last contact: Tue Aug 22 12:19:01 EDT 2017

Name: 10.1.0.29:50010 (r510-0-10.privnet)

Hostname: r510-0-10.privnet

Decommission Status : Normal

Configured Capacity: 23244357623808 (21.14 TB)

DFS Used: 21895725548612 (19.91 TB)

Non DFS Used: 1067771898812 (994.44 GB)

DFS Remaining: 280860176384 (261.57 GB)

DFS Used%: 94.20%

DFS Remaining%: 1.21%

Last contact: Tue Aug 22 12:19:00 EDT 2017

Name: 10.1.0.23:50010 (r510-0-6.privnet)

Hostname: r510-0-6.privnet

Decommission Status : Normal

Configured Capacity: 21403204469760 (19.47 TB)

DFS Used: 20167376171392 (18.34 TB)

Non DFS Used: 983660692096 (916.11 GB)

DFS Remaining: 252167606272 (234.85 GB)

DFS Used%: 94.23%

DFS Remaining%: 1.18%

Last contact: Tue Aug 22 12:19:00 EDT 2017

Name: 10.1.0.32:50010 (r510-0-5.privnet)

Hostname: r510-0-5.privnet

Decommission Status : Normal

Configured Capacity: 23244357623808 (21.14 TB)

DFS Used: 21901919940608 (19.92 TB)

Non DFS Used: 1067914788864 (994.57 GB)

DFS Remaining: 274522894336 (255.67 GB)

DFS Used%: 94.22%

DFS Remaining%: 1.18%

Last contact: Tue Aug 22 12:19:00 EDT 2017

Name: 10.1.0.5:50010 (r720-0-2.privnet)

Hostname: r720-0-2.privnet

Decommission Status : Normal

Configured Capacity: 21542112138240 (19.59 TB)

DFS Used: 20266237505536 (18.43 TB)

Non DFS Used: 990715188224 (922.68 GB)

DFS Remaining: 285159444480 (265.58 GB)

DFS Used%: 94.08%

DFS Remaining%: 1.32%

Last contact: Tue Aug 22 12:18:59 EDT 2017

Name: 10.1.0.19:50010 (compute-0-5.privnet)

Hostname: compute-0-5.privnet

Decommission Status : Normal

Configured Capacity: 3761933637632 (3.42 TB)

DFS Used: 3494417171562 (3.18 TB)

Non DFS Used: 226901070742 (211.32 GB)

DFS Remaining: 40615395328 (37.83 GB)

DFS Used%: 92.89%

DFS Remaining%: 1.08%

Last contact: Tue Aug 22 12:19:01 EDT 2017

Dead datanodes:

Name: 10.1.0.6:50010 (r720-0-1.privnet)

Hostname: r720-0-1.privnet

Decommission Status : Normal

Configured Capacity: 0 (0 B)

DFS Used: 0 (0 B)

Non DFS Used: 0 (0 B)

DFS Remaining: 0 (0 B)

DFS Used%: 100.00%

DFS Remaining%: 0.00%

Last contact: Sun Aug 06 08:28:01 EDT 2017

[root@r720-0-1 ~]# df -ah

Filesystem Size Used Avail Use% Mounted on

/dev/sda2 20G 3.7G 15G 21% /

tmpfs 48G 0 48G 0% /dev/shm

/dev/sdb1 1.8T 1.2T 517G 71% /hadoop1

/dev/sdj1 1.8T 1.2T 520G 71% /hadoop10

/dev/sdk1 1.8T 1.2T 519G 71% /hadoop11

/dev/sdl1 1.8T 1.2T 520G 71% /hadoop12

/dev/sdc1 1.8T 1.2T 519G 71% /hadoop2

/dev/sdd1 1.8T 1.2T 522G 71% /hadoop3

/dev/sde1 1.8T 1.2T 527G 70% /hadoop5

/dev/sdf1 1.8T 1.2T 525G 70% /hadoop6

/dev/sdg1 1.8T 1.2T 523G 70% /hadoop7

/dev/sdh1 1.8T 1.2T 526G 70% /hadoop8

/dev/sdi1 1.8T 1.2T 515G 71% /hadoop9

/dev/sda3 20G 5.3G 13G 29% /scratch

/dev/sda7 72G 650M 68G 1% /tmp

/dev/sda6 7.6G 2.4G 4.8G 34% /var

fuse_dfs 178T 167T 12T 94% /mnt/hadoop

r720-datanfs.privnet:/data

37T 35T 2.1T 95% /data

10.1.0.1:/export/home

7.2T 1.4T 5.5T 20% /home

[root@r720-0-1 ~]#

[root@r720-0-1 ~]# service hadoop-hdfs-datanode status

Hadoop datanode is dead and pid file exists [FAILED]

[root@r720-0-1 ~]# service hadoop-hdfs-datanode restart

Stopping Hadoop datanode: [ OK ]

no datanode to stop

Starting Hadoop datanode: [ OK ]

starting datanode, logging to /scratch/hadoop/hadoop-hdfs/hadoop-hdfs-datanode-r720-0-1.privnet.out

[root@r720-0-1 ~]#

[root@r720-0-1 ~]# service hadoop-hdfs-datanode status

Hadoop datanode is dead and pid file exists [FAILED]

[root@r720-0-1 ~]# service hadoop-hdfs-datanode restart

Stopping Hadoop datanode: [ OK ]

no datanode to stop

Starting Hadoop datanode: [ OK ]

starting datanode, logging to /scratch/hadoop/hadoop-hdfs/hadoop-hdfs-datanode-r720-0-1.privnet.out

[root@r720-0-1 ~]#

[root@r720-0-1 ~]# service hadoop-hdfs-datanode status

Hadoop datanode is dead and pid file exists [FAILED]

[root@r720-0-1 ~]# grep -i warn /scratch/hadoop/hadoop-hdfs/hadoop-hdfs-datanode-r720-0-1.privnet.log

2017-08-06 08:28:02,422 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing replica BP-953065178-10.1.0.16-1445909897155:-6597879525639482474 on failed volume /hadoop11/data/current

2017-08-06 08:28:02,422 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing replica BP-953065178-10.1.0.16-1445909897155:-391729697961201349 on failed volume /hadoop11/data/current

2017-08-06 08:28:02,422 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing replica BP-953065178-10.1.0.16-1445909897155:1433264680818142941 on failed volume /hadoop11/data/current

2017-08-06 08:28:03,062 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode is shutting down: DataNode failed volumes:/hadoop11/data/current;

2017-08-22 05:47:58,027 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool <registering> (storage id DS-692293337-10.1.0.6-50010-1445911035181) service to hepcms-namenode.privnet/10.1.0.16:9000

org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 10, volumes configured: 11, volumes failed: 1, volume failures tolerated: 0

2017-08-22 05:49:04,106 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool <registering> (storage id DS-692293337-10.1.0.6-50010-1445911035181) service to hepcms-namenode.privnet/10.1.0.16:9000

org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 10, volumes configured: 11, volumes failed: 1, volume failures tolerated: 0

One of the hadoop disks failed or not appear in df -ah

umount -l /hadoop11

mount -a /hadoop11

if hadoop11 failed. Need to take it out of hadoop. Remove it from the list in hdfs-site.xml

[root@r720-0-1 ~]# vi /etc/hadoop/conf/hdfs-site.xml

<value>/hadoop1/data,/hadoop2/data,/hadoop3/data,/hadoop5/data,/hadoop6/data,/hadoop7/data,/hadoop8/data,/h\adoop9/data,/hadoop10/data,/hadoop11/data,/hadoop12/data</value>

saved ctrl-x-c for emacs :wq for vi

restart the service:

For example to restart the service on all r510s use clush:

[root@hepcms-hn ~]# clush -w @r510 -b service hadoop-hdfs-datanode restart

Didnt automatically remove it from df -ah. Waited half an hour.

If you need to put it back in, add to the /etc/hadoop/conf/hdfs-site.xml and restart the service. You might have to mount the /hadoopxx again.

Hadoop service Failed to start on a datanode with java exception error

[root@r540-0-21 ~]# service hadoop-hdfs-datanode start

starting datanode, logging to /scratch/hadoop/hadoop-hdfs/hadoop-hdfs-datanode-r540-0-21.privnet.out

Failed to start Hadoop datanode. Return value: 1 [FAILED]

[root@r540-0-21 ~]#

See the error in the log file

grep -i exc /scratch/hadoop/hadoop-hdfs/hadoop-hdfs-datanode-r540-0-21.privnet.log

2020-07-14 15:47:28,865 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Likely the client has stopped reading, disconnecting it (r540-0-21.privnet:50010:DataXceiver error processing READ_BLOCK operation src: /10.1.0.14:21620 dst: /10.1.0.102:50010); java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.1.0.102:50010 remote=/10.1.0.14:21620]

apparently the port is being reused to r510 -

[root@r540-0-20 ~]# lsof -i:50010

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME

java 4400 hdfs 108u IPv4 24956 0t0 TCP *:50010 (LISTEN)

[root@r540-0-20 ~]#

[root@r540-0-21 ~]# lsof -i:50010

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME

java 4136 hdfs 108u IPv4 22859 0t0 TCP *:50010 (LISTEN)

java 4136 hdfs 638u IPv4 100705722 0t0 TCP r540-0-21.privnet:50010->r510-0-1.privnet:33374 (ESTABLISHED)

[root@r540-0-21 ~]#

removed the process and restarted hadoop service

[root@r540-0-21 ~]#

[root@r540-0-21 ~]# kill -9 4136

[root@r540-0-21 ~]#

[root@r540-0-21 ~]# lsof -i:50010

[root@r540-0-21 ~]# systemctl restart hadoop-hdfs-datanode

[root@r540-0-21 ~]# systemctl status hadoop-hdfs-datanode

â hadoop-hdfs-datanode.service - LSB: Hadoop datanode

Loaded: loaded (/etc/rc.d/init.d/hadoop-hdfs-datanode; bad; vendor preset: disabled)

Active: active (exited) since Fri 2020-07-17 14:53:17 EDT; 9s ago

Docs: man:systemd-sysv-generator(8)

Process: 91326 ExecStart=/etc/rc.d/init.d/hadoop-hdfs-datanode start (code=exited, status=0/SUCCESS)Jul 17 14:53:08 r540-0-21.privnet systemd[1]: Starting LSB: Hadoop datanode...

Jul 17 14:53:08 r540-0-21.privnet su[91357]: (to hdfs) root on none

Jul 17 14:53:08 r540-0-21.privnet hadoop-hdfs-datanode[91326]: starting datanode, logging to /scratch/hadoop/hadoop-hdfs/hadoop-hdfs-datanode-r540-0-21.privnet.out

Jul 17 14:53:17 r540-0-21.privnet hadoop-hdfs-datanode[91326]: Started Hadoop datanode (hadoop-hdfs-datanode):[ OK ]

Jul 17 14:53:17 r540-0-21.privnet systemd[1]: Started LSB: Hadoop datanode.

[root@r540-0-21 ~]# lsof -i:50010

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME

java 91379 hdfs 108u IPv4 100832449 0t0 TCP *:50010 (LISTEN)

[root@r540-0-21 ~]#

That fixes the issue.

checked health of the node:

[root@r720-0-1 ~]# hadoop fsck / -blocks

DEPRECATED: Use of this script to execute hdfs command is deprecated.

Instead use the hdfs command for it.

Connecting to namenode via http://hepcms-namenode.privnet:50070

FSCK started by root (auth:SIMPLE) from /10.1.0.6 for path / at Tue Aug 22 12:16:58 EDT 2017

....................................................................................................

............................................................Status: HEALTHY

Total size: 89634301416782 B (Total open files size: 118798848 B)

Total dirs: 58584

Total files: 1060360 (Files currently being written: 2)

Total blocks (validated): 1632671 (avg. block size 54900406 B) (Total open file blocks (not validated): 2)

Minimally replicated blocks: 1632671 (100.0 %)

Over-replicated blocks: 24598 (1.506611 %)

Under-replicated blocks: 0 (0.0 %)

Mis-replicated blocks: 0 (0.0 %)

Default replication factor: 2

Average block replication: 2.024795

Corrupt blocks: 0

Missing replicas: 0 (0.0 %)

Number of data-nodes: 13

Number of racks: 1

FSCK ended at Tue Aug 22 12:17:20 EDT 2017 in 21951 milliseconds

The filesystem under path '/' is HEALTHY

[root@r720-0-1 ~]# umount /mnt/hadoop

umount: /mnt/hadoop: device is busy.

(In some cases useful info about processes that use

the device is found by lsof(8) or fuser(1))

apparently there are lot of open files on the node.

[root@r720-0-1 ~]# lsof /mnt/hadoop

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME

main 3722473 kakw 8r REG 0,17 317564959 3916093 /mnt/hadoop/cms/store/user/yoshin/EmJetAnalysis/Analysis-20170609-v0/QCD_HT1000to1500/QCD_HT1000to1500_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/Analysis-20170609/170609_100342/0000/ntuple_374.root

main 3722473 kakw 9r REG 0,17 317564959 3916093 /mnt/hadoop/cms/store/user/yoshin/EmJetAnalysis/Analysis-20170609-v0/QCD_HT1000to1500/QCD_HT1000to1500_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/Analysis-20170609/170609_100342/0000/ntuple_374.root

kill -9 3722473 3722475 3722489 3722501 3722723 3722801 3722854 3723314

now unmount haddop and then the bad disk. mount and start the service.

[root@r720-0-1 ~]# service hadoop-hdfs-datanode status

Hadoop datanode is dead and pid file exists [FAILED]

[root@r720-0-1 ~]# umount /mnt/hadoop

[root@r720-0-1 ~]# umount /dev/sdk1

umount: /dev/sdk1: not mounted

[root@r720-0-1 ~]# umount /dev/sdk1

umount: /dev/sdk1: not mounted

[root@r720-0-1 ~]# mount -a /mnt/hadoop

INFO /builddir/build/BUILD/hadoop-2.0.0-cdh4.7.1/src/hadoop-hdfs-project/hadoop-hdfs/src/main/native/fuse-dfs/fuse_options.c:164 Adding FUSE arg /mnt/hadoop

INFO /builddir/build/BUILD/hadoop-2.0.0-cdh4.7.1/src/hadoop-hdfs-project/hadoop-hdfs/src/main/native/fuse-dfs/fuse_options.c:115 Ignoring option allow_other

INFO /builddir/build/BUILD/hadoop-2.0.0-cdh4.7.1/src/hadoop-hdfs-project/hadoop-hdfs/src/main/native/fuse-dfs/fuse_options.c:115 Ignoring option dev

INFO /builddir/build/BUILD/hadoop-2.0.0-cdh4.7.1/src/hadoop-hdfs-project/hadoop-hdfs/src/main/native/fuse-dfs/fuse_options.c:115 Ignoring option suid

[root@r720-0-1 ~]#

[root@r720-0-1 ~]# df -ah

Filesystem Size Used Avail Use% Mounted on

/dev/sda2 20G 3.7G 15G 21% /

proc 0 0 0 - /proc

sysfs 0 0 0 - /sys

devpts 0 0 0 - /dev/pts

tmpfs 48G 0 48G 0% /dev/shm

/dev/sdb1 1.8T 1.2T 517G 71% /hadoop1

/dev/sdj1 1.8T 1.2T 520G 71% /hadoop10

/dev/sdl1 1.8T 1.2T 520G 71% /hadoop12

/dev/sdc1 1.8T 1.2T 519G 71% /hadoop2

/dev/sdd1 1.8T 1.2T 522G 71% /hadoop3

/dev/sde1 1.8T 1.2T 527G 70% /hadoop5

/dev/sdf1 1.8T 1.2T 525G 70% /hadoop6

/dev/sdg1 1.8T 1.2T 523G 70% /hadoop7

/dev/sdh1 1.8T 1.2T 526G 70% /hadoop8

/dev/sdi1 1.8T 1.2T 515G 71% /hadoop9

/dev/sda3 20G 5.3G 13G 29% /scratch

/dev/sda7 72G 885M 68G 2% /tmp

/dev/sda6 7.6G 2.4G 4.8G 34% /var

none 0 0 0 - /proc/sys/fs/binfmt_misc

sunrpc 0 0 0 - /var/lib/nfs/rpc_pipefs

r720-datanfs.privnet:/data

37T 35T 2.0T 95% /data

10.1.0.1:/export/home

7.2T 1.4T 5.5T 20% /home

fuse_dfs 178T 167T 12T 94% /mnt/hadoop

[root@r720-0-1 ~]# service hadoop-hdfs-datanode start

Starting Hadoop datanode: [ OK ]

starting datanode, logging to /scratch/hadoop/hadoop-hdfs/hadoop-hdfs-datanode-r720-0-1.privnet.out

[root@r720-0-1 ~]# service hadoop-hdfs-datanode status

Hadoop datanode is running [ OK ]

[root@r720-0-1 ~]#

node is alive again

[root@r720-0-1 ~]# hadoop dfsadmin -report

DEPRECATED: Use of this script to execute hdfs command is deprecated.

Instead use the hdfs command for it.

Configured Capacity: 214627371145216 (195.20 TB)

Present Capacity: 204762305384448 (186.23 TB)

DFS Remaining: 21700580372480 (19.74 TB)

DFS Used: 183061725011968 (166.49 TB)

DFS Used%: 89.40%

Under replicated blocks: 0

Blocks with corrupt replicas: 0

Missing blocks: 0

-------------------------------------------------

Datanodes available: 14 (14 total, 0 dead)

Live datanodes:

Name: 10.1.0.18:50010 (r510-0-1.privnet)

Hostname: r510-0-1.privnet

......

Name: 10.1.0.6:50010 (r720-0-1.privnet)

Hostname: r720-0-1.privnet

Decommission Status : Normal

Configured Capacity: 19583239403520 (17.81 TB)

DFS Used: 13090004533248 (11.91 TB)

Non DFS Used: 900624750592 (838.77 GB)

DFS Remaining: 5592610119680 (5.09 TB)

DFS Used%: 66.84%

DFS Remaining%: 28.56%

Last contact: Tue Aug 22 13:17:46 EDT 2017

Rebalancing

Source: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#balancer

Can do the following command from any hadoop node.

You can check the progress

[root@hepcms-hn ~]# clush -w @all df -h | grep hadoop

It took 35 hours the first time:

17/04/04 22:09:33 INFO net.NetworkTopology: Adding a new node: /default-rack/10.1.0.31:50010

17/04/04 22:09:33 INFO net.NetworkTopology: Adding a new node: /default-rack/10.1.0.33:50010

17/04/04 22:09:33 INFO net.NetworkTopology: Adding a new node: /default-rack/10.1.0.18:50010

17/04/04 22:09:33 INFO net.NetworkTopology: Adding a new node: /default-rack/10.1.0.27:50010

17/04/04 22:09:33 INFO net.NetworkTopology: Adding a new node: /default-rack/10.1.0.24:50010

17/04/04 22:09:33 INFO net.NetworkTopology: Adding a new node: /default-rack/10.1.0.17:50010

17/04/04 22:09:33 INFO net.NetworkTopology: Adding a new node: /default-rack/10.1.0.30:50010

17/04/04 22:09:33 INFO net.NetworkTopology: Adding a new node: /default-rack/10.1.0.19:50010

17/04/04 22:09:33 INFO net.NetworkTopology: Adding a new node: /default-rack/10.1.0.6:50010

17/04/04 22:09:33 INFO net.NetworkTopology: Adding a new node: /default-rack/10.1.0.32:50010

17/04/04 22:09:33 INFO net.NetworkTopology: Adding a new node: /default-rack/10.1.0.23:50010

17/04/04 22:09:33 INFO net.NetworkTopology: Adding a new node: /default-rack/10.1.0.29:50010

17/04/04 22:09:33 INFO net.NetworkTopology: Adding a new node: /default-rack/10.1.0.5:50010

17/04/04 22:09:33 INFO balancer.Balancer: 0 over-utilized: []

17/04/04 22:09:33 INFO balancer.Balancer: 0 underutilized: []

The cluster is balanced. Exiting...

Balancing took 35.48840777777778 hours

Transport endpoint is not connected

Problem:

yhshin@r510-0-5 ~]$ ls /store

ls: cannot access /store: Transport endpoint is not connected

[yhshin@r510-0-5 ~]$ ls /mnt/hadoop

Solution:

Nebraska T2 sees it time to time - it can happen if the job is trying to move a large file or something that puts a big load - that explains the file system crash that causes the fuse mount issue. The only fix they have is to check before running a job that the node is mounted. Carl will send me instructions and we can implement it.

All that said, if the job running on the node itself is doing something something to crash the file system then there is nothing we can do.

The following solution is for hepcms-gums

ssh to the offending node

[root@hepcms-gums ~]# ls -alrth /mnt

ls: cannot access /mnt/hadoop: Transport endpoint is not connected

total 8.0K

d?????????? ? ? ? ? ? hadoop

drwxr-xr-x. 3 root root 4.0K Dec 19 10:25 .

dr-xr-xr-x. 28 root root 4.0K Dec 19 10:35 ..

[root@hepcms-gums ~]# umount /mnt/hadoop

[root@hepcms-gums ~]# chown hadoop:hdfs /mnt/hadoop

[root@hepcms-gums ~]# mount /mnt/hadoop

INFO /builddir/build/BUILD/hadoop-2.0.0-cdh4.1.1/src/hadoop-hdfs-project/hadoop-hdfs/src/main/native/fuse-dfs/fuse_options.c:164 Adding FUSE arg /mnt/hadoop

INFO /builddir/build/BUILD/hadoop-2.0.0-cdh4.1.1/src/hadoop-hdfs-project/hadoop-hdfs/src/main/native/fuse-dfs/fuse_options.c:115 Ignoring option allow_other

INFO /builddir/build/BUILD/hadoop-2.0.0-cdh4.1.1/src/hadoop-hdfs-project/hadoop-hdfs/src/main/native/fuse-dfs/fuse_options.c:115 Ignoring option dev

INFO /builddir/build/BUILD/hadoop-2.0.0-cdh4.1.1/src/hadoop-hdfs-project/hadoop-hdfs/src/main/native/fuse-dfs/fuse_options.c:115 Ignoring option suid

[root@hepcms-gums ~]# df -h

Filesystem Size Used Avail Use% Mounted on

/dev/mapper/vg_sys-lv_root

96G 2.3G 89G 3% /

tmpfs 498M 0 498M 0% /dev/shm

/dev/vda1 477M 70M 383M 16% /boot

r720-datanfs.privnet:/data

37T 30T 6.6T 82% /data

10.1.0.1:/export/home

7.2T 324G 6.5T 5% /home

fuse_dfs 64T 41T 24T 64% /mnt/hadoop

Hadoop-hdfs-datanode service fails and a partition is unmounted

This happened twice this week (June 11-17 2017). r510-0-10 was unmounted because of an unknown issue, r510-0-11 was unmounted when Shabnam tried to replace it on the 14th. When this happens, the Hadoop logs in /scratch should show when the error occurred, and which partition had the problem. Then you try to ls while in the problem partition you will somethings like this

[root@r510-0-11 hadoop8]# ls

ls: reading directory .: Input/output error

First remove the partition from hdfs-site.xml in /etc/hadoop/conf

service hadoop-hdfs-datanode start

unmount Hadoop disk

fsck -y /hadoop#

- mount partition
- check if disk is working
- add partition back into hdfs-site.xml and restart Hadoop-hdfs-datanode

ls: cannot access /mnt/hadoop: Transport endpoint is not connected

DATANODE DISK FAIL OR Full Disk alarm in Ganglia OR become read-only and you really need the blocks on it

in case of compute nodes we will take them out of haddop and the blocks will be replicated. Compute nodes donot add much to hadoop in terms of space so it is better to keep them out of hadoop. Normally if you take the node out of hadoop, hadoop will replicate the missing blocks.

from interactive node run firefox and check hadoop status:

hepcmshttp://hepcms-namenode.privnet:50070/dfshealth.jsp-namenode.privnet:50070/dfshealth.jsp

- - - You found this out through checking the web page, that files were corrupt, or because a user complained about a file, or by running a

[root@hepcms-namenode ~]# hdfs fsck /

- - ........................Status: HEALTHY

Total size: 45691338212949 B

Total dirs: 60217

Total files: 304924

Total blocks (validated): 695338 (avg. block size 65710975 B)

Minimally replicated blocks: 695338 (100.0 %)

Over-replicated blocks: 0 (0.0 %)

Under-replicated blocks: 0 (0.0 %)

Mis-replicated blocks: 0 (0.0 %)

Default replication factor: 2

Average block replication: 2.3250217

Corrupt blocks: 0

Missing replicas: 0 (0.0 %)

Number of data-nodes: 13

Number of racks: 1

FSCK ended at Wed Jan 18 09:33:11 EST 2017 in 8158 milliseconds

- - You have verified from the hadoop log on that node (r510-0-5: /scratch/hadoop/hdfs-hadoop/) that the particular disk (say /hadoop4) is not writeable, but you can see it's still mounted, and you can
  - ls -alh /hadoop4/data
  - You should NOT have the disk removed from the list in /etc/hadoop/conf/hdfs-site.xml on the datanode (r510-0-5)

- The datanode will have its hadoop-hdfs-datanode service stop working because it cannot write to that disk
- Login to the hepcms-namenode, become root
- Then exclude this datanode (use internal IP like r510-0-5.privnet) edit /etc/hadoop/conf/hosts-exclude on the hepcms-namenode
  - hdfs dfsadmin -refreshNodes
- Login to the datanode (r510-0-5), become root, and start the datanode service
- service hadoop-hdfs-datanode start
- Wait quite some time until hadoop reports no more corrupt blocks and the node is "Decommissioned", don't do any of the following until there are no more corrupt blocks
  - Note that if there is more than one node/disk failed, the corrupt blocks could be elsewhere, logs on the namenode should help you figure that out
- At this point you can do one of the following:
  - leave the datanode decomissioned
  - or remove the disk from the list in /etc/hadoop/conf/hdfs-site.xml on the datanode (r510-0-5), and service hadoop-hdfs-datanode restart
  - Fix the disk (umount /hadoop4; fsck /dev/sdd) - note it may fail again in a week or two, be sure to check its health with omreport storage pdisk controller=0
  - Wipe the disk (as above)
  - Replace the disk (above)
- To allow the datanode back into hadoop, remove its hostname from /etc/hadoop/conf/hosts-exclude on the hepcms-namenode
  - hdfs dfsadmin -refreshNodes

How to handle a stale mount of a NFS disk:

[root@r720-0-1 ~]# mount /data

mount.nfs: Stale file handle

[root@r720-0-1 ~]# ls /data

ls: cannot access /data: Stale file handle

[root@r720-0-1 ~]# umount -nf /data

[root@r720-0-1 ~]# mount /data

[root@r720-0-1 ~]# ls /data

groups osg test-compute-0-2 TESTING users

Debug NFS disk mount:

Check firewall settings on node that disk mounts from and node that disk mounts to

Check for /etc/exports proper settings on node disk mounts from

Check for /etc/fstab proper settings on node disk mounts to

Can do on node disk mounts to: showmount -e <IP>

NFS disk mount/cannot find server problem:

/sharesoft/osg/ce/setup.csh: No such file or directory.

[belt@hepcms-in1 ~]$ su -

Password:

[root@hepcms-in1 ~]# mount /data

mount.nfs: Failed to resolve server r720-datanfs.privnet: Temporary failure in name resolution

Look at /etc/resolv.conf, it was being modified with NetworkManager (/etc/init.d/NetworkManager status), turn that off and make Puppet not allow it to run in base.pp: service { 'NetworkManager': ensure => 'stopped', enable => false }

Additional NFS disk mount problem: requested NFS version or transport protocol is not supported

[root@compute-0-5 ~]# mount /data

mount.nfs: requested NFS version or transport protocol is not supported

[root@compute-0-5 ~]# mount -t nfs r720-datanfs.privnet:/data /data

[root@compute-0-5 ~]# ls /data

cmssw cvmfs groups gums lost+found osg root_backup root.old scratch share site_conf TESTING users

- Note: Don't have a puppet fix at this time (July 7, 2016) as this is using Trey's nfs puppet module

Problems with writing to an nfs mount?

http://serverfault.com/questions/212178/chown-on-a-mounted-nfs-partition-gives-operation-not-permitted

In /etc/exports need no_root_squash as an option

Did you reboot things in the wrong order? NFS disks not mounted?

- Check that hepcms-ovirt is up and VMs are running, make sure hepcms-foreman is up and healthy, as it runs the DNS and routing so it needs to be up read proper disk settings
- on machine that rebooted, ls /home; ls /data, if it reports not what's expected, mount the disks:
- mount -a /home; mount -a /data
  - If they are "already mounted": umount -nf /home (for example, can do it multiple times to make it work occasionally)
- Check NIS healthy and working

Want to edit/look at files as root on NFS systems:

- In all cases, do not delete user files without consulting with them unless it's clear they are breaking the cluster. Assume nothing is backed up! NEVER EVER use wildcards with rm. It is better to write a deletion script, get confirmation from the user that is the list of files to delete, and then run the script.
- For /home, login to hepcms-hn.umd.edu, su - to become root, and look at /export/home
- For /data, login to any cluster node, then ssh r720-datanfs, su - to become root, and look at /data
- For /hadoop, end of Aug. 2015, it's on /data
- For actual hadoop (mounted /mnt/hadoop once SE is up will be soft linked to hadoop): login to any cluster node, then ssh hepcms-namenode, su - to become root, and look at /mnt/hadoop
  - Note that there are hadoop dfs commands one can use that don't use the fuse mount which may be more efficient for file manipulation

Hadoop filesystem not mounting:

https://sites.google.com/a/physics.umd.edu/tier-3-umd/commands/hadoopnamenodesetup#TOC-Automated-puppet-fuse-mount-not-working-

[root@r510-0-6 ~]# ls -alrth /mnt

ls: cannot access /mnt/hadoop: Transport endpoint is not connected

total 8.0K

d?????????? ? ? ? ? ? hadoop

drwxr-xr-x. 3 root root 4.0K Jul 15 12:59 .

dr-xr-xr-x. 29 root root 4.0K Jul 15 13:09 ..

[root@r510-0-6 ~]# chown hdfs:hadoop /mnt/hadoop

chown: cannot access `/mnt/hadoop': Transport endpoint is not connected

[root@r510-0-6 ~]# umount /mnt/hadoop

[root@r510-0-6 ~]# chown hdfs:hadoop /mnt/hadoop

[root@r510-0-6 ~]# mount /mnt/hadoop

Debug and fix a hadoop data disk:

http://hep-t3.physics.umd.edu/HowToForAdmins/errors.html#errorsHadoopFsck

- Before any debugging run service hadoop-hdfs-datanode stop
- NOTE: IF YOUR BAD DISK IS /hadoop1 (that is where the OS is stored on as well on r510 and compute nodes) do not execute the following commands as it will wipe the OS :) In this scenario it may be necessary to re-kickstart the machine...
- First, run df -h to identify the names of the Hadoop disks, on an r510 machine there should be 12 disks, they start with "sd" then go 'a' through 'l'
  - an example of what it should look like with 12 disks:

[root@r510-0-5 ~]# df -h

/dev/sda7 1.6T 1.1T 397G 74% /hadoop1

/dev/sdb1 1.8T 1.3T 438G 75% /hadoop2

/dev/sdc1 1.8T 1.2T 551G 68% /hadoop3

/dev/sdd1 1.8T 1.2T 561G 68% /hadoop4

/dev/sde1 1.8T 86G 1.7T 5% /hadoop5

/dev/sdf1 1.8T 1.2T 557G 68% /hadoop6

/dev/sdg1 1.8T 1.2T 560G 68% /hadoop7

/dev/sdh1 1.8T 1.2T 546G 69% /hadoop8

/dev/sdi 1.8T 68M 1.7T 1% /hadoop9

/dev/sdj1 1.8T 1.2T 527G 70% /hadoop10

/dev/sdk1 1.8T 1.2T 564G 68% /hadoop11

/dev/sdl1 1.8T 1.3T 414G 76% /hadoop12

- - Here you can see that all 12 disks are present, if any of the hadoop disks are missing note which dev/sd? it is, as they are alphabetical. So if it was /dev/sdi that was missing :
    - run the command lsblk -d
      - check to see if /dev/sdi (or whichever is your missing disk) is listed on this.
      - The output should like this:

[root@r510-0-5 ~]# lsblk -d

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT

sde 8:64 0 1.8T 0 disk

sdd 8:48 0 1.8T 0 disk

sda 8:0 0 1.8T 0 disk

sdh 8:112 0 1.8T 0 disk

sdl 8:176 0 1.8T 0 disk

sdb 8:16 0 1.8T 0 disk

sdi 8:128 0 1.8T 0 disk

sdk 8:160 0 1.8T 0 disk

sdg 8:96 0 1.8T 0 disk

sdc 8:32 0 1.8T 0 disk

sdf 8:80 0 1.8T 0 disk

sdj 8:144 0 1.8T 0 disk

- - If you don't see your missing disk on this list, then you must use omreport storage pdisk controller=0 to identify exactly which one it is, as it may be likely broken on a hardware level and may need to be replaced, or you may attempt re-seating the disk
    - use the omreport, identify which disk by its number and status. Log into the Dell OMSA manager in firefox, and flash the LED's of all the drives till you find your target.
  - If you do identify your disk on the lsblk -d list
    - First step is you want to unmount the disk by : umount /dev/device (note about this at end) umount -nf /device if it's not working
    - run the command: fsck -y /dev/sdi
    - If this fails:
      - run the command mkfs.ext4 /dev/sdi to re-make the file system on that disk
      - run the command blkid /dev/sdi take note of its UUID and update the UUID of that drive in /etc/fstab , make sure you update its file system type from ext3 to ext4 if it needs be.
      - run the command mount /dev/sdi or you could use: mount /hadoop# if the drive is labeled. If this gives an error suggesting that this does not exist, make sure the directory /hadoop# exists, if it doesn't run mkdir /hadoop# and run the mount command again.
        Note: if you get an error that it still doesn't exist.. make sure you typed in the UUID of the device into /etc/fstab correctly. check with command blkid /dev/sdi

- - - - To label the drive, run the command: e2label /dev/sdi /hadoop# to label it, you can also add this label into the /etc/fstab file to make troubleshooting a bit easier, append LABEL=/hadoop5 to the appropriate line in the file: it will look like this:
        LABEL=/hadoop5 /hadoop5 ext3 defaults 1 2 (appending the label is optional in the file, using the e2label command is enough)

- Once this is complete, make sure to run service hadoop-hdfs-datanode start to restart the hadoop. Also in a few moments, run service hadoop-hdfs-datanode status to ensure that the repair was successful.

run ls -alh /sys/block/sdg/device to identify which disk , this may also help troubleshoot which disk needs repair

This is also helpful, how to interpret the info listed:

http://unix.stackexchange.com/questions/40351/how-do-i-correlate-dev-sd-devices-to-the-hardware-they-represent

* After ANY operation using Hadoop where stopping the hadoop-hdfs-datanode service, make sure to run service hadoop-hdfs-datanode start to turn it back on ( very important )

* If regular umount /dev/device is not working, use umount -nf if you see something like:

/dev/sdh1 1.8T 1.4T 313G 82% /hadoop8

change into something like: /dev/sdh1 16G 2.6G 13G 17% /hadoop8

service hadoop-hdfs-datanode status returns a failed message

Can check log: tail -n100 /scratch/hadoop/hadoop-hdfs/*.log

then run the command service hadoop-hdfs-datanode start

run the command : service hadoop-hdfs-datanode status make sure you get the green "OK" if not it shows as failed, check the logs again

Puppet/Foreman/OVIRT

puppet noop command is commented out in crontab.

Oct2020

After a site wide shutdown

Migrate PKI to SHA256 signatures Howto

Summary

oVirt 4.1, on new setups, creates PKI infrastructure that uses SHA256 signatures.

Existing setups upgraded to 4.1 do not currently have PKI migrated.

This Howto explains how to manually migrate the PKI of such setups to use SHA256 signatures.

Background

Previous versions of oVirt used SHA-1 for signatures of SSL certificates created by the internal CA. This is no longer considered secure, see e.g. Firefox Chrom Edge/IE or shattered.io.

See Features/PKI for general details about PKI in oVirt.

If you are worried only by a recent browser warning about or rejecting your SHA-1-signed certificate, it might be enough to only re-sign the apache certificate, or only the CA+apache certificates. This procedure was only tested currently in its entirety.

Change the default

This step is not needed on >= 4.1.

On < 4.1, upgrading to a newer < 4.1 version (e.g. 4.0.6 to 4.0.7) might revert this change, so you need to repeat it per each upgrade until 4.1.

On the engine machine, run these commands:

# Backup exiting confcp -p /etc/pki/ovirt-engine/openssl.conf /etc/pki/ovirt-engine/openssl.conf."$(date +"%Y%m%d%H%M%S")"# Edit it to default to SHA256sed -i 's/^default_md = sha1/default_md = sha256/' /etc/pki/ovirt-engine/openssl.conf

Re-sign CA cert

If you only use this procedure because your browser warns/rejects, then it might be enough to skip this part. If your browser requires both the CA cert and the https cert to have SHA256 signatures, you have to complete it.

On the engine machine, run these commands:

# Backup CA certcp -p /etc/pki/ovirt-engine/private/ca.pem /etc/pki/ovirt-engine/private/ca.pem."$(date +"%Y%m%d%H%M%S")"# Create a new cert into ca.pem.new openssl x509 -signkey /etc/pki/ovirt-engine/private/ca.pem -in /etc/pki/ovirt-engine/ca.pem -out /etc/pki/ovirt-engine/ca.pem.new -days 3650 -sha256# Replace the existing with the new one /bin/mv /etc/pki/ovirt-engine/ca.pem.new /etc/pki/ovirt-engine/ca.pem

Re-sign certs for engine side entities

Choose entities to re-sign

Decide what you want, among the options below:

If only apache httpd (for browsers that reject SHA1 signatures), run:

names="apache"

If also the engine cert:

names="apache engine"

If all normally-existing entities:

names="engine apache websocket-proxy jboss imageio-proxy"

If you replaced the https cert with a cert signed by a 3rd party, you should not include “apache” in above - e.g. use one of:

names="engine"# ornames="engine websocket-proxy jboss imageio-proxy"

Enter Maintenance

If this is a self-hosted-engine, move it to global maintenance.

Re-sign

Run this (in the same terminal of previous subsection above):

for name in $names; do subject="$(openssl x509 -in /etc/pki/ovirt-engine/certs/"${name}".cer -noout -subject | sed 's;subject= $.*$;\1;')" /usr/share/ovirt-engine/bin/pki-enroll-pkcs12.sh --name="${name}" --password=mypass --subject="${subject}" --keep-keydone

Restart services

If you included apache:

systemctl restart httpd

If you included engine:

systemctl restart ovirt-engine

If you included ovirt-websocket-proxy/ovirt-imageio-proxy:

systemctl restart ovirt-websocket-proxy systemctl restart ovirt-imageio-proxy

Exit maintenance

If this is a self-hosted-engine, exit global maintenance.

Reconnect to web admin

Your browser will likely refuse to continue working with the web admin ui. You might need to restart it and/or remove the engine cert and/or engine ca cert.

In my own case I unchecked “Permanently store this exception” when I first logged in, and after restarting httpd the browser showed an error about using the same serial number. Restarting the browser was enough to login again.

Enroll Certificates for hosts

For all of your hosts, one host at a time, using the web admin ui:

- Set it to Maintenance
- Choose “Enroll Certificates”
- Activate

Verify

You can do this step at any time, also before starting this procedure.

Certs that use SHA1 will show as having ‘sha1WithRSAEncryption’. Certs that use SHA256 will show as having ‘sha256WithRSAEncryption’.

On engine machine:

openssl x509 -in /etc/pki/ovirt-engine/ca.pem -text | grep Signature for name in engine apache websocket-proxy jboss imageio-proxy; do echo $name:; openssl x509 -in /etc/pki/ovirt-engine/certs/"${name}".cer -text | grep Signature; done

On hosts:

openssl x509 -in /etc/pki/vdsm/certs/vdsmcert.pem -text | grep Signature openssl x509 -in /etc/pki/vdsm/certs/cacert.pem -text | grep Signature

Sep 2018

Ovirt Cleaning

[root@hepcms-ovirt images]# df -ah

Filesystem Size Used Avail Use% Mounted on

/dev/mapper/vg_sys-LV_root

72G 67G 1.7G 98% /

proc 0 0 0 - /proc

sysfs 0 0 0 - /sys

devpts 0 0 0 - /dev/pts

tmpfs 44G 4.0K 44G 1% /dev/shm

/dev/sda2 477M 57M 395M 13% /boot

/dev/mapper/vg_ovirt-lv_ovirt

19T 1.5T 17T 8% /opt/ovirt

none 0 0 0 - /proc/sys/fs/binfmt_misc

sunrpc 0 0 0 - /var/lib/nfs/rpc_pipefs

nfsd 0 0 0 - /proc/fs/nfsd

127.0.0.1:/opt/ovirt/import_export

19T 1.5T 17T 8% /rhev/data-center/mnt/127.0.0.1:_opt_ovirt_import__export

127.0.0.1:/opt/ovirt/iso

19T 1.5T 17T 8% /rhev/data-center/mnt/127.0.0.1:_opt_ovirt_iso

/dev/mapper shows 98% and most of it is due to crash reports.

[root@hepcms-ovirt crash]# pwd

/var/crash

[root@hepcms-ovirt crash]# ls -slrt

total 8

4 drwxr-xr-x 2 root root 4096 Aug 28 2016 127.0.0.1-2016-08-28-15:17:56

4 drwxr-xr-x 2 root root 4096 Jan 14 05:02 127.0.0.1-2018-01-14-04:39:12

[root@hepcms-ovirt crash]# rm -rf 127.0.0.1-2016-08-28-15\:17\:56/

[root@hepcms-ovirt crash]# df -ah

Filesystem Size Used Avail Use% Mounted on

/dev/mapper/vg_sys-LV_root

72G 42G 27G 62% /

How to upgrade the puppet master

Note that right now, common.yaml keeps the version of puppet at 3.7.5 which is what the master is at. So, you want to change in both common.yaml and for the puppet master at the same time
Trey recommends:
- - I'd stick with 3.7.5 and test the upgrade later. Usually a newer master with older clients will work if the difference is 3.8.x vs 3.7.x. I personally do Puppet tests by taking one of my masters and removing it from round-robin DNS puppet.brazos.tamu.edu then upgrade it and a few clients and do puppet agent --test --server puppetmaster02.brazos.tamu.edu --noop. For a single-master situation, the easiest solution is likely to snapshot the VM , upgrade , then test on a few clients to make sure things are fine. Usually good idea to ensure clients have all changes applied before testing to see if updated Puppet modifies behavior
  - Clients are easy to rollback, just yum downgrade which can be done by Puppet too

Running Puppet on Foreman - DON'T!

You'll note that hepcms-foreman has a cron job for puppet agent --test --noop, which may list all sorts of common changes. However, it's in a working stage and we do not intend to actually run puppet without "--noop". Please don't change our foreman without being super sure you are knowing what you are doing, and certainly never without backing it up.

Do Not run Puppet on hepcms-hn (The head node)

important things such as iptables may break

Oh no you ran puppet where you shouldn't have and want to roll back your file

https://docs.puppetlabs.com/puppet/latest/reference/man/filebucket.html
Look at the report on hepcms-foreman, and it says something like this: checksum was <md5sum>
- you can usually do something like puppet filebucket restore /etc/yp.conf <md5sum>
  - Example: notice /Stage[main]/Sudo/File[/etc/sudoers]/content content changed '{md5}26bf78728f812c729cfe82b1664e0f5a' to '{md5}4093e52552d97099d003c645f15f9372'
  - puppet filebucket restore /etc/sudoers 26bf78728f812c729cfe82b1664e0f5a

Keep puppet at the same version for the client as master

In common.yaml:
- puppet::version: '3.7.5-1.el6'
Note that you need the puppet class on the node to make this take effect:
- classes:
- - puppet

Work on a test node for hiera or puppet code changes:

https://sites.google.com/a/physics.umd.edu/tier-3-umd/commands/margueritedebuglog/vmtest

Backup puppet (as root on hepcms-foreman) before changing things (note this is aliased as backup to put a tar in /data on that node)

cp -a /etc/puppet /etc/puppet-$(date +%F)

cp -a /var/lib/puppet /var/lib/puppet-$(date +%F)

Check what the changes were after you try something in Puppet:

cp -a /etc/puppet /etc/puppet-$(date +%F)
A simple cp -a /etc/puppet /etc/puppet.bak or something similar should work can then use the backup to see what changed , something like diff -r -w --brief /etc/puppet/ /etc/puppet.bak/ To verify no unexpected changes took place
- -w ignores whitespace

Test out hiera:

(# is not typed, it's the prompt for [root@hepcms-foreman ~]# ):

# hiera --config /etc/puppet/hiera/production/hiera.yaml foreman_proxy:: trusted_hosts::environment=production ::hostgroup='base/mgmt/dns' ::fqdn=ns01.brazos.tamu.edu

["foreman.brazos.tamu.edu"]

That command will print out the value for foreman_proxy::trusted_hosts when environment=production foreman's hostgroup=base/mgmt/dns and fqdn=ns01.brazos.tamu.edu , the "::" denotes facts and Foreman's hostgroup value is treated as a fact. . If your hiera is collected in Puppet using `hiera_array` you can use the `--array` option and `hiera_hash` can use the --hash option...those options will print out values using the appropriate "merge" functionality. That is a way to test what value will be seen by Puppet for a particular system

Example:

# hiera --config /etc/puppet/hiera/production/hiera.yaml ntp::servers ::environment=production ::fqdn=r720-datanfs.privnet

WARN: Thu Jul 09 14:10:39 -0400 2015: Cannot load backend eyaml: no such file to load -- hiera/backend/eyaml_backend

["0.centos.pool.ntp.org", "1.centos.pool.ntp.org", "2.centos.pool.ntp.org"]

The only time eyaml_backend is used in your hiera is encrypted values. We'll ignore those for now as that concept is easier to deal with once Hiera better understood.

Did you change hiera.yaml?

Restart your puppet master on foreman /etc/init.d/puppetserver restart

Test puppet changes without implementing them on a node:

puppet agent --test --noop

Test puppet changes without implementing them on a node for just one feature (tags):

puppet agent --test --noop --tags nfs
another example : puppet agent --test --tags profile::base --noop
- also for running osg class in int.yaml file puppet agent --test profile::osg --noop

Implement puppet changes on a node:

puppet agent --test

Stop a puppet agent (these run automatically on a node either in kickstart or crontab):

/etc/init.d/puppet stop

Make that puppet agent not start automatically upon node reboot):

chkconfig puppet off

Start a puppet agent (these run automatically on a node either in kickstart or crontab):

/etc/init.d/puppet start

Changes in Puppet not taking effect?

See if puppet agent is running, if not do this by hand above, otherwise it will get picked up automagically if the puppet agent is running: ps ahux | grep puppet

Conflict error message in puppet?

Read the error message, for instance:

- Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate declaration: Package[osg-se-hadoop-client] is already declared in file /etc/puppet/environments/production/manifests/site.pp:23; cannot redeclare at /etc/puppet/environments/production/modules/profile/manifests/osg/hadoop_client.pp:33 on node foreman-vmtest2.local

In this case it tells you exactly what thing was declared in two places that the node had implemented (one in a class it had added in hepcms-foreman, one in site.pp). Try not to remove things from common.yaml, base.pp to resolve these duplicates, as OTHER nodes depend on that stuff!

This particular one was documented here: https://sites.google.com/a/physics.umd.edu/tier-3-umd/commands/hadoopnamenodesetup#TOC-Make-the-Fuse-client-to-mount-hadoop-elsewhere-in-puppet:

puppet agent --test --noop reports failed dependencies?

They might be fine if you run puppet agent --test if they are the sort that complains about a missing (configuration) file which puppet has to actually install the software first to have that file exist. Run puppet agent --test and they should go away
If they don't go away after running without --noop, read the error and try and figure out what dependency is actually missing

Want to add a puppet class in base.pp or site.pp instead of on hepcms-foreman web?

Use the following example puppet text in the above files, update in github, foreman getr10k, and puppet agent --test update on that node(s)
- include ::osg
- include ::profile::osg::hadoop_client

Want to add a puppet class in a hiera yaml instead of on hepcms-foreman web?

https://docs.puppetlabs.com/hiera/1/complete_example.html#assigning-a-class-to-a-node-with-hiera
site.pp needs: hiera_include('classes') (in the default, note if you have a specific FQDN, you need it there as well, otherwise the classes: part in hiera yaml below will be ignored), if troubles try hiera_include('classes',[])
Your .yaml will have something like:
- classes:
- - omsa
- - profile::osg::hadoop_client

Changed hostname on a machine, need to re-get puppet cert:

1.) Backup the /var/lib/puppet/ssl folder

mv /var/lib/puppet/ssl /var/lib/puppet/ssl.bak

2.) run puppet agent --test --noop on that node

3.) On the Puppet Master (Foreman) , run : puppet cert sign nodename_.umd.edu

* If the wrong name persists, go to /etc/puppet/puppet.conf and change it by hand.

the line is : certname = hepcms-in2.umd.edu

* Then run puppet agent --test --noop on that node again and confirm the update has applied.

Puppet certificate error/made a new cert:

Make sure to do the command as root

[belt@hepcms-in2 ~]$ puppet agent --test

Info: Creating a new SSL key for hepcms-in2.umd.edu

Error: Could not request certificate: Find /production/certificate/ca?fail_on_404=true resulted in 404 with the message: {"message":"Not Found: Error: Invalid URL - Puppet expects requests that conform to the /puppet and /puppet-ca APIs.\n\nNote that Puppet 3 agents aren't compatible with this version; if you're running Puppet 3, you must either upgrade your agents to match the server or point them to a server running Puppet 3.\n\nMaster Info:\n Puppet version: 4.2.1\n Supported /puppet API versions: v3\n Supported /puppet-ca API versions: v1","issue_kind":"HANDLER_NOT_FOUND"}

Exiting; failed to retrieve certificate and waitforcert is disabled

Puppet error with certificate, tries to run puppet agent --test --noop and gets this error, could not request certificate, certificate retrieved from the master does not match the agent's private key:

hepcms-namenode: hdfs balancer

1. make sure to backup the /etc and /var folders before continuing , follow the steps in the screenshot

2. log onto hepcms-foreman and run puppet cert clean #node-name#.privnet/ fill in with your node name

3. then on the node/agent itself, run find /var/lib/puppet/ssl -name hepcms-in3.privnet.pem -delete

4. then run `puppet agent --test --noop` on (hepcms-foreman) and it will show something like this:

5. afterwards go to the node of concern (in my case it was hepcms-in3) and on there run the command `puppet agent --test`

and you should get something like the following :

If you get similar messages it means Puppet has picked up on changes successfully.

r10k make sure we don't lose changes updated locally and not in github:

backup puppet as shown above
r10k when run in verbose mode should print what it's removing (if anything) , so you'd then be able to identify what in the backup needs to be restored

Backup Foreman:

on hepcms-foreman as root: run /usr/local/sbin/mysqlbackup.sh
makes output in /opt/mysql_backups/mysql_backup*20150709-143033*.bz2 (for instance for 14:30:33 on 9 July 2015), take the latest outputs and backup elsewhere

Troubleshooting Foreman New Host Problems.

/var/lib/dhcpd/dhcpd.leases

service dhcpd restart

service foreman-proxy restart

When a host is removed the entries in the following directories should also be removed if not cleaned up automatically.

/var/lib/puppet/yaml/facts/

/var/lib/puppet/yaml/node/

/var/lib/puppet/yaml/foreman/

leases file will have removed machines. If you remove information from this file, restart dhcpd service.

If the VM is still in Foreman and managed, can delete from Foreman and that should delete from oVirt. If VM is no longer in Foreman or not managed in Foreman, have to delete from oVirt directly

For DHCP errors, look in /var/log/foreman-proxy/proxy.log on the DHCP server.

sometimes run into problems where permissions on /etc/dhcp are too restrictive. Usually chmod 0755 /etc/dhcp fixes the issue, then restart foreman-proxy

The DHCP conflict entries may be due to entries left in DHCP. Deleting a host from Foreman should clean up DHCP too. You may have to open /var/lib/dhcpd/dhcpd.leases and delete the things that shouldn't be there. Then restart dhcpd service

Make a backup first just to be safe

debug information logs:

/var/log/foreman-proxy/proxy.log

/var/log/foreman/production.log

/var/log/messages

/var/log/boot.log

You can clean cache

/var/run/foreman/cache/

pxe boot faliure will be seen when the downloaded PXE files are corrupted. The easiest fix is removing them and forcing them to redownload

First remove the associated files in `

/var/lib/tftpboot/boot`

on the Foreman server

So if the host was supposed to build SL 6.7 , file likely called `Scientific-6.7-x86_64-initrd.img` and `Scientific-6.7-x86_64-vmlinuz

Then cancel build for host in Foreman and click Build again , that will trigger Foreman Proxy to redownload the files (since they will be missing).

When you click the Build button, one thing Foreman does is instruct Foreman Proxy to ensure TFTP boot files exist. If you remove them, Foreman Proxy will download them again.

But only if you instruct a host to Build after removing them.

Add a puppet class to a host group in Foreman

On the foreman web page:

- Configure … Host Groups … click on group (base)
- Click on Puppet Classes
- Click the + to expand the puppet class and click the + next to the particular thing you want to add
- Click Submit on the bottom
- Then be sure on that node to run puppet agent --test to pick up the changes (see --noop for testing above)

Foreman kickstart telling you there's not enough disk space for partitions?

- Check that you are partitioning the right disk (/dev/sda for instance, can use --ondisk==/dev/sda to force it in the kickstart)

Kickstart on PXE boot failure:

See error message:
- Cannot open root device "(null)" or unknown-block(8,6)
- Please append a correct "root=" boot option: here are the available partitions:
- Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(8,6)
- Pid 1, comm: swapper Not tainted 2.6.32-504.el6.x86_64 #1
- Call Trace:
I've seen that when the downloaded PXE files are corrupted. The easiest fix is removing them and forcing them to redownload
First remove the associated files in `/var/lib/tftpboot/boot` on the Foreman server
So if the host was supposed to build SL 6.7 , file likely called `Scientific-6.7-x86_64-initrd.img` and `Scientific-6.7-x86_64-vmlinuz
Then cancel build for host in Foreman and click Build again , that will trigger Foreman Proxy to redownload the files (since they will be missing).
When you click the Build button, one thing Foreman does is instruct Foreman Proxy to ensure TFTP boot files exist. If you remove them, Foreman Proxy will download them again.
But only if you instruct a host to Build after removing them.

Error downloading the kickstart file:

see red and blue screen "Error downloading kickstart file". Please modify the kickstart parameter below or press Cancel to proceed as an interactive installation. In this case, follow directions above for "Kickstart on PXE boot failure" to remove and force Foreman to re-download the kickstart files.

Kernel panic during kickstart:

Follow directions above for "Kickstart on PXE boot failure" to remove and force Foreman to re-download the kickstart files.

Foreman Reboot OK but web interface Not working

First check httpd service is running, if not start it.

[root@hepcms-foreman ~]# service httpd status

httpd is stopped

[root@hepcms-foreman ~]# service httpd start

[Wed May 31 15:55:18 2017] [warn] module passenger_module is already loaded, skipping

Syntax error on line 4 of /etc/httpd/conf.d/activemq-httpd.conf:

Invalid command 'ProxyRequests', perhaps misspelled or defined by a module not included in the server configuration

The offending config /etc/httpd/conf.d/activemq-httpd.conf

was part of mcollective yum install, a service that is not working as of now and was installed couple months ago.

For now move it from httpd area

[root@hepcms-foreman conf.d]# mv activemq-httpd.conf /root/

restart the services

[root@hepcms-foreman conf.d]# service httpd start

Starting httpd: [Fri Jun 02 13:19:39 2017] [warn] module passenger_module is already loaded, skipping

[ OK ]

[root@hepcms-foreman conf.d]# passenger-status

Version : 4.0.18

Date : Fri Jun 02 13:19:52 -0400 2017

Instance: 25594

----------- General information -----------

Max pool size : 6

Processes : 0

Requests in top-level queue : 0

----------- Application groups -----------

/usr/share/foreman#default:

App root: /usr/share/foreman

(spawning new process...)

Requests in queue: 2

[root@hepcms-foreman conf.d]# service foreman status

Foreman is running under passenger [PASSED]

Is the Foreman build of a baremetal machine working (checking during build):

ssh root@hepcms-foreman.umd.edu, check /var/log/foreman/production.log for something like this:

Started PUT "/hosts/hepcms-gridftp.umd.edu/setBuild?auth_object=hepcms-gridftp.umd.edu&permission=build_hosts" for 206.196.186.151 at 2016-06-17 14:14:37 -0400

2016-06-17 14:14:37 [I] Processing by HostsController#setBuild as HTML

2016-06-17 14:14:37 [I] Parameters: {"utf8"=>"✓", "authenticity_token"=>"SQ1q1B1aPMbBXTfyYJ/YUVO9bn3nXHLBbOvzl2os3eY=", "commit"=>"Build", "auth_object"=>"hepcms-gridftp.umd.edu", "permission"=>"build_hosts", "id"=>"hepcms-gridftp.umd.edu"}

2016-06-17 14:14:37 [I] Add the TFTP configuration for hepcms-gridftp.umd.edu

2016-06-17 14:14:37 [I] Fetching required TFTP boot files for hepcms-gridftp.umd.edu

2016-06-17 14:14:37 [I] Redirected to https://hepcms-foreman.umd.edu/hosts/hepcms-gridftp.umd.edu

2016-06-17 14:14:37 [I] Completed 302 Found in 601ms (ActiveRecord: 10.7ms)

Note the time it takes to build at the bottom of this page can be very long ~20-30minutes or more depending on disks attached to the node

Get r10k updates:

ssh root@hepcms-foreman.umd.edu
- May wish to backup on hepcms-foreman as you see above, type the alias backup
Command is: r10k deploy -v info environment -p this is aliased as getr10k, which runs the command above as well as lists the date, so you can keep track of when you did the command in a workflow

My github update to Puppet/Hiera didn't take effect?

Did you get r10k updates above?
Did you have a bug in your code (puppet agent --test --noop on the machine you are trying to change)

Problem with r10k?

- Go to the area it's complaining about (on hepcms-foreman) and do a git status. Currently there is no ssh key on hepcms-foreman, we are using git as read only
- To continue to use git as read only, do the following:
- Commit and push any changes you have made by hand to git using another server (not ideal, best to add a ssh key and config your hepcms-foreman git as root)
- Go to the affected areas that git complains about, i.e., and in each area, do the following git commands:
  - /etc/puppet/hiera/production/hieradata
  - /etc/puppet/environments/production/modules/profile
- git fetch --all
- git reset --hard origin master or git reset --hard

Add a puppet module by hand in an area (locally) where r10k & git won't affect it:

On hepcms-foreman Put it in: /etc/puppet/modules, that is a directory defined in /etc/puppet/puppet.conf as a basemodulepath path which is a path picked up by Puppet for modules but it is not touched by r10k

Command line puppet modify some value:

In puppet:

host { 'hepcms-hn.umd.edu':

ensure => 'present',

host_aliases => ['hepcms-hn'],

ip => '10.1.0.1',

}

From command line in puppet:

puppet resource host hepcms-hn.umd.edu ensure=present host_aliases=hepcms-hn ip=10.1.0.1

By hand with no puppet: In /etc/hosts: 10.1.0.1 hepcms-hn.umd.edu

Add a module to what r10k picks up from git?

In github umd_hepcms_puppet_modules, edit Puppetfile - This file is a list of modules that are to be installed on the puppet master by r10k. Make sure to check in the edit and run r10k to pick it up.
Example lines:
- mod 'puppetlabs/denyhosts', '0.1.0'
- mod 'osg', :git => 'https://github.com/treydock/puppet-osg'
- mod 'role', :git => 'https://github.com/UMD-HEPCMS/umd_hepcms_puppet_roles'
In hepcms-foreman, be sure to update the Puppet classes available, Configure… Puppet classes.. click on button to Import from hepcms-puppet.umd.edu

Puppet module complaining about operating system issues?

Check on that node which complains the facter information, then fix to pick up facter from puppetlabs below;

facter -p operatingsystemmajrelease

facter --version

Missing some repo in your install?

Check your foreman provisioning template (web interface) for that node (can click on Templates) , in this case it's:

<% if puppet_enabled && @host.params['enable-puppetlabs-repo'] && @host.params['enable-puppetlabs-repo'] == 'true' -%>

In this case we want to set that universally, so in foreman (web interface), Configure… Global Parameters…
- Name: enable-puppetlabs-repo Value: true

Ensure a specific package is used in hiera .yaml:

facter::package_ensure: "2.4.4-1.el%{::operatingsystemmajrelease}"

Or hard code release:

facter::package_ensure: "2.4.4-1.el6"

Puppet order of applying things:

Now it's possible that puppet tries to update facter before adding the repo. Puppet's order of applying things is 'random'. You'd have to tell Puppet that Package[facter]` requires `Yumrepo[puppetlabs-products]

One really bad hack I use in site.pp is this: Yumrepo <| |> -> Package <| |>

That basically tells Puppet to ensure all repos are added before packages

It has caused me a few problems but the problems were with modules I developed so updated my own modules to allow for such a hack

Do proper ordering of install to ensure program (i.e. facter) comes from puppetlabs instead of epel:

Add to profile::base something like this:
- include ::facter
- include ::puppetlabs_yum
- Class['::puppetlabs_yum'] -> Class['::facter']
That will ensure anything with profile::base has the puppetlabs_yum class applied before facter

Change in some .yaml parameter or class not taking effect at all on a node?

Is there a warning message with puppet agent --test --noop run on that node?
Is the puppet module added to the base class or the node (check the hepcms-foreman web page)?
- Can add on the hepcms-foreman web page (which should only affect kickstart), or better, add in base.pp below:
- Add in base.pp for instance:
  - - - include ::facter
      - include ::puppetlabs_yum
      - Class['::puppetlabs_yum'] -> Class['::facter']

Check puppet agent behavior for a specific module (on that node):

puppet agent --test --tags facter,puppetlabs_yum

All your nodes in the hepcms-foreman web page suddenly orange for "not in sync"?

Check that /data and /home are properly mounted. Check that the head node and r270-datanfs machines are healthy (df -h) and have proper network and firewall settings. Interestingly enough this caused problems in puppet agent when I screwed up the r720-datanfs firewall and the other symptom was that df -h would hang.

How to change a puppet configuration file in your hiera .yaml?

Look in the module's .erb file to see what variables modify the configuration file, for instance:

https://github.com/treydock/puppet-osg/blob/master/templates/cvmfs/default.local.erb

Format in your .yaml like so:

osg::cvmfs::http_proxies:

- 'http://hepcms-squid:3128'

http://rnelson0.com/2014/10/20/rewriting-a-puppet-module-for-use-with-hiera/

https://docs.puppetlabs.com/hiera/1/puppet.html#automatic-parameter-lookup

How to use a puppet class in hiera .yaml?

Example: https://forge.puppetlabs.com/jfryman/selinux

Add in Puppetfile: mod 'jfryman/selinux', '0.2.5'

Add in base.pp: include ::selinux

Add in common.yaml: selinux::mode: 'disabled'

Be sure to run r10k to pick up changes, run puppet agent --test

To use in a specific GUMS.yaml: selinux::mode: 'enforcing'

Did your hiera implementation give you something weird, like ["?

Are you trying to implement a variable with a single value like an array?
array:
- osg::cvmfs::http_proxies:
- - 'http://hepcms-squid.privnet:3128'
single value (with or without single quotes, note the space after the last :) :
- osg::cvmfs::cms_local_site: T3_US_UMD

Hiera yaml files:

- Picks up from common.yaml, Hostgroup (only one, NOT an inherited structure, so Worker.yaml and R720.yaml is a bad idea, stick to just one.), and fqdn/FullFQDN.yaml

Hiera variable usage:

Want to have in my firewall --dport 9000:9999, the puppet module accepts in hiera the following (note it doesn't accept the string "9000:9999"):
dport: [9000,9999] This is an array which is actually equivalent to:
But I want the range, which is coded correctly in a string as:
dport: '9000-9999'
And I get:
-A INPUT -p tcp -m multiport --dports 9000:9999 -m comment --comment "004 Condor ports open" -j ACCEPT

Note:

dport is for inbound traffic, sport is for outbound traffic

for example :

'003 allow GRAM callback inbound':

dport: '40000-40199'

proto: "tcp"

action: 'accept'

'004 allow GRAM callback outbound':

sport: '20000-25000'

proto: "tcp"

action: 'accept'

Why is my node stuck in blue A and always doing the same update?

In the example above with selinux, it keeps applying the setenforce 0, but stops after reboot, so just reboot the node. If that doesn't fix it, then check the hepcms-foreman web page Reports for what puppet keeps applying, maybe you set something up wrong.

Implementing a new puppet module and get an error about "Could not find class"?

Did you spell the class name right in implementation?
Did you check the Dependencies web page for the puppet module? Make sure it's installed in the Puppetfile

Foreman error in getting r10k?

ERROR -> Forge module names must match 'owner/modulename'
Did you forget a comma in your Puppetfile ?

Puppet create a directory:

http://www.puppetcookbook.com/posts/creating-a-directory.html

Puppet symlink a file:

puppet snippet:

# Same as command: ln -s /etc/puppet/hiera/production/hiera.yaml /etc/hiera.yaml

file { '/etc/hiera.yaml':

ensure => 'symlink',

target => '/etc/puppet/hiera/production/hiera.yaml',

}

Foreman-proxy unable to start

foreman proxy not starting

check ps aux. Smart proxy process is running despite foreman proxy daeman stopped. Its status was SNl on ps aux and has been running since the last time the proxy works. The issue can be solved by manually killing smart proxy process and restarting foreman proxy

498 22939 0.4 0.7 169560 62244 ? Sl 14:36 0:11 ruby /usr/share/foreman-proxy/bin/smart-pr

Troubleshooting a Foreman installation is possible by:

look at install logs on `/root/`

R720-0-1 goes into kernel panic on reboot after an osg update

reboot the server. when it gets to the screen where it chooses linux version, choose instead and older version (kernel).

Once in,remov oldest kernel from /boot (not the one using now) directory.

Check if other directories are filled and if so release some space.

You can check which kernel is being used right now

[root@r720-0-1 boot]# uname -or

2.6.32-642.6.2.el6.x86_64 GNU/Linux

[root@r720-0-1 boot]#

and reinstall new kernel again.

Instructions from Doug:

Do the kernel panics occur while booting or sometime later? Either way, I recommend booting into a previous kernel. I suspect that a disk partition filled and resulted in a corrupt upgrade. You can usually recover from this. After booting, check the disks for full partitions. If it is the result of logs or crash dumps (/var/crash), clean these up. The most likely problem is that /boot filled. This is little tricky to clean up. To remove packages, /var much have free space. Then you can remove "old" kernels; not the one you are running. When there is enough free space, try reinstalling the most recent kernel.

yum reinstall kernel-##.##....

Do not use --skip-broken. It is best to keep cleaning and resolving yum errors until you can run yum cleanly. I recently went through this for a machine that would not boot. We spent 4 or 5 hours resolving the issues, but did not have to reinstall the OS.

RENEWING GRID SITE CERTIFICATES

Getting certificatess from Digicert 2019

https://www.digicert.com/secure/profile-settings/

These certs are now through UMD so on the digicert website login through SSO for university of maryland, college park.

We have three certs at the moment that expire in January

- dport:
- - 9000
- - 9999

I love digecert! They are for linux

usr: digicert

jabeen@umd.edu

ON SE, ce and gridft machines, cd to

cd /data/site_conf/certs/DIGICERT-2019/

and create certificate csr and key files.

openssl req -new -newkey rsa:2048 -nodes -out hepcms-0_umd.edu.csr -keyout hepcms-0.umd.edu.key -subj "/C=us/ST=CO/L=College Park/O=University of Maryland/OU=Physics/CN=hepcms-0.umd.edu/emailAddress=jabeen@umd.edu"

openssl req -new -newkey rsa:2048 -nodes -out hepcms-1_umd.edu.csr -keyout hepcms-1.umd.edu.key -subj "/C=us/ST=CO/L=College Park/O=University of Maryland/OU=Physics/CN=hepcms-1.umd.edu/emailAddress=jabeen@umd.edu"

openssl req -new -newkey rsa:2048 -nodes -out hepcms-gridftp_umd.edu.csr -keyout hepcms-gridftp.umd.edu.key -subj "/C=us/ST=CO/L=College Park/O=University of Maryland/OU=Physics/CN=hepcms-gridftp.umd.edu/emailAddress=jabeen@umd.edu"

998 more hepcms-1.umd.edu.csr

copy paste this to the digicert website. All the other fields are automatically filled. Order the certificate and wait for approval.

once you have it, download the file, copy to /data directory and copy the relevant files to all three machines /etc/grid-security.

Make sure they have correct permissions.

on hepcms-se (hepcms-0) (this is for se and xdroot both)

cp /data/site_conf/certs/DIGICERT-2019/hepcms-0_umd.edu.csr .

cp /data/site_conf/certs/DIGICERT-2019/hepcms-0_umd_edu_14042245/hepcms-0_umd_edu.crt .

cp /data/site_conf/certs/DIGICERT-2019/hepcms-0_umd_edu_14042245/DigiCertCA.crt .

cp /data/site_conf/certs/DIGICERT-2019/hepcms-0.umd.edu.key

cp hepcms-0.umd.edu.key hostkey.pem

cp hepcms-0_umd_edu.crt hostcert.pem

chmod 444 hostcert.pem

chmod 400 hostkey.pem

cd xrd/

cp ../hostkey.pem xrdkey.pem

cp ../hostcert.pem xrdcert.pem

1024 chmod 444 xrdcert.pem

1025 chmod 400 xrdkey.pem

restart the services

service condor-ce restart on CE (hepcms-1)

service xrootd restart on CE (hepcms-0)

service cmsd restart on CE (hepcms-0)

service globus-gridftp-server restart on gridftp

check the dates and that the cert matches the key

[root@hepcms-1 grid-security]# openssl x509 -in hostcert.pem -subject -issuer -dates -noout

subject= /DC=com/DC=DigiCert-Grid/C=US/ST=Maryland/L=College Park/O=University of Maryland/CN=hepcms-1.umd.edu

issuer= /C=US/O=DigiCert Grid/OU=www.digicert.com/CN=DigiCert Grid Trust CA G2

notBefore=Dec 6 00:00:00 2019 GMT

notAfter=Jan 5 12:00:00 2021 GMT

[root@hepcms-1 grid-security]#

[root@hepcms-1 grid-security]# openssl x509 -noout -modulus -in hostcert.pem | openssl md5

(stdin)= a6c9ac5f7a36ff49efa6de7f861359e9

\[root@hepcms-1 grid-security]# openssl rsa -noout -modulus -in hostkey.pem | openssl md5

(stdin)= a6c9ac5f7a36ff49efa6de7f861359e9

[root@hepcms-1 grid-security]#

1004 ls -slrt

1005 service condor-ce status

1006 service condor-ce restart

1007 tail -100 /var/log/condor-ce/SchedLog

1008 history

Same for SE and hepcms-gridftp.umd.edu

Check that the services are working:

[jabeen@hepcms-in2 PDF]$ xrdfs root://hepcms-0.umd.edu:1094/ ls /store/test/xrootd/T3_US_UMD/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/

/store/test/xrootd/T3_US_UMD/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000//A64CCCF2-5C76-E711-B359-0CC47A78A3F8.root

/store/test/xrootd/T3_US_UMD/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000//AE237916-5D76-E711-A48C-FA163EEEBFED.root

/store/test/xrootd/T3_US_UMD/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000//CE860B10-5D76-E711-BCA8-FA163EAA761A.root

cd /data/users/jabeen/CMSSW_8_0_26_patch1/src/WG_Analysis/

cmsenv

source /cvmfs/cms.cern.ch/crab3/crab.csh

crab checkwrite --site=T3_US_UMD

Getting certificates from UMD IT (2019)

ON all the nodes we need certificates get the RSS key

clush -w hepcms-gridftp -b openssl req -new -batch -newkey rsa:2048 -nodes -keyout hepcms-gridftp.umd.edu.key -out hepcms-gridftp.umd.edu.csr

clush -w hepcms-ce -b openssl req -new -batch -newkey rsa:2048 -nodes -keyout hepcms-1.umd.edu.key -out hepcms-1.umd.edu.csr

clush -w hepcms-se -b openssl req -new -batch -newkey rsa:2048 -nodes -keyout hepcms-0.umd.edu.key -out hepcms-0.umd.edu.csr

1. clush -w siab-1 -b openssl req -new -batch -newkey rsa:2048 -nodes -keyout siab-1.umd.edu.key -out siab-1.umd.edu.csr
2. clush -w hepcms-gridftp -b openssl req -new -batch -newkey rsa:2048 -nodes -keyout hepcms-gridftp.umd.edu.key -out hepcms-gridftp.umd.edu.csr

1. \
2. clush -w hepcms-ce openssl -b req -new -batch -newkey rsa:2048 -nodes -keyout http/hepcms-1.umd.edu.key -out http/hepcms-1.umd.edu.csr
3. clush -w hepcms-ce openssl -b req -new -batch -newkey rsa:2048 -nodes -keyout rsv/hepcms-1.umd.edu.key -out rsv/hepcms-1.umd.edu.csr
4. clush -w epcms-in2 openssl -b req -new -batch -newkey rsa:2048 -nodes -keyout hepcms-in2..umd.edu.key -out hepcms-in2.umd.edu.csr

Not getting this one. clush -w openssl req -new -batch -newkey rsa:2048 -nodes -keyout hepcmsdev-6..umd.edu.key -out hepcmsdev-6..umd.edu.csr

From https://opensciencegrid.org/docs/security/host-certs/

1. Verify that the issuer CN field is InCommon IGTF Server CA:
2. Install the host certificate and key: in /etc/grid-security

- - 1. $ openssl x509 -in <PATH TO CERTIFICATE> -noout -issuer issuer= /C=US/O=Internet2/OU=InCommon/CN=InCommon IGTF Server CA
    2. root@host # cp <PATH TO CERTIFICATE> hostcert.pem root@host # cp <PATH TO KEY> hostkey.pem

1. root@host # chmod 444 hostcert.pem
2. root@host # chmod 400 hostkey.pem

From https://www.digicert.com/csr-ssl-installation/apache-openssl.htm#ssl_certificate_install

[root@hepcms-gridftp grid-security]# grep -i -r "SSLCertificateFile" /etc/

/etc/sfcb/sfcb.cfg:sslCertificateFilePath: /etc/sfcb/server.pem

[root@hepcms-gridftp grid-security]#

Below is the older stuff used to get certs from OSG (Until 2018)

Hosts that need certificates.

osg-gridadmin-cert-request --hostname=hepcms-in2.umd.edu --vo=CMS

osg-gridadmin-cert-request --hostname=siab-1.umd.edu --vo=CMS

osg-gridadmin-cert-request --hostname=hepcms-0.umd.edu --vo=CMS

osg-gridadmin-cert-request --hostname=hepcms-1.umd.edu --vo=CMS

osg-gridadmin-cert-request --hostname=hepcms-gridftp.umd.edu --vo=CMS

osg-gridadmin-cert-request --hostname=hepcmsdev-6.umd.edu --vo=CMS

osg-gridadmin-cert-request --hostname=http/hepcms-1.umd.edu --vo=CMS

osg-gridadmin-cert-request --hostname=rsv/hepcms-1.umd.edu --vo=CMS

Command to get the certificate

osg-gridadmin-cert-request --hostname=hepcms-1.umd.edu --vo=CMS

CE and RSV

[jabeen@hepcms-in1 ~/SITE_CERTS/hepcms-ce]$ osg-gridadmin-cert-request --hostname=hepcms-1.umd.edu --vo=CMS

[jabeen@hepcms-in1 ~/SITE_CERTS/hepcms-ce]$ osg-gridadmin-cert-request --hostname=rsv/hepcms-1.umd.edu --vo=CMS

[jabeen@hepcms-in1 ~/SITE_CERTShepcms-ce]$ osg-gridadmin-cert-request --hostname=http/hepcms-1.umd.edu --vo=CMS

ssh hepcms-ce

cd /etc/grid-security/

compare the new and old to see they have the same ID

[root@hepcms-1 2017certs]# openssl x509 -in hepcms-1.umd.edu.pem -subject -issuer -dates -noout

subject= /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=hepcms-1.umd.edu

issuer= /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1

notBefore=Apr 4 17:11:52 2017 GMT

notAfter=May 4 17:16:52 2018 GMT

[root@hepcms-1 2017certs]# openssl x509 -in ../hostcert.pem -subject -issuer -dates -noout

subject= /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=hepcms-1.umd.edu

issuer= /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1

notBefore=Mar 4 19:38:57 2016 GMT

notAfter=Apr 3 19:43:57 2017 GMT

[root@hepcms-1 2017certs]# openssl x509 -in ../rsv/rsvcert.pem -subject -issuer -dates -noout

subject= /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=rsv/hepcms-1.umd.edu

issuer= /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1

notBefore=Mar 4 19:39:30 2016 GMT

notAfter=Apr 3 19:44:30 2017 GMT

[root@hepcms-1 2017certs]# openssl x509 -in ../http/httpcert.pem -subject -issuer -dates -noout

subject= /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=http/hepcms-1.umd.edu

issuer= /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1

notBefore=May 2 21:35:41 2016 GMT

notAfter=Jun 1 21:40:41 2017 GMT

now move old certs to xxx-old and copy all three new certs to their proper names and directories

[root@hepcms-1 grid-security]# mv hostcert.pem hostcert.pem-old

[root@hepcms-1 grid-security]# mv hostkey.pem hostkey.pem-old

[root@hepcms-1 grid-security]# cp /data/users/jabeen/SITE_CERTS/hepcms-ce/*.pem .

[root@hepcms-1 grid-security]# mv hepcms-1.umd.edu.pem hostcert.pem

[root@hepcms-1 grid-security]# mv hepcms-1.umd.edu-key.pem hostkey.pem

[root@hepcms-1 grid-security]# cd http/

[root@hepcms-1 http]# mv httpcert.pem httpcert.pem-old

[root@hepcms-1 http]# mv httpkey.pem httpkey.pem-old

[root@hepcms-1 http]# mv ../http-hepcms-1.umd.edu.pem httpcert.pem

[root@hepcms-1 http]# mv ../http-hepcms-1.umd.edu-key.pem httpkey.pem

[root@hepcms-1 http]# cd ../rsv/

[root@hepcms-1 rsv]# mv rsvcert.pem rsvcert.pem-old

[root@hepcms-1 rsv]# mv rsvkey.pem rsvkey.pem-old

[root@hepcms-1 rsv]# mv ../rsv-hepcms-1.umd.edu.pem rsvcert.pem

[root@hepcms-1 rsv]# mv ../rsv-hepcms-1.umd.edu-key.pem rsvkey.pem

Make sure they have the right ownership

[root@hepcms-1 grid-security]# chmod 444 hostcert.pem http/httpcert.pem rsv/rsvcert.pem

[root@hepcms-1 grid-security]# chmod 400 hostkey.pem http/httpkey.pem rsv/rsvkey.pem

[root@hepcms-1 grid-security]# chown root:root *.pem

[root@hepcms-1 rsv]# chown rsv:rsv *.pem

[root@hepcms-1 http]# chown tomcat:tomcat *.pem

[root@hepcms-1 rsv]# service rsv restart

Stopping RSV: Stopping all metrics on all hosts.

Stopping consumers.

Starting RSV: Starting 13 metrics for host 'hepcms-1.umd.edu'.

Starting 2 metrics for host 'hepcms-0.umd.edu:8443'.

Starting 1 metrics for host 'hepcms-gridftp.umd.edu'.

Starting 2 consumers.

[root@hepcms-1 rsv]# service httpd restart

Stopping httpd: [ OK ]

Starting httpd: [ OK ]

[root@hepcms-1 rsv]#

SE cert: this is used for both SE and BESTMAN

get all the new certs: (you ahve to be grid admin for that)

as yourself login to hepcms-in2 and make a new area to save the newcerts. THis SITE_CERTS directory is soflinked in /data/users/jabeen which makes these certs accessible from all needed nodes.

jabeen@hepcms-in2 ~]$ mkdir SITE_CERTS

[jabeen@hepcms-in2 ~]$ cd SITE_CERTS/

[jabeen@hepcms-in2 ~/SITE_CERTS]$ mkdir hepcms-se

[jabeen@hepcms-in2 ~/SITE_CERTS/hepcms-se]$ osg-gridadmin-cert-request --hostname=hepcms-0.umd.edu --vo=CMS

Using timeout of 5 minutes

Please enter the pass phrase for '/home/jabeen/.globus/userkey.pem':

Waiting for response from Quota Check API. Please wait.

Beginning request process for hepcms-0.umd.edu

Generating certificate...

Writing key to ./hepcms-0.umd.edu-key.pem

Id is: 9155

Connecting to server to approve certificate...

Issuing certificate...

Certificate written to ./hepcms-0.umd.edu.pem

[jabeen@hepcms-in2 ~/SITE_CERTS]$ ls -slrt hepcms-se/

total 8

4 -rw------- 1 jabeen users 1679 Jan 13 18:49 hepcms-0.umd.edu-key.pem

4 -rw-r--r-- 1 jabeen users 1668 Jan 13 18:49 hepcms-0.umd.edu.pem

Apply SE and bestman Certificates (same)

[jabeen@hepcms-in2 ~]$ ssh hepcms-se

[root@hepcms-0 /]# cd ./etc/grid-security/

[root@hepcms-0 grid-security]# ls -alrh

total 104K

drwxr-xr-x 2 xrootd xrootd 4.0K Jun 29 2016 xrd

drwxr-xr-x 46 root root 4.0K Nov 4 16:24 vomsdir

-r-------- 1 root root 1.7K Feb 6 2016 hostkey.pem

-r--r--r-- 1 root root 1.7K Feb 6 2016 hostcert.pem

-rw-r--r-- 1 root root 1.8K Aug 9 20:46 gsi.conf

-rw-r--r-- 1 root root 60 Feb 29 2016 gsi-authz.conf

drwxr-xr-x 2 root root 60K Oct 20 00:44 certificates

drwxr-xr-x 2 bestman bestman 4.0K May 26 2016 bestman

drwxr-xr-x. 111 root root 12K Jan 13 18:51 ..

drwxr-xr-x 6 root root 4.0K Jul 6 2016 .

root@hepcms-0 grid-security]# sftp jabeen@hepcms.umd.edu

Connecting to hepcms.umd.edu...

jabeen@hepcms.umd.edu's password:

sftp> cd /home/jabeen/SITE_CERTS/hepcms-se

sftp> mget *.pem

Fetching /home/jabeen/SITE_CERTS/hepcms-se/hepcms-0.umd.edu-key.pem to hepcms-0.umd.edu-key.pem

/home/jabeen/SITE_CERTS/hepcms-se/hepcms-0.umd.edu-key.pem 100% 1679 1.6KB/s 00:00

Fetching /home/jabeen/SITE_CERTS/hepcms-se/hepcms-0.umd.edu.pem to hepcms-0.umd.edu.pem

/home/jabeen/SITE_CERTS/hepcms-se/hepcms-0.umd.edu.pem 100% 1668 1.6KB/s 00:00

sftp> bye

[root@hepcms-0 grid-security]# cp /data/users/jabeen/SITE_CERTS/hepcms-se/hepcms-0.umd.edu* .

[root@hepcms-0 grid-security]# ls -slrt

total 96

4 -r-------- 1 root root 1679 Feb 6 2016 hostkey.pem

4 -r--r--r-- 1 root root 1672 Feb 6 2016 hostcert.pem

4 -rw-r--r-- 1 root root 60 Feb 29 2016 gsi-authz.conf

4 drwxr-xr-x 2 bestman bestman 4096 May 26 2016 bestman

4 drwxr-xr-x 2 xrootd xrootd 4096 Jun 29 2016 xrd

4 -rw-r--r-- 1 root root 1781 Aug 9 20:46 gsi.conf

60 drwxr-xr-x 2 root root 61440 Oct 20 00:44 certificates

4 drwxr-xr-x 46 root root 4096 Nov 4 16:24 vomsdir

4 -rw------- 1 root root 1679 Jan 13 18:58 hepcms-0.umd.edu-key.pem

4 -rw-r--r-- 1 root root 1668 Jan 13 18:58 hepcms-0.umd.edu.pem

Check that old and new certs are for the same host:

[root@hepcms-0 grid-security]# openssl x509 -in hostcert.pem -subject -issuer -dates -noout

subject= /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=hepcms-0.umd.edu

issuer= /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1

notBefore=Jan 13 23:44:12 2017 GMT

notAfter=Feb 12 23:49:12 2018 GMT

[root@hepcms-0 grid-security]# openssl x509 -in hepcms-0.umd.edu.pem -subject -issuer -dates -noout

subject= /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=hepcms-0.umd.edu

issuer= /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1

notBefore=Feb 11 17:03:20 2018 GMT

notAfter=Mar 13 17:08:20 2019 GMT

[root@hepcms-0 grid-security]#

[root@hepcms-0 grid-security]# mv hepcms-0.umd.edu-key.pem hostkey.pem

[root@hepcms-0 grid-security]# mv hepcms-0.umd.edu.pem hostcert.pem

[root@hepcms-0 grid-security]# ls -alrh

total 104K

drwxr-xr-x 2 xrootd xrootd 4.0K Jun 29 2016 xrd

drwxr-xr-x 46 root root 4.0K Nov 4 16:24 vomsdir

-rw------- 1 root root 1.7K Jan 13 18:58 hostkey.pem

-rw-r--r-- 1 root root 1.7K Jan 13 18:58 hostcert.pem

-rw-r--r-- 1 root root 1.8K Aug 9 20:46 gsi.conf

-rw-r--r-- 1 root root 60 Feb 29 2016 gsi-authz.conf

drwxr-xr-x 2 root root 60K Oct 20 00:44 certificates

drwxr-xr-x 2 bestman bestman 4.0K May 26 2016 bestman

drwxr-xr-x. 111 root root 12K Jan 13 18:51 ..

drwxr-xr-x 6 root root 4.0K Jan 13 19:02 .

[root@hepcms-0 grid-security]# chmod 400 hostkey.pem

[root@hepcms-0 grid-security]# chmod 444 hostcert.pem

[root@hepcms-0 grid-security]# ls -alrh

total 104K

drwxr-xr-x 2 xrootd xrootd 4.0K Jun 29 2016 xrd

drwxr-xr-x 46 root root 4.0K Nov 4 16:24 vomsdir

-r-------- 1 root root 1.7K Jan 13 18:58 hostkey.pem

-r--r--r-- 1 root root 1.7K Jan 13 18:58 hostcert.pem

-rw-r--r-- 1 root root 1.8K Aug 9 20:46 gsi.conf

-rw-r--r-- 1 root root 60 Feb 29 2016 gsi-authz.conf

drwxr-xr-x 2 root root 60K Oct 20 00:44 certificates

drwxr-xr-x 2 bestman bestman 4.0K May 26 2016 bestman

drwxr-xr-x. 111 root root 12K Jan 13 18:51 ..

drwxr-xr-x 6 root root 4.0K Jan 13 19:02 .

For bestman certs:

[root@hepcms-0 grid-security]# chown bestman:bestman bestman

[root@hepcms-0 grid-security]# ls bestman/

[root@hepcms-0 grid-security]# cp *.pem bestman/

[root@hepcms-0 grid-security]# cd bestman/

[root@hepcms-0 bestman]# chown bestman:bestman *.pem

Check bestman certs are the same as hepcms-0

[root@hepcms-0 bestman]# openssl x509 -in bestmancert.pem -subject -issuer -dates -noout

subject= /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=hepcms-0.umd.edu

issuer= /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1

notBefore=Jan 13 23:44:12 2017 GMT

notAfter=Feb 12 23:49:12 2018 GMT

[root@hepcms-0 bestman]#

[root@hepcms-0 bestman]# ls -alrh

total 24K

-r-------- 1 bestman bestman 1.7K Jan 13 19:08 hostkey.pem

-r--r--r-- 1 bestman bestman 1.7K Jan 13 19:08 hostcert.pem

-r-------- 1 bestman bestman 1.7K Mar 1 2016 bestmankey.pem

-r-------- 1 bestman bestman 1.7K Mar 1 2016 bestmancert.pem

drwxr-xr-x 6 root root 4.0K Jan 13 19:02 ..

drwxr-xr-x 2 bestman bestman 4.0K Jan 13 19:08 .

[

root@hepcms-0 bestman]# mv hostkey.pem bestmankey.pem

[root@hepcms-0 bestman]# mv hostcert.pem bestmancert.pem

[root@hepcms-0 bestman]# openssl x509 -in /etc/grid-security/bestman/bestmancert.pem -subject -issuer -dates -noout

subject= /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=hepcms-0.umd.edu

issuer= /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1

notBefore=Jan 13 23:44:12 2017 GMT

notAfter=Feb 12 23:49:12 2018 GMT

[root@hepcms-0 bestman]# chmod 400 bestmankey.pem

[root@hepcms-0 bestman]# chmod 444 bestmancert.pem

[root@hepcms-0 bestman]# ls -alrh

total 16K

-r-------- 1 bestman bestman 1.7K Jan 13 19:08 bestmankey.pem

-r--r--r-- 1 bestman bestman 1.7K Jan 13 19:08 bestmancert.pem

drwxr-xr-x 6 root root 4.0K Jan 13 19:02 ..

drwxr-xr-x 2 bestman bestman 4.0K Jan 13 19:15 .

Xrootd cert

is the same as se

[root@hepcms-0 xrd]# openssl x509 -in xrdcert.pem -subject -issuer -dates -noout

subject= /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=hepcms-0.umd.edu

issuer= /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1

notBefore=Feb 6 16:27:05 2016 GMT

notAfter=Mar 7 16:32:05 2017 GMT

[root@hepcms-0 xrd]# pwd

/etc/grid-security/xrd

[root@hepcms-0 xrd]# mv xrdcert.pem xrdcert.pem-old

[root@hepcms-0 xrd]# mv xrdkey.pem xrdkey.pem-old

[root@hepcms-0 xrd]# cp ../hostcert.pem ./xrdcert.pem

[root@hepcms-0 xrd]# cp ../hostkey.pem xrdkey.pem

[root@hepcms-0 xrd]# ls -slrt

total 16

4 -r-------- 1 xrootd xrootd 1679 Jun 29 2016 xrdkey.pem-old

4 -r--r--r-- 1 xrootd xrootd 1672 Jun 29 2016 xrdcert.pem-old

4 -r--r--r-- 1 root root 1668 Mar 3 20:49 xrdcert.pem

4 -r-------- 1 root root 1679 Mar 3 20:50 xrdkey.pem

[root@hepcms-0 xrd]# chown xrootd:xrootd xrdcert.pem

[root@hepcms-0 xrd]# chown xrootd:xrootd xrdkey.pem

[root@hepcms-0 xrd]# ls -slrt

total 16

4 -r-------- 1 xrootd xrootd 1679 Jun 29 2016 xrdkey.pem-old

4 -r--r--r-- 1 xrootd xrootd 1672 Jun 29 2016 xrdcert.pem-old

4 -r--r--r-- 1 xrootd xrootd 1668 Mar 3 20:49 xrdcert.pem

4 -r-------- 1 xrootd xrootd 1679 Mar 3 20:50 xrdkey.pem

gridFTP site cert

Get the cert

abeen@hepcms-in2 ~/SITE_CERTS]$ mkdir hepcms-gridftp

[jabeen@hepcms-in2 ~/SITE_CERTS]$ cd hepcms-gridftp/

[jabeen@hepcms-in2 hepcms-gridftp]$ osg-gridadmin-cert-request --hostname=hepcms-gridftp.umd.edu --vo=CMS

Using timeout of 5 minutes

Please enter the pass phrase for '/home/jabeen/.globus/userkey.pem':

Waiting for response from Quota Check API. Please wait.

Beginning request process for hepcms-gridftp.umd.edu

Generating certificate...

Writing key to ./hepcms-gridftp.umd.edu-key.pem

Id is: 9156

Connecting to server to approve certificate...

Issuing certificate...

Certificate written to ./hepcms-gridftp.umd.edu.pem

[jabeen@hepcms-in2 hepcms-gridftp]$ ls

hepcms-gridftp.umd.edu-key.pem hepcms-gridftp.umd.edu.pem

cd ../

Copy to hepcms-gridftp

ssh -Y hepcms-gridftp

[root@hepcms-gridftp grid-security]# cp /data/users/jabeen/SITE_CERTS/hepcms-gridftp/hepcms-gridftp.umd.edu* .

Check new and old are for gridftp

[root@hepcms-gridftp grid-security]# openssl x509 -in hostcert.pem -subject -issuer -dates -noout

subject= /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=hepcms-gridftp.umd.edu

issuer= /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1

notBefore=Jun 13 18:05:50 2016 GMT

notAfter=Jul 13 18:10:50 2017 GMT

[root@hepcms-gridftp grid-security]# openssl x509 -in hepcms-gridftp.umd.edu.pem -subject -issuer -dates -noout

subject= /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=hepcms-gridftp.umd.edu

issuer= /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1

notBefore=Jan 14 00:13:12 2017 GMT

notAfter=Feb 13 00:18:12 2018 GMT

replace old certs and check that permissions are same for old and new certs

[root@hepcms-gridftp grid-security]# mv hostkey.pem hostkey.pem-old

mv hostcert.pem hostcert.pem-old

mv hepcms-gridftp.umd.edu.pem hostcert.pem

mv hepcms-gridftp.umd.edu-key.pem hostkey.pem

ls -slrt

chmod 400 hostkey.pem

chmod 444 hostcert.pem

openssl x509 -in hostcert.pem -subject -issuer -dates -noout

[root@hepcms-gridftp grid-security]# openssl x509 -in hostcert.pem -subject -issuer -dates -noout

subject= /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=hepcms-gridftp.umd.edu

issuer= /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1

notBefore=Jan 14 00:13:12 2017 GMT

notAfter=Feb 13 00:18:12 2018 GMT

gridftp cert also needs to be on datanfs

This is the place where we should keep all the certs to be deployed through pippet.

abeen@hepcms-in2 ~/SITE_CERTS]$ mkdir hepcms-gridftp

[jabeen@hepcms-in2 ~/SITE_CERTS]$ cd hepcms-gridftp/

[jabeen@hepcms-in2 hepcms-gridftp]$ osg-gridadmin-cert-request --hostname=hepcms-gridftp.umd.edu --vo=CMS

Using timeout of 5 minutes

Please enter the pass phrase for '/home/jabeen/.globus/userkey.pem':

Waiting for response from Quota Check API. Please wait.

Beginning request process for hepcms-gridftp.umd.edu

Generating certificate...

Writing key to ./hepcms-gridftp.umd.edu-key.pem

Id is: 9156

Connecting to server to approve certificate...

Issuing certificate...

Certificate written to ./hepcms-gridftp.umd.edu.pem

[jabeen@hepcms-in2 hepcms-gridftp]$ ls

hepcms-gridftp.umd.edu-key.pem hepcms-gridftp.umd.edu.pem

cd ../

jabeen@hepcms-in2 ~/SITE_CERTS]$ ls

hepcms-gridftp hepcms-se

[jabeen@hepcms-in2 ~/SITE_CERTS]$ tar cfvz hepcms-gridftp-cert.tgz hepcms-gridftp

hepcms-gridftp/

root@hepcms-in2 ~]# cp /home/jabeen/SITE_CERTS/hepcms-gridftp-cert.tgz /data/site_conf/certs/

[root@hepcms-in2 ~]# ssh r720-datanfs

root@r720-datanfs ~]# cd /data/site_conf/certs

[root@r720-datanfs certs]# cp /data/users/jabeen/SITE_CERTS/hepcms-gridftp/hepcms-gridftp.umd.edu* .

[root@r720-datanfs certs]# mv hepcms-gridftcert.pem hepcms-gridftcert.pem-old

[root@r720-datanfs certs]# mv hepcms-gridftpkey.pem hepcms-gridftpkey.pem-old

[root@r720-datanfs certs]# chown root:root *.pem

[root@r720-datanfs certs]# ls -alrh

root@r720-datanfs certs]# ls -slrt

total 32

4 -r--r--r-- 1 root root 1675 May 25 2016 http

4 -r-------- 1 9 13 1679 Jun 3 2016 httpkey.pem

4 -r--r--r-- 1 9 13 1681 Jun 3 2016 httpcert.pem

0 drwxr-xr-x 2 root root 71 Jun 4 2016 grid-security

0 drwxr-xr-x 2 root root 41 Jun 4 2016 rsv

4 -rw-r--r-- 1 root root 35 Jun 9 2016 README

4 -rw------- 1 root root 1675 Jan 13 2017 hepcms-gridftpkey.pem-old

4 -rw-r--r-- 1 root root 1690 Jan 13 2017 hepcms-gridftcert.pem-old

4 -rw------- 1 root root 1679 Feb 11 13:08 hepcms-gridftpkey.pem

4 -rw-r--r-- 1 root root 1690 Feb 11 13:08 hepcms-gridftcert.pem

http cert at hepcmsdev-6 (hepcms-gums)

Get ce for hepcmsdev-6

jabeen@hepcms-in1 http]$ osg-gridadmin-cert-request --hostname=hepcmsdev-6.umd.edu --vo=CMS

Using timeout of 5 minutes

Please enter the pass phrase for '/home/jabeen/.globus/userkey.pem':

Waiting for response from Quota Check API. Please wait.

Beginning request process for hepcmsdev-6.umd.edu

Generating certificate...

Writing key to ./hepcmsdev-6.umd.edu-key.pem

Id is: 9312

Connecting to server to approve certificate...

Issuing certificate...

Certificate written to ./hepcmsdev-6.umd.edu.pem

[jabeen@hepcms-in1 http]$ openssl x509 -in hepcmsdev-6.umd.edu.pem -subject -issuer -dates -noout

subject= /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=hepcmsdev-6.umd.edu

issuer= /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1

notBefore=Feb 12 04:01:17 2017 GMT

notAfter=Mar 14 04:06:17 2018 GMT

now copy these to the location on hepcmsdev-6

[

root@hepcmsdev-6 ~]# cd /etc/grid-security/http/

[root@hepcmsdev-6 http]# ls -slrt

total 8

4 -r--------. 1 tomcat tomcat 1675 Jan 13 2016 httpkey.pem

4 -r--r--r--. 1 tomcat tomcat 1692 Jan 13 2016 httpcert.pem

copy the new certs to old names and fix permissions

[root@hepcmsdev-6 http]# mv httpcert.pem httpcert.pem-old

[root@hepcmsdev-6 http]# mv httpkey.pem httpkey.pem-old

[root@hepcmsdev-6 http]# mv hepcmsdev-6.umd.edu-key.pem httpkey.pem

[root@hepcmsdev-6 http]# mv hepcmsdev-6.umd.edu.pem httpcert.pem

check that old and new certs match:

[root@hepcmsdev-6 http]# openssl x509 -in /etc/grid-security/http/httpcert.pem -dates -noout

notBefore=Jan 13 16:57:15 2016 GMT

notAfter=Feb 11 17:02:15 2017 GMT

[root@hepcmsdev-6 http]# chmod 400 httpkey.pem

[root@hepcmsdev-6 http]# chmod 444 httpcert.pem

[root@hepcmsdev-6 http]# chown tomcat.tomcat httpcert.pem

[root@hepcmsdev-6 http]# chown tomcat.tomcat httpkey.pem

Now restart the services:

service mysqld restart; service tomcat6 restart

More info here:

https://sites.google.com/a/physics.umd.edu/tier-3-umd/margueritedebuglog/gumsdebugging15dec2015

TODO

copied from

https://sites.google.com/a/physics.umd.edu/tier-3-umd/dont-edit/sitegridcertificates

Get CE and RSV site certificates

- - 4 March 2016 (MBT)
    - hepcms-in2 already has osg-pki-tools installed so a GridAdmin can get site certificates, it also has osg, osg::cacerts and osg::cacerts::updater
  - Second, be sure the FQDN of your public IP of your node matches what hostname reports (on that node), use that below for HOSTNAME in the request
  - Third, that FQDN needs to exist as a service in OIM for the GridAdmin to get certificates (it already does, full instructions above in SE general example)
  - Login as myself on hepcms-in2, make sure I have my grid certificate installed on my /home/.globus
    - http://hep-t3.physics.umd.edu/HowToForUsers.html#CertAndProxy
    - HTCondorCE page says we need:
    - RSV page says we need:
    - Also double-checked older cert that the service cert for rsv was in the form of (OLD FQDN used there)
      - OLD COMMAND: osg-gridadmin-cert-request --hostname=hepcms-0.umd.edu --vo=CMS
        OLD COMMAND: osg-gridadmin-cert-request --hostname=rsv/hepcms-0.umd.edu --vo=CMS

- I will now run:
  - osg-gridadmin-cert-request --hostname=hepcms-1.umd.edu --vo=CMS
  - osg-gridadmin-cert-request --hostname=rsv/hepcms-1.umd.edu --vo=CMS
- Get my certs in my local area, certs made, approved, I got 2 grid emails per cert about this
- Using timeout of 5 minutes
  - The timeout is set to 5
  - Please enter the pass phrase for '/home/belt/.globus/userkey.pem':
  - Waiting for response from Quota Check API. Please wait.
  - Beginning request process for hepcms-1.umd.edu
  - Generating certificate...
  - Writing key to ./hepcms-1.umd.edu-key.pem
  - Id is: 7251
  - Connecting to server to approve certificate...
  - Issuing certificate...
  - Certificate written to ./hepcms-1.umd.edu.pem
  - [belt@hepcms-in2 SiteCE]$ cd ..
  - [belt@hepcms-in2 ~/SiteCertCE]$ dir
  - total 12K
  - drwxr-xr-x 3 belt users 4.0K Mar 4 14:43 .
  - drwxr-xr-x 78 belt users 4.0K Mar 4 14:33 ..
  - drwxr-xr-x 2 belt users 4.0K Mar 4 14:44 SiteCE
  - [belt@hepcms-in2 ~/SiteCertCE]$ mkdir RSVCE
  - [belt@hepcms-in2 ~/SiteCertCE]$ cd RSVCE/
  - [belt@hepcms-in2 RSVCE]$ osg-gridadmin-cert-request --hostname=rsv/hepcms-1.umd.edu --vo=CMS
  - Using timeout of 5 minutes
  - The timeout is set to 5
  - Please enter the pass phrase for '/home/belt/.globus/userkey.pem':
  - Waiting for response from Quota Check API. Please wait.
  - Beginning request process for rsv/hepcms-1.umd.edu
  - Generating certificate...
  - Writing key to ./rsv-hepcms-1.umd.edu-key.pem
  - Id is: 7252
  - Connecting to server to approve certificate...
  - Issuing certificate...
  - Certificate written to ./rsv-hepcms-1.umd.edu.pem
  - [belt@hepcms-in2 RSVCE]$
  - Apparently (2 May 2016) we also need a http site cert for CEMon (didn't see that before). Get it to my area below:
    - [belt@hepcms-in2 http]$ osg-gridadmin-cert-request --hostname=http/hepcms-1.umd.edu --vo=CMS
    - Using timeout of 5 minutes
    - The timeout is set to 5
    - Please enter the pass phrase for '/home/belt/.globus/userkey.pem':
    - Waiting for response from Quota Check API. Please wait.
    - Beginning request process for http/hepcms-1.umd.edu
    - Generating certificate...
    - Writing key to ./http-hepcms-1.umd.edu-key.pem
    - Id is: 7740
    - Connecting to server to approve certificate...
    - Issuing certificate...
    - Certificate written to ./http-hepcms-1.umd.edu.pem
    - [belt@hepcms-in2 http]$ pwd
    - /home/belt/SiteCertCE/http

Copy CE and RSV certs to proper areas on hepcms-1.umd.edu

- - These certs are still in my user area (~belt/SiteCertCE/SiteCE/*.pem for CE and ~belt/SiteCertCE/RSVCE/*.pem for RSV, you can login to hepcms-hn (su - to become root) and scp them to hepcms-ce as needed)
  - rsv user needs to exist, so you may need to *install* rsv before properly chown-ing the cert
  - properly rename the certificates when you move them to /etc/grid-security/ and /etc/grid-security/rsv
  - Make sure the permissions are appropriate (chmod 400 *key.pem; chmod 444 *cert.pem)
  - Make sure they are owned correctly (OSG twiki will guide you, or blocks I copied above),
    - CE cert: chown root:root /etc/grid-security/*.pem
    - RSV cert: chown rsv:rsv /etc/grid-security/rsv/*.pem
  - Make sure the subdirectory is properly chowned (for RSV): chown rsv:rsv /etc/grid-security/rsv
  - HTTP Certs are located in ~belt/SiteCertCE/http/
  - Note that all the above certs are now in /data/site_conf and accessible through puppet

HTTP Cert properties:

file { '/etc/grid-security/http':

ensure => 'directory',

owner => 'tomcat',

group => 'tomcat',

mode => '0755',

}

file { '/etc/grid-security/http/httpcert.pem':

ensure => 'file',

owner => 'tomcat',

group => 'tomcat',

mode => '0444',

source => $osg::ce::_httpcert_source,

require => File['/etc/grid-security/http'],

}

file { '/etc/grid-security/http/httpkey.pem':

ensure => 'file',

owner => 'tomcat',

group => 'tomcat',

mode => '0400',

source => $osg::ce::_httpkey_source,

require => File['/etc/grid-security/http'],

}

Some links

Certificate troubleshooting

- - See at the top how to test the status of certificates
    - Also check they have the proper permissions and ownerships
  - Also see the various OSG troubleshooting web pages

Old note example to debug site cert problem

- - I hadn't renewed the certificates and grid jobs were no longer coming in and we had rsv errors
  - Debugged this with:
    - https://twiki.opensciencegrid.org/bin/view/Documentation/Release3/TroubleshootingComputeElement#Troubleshooting_Common_Job_Submi
  - Saw this error in CE: globus-gatekeeper.log:

PID: 8094 -- Notice: 0: GATEKEEPER_JM_ID 2014-03-11.09:37:37.0000031277.0000000061 for /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=rsv/hepcms-0.umd.edu on ::ffff:128.8.164.12

Failure: globus_gss_assist_gridmap() failed authorization. globus_gss_assist: Error invoking callout

globus_callout_module: The callout returned an error

MONITORING

Ganglia web interface not working:

There was an error collecting ganglia data (127.0.0.1:8652): fsockopen error: Connection refused

Fix: check gmond amd gmetad services. restart and restart httpd.

[root@hepcms-hn ~]# service gmetad status

gmetad dead but subsys locked

[root@hepcms-hn ~]# service gmetad start

Starting GANGLIA gmetad: [ OK ]

[root@hepcms-hn ~]#

[root@hepcms-hn ~]# service httpd restart

Stopping httpd: [ OK ]

Starting httpd: [ OK ]

[root@hepcms-hn ~]#

Ganglia displaying a dead node

delete dead node from /var/lib/ganglia/rrds/UMD HEP CMS T3 and restart the services.

cd /var/lib/ganglia/rrds/

cd UMD\ HEP\ CMS\ T3/

rm -rf hepcms-ovirt2.privnet/

service gmetad start

service gmond restart

Ganglia displaying ips instead of host names.

overrode the hostname ingmond.conf (override_hostname = "r510-0-6")

vi /etc/ganglia/gmond.conf

service gmond restart

- - - - Certificate
        Host certificate

- - - - Certificate
        RSV service certificate

- - - - User that owns certificate
        root

- - - - User that owns certificate
        rsv

- - - - Path to certificate
        /etc/grid-security/hostcert.pem
        /etc/grid-security/hostkey.pem

- - - - Path to certificate
        /etc/grid-security/rsv/rsvcert.pem
        /etc/grid-security/rsv/rsvkey.pem

Ganglia showing hepcms-hn as dead

restrts both gmetad and gmond service.

[root@hepcms-hn ~]# service gmetad restart

Shutting down GANGLIA gmetad: [ OK ]

Starting GANGLIA gmetad: [ OK ]

[root@hepcms-hn ~]# service gmond restart

Shutting down GANGLIA gmond: [ OK ]

Starting GANGLIA gmond: [ OK ]

[root@hepcms-hn ~]#

HARDWARE ISSUES

Consult the following web page for more technical information about hardware identity https://sites.google.com/a/physics.umd.edu/tier-3-umd/commands/umd-t3-hardware-info

- - Note also the spreadsheet attached (in Excel format there and pdf) with physical connection information
  - This spreadsheet is POSTED inside the C-21 back rack door physically at Rivertech

Replaced nodes show STATE FOREIGN

Reboot

press F2 to get into BIOS setup

Go to DEVICE SETTINGS

go to INTEGRATED RAID CONTROLLER utility

go to Configuartion management

PHYSICAL DISK MANAGEMENT will show the disks STAT

go back to CONFIGURATION MANAGEMENT

go to MAnage Foreign Configuration

Preview Foreign Configuration

Clear foreign configuration

Node is unreachable.

compute-0-11 became unreahable. Ping or foreman connections didnt work.

At rivertech connected to the monortor andhard reboot the node.

It seems to fail while checking the file system.

root@compute-0-11> fsck

enter y for all the questions.

reboot

The node came back with everything perfectly mounted.

Partitioning a disk

enter fdisk /dev/sdX
press n for a new partition
press p for adding a new partition
if the number of partitions is below 4, it will ask you for a partition number.
you can enter where the partitions start and end on the disk, but the defaults should be fine. (If you are adding another partition to the disk, the default should be right after the previous partition ends)
Usually it should partition and be ready to use, but if the disk is busy, you will have to restart the node.

KVM

- - Print Screen,
  - arrow up and down to select machines (can sometimes also type letter#, like c8 for "compute-0-8"
  - Get the second menu by arrowing to the KVM switch and pressing Enter or Print Screen again

One or both KVMs not working?

- - Check that they are powered on (green light in back), if not, check the power cord connection on phase3 (innermost PDU), as the sheaths don't fit, so they have some "wiggle room" and can get jostled
  - Also check that other physical connections are made (try not to jostle anything else)
  - Print Screen is the key to use with KVMs, get the second menu by arrowing to the KVM switch and pressing Enter or Print Screen again

Machine has Orange blinking light, or orange "electrical", or orange "hard drive" symbol

- - Use Dell OMSA commands to help debug, or Dell OMSA webserver (note that in the past we have had R510s report a power problem according to Dell OMSA commands when actually they just needed a Firmware update!!!)
  - Could be that one of the two power cords is loose at the machine or at the PDU (be careful not to jostle any others)
    - If one is loose and you unplug the other, you will power down the machine suddenly, not good for disks
  - Could be that one of the hard drives has completely failed (beyond fsck failure), and needs to be physically replaced
  - If you cannot get to the operating system, try to reboot to the (F11, I think) menu to run Dell Diagnostics

Machine isn't visible remotely from reboot

- - If need be, some machines were setup with iDRAC access, with internal network identities accessible from firefox within the cluster
  - Check it physically at Rivertech
  - ping it (see also network troubleshooting)
  - At Rivertech, check that it can see the outside world, check if there's something interesting on the screen (like it wanted to fsck some disks upon reboot), if you can login as root, then fsck -y /hadoop2 (or any other disks it might complain about)
  - more network troubleshooting : go to /etc/init.d and run bash network stop then bash network start then have another machine ping it / try to ping a different machine

Dell Disks/OMSA

after doing the usual ssh-agent $SHELL; ssh-add stuff

you can find all the service tags for all the machines by using clush and doing

clush -w @all_baremetal omreport system summary | grep Service

that will report everything currently in clush, not ovirt and r720-datanfs though

Does Dell OMSA work?

omreport system summary

Cannot get to Dell OMSA webserver

https://servername.privnet:1311 from firefox on the local network e.g from in1 as normal user

the account login as root on the particular node.

- 1. be sure iptables is turned off:
    - by hand: service iptables stop; chkconfig iptables off
    - In puppet: Can disable iptables in Hiera using firewall::ensure: 'stopped' which requires the firewall class be included
    - Or if it's on, make sure that internal ports are open
  2. Restart webserver: omconfig system webserver action=restart

- - 1. Is Dell OMSA running? Try omreport system

Clear omsa logs so we don't get warnings for issues already resolved

[root@hepcms-gridftp ~]# service iptables stop

iptables: Setting chains to policy ACCEPT: filter [ OK ]

iptables: Flushing firewall rules: [ OK ]

iptables: Unloading modules: [ OK ]

[root@hepcms-gridftp ~]# omconfig system webserver action=restart

DSM SA Connection Service restarted successfully.

From firefox on hn connect to https://hepcms-gridftp.privnet:1311

clearing hn log

https://umdt3.slack.com/messages/C0B4U4C2G/

[root@hepcms-gridftp ~]# service iptables start

iptables: Applying firewall rules: [ OK ]

[root@hepcms-gridftp ~]#

What's the status of disks?

- - omreport storage vdisk controller=0
  - omreport storage vdisk controller=1 # for hepcmsdev-1 and r720-datanfs virtual disks
  - omreport storage pdisk controller=0 # physical disk status - virtual disks only exist on the RAIDed machines

Need to know which disk number in Dell corresponds to which hard disk?

- Open up firefox from an interactive node on the cluster
- Put in https://r510-0-11.privnet:1311 (for instance, use the name of your node)
- Use the root login and password for that node
- Click on Storage, and find the Physical Disks
- You can Blink and Unblink specific disks (use process of elimination if your drive is missing

9 August 2018

replaced 0-1-13 on r720-0-1. This is the right small disk on the back.

Cleared badblock on virtual hadoop bad disk

If you cant get to the omsa web interface you can use blink scrip

@hepcms-gridftp ~]# more /data/osg/scripts/BlinkLED.sh

#!/bin/sh

omconfig chassis leds led=identify flash=on

http://www.dell.com/support/manuals/us/en/19/dell-opnmang-srvr-admin-v7.4/OMSA_CLI-v3/Omconfig-Blink-Physical-Disk?guid=GUID-47AABF96-182C-48D6-ADBC-C825FEBA4E89&lang=en-us

omconfig storage pdisk action=blink controller=1 pdisk=0:2:0

dont forget to unblink

OMSA COMMANDS

Location of Dell System Summaries is on the hepcms-hn and is in /root/omsa_report

Commands

Example of how to grab report files and retrieve them from "node group" and diff them.

clush -v -w --rcopy /root/omsa_chassis_report --dest /root/omsa_reports

clush -v -w @R510 --rcopy /root/omsa_chassis_report --dest /root/omsa_reports

omreport chassis biossetup

omreport chassis firmware

omreport chassis memory

omreport chassis nics

omreport chassis removableflashmedia

omreport system esmlog

omreport system alertaction

omreport storage pdisk controller=0

parted /dev/sda 'print'

omconfig storage controller action=exportlog controller=0

omreport -?

omreport chassis batteries

omreport chassis pwrmanagement

omreport chassis pwrsupplies

omreport system summary

omreport chassis memory

omreport chassis

omreport storage pdisk controller=0

Omconfig Chassis Leds Or Omconfig Mainsystem Leds

Use the omconfig chassis leds or omconfig mainsystem leds command to specify when to flash a chassis fault LED or chassis identification LED. This command also allows you to clear the LED of the system hard drive. The following table displays the valid parameters for the command.

Report abuse