Link to internal troubleshooting page: Troubleshooting
Link to weekly meetings:meetings
https://www.google.com/url?q=https%3A%2F%2Fumdt3.slack.com%2Farchives%2Ft3newsysadmin%2Fp1465839605000051&sa=D
Link to the BIGGER admin guide: UMDT3
Things to do before using any network stop or similar command (or any change in IP table)
Please make sure you are physically in front of the machine.
Please run the commands by others working on the cluster.
RESOURCES And Commands
Yum commands YUM
/bin/ — Used to store user commands. The directory /usr/bin/ also stores user commands.
/sbin/ — Location of many system commands, such as shutdown. The directory /usr/sbin/ also contains many system commands.
/root/ — The home directory of root, the superuser.
/misc/ — This directory is used for automatically mounting directories on removeable devices (such as Zip drives) and remote directories (such as NFS shares) using autofs. Refer to the autofs manual page (type man autofs at a shell prompt) for more information.
/mnt/ — This directory typically contains the mount points for file systems mounted after the system is booted.
/media/ — This directory contains the mount points for removable media, such as diskettes, CD-ROMs, and USB flash drives.
/boot/ — Contains the kernel and other files used during system startup.
/lost+found/ — Used by fsck to place orphaned files (files without names).
/lib/ — Contains many device modules and library files used by programs in /bin/ and /sbin/. The directory /usr/lib/ contains library files for user applications.
/dev/ — Stores device files.
/etc/ — Contains configuration files and directories.
/var/ — For variable (or constantly changing) files, such as log files and the printer spool.
/usr/ — Contains files and directories directly relating to users of the system, such as programs and supporting library files.
/proc/ — A virtual file system (not actually stored on the disk) that contains system information used by certain programs.
/initrd/ — A directory that is used to mount the initrd.img image file and load needed device modules during bootup.
Warning: Do not delete the /initrd/ directory. You will be unable to boot your computer if you delete the directory and then reboot your Red Hat Enterprise Linux system.
/tftpboot/ — Contains files and applications needed for Preboot Execution Environment (PXE), a service that allows client machines and machines without hard drives to boot an operating system from an image on a central PXE server.
/tmp/ — The temporary directory for users and programs. /tmp/ allows all users on a system read and write access.
/home/ — Default location of user home directories.
/opt/ — Directory where optional files and programs are stored. This directory is used mainly by third-party developers for easy installation and uninstallation of their software packages.
NIS Account Creation /Management
In case there are two similar accounts (eg. oscillatorb and osscilatorB)
To add a sysadmin to the sudoers file and implement sudo on an individual node:
Condor/Grid Jobs
Replace bad disks:
Check Status:
Format new disk
Identify bad disk:
omreport storage pdisk controller=0 pdisk=0:0:3
Identify disk on the machine. disk 0:0:3 should have blinking light after following command is run:
omconfig storage pdisk action=blink controller=0 pdisk=0:0:3
Replace the disk and check if new disk is in non-critical state.
omreport storage pdisk controller=0 pdisk=0:0:3
Stop blinking:
omconfig storage pdisk action=unblink controller=0 pdisk=0:0:3
DISK CLEAN UP
Remove temp files more than 5 days old
Hadoop PhEDEx Cleaning
HADOOP /NFS
Hadoop-hdfs-datanode service fails and a partition is unmounted
Hadoop service Failed to start on a datanode withjava exception error
ls: cannot access /mnt/hadoop: Transport endpoint is not connected
Additional NFS disk mount problem: requested NFS version or transport protocol is not supported
Did you reboot things in the wrong order? NFS disks not mounted?
Puppet/Foreman
Oh no you ran puppet where you shouldn't have and want to roll back your file
Check what the changes were after you try something in Puppet:
Test puppet changes without implementing them on a node for just one feature (tags):
Stop a puppet agent (these run automatically on a node either in kickstart or crontab):
Make that puppet agent not start automatically upon node reboot):
Start a puppet agent (these run automatically on a node either in kickstart or crontab):
Want to add a puppet class in base.pp or site.pp instead of on hepcms-foreman web?
Want to add a puppet class in a hiera yaml instead of on hepcms-foreman web?
r10k make sure we don't lose changes updated locally and not in github:
Foreman kickstart telling you there's not enough disk space for partitions?
Is the Foreman build of a baremetal machine working (checking during build):
Add a puppet module by hand in an area (locally) where r10k & git won't affect it:
Do proper ordering of install to ensure program (i.e. facter) comes from puppetlabs instead of epel:
Change in some .yaml parameter or class not taking effect at all on a node?
Check puppet agent behavior for a specific module (on that node):
All your nodes in the hepcms-foreman web page suddenly orange for "not in sync"?
How to change a puppet configuration file in your hiera .yaml?
Did your hiera implementation give you something weird, like ["?
Why is my node stuck in blue A and always doing the same update?
Implementing a new puppet module and get an error about "Could not find class"?
RENEWING GRID SITE CERTIFICATES
MONITORING
Ganglia web interface not working:
HARDWARE ISSUES
Consult this web page for more technical information about hardware identity
Machine has Orange blinking light, or orange "electrical", or orange "hard drive" symbol
https://servername.privnet:1311 from firefox on the local network (i.e. hepcmsdev-1):
Need to know which disk number in Dell corresponds to which hard disk?
Site Certs
https://twiki.grid.iu.edu/bin/view/Documentation/Release3/InstallCertAuth
https://twiki.grid.iu.edu/bin/view/Documentation/Release3/OsgCaCertsUpdater
Condor commands
ps aux | grep condor_schedd
condor 9911 0.0 3.2 668508 529988 ? Ss 2016 55:47 condor_schedd -f
condor 1938084 0.0 0.1 121632 25228 ? S Mar09 2:49 condor_schedd [extra process]
root 2979082 0.0 0.0 6452 724 pts/21 S+ 21:32 0:00 grep condor_schedd
[root@hepcms-in2 condor]# service condor restart [restart service]
Stopping Condor daemons: [
the command to see your CE reporting is:
$ condor_status -pool collector.opensciencegrid.org:9619 -any | grep -i umd
#Stop service after current jobs stop
condor_off -startd -peaceful r720-0-2
# start queue on node
systemctl start condor
general
df -ah
umount -nf /data
[mount /data
To see the partitions and mounted sustem information on a server
/etc/fstab
/etc/exports
ps -ef | grep rsync
ps aux | grep condor_schedd
condor 9911 0.0 3.2 668508 529988 ? Ss 2016 55:47 condor_schedd -f
condor 1938084 0.0 0.1 121632 25228 ? S Mar09 2:49 condor_schedd [extra process]
root 2979082 0.0 0.0 6452 724 pts/21 S+ 21:32 0:00 grep condor_schedd
[root@hepcms-in2 condor]# kill 1938084
[root@hepcms-in2 condor]# kill 9911
[root@hepcms-in2 condor]# ps aux | grep condor_schedd
[checked that condor_schedd is killed]
root 2979085 0.0 0.0 6448 692 pts/21 S+ 21:33 0:00 grep condor_schedd
[root@hepcms-in2 condor]# service condor restart [restart service]
Stopping Condor daemons: [
Hadoop commands
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#balancer
hadoop dfsadmin -report
hadoop fsck / -blocks
service hadoop-hdfs-datanode status
this logs in /scratch/hadoop/hadoop-hdfs/hadoop-hdfs-datanode xxxx .privnet.out
you can grep warning
grep -i warn /scratch/hadoop/hadoop-hdfs/hadoop-hdfs-datanode-r720-0-1.privnet.log
if a disk fails need to take it out of hadoop. Then exclude this datanode (use internal IP like r510-0-5.privnet) edit /etc/hadoop/conf/hosts-exclude on the hepcms-namenode then hdfs dfsadmin -refreshNodes
lsof /mnt/hadoop can show if some has lot of root files open at the same time
Can do the following command from any hadoop node. to balance hadoop manually hepcms-namenode: hdfs balancer
RSV tests
http://opensciencegrid.org/docs/monitoring/rsv-control/
on hepcms-ce
list all metrics rsv-control --job-list
run these two metrics rsv-control --run --host hepcms-1 org.osg.general.ping-host org.osg.general.java-version
disable the matrices no longer required after update to 3.4 - bestman and gratia
rsv-control --disable --host hepcms-1 org.osg.srm.srmcp-readwrite
sv-control --disable --host hepcms-1 org.osg.gratia.metric
rsv-control --disable --host hepcms-0.umd.edu:8443 org.osg.srm.srmcp-readwrite
The names of the hosts and which matrics are enabled is on themonitoring rsv page.
DNS Server:
sudo yum install bind bind-utils
Here the instructions.
https://www.digitalocean.com/community/tutorials/how-to-configure-bind-as-a-private-network-dns-server-on-centos-7
Following files need changes. To add a new IP address edit following three files.
/etc/named.conf,
/var/named/dynamic/db.privnet
/var/named/dynamic/db.1.10.in-addr.arpa
possible solution:
1.) Backup your .ssh folder by mv .ssh .ssh_backup
2.) delete the .ssh folder in your home directory
3.) ssh into username@hepcms.umd.edu
it could be that user tried wrong name or passwd too many time (3) and IP was blocked by denyhost.
On each interactive note use the script /root/unblock_denyhosts.sh to clean the block. It only needs one argument, the IP address.
If you do not have the IP address of a user, but the user tried and failed to login, you can determine the user's IP address by searching the /var/log/secure* files for the username (which should have the IP address listed). Running "grep -i [username] /var/log/*" will help here.
Only for CMS HEP (Nick Hadley, Sarah Eno, Andris Skuja, Alberto Belloni, Drew Baden), CMS Heavy Ion (Alice Mignerey's group), UMD theory (Raman Sundrum's group), and for Higgs Honors class taught by Shabnam Jabeen and Sarah Eno (those accounts for the duration of the class and not longer)
Can get special permission for CMS colleagues offsite, talk to Nick Hadley, but we prefer they use grid tools and not local accounts to access CMS shared files
If you have an email request that seems legitimate, it could still be a social hack, please confirm with a professor/graduate student/postdoc in the above groups as well as looking up that person on the UMD directory (note that occasionally postdocs are slow to show up in the directory)
The scripts are now in /root/scripts_accounts/ on hepcms-hn
Reference: http://www.linuxhomenetworking.com/wiki/index.php/Quick_HOWTO_:_Ch30_:_Configuring_NIS#Adding_New_NIS_Users
Add Users
Log into the head node (HN) (hepcms-hn.umd.edu)
su - to become root
Note that we prefer NOT to make accounts with uppercase, as it generally causes case sensitive login problems, only make them lowercase
run the script MakeAccount.sh
cd scripts_accounts
./MakeAccount.sh [usernames] [password] # use "" with multiple names
The above script is equivalent to:
Make the user and give them a first password(it can be anything, doesn't matter since they have to change it), and set them required to change it upon first login (if they are a theory user do -g theory):
useradd -g users -c "Full Name" username
passwd username
This is set because it appears to work with NIS in tests (i.e. it forced the student to change their password and that new one was propagated across all machines, but if for some reason it doesn't, use yppasswd instead for changes: chage -d 0 username
Anytime you make changes to the main NIS database of users (password, new users, etc.), update the maps:
cd /var/yp; make
Optional: You can check to see if the user's authentication information has been updated by using the ypmatch command, which should return the user's encrypted password string: (optional, a bit buggy? was not working properly as of 1/6/2016. This step can be skipped for now ) - Margarita
ypmatch username passwd
Tested and it seems to work. Remember!!!, use yppasswd instead of passwd
Again, above steps are now in a script (mainly to create large number of accounts) : /root/scripts_accounts/MakeAccount.sh
Make /data Directory (for some users)
Note that Higgs Honors students do NOT get /data area that is provided to normal HEP users.
make a /data/users/username area (if theory they may also create space in /data/groups/theory/username upon request)
From the head node, go through internal network to r720-datanfs, be sure you are root (su -):
ssh r720-datanfs
mkdir /data/users/username
chown username:users /data/users/username
Only if requested, make an SE area in /hadoop (instructions HERE)
Document new user
document the new user in our .csv file (so they can get sysadmin emails) and generate text for a welcome email
The following script should now automatically run as part of /root/scripts_accounts/MakeAccount.sh
To write to file, ensure that you have entered No when asked to add a new user, and not aborting before that
Would you like to add a new user (Y/N)? N
The following users have been added:
xxxxx
Write new users to output file '/root/cronscripts/hepcms_Users.csv' (Y/N)? Y
<Writing to output file '/root/cronscripts/hepcms_Users.csv'>
Here is the manual command:
cd /root/scripts_accounts
python AddNewUser.py
group examples: "theory" "HIN" (heavy ion) "Fall2015" (Higgs class), the default is CMS HEP, so no group needs to be specified
Generate welcome email text you can COPY/PASTE into your mail program, see options here: python pyNewUserInstructions.py --help
Note that the python script may not be able to handle special characters in the password properly, so make sure it put information in correctly, use for instance: --passwd="Special&Pass"
As a side note you can see how this .csv file is used elsewhere: python SendMail.py --help, and python parseUsers.py --help
If you have an email request, ensure that this is a true request and not a hacked account or a social hack, try to make phone/voice contact with the user if possible
From ANY machine on the cluster as root (su -) or sudoers (sudo -i)
yppasswd username
chage -d 0 username
cd /var/yp; make
This should automatically update, if for some reason it doesn't, you can make changes on the head node as root, and update the NIS maps:
cd /var/yp; make
In case of error in changing password due to chfn:
You will get an error in the HN /var/log/messages like this:
ONLY for this problem, on the HN (as root su -), use system-config-users to edit the password by hand. Be very careful, a lot of account destruction can be done with this program
Then re-make the NIS database with this comment (on the HN as root su -):
cd /var/yp; make
How to remove a user:
Not yet documented, please consider data retention policies (my general guideline is once the proofs have been submitted to the journal and you've followed experimental guidelines in data retention, you can delete the files). Also consider sometimes users use things in other people's areas (like the geant files in /data/users/jtemple)
userdel -r username
cd /var/yp; make
For the following two areas which won't get automatically moved, please consider other users may share files made by one user in these areas. Additionally, there might be a PhEDEx registered dataset in a user's private hadoop SE area!
Don't forget that they will have an area in /data/users/username
They might also have an area in /mnt/hadoop/cms/user/username, and an associated grid certificate account on our gums service (check the GUMS page to remove that)
hepcm-hn: cd /var/yp; make
The GUI controls users and groups for accounts on hepcms-hn, it’s just another way to do Linux account management other than command line.
NIS handles spreading that information to the rest of the cluster. Anytime you change accounts, either with system-config-users, or with useradd, userdel, or any other *Linux* users tools, you have to tell NIS to pick up and spread the changes.
as root or sudo
chown -R username: foldername
A) old
oscillator
and
OscillatorB
removed via
system-config-users
B) instances of
andrej
removed carefully from
/etc/group
and
/etc/gshadow
and
/etc/shadow
above and
system-config-users
no longer complains about
andrej
c)
cd /var/yp; make; cd -
just to be sure everything’s in sync for NIS(edited)
well, I had already copied some files from the two home directories before all this and put them here:
/data/users/oscillator
in one is a CMS map program in a
public_html
folder, which was pretty cool, and the tarfile has some programs
so it’s up to Fred to keep or delete, I did
chown -R oscillator:users /data/users/oscillator/oscillatorb*
to make them properly owned
ypchsh
For above command to work. this argument was needed to set in /etc/sysconfig/yppasswdd on headnode.
https://www.linux.com/learn/tutorials/306766:linux-101-introduction-to-sudo
Also (see sudoers here: https://sites.google.com/a/physics.umd.edu/tier-3-umd/commands)
On that specific node, as root ("su -" OR "sudo -i" ) add the user username to the "wheel" group.
usermod -aG wheel username (to add)
gpasswd -d username wheel (to remove from the group)
Then on individual node where you want to give the user access use visudo to edit /etc/sudoers file (be very careful because you can mess up the system with changes in this file)
Make sure to use visudo, since it will check to make sure that the sudoers file is properly formatted.
visudo
find the following two lines
## Allows people in group wheel to run all commands
# %wheel ALL=(ALL) ALL sudoers
move your curser on the # before %wheel and delete # by pressing x
Alternatively, you can press insert to go into editing mode, and use backspace to erase the #
save and exit by typing (esc can look up vi text editor commands if need be) :x
It should look like this now for that line: %wheel ALL=(ALL) ALL
make sure you are editing it as su - otherwise the changes do not save.
Exit out of root, and log in as your regular username (the one used with usermod -aG wheel username)
Test this (may not work right away -- see below)
groups
sudo su - It appears that command "sudo -i" works instead.
The command groups should show you being in the group users, and wheel
You will be warned, and you should now have root access *on that node only*. NIS doesn't sync sudo in the current setttings.
The command sudo tells unix to run a single command as root, in this case the su - will elevate you to root permanently, thus allowing you to enter in more commands as root
Enter in your user password, and you should see a # instead of $ indicating you are currently root
Note (6 Oct 2015): I wasn't able to get the su USERNAME - command to work, I successfully added "belt" to the wheel group on hepcms-in2, and successfully edited /etc/sudoers (with visudo) as root to have:
%wheel ALL=(ALL) ALL
And still sudo whoami doesn't work as belt on hepcms-in2. No idea why. (but it is apparently ok!)
http://linuxpoison.blogspot.com/2008/12/configuring-sudo-and-adding-users-to.html
Test that this works (days later it magically worked, and didn't work immediately on hepcms-in1 3:10pm 19 Oct 2015):
[belt@hepcms-in2 ~]$ sudo more /etc/sudoers.d/10_wheel
[sudo] password for belt:
%wheel ALL=(ALL) ALL
[belt@hepcms-in2 ~]$ more /etc/sudoers.d/10_wheel
/etc/sudoers.d/10_wheel: Permission denied
5:34pm 19 Oct 2015: it works now on hepcms-in1! So apparently there's some time delay needed after setting this up.
We dont use gums anymore. Move on to the map file section below.
Note: this is done ONLY by request
Requirements:
Admin must be able to make a new user account on hepcms-hn
Admin must be in the admins group on GUMS (https://hepcmsdev-6.umd.edu:8443/gums/manualUserGroups.jsp) - authenticate with grid cert
Have the user's grid certificate DN (output of helps, they have instructions on our user's page: http://hep-t3.physics.umd.edu/HowToForUsers.html#crab)
Have the user's CERN account name (could be different than their hepcms account name), this is because crab jobs will write to /mnt/hadoop/cms/store/user/CERNUsername
Note that CERNUsername could be the same as HepcmsUsername
Make a new user account, with HepcmsUsername_g in the (default) users group, it's not a standard login, so we make it with /bin/true
useradd -g users -c "HepcmsUsername grid user" -n HepcmsUsername_g -s /bin/true
Note that we prefer not to have usernames with capitalization, it's indicated in this section only for your readability
There's no password since it's not a login account, so we proceed to sync the NIS maps anyway
Anytime you make changes to the main NIS database of users (password, new users, etc.), update the maps:
cd /var/yp; make
Then make their area on hadoop, as root on any node (maybe ideally on hepcms-namenode? maybe should use hadoop dfs commands? don't know for sure, it worked like this)
cd /mnt/hadoop/cms/store/user
mkdir CERNUsername
chown HepcmsUsername_g:users CERNUsername
Next, you need the cern DN for the user. You can either ask them to run the voms-proxy-info command and send you the output or get it directly from here:
https://lcg-voms2.cern.ch:8443/voms/cms/user/search.action
edit /data/osg/scripts/grid-mapfile on any node where /data is mounted. This is the file with grid user mapping and is linked in every node as /etc/grid-security/grid-mapfile
Send the following information to the user:
Your storage space on the UMD HEP T3 SE has been created as /store/user/username
Files are written to this space primarily with CRAB jobs, for documentation, see:
https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrab
Ownership of these files is via a second user account linked to your grid certificate, so you will not be able to move, delete, or rename these files with your regular login, only through SE commands. This is the only difference you will see from a normal local filesystem. Some examples are given here:
https://sites.google.com/a/physics.umd.edu/umdt3/user-guide/file-transfer-from-to-the-cluster#TOC-T3_US_UMD-hadoop-examples:
If you have difficulty using this area, please contact the sysadmins.
Keep in mind that hadoop is internally replicated, so the disk space available is half of what is shown with "df -h". Additionally, one R510 node can store 12TB (after replication), so it is best to keep at least 24TB (before replication) free in case one node goes down to protect the data.
You are strongly encouraged to retain files of 1GB or larger for the health of the hadoop system.
For other information:
https://sites.google.com/a/physics.umd.edu/umdt3/user-guide
old not-used-anymore instructions
Then authenticate with your grid certificate to GUMS web page https://hepcmsdev-6.umd.edu:8443/gums/manualUserGroups.jsp and map their grid certificate to their account
Click on "Manual Account Mappings" (note that if you get some weird page about security, click back on "Home" in the upper left and try again, should be fixed)
Click "add" at the very bottom of the page
Put their full DN: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=CERNUsername/CN=SOMENUMBER/CN=Full User Name
Choose localAccountMapper
Put in their HepcmsUsername_g
Click "save"
Notify the user that they have an account, there is a shell script on hepcms-hn in /root/cronscripts/HadoopSE_NewUserWelcome.txt, that can be used as input for pyNewUserInstructions.py
python pyNewUserInstructions.py -u CERNUsername -t HadoopSE_NewUserWelcome.txt
It will ask for First Name and Password, those values don't matter as they aren't in the NewUserWelcome.txt file, you can put anything there or press enter
Email the user the text of the new user welcome
:
It turns out they do have a root distribution for slc6, they just don't announce that on the root page (odd!).
source /cvmfs/sft.cern.ch/lcg/views/LCG_95/x86_64-slc6-gcc7-opt/setup.sh
After installing new certificates, xrrotd service failed.
[root@hepcms-0 xrd]# service cmsd restart
Shutting down xrootd (cmsd, default): [ OK ]
Starting xrootd (cmsd, default): [ OK ]
[root@hepcms-0 xrd]# service xrootd status
[default] xrootd dead but pid file exists
[root@hepcms-0 xrd]# service cmsd status
[default] cmsd (pid 18945) is running...
First tried removing the pid process
[root@hepcms-0 xrd]# ps -eaf | grep pid
xrootd 18945 1 0 11:27 ? 00:00:05 /usr/bin/cmsd -l /var/log/xrootd/cmsd.log -c /etc/xrootd/xrootd-clustered.cfg -k fifo -b -s /var/run/xrootd/cmsd-default.pid -n default
root 19675 17790 0 11:40 pts/0 00:00:00 grep pid
[root@hepcms-0 xrd]#
root@hepcms-0 xrd]# ls -slrt /var/run/xrootd/
total 20
0 prw-r----- 1 xrootd xrootd 0 Oct 20 12:24 ofsEvents
4 -rw-r--r-- 1 xrootd xrootd 5 Oct 27 07:21 xrootd.pid
4 -rw-r--r-- 1 xrootd xrootd 169 Oct 27 07:21 xrootd.anon.env
4 -rw-r--r-- 1 xrootd xrootd 68 Jan 5 11:51 cmsd.pid
4 -rw-r--r-- 1 xrootd xrootd 5 Jan 5 11:51 cmsd-default.pid
4 -rw-r--r-- 1 xrootd xrootd 167 Jan 5 11:51 cmsd.anon.env
[root@hepcms-0 xrd]# mv /var/run/xrootd/xrootd.pid /var/run/xrootd/xrootd.pid-old
[root@hepcms-0 xrd]# service xrootd start
Starting xrootd (xrootd, default): [FAILED]
[root@hepcms-0 xrd]# service xrootd restart
Shutting down xrootd (xrootd, default): [FAILED]
Starting xrootd (xrootd, default): [FAILED]
Kill all PID associated with xrootd ontained
root@hepcms-0 ~]# ps -ef | grep xroot
xrootd 18990 1 0 11:27 ? 00:00:00 perl /usr/share/xrootd/utils/XrdOlbMonPerf 30
xrootd 20223 1 0 11:43 ? 00:00:00 perl /usr/share/xrootd/utils/XrdOlbMonPerf 30
xrootd 21145 1 0 11:51 ? 00:00:08 /usr/bin/cmsd -l /var/log/xrootd/cmsd.log -c /etc/xrootd/xrootd-clustered.cfg -k fifo -b -s /var/run/xrootd/cmsd-default.pid -n default
xrootd 21190 21145 0 11:51 ? 00:00:00 perl /usr/share/xrootd/utils/XrdOlbMonPerf 30
Located error message in /var/log/xrootd/xrootd.log that xrootd.t2.ucsd.edu:9930 '; Name or service not known.
Edited /etc/xrootd/xrootd-clustered.cfg according to instructions here:
https://twiki.cern.ch/twiki/bin/view/CMSPublic/XRootDMonitoring#XRootD_Site_Configuration
service xrootd start worked.
haddop 12 and hadoop 8 lost partition. So hadoop was filling up / area while trying to write to /hadoop 12.
moved /hadoop12/data/currenbt to /hadoop1
[root@r510-0-11 ~]# parted /dev/sda print all | grep /dev
Disk /dev/sda: 2000GB
Disk /dev/sdb: 2000GB
Disk /dev/sdc: 2000GB
Disk /dev/sdd: 2000GB
Disk /dev/sde: 2000GB
Disk /dev/sdf: 2000GB
Error: /dev/sdl: unrecognised disk label
Disk /dev/sdj: 2000GB
Disk /dev/sdk: 2000GB
Disk /dev/sdg: 2000GB
Disk /dev/sdi: 2000GB
Error: /dev/sdh: unrecognised disk label
[root@r510-0-11 ~]# mkfs.ext4 /dev/sdl
[root@r510-0-11 ~]# mkfs.ext4 /dev/sdh
root@r510-0-11 ~]# blkid
/dev/sdh: UUID="2e4df78f-1cdc-4243-b309-f24f20154e14" TYPE="ext4"
/dev/sdl: UUID="cdb099c6-1798-4c8e-84cd-7b2f59641110" TYPE="ext4"
added the UUID back to /etc/fstab and mounted the disks again.
[root@r510-0-11 ~]# mount -a /hadoop12
[root@r510-0-11 ~]# mount -a /hadoop8
make sure both /hadoop disks have data directories
ls -slrt /hadoop*
/hadoop12:
total 20
16 drwx------ 2 root root 16384 Mar 25 13:10 lost+found
4 drwxr-xr-x 2 hdfs hadoop 4096 Mar 25 13:45 data
/hadoop8:
total 20
16 drwx------ 2 root root 16384 Mar 25 13:27 lost+found
4 drwxr-xr-x 2 hdfs hadoop 4096 Mar 25 13:45 data
Also check if the hadoop disks are not masked.
[root@r510-0-11 ~]# grep hadoop12 /etc/hadoop/conf/hdfs-site.xml
<value>/hadoop1/data,/hadoop2/data,/hadoop3/data,/hadoop4/data,/hadoop5/data,/hadoop6/data,/hadoop7/data,/hadoop8/data,/hadoop9/data,/hadoop10/data,/hadoop11/data,/hadoop12/data</value>
[root@r510-0-11 ~]#
restart hadoop service
[root@r510-0-11 ~]# service hadoop-hdfs-datanode restart
Stopping Hadoop datanode: [ OK ]
stopping datanode
Starting Hadoop datanode: [ OK ]
starting datanode, logging to /scratch/hadoop/hadoop-hdfs/hadoop-hdfs-datanode-r510-0-11.privnet.out
[root@r510-0-11 ~]#
CE troubleshooting
https://opensciencegrid.org/docs/compute-element/troubleshoot-htcondor-ce/
the error is generated in .bashrc login file.
# CMSSW =
export VO_CMS_SW_DIR=/cvmfs/cms.cern.ch/ . $VO_CMS_SW_DIR/cmsset_default.sh
umount and mount cvmfs
umount -l /cvmfs/cms.cern.ch ; mount /cvmfs/cms.cern.ch
to check or release the jobs go to the schedular (in1 or in2)
[jabeen@hepcms-in1 ~]$condor_q -hold -af HoldReason
Error from slot2@r510-0-4.privnet: Failed to execute '/data/users/ahorst/hgcal_tile/build/condor-executable.sh': (errno=13: 'Permission denied') 637568.0 [????????????] [?????????] Error from slot2@r510-0-4.privnet: Failed to execute '/data/users/ahorst/hgcal_tile/build/condor-executable.sh': (errno=13: 'Permission denied')
[jabeen@hepcms-in1 ~]$ ls -lsrt /data/users/ahorst/hgcal_tile/build/condor-executable.sh
4 -rw-r--r-- 1 ahorst users 2035 Jun 15 09:47 /data/users/ahorst/hgcal_tile/build/condor-executable.sh
The script does not seem to executable permissions.
You can also use
condor_q -analyze 3393151.0
Once the hold reason is fixed, release the job uisng ID
condor_release 3393151.0
or realse all your jobs as
condor_release jabeen
https://twiki.cern.ch/twiki/bin/view/CMSPublic/SiteConfInGitlab
https://gitlab.cern.ch/SITECONF/T3_US_UMD
changed the file on git site for UMD
this storage.xml file is the one in the cvmfs area and is owned by cvmfs. Changes are propaged from git to local cluster in hour or so.
[jabeen@hepcms-0 ~]$ cd /cvmfs/cms.cern.ch/SITECONF/T3_US_UMD/PhEDEx/
The file in the hepcms-se xrootd area should be identicle.
[root@hepcms-0 xrootd]# ls -slrt /etc/xrootd/storage.xml
[root@hepcms-0 xrootd]# cp storage.xml storage.xml_Feb2018
[root@hepcms-0 xrootd]# emacs -nw storage.xml
you can update git file manually.
If the storage.xml file in your cvmfs area hasn't updated to the latest git commit at UMD yet, could try the following as root?
# cvmfs_talk -i cms.cern.ch evict /SITECONF/T3_US_UMD/PhEDEx/storage.xml
OK
# stat /cvmfs/cms.cern.ch//SITECONF/T3_US_UMD/PhEDEx/storage.xml
File: `/cvmfs/cms.cern.ch//SITECONF/T3_US_UMD/PhEDEx/storage.xml'
Size: 1384 Blocks: 3 IO Block: 4096 regular file
Device: 1ah/26d Inode: 188799077 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 498/ cvmfs) Gid: ( 498/ cvmfs)
Access: 2018-02-16 13:43:28.000000000 -0600
Modify: 2018-02-16 13:43:28.000000000 -0600
Change: 2018-02-16 13:43:28.000000000 -0600
#
https://gitlab.cern.ch/SITECONF/T3_US_UMD/blob/master/JobConfig/site-local-config.xml
changes all instances of /sharesoft/cmssw to /cvmfs/cms.cern.ch
from Stephan Lammel at fnal
T3_US_UMD is currently using Posix cp as stage-out command.
Once we go to Singularity, a command that understands certificates
would be needed. Can i ask you what your plans are? Would it be
possible to resolve/replace cp with, for instance, gfal2 or xrdcp
now? (The SAM WN-mc test requires role=production and we would
like to switch it to lcgadmin and ship a certificate with it
instead. This, however, will not work in case of Posix cp.)
https://gitlab.cern.ch/SITECONF/T3_US_UMD/blob/master/JobConfig/site-local-config.xml
changed
<command value="cp" />
<catalog url="trivialcatalog_file://sharesoft/cmssw/SITECONF/T3_US_UMD/PhEDEx/storage.xml?protocol=direct"/>
to
<command value="gfal2"/>
<catalog url="trivialcatalog_file://sharesoft/cmssw/SITECONF/T3_US_UMD/PhEDEx/storage.xml?protocol=srmv2"/>
(For instance to switch from cp to gfal2, which most sites use.)
Instructions on how to change files of your site in SITECONF are
at https://twiki.cern.ch/twiki/bin/view/CMSPublic/SiteConfInGitlab .
Grid SAM metric 13 critical and 15 warning, and GRID RSV
org.osg.srm.srmcp-readwrite
13org.cms.SRM-VOPut (/cms/Role_production)
15org.cms.SRM-VOGet (/cms/Role_production)
Detailed output of Metric Result
Field Value
Hostname hepcms-0.umd.edu
Metric org.cms.SRM-VOPut
VOFQAN /cms/Role=production
Service Flavour SRM
Timestamp 2017-08-22T21:45:05Z
Status CRITICAL
Summary CRITICAL:
Details
CRITICAL:
Testing from: etf-18.cern.ch
DN: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=sciaba/CN=430796/CN=Andrea Sciaba/CN=596854710/CN=1175387511/CN=1800558125/CN=1185662883
VOMS FQANs: /cms/Role=production/Capability=NULL, /cms/ALARM/Role=NULL/Capability=NULL, /cms/Role=NULL/Capability=NULL, /cms/TEAM/Role=NULL/Capability=NULL
gfal2 2.9.3
VOPut: Copy file using gfal.filecopy().
Parameters:
source: file:///var/lib/gridprobes/cms.Role.production/org.cms/SRM/hepcms-0.umd.edu/testFile.txt
dest: srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/mnt/hadoop/cms/store/unmerged/SAM/testSRM/SAM-hepcms-0.umd.edu/lcg-util/testfile-put-nospacetoken-1503438003-1b90049def0b.txt
src_spacetoken:
dst_spacetoken:
timeout: 120
StartTime of the transfer: 2017-08-22 23:41:21.609877
ERROR: DESTINATION MAKE_PARENT srm-ifce err: Communication error on send, err: [SE][Mkdir][] httpg://hepcms-0.umd.edu:8443/srm/v2/server: CGSI-gSOAP running on etf-18.cern.ch reports Error reading token data header: Connection reset by peer
VO specific Detailed Output: None critical= 1 File was NOT copied to SRM. file= testfile-put-nospacetoken-1503438003-1b90049def0b.txt
metricName: org.osg.srm.srmcp-readwrite
metricType: status
timestamp: 2017-08-22 18:34:51 EDT
metricStatus: CRITICAL
serviceType: OSG-SRM
serviceURI: hepcms-0.umd.edu:8443
gatheredAt: hepcms-1.umd.edu
summaryData: CRITICAL
detailsData: Failed to transfer file to remote server.
Command: gfal-copy 'file:///usr/share/rsv/probe-helper-files/storage-probe-test-file' 'srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/mnt/hadoop/osg/rsv/storage-probe-test-file.1503440880.3035853' 2>&1
Output from gfal-copy:
gfal-copy error: 70 (Communication error on send) - DESTINATION SRM_PUT_TURL srm-ifce err: Communication error on send, err: [SE][PrepareToPut][] httpg://hepcms-0.umd.edu:8443/srm/v2/server: CGSI-gSOAP running on hepcms-1.umd.edu reports Error reading token data header: Connection reset by peer
Copying 306 bytes file:///usr/share/rsv/probe-helper-files/storage-probe-test-file => srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/mnt/hadoop/osg/rsv/storage-probe-test-file.1503440880.3035853
bestman seems to be running large number of threads. shutting and restarting bestamn didnt help.
log shows this exception
[root@hepcms-0 xrd]# more /var/log/bestman2/bestman2.log
securePort=8443
-- done with listing web service parameters --
BeStMan: space mgt component is disabled.
[Note:] srmcacheKeywordOn is set to true automatically when space mgt is disabled.
............ no static tokens defined for bestman
.........local SRM is on: httpg://hepcms-0.umd.edu:8443/srm/v2/server current user:bestman
.... using gsi connection.
...appling /etc/bestman2/conf/WEB-INF/jetty.xml
........pool:null qtp310490400{10<=0<=0/256,-1}
..........acceptQueueSize:0
..................acceptor:1
java.net.BindException: Address already in use
at java.net.PlainSocketImpl.socketBind(Native Method)
at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:376)
at java.net.ServerSocket.bind(ServerSocket.java:376)
at java.net.ServerSocket.<init>(ServerSocket.java:237)
It turns out there are thousands are processes are running on that port.
[root@hepcms-0 xrd]# lsof -i:8443
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 30784 bestman 50u IPv6 98423388 0t0 TCP hepcms-0.umd.edu:pcsync-https->fts430.cern.ch:34328 (ESTABLISHED)
java 30784 bestman 71u IPv6 45003155 0t0 TCP *:pcsync-https (LISTEN)
java 30784 bestman 72u IPv6 97613623 0t0 TCP hepcms-0.umd.edu:pcsync-https->fts432.cern.ch:42720 (ESTABLISHED)
java 30784 bestman 74u IPv6 108512931 0t0 TCP hepcms-0.umd.edu:pcsync-https->fts433.cern.ch:52592 (CLOSE_WAIT)
java 30784 bestman 75u IPv6 98581670 0t0 TCP hepcms-0.umd.edu:pcsync-https->fts435.cern.ch:38534 (ESTABLISHED)
java 30784 bestman 76u IPv6 108512932 0t0 TCP hepcms-0.umd.edu:pcsync-https->bighep.ucr.edu:37924 (CLOSE_WAIT)
java 30784 bestman 77u IPv6 108512933 0t0 TCP hepcms-0.umd.edu:pcsync-https->fts433.cern.ch:52600 (CLOSE_WAIT)
Killed all these processes:
[root@hepcms-0 xrd]# kill 30784
[root@hepcms-0 xrd]# lsof -i:8443
[root@hepcms-0 xrd]#
[root@hepcms-0 xrd]#
stop and start bestman again
[root@hepcms-0 xrd]# service bestman2 stop
Shutting down bestman2: [ OK ]
[root@hepcms-0 xrd]# service bestman2 start
Starting bestman2: [ OK ]
This fixed the 100% CPU issue
Cleaned cvmfs for all and VM nodes. NOTE gridftp is not yet part of vm group for clush so did it separately
on [root@hepcms-hn ~]#
ssh-agent $SHELL
ssh-add
clush -w @all /data/osg/scripts/fixCVMFS.sh
clush -w @all df -ah | cvmfs
clush -w @vm /data/osg/scripts/fixCVMFS.sh
clush -w @vm df -ah | cvmfs
[root@hepcms-gridftp ~]# /data/osg/scripts/fixCVMFS.sh
[root@hepcms-gridftp ~]# service globus-gridftp-server status
GridFTP server is running (pid=11545)
in about 10 minutes crabcheckwrite was a success and later the RSV and SAM metrics were green as well.
[jabeen@hepcms-in2 src]$ cmsenv
voms-proxy-init -voms cms
source /cvmfs/cms.cern.ch/crab3/crab.csh
crab checkwrite --site=T3_US_UMD
metricName: org.osg.srm.srmping metricType: status timestamp: 2018-02-13 14:33:03 EST metricStatus: OK serviceType: OSG-SRM serviceURI: hepcms-0.umd.edu:8443 gatheredAt: hepcms-1.umd.edu summaryData: OK detailsData: SRM server running at hepcms-0.umd.edu:8443 is alive and responding to the srm-ping command. Output from srm-ping: srm-ping 2.2.2.3.0 Wed Nov 7 16:03:09 CST 2012 BeStMan and SRM-Clients Copyright(c) 2007-2012, Lawrence Berkeley National Laboratory. All rights reserved. Support at SRM@LBL.GOV and documents at http://sdm.lbl.gov/bestman OSG Support at osg-software@opensciencegrid.org and documentation at https://www.opensciencegrid.org/bin/view/Documentation/Release3/ ############################################################## # SRM_HOME = /etc/bestman2 # BESTMAN_LIB = /usr/share/java/bestman2 # JAVA_HOME = /etc/alternatives/java_sdk java version "1.7.0_151" OpenJDK Runtime Environment (rhel-2.6.11.0.el6_9-x86_64 u151-b00) OpenJDK 64-Bit Server VM (build 24.151-b00, mixed mode) # BESTMAN_SYSCONF = /etc/sysconfig/bestman2 ############################################################## ################################################################# # BeStMan and BeStMan Clients Copyright(c) 2007-2011, # Lawrence Berkeley National Laboratory. All rights reserved. # Support at SRM@LBL.GOV and documents at http://sdm.lbl.gov/bestman ################################################################# # # BESTMAN_SYSCONF contains both external env settings and internal definitions #
Problem.
Solution
This means the SAM test file is missing from your storage (at the least, there may be more problems). I used the central phedex machine to drop that file in place where I think SAM is going to look for it. We'll see if xrootd will pass now:
Carl Lundstedt
refer to this link: condor_config
For priority tag see this link
Might be due to extra condor_schedd process running on interactive node.
[root@hepcms-in2 condor]# ps aux | grep condor_schedd
condor 9911 0.0 3.2 668508 529988 ? Ss 2016 55:47 condor_schedd -f
condor 1938084 0.0 0.1 121632 25228 ? S Mar09 2:49 condor_schedd [extra process]
root 2979082 0.0 0.0 6452 724 pts/21 S+ 21:32 0:00 grep condor_schedd
[root@hepcms-in2 condor]# kill 1938084
[root@hepcms-in2 condor]# kill 9911
[root@hepcms-in2 condor]# ps aux | grep condor_schedd
[checked that condor_schedd is killed]
root 2979085 0.0 0.0 6448 692 pts/21 S+ 21:33 0:00 grep condor_schedd
[root@hepcms-in2 condor]# service condor restart [restart service]
Stopping Condor daemons: [ OK ]
Starting Condor daemons: [ OK ]
[root@hepcms-in2 condor]# ps aux | grep condor_schedd
condor 2979179 1.6 0.0 102888 8592 ? Ss 21:33 0:00 condor_schedd -f
root 2979213 0.0 0.0 6452 728 pts/21 S+ 21:33 0:00 grep condor_schedd
user grid jobs failing jobs are failing with status 60321: "Site related issue: no space, SE down, refused connection".
also Checkwrite errors.
Two RSV tests failing - critical
1 of 16 - metricName: org.osg.srm.srmcp-readwrite
16 of 16: Running metric org.osg.globus.gridftp-simple
metricName: org.osg.globus.gridftp-simple
metricName: org.osg.srm.srmcp-readwrite metricType: status timestamp: 2017-02-11 19:28:03 EST metricStatus: CRITICAL serviceType: OSG-SRM serviceURI: hepcms-0.umd.edu:8443 gatheredAt: hepcms-1.umd.edu summaryData: CRITICAL detailsData: Failed to transfer file to remote server. Command: gfal-copy 'file:///usr/share/rsv/probe-helper-files/storage-probe-test-file' 'srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/mnt/hadoop/osg/rsv/storage-probe-test-file.1486859280.706269' 2>&1 Output from gfal-copy: gfal-copy error: 13 (Permission denied) - DESTINATION SRM_PUT_TURL srm-ifce err: Permission denied, err: [SE][PrepareToPut][SRM_AUTHORIZATION_FAILURE] httpg://hepcms-0.umd.edu:8443/srm/v2/server: not mapped./DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=rsv/hepcms-1.umd.edu Copying 306 bytes file:///usr/share/rsv/probe-helper-files/storage-probe-test-file => srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/mnt/hadoop/osg/rsv/storage-probe-test-file.1486859280.706269 EOT
Solution:
http cert on hepcmsdev-6 (gums server) expired & restarting mysqld and tomcat6 on hepcmsdev-6
This could also happen if a new cert does not have the right permissions. For example, bestmancert.pem is not owned by bestman.
Grid Sam SRM CRITICAL Also CheckWrite failure
with matric 13 CRITICAL and 15 Warning showed gridftp cert on hepcms-gridftp expired.
This also made bestan2 on hepcms-se 'dead'
13 org.cms.SRM-VOPut (/cms/Role_production)
15 org.cms.SRM-VOGet (/cms/Role_production)
Field Value
Hostname hepcms-0.umd.edu
Metric org.cms.SRM-VOPut
VOFQAN /cms/Role=production
Service Flavour SRM
Timestamp 2017-07-14T13:22:22Z
Status CRITICAL
Summary CRITICAL:
Details
CRITICAL:
Testing from: etf-18.cern.ch
DN: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=sciaba/CN=430796/CN=Andrea Sciaba/CN=596854710/CN=1175387511/CN=22887794/CN=116673394
VOMS FQANs: /cms/Role=production/Capability=NULL, /cms/ALARM/Role=NULL/Capability=NULL, /cms/Role=NULL/Capability=NULL, /cms/TEAM/Role=NULL/Capability=NULL
gfal2 2.9.3
VOPut: Copy file using gfal.filecopy().
Parameters:
source: file:///var/lib/gridprobes/cms.Role.production/org.cms/SRM/hepcms-0.umd.edu/testFile.txt
dest: srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/mnt/hadoop/cms/store/unmerged/SAM/testSRM/SAM-hepcms-0.umd.edu/lcg-util/testfile-put-nospacetoken-1500038540-8a81d45b79fb.txt
src_spacetoken:
dst_spacetoken:
timeout: 120
StartTime of the transfer: 2017-07-14 15:22:20.847726
ERROR: globus_ftp_client: the server responded with an error 530 530-globus_xio: Server side credential failure 530-globus_gsi_gssapi: Error with GSI credential 530-globus_gsi_gssapi: Error with gss credential handle 530-globus_credential: Error with credential: The host credential: /etc/grid-security/hostcert.pem 530- with subject: /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=hepcms-gridftp.umd.edu 530- has expired 1151 minutes ago. 530- 530 End.
VO specific Detailed Output: None critical= 1 File was NOT copied to SRM. file= testfile-put-nospacetoken-1500038540-8a81d45b79fb.txt
Solution:
renewed gridftp certs on hepcms-gridftp
for details see the cert renewal instructions.
Gridftep cancelling transfer due to over-load limit . Also, sam metric 12(voput error)
metricName: org.osg.globus.gridftp-simple
metricType: status
timestamp: 2018-04-04 13:18:10 EDT
metricStatus: CRITICAL
serviceType: GridFTP
serviceURI: hepcms-gridftp.umd.edu
gatheredAt: hepcms-1.umd.edu
summaryData: CRITICAL
detailsData: Successful transfer to remote host.
Failed to transfer from remote host.
Command: globus-url-copy 'gsiftp://hepcms-gridftp.umd.edu//mnt/hadoop/osg/rsv/gridftp-probe-test-file.1522861680.1747958.remote' 'file:///tmp/gridftp-probe-test-file.1522861680.1747958.local' 2>&1
Output:
error: globus_ftp_client: the server responded with an error
530 Login incorrect. : Server is cancelling transfer due to over-load limit (host=hepcms-gridftp.umd.edu, user=rsv, path=(null))
on hepcms-gridftp
ran fixcvmfs
also /var / was 90% so removed old log files.
Seems to fix SAm errors. but not RSV read/write and gridftp
Field Value
Hostname hepcms-gridftp.umd.edu
Metric org.cms.SRM-VOGet
VOFQAN /cms/Role=production
Service Flavour SRM
Timestamp 2018-04-04T18:28:38Z
Status CRITICAL
Summary CRITICAL:
Details
CRITICAL:
Testing from: etf-18.cern.ch
DN: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=sciaba/CN=430796/CN=Andrea Sciaba/CN=596854710/CN=1175387511/CN=1369956787/CN=1557781315
VOMS FQANs: /cms/Role=production/Capability=NULL, /cms/ALARM/Role=NULL/Capability=NULL, /cms/Role=NULL/Capability=NULL, /cms/TEAM/Role=NULL/Capability=NULL
gfal2 2.14.2
2018-04-04T18:24:00Z
Source: gsiftp://hepcms-gridftp.umd.edu//mnt/hadoop/cms/store/unmerged/SAM/testSRM/SAM-hepcms-gridftp.umd.edu/lcg-util/testfile-put-nospacetoken-1522865933-f53f33fe9b45.txt
Destination: file:///var/lib/gridprobes/cms.Role.production/org.cms/SRM/hepcms-gridftp.umd.edu/testFileIn.txt
Get file using gfal.filecopy().
Parameters:
source: gsiftp://hepcms-gridftp.umd.edu//mnt/hadoop/cms/store/unmerged/SAM/testSRM/SAM-hepcms-gridftp.umd.edu/lcg-util/testfile-put-nospacetoken-1522865933-f53f33fe9b45.txt
dest: file:///var/lib/gridprobes/cms.Role.production/org.cms/SRM/hepcms-gridftp.umd.edu/testFileIn.txt
src_spacetoken:
dst_spacetoken:
timeout: 120
StartTime of the transfer: 2018-04-04 20:24:00.610629
ERROR: Could not open source: globus_ftp_client: the server responded with an error 530 Login incorrect. : Server is cancelling transfer due to over-load limit (host=hepcms-gridftp.umd.edu, user=sam, path=(null))
2018-04-04T18:28:38Z
VO specific Detailed Output: None critical= 1 File was NOT copied from SRM. file= testfile-put-nospacetoken-1522865933-f53f33fe9b45.txt
Also, top on hepcms-gridftp is extremely busy with young's procesesses. That could also be why transfers are failing with overload error.s
RSV tests are still red srmcp read write and grid ftp. lets see if the clear.
This did not fix it.
But noticed, haddop is not mounted on gridftp.
mounted hadoop and restart the grid ftp service
[root@hepcms-gridftp ~]# umount /mnt/hadoop
umount: /mnt/hadoop: not mounted
[root@hepcms-gridftp ~]# mount -a /mnt/hadoop
[root@hepcms-gridftp ~]# service globus-gridftp-server start
Starting globus-gridftp-server: [ OK ]
[root@hepcms-gridftp ~]#
Finally read-write RSV test was green but gridftp was still refusing transfers because of over-load.
on head node cvmfs fix seems to fix it.
clush -w @all /data/osg/scripts/fixCVMFS.sh
Everything is back to normal
For now use this work around
voms-proxy-init -voms cms
cp /tmp/x509up_u`id -u` ~/
The first line should create your proxy id file in /tmp/ area that is needed by condor to use your proxy.
second copies it to your home area because condor can see files in this area.
Then you can add a line in your .jdl explicitly telling condor where to look for the proxy file:
x509userproxy = /home/yhshin/x509up_u1112
We should infact use the solution below but As of 16Feb. the solution below does not work. Sent email to T3:
From: https://twiki.cern.ch/twiki/bin/view/CMSPublic/WorkBookXrootdService#OpenCondor
Open a file in Condor Batch or CERN Batch
Condor
If one wants to use local condor batch to analyze user/group skims located at remote sites. The only modification needed is adding:
use_x509userproxy = true
in your condor jdl file (the file which defines universe, Executable, etc..).
For OLDER versions of HTCondor (before 8.0.0), you need:
x509userproxy = /tmp/x509up_uXXXX
The string /tmp/x509up_uXXXX is the string in the "path:" statement from output of "voms-proxy-info -all", which contains your valid grid proxy. Condor will pass this information to the working node of the condor batch.
Instructions to duplicate issue:
mkdir dummy
cd dummy
cmsrel CMSSW_8_0_6
cd CMSSW_8_0_6/src/
cmsenv
source /cvmfs/cms.cern.ch/crab3/crab.sh
voms-proxy-init -voms cms
crab checkwrite --site=T3_US_UMD
Output of the "checkwrite" command:
Will check write permission in the default location /store/user/<username>
Retrieving DN from proxy...
DN is: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=yoshin/CN=742847/CN=Young Ho Shin
Retrieving username from SiteDB...
Username is: yoshin
Validating LFN /store/user/yoshin...
LFN /store/user/yoshin is valid.
Will use `gfal-copy`, `gfal-rm` commands for checking write permissions
Will check write permission in /store/user/yoshin on site T3_US_UMD
Attempting to create (dummy) directory crab3checkwrite_20160617_201033 and copy (dummy) file crab3checkwrite_20160617_201033.tmp to /store/user/yoshin
Executing command: env -i X509_USER_PROXY=/tmp/x509up_u1112 gfal-copy -p -v -t 180 file:///home/yhshin/dummy/CMSSW_8_0_6/src/crab3checkwrite_20160617_201033.tmp 'srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/mnt/hadoop/cms/store/user/yoshin/crab3checkwrite_20160617_201033/crab3checkwrite_20160617_201033.tmp'
Please wait...
Failed running copy command
Stdout:
Copying 85 bytes file:///home/yhshin/dummy/CMSSW_8_0_6/src/crab3checkwrite_20160617_201033.tmp => srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/mnt/hadoop/cms/store/user/yoshin/crab3checkwrite_20160617_201033/crab3checkwrite_20160617_201033.tmp
event: [1466208644235] BOTH GFAL2:CORE:COPY LIST:ENTER
event: [1466208644236] BOTH GFAL2:CORE:COPY LIST:ITEM file:///home/yhshin/dummy/CMSSW_8_0_6/src/crab3checkwrite_20160617_201033.tmp => srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/mnt/hadoop/cms/store/user/yoshin/crab3checkwrite_20160617_201033/crab3checkwrite_20160617_201033.tmp
event: [1466208644236] BOTH GFAL2:CORE:COPY LIST:EXIT
event: [1466208648618] BOTH SRM PREPARE:ENTER
Stderr:
WARNING Failed to ping srm://hepcms-0.umd.edu:8443/srm/v2/server?SFN=/mnt/hadoop/cms/store/user/yoshin/crab3checkwrite_20160617_201033/crab3checkwrite_20160617_201033.tmp
WARNING Transfer failed with: DESTINATION MAKE_PARENT srm-ifce err: Communication error on send, err: [SE][Mkdir][] httpg://hepcms-0.umd.edu:8443/srm/v2/server: CGSI-gSOAP running on hepcms-in1.umd.edu reports Error reading token data header: Connection reset by peer
gfal-copy error: 70 (Communication error on send) - DESTINATION MAKE_PARENT srm-ifce err: Communication error on send, err: [SE][Mkdir][] httpg://hepcms-0.umd.edu:8443/srm/v2/server: CGSI-gSOAP running on hepcms-in1.umd.edu reports Error reading token data header: Connection reset by peer
Checkwrite Result:
Unable to check write permission in /store/user/yoshin on site T3_US_UMD
Please try again later or contact the site administrators sending them the 'crab checkwrite' output as printed above.
Note: You cannot write to a site if you did not ask permission.
Solution:
bestman2 runs out of memory (on hepcms-se) - verify by /var/log/messages
fixed it by restarting a service (bestman2) (on hepcms-se)
Result:
Checkwrite Result:
Success: Able to write in /store/user/yoshin on site T3_US_UMD
Note that alternately this (and other CRAB writing output issues) could be due to:
gridftp not running (hepcms-gridftp)
The user doesn't have a SE account (https://sites.google.com/a/physics.umd.edu/umdt3/user-guide/submitting-analysis-jobs#TOC-To-stage-your-data-back-to-the-hepcms-SE: to request)
The user didn't authenticate with their grid proxy properly (they should have gotten an error about that)
They user's grid certificate is not properly linked in GUMS to their SE user that owns /store/user/CERNusername, or GUMS is not working on hepcmsdev-6 (hepcms-gums)
hadoop is down (see hadoop troubleshooting if the /mnt/hadoop directory's not there)
Some sort of weird hadoop permissions. Note that crab will automatically make *subdirectories* of /store/user/CERNusername for our SE users.
Some node is missing the /store softlink (it's puppetized for everyone)
If it's regular crab job output transfer to SE issues, check the CMS dashboard for the site status of the place the job came from if all our stuff passes tests for SE, CE and hadoop health
A different looking error if trouble communicating to CERN
hepcms-gridftp was rebooted August 2nd.
Apparently, /etc/resolv.conf was overwritten at the reboot, messing up servernames in resolv.conf. This resulted in not being able to mount /mnt/hadoop or /data.
March 2019: apparently it happens when DNS1,2 are specified in the ifcfg file. commented them out (they are also commented out in hn)
[root@hepcms-gridftp ~]# more /etc/sysconfig/network-scripts/ifcfg-eth1
###DNS1="128.8.74.2"
###DNS2="128.8.76.2"
Old solotion.
NOTE: /etc/resolv.conf is overwritten on reboot a copy is saved as:
4 -rw-r--r-- 1 root root 125 Aug 2 11:56 resolv.conf.save
[root@hepcms-gridftp log]# more /etc/resolv.conf.save
options rotate timeout:1
# This file is being maintained by Puppet.
# DO NOT EDIT
search privnet umd.edu
nameserver 10.1.0.2
The new file somehow had two more addresses, which we commented out.
[root@hepcms-gridftp log]# more /etc/resolv.conf
options rotate timeout:1
# This file is being maintained by Puppet.
# DO NOT EDIT
search privnet umd.edu
nameserver 10.1.0.2
#nameserver 128.8.74.2
#nameserver 128.8.76.2
[root@hepcms-gridftp log]
Now unmount and mount hadoop and start the service.
[root@hepcms-gridftp log]# umount /mnt/hadoop
umount: /mnt/hadoop: not mounted
[root@hepcms-gridftp log]# umount /mnt/hadoop
umount: /mnt/hadoop: not mounted
[root@hepcms-gridftp log]# umount /mnt/hadoop
umount: /mnt/hadoop: not mounted
[root@hepcms-gridftp log]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 16G 2.9G 13G 19% /
/dev/sda3 16G 573M 15G 4% /tmp
/dev/sda5 7.9G 2.5G 5.1G 33% /var
10.1.0.1:/export/home
7.2T 1.2T 5.7T 18% /home
10.1.0.7:/data 37T 34T 2.9T 93% /data
cvmfs2 20G 398M 20G 2% /cvmfs/config-osg.opensciencegrid.org
cvmfs2 20G 398M 20G 2% /cvmfs/cms.cern.ch
[root@hepcms-gridftp log]# mount -a /mnt/hadoop
INFO /builddir/build/BUILD/hadoop-2.0.0-cdh4.7.1/src/hadoop-hdfs-project/hadoop-hdfs/src/main/native/fuse-dfs/fuse_options.c:164 Adding FUSE arg /mnt/hadoop
INFO /builddir/build/BUILD/hadoop-2.0.0-cdh4.7.1/src/hadoop-hdfs-project/hadoop-hdfs/src/main/native/fuse-dfs/fuse_options.c:115 Ignoring option allow_other
INFO /builddir/build/BUILD/hadoop-2.0.0-cdh4.7.1/src/hadoop-hdfs-project/hadoop-hdfs/src/main/native/fuse-dfs/fuse_options.c:115 Ignoring option dev
INFO /builddir/build/BUILD/hadoop-2.0.0-cdh4.7.1/src/hadoop-hdfs-project/hadoop-hdfs/src/main/native/fuse-dfs/fuse_options.c:115 Ignoring option suid
[root@hepcms-gridftp log]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 16G 2.9G 13G 19% /
/dev/sda3 16G 573M 15G 4% /tmp
/dev/sda5 7.9G 2.5G 5.1G 33% /var
10.1.0.1:/export/home
7.2T 1.2T 5.7T 18% /home
10.1.0.7:/data 37T 34T 2.9T 93% /data
cvmfs2 20G 398M 20G 2% /cvmfs/config-osg.opensciencegrid.org
cvmfs2 20G 398M 20G 2% /cvmfs/cms.cern.ch
fuse_dfs 198T 134T 64T 68% /mnt/hadoop
[root@hepcms-gridftp log]# ls -slrt /
Make sure soft links to store and hadoop in / are green.
[root@hepcms-gridftp log]# service globus-gridftp-server status
GridFTP server is not running
[root@hepcms-gridftp log]# service globus-gridftp-server start
Starting globus-gridftp-server: [ OK ]
[root@hepcms-gridftp log]#
All of the errors below are resolved.
these should have same permissions at /tmp
[root@hepcms-in1 ~]# chmod 1777 /dev/shm
[root@hepcms-in1 ~]# ls -ld /dev/shm
drwxrwxrwt 2 root root 40 Feb 27 19:22 /dev/shm
add a cron script for user@privnet on the head node, in `/root/cronscripts/EnoPriority.sh` for example
add in the crontab on HN (edit `/var/spool/cron/root`, this entry:
*/20 * * * * /root/cronscripts/EnoPriority.sh
That resets his user priority every 20 minutes.
These are documented here: https://sites.google.com/a/physics.umd.edu/tier-3-umd/t3-cluster-building-manual/headnode
so, if the user submits a bunch of jobs right now, he would have priority to start taking over the cluster *as existing jobs leave*, even over the 380 idle jobs waiting to start, for instance:
condor_status -submitters
but the `219` and `25` jobs running would have to finish and they can be up to 24 hours long (crab has limits built in, we do not impose time limits on our condor queue)
https://sites.google.com/a/physics.umd.edu/tier-3-umd/dont-edit/commands/clustershell#TOC-Commands:-
clush -w @r510 -b service hadoop-hdfs-datanode restart
are on hn /etc/clustershell/groups
all: hepcms-
in1,hepcms-in2,r720-0-1,r720-0-2,r720-datanfs,r510-0-1,r510-0-5,r510-0-6,r510-0-9,r510-0-10,r510-0-11,r510-0-4,compute-0-5,compute-0-6,compute-0-7,compute-0-8,compute-0-10,compute-0-11,hepcms-ce,hepcms-se,hepcms-namenode,hepcms-secondary-namenode,hepcms-squid,hepcms-gums,hepcms-gridftp,foreman-vmtest2
bm: hepcms-in2,r720-0-1,r720-0-2,r720-datanfs,r510-0-1,r510-0-4,r510-0-5,r510-0-6,r510-0-9,r510-0-10,r510-0-11,hepcms-gridftp,compute-0-5,compute-0-6,compute-0-7,compute-0-8,compute-0-10,compute-0-11
vm: hepcms-in1,hepcms-ce,hepcms-se,hepcms-namenode,hepcms-secondary-namenode,hepcms-squid,hepcms-gums,hepcms-gridftp,foreman-vmtest2
int: hepcms-in1,hepcms-in2,hepcms-in3
compute: compute-0-5,compute-0-6,compute-0-7,compute-0-8,compute-0-10,compute-0-11
r510: r510-0-9,r510-0-5,r510-0-4,r510-0-1,r510-0-6,r510-0-10,r510-0-11
r720: r720-0-1,r720-0-2,r720-datanfs
se: hepcms-se
ce: hepcms-ce
gridftp: hepcms-gridftp
[root@hepcms-hn ~]# clush -w @vm ls -ls /cvmfs/cms.cern.ch/cmsset_default.csh
foreman-vmtest2: ssh: Could not resolve hostname foreman-vmtest2: Name or service not known
clush: foreman-vmtest2: exited with exit code 255
hepcms-se: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh
hepcms-squid: ls: cannot access /cvmfs/cms.cern.ch/cmsset_default.csh: No such file or directory
clush: hepcms-squid: exited with exit code 2
hepcms-in1: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh
hepcms-namenode: ls: cannot access /cvmfs/cms.cern.ch/cmsset_default.csh: No such file or directory
clush: hepcms-namenode: exited with exit code 2
hepcms-secondary-namenode: ls: cannot access /cvmfs/cms.cern.ch/cmsset_default.csh: No such file or directory
clush: hepcms-secondary-namenode: exited with exit code 2
hepcms-gums: ls: cannot access /cvmfs/cms.cern.ch/cmsset_default.csh: No such file or directory
clush: hepcms-gums: exited with exit code 2
hepcms-ce: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh
[root@hepcms-hn ~]# clush -w @all ls -ls /cvmfs/cms.cern.ch/cmsset_default.csh
foreman-vmtest2: ssh: Could not resolve hostname foreman-vmtest2: Name or service not known
clush: foreman-vmtest2: exited with exit code 255
hepcms-in3: Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
clush: hepcms-in3: exited with exit code 255
compute-0-7: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh
compute-0-8: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh
compute-0-11: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh
compute-0-6: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh
hepcms-in2: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh
r510-0-4: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh
r510-0-5: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh
r510-0-11: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh
r510-0-9: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh
r720-0-1: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh
r510-0-1: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh
r720-0-2: 2 -rwxr-xr-x 1 cvmfs cvmfs 1259 Feb 1 2017 /cvmfs/cms.cern.ch/cmsset_default.csh
are in @hepcms-hn: more /etc/clustershell/groups
###
### File managed by puppet
###
all: hepcms-in2,hepcms-in3,compute-0-8,compute-0-6,compute-0-7,compute-0-11,r720-0-1,r720-0-2,r510-0-1,r510-0-5,r510-0-9,r510-0-11,
vm: se, ce, squid, gums, namenode, secondary-namenode, hepcms-in1
INT: hepcms-in1,hepcms-in2,hepcms-in3,hepcms-in4,hepcms-in5,hepcms-in6,hepcms-in7
compute: compute-0-5,compute-0-6,compute-0-7,compute-0-8,compute-0-11
R510: r510-0-1,r510-0-5,r510-0-9,r510-0-11
R720: r720-0-1,r720-0-2
SE: hepcms-se
CE: hepcms-ce
you can also use
clush -w nodename command
clush -w node , then +nodename to add the rest with the broken fuse mount
clush -w @nodegroup -nodename +nodename .......
to remove or add nodes out of a group for that interactive session
from head node as root:
ssh-agent $SHELL
ssh-add
clush -w @all df -h
clush -w @compute cvmfs_config wipecache
clush -w @R510 cvmfs_config wipecache
clush -w @R720 cvmfs_config wipecache
hadoop logs in /scratch can make a disk go 100% full. and show up as problem in Ganglia.
as root on the offending node:
It turns out that new logs are being saved in a different directory. for that do:
root@r510-0-5 scripts]python /data/osg/scripts/pyCleanupHadoopLogs.py -k 15 -s $(r510-0-5.privnet).log --dir /scratch/hadoop/hadoop-hdfs/
can use clush for above command.
check that logs are being updated
ls -alrh /scratch/hadoop/hadoop-hdfs/
If not restart the hadoop
service hadoop-hdfs-datanode start
[root@r510-0-5 scripts]# service hadoop-hdfs-datanode status
Hadoop datanode is running [ OK ]
Old log directory:
[root@r510-0-5 scripts]# ls -1 /scratch/hadoop/log/
hadoop-hdfs-datanode-R510-0-5.local.log.2015-04-19
[root@r510-0-5 scripts]/data/osg/scripts
[root@r510-0-5 scripts]python /data/osg/scripts/pyCleanupHadoopLogs.py -k 15 -s $(R510-0-5.local).log --dir /scratch/hadoop/log/
For hadoop comands:
find ./* -mtime +5 -exec rm -rf {} \;
July 2018, becasue of a stress test there was about 25TB of data in these directories.
Removed all of it.
/mnt/hadoop/cms/store/PhEDEx_Debug/
/mnt/hadoop/cms/store/PhEDEx_LoadTest07/
rm -rf LoadTest07_Debug_*
[root@hepcms-hn ~]# swapoff -a && swapon -a
[root@hepcms-hn ~]# sync; echo 1 > /proc/sys/vm/drop_caches
sync will flush the file system buffer. Command Separated by “;” run sequentially.
https://www.tecmint.com/clear-ram-memory-cache-buffer-and-swap-space-on-linux/
from home:
/etc/fstab
[jabeen@hepcms-hn ~]$ more /etc/fstab
#
# /etc/fstab
# Created by anaconda on Wed May 13 13:30:22 2015
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
UUID=b4133507-22e5-4ba5-8521-6836d7051ca5 / ext4 defaults 1 1
UUID=f8ef7370-5a4e-4206-96ca-69a5f63cd8e6 /export ext4 defaults,usrquota,grpquota 1 2
UUID=5443d7bc-f5f5-4418-a02b-70edb813c428 /scratch ext4 defaults 1 2
UUID=031c91c8-cb63-4e83-a8c2-3e706127e123 /tmp ext4 defaults 1 2
UUID=88a037dd-9936-4f30-b01b-a1edfaabfdeb /var ext4 defaults 1 2
UUID=44898750-d91f-4b02-ad4b-a324c5d20f4d swap swap defaults 0 0
tmpfs /dev/shm tmpfs defaults 0 0
devpts /dev/pts devpts gid=5,mode=620 0 0
sysfs /sys sysfs defaults 0 0
proc /proc proc defaults 0 0
10.1.0.7:/data /data nfs rw,async,intr,nolock,nfsvers=3 0 0
10.1.0.100:/data2 /data2 nfs rw,async,intr,nolock,nfsvers=3 0 0
nfs.isipnl.nas.umd.edu:/ifs/data/CMNS_Physics /CampusBackup nfs nfsvers=3,tcp,rw,hard,intr,timeo=600,retrans=2,rsize=131072
,wsize=524288 0 0
nfs.isipnl.nas.umd.edu:/ifs/data/CMNS_HEP_00 /DataCampusBackup nfs nfsvers=3,tcp,rw,hard,intr,timeo=600,retrans=2,rsize=131
072,wsize=524288 0 0
[jabeen@hepcms-hn ~]$ /etc/export
[jabeen@hepcms-hn ~]$ more /etc/exports
/export 10.0.0.0/255.0.0.0(fsid=1,rw,async,no_subtree_check,no_root_squash)
/export 128.8.164.11(fsid=1,rw,async,no_subtree_check,no_root_squash)
on datanfs
[root@r720-datanfs ~]# more /etc/fstab
# HEADER: This file was autogenerated at Thu Dec 17 16:09:27 -0500 2015
# HEADER: by puppet. While it can still be managed manually, it
# HEADER: is definitely not recommended.
#
# /etc/fstab
# Created by anaconda on Tue Jul 28 11:50:16 2015
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
UUID=0408cd1c-e1b3-4ce8-9a32-1e69f1b44914 / ext4 defaults 1 1
UUID=a92a7d16-cc33-4b41-aae4-2492de2b0daf /data xfs defaults 1 2
UUID=60462962-9ef3-4bad-95f6-000cd5961fc7 /scratch ext4 defaults 1 2
UUID=51ae0cf9-e6cb-4abb-b827-7400605388d0 /tmp ext4 defaults 1 2
UUID=58d92d33-0155-43d9-aab0-72edc1768fb2 /var ext4 defaults 1 2
UUID=039baec5-1182-4265-9c5e-4688e2d410c4 swap swap defaults 0 0
UUID=71d4ccff-a837-43f8-871e-5118d81a413b swap swap defaults 0 0
tmpfs /dev/shm tmpfs defaults 0 0
devpts /dev/pts devpts gid=5,mode=620 0 0
sysfs /sys sysfs defaults 0 0
proc /proc proc defaults 0 0
10.1.0.1:/export/home /home nfs rw,async,intr,nolock,nfsvers=3 0 0
hadoop-fuse-dfs /mnt/hadoop fuse server=hepcms-namenode.privnet,port=9000,rdbuffer=131072,allow_other 0 0
[root@r720-datanfs ~]# more /etc/exports
# File managed by Puppet, do not edit!
/data 10.1.0.0/16(fsid=1,rw,async,no_subtree_check,no_root_squash) 10.1.255.231(fsid=1,rw,async,no_subtree_check,no
_root_squash)
/data/hadoop 10.1.255.232(fsid=1,rw,async,no_subtree_check,no_root_squash)
On siab-1
/etc/exports had /data2 (rw,sync,no_root_squash)
replaced matching with hn /etc/export
/data2 10.0.0.0/255.0.0.0(fsid=1,rw,async,no_subtree_check,no_root_squash)
Now export the directories in /etc/exports with the command
[0806] root@siab-1 ~# exportfs -arv
ON hepcms-in2 added this line to /etc/fstab
10.1.0.100:/data2/home /home nfs rw,async,intr,nolock,nfsvers=3 0 0
if previous home mount is stale first umount home and then move
umount /home
mount -a /home
are mounted on hepmcms-hn
campus backup unmounted using the lazy unmount as normal command gave busy devoice.
umount -l /CampusBackup
umount -l /DataCampusBackup
nfs.isipnl.nas.umd.edu:/ifs/data/CMNS_Physics
500G 259G 242G 52% /CampusBackup
nfs.isipnl.nas.umd.edu:/ifs/data/CMNS_HEP_00
9.0T 7.6T 1.5T 84% /DataCampusBackup
VOFQAN /cms/Role=lcgadmin
Service Flavour HTCONDOR-CE
Metric org.cms.WN-xrootd-access
VOFQAN /cms/Role=lcgadmin
Service Flavour HTCONDOR-CE
Hostname hepcms-0.umd.edu
Metric org.cms.SE-xrootd
VOFQAN read
Service Flavour XROOTD
Hostname hepcms-gridftp.umd.edu
Metric org.cms.SRM-VOGet
VOFQAN /cms/Role=production
Service Flavour SRM
Solution:
[root@compute-0-11 ~]# service hadoop-hdfs-datanode status
Hadoop datanode is dead and pid file exists [FAILED]
[root@compute-0-11 ~]# mount /hadoop1
[root@compute-0-11 ~]# mount /hadoop2
[root@compute-0-11 ~]# service hadoop-hdfs-datanode stop
Stopping Hadoop datanode: [ OK ]
no datanode to stop
[root@compute-0-11 ~]# service hadoop-hdfs-datanode start
Starting Hadoop datanode: [ OK ]
make sure all the individual disk mounts have correct permission
chown hdfs:hadoop /hadoop1/data
make sure on hepcm-namenode and secondary name node services are running
service hadoop-hdfs-namenode status
service hadoop-hdfs-secondarynamenode status
on hepcms-namenode
check the safemode is ON or OFF hdfs dfsadmin -safemode get
If safe mode is turned ON, please issue the below command to leave from safemode. hdfs dfsadmin -safemode leave
for working hadoop it should be off.
clush -b -w @r510 hadoop fsck / -blocks > hadoop-fsck-pipe-blocks.output
clush -b -w @r720 hadoop fsck / -blocks >> hadoop-fsck-pipe-blocks.output
clush -b -w @compute hadoop fsck / -blocks >> hadoop-fsck-pipe-blocks.output
grep -i --before-context=20 "r510" hadoop-fsck-pipe-blocks.output > hadoop-fsck-pipe-blocks-output.log
grep -i --before-context=20 "compute" hadoop-fsck-pipe-blocks.output >> hadoop-fsck-pipe-blocks-output.log
grep -i --before-context=20 "r720" hadoop-fsck-pipe-blocks.output >> hadoop-fsck-pipe-blocks-output.log
rm hadoop-fsck-pipe-blocks.output
-------
To check individual files:
for some reason on interactives hadoop commands are set to local file system. In reality you should use hdfs path, i.e. `/cms/store/user/...` if the command refers to the hadoop system (edited)
[kakw@compute-0-6 0000]$ hdfs dfs -ls /cms/store/user/yoshin/EmJetAnalysis/Analysis-20171103-v0/QCD_HT1000to1500/QCD_HT1000to1500_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/Analysis-20171103/171119_212955/0000/ | head
Found 1523 items
drwxr-xr-x - yhshin_g users 0 2017-11-20 10:50 /cms/store/user/yoshin/EmJetAnalysis/Analysis-20171103-v0/QCD_HT1000to1500/QCD_HT1000to1500_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/Analysis-20171103/171119_212955/0000/failed
-rw-rw-r-- 2 yhshin_g users 350540231 2017-11-19 20:17 /cms/store/user/yoshin/EmJetAnalysis/Analysis-20171103-v0/QCD_HT1000to1500/QCD_HT1000to1500_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/Analysis-20171103/171119_212955/0000/ntuple_1.root
-rw-rw-r-- 2 yhshin_g users 287446471 2017-11-20 11:21 /cms/store/user/yoshin/EmJetAnalysis/Analysis-20171103-v0/QCD_HT1000to1500/QCD_HT1000to1500_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/Analysis-20171103/171119_212955/0000/ntuple_10.root
columns are: permissions number_of_replicas userid groupid filesize modification_date modification_time filename
[root@r720-0-1 ~]# hadoop dfsadmin -report
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
Configured Capacity: 195044131741696 (177.39 TB)
Present Capacity: 186079690669334 (169.24 TB)
DFS Remaining: 3311928012800 (3.01 TB)
DFS Used: 182767762656534 (166.23 TB)
DFS Used%: 98.22%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Datanodes available: 13 (14 total, 1 dead)
Live datanodes:
Name: 10.1.0.18:50010 (r510-0-1.privnet)
Hostname: r510-0-1.privnet
Decommission Status : Normal
Configured Capacity: 23351871590400 (21.24 TB)
DFS Used: 21859788259328 (19.88 TB)
Non DFS Used: 1076121092096 (1002.22 GB)
DFS Remaining: 415962238976 (387.40 GB)
DFS Used%: 93.61%
DFS Remaining%: 1.78%
Last contact: Tue Aug 22 12:19:01 EDT 2017
Name: 10.1.0.30:50010 (r510-0-11.privnet)
Hostname: r510-0-11.privnet
Decommission Status : Normal
Configured Capacity: 21420767245312 (19.48 TB)
DFS Used: 19468739567994 (17.71 TB)
Non DFS Used: 984552755846 (916.94 GB)
DFS Remaining: 967474921472 (901.03 GB)
DFS Used%: 90.89%
DFS Remaining%: 4.52%
Last contact: Tue Aug 22 12:19:01 EDT 2017
Name: 10.1.0.28:50010 (compute-0-7.privnet)
Hostname: compute-0-7.privnet
Decommission Status : Decommissioned
Configured Capacity: 3790909986816 (3.45 TB)
DFS Used: 1199918022656 (1.09 TB)
Non DFS Used: 173738641408 (161.81 GB)
DFS Remaining: 2417253322752 (2.20 TB)
DFS Used%: 31.65%
DFS Remaining%: 63.76%
Last contact: Tue Aug 22 12:18:59 EDT 2017
Name: 10.1.0.33:50010 (compute-0-6.privnet)
Hostname: compute-0-6.privnet
Decommission Status : Normal
Configured Capacity: 3790909986816 (3.45 TB)
DFS Used: 3581889528174 (3.26 TB)
Non DFS Used: 173738649234 (161.81 GB)
DFS Remaining: 35281809408 (32.86 GB)
DFS Used%: 94.49%
DFS Remaining%: 0.93%
Last contact: Tue Aug 22 12:19:01 EDT 2017
Name: 10.1.0.27:50010 (compute-0-11.privnet)
Hostname: compute-0-11.privnet
Decommission Status : Normal
Configured Capacity: 3790909986816 (3.45 TB)
DFS Used: 3585854763008 (3.26 TB)
Non DFS Used: 173738641408 (161.81 GB)
DFS Remaining: 31316582400 (29.17 GB)
DFS Used%: 94.59%
DFS Remaining%: 0.83%
Last contact: Tue Aug 22 12:18:59 EDT 2017
Name: 10.1.0.17:50010 (r510-0-9.privnet)
Hostname: r510-0-9.privnet
Decommission Status : Normal
Configured Capacity: 21258521805824 (19.33 TB)
DFS Used: 19962238316544 (18.16 TB)
Non DFS Used: 977815778304 (910.66 GB)
DFS Remaining: 318467710976 (296.60 GB)
DFS Used%: 93.90%
DFS Remaining%: 1.50%
Last contact: Tue Aug 22 12:18:59 EDT 2017
Name: 10.1.0.31:50010 (r510-0-4.privnet)
Hostname: r510-0-4.privnet
Decommission Status : Normal
Configured Capacity: 23244357623808 (21.14 TB)
DFS Used: 21804435677184 (19.83 TB)
Non DFS Used: 1067771875328 (994.44 GB)
DFS Remaining: 372150071296 (346.59 GB)
DFS Used%: 93.81%
DFS Remaining%: 1.60%
Last contact: Tue Aug 22 12:18:59 EDT 2017
Name: 10.1.0.24:50010 (compute-0-8.privnet)
Hostname: compute-0-8.privnet
Decommission Status : Normal
Configured Capacity: 3790909986816 (3.45 TB)
DFS Used: 3579222183936 (3.26 TB)
Non DFS Used: 173738641408 (161.81 GB)
DFS Remaining: 37949161472 (35.34 GB)
DFS Used%: 94.42%
DFS Remaining%: 1.00%
Last contact: Tue Aug 22 12:19:01 EDT 2017
Name: 10.1.0.29:50010 (r510-0-10.privnet)
Hostname: r510-0-10.privnet
Decommission Status : Normal
Configured Capacity: 23244357623808 (21.14 TB)
DFS Used: 21895725548612 (19.91 TB)
Non DFS Used: 1067771898812 (994.44 GB)
DFS Remaining: 280860176384 (261.57 GB)
DFS Used%: 94.20%
DFS Remaining%: 1.21%
Last contact: Tue Aug 22 12:19:00 EDT 2017
Name: 10.1.0.23:50010 (r510-0-6.privnet)
Hostname: r510-0-6.privnet
Decommission Status : Normal
Configured Capacity: 21403204469760 (19.47 TB)
DFS Used: 20167376171392 (18.34 TB)
Non DFS Used: 983660692096 (916.11 GB)
DFS Remaining: 252167606272 (234.85 GB)
DFS Used%: 94.23%
DFS Remaining%: 1.18%
Last contact: Tue Aug 22 12:19:00 EDT 2017
Name: 10.1.0.32:50010 (r510-0-5.privnet)
Hostname: r510-0-5.privnet
Decommission Status : Normal
Configured Capacity: 23244357623808 (21.14 TB)
DFS Used: 21901919940608 (19.92 TB)
Non DFS Used: 1067914788864 (994.57 GB)
DFS Remaining: 274522894336 (255.67 GB)
DFS Used%: 94.22%
DFS Remaining%: 1.18%
Last contact: Tue Aug 22 12:19:00 EDT 2017
Name: 10.1.0.5:50010 (r720-0-2.privnet)
Hostname: r720-0-2.privnet
Decommission Status : Normal
Configured Capacity: 21542112138240 (19.59 TB)
DFS Used: 20266237505536 (18.43 TB)
Non DFS Used: 990715188224 (922.68 GB)
DFS Remaining: 285159444480 (265.58 GB)
DFS Used%: 94.08%
DFS Remaining%: 1.32%
Last contact: Tue Aug 22 12:18:59 EDT 2017
Name: 10.1.0.19:50010 (compute-0-5.privnet)
Hostname: compute-0-5.privnet
Decommission Status : Normal
Configured Capacity: 3761933637632 (3.42 TB)
DFS Used: 3494417171562 (3.18 TB)
Non DFS Used: 226901070742 (211.32 GB)
DFS Remaining: 40615395328 (37.83 GB)
DFS Used%: 92.89%
DFS Remaining%: 1.08%
Last contact: Tue Aug 22 12:19:01 EDT 2017
Dead datanodes:
Name: 10.1.0.6:50010 (r720-0-1.privnet)
Hostname: r720-0-1.privnet
Decommission Status : Normal
Configured Capacity: 0 (0 B)
DFS Used: 0 (0 B)
Non DFS Used: 0 (0 B)
DFS Remaining: 0 (0 B)
DFS Used%: 100.00%
DFS Remaining%: 0.00%
Last contact: Sun Aug 06 08:28:01 EDT 2017
[root@r720-0-1 ~]# df -ah
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 20G 3.7G 15G 21% /
tmpfs 48G 0 48G 0% /dev/shm
/dev/sdb1 1.8T 1.2T 517G 71% /hadoop1
/dev/sdj1 1.8T 1.2T 520G 71% /hadoop10
/dev/sdk1 1.8T 1.2T 519G 71% /hadoop11
/dev/sdl1 1.8T 1.2T 520G 71% /hadoop12
/dev/sdc1 1.8T 1.2T 519G 71% /hadoop2
/dev/sdd1 1.8T 1.2T 522G 71% /hadoop3
/dev/sde1 1.8T 1.2T 527G 70% /hadoop5
/dev/sdf1 1.8T 1.2T 525G 70% /hadoop6
/dev/sdg1 1.8T 1.2T 523G 70% /hadoop7
/dev/sdh1 1.8T 1.2T 526G 70% /hadoop8
/dev/sdi1 1.8T 1.2T 515G 71% /hadoop9
/dev/sda3 20G 5.3G 13G 29% /scratch
/dev/sda7 72G 650M 68G 1% /tmp
/dev/sda6 7.6G 2.4G 4.8G 34% /var
fuse_dfs 178T 167T 12T 94% /mnt/hadoop
r720-datanfs.privnet:/data
37T 35T 2.1T 95% /data
10.1.0.1:/export/home
7.2T 1.4T 5.5T 20% /home
[root@r720-0-1 ~]#
[root@r720-0-1 ~]# service hadoop-hdfs-datanode status
Hadoop datanode is dead and pid file exists [FAILED]
[root@r720-0-1 ~]# service hadoop-hdfs-datanode restart
Stopping Hadoop datanode: [ OK ]
no datanode to stop
Starting Hadoop datanode: [ OK ]
starting datanode, logging to /scratch/hadoop/hadoop-hdfs/hadoop-hdfs-datanode-r720-0-1.privnet.out
[root@r720-0-1 ~]#
[root@r720-0-1 ~]# service hadoop-hdfs-datanode status
Hadoop datanode is dead and pid file exists [FAILED]
[root@r720-0-1 ~]# service hadoop-hdfs-datanode restart
Stopping Hadoop datanode: [ OK ]
no datanode to stop
Starting Hadoop datanode: [ OK ]
starting datanode, logging to /scratch/hadoop/hadoop-hdfs/hadoop-hdfs-datanode-r720-0-1.privnet.out
[root@r720-0-1 ~]#
[root@r720-0-1 ~]#
[root@r720-0-1 ~]# service hadoop-hdfs-datanode status
Hadoop datanode is dead and pid file exists [FAILED]
[root@r720-0-1 ~]# grep -i warn /scratch/hadoop/hadoop-hdfs/hadoop-hdfs-datanode-r720-0-1.privnet.log
2017-08-06 08:28:02,422 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing replica BP-953065178-10.1.0.16-1445909897155:-6597879525639482474 on failed volume /hadoop11/data/current
2017-08-06 08:28:02,422 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing replica BP-953065178-10.1.0.16-1445909897155:-391729697961201349 on failed volume /hadoop11/data/current
2017-08-06 08:28:02,422 WARN org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing replica BP-953065178-10.1.0.16-1445909897155:1433264680818142941 on failed volume /hadoop11/data/current
2017-08-06 08:28:03,062 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode is shutting down: DataNode failed volumes:/hadoop11/data/current;
2017-08-22 05:47:58,027 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool <registering> (storage id DS-692293337-10.1.0.6-50010-1445911035181) service to hepcms-namenode.privnet/10.1.0.16:9000
org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 10, volumes configured: 11, volumes failed: 1, volume failures tolerated: 0
2017-08-22 05:49:04,106 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool <registering> (storage id DS-692293337-10.1.0.6-50010-1445911035181) service to hepcms-namenode.privnet/10.1.0.16:9000
org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed volumes - current valid volumes: 10, volumes configured: 11, volumes failed: 1, volume failures tolerated: 0
umount -l /hadoop11
mount -a /hadoop11
if hadoop11 failed. Need to take it out of hadoop. Remove it from the list in hdfs-site.xml
[root@r720-0-1 ~]# vi /etc/hadoop/conf/hdfs-site.xml
<value>/hadoop1/data,/hadoop2/data,/hadoop3/data,/hadoop5/data,/hadoop6/data,/hadoop7/data,/hadoop8/data,/h\adoop9/data,/hadoop10/data,/hadoop11/data,/hadoop12/data</value>
saved ctrl-x-c for emacs :wq for vi
restart the service:
For example to restart the service on all r510s use clush:
[root@hepcms-hn ~]# clush -w @r510 -b service hadoop-hdfs-datanode restart
Didnt automatically remove it from df -ah. Waited half an hour.
If you need to put it back in, add to the /etc/hadoop/conf/hdfs-site.xml and restart the service. You might have to mount the /hadoopxx again.
[root@r540-0-21 ~]# service hadoop-hdfs-datanode start
starting datanode, logging to /scratch/hadoop/hadoop-hdfs/hadoop-hdfs-datanode-r540-0-21.privnet.out
Failed to start Hadoop datanode. Return value: 1 [FAILED]
[root@r540-0-21 ~]#
See the error in the log file
grep -i exc /scratch/hadoop/hadoop-hdfs/hadoop-hdfs-datanode-r540-0-21.privnet.log
2020-07-14 15:47:28,865 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Likely the client has stopped reading, disconnecting it (r540-0-21.privnet:50010:DataXceiver error processing READ_BLOCK operation src: /10.1.0.14:21620 dst: /10.1.0.102:50010); java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.1.0.102:50010 remote=/10.1.0.14:21620]
apparently the port is being reused to r510 -
[root@r540-0-20 ~]# lsof -i:50010
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 4400 hdfs 108u IPv4 24956 0t0 TCP *:50010 (LISTEN)
[root@r540-0-20 ~]#
[root@r540-0-21 ~]# lsof -i:50010
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 4136 hdfs 108u IPv4 22859 0t0 TCP *:50010 (LISTEN)
java 4136 hdfs 638u IPv4 100705722 0t0 TCP r540-0-21.privnet:50010->r510-0-1.privnet:33374 (ESTABLISHED)
[root@r540-0-21 ~]#
removed the process and restarted hadoop service
[root@r540-0-21 ~]#
[root@r540-0-21 ~]# kill -9 4136
[root@r540-0-21 ~]#
[root@r540-0-21 ~]#
[root@r540-0-21 ~]# lsof -i:50010
[root@r540-0-21 ~]# systemctl restart hadoop-hdfs-datanode
[root@r540-0-21 ~]# systemctl status hadoop-hdfs-datanode
â hadoop-hdfs-datanode.service - LSB: Hadoop datanode
Loaded: loaded (/etc/rc.d/init.d/hadoop-hdfs-datanode; bad; vendor preset: disabled)
Active: active (exited) since Fri 2020-07-17 14:53:17 EDT; 9s ago
Docs: man:systemd-sysv-generator(8)
Process: 91326 ExecStart=/etc/rc.d/init.d/hadoop-hdfs-datanode start (code=exited, status=0/SUCCESS)Jul 17 14:53:08 r540-0-21.privnet systemd[1]: Starting LSB: Hadoop datanode...
Jul 17 14:53:08 r540-0-21.privnet su[91357]: (to hdfs) root on none
Jul 17 14:53:08 r540-0-21.privnet hadoop-hdfs-datanode[91326]: starting datanode, logging to /scratch/hadoop/hadoop-hdfs/hadoop-hdfs-datanode-r540-0-21.privnet.out
Jul 17 14:53:17 r540-0-21.privnet hadoop-hdfs-datanode[91326]: Started Hadoop datanode (hadoop-hdfs-datanode):[ OK ]
Jul 17 14:53:17 r540-0-21.privnet systemd[1]: Started LSB: Hadoop datanode.
[root@r540-0-21 ~]# lsof -i:50010
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 91379 hdfs 108u IPv4 100832449 0t0 TCP *:50010 (LISTEN)
[root@r540-0-21 ~]#
That fixes the issue.
checked health of the node:
[root@r720-0-1 ~]# hadoop fsck / -blocks
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
Connecting to namenode via http://hepcms-namenode.privnet:50070
FSCK started by root (auth:SIMPLE) from /10.1.0.6 for path / at Tue Aug 22 12:16:58 EDT 2017
....................................................................................................
....................................................................................................
....................................................................................................
............................................................Status: HEALTHY
Total size: 89634301416782 B (Total open files size: 118798848 B)
Total dirs: 58584
Total files: 1060360 (Files currently being written: 2)
Total blocks (validated): 1632671 (avg. block size 54900406 B) (Total open file blocks (not validated): 2)
Minimally replicated blocks: 1632671 (100.0 %)
Over-replicated blocks: 24598 (1.506611 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 2.024795
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 13
Number of racks: 1
FSCK ended at Tue Aug 22 12:17:20 EDT 2017 in 21951 milliseconds
The filesystem under path '/' is HEALTHY
[root@r720-0-1 ~]# umount /mnt/hadoop
umount: /mnt/hadoop: device is busy.
(In some cases useful info about processes that use
the device is found by lsof(8) or fuser(1))
apparently there are lot of open files on the node.
[root@r720-0-1 ~]# lsof /mnt/hadoop
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
main 3722473 kakw 8r REG 0,17 317564959 3916093 /mnt/hadoop/cms/store/user/yoshin/EmJetAnalysis/Analysis-20170609-v0/QCD_HT1000to1500/QCD_HT1000to1500_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/Analysis-20170609/170609_100342/0000/ntuple_374.root
main 3722473 kakw 9r REG 0,17 317564959 3916093 /mnt/hadoop/cms/store/user/yoshin/EmJetAnalysis/Analysis-20170609-v0/QCD_HT1000to1500/QCD_HT1000to1500_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/Analysis-20170609/170609_100342/0000/ntuple_374.root
kill -9 3722473 3722475 3722489 3722501 3722723 3722801 3722854 3723314
now unmount haddop and then the bad disk. mount and start the service.
[root@r720-0-1 ~]# service hadoop-hdfs-datanode status
Hadoop datanode is dead and pid file exists [FAILED]
[root@r720-0-1 ~]# umount /mnt/hadoop
[root@r720-0-1 ~]# umount /dev/sdk1
[root@r720-0-1 ~]# umount /dev/sdk1
umount: /dev/sdk1: not mounted
[root@r720-0-1 ~]# umount /dev/sdk1
umount: /dev/sdk1: not mounted
[root@r720-0-1 ~]# mount -a /mnt/hadoop
INFO /builddir/build/BUILD/hadoop-2.0.0-cdh4.7.1/src/hadoop-hdfs-project/hadoop-hdfs/src/main/native/fuse-dfs/fuse_options.c:164 Adding FUSE arg /mnt/hadoop
INFO /builddir/build/BUILD/hadoop-2.0.0-cdh4.7.1/src/hadoop-hdfs-project/hadoop-hdfs/src/main/native/fuse-dfs/fuse_options.c:115 Ignoring option allow_other
INFO /builddir/build/BUILD/hadoop-2.0.0-cdh4.7.1/src/hadoop-hdfs-project/hadoop-hdfs/src/main/native/fuse-dfs/fuse_options.c:115 Ignoring option dev
INFO /builddir/build/BUILD/hadoop-2.0.0-cdh4.7.1/src/hadoop-hdfs-project/hadoop-hdfs/src/main/native/fuse-dfs/fuse_options.c:115 Ignoring option suid
[root@r720-0-1 ~]#
[root@r720-0-1 ~]# df -ah
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 20G 3.7G 15G 21% /
proc 0 0 0 - /proc
sysfs 0 0 0 - /sys
devpts 0 0 0 - /dev/pts
tmpfs 48G 0 48G 0% /dev/shm
/dev/sdb1 1.8T 1.2T 517G 71% /hadoop1
/dev/sdj1 1.8T 1.2T 520G 71% /hadoop10
/dev/sdl1 1.8T 1.2T 520G 71% /hadoop12
/dev/sdc1 1.8T 1.2T 519G 71% /hadoop2
/dev/sdd1 1.8T 1.2T 522G 71% /hadoop3
/dev/sde1 1.8T 1.2T 527G 70% /hadoop5
/dev/sdf1 1.8T 1.2T 525G 70% /hadoop6
/dev/sdg1 1.8T 1.2T 523G 70% /hadoop7
/dev/sdh1 1.8T 1.2T 526G 70% /hadoop8
/dev/sdi1 1.8T 1.2T 515G 71% /hadoop9
/dev/sda3 20G 5.3G 13G 29% /scratch
/dev/sda7 72G 885M 68G 2% /tmp
/dev/sda6 7.6G 2.4G 4.8G 34% /var
none 0 0 0 - /proc/sys/fs/binfmt_misc
sunrpc 0 0 0 - /var/lib/nfs/rpc_pipefs
r720-datanfs.privnet:/data
37T 35T 2.0T 95% /data
10.1.0.1:/export/home
7.2T 1.4T 5.5T 20% /home
fuse_dfs 178T 167T 12T 94% /mnt/hadoop
[root@r720-0-1 ~]# service hadoop-hdfs-datanode start
Starting Hadoop datanode: [ OK ]
starting datanode, logging to /scratch/hadoop/hadoop-hdfs/hadoop-hdfs-datanode-r720-0-1.privnet.out
[root@r720-0-1 ~]# service hadoop-hdfs-datanode status
Hadoop datanode is running [ OK ]
[root@r720-0-1 ~]#
node is alive again
[root@r720-0-1 ~]# hadoop dfsadmin -report
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
Configured Capacity: 214627371145216 (195.20 TB)
Present Capacity: 204762305384448 (186.23 TB)
DFS Remaining: 21700580372480 (19.74 TB)
DFS Used: 183061725011968 (166.49 TB)
DFS Used%: 89.40%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Datanodes available: 14 (14 total, 0 dead)
Live datanodes:
Name: 10.1.0.18:50010 (r510-0-1.privnet)
Hostname: r510-0-1.privnet
......
Name: 10.1.0.6:50010 (r720-0-1.privnet)
Hostname: r720-0-1.privnet
Decommission Status : Normal
Configured Capacity: 19583239403520 (17.81 TB)
DFS Used: 13090004533248 (11.91 TB)
Non DFS Used: 900624750592 (838.77 GB)
DFS Remaining: 5592610119680 (5.09 TB)
DFS Used%: 66.84%
DFS Remaining%: 28.56%
Last contact: Tue Aug 22 13:17:46 EDT 2017
Source: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#balancer
Can do the following command from any hadoop node.
You can check the progress
[root@hepcms-hn ~]# clush -w @all df -h | grep hadoop
It took 35 hours the first time:
17/04/04 22:09:33 INFO net.NetworkTopology: Adding a new node: /default-rack/10.1.0.31:50010
17/04/04 22:09:33 INFO net.NetworkTopology: Adding a new node: /default-rack/10.1.0.33:50010
17/04/04 22:09:33 INFO net.NetworkTopology: Adding a new node: /default-rack/10.1.0.18:50010
17/04/04 22:09:33 INFO net.NetworkTopology: Adding a new node: /default-rack/10.1.0.27:50010
17/04/04 22:09:33 INFO net.NetworkTopology: Adding a new node: /default-rack/10.1.0.24:50010
17/04/04 22:09:33 INFO net.NetworkTopology: Adding a new node: /default-rack/10.1.0.17:50010
17/04/04 22:09:33 INFO net.NetworkTopology: Adding a new node: /default-rack/10.1.0.30:50010
17/04/04 22:09:33 INFO net.NetworkTopology: Adding a new node: /default-rack/10.1.0.19:50010
17/04/04 22:09:33 INFO net.NetworkTopology: Adding a new node: /default-rack/10.1.0.6:50010
17/04/04 22:09:33 INFO net.NetworkTopology: Adding a new node: /default-rack/10.1.0.32:50010
17/04/04 22:09:33 INFO net.NetworkTopology: Adding a new node: /default-rack/10.1.0.23:50010
17/04/04 22:09:33 INFO net.NetworkTopology: Adding a new node: /default-rack/10.1.0.29:50010
17/04/04 22:09:33 INFO net.NetworkTopology: Adding a new node: /default-rack/10.1.0.5:50010
17/04/04 22:09:33 INFO balancer.Balancer: 0 over-utilized: []
17/04/04 22:09:33 INFO balancer.Balancer: 0 underutilized: []
The cluster is balanced. Exiting...
Balancing took 35.48840777777778 hours
Problem:
yhshin@r510-0-5 ~]$ ls /store
ls: cannot access /store: Transport endpoint is not connected
[yhshin@r510-0-5 ~]$ ls /mnt/hadoop
Solution:
Nebraska T2 sees it time to time - it can happen if the job is trying to move a large file or something that puts a big load - that explains the file system crash that causes the fuse mount issue. The only fix they have is to check before running a job that the node is mounted. Carl will send me instructions and we can implement it.
All that said, if the job running on the node itself is doing something something to crash the file system then there is nothing we can do.
The following solution is for hepcms-gums
ssh to the offending node
[root@hepcms-gums ~]# ls -alrth /mnt
ls: cannot access /mnt/hadoop: Transport endpoint is not connected
total 8.0K
d?????????? ? ? ? ? ? hadoop
drwxr-xr-x. 3 root root 4.0K Dec 19 10:25 .
dr-xr-xr-x. 28 root root 4.0K Dec 19 10:35 ..
[root@hepcms-gums ~]# umount /mnt/hadoop
[root@hepcms-gums ~]# chown hadoop:hdfs /mnt/hadoop
[root@hepcms-gums ~]# mount /mnt/hadoop
INFO /builddir/build/BUILD/hadoop-2.0.0-cdh4.1.1/src/hadoop-hdfs-project/hadoop-hdfs/src/main/native/fuse-dfs/fuse_options.c:164 Adding FUSE arg /mnt/hadoop
INFO /builddir/build/BUILD/hadoop-2.0.0-cdh4.1.1/src/hadoop-hdfs-project/hadoop-hdfs/src/main/native/fuse-dfs/fuse_options.c:115 Ignoring option allow_other
INFO /builddir/build/BUILD/hadoop-2.0.0-cdh4.1.1/src/hadoop-hdfs-project/hadoop-hdfs/src/main/native/fuse-dfs/fuse_options.c:115 Ignoring option dev
INFO /builddir/build/BUILD/hadoop-2.0.0-cdh4.1.1/src/hadoop-hdfs-project/hadoop-hdfs/src/main/native/fuse-dfs/fuse_options.c:115 Ignoring option suid
[root@hepcms-gums ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_sys-lv_root
96G 2.3G 89G 3% /
tmpfs 498M 0 498M 0% /dev/shm
/dev/vda1 477M 70M 383M 16% /boot
r720-datanfs.privnet:/data
37T 30T 6.6T 82% /data
10.1.0.1:/export/home
7.2T 324G 6.5T 5% /home
fuse_dfs 64T 41T 24T 64% /mnt/hadoop
This happened twice this week (June 11-17 2017). r510-0-10 was unmounted because of an unknown issue, r510-0-11 was unmounted when Shabnam tried to replace it on the 14th. When this happens, the Hadoop logs in /scratch should show when the error occurred, and which partition had the problem. Then you try to ls while in the problem partition you will somethings like this
[root@r510-0-11 hadoop8]# ls
ls: reading directory .: Input/output error
First remove the partition from hdfs-site.xml in /etc/hadoop/conf
service hadoop-hdfs-datanode start
unmount Hadoop disk
fsck -y /hadoop#
mount partition
check if disk is working
add partition back into hdfs-site.xml and restart Hadoop-hdfs-datanode
in case of compute nodes we will take them out of haddop and the blocks will be replicated. Compute nodes donot add much to hadoop in terms of space so it is better to keep them out of hadoop. Normally if you take the node out of hadoop, hadoop will replicate the missing blocks.
from interactive node run firefox and check hadoop status:
hepcmshttp://hepcms-namenode.privnet:50070/dfshealth.jsp-namenode.privnet:50070/dfshealth.jsp
You found this out through checking the web page, that files were corrupt, or because a user complained about a file, or by running a
[root@hepcms-namenode ~]# hdfs fsck /
........................Status: HEALTHY
Total size: 45691338212949 B
Total dirs: 60217
Total files: 304924
Total blocks (validated): 695338 (avg. block size 65710975 B)
Minimally replicated blocks: 695338 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 2.3250217
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 13
Number of racks: 1
FSCK ended at Wed Jan 18 09:33:11 EST 2017 in 8158 milliseconds
You have verified from the hadoop log on that node (r510-0-5: /scratch/hadoop/hdfs-hadoop/) that the particular disk (say /hadoop4) is not writeable, but you can see it's still mounted, and you can
ls -alh /hadoop4/data
You should NOT have the disk removed from the list in /etc/hadoop/conf/hdfs-site.xml on the datanode (r510-0-5)
The datanode will have its hadoop-hdfs-datanode service stop working because it cannot write to that disk
Login to the hepcms-namenode, become root
Then exclude this datanode (use internal IP like r510-0-5.privnet) edit /etc/hadoop/conf/hosts-exclude on the hepcms-namenode
hdfs dfsadmin -refreshNodes
Login to the datanode (r510-0-5), become root, and start the datanode service
service hadoop-hdfs-datanode start
Wait quite some time until hadoop reports no more corrupt blocks and the node is "Decommissioned", don't do any of the following until there are no more corrupt blocks
Note that if there is more than one node/disk failed, the corrupt blocks could be elsewhere, logs on the namenode should help you figure that out
At this point you can do one of the following:
leave the datanode decomissioned
or remove the disk from the list in /etc/hadoop/conf/hdfs-site.xml on the datanode (r510-0-5), and service hadoop-hdfs-datanode restart
Fix the disk (umount /hadoop4; fsck /dev/sdd) - note it may fail again in a week or two, be sure to check its health with omreport storage pdisk controller=0
Wipe the disk (as above)
Replace the disk (above)
To allow the datanode back into hadoop, remove its hostname from /etc/hadoop/conf/hosts-exclude on the hepcms-namenode
hdfs dfsadmin -refreshNodes
[root@r720-0-1 ~]# mount /data
mount.nfs: Stale file handle
[root@r720-0-1 ~]# ls /data
ls: cannot access /data: Stale file handle
[root@r720-0-1 ~]# umount -nf /data
[root@r720-0-1 ~]# mount /data
[root@r720-0-1 ~]# ls /data
groups osg test-compute-0-2 TESTING users
Check firewall settings on node that disk mounts from and node that disk mounts to
Check for /etc/exports proper settings on node disk mounts from
Check for /etc/fstab proper settings on node disk mounts to
Can do on node disk mounts to: showmount -e <IP>
/sharesoft/osg/ce/setup.csh: No such file or directory.
[belt@hepcms-in1 ~]$ su -
Password:
[root@hepcms-in1 ~]# mount /data
mount.nfs: Failed to resolve server r720-datanfs.privnet: Temporary failure in name resolution
Look at /etc/resolv.conf, it was being modified with NetworkManager (/etc/init.d/NetworkManager status), turn that off and make Puppet not allow it to run in base.pp: service { 'NetworkManager': ensure => 'stopped', enable => false }
[root@compute-0-5 ~]# mount /data
mount.nfs: requested NFS version or transport protocol is not supported
[root@compute-0-5 ~]# mount -t nfs r720-datanfs.privnet:/data /data
[root@compute-0-5 ~]# ls /data
cmssw cvmfs groups gums lost+found osg root_backup root.old scratch share site_conf TESTING users
Note: Don't have a puppet fix at this time (July 7, 2016) as this is using Trey's nfs puppet module
http://serverfault.com/questions/212178/chown-on-a-mounted-nfs-partition-gives-operation-not-permitted
In /etc/exports need no_root_squash as an option
Check that hepcms-ovirt is up and VMs are running, make sure hepcms-foreman is up and healthy, as it runs the DNS and routing so it needs to be up read proper disk settings
on machine that rebooted, ls /home; ls /data, if it reports not what's expected, mount the disks:
mount -a /home; mount -a /data
If they are "already mounted": umount -nf /home (for example, can do it multiple times to make it work occasionally)
Check NIS healthy and working
In all cases, do not delete user files without consulting with them unless it's clear they are breaking the cluster. Assume nothing is backed up! NEVER EVER use wildcards with rm. It is better to write a deletion script, get confirmation from the user that is the list of files to delete, and then run the script.
For /home, login to hepcms-hn.umd.edu, su - to become root, and look at /export/home
For /data, login to any cluster node, then ssh r720-datanfs, su - to become root, and look at /data
For /hadoop, end of Aug. 2015, it's on /data
For actual hadoop (mounted /mnt/hadoop once SE is up will be soft linked to hadoop): login to any cluster node, then ssh hepcms-namenode, su - to become root, and look at /mnt/hadoop
Note that there are hadoop dfs commands one can use that don't use the fuse mount which may be more efficient for file manipulation
https://sites.google.com/a/physics.umd.edu/tier-3-umd/commands/hadoopnamenodesetup#TOC-Automated-puppet-fuse-mount-not-working-
[root@r510-0-6 ~]# ls -alrth /mnt
ls: cannot access /mnt/hadoop: Transport endpoint is not connected
total 8.0K
d?????????? ? ? ? ? ? hadoop
drwxr-xr-x. 3 root root 4.0K Jul 15 12:59 .
dr-xr-xr-x. 29 root root 4.0K Jul 15 13:09 ..
[root@r510-0-6 ~]# chown hdfs:hadoop /mnt/hadoop
chown: cannot access `/mnt/hadoop': Transport endpoint is not connected
[root@r510-0-6 ~]# umount /mnt/hadoop
[root@r510-0-6 ~]# chown hdfs:hadoop /mnt/hadoop
[root@r510-0-6 ~]# mount /mnt/hadoop
http://hep-t3.physics.umd.edu/HowToForAdmins/errors.html#errorsHadoopFsck
Before any debugging run service hadoop-hdfs-datanode stop
NOTE: IF YOUR BAD DISK IS /hadoop1 (that is where the OS is stored on as well on r510 and compute nodes) do not execute the following commands as it will wipe the OS :) In this scenario it may be necessary to re-kickstart the machine...
First, run df -h to identify the names of the Hadoop disks, on an r510 machine there should be 12 disks, they start with "sd" then go 'a' through 'l'
an example of what it should look like with 12 disks:
[root@r510-0-5 ~]# df -h
/dev/sda7 1.6T 1.1T 397G 74% /hadoop1
/dev/sdb1 1.8T 1.3T 438G 75% /hadoop2
/dev/sdc1 1.8T 1.2T 551G 68% /hadoop3
/dev/sdd1 1.8T 1.2T 561G 68% /hadoop4
/dev/sde1 1.8T 86G 1.7T 5% /hadoop5
/dev/sdf1 1.8T 1.2T 557G 68% /hadoop6
/dev/sdg1 1.8T 1.2T 560G 68% /hadoop7
/dev/sdh1 1.8T 1.2T 546G 69% /hadoop8
/dev/sdi 1.8T 68M 1.7T 1% /hadoop9
/dev/sdj1 1.8T 1.2T 527G 70% /hadoop10
/dev/sdk1 1.8T 1.2T 564G 68% /hadoop11
/dev/sdl1 1.8T 1.3T 414G 76% /hadoop12
Here you can see that all 12 disks are present, if any of the hadoop disks are missing note which dev/sd? it is, as they are alphabetical. So if it was /dev/sdi that was missing :
run the command lsblk -d
check to see if /dev/sdi (or whichever is your missing disk) is listed on this.
The output should like this:
[root@r510-0-5 ~]# lsblk -d
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sde 8:64 0 1.8T 0 disk
sdd 8:48 0 1.8T 0 disk
sda 8:0 0 1.8T 0 disk
sdh 8:112 0 1.8T 0 disk
sdl 8:176 0 1.8T 0 disk
sdb 8:16 0 1.8T 0 disk
sdi 8:128 0 1.8T 0 disk
sdk 8:160 0 1.8T 0 disk
sdg 8:96 0 1.8T 0 disk
sdc 8:32 0 1.8T 0 disk
sdf 8:80 0 1.8T 0 disk
sdj 8:144 0 1.8T 0 disk
If you don't see your missing disk on this list, then you must use omreport storage pdisk controller=0 to identify exactly which one it is, as it may be likely broken on a hardware level and may need to be replaced, or you may attempt re-seating the disk
use the omreport, identify which disk by its number and status. Log into the Dell OMSA manager in firefox, and flash the LED's of all the drives till you find your target.
If you do identify your disk on the lsblk -d list
First step is you want to unmount the disk by : umount /dev/device (note about this at end) umount -nf /device if it's not working
run the command: fsck -y /dev/sdi
If this fails:
run the command mkfs.ext4 /dev/sdi to re-make the file system on that disk
run the command blkid /dev/sdi take note of its UUID and update the UUID of that drive in /etc/fstab , make sure you update its file system type from ext3 to ext4 if it needs be.
run the command mount /dev/sdi or you could use: mount /hadoop# if the drive is labeled. If this gives an error suggesting that this does not exist, make sure the directory /hadoop# exists, if it doesn't run mkdir /hadoop# and run the mount command again.
Note: if you get an error that it still doesn't exist.. make sure you typed in the UUID of the device into /etc/fstab correctly. check with command blkid /dev/sdi
To label the drive, run the command: e2label /dev/sdi /hadoop# to label it, you can also add this label into the /etc/fstab file to make troubleshooting a bit easier, append LABEL=/hadoop5 to the appropriate line in the file: it will look like this:
LABEL=/hadoop5 /hadoop5 ext3 defaults 1 2 (appending the label is optional in the file, using the e2label command is enough)
Once this is complete, make sure to run service hadoop-hdfs-datanode start to restart the hadoop. Also in a few moments, run service hadoop-hdfs-datanode status to ensure that the repair was successful.
run ls -alh /sys/block/sdg/device to identify which disk , this may also help troubleshoot which disk needs repair
This is also helpful, how to interpret the info listed:
http://unix.stackexchange.com/questions/40351/how-do-i-correlate-dev-sd-devices-to-the-hardware-they-represent
* After ANY operation using Hadoop where stopping the hadoop-hdfs-datanode service, make sure to run service hadoop-hdfs-datanode start to turn it back on ( very important )
* If regular umount /dev/device is not working, use umount -nf if you see something like:
/dev/sdh1 1.8T 1.4T 313G 82% /hadoop8
change into something like: /dev/sdh1 16G 2.6G 13G 17% /hadoop8
service hadoop-hdfs-datanode status returns a failed message
Can check log: tail -n100 /scratch/hadoop/hadoop-hdfs/*.log
then run the command service hadoop-hdfs-datanode start
run the command : service hadoop-hdfs-datanode status make sure you get the green "OK" if not it shows as failed, check the logs again
Oct2020
After a site wide shutdown
oVirt 4.1, on new setups, creates PKI infrastructure that uses SHA256 signatures.
Existing setups upgraded to 4.1 do not currently have PKI migrated.
This Howto explains how to manually migrate the PKI of such setups to use SHA256 signatures.
Previous versions of oVirt used SHA-1 for signatures of SSL certificates created by the internal CA. This is no longer considered secure, see e.g. Firefox Chrom Edge/IE or shattered.io.
See Features/PKI for general details about PKI in oVirt.
If you are worried only by a recent browser warning about or rejecting your SHA-1-signed certificate, it might be enough to only re-sign the apache certificate, or only the CA+apache certificates. This procedure was only tested currently in its entirety.
This step is not needed on >= 4.1.
On < 4.1, upgrading to a newer < 4.1 version (e.g. 4.0.6 to 4.0.7) might revert this change, so you need to repeat it per each upgrade until 4.1.
On the engine machine, run these commands:
# Backup exiting confcp -p /etc/pki/ovirt-engine/openssl.conf /etc/pki/ovirt-engine/openssl.conf."$(date +"%Y%m%d%H%M%S")"# Edit it to default to SHA256sed -i 's/^default_md = sha1/default_md = sha256/' /etc/pki/ovirt-engine/openssl.conf
If you only use this procedure because your browser warns/rejects, then it might be enough to skip this part. If your browser requires both the CA cert and the https cert to have SHA256 signatures, you have to complete it.
On the engine machine, run these commands:
# Backup CA certcp -p /etc/pki/ovirt-engine/private/ca.pem /etc/pki/ovirt-engine/private/ca.pem."$(date +"%Y%m%d%H%M%S")"# Create a new cert into ca.pem.new openssl x509 -signkey /etc/pki/ovirt-engine/private/ca.pem -in /etc/pki/ovirt-engine/ca.pem -out /etc/pki/ovirt-engine/ca.pem.new -days 3650 -sha256# Replace the existing with the new one /bin/mv /etc/pki/ovirt-engine/ca.pem.new /etc/pki/ovirt-engine/ca.pem
Decide what you want, among the options below:
If only apache httpd (for browsers that reject SHA1 signatures), run:
names="apache"
If also the engine cert:
names="apache engine"
If all normally-existing entities:
names="engine apache websocket-proxy jboss imageio-proxy"
If you replaced the https cert with a cert signed by a 3rd party, you should not include “apache” in above - e.g. use one of:
names="engine"# ornames="engine websocket-proxy jboss imageio-proxy"
If this is a self-hosted-engine, move it to global maintenance.
Run this (in the same terminal of previous subsection above):
for name in $names; do subject="$(openssl x509 -in /etc/pki/ovirt-engine/certs/"${name}".cer -noout -subject | sed 's;subject= \(.*\);\1;')" /usr/share/ovirt-engine/bin/pki-enroll-pkcs12.sh --name="${name}" --password=mypass --subject="${subject}" --keep-keydone
If you included apache:
systemctl restart httpd
If you included engine:
systemctl restart ovirt-engine
If you included ovirt-websocket-proxy/ovirt-imageio-proxy:
systemctl restart ovirt-websocket-proxy systemctl restart ovirt-imageio-proxy
If this is a self-hosted-engine, exit global maintenance.
Your browser will likely refuse to continue working with the web admin ui. You might need to restart it and/or remove the engine cert and/or engine ca cert.
In my own case I unchecked “Permanently store this exception” when I first logged in, and after restarting httpd the browser showed an error about using the same serial number. Restarting the browser was enough to login again.
For all of your hosts, one host at a time, using the web admin ui:
Set it to Maintenance
Choose “Enroll Certificates”
Activate
You can do this step at any time, also before starting this procedure.
Certs that use SHA1 will show as having ‘sha1WithRSAEncryption’. Certs that use SHA256 will show as having ‘sha256WithRSAEncryption’.
On engine machine:
openssl x509 -in /etc/pki/ovirt-engine/ca.pem -text | grep Signature for name in engine apache websocket-proxy jboss imageio-proxy; do echo $name:; openssl x509 -in /etc/pki/ovirt-engine/certs/"${name}".cer -text | grep Signature; done
On hosts:
openssl x509 -in /etc/pki/vdsm/certs/vdsmcert.pem -text | grep Signature openssl x509 -in /etc/pki/vdsm/certs/cacert.pem -text | grep Signature
Sep 2018
[root@hepcms-ovirt images]# df -ah
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_sys-LV_root
72G 67G 1.7G 98% /
proc 0 0 0 - /proc
sysfs 0 0 0 - /sys
devpts 0 0 0 - /dev/pts
tmpfs 44G 4.0K 44G 1% /dev/shm
/dev/sda2 477M 57M 395M 13% /boot
/dev/mapper/vg_ovirt-lv_ovirt
19T 1.5T 17T 8% /opt/ovirt
none 0 0 0 - /proc/sys/fs/binfmt_misc
sunrpc 0 0 0 - /var/lib/nfs/rpc_pipefs
nfsd 0 0 0 - /proc/fs/nfsd
127.0.0.1:/opt/ovirt/import_export
19T 1.5T 17T 8% /rhev/data-center/mnt/127.0.0.1:_opt_ovirt_import__export
127.0.0.1:/opt/ovirt/iso
19T 1.5T 17T 8% /rhev/data-center/mnt/127.0.0.1:_opt_ovirt_iso
/dev/mapper shows 98% and most of it is due to crash reports.
[root@hepcms-ovirt crash]# pwd
/var/crash
[root@hepcms-ovirt crash]# ls -slrt
total 8
4 drwxr-xr-x 2 root root 4096 Aug 28 2016 127.0.0.1-2016-08-28-15:17:56
4 drwxr-xr-x 2 root root 4096 Jan 14 05:02 127.0.0.1-2018-01-14-04:39:12
[root@hepcms-ovirt crash]# rm -rf 127.0.0.1-2016-08-28-15\:17\:56/
[root@hepcms-ovirt crash]# df -ah
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_sys-LV_root
72G 42G 27G 62% /
Note that right now, common.yaml keeps the version of puppet at 3.7.5 which is what the master is at. So, you want to change in both common.yaml and for the puppet master at the same time
Trey recommends:
I'd stick with 3.7.5 and test the upgrade later. Usually a newer master with older clients will work if the difference is 3.8.x vs 3.7.x. I personally do Puppet tests by taking one of my masters and removing it from round-robin DNS puppet.brazos.tamu.edu then upgrade it and a few clients and do puppet agent --test --server puppetmaster02.brazos.tamu.edu --noop. For a single-master situation, the easiest solution is likely to snapshot the VM , upgrade , then test on a few clients to make sure things are fine. Usually good idea to ensure clients have all changes applied before testing to see if updated Puppet modifies behavior
Clients are easy to rollback, just yum downgrade which can be done by Puppet too
You'll note that hepcms-foreman has a cron job for puppet agent --test --noop, which may list all sorts of common changes. However, it's in a working stage and we do not intend to actually run puppet without "--noop". Please don't change our foreman without being super sure you are knowing what you are doing, and certainly never without backing it up.
important things such as iptables may break
https://docs.puppetlabs.com/puppet/latest/reference/man/filebucket.html
Look at the report on hepcms-foreman, and it says something like this: checksum was <md5sum>
you can usually do something like puppet filebucket restore /etc/yp.conf <md5sum>
Example: notice /Stage[main]/Sudo/File[/etc/sudoers]/content content changed '{md5}26bf78728f812c729cfe82b1664e0f5a' to '{md5}4093e52552d97099d003c645f15f9372'
puppet filebucket restore /etc/sudoers 26bf78728f812c729cfe82b1664e0f5a
In common.yaml:
puppet::version: '3.7.5-1.el6'
Note that you need the puppet class on the node to make this take effect:
classes:
- puppet
https://sites.google.com/a/physics.umd.edu/tier-3-umd/commands/margueritedebuglog/vmtest
cp -a /etc/puppet /etc/puppet-$(date +%F)
cp -a /var/lib/puppet /var/lib/puppet-$(date +%F)
cp -a /etc/puppet /etc/puppet-$(date +%F)
A simple cp -a /etc/puppet /etc/puppet.bak or something similar should work can then use the backup to see what changed , something like diff -r -w --brief /etc/puppet/ /etc/puppet.bak/ To verify no unexpected changes took place
-w ignores whitespace
(# is not typed, it's the prompt for [root@hepcms-foreman ~]# ):
# hiera --config /etc/puppet/hiera/production/hiera.yaml foreman_proxy:: trusted_hosts::environment=production ::hostgroup='base/mgmt/dns' ::fqdn=ns01.brazos.tamu.edu
["foreman.brazos.tamu.edu"]
That command will print out the value for foreman_proxy::trusted_hosts when environment=production foreman's hostgroup=base/mgmt/dns and fqdn=ns01.brazos.tamu.edu , the "::" denotes facts and Foreman's hostgroup value is treated as a fact. . If your hiera is collected in Puppet using `hiera_array` you can use the `--array` option and `hiera_hash` can use the --hash option...those options will print out values using the appropriate "merge" functionality. That is a way to test what value will be seen by Puppet for a particular system
Example:
# hiera --config /etc/puppet/hiera/production/hiera.yaml ntp::servers ::environment=production ::fqdn=r720-datanfs.privnet
WARN: Thu Jul 09 14:10:39 -0400 2015: Cannot load backend eyaml: no such file to load -- hiera/backend/eyaml_backend
["0.centos.pool.ntp.org", "1.centos.pool.ntp.org", "2.centos.pool.ntp.org"]
The only time eyaml_backend is used in your hiera is encrypted values. We'll ignore those for now as that concept is easier to deal with once Hiera better understood.
Restart your puppet master on foreman /etc/init.d/puppetserver restart
puppet agent --test --noop
puppet agent --test --noop --tags nfs
another example : puppet agent --test --tags profile::base --noop
also for running osg class in int.yaml file puppet agent --test profile::osg --noop
puppet agent --test
/etc/init.d/puppet stop
Make that puppet agent not start automatically upon node reboot):
chkconfig puppet off
Start a puppet agent (these run automatically on a node either in kickstart or crontab):
/etc/init.d/puppet start
See if puppet agent is running, if not do this by hand above, otherwise it will get picked up automagically if the puppet agent is running: ps ahux | grep puppet
Read the error message, for instance:
Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate declaration: Package[osg-se-hadoop-client] is already declared in file /etc/puppet/environments/production/manifests/site.pp:23; cannot redeclare at /etc/puppet/environments/production/modules/profile/manifests/osg/hadoop_client.pp:33 on node foreman-vmtest2.local
In this case it tells you exactly what thing was declared in two places that the node had implemented (one in a class it had added in hepcms-foreman, one in site.pp). Try not to remove things from common.yaml, base.pp to resolve these duplicates, as OTHER nodes depend on that stuff!
This particular one was documented here: https://sites.google.com/a/physics.umd.edu/tier-3-umd/commands/hadoopnamenodesetup#TOC-Make-the-Fuse-client-to-mount-hadoop-elsewhere-in-puppet:
They might be fine if you run puppet agent --test if they are the sort that complains about a missing (configuration) file which puppet has to actually install the software first to have that file exist. Run puppet agent --test and they should go away
If they don't go away after running without --noop, read the error and try and figure out what dependency is actually missing
Use the following example puppet text in the above files, update in github, foreman getr10k, and puppet agent --test update on that node(s)
include ::osg
include ::profile::osg::hadoop_client
https://docs.puppetlabs.com/hiera/1/complete_example.html#assigning-a-class-to-a-node-with-hiera
site.pp needs: hiera_include('classes') (in the default, note if you have a specific FQDN, you need it there as well, otherwise the classes: part in hiera yaml below will be ignored), if troubles try hiera_include('classes',[])
Your .yaml will have something like:
classes:
- omsa
- profile::osg::hadoop_client
1.) Backup the /var/lib/puppet/ssl folder
mv /var/lib/puppet/ssl /var/lib/puppet/ssl.bak
2.) run puppet agent --test --noop on that node
3.) On the Puppet Master (Foreman) , run : puppet cert sign nodename_.umd.edu
* If the wrong name persists, go to /etc/puppet/puppet.conf and change it by hand.
the line is : certname = hepcms-in2.umd.edu
* Then run puppet agent --test --noop on that node again and confirm the update has applied.
Make sure to do the command as root
[belt@hepcms-in2 ~]$ puppet agent --test
Info: Creating a new SSL key for hepcms-in2.umd.edu
Error: Could not request certificate: Find /production/certificate/ca?fail_on_404=true resulted in 404 with the message: {"message":"Not Found: Error: Invalid URL - Puppet expects requests that conform to the /puppet and /puppet-ca APIs.\n\nNote that Puppet 3 agents aren't compatible with this version; if you're running Puppet 3, you must either upgrade your agents to match the server or point them to a server running Puppet 3.\n\nMaster Info:\n Puppet version: 4.2.1\n Supported /puppet API versions: v3\n Supported /puppet-ca API versions: v1","issue_kind":"HANDLER_NOT_FOUND"}
Exiting; failed to retrieve certificate and waitforcert is disabled
Puppet error with certificate, tries to run puppet agent --test --noop and gets this error, could not request certificate, certificate retrieved from the master does not match the agent's private key:
hepcms-namenode: hdfs balancer
1. make sure to backup the /etc and /var folders before continuing , follow the steps in the screenshot
2. log onto hepcms-foreman and run puppet cert clean #node-name#.privnet/ fill in with your node name
3. then on the node/agent itself, run find /var/lib/puppet/ssl -name hepcms-in3.privnet.pem -delete
4. then run `puppet agent --test --noop` on (hepcms-foreman) and it will show something like this:
5. afterwards go to the node of concern (in my case it was hepcms-in3) and on there run the command `puppet agent --test`
and you should get something like the following :
If you get similar messages it means Puppet has picked up on changes successfully.
backup puppet as shown above
r10k when run in verbose mode should print what it's removing (if anything) , so you'd then be able to identify what in the backup needs to be restored
on hepcms-foreman as root: run /usr/local/sbin/mysqlbackup.sh
makes output in /opt/mysql_backups/mysql_backup*20150709-143033*.bz2 (for instance for 14:30:33 on 9 July 2015), take the latest outputs and backup elsewhere
/var/lib/dhcpd/dhcpd.leases
service dhcpd restart
service foreman-proxy restart
When a host is removed the entries in the following directories should also be removed if not cleaned up automatically.
/var/lib/puppet/yaml/facts/
/var/lib/puppet/yaml/node/
/var/lib/puppet/yaml/foreman/
leases file will have removed machines. If you remove information from this file, restart dhcpd service.
If the VM is still in Foreman and managed, can delete from Foreman and that should delete from oVirt. If VM is no longer in Foreman or not managed in Foreman, have to delete from oVirt directly
For DHCP errors, look in /var/log/foreman-proxy/proxy.log on the DHCP server.
sometimes run into problems where permissions on /etc/dhcp are too restrictive. Usually chmod 0755 /etc/dhcp fixes the issue, then restart foreman-proxy
The DHCP conflict entries may be due to entries left in DHCP. Deleting a host from Foreman should clean up DHCP too. You may have to open /var/lib/dhcpd/dhcpd.leases and delete the things that shouldn't be there. Then restart dhcpd service
Make a backup first just to be safe
debug information logs:
/var/log/foreman-proxy/proxy.log
/var/log/foreman/production.log
/var/log/messages
/var/log/boot.log
You can clean cache
/var/run/foreman/cache/
pxe boot faliure will be seen when the downloaded PXE files are corrupted. The easiest fix is removing them and forcing them to redownload
First remove the associated files in `
/var/lib/tftpboot/boot`
on the Foreman server
So if the host was supposed to build SL 6.7 , file likely called `Scientific-6.7-x86_64-initrd.img` and `Scientific-6.7-x86_64-vmlinuz
Then cancel build for host in Foreman and click Build again , that will trigger Foreman Proxy to redownload the files (since they will be missing).
When you click the Build button, one thing Foreman does is instruct Foreman Proxy to ensure TFTP boot files exist. If you remove them, Foreman Proxy will download them again.
But only if you instruct a host to Build after removing them.
On the foreman web page:
Configure … Host Groups … click on group (base)
Click on Puppet Classes
Click the + to expand the puppet class and click the + next to the particular thing you want to add
Click Submit on the bottom
Then be sure on that node to run puppet agent --test to pick up the changes (see --noop for testing above)
Check that you are partitioning the right disk (/dev/sda for instance, can use --ondisk==/dev/sda to force it in the kickstart)
See error message:
Cannot open root device "(null)" or unknown-block(8,6)
Please append a correct "root=" boot option: here are the available partitions:
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(8,6)
Pid 1, comm: swapper Not tainted 2.6.32-504.el6.x86_64 #1
Call Trace:
I've seen that when the downloaded PXE files are corrupted. The easiest fix is removing them and forcing them to redownload
First remove the associated files in `/var/lib/tftpboot/boot` on the Foreman server
So if the host was supposed to build SL 6.7 , file likely called `Scientific-6.7-x86_64-initrd.img` and `Scientific-6.7-x86_64-vmlinuz
Then cancel build for host in Foreman and click Build again , that will trigger Foreman Proxy to redownload the files (since they will be missing).
When you click the Build button, one thing Foreman does is instruct Foreman Proxy to ensure TFTP boot files exist. If you remove them, Foreman Proxy will download them again.
But only if you instruct a host to Build after removing them.
see red and blue screen "Error downloading kickstart file". Please modify the kickstart parameter below or press Cancel to proceed as an interactive installation. In this case, follow directions above for "Kickstart on PXE boot failure" to remove and force Foreman to re-download the kickstart files.
Follow directions above for "Kickstart on PXE boot failure" to remove and force Foreman to re-download the kickstart files.
First check httpd service is running, if not start it.
[root@hepcms-foreman ~]# service httpd status
httpd is stopped
[root@hepcms-foreman ~]# service httpd start
[Wed May 31 15:55:18 2017] [warn] module passenger_module is already loaded, skipping
Syntax error on line 4 of /etc/httpd/conf.d/activemq-httpd.conf:
Invalid command 'ProxyRequests', perhaps misspelled or defined by a module not included in the server configuration
The offending config /etc/httpd/conf.d/activemq-httpd.conf
was part of mcollective yum install, a service that is not working as of now and was installed couple months ago.
For now move it from httpd area
[root@hepcms-foreman conf.d]# mv activemq-httpd.conf /root/
restart the services
[root@hepcms-foreman conf.d]# service httpd start
Starting httpd: [Fri Jun 02 13:19:39 2017] [warn] module passenger_module is already loaded, skipping
[ OK ]
[root@hepcms-foreman conf.d]# passenger-status
Version : 4.0.18
Date : Fri Jun 02 13:19:52 -0400 2017
Instance: 25594
----------- General information -----------
Max pool size : 6
Processes : 0
Requests in top-level queue : 0
----------- Application groups -----------
/usr/share/foreman#default:
App root: /usr/share/foreman
(spawning new process...)
Requests in queue: 2
[root@hepcms-foreman conf.d]# service foreman status
Foreman is running under passenger [PASSED]
ssh root@hepcms-foreman.umd.edu, check /var/log/foreman/production.log for something like this:
Started PUT "/hosts/hepcms-gridftp.umd.edu/setBuild?auth_object=hepcms-gridftp.umd.edu&permission=build_hosts" for 206.196.186.151 at 2016-06-17 14:14:37 -0400
2016-06-17 14:14:37 [I] Processing by HostsController#setBuild as HTML
2016-06-17 14:14:37 [I] Parameters: {"utf8"=>"✓", "authenticity_token"=>"SQ1q1B1aPMbBXTfyYJ/YUVO9bn3nXHLBbOvzl2os3eY=", "commit"=>"Build", "auth_object"=>"hepcms-gridftp.umd.edu", "permission"=>"build_hosts", "id"=>"hepcms-gridftp.umd.edu"}
2016-06-17 14:14:37 [I] Add the TFTP configuration for hepcms-gridftp.umd.edu
2016-06-17 14:14:37 [I] Fetching required TFTP boot files for hepcms-gridftp.umd.edu
2016-06-17 14:14:37 [I] Redirected to https://hepcms-foreman.umd.edu/hosts/hepcms-gridftp.umd.edu
2016-06-17 14:14:37 [I] Completed 302 Found in 601ms (ActiveRecord: 10.7ms)
Note the time it takes to build at the bottom of this page can be very long ~20-30minutes or more depending on disks attached to the node
ssh root@hepcms-foreman.umd.edu
May wish to backup on hepcms-foreman as you see above, type the alias backup
Command is: r10k deploy -v info environment -p this is aliased as getr10k, which runs the command above as well as lists the date, so you can keep track of when you did the command in a workflow
Did you get r10k updates above?
Did you have a bug in your code (puppet agent --test --noop on the machine you are trying to change)
Go to the area it's complaining about (on hepcms-foreman) and do a git status. Currently there is no ssh key on hepcms-foreman, we are using git as read only
To continue to use git as read only, do the following:
Commit and push any changes you have made by hand to git using another server (not ideal, best to add a ssh key and config your hepcms-foreman git as root)
Go to the affected areas that git complains about, i.e., and in each area, do the following git commands:
/etc/puppet/hiera/production/hieradata
/etc/puppet/environments/production/modules/profile
git fetch --all
git reset --hard origin master or git reset --hard
On hepcms-foreman Put it in: /etc/puppet/modules, that is a directory defined in /etc/puppet/puppet.conf as a basemodulepath path which is a path picked up by Puppet for modules but it is not touched by r10k
In puppet:
host { 'hepcms-hn.umd.edu':
ensure => 'present',
host_aliases => ['hepcms-hn'],
ip => '10.1.0.1',
}
From command line in puppet:
puppet resource host hepcms-hn.umd.edu ensure=present host_aliases=hepcms-hn ip=10.1.0.1
By hand with no puppet: In /etc/hosts: 10.1.0.1 hepcms-hn.umd.edu
In github umd_hepcms_puppet_modules, edit Puppetfile - This file is a list of modules that are to be installed on the puppet master by r10k. Make sure to check in the edit and run r10k to pick it up.
Example lines:
mod 'puppetlabs/denyhosts', '0.1.0'
mod 'osg', :git => 'https://github.com/treydock/puppet-osg'
mod 'role', :git => 'https://github.com/UMD-HEPCMS/umd_hepcms_puppet_roles'
In hepcms-foreman, be sure to update the Puppet classes available, Configure… Puppet classes.. click on button to Import from hepcms-puppet.umd.edu
Check on that node which complains the facter information, then fix to pick up facter from puppetlabs below;
facter -p operatingsystemmajrelease
facter --version
Check your foreman provisioning template (web interface) for that node (can click on Templates) , in this case it's:
<% if puppet_enabled && @host.params['enable-puppetlabs-repo'] && @host.params['enable-puppetlabs-repo'] == 'true' -%>
In this case we want to set that universally, so in foreman (web interface), Configure… Global Parameters…
Name: enable-puppetlabs-repo Value: true
facter::package_ensure: "2.4.4-1.el%{::operatingsystemmajrelease}"
Or hard code release:
facter::package_ensure: "2.4.4-1.el6"
Now it's possible that puppet tries to update facter before adding the repo. Puppet's order of applying things is 'random'. You'd have to tell Puppet that Package[facter]` requires `Yumrepo[puppetlabs-products]
One really bad hack I use in site.pp is this: Yumrepo <| |> -> Package <| |>
That basically tells Puppet to ensure all repos are added before packages
It has caused me a few problems but the problems were with modules I developed so updated my own modules to allow for such a hack
Add to profile::base something like this:
include ::facter
include ::puppetlabs_yum
Class['::puppetlabs_yum'] -> Class['::facter']
That will ensure anything with profile::base has the puppetlabs_yum class applied before facter
Is there a warning message with puppet agent --test --noop run on that node?
Is the puppet module added to the base class or the node (check the hepcms-foreman web page)?
Can add on the hepcms-foreman web page (which should only affect kickstart), or better, add in base.pp below:
Add in base.pp for instance:
include ::facter
include ::puppetlabs_yum
Class['::puppetlabs_yum'] -> Class['::facter']
puppet agent --test --tags facter,puppetlabs_yum
Check that /data and /home are properly mounted. Check that the head node and r270-datanfs machines are healthy (df -h) and have proper network and firewall settings. Interestingly enough this caused problems in puppet agent when I screwed up the r720-datanfs firewall and the other symptom was that df -h would hang.
Look in the module's .erb file to see what variables modify the configuration file, for instance:
https://github.com/treydock/puppet-osg/blob/master/templates/cvmfs/default.local.erb
Format in your .yaml like so:
osg::cvmfs::http_proxies:
- 'http://hepcms-squid:3128'
http://rnelson0.com/2014/10/20/rewriting-a-puppet-module-for-use-with-hiera/
https://docs.puppetlabs.com/hiera/1/puppet.html#automatic-parameter-lookup
Example: https://forge.puppetlabs.com/jfryman/selinux
Add in Puppetfile: mod 'jfryman/selinux', '0.2.5'
Add in base.pp: include ::selinux
Add in common.yaml: selinux::mode: 'disabled'
Be sure to run r10k to pick up changes, run puppet agent --test
To use in a specific GUMS.yaml: selinux::mode: 'enforcing'
Are you trying to implement a variable with a single value like an array?
array:
osg::cvmfs::http_proxies:
- 'http://hepcms-squid.privnet:3128'
single value (with or without single quotes, note the space after the last :) :
osg::cvmfs::cms_local_site: T3_US_UMD
Picks up from common.yaml, Hostgroup (only one, NOT an inherited structure, so Worker.yaml and R720.yaml is a bad idea, stick to just one.), and fqdn/FullFQDN.yaml
Want to have in my firewall --dport 9000:9999, the puppet module accepts in hiera the following (note it doesn't accept the string "9000:9999"):
dport: [9000,9999] This is an array which is actually equivalent to:
But I want the range, which is coded correctly in a string as:
dport: '9000-9999'
And I get:
-A INPUT -p tcp -m multiport --dports 9000:9999 -m comment --comment "004 Condor ports open" -j ACCEPT
Note:
dport is for inbound traffic, sport is for outbound traffic
for example :
'003 allow GRAM callback inbound':
dport: '40000-40199'
proto: "tcp"
action: 'accept'
'004 allow GRAM callback outbound':
sport: '20000-25000'
proto: "tcp"
action: 'accept'
In the example above with selinux, it keeps applying the setenforce 0, but stops after reboot, so just reboot the node. If that doesn't fix it, then check the hepcms-foreman web page Reports for what puppet keeps applying, maybe you set something up wrong.
Did you spell the class name right in implementation?
Did you check the Dependencies web page for the puppet module? Make sure it's installed in the Puppetfile
ERROR -> Forge module names must match 'owner/modulename'
Did you forget a comma in your Puppetfile ?
http://www.puppetcookbook.com/posts/creating-a-directory.html
puppet snippet:
# Same as command: ln -s /etc/puppet/hiera/production/hiera.yaml /etc/hiera.yaml
file { '/etc/hiera.yaml':
ensure => 'symlink',
target => '/etc/puppet/hiera/production/hiera.yaml',
}
foreman proxy not starting
check ps aux. Smart proxy process is running despite foreman proxy daeman stopped. Its status was SNl on ps aux and has been running since the last time the proxy works. The issue can be solved by manually killing smart proxy process and restarting foreman proxy
498 22939 0.4 0.7 169560 62244 ? Sl 14:36 0:11 ruby /usr/share/foreman-proxy/bin/smart-pr
look at install logs on `/root/`
reboot the server. when it gets to the screen where it chooses linux version, choose instead and older version (kernel).
Once in,remov oldest kernel from /boot (not the one using now) directory.
Check if other directories are filled and if so release some space.
You can check which kernel is being used right now
[root@r720-0-1 boot]# uname -or
2.6.32-642.6.2.el6.x86_64 GNU/Linux
[root@r720-0-1 boot]#
and reinstall new kernel again.
Instructions from Doug:
Do the kernel panics occur while booting or sometime later? Either way, I recommend booting into a previous kernel. I suspect that a disk partition filled and resulted in a corrupt upgrade. You can usually recover from this. After booting, check the disks for full partitions. If it is the result of logs or crash dumps (/var/crash), clean these up. The most likely problem is that /boot filled. This is little tricky to clean up. To remove packages, /var much have free space. Then you can remove "old" kernels; not the one you are running. When there is enough free space, try reinstalling the most recent kernel.
yum reinstall kernel-##.##....
Do not use --skip-broken. It is best to keep cleaning and resolving yum errors until you can run yum cleanly. I recently went through this for a machine that would not boot. We spent 4 or 5 hours resolving the issues, but did not have to reinstall the OS.
https://www.digicert.com/secure/profile-settings/
These certs are now through UMD so on the digicert website login through SSO for university of maryland, college park.
We have three certs at the moment that expire in January
dport:
- 9000
- 9999
I love digecert! They are for linux
usr: digicert
jabeen@umd.edu
Log in to digicert account and request a new grid ssl host certificate. It requires csr file
ON SE, ce and gridft machines, cd to
cd /data/site_conf/certs/DIGICERT-2019/
and create certificate csr and key files.
openssl req -new -newkey rsa:2048 -nodes -out hepcms-0_umd.edu.csr -keyout hepcms-0.umd.edu.key -subj "/C=us/ST=CO/L=College Park/O=University of Maryland/OU=Physics/CN=hepcms-0.umd.edu/emailAddress=jabeen@umd.edu"
openssl req -new -newkey rsa:2048 -nodes -out hepcms-1_umd.edu.csr -keyout hepcms-1.umd.edu.key -subj "/C=us/ST=CO/L=College Park/O=University of Maryland/OU=Physics/CN=hepcms-1.umd.edu/emailAddress=jabeen@umd.edu"
openssl req -new -newkey rsa:2048 -nodes -out hepcms-gridftp_umd.edu.csr -keyout hepcms-gridftp.umd.edu.key -subj "/C=us/ST=CO/L=College Park/O=University of Maryland/OU=Physics/CN=hepcms-gridftp.umd.edu/emailAddress=jabeen@umd.edu"
998 more hepcms-1.umd.edu.csr
copy paste this to the digicert website. All the other fields are automatically filled. Order the certificate and wait for approval.
once you have it, download the file, copy to /data directory and copy the relevant files to all three machines /etc/grid-security.
Make sure they have correct permissions.
on hepcms-se (hepcms-0) (this is for se and xdroot both)
cp /data/site_conf/certs/DIGICERT-2019/hepcms-0_umd.edu.csr .
cp /data/site_conf/certs/DIGICERT-2019/hepcms-0_umd_edu_14042245/hepcms-0_umd_edu.crt .
cp /data/site_conf/certs/DIGICERT-2019/hepcms-0_umd_edu_14042245/DigiCertCA.crt .
cp /data/site_conf/certs/DIGICERT-2019/hepcms-0.umd.edu.key
cp hepcms-0.umd.edu.key hostkey.pem
cp hepcms-0_umd_edu.crt hostcert.pem
chmod 444 hostcert.pem
chmod 400 hostkey.pem
cd xrd/
cp ../hostkey.pem xrdkey.pem
cp ../hostcert.pem xrdcert.pem
1024 chmod 444 xrdcert.pem
1025 chmod 400 xrdkey.pem
restart the services
service condor-ce restart on CE (hepcms-1)
service xrootd restart on CE (hepcms-0)
service cmsd restart on CE (hepcms-0)
service globus-gridftp-server restart on gridftp
check the dates and that the cert matches the key
[root@hepcms-1 grid-security]# openssl x509 -in hostcert.pem -subject -issuer -dates -noout
subject= /DC=com/DC=DigiCert-Grid/C=US/ST=Maryland/L=College Park/O=University of Maryland/CN=hepcms-1.umd.edu
issuer= /C=US/O=DigiCert Grid/OU=www.digicert.com/CN=DigiCert Grid Trust CA G2
notBefore=Dec 6 00:00:00 2019 GMT
notAfter=Jan 5 12:00:00 2021 GMT
[root@hepcms-1 grid-security]#
[root@hepcms-1 grid-security]# openssl x509 -noout -modulus -in hostcert.pem | openssl md5
(stdin)= a6c9ac5f7a36ff49efa6de7f861359e9
\[root@hepcms-1 grid-security]# openssl rsa -noout -modulus -in hostkey.pem | openssl md5
(stdin)= a6c9ac5f7a36ff49efa6de7f861359e9
[root@hepcms-1 grid-security]#
1004 ls -slrt
1005 service condor-ce status
1006 service condor-ce restart
1007 tail -100 /var/log/condor-ce/SchedLog
1008 history
Same for SE and hepcms-gridftp.umd.edu
Check that the services are working:
[jabeen@hepcms-in2 PDF]$ xrdfs root://hepcms-0.umd.edu:1094/ ls /store/test/xrootd/T3_US_UMD/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000/
/store/test/xrootd/T3_US_UMD/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000//A64CCCF2-5C76-E711-B359-0CC47A78A3F8.root
/store/test/xrootd/T3_US_UMD/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000//AE237916-5D76-E711-A48C-FA163EEEBFED.root
/store/test/xrootd/T3_US_UMD/store/mc/SAM/GenericTTbar/AODSIM/CMSSW_9_2_6_91X_mcRun1_realistic_v2-v1/00000//CE860B10-5D76-E711-BCA8-FA163EAA761A.root
cd /data/users/jabeen/CMSSW_8_0_26_patch1/src/WG_Analysis/
cmsenv
source /cvmfs/cms.cern.ch/crab3/crab.csh
crab checkwrite --site=T3_US_UMD
ON all the nodes we need certificates get the RSS key
clush -w hepcms-gridftp -b openssl req -new -batch -newkey rsa:2048 -nodes -keyout hepcms-gridftp.umd.edu.key -out hepcms-gridftp.umd.edu.csr
clush -w hepcms-ce -b openssl req -new -batch -newkey rsa:2048 -nodes -keyout hepcms-1.umd.edu.key -out hepcms-1.umd.edu.csr
clush -w hepcms-se -b openssl req -new -batch -newkey rsa:2048 -nodes -keyout hepcms-0.umd.edu.key -out hepcms-0.umd.edu.csr
clush -w siab-1 -b openssl req -new -batch -newkey rsa:2048 -nodes -keyout siab-1.umd.edu.key -out siab-1.umd.edu.csr
clush -w hepcms-gridftp -b openssl req -new -batch -newkey rsa:2048 -nodes -keyout hepcms-gridftp.umd.edu.key -out hepcms-gridftp.umd.edu.csr
\
clush -w hepcms-ce openssl -b req -new -batch -newkey rsa:2048 -nodes -keyout http/hepcms-1.umd.edu.key -out http/hepcms-1.umd.edu.csr
clush -w hepcms-ce openssl -b req -new -batch -newkey rsa:2048 -nodes -keyout rsv/hepcms-1.umd.edu.key -out rsv/hepcms-1.umd.edu.csr
clush -w epcms-in2 openssl -b req -new -batch -newkey rsa:2048 -nodes -keyout hepcms-in2..umd.edu.key -out hepcms-in2.umd.edu.csr
Not getting this one. clush -w openssl req -new -batch -newkey rsa:2048 -nodes -keyout hepcmsdev-6..umd.edu.key -out hepcmsdev-6..umd.edu.csr
From https://opensciencegrid.org/docs/security/host-certs/
Verify that the issuer CN field is InCommon IGTF Server CA:
Install the host certificate and key: in /etc/grid-security
$ openssl x509 -in <PATH TO CERTIFICATE> -noout -issuer issuer= /C=US/O=Internet2/OU=InCommon/CN=InCommon IGTF Server CA
root@host # cp <PATH TO CERTIFICATE> hostcert.pem root@host # cp <PATH TO KEY> hostkey.pem
root@host # chmod 444 hostcert.pem
root@host # chmod 400 hostkey.pem
From https://www.digicert.com/csr-ssl-installation/apache-openssl.htm#ssl_certificate_install
[root@hepcms-gridftp grid-security]# grep -i -r "SSLCertificateFile" /etc/
/etc/sfcb/sfcb.cfg:sslCertificateFilePath: /etc/sfcb/server.pem
[root@hepcms-gridftp grid-security]#
Hosts that need certificates.
osg-gridadmin-cert-request --hostname=hepcms-in2.umd.edu --vo=CMS
osg-gridadmin-cert-request --hostname=siab-1.umd.edu --vo=CMS
osg-gridadmin-cert-request --hostname=hepcms-0.umd.edu --vo=CMS
osg-gridadmin-cert-request --hostname=hepcms-1.umd.edu --vo=CMS
osg-gridadmin-cert-request --hostname=hepcms-gridftp.umd.edu --vo=CMS
osg-gridadmin-cert-request --hostname=hepcmsdev-6.umd.edu --vo=CMS
osg-gridadmin-cert-request --hostname=http/hepcms-1.umd.edu --vo=CMS
osg-gridadmin-cert-request --hostname=rsv/hepcms-1.umd.edu --vo=CMS
Command to get the certificate
osg-gridadmin-cert-request --hostname=hepcms-1.umd.edu --vo=CMS
[jabeen@hepcms-in1 ~/SITE_CERTS/hepcms-ce]$ osg-gridadmin-cert-request --hostname=hepcms-1.umd.edu --vo=CMS
[jabeen@hepcms-in1 ~/SITE_CERTS/hepcms-ce]$ osg-gridadmin-cert-request --hostname=rsv/hepcms-1.umd.edu --vo=CMS
[jabeen@hepcms-in1 ~/SITE_CERTShepcms-ce]$ osg-gridadmin-cert-request --hostname=http/hepcms-1.umd.edu --vo=CMS
ssh hepcms-ce
cd /etc/grid-security/
compare the new and old to see they have the same ID
[root@hepcms-1 2017certs]# openssl x509 -in hepcms-1.umd.edu.pem -subject -issuer -dates -noout
subject= /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=hepcms-1.umd.edu
issuer= /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1
notBefore=Apr 4 17:11:52 2017 GMT
notAfter=May 4 17:16:52 2018 GMT
[root@hepcms-1 2017certs]# openssl x509 -in ../hostcert.pem -subject -issuer -dates -noout
subject= /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=hepcms-1.umd.edu
issuer= /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1
notBefore=Mar 4 19:38:57 2016 GMT
notAfter=Apr 3 19:43:57 2017 GMT
[root@hepcms-1 2017certs]# openssl x509 -in ../rsv/rsvcert.pem -subject -issuer -dates -noout
subject= /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=rsv/hepcms-1.umd.edu
issuer= /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1
notBefore=Mar 4 19:39:30 2016 GMT
notAfter=Apr 3 19:44:30 2017 GMT
[root@hepcms-1 2017certs]# openssl x509 -in ../http/httpcert.pem -subject -issuer -dates -noout
subject= /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=http/hepcms-1.umd.edu
issuer= /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1
notBefore=May 2 21:35:41 2016 GMT
notAfter=Jun 1 21:40:41 2017 GMT
now move old certs to xxx-old and copy all three new certs to their proper names and directories
[root@hepcms-1 grid-security]# mv hostcert.pem hostcert.pem-old
[root@hepcms-1 grid-security]# mv hostkey.pem hostkey.pem-old
[root@hepcms-1 grid-security]# cp /data/users/jabeen/SITE_CERTS/hepcms-ce/*.pem .
[root@hepcms-1 grid-security]# mv hepcms-1.umd.edu.pem hostcert.pem
[root@hepcms-1 grid-security]# mv hepcms-1.umd.edu-key.pem hostkey.pem
[root@hepcms-1 grid-security]# cd http/
[root@hepcms-1 http]# mv httpcert.pem httpcert.pem-old
[root@hepcms-1 http]# mv httpkey.pem httpkey.pem-old
[root@hepcms-1 http]# mv ../http-hepcms-1.umd.edu.pem httpcert.pem
[root@hepcms-1 http]# mv ../http-hepcms-1.umd.edu-key.pem httpkey.pem
[root@hepcms-1 http]# cd ../rsv/
[root@hepcms-1 rsv]# mv rsvcert.pem rsvcert.pem-old
[root@hepcms-1 rsv]# mv rsvkey.pem rsvkey.pem-old
[root@hepcms-1 rsv]# mv ../rsv-hepcms-1.umd.edu.pem rsvcert.pem
[root@hepcms-1 rsv]# mv ../rsv-hepcms-1.umd.edu-key.pem rsvkey.pem
Make sure they have the right ownership
[root@hepcms-1 grid-security]# chmod 444 hostcert.pem http/httpcert.pem rsv/rsvcert.pem
[root@hepcms-1 grid-security]# chmod 400 hostkey.pem http/httpkey.pem rsv/rsvkey.pem
[root@hepcms-1 grid-security]# chown root:root *.pem
[root@hepcms-1 rsv]# chown rsv:rsv *.pem
[root@hepcms-1 http]# chown tomcat:tomcat *.pem
[root@hepcms-1 rsv]# service rsv restart
Stopping RSV: Stopping all metrics on all hosts.
Stopping consumers.
Starting RSV: Starting 13 metrics for host 'hepcms-1.umd.edu'.
Starting 2 metrics for host 'hepcms-0.umd.edu:8443'.
Starting 1 metrics for host 'hepcms-gridftp.umd.edu'.
Starting 2 consumers.
[root@hepcms-1 rsv]# service httpd restart
Stopping httpd: [ OK ]
Starting httpd: [ OK ]
[root@hepcms-1 rsv]#
get all the new certs: (you ahve to be grid admin for that)
as yourself login to hepcms-in2 and make a new area to save the newcerts. THis SITE_CERTS directory is soflinked in /data/users/jabeen which makes these certs accessible from all needed nodes.
jabeen@hepcms-in2 ~]$ mkdir SITE_CERTS
[jabeen@hepcms-in2 ~]$ cd SITE_CERTS/
[jabeen@hepcms-in2 ~/SITE_CERTS]$ mkdir hepcms-se
[jabeen@hepcms-in2 ~/SITE_CERTS/hepcms-se]$ osg-gridadmin-cert-request --hostname=hepcms-0.umd.edu --vo=CMS
Using timeout of 5 minutes
Please enter the pass phrase for '/home/jabeen/.globus/userkey.pem':
Waiting for response from Quota Check API. Please wait.
Beginning request process for hepcms-0.umd.edu
Generating certificate...
Writing key to ./hepcms-0.umd.edu-key.pem
Id is: 9155
Connecting to server to approve certificate...
Issuing certificate...
Certificate written to ./hepcms-0.umd.edu.pem
[jabeen@hepcms-in2 ~/SITE_CERTS]$ ls -slrt hepcms-se/
total 8
4 -rw------- 1 jabeen users 1679 Jan 13 18:49 hepcms-0.umd.edu-key.pem
4 -rw-r--r-- 1 jabeen users 1668 Jan 13 18:49 hepcms-0.umd.edu.pem
Apply SE and bestman Certificates (same)
[jabeen@hepcms-in2 ~]$ ssh hepcms-se
[root@hepcms-0 /]# cd ./etc/grid-security/
[root@hepcms-0 grid-security]# ls -alrh
total 104K
drwxr-xr-x 2 xrootd xrootd 4.0K Jun 29 2016 xrd
drwxr-xr-x 46 root root 4.0K Nov 4 16:24 vomsdir
-r-------- 1 root root 1.7K Feb 6 2016 hostkey.pem
-r--r--r-- 1 root root 1.7K Feb 6 2016 hostcert.pem
-rw-r--r-- 1 root root 1.8K Aug 9 20:46 gsi.conf
-rw-r--r-- 1 root root 60 Feb 29 2016 gsi-authz.conf
drwxr-xr-x 2 root root 60K Oct 20 00:44 certificates
drwxr-xr-x 2 bestman bestman 4.0K May 26 2016 bestman
drwxr-xr-x. 111 root root 12K Jan 13 18:51 ..
drwxr-xr-x 6 root root 4.0K Jul 6 2016 .
root@hepcms-0 grid-security]# sftp jabeen@hepcms.umd.edu
Connecting to hepcms.umd.edu...
jabeen@hepcms.umd.edu's password:
sftp> cd /home/jabeen/SITE_CERTS/hepcms-se
sftp> mget *.pem
Fetching /home/jabeen/SITE_CERTS/hepcms-se/hepcms-0.umd.edu-key.pem to hepcms-0.umd.edu-key.pem
/home/jabeen/SITE_CERTS/hepcms-se/hepcms-0.umd.edu-key.pem 100% 1679 1.6KB/s 00:00
Fetching /home/jabeen/SITE_CERTS/hepcms-se/hepcms-0.umd.edu.pem to hepcms-0.umd.edu.pem
/home/jabeen/SITE_CERTS/hepcms-se/hepcms-0.umd.edu.pem 100% 1668 1.6KB/s 00:00
sftp> bye
[root@hepcms-0 grid-security]# cp /data/users/jabeen/SITE_CERTS/hepcms-se/hepcms-0.umd.edu* .
[root@hepcms-0 grid-security]# ls -slrt
total 96
4 -r-------- 1 root root 1679 Feb 6 2016 hostkey.pem
4 -r--r--r-- 1 root root 1672 Feb 6 2016 hostcert.pem
4 -rw-r--r-- 1 root root 60 Feb 29 2016 gsi-authz.conf
4 drwxr-xr-x 2 bestman bestman 4096 May 26 2016 bestman
4 drwxr-xr-x 2 xrootd xrootd 4096 Jun 29 2016 xrd
4 -rw-r--r-- 1 root root 1781 Aug 9 20:46 gsi.conf
60 drwxr-xr-x 2 root root 61440 Oct 20 00:44 certificates
4 drwxr-xr-x 46 root root 4096 Nov 4 16:24 vomsdir
4 -rw------- 1 root root 1679 Jan 13 18:58 hepcms-0.umd.edu-key.pem
4 -rw-r--r-- 1 root root 1668 Jan 13 18:58 hepcms-0.umd.edu.pem
Check that old and new certs are for the same host:
[root@hepcms-0 grid-security]# openssl x509 -in hostcert.pem -subject -issuer -dates -noout
subject= /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=hepcms-0.umd.edu
issuer= /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1
notBefore=Jan 13 23:44:12 2017 GMT
notAfter=Feb 12 23:49:12 2018 GMT
[root@hepcms-0 grid-security]# openssl x509 -in hepcms-0.umd.edu.pem -subject -issuer -dates -noout
subject= /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=hepcms-0.umd.edu
issuer= /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1
notBefore=Feb 11 17:03:20 2018 GMT
notAfter=Mar 13 17:08:20 2019 GMT
[root@hepcms-0 grid-security]#
[root@hepcms-0 grid-security]# mv hepcms-0.umd.edu-key.pem hostkey.pem
[root@hepcms-0 grid-security]# mv hepcms-0.umd.edu.pem hostcert.pem
[root@hepcms-0 grid-security]# ls -alrh
total 104K
drwxr-xr-x 2 xrootd xrootd 4.0K Jun 29 2016 xrd
drwxr-xr-x 46 root root 4.0K Nov 4 16:24 vomsdir
-rw------- 1 root root 1.7K Jan 13 18:58 hostkey.pem
-rw-r--r-- 1 root root 1.7K Jan 13 18:58 hostcert.pem
-rw-r--r-- 1 root root 1.8K Aug 9 20:46 gsi.conf
-rw-r--r-- 1 root root 60 Feb 29 2016 gsi-authz.conf
drwxr-xr-x 2 root root 60K Oct 20 00:44 certificates
drwxr-xr-x 2 bestman bestman 4.0K May 26 2016 bestman
drwxr-xr-x. 111 root root 12K Jan 13 18:51 ..
drwxr-xr-x 6 root root 4.0K Jan 13 19:02 .
[root@hepcms-0 grid-security]# chmod 400 hostkey.pem
[root@hepcms-0 grid-security]# chmod 444 hostcert.pem
[root@hepcms-0 grid-security]# ls -alrh
total 104K
drwxr-xr-x 2 xrootd xrootd 4.0K Jun 29 2016 xrd
drwxr-xr-x 46 root root 4.0K Nov 4 16:24 vomsdir
-r-------- 1 root root 1.7K Jan 13 18:58 hostkey.pem
-r--r--r-- 1 root root 1.7K Jan 13 18:58 hostcert.pem
-rw-r--r-- 1 root root 1.8K Aug 9 20:46 gsi.conf
-rw-r--r-- 1 root root 60 Feb 29 2016 gsi-authz.conf
drwxr-xr-x 2 root root 60K Oct 20 00:44 certificates
drwxr-xr-x 2 bestman bestman 4.0K May 26 2016 bestman
drwxr-xr-x. 111 root root 12K Jan 13 18:51 ..
drwxr-xr-x 6 root root 4.0K Jan 13 19:02 .
For bestman certs:
[root@hepcms-0 grid-security]# chown bestman:bestman bestman
[root@hepcms-0 grid-security]# ls bestman/
[root@hepcms-0 grid-security]# cp *.pem bestman/
[root@hepcms-0 grid-security]# cd bestman/
[root@hepcms-0 bestman]# chown bestman:bestman *.pem
Check bestman certs are the same as hepcms-0
[root@hepcms-0 bestman]# openssl x509 -in bestmancert.pem -subject -issuer -dates -noout
subject= /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=hepcms-0.umd.edu
issuer= /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1
notBefore=Jan 13 23:44:12 2017 GMT
notAfter=Feb 12 23:49:12 2018 GMT
[root@hepcms-0 bestman]#
[root@hepcms-0 bestman]# ls -alrh
total 24K
-r-------- 1 bestman bestman 1.7K Jan 13 19:08 hostkey.pem
-r--r--r-- 1 bestman bestman 1.7K Jan 13 19:08 hostcert.pem
-r-------- 1 bestman bestman 1.7K Mar 1 2016 bestmankey.pem
-r-------- 1 bestman bestman 1.7K Mar 1 2016 bestmancert.pem
drwxr-xr-x 6 root root 4.0K Jan 13 19:02 ..
drwxr-xr-x 2 bestman bestman 4.0K Jan 13 19:08 .
[
root@hepcms-0 bestman]# mv hostkey.pem bestmankey.pem
[root@hepcms-0 bestman]# mv hostcert.pem bestmancert.pem
[root@hepcms-0 bestman]# openssl x509 -in /etc/grid-security/bestman/bestmancert.pem -subject -issuer -dates -noout
subject= /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=hepcms-0.umd.edu
issuer= /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1
notBefore=Jan 13 23:44:12 2017 GMT
notAfter=Feb 12 23:49:12 2018 GMT
[root@hepcms-0 bestman]# chmod 400 bestmankey.pem
[root@hepcms-0 bestman]# chmod 444 bestmancert.pem
[root@hepcms-0 bestman]# ls -alrh
total 16K
-r-------- 1 bestman bestman 1.7K Jan 13 19:08 bestmankey.pem
-r--r--r-- 1 bestman bestman 1.7K Jan 13 19:08 bestmancert.pem
drwxr-xr-x 6 root root 4.0K Jan 13 19:02 ..
drwxr-xr-x 2 bestman bestman 4.0K Jan 13 19:15 .
is the same as se
[root@hepcms-0 xrd]# openssl x509 -in xrdcert.pem -subject -issuer -dates -noout
subject= /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=hepcms-0.umd.edu
issuer= /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1
notBefore=Feb 6 16:27:05 2016 GMT
notAfter=Mar 7 16:32:05 2017 GMT
[root@hepcms-0 xrd]# pwd
/etc/grid-security/xrd
[root@hepcms-0 xrd]# mv xrdcert.pem xrdcert.pem-old
[root@hepcms-0 xrd]# mv xrdkey.pem xrdkey.pem-old
[root@hepcms-0 xrd]# cp ../hostcert.pem ./xrdcert.pem
[root@hepcms-0 xrd]# cp ../hostkey.pem xrdkey.pem
[root@hepcms-0 xrd]# ls -slrt
total 16
4 -r-------- 1 xrootd xrootd 1679 Jun 29 2016 xrdkey.pem-old
4 -r--r--r-- 1 xrootd xrootd 1672 Jun 29 2016 xrdcert.pem-old
4 -r--r--r-- 1 root root 1668 Mar 3 20:49 xrdcert.pem
4 -r-------- 1 root root 1679 Mar 3 20:50 xrdkey.pem
[root@hepcms-0 xrd]# chown xrootd:xrootd xrdcert.pem
[root@hepcms-0 xrd]# chown xrootd:xrootd xrdkey.pem
[root@hepcms-0 xrd]# ls -slrt
total 16
4 -r-------- 1 xrootd xrootd 1679 Jun 29 2016 xrdkey.pem-old
4 -r--r--r-- 1 xrootd xrootd 1672 Jun 29 2016 xrdcert.pem-old
4 -r--r--r-- 1 xrootd xrootd 1668 Mar 3 20:49 xrdcert.pem
4 -r-------- 1 xrootd xrootd 1679 Mar 3 20:50 xrdkey.pem
Get the cert
abeen@hepcms-in2 ~/SITE_CERTS]$ mkdir hepcms-gridftp
[jabeen@hepcms-in2 ~/SITE_CERTS]$ cd hepcms-gridftp/
[jabeen@hepcms-in2 hepcms-gridftp]$ osg-gridadmin-cert-request --hostname=hepcms-gridftp.umd.edu --vo=CMS
Using timeout of 5 minutes
Please enter the pass phrase for '/home/jabeen/.globus/userkey.pem':
Waiting for response from Quota Check API. Please wait.
Beginning request process for hepcms-gridftp.umd.edu
Generating certificate...
Writing key to ./hepcms-gridftp.umd.edu-key.pem
Id is: 9156
Connecting to server to approve certificate...
Issuing certificate...
Certificate written to ./hepcms-gridftp.umd.edu.pem
[jabeen@hepcms-in2 hepcms-gridftp]$ ls
hepcms-gridftp.umd.edu-key.pem hepcms-gridftp.umd.edu.pem
cd ../
Copy to hepcms-gridftp
ssh -Y hepcms-gridftp
[root@hepcms-gridftp grid-security]# cp /data/users/jabeen/SITE_CERTS/hepcms-gridftp/hepcms-gridftp.umd.edu* .
Check new and old are for gridftp
[root@hepcms-gridftp grid-security]# openssl x509 -in hostcert.pem -subject -issuer -dates -noout
subject= /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=hepcms-gridftp.umd.edu
issuer= /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1
notBefore=Jun 13 18:05:50 2016 GMT
notAfter=Jul 13 18:10:50 2017 GMT
[root@hepcms-gridftp grid-security]# openssl x509 -in hepcms-gridftp.umd.edu.pem -subject -issuer -dates -noout
subject= /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=hepcms-gridftp.umd.edu
issuer= /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1
notBefore=Jan 14 00:13:12 2017 GMT
notAfter=Feb 13 00:18:12 2018 GMT
replace old certs and check that permissions are same for old and new certs
[root@hepcms-gridftp grid-security]# mv hostkey.pem hostkey.pem-old
mv hostcert.pem hostcert.pem-old
mv hepcms-gridftp.umd.edu.pem hostcert.pem
mv hepcms-gridftp.umd.edu-key.pem hostkey.pem
ls -slrt
chmod 400 hostkey.pem
chmod 444 hostcert.pem
openssl x509 -in hostcert.pem -subject -issuer -dates -noout
[root@hepcms-gridftp grid-security]# openssl x509 -in hostcert.pem -subject -issuer -dates -noout
subject= /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=hepcms-gridftp.umd.edu
issuer= /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1
notBefore=Jan 14 00:13:12 2017 GMT
notAfter=Feb 13 00:18:12 2018 GMT
gridftp cert also needs to be on datanfs
This is the place where we should keep all the certs to be deployed through pippet.
abeen@hepcms-in2 ~/SITE_CERTS]$ mkdir hepcms-gridftp
[jabeen@hepcms-in2 ~/SITE_CERTS]$ cd hepcms-gridftp/
[jabeen@hepcms-in2 hepcms-gridftp]$ osg-gridadmin-cert-request --hostname=hepcms-gridftp.umd.edu --vo=CMS
Using timeout of 5 minutes
Please enter the pass phrase for '/home/jabeen/.globus/userkey.pem':
Waiting for response from Quota Check API. Please wait.
Beginning request process for hepcms-gridftp.umd.edu
Generating certificate...
Writing key to ./hepcms-gridftp.umd.edu-key.pem
Id is: 9156
Connecting to server to approve certificate...
Issuing certificate...
Certificate written to ./hepcms-gridftp.umd.edu.pem
[jabeen@hepcms-in2 hepcms-gridftp]$ ls
hepcms-gridftp.umd.edu-key.pem hepcms-gridftp.umd.edu.pem
cd ../
jabeen@hepcms-in2 ~/SITE_CERTS]$ ls
hepcms-gridftp hepcms-se
[jabeen@hepcms-in2 ~/SITE_CERTS]$ tar cfvz hepcms-gridftp-cert.tgz hepcms-gridftp
hepcms-gridftp/
root@hepcms-in2 ~]# cp /home/jabeen/SITE_CERTS/hepcms-gridftp-cert.tgz /data/site_conf/certs/
[root@hepcms-in2 ~]# ssh r720-datanfs
root@r720-datanfs ~]# cd /data/site_conf/certs
[root@r720-datanfs certs]# cp /data/users/jabeen/SITE_CERTS/hepcms-gridftp/hepcms-gridftp.umd.edu* .
[root@r720-datanfs certs]# mv hepcms-gridftcert.pem hepcms-gridftcert.pem-old
[root@r720-datanfs certs]# mv hepcms-gridftpkey.pem hepcms-gridftpkey.pem-old
[root@r720-datanfs certs]# chown root:root *.pem
[root@r720-datanfs certs]# ls -alrh
root@r720-datanfs certs]# ls -slrt
total 32
4 -r--r--r-- 1 root root 1675 May 25 2016 http
4 -r-------- 1 9 13 1679 Jun 3 2016 httpkey.pem
4 -r--r--r-- 1 9 13 1681 Jun 3 2016 httpcert.pem
0 drwxr-xr-x 2 root root 71 Jun 4 2016 grid-security
0 drwxr-xr-x 2 root root 41 Jun 4 2016 rsv
4 -rw-r--r-- 1 root root 35 Jun 9 2016 README
4 -rw------- 1 root root 1675 Jan 13 2017 hepcms-gridftpkey.pem-old
4 -rw-r--r-- 1 root root 1690 Jan 13 2017 hepcms-gridftcert.pem-old
4 -rw------- 1 root root 1679 Feb 11 13:08 hepcms-gridftpkey.pem
4 -rw-r--r-- 1 root root 1690 Feb 11 13:08 hepcms-gridftcert.pem
Get ce for hepcmsdev-6
jabeen@hepcms-in1 http]$ osg-gridadmin-cert-request --hostname=hepcmsdev-6.umd.edu --vo=CMS
Using timeout of 5 minutes
Please enter the pass phrase for '/home/jabeen/.globus/userkey.pem':
Waiting for response from Quota Check API. Please wait.
Beginning request process for hepcmsdev-6.umd.edu
Generating certificate...
Writing key to ./hepcmsdev-6.umd.edu-key.pem
Id is: 9312
Connecting to server to approve certificate...
Issuing certificate...
Certificate written to ./hepcmsdev-6.umd.edu.pem
[jabeen@hepcms-in1 http]$ openssl x509 -in hepcmsdev-6.umd.edu.pem -subject -issuer -dates -noout
subject= /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=hepcmsdev-6.umd.edu
issuer= /DC=org/DC=cilogon/C=US/O=CILogon/CN=CILogon OSG CA 1
notBefore=Feb 12 04:01:17 2017 GMT
notAfter=Mar 14 04:06:17 2018 GMT
now copy these to the location on hepcmsdev-6
[
root@hepcmsdev-6 ~]# cd /etc/grid-security/http/
[root@hepcmsdev-6 http]# ls -slrt
total 8
4 -r--------. 1 tomcat tomcat 1675 Jan 13 2016 httpkey.pem
4 -r--r--r--. 1 tomcat tomcat 1692 Jan 13 2016 httpcert.pem
copy the new certs to old names and fix permissions
[root@hepcmsdev-6 http]# mv httpcert.pem httpcert.pem-old
[root@hepcmsdev-6 http]# mv httpkey.pem httpkey.pem-old
[root@hepcmsdev-6 http]# mv hepcmsdev-6.umd.edu-key.pem httpkey.pem
[root@hepcmsdev-6 http]# mv hepcmsdev-6.umd.edu.pem httpcert.pem
check that old and new certs match:
[root@hepcmsdev-6 http]# openssl x509 -in /etc/grid-security/http/httpcert.pem -dates -noout
notBefore=Jan 13 16:57:15 2016 GMT
notAfter=Feb 11 17:02:15 2017 GMT
[root@hepcmsdev-6 http]# chmod 400 httpkey.pem
[root@hepcmsdev-6 http]# chmod 444 httpcert.pem
[root@hepcmsdev-6 http]# chown tomcat.tomcat httpcert.pem
[root@hepcmsdev-6 http]# chown tomcat.tomcat httpkey.pem
Now restart the services:
service mysqld restart; service tomcat6 restart
More info here:
https://sites.google.com/a/physics.umd.edu/tier-3-umd/margueritedebuglog/gumsdebugging15dec2015
copied from
https://sites.google.com/a/physics.umd.edu/tier-3-umd/dont-edit/sitegridcertificates
Get CE and RSV site certificates
4 March 2016 (MBT)
hepcms-in2 already has osg-pki-tools installed so a GridAdmin can get site certificates, it also has osg, osg::cacerts and osg::cacerts::updater
Second, be sure the FQDN of your public IP of your node matches what hostname reports (on that node), use that below for HOSTNAME in the request
Third, that FQDN needs to exist as a service in OIM for the GridAdmin to get certificates (it already does, full instructions above in SE general example)
Login as myself on hepcms-in2, make sure I have my grid certificate installed on my /home/.globus
http://hep-t3.physics.umd.edu/HowToForUsers.html#CertAndProxy
HTCondorCE page says we need:
RSV page says we need:
Also double-checked older cert that the service cert for rsv was in the form of (OLD FQDN used there)
OLD COMMAND: osg-gridadmin-cert-request --hostname=hepcms-0.umd.edu --vo=CMS
OLD COMMAND: osg-gridadmin-cert-request --hostname=rsv/hepcms-0.umd.edu --vo=CMS
I will now run:
osg-gridadmin-cert-request --hostname=hepcms-1.umd.edu --vo=CMS
osg-gridadmin-cert-request --hostname=rsv/hepcms-1.umd.edu --vo=CMS
Get my certs in my local area, certs made, approved, I got 2 grid emails per cert about this
Using timeout of 5 minutes
The timeout is set to 5
Please enter the pass phrase for '/home/belt/.globus/userkey.pem':
Waiting for response from Quota Check API. Please wait.
Beginning request process for hepcms-1.umd.edu
Generating certificate...
Writing key to ./hepcms-1.umd.edu-key.pem
Id is: 7251
Connecting to server to approve certificate...
Issuing certificate...
Certificate written to ./hepcms-1.umd.edu.pem
[belt@hepcms-in2 SiteCE]$ cd ..
[belt@hepcms-in2 ~/SiteCertCE]$ dir
total 12K
drwxr-xr-x 3 belt users 4.0K Mar 4 14:43 .
drwxr-xr-x 78 belt users 4.0K Mar 4 14:33 ..
drwxr-xr-x 2 belt users 4.0K Mar 4 14:44 SiteCE
[belt@hepcms-in2 ~/SiteCertCE]$ mkdir RSVCE
[belt@hepcms-in2 ~/SiteCertCE]$ cd RSVCE/
[belt@hepcms-in2 RSVCE]$ osg-gridadmin-cert-request --hostname=rsv/hepcms-1.umd.edu --vo=CMS
Using timeout of 5 minutes
The timeout is set to 5
Please enter the pass phrase for '/home/belt/.globus/userkey.pem':
Waiting for response from Quota Check API. Please wait.
Beginning request process for rsv/hepcms-1.umd.edu
Generating certificate...
Writing key to ./rsv-hepcms-1.umd.edu-key.pem
Id is: 7252
Connecting to server to approve certificate...
Issuing certificate...
Certificate written to ./rsv-hepcms-1.umd.edu.pem
[belt@hepcms-in2 RSVCE]$
Apparently (2 May 2016) we also need a http site cert for CEMon (didn't see that before). Get it to my area below:
[belt@hepcms-in2 http]$ osg-gridadmin-cert-request --hostname=http/hepcms-1.umd.edu --vo=CMS
Using timeout of 5 minutes
The timeout is set to 5
Please enter the pass phrase for '/home/belt/.globus/userkey.pem':
Waiting for response from Quota Check API. Please wait.
Beginning request process for http/hepcms-1.umd.edu
Generating certificate...
Writing key to ./http-hepcms-1.umd.edu-key.pem
Id is: 7740
Connecting to server to approve certificate...
Issuing certificate...
Certificate written to ./http-hepcms-1.umd.edu.pem
[belt@hepcms-in2 http]$ pwd
/home/belt/SiteCertCE/http
Copy CE and RSV certs to proper areas on hepcms-1.umd.edu
These certs are still in my user area (~belt/SiteCertCE/SiteCE/*.pem for CE and ~belt/SiteCertCE/RSVCE/*.pem for RSV, you can login to hepcms-hn (su - to become root) and scp them to hepcms-ce as needed)
rsv user needs to exist, so you may need to *install* rsv before properly chown-ing the cert
properly rename the certificates when you move them to /etc/grid-security/ and /etc/grid-security/rsv
Make sure the permissions are appropriate (chmod 400 *key.pem; chmod 444 *cert.pem)
Make sure they are owned correctly (OSG twiki will guide you, or blocks I copied above),
CE cert: chown root:root /etc/grid-security/*.pem
RSV cert: chown rsv:rsv /etc/grid-security/rsv/*.pem
Make sure the subdirectory is properly chowned (for RSV): chown rsv:rsv /etc/grid-security/rsv
HTTP Certs are located in ~belt/SiteCertCE/http/
Note that all the above certs are now in /data/site_conf and accessible through puppet
HTTP Cert properties:
file { '/etc/grid-security/http':
ensure => 'directory',
owner => 'tomcat',
group => 'tomcat',
mode => '0755',
}
file { '/etc/grid-security/http/httpcert.pem':
ensure => 'file',
owner => 'tomcat',
group => 'tomcat',
mode => '0444',
source => $osg::ce::_httpcert_source,
require => File['/etc/grid-security/http'],
}
file { '/etc/grid-security/http/httpkey.pem':
ensure => 'file',
owner => 'tomcat',
group => 'tomcat',
mode => '0400',
source => $osg::ce::_httpkey_source,
require => File['/etc/grid-security/http'],
}
http://hep-t3.physics.umd.edu/HowToForUsers.html#CertAndProxy
https://twiki.opensciencegrid.org/bin/view/Documentation/Release3/GetHostServiceCertificates
Some experience with certs and installing osg-pki-tools here: https://sites.google.com/a/physics.umd.edu/tier-3-umd/commands/margueritedebuglog/12jan2016gumsdebug
See at the top how to test the status of certificates
Also check they have the proper permissions and ownerships
Also see the various OSG troubleshooting web pages
I hadn't renewed the certificates and grid jobs were no longer coming in and we had rsv errors
Debugged this with:
Saw this error in CE: globus-gatekeeper.log:
PID: 8094 -- Notice: 0: GATEKEEPER_JM_ID 2014-03-11.09:37:37.0000031277.0000000061 for /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=rsv/hepcms-0.umd.edu on ::ffff:128.8.164.12
Failure: globus_gss_assist_gridmap() failed authorization. globus_gss_assist: Error invoking callout
globus_callout_module: The callout returned an error
There was an error collecting ganglia data (127.0.0.1:8652): fsockopen error: Connection refused
Fix: check gmond amd gmetad services. restart and restart httpd.
[root@hepcms-hn ~]# service gmetad status
gmetad dead but subsys locked
[root@hepcms-hn ~]# service gmetad start
Starting GANGLIA gmetad: [ OK ]
[root@hepcms-hn ~]#
[root@hepcms-hn ~]# service httpd restart
Stopping httpd: [ OK ]
Starting httpd: [ OK ]
[root@hepcms-hn ~]#
Ganglia displaying a dead node
delete dead node from /var/lib/ganglia/rrds/UMD HEP CMS T3 and restart the services.
cd /var/lib/ganglia/rrds/
cd UMD\ HEP\ CMS\ T3/
rm -rf hepcms-ovirt2.privnet/
service gmetad start
service gmond restart
overrode the hostname ingmond.conf (override_hostname = "r510-0-6")
vi /etc/ganglia/gmond.conf
service gmond restart
Host certificate
RSV service certificate
/etc/grid-security/hostcert.pem
/etc/grid-security/hostkey.pem
/etc/grid-security/rsv/rsvcert.pem
/etc/grid-security/rsv/rsvkey.pem
restrts both gmetad and gmond service.
[root@hepcms-hn ~]# service gmetad restart
Shutting down GANGLIA gmetad: [ OK ]
Starting GANGLIA gmetad: [ OK ]
[root@hepcms-hn ~]# service gmond restart
Shutting down GANGLIA gmond: [ OK ]
Starting GANGLIA gmond: [ OK ]
[root@hepcms-hn ~]#
Note also the spreadsheet attached (in Excel format there and pdf) with physical connection information
This spreadsheet is POSTED inside the C-21 back rack door physically at Rivertech
Reboot
press F2 to get into BIOS setup
Go to DEVICE SETTINGS
go to INTEGRATED RAID CONTROLLER utility
go to Configuartion management
PHYSICAL DISK MANAGEMENT will show the disks STAT
go back to CONFIGURATION MANAGEMENT
go to MAnage Foreign Configuration
Preview Foreign Configuration
Clear foreign configuration
compute-0-11 became unreahable. Ping or foreman connections didnt work.
At rivertech connected to the monortor andhard reboot the node.
It seems to fail while checking the file system.
login as root.
root@compute-0-11> fsck
enter y for all the questions.
reboot
The node came back with everything perfectly mounted.
enter fdisk /dev/sdX
press n for a new partition
press p for adding a new partition
if the number of partitions is below 4, it will ask you for a partition number.
you can enter where the partitions start and end on the disk, but the defaults should be fine. (If you are adding another partition to the disk, the default should be right after the previous partition ends)
Usually it should partition and be ready to use, but if the disk is busy, you will have to restart the node.
Print Screen,
arrow up and down to select machines (can sometimes also type letter#, like c8 for "compute-0-8"
Get the second menu by arrowing to the KVM switch and pressing Enter or Print Screen again
One or both KVMs not working?
Check that they are powered on (green light in back), if not, check the power cord connection on phase3 (innermost PDU), as the sheaths don't fit, so they have some "wiggle room" and can get jostled
Also check that other physical connections are made (try not to jostle anything else)
Print Screen is the key to use with KVMs, get the second menu by arrowing to the KVM switch and pressing Enter or Print Screen again
Use Dell OMSA commands to help debug, or Dell OMSA webserver (note that in the past we have had R510s report a power problem according to Dell OMSA commands when actually they just needed a Firmware update!!!)
Could be that one of the two power cords is loose at the machine or at the PDU (be careful not to jostle any others)
If one is loose and you unplug the other, you will power down the machine suddenly, not good for disks
Could be that one of the hard drives has completely failed (beyond fsck failure), and needs to be physically replaced
If you cannot get to the operating system, try to reboot to the (F11, I think) menu to run Dell Diagnostics
If need be, some machines were setup with iDRAC access, with internal network identities accessible from firefox within the cluster
Check it physically at Rivertech
ping it (see also network troubleshooting)
At Rivertech, check that it can see the outside world, check if there's something interesting on the screen (like it wanted to fsck some disks upon reboot), if you can login as root, then fsck -y /hadoop2 (or any other disks it might complain about)
more network troubleshooting : go to /etc/init.d and run bash network stop then bash network start then have another machine ping it / try to ping a different machine
Dell Disks/OMSA
after doing the usual ssh-agent $SHELL; ssh-add stuff
you can find all the service tags for all the machines by using clush and doing
clush -w @all_baremetal omreport system summary | grep Service
that will report everything currently in clush, not ovirt and r720-datanfs though
omreport system summary
the account login as root on the particular node.
be sure iptables is turned off:
by hand: service iptables stop; chkconfig iptables off
In puppet: Can disable iptables in Hiera using firewall::ensure: 'stopped' which requires the firewall class be included
Or if it's on, make sure that internal ports are open
Restart webserver: omconfig system webserver action=restart
Is Dell OMSA running? Try omreport system
[root@hepcms-gridftp ~]# service iptables stop
iptables: Setting chains to policy ACCEPT: filter [ OK ]
iptables: Flushing firewall rules: [ OK ]
iptables: Unloading modules: [ OK ]
[root@hepcms-gridftp ~]# omconfig system webserver action=restart
DSM SA Connection Service restarted successfully.
From firefox on hn connect to https://hepcms-gridftp.privnet:1311
clearing hn log
https://umdt3.slack.com/messages/C0B4U4C2G/
https://umdt3.slack.com/messages/C0B4U4C2G/
[root@hepcms-gridftp ~]# service iptables start
iptables: Applying firewall rules: [ OK ]
[root@hepcms-gridftp ~]#
omreport storage vdisk controller=0
omreport storage vdisk controller=1 # for hepcmsdev-1 and r720-datanfs virtual disks
omreport storage pdisk controller=0 # physical disk status - virtual disks only exist on the RAIDed machines
Open up firefox from an interactive node on the cluster
Put in https://r510-0-11.privnet:1311 (for instance, use the name of your node)
Use the root login and password for that node
Click on Storage, and find the Physical Disks
You can Blink and Unblink specific disks (use process of elimination if your drive is missing
replaced 0-1-13 on r720-0-1. This is the right small disk on the back.
Cleared badblock on virtual hadoop bad disk
If you cant get to the omsa web interface you can use blink scrip
@hepcms-gridftp ~]# more /data/osg/scripts/BlinkLED.sh
#!/bin/sh
omconfig chassis leds led=identify flash=on
omconfig storage pdisk action=blink controller=1 pdisk=0:2:0
dont forget to unblink
OMSA COMMANDS
Location of Dell System Summaries is on the hepcms-hn and is in /root/omsa_report
Commands
Example of how to grab report files and retrieve them from "node group" and diff them.
clush -v -w --rcopy /root/omsa_chassis_report --dest /root/omsa_reports
clush -v -w @R510 --rcopy /root/omsa_chassis_report --dest /root/omsa_reports
omreport chassis biossetup
omreport chassis firmware
omreport chassis memory
omreport chassis nics
omreport chassis removableflashmedia
omreport system esmlog
omreport system alertaction
omreport storage pdisk controller=0
parted /dev/sda 'print'
omconfig storage controller action=exportlog controller=0
omreport -?
omreport chassis batteries
omreport chassis pwrmanagement
omreport chassis pwrsupplies
omreport system summary
omreport chassis memory
omreport chassis
omreport storage pdisk controller=0
Omconfig Chassis Leds Or Omconfig Mainsystem Leds
Use the omconfig chassis leds or omconfig mainsystem leds command to specify when to flash a chassis fault LED or chassis identification LED. This command also allows you to clear the LED of the system hard drive. The following table displays the valid parameters for the command.