To specify, the main goal is to gather the code already written in a specific area in the /data space in a folder named cluster-health (I made it recently). However, since there are different versions of the same code in different nodes, we need to gather them in that specific space.
the main objective are fall into these 6 categories:
1- backup of the files in each and every nodes
2- check mounts cvmfs, data, home and hadoop
3- check log areas or 100% full disks
4- check bad usage and kill jobs or programs
5- check certificate expiration
6- omsa health reports
For backing up of the files:
tar_scripts.sh
To check mounts cvmfs, data, home and hadoop:
Check_Mounting.py, Check_Mounting2.py, Check_Hadoop.py
To check log areas or 100% full disks:
disk_monitor.sh, collect_disk_monitor.sh
To check bad usage and kill jobs or programs:
Kill_Jobs_2.py, kill_email.py, kill_ps.py
To check certificate expiration (not sure it’s going to put in risk the health of the cluster):
check_cert_expire.py, check_cert_expire_daily.py, check_cert_expire_daily_new.py, check_cert_expire_new.py
To get the health reports using omsa:
omsa_reports.sh
I could gather all the python and bash scripts in these node in the attached pdf file:
r510-0-1, r520-0-4, r520-0-5, r520-0-6, r520-0-9, r520-0-10, r520-0-11, r720-0-1, r720-0-2, r540-0-20, r540-0-21, compute-0-6, compute-0-7, compute-0-10, hepcms-hn, hepcms-in2, hepcms-in1, hepcms-namenode
I looked at the mounting and the backup files in the entire cluster and I could gather the final version of those file in an excel spreadsheet. I was wondering if you could help me on doing the same for the files in the categories 3, 4 and 5 are the more completed one.
# HEADER: This file was autogenerated at Fri Dec 23 15:05:03 -0500 2016 by puppet.
# HEADER: While it can still be managed manually, it is definitely not recommended.
# HEADER: Note particularly that the comments starting with 'Puppet Name' should
# HEADER: not be deleted, as doing so could cause duplicate cron jobs.
# Puppet Name: puppet every hour at x:09
9 * * * * /usr/bin/env puppet agent --config /etc/puppet/puppet.conf --onetime --no-daemonize --noop
*/5 * * * * /root/cronscripts/disk_monitor.sh every 5 minutes
# Puppet Name: get_omsa_reports runs at 22:00 daily
0 22 * * * /bin/bash /data/monitoring/scripts/run_omreport.sh
# Cleans CVMFS cache when it reaches 75% or above runs every 12 hours
0 */12 * * * /usr/bin/env /root/cronscripts/clean_cvmfs_cache.py
#gmetric cpu temp script every 5 minutes */5 * * * * /etc/ganglia/sensors_gmetric.sh 2>&1 /dev/null
#!/usr/bin/python
'''
This script checks certificate expiration dates in /etc/grid-security (excluding /etc/grid-security/certificates however) and emails admins if any of them are going to expire within 1-30 days. NOTE: This script must be run as someone who has access to these certificates for it to function properly.
'''
#Import necessary modules
import smtplib
import subprocess
import os
#Set admins and directory containing certificates
admins = ['jabeen@umd.edu', 'kakw@umd.edu', 'youngho.shin@cern.ch', 'johnmichaelmartyn@gmail.com', 'jamiebmazza@gmail.com', 'mnudelli@terpmail.umd.edu']
GRID_SECURITY_DIRS = ['/etc/grid-security/']
#Produces a list of directories in /etc/grid-security/ to search through
ohost = subprocess.Popen("hostname", stdout=subprocess.PIPE, shell=True)
if str(os.path.isdir("/etc/grid-security")) == "False" :
quit()
os.chdir("/etc/grid-security")
o0 = subprocess.Popen("find . -type d| grep -v './certificates'", stdout=subprocess.PIPE, shell=True)
o0_text = str(o0.communicate())
dirs = o0_text.split(r"\n")
del dirs[0]
del dirs[-1]
for dir in dirs:
GRID_SECURITY_DIRS.append("/etc/grid-security" + dir[1:] + "/")
#Creates message string, a list of certificates that will expire soon, and a dictionary for use later
msg = ""
certs_expiring = []
day = {}
#Produces list of .pem files in directories of GRID_SECURITY_DIRS
for i in range(0, len(GRID_SECURITY_DIRS)):
os.chdir(GRID_SECURITY_DIRS[i])
o1 = subprocess.Popen("ls | grep .pem | grep -v .pem-old | grep -v empty.pem", stdout=subprocess.PIPE, shell=True)
o1_text = str(o1.communicate())
files = o1_text.split(r"\n")
files[0] = files[0][2:]
del files[-1]
#Checks which certificates will expire in 1-30 day(s); compiles a message containing the expiring certificates and their expiration dates
for file in files:
for x in range(1,31):
o_day = subprocess.Popen("openssl x509 -checkend %d -noout -in %s%s ; echo $?" % ((x*86400), GRID_SECURITY_DIRS[i], file), stdout=subprocess.PIPE, shell=True)
day_list = str(o_day.communicate()).split(r"\n")
day_list[0] = day_list[0][2:]
del day_list[-1]
day["day%d" % x] = day_list
o_exp = subprocess.Popen("openssl x509 -enddate -noout -in %s%s" % (GRID_SECURITY_DIRS[i], file), stdout=subprocess.PIPE, shell=True)
exp_date = str(o_exp.communicate()).split(r"\n")
exp_date = exp_date[0][11:]
if exp_date == "" :
msg += "check_cert_expire.py does not have access to %s%s, and its expiration date cannot be determined. \n" % (GRID_SECURITY_DIRS[i], file)
break
for y in range(1,31):
if day["day%d" % y] == ["1"]:
msg += "%s%s will expire within %d day(s). Its expiration date is: %s \n" % (GRID_SECURITY_DIRS[i], file, y, exp_date)
certs_expiring.append(file)
break
#Emails admins if any certificates are nearing expiration
if msg:
host = str(ohost.communicate())[2:-10]
msg = "On " + host + ": \n" + msg
from_addr = 'root@hepcms-hn.umd.edu'
to_addr = ", ".join(admins)
subject = "WARNING: Certificates %s are nearing expiration." % " ".join(certs_expiring)
header = "From: %s\nTo: %s\nSubject: %s\n" % (from_addr, to_addr, subject)
email = header + msg
server = smtplib.SMTP('localhost')
for addr in admins:
server.sendmail(from_addr, addr, email)
server.quit()
cc
This script checks the status of nodes and emails admins about any critical nodes.
#!/usr/bin/python
#Emails admins about critical nodes
#Import packages
from os import walk
from smtplib import SMTP
#Function that creates a dictionary from a set of data
def make_dict(data):
d = {}
for element in data:
pair = element.split(' :')
d.update({pair[0].strip() : pair[1].strip('\n').strip()})
return d
#Function that creates a dictionary from a set of data, this one is reversed/mirrored
def make_dict_rev(data):
d = {}
for element in data:
pair = element.split(' :')
d.update({pair[1].strip('\n').strip() : pair[0].strip()})
return d
#Define variables
REPORT_DIR = '/data/monitoring/omsa_reports'
msg = ""
admin_emails = ['jrtaylor95@gmail.com']
critical_nodes = []
node_name = ''
msg_part = ''
#Searches for critical nodes, compiles a list of these nodes into a msg
for (dirpath, dirname, filename) in walk(REPORT_DIR):
if filename:
for (file) in filename:
with open(dirpath + "/" + file, "r", 0) as fo:
lines = fo.readlines()
for i, line in enumerate(lines):
if 'Critical' in line:
if 'pdisk' in file:
d = make_dict(lines[i-1:i+38])
node_name = file[:-11]
msg_part = '%s\n' % node_name
for key in ['Name', 'Status', 'Failure Predicted']:
msg_part += '\t%-20s\t%-20s\n' % (key, d[key])
elif 'vdisk' in file:
d = make_dict(lines[i-1:i+17])
node_name = file[:-11]
msg_part = '%s\n' % node_name
for key in ['Name', 'Status' 'Device Name']:
msg_part += '\t%-20s\t%-20s\n' % (key, d[key])
else:
d = make_dict_rev(lines[5:-3])
node_name = file[:-4]
msg_part = '%s\n' % node_name
for k, v in d.items():
if 'Critical' in v:
msg_part += '\t%-20s\t%-20s\n' % (k, v)
critical_nodes.append(node_name)
msg += '%s\n' % msg_part
#Emails admins about any critical nodes
if msg:
critical_nodes = list(set(critical_nodes))
from_addr = 'root@hepcms-hn.umd.edu'
to_addr = ", ".join(admin_emails)
subject = "WARNING: Node(s) %s are critical" % " ".join(critical_nodes)
header = "From: %s\nTo: %s\nSubject: %s\n" % (from_addr, to_addr, subject)
email = header + msg
#print email
server = smtplib.SMTP('localhost')
for addr in admins:
server.sendmail(from_addr, addr, email)
server.quit()
This script wipes the cvmfs cache when its usage is greater than 75%.
#!/usr/bin/env python
#cleans cvmfs cache and temp scratch if they are almost full
#Import modules
import subprocess
import re
#Checks disk usage, wipes cache if >75% is used up
output = subprocess.Popen(['df', '-h'], stdout=subprocess.PIPE)
#finds line usage % is on and compares value to 75%
for line in iter(output.stdout.readline, ''):
if "cvmfs" in line:
percent = line.split()[4]
percent = int(percent.strip("%"))
if percent > 75:
subprocess.call(["cvmfs_config", "wipecache"])
break
This script cleans hadoop logs when usage is greater than a user-specified value, which defaults to 95%.
#!/usr/bin/env python
#cleans cvmfs cache and temp scratch if they are almost full
import subprocess, socket, sys, getopt
#sets options for running this script
def parse_args( argv ):
try:
opts, args = getopt.getopt(argv, 't:h', ['threshold=', 'help'])
except getopt.GetoptError:
print 'clean_hadoop_logs.py [options <argument>]\ntry clean_hadoop_logs.py -h | --help for more information'
sys.exit(2)
output = None
for opt, arg in opts:
if opt in ('-h', '--help'):
print 'Usage: clean_hadoop_logs.py [options <argument>]\n\t-t, --threshold\tSet threshold for log deletion \n\t-h, --help\tDisplay help dialog'
sys.exit(2)
elif opt in ('-t', '--threshold'):
output = arg
return output
#Checks disk usage, cleans logs if usage is greater than 95% or user specified amount
def main( argv ):
arg = parse_args(argv)
hostname = socket.gethostname()
output = subprocess.Popen(['df', '-h'], stdout=subprocess.PIPE)
if arg:
threshold = arg
else:
threshold = 95
for line in iter(output.stdout.readline, ''):
if "scratch" in line:
percent = line.split()[4]
percent = int(percent.strip("%"))
if percent > threshold:
subprocess.call(['python', '/data/osg/scripts/pyCleanupHadoopLogs.py', '-k', '15', '-s', '$(' + hostname + ').log', '--dir', '/scratch/hadoop/hadoop-hdfs/'])
break
#Runs script if being called directly
if __name__ == "__main__":
main(sys.argv[1:])
#!/bin/bash
NODE_NAME=${HOSTNAME%%.*}
#Sets node type of NODE_NAME
if [[ $NODE_NAME =~ "in" ]]; then
NODE_TYPE="interactive"
elif [[ "$NODE_NAME" =~ "compute" ]]; then
NODE_TYPE="Compute"
elif [[ "$NODE_NAME" =~ "r510" ]]; then
NODE_TYPE="R510"
elif [[ "$NODE_NAME" =~ "r720" ]]; then
NODE_TYPE="R720"
elif [[ "$NODE_NAME" =~ "gridftp" ]]; then
NODE_TYPE="GRIDFTP"
elif [[ "$NODE_NAME" =~ "hn" ]]; then
NODE_TYPE="HN"
else
NODE_TYPE="OTHER"
fi
OMSA_REPORTS_FOLDER="/data/monitoring/omsa_reports"
SYSTEM_OUTPUT_FOLDER="$OMSA_REPORTS_FOLDER/system/$NODE_TYPE"
CHASSIS_OUTPUT_FOLDER="$OMSA_REPORTS_FOLDER/chassis/$NODE_TYPE"
STORAGE_OUTPUT_FOLDER="$OMSA_REPORTS_FOLDER/storage/$NODE_TYPE"
#makes directories if they do not exist, and outputs .txt file corresponding to NODE_NAME
if [ ! -d "$SYSTEM_OUTPUT_FOLDER" ]; then
mkdir -p $SYSTEM_OUTPUT_FOLDER
fi
omreport system summary -outc $SYSTEM_OUTPUT_FOLDER/$NODE_NAME.txt
if [ ! -d "$CHASSIS_OUTPUT_FOLDER" ]; then
mkdir -p $CHASSIS_OUTPUT_FOLDER
fi
omreport chassis -outc $CHASSIS_OUTPUT_FOLDER/$NODE_NAME.txt
if [ ! -d "$STORAGE_OUTPUT_FOLDER" ]; then
mkdir -p $STORAGE_OUTPUT_FOLDER
fi
#outputs .txt files corresponding to specified node, deletes them if the exit status indicates an error
omreport storage pdisk controller=0 -outc $STORAGE_OUTPUT_FOLDER/$NODE_NAME-pdisk0.txt
if [ $? == 255 ]; then
rm -f $STORAGE_OUTPUT_FOLDER/$NODE_NAME-pdisk0.txt
fi
omreport storage vdisk controller=0 -outc $STORAGE_OUTPUT_FOLDER/$NODE_NAME-vdisk0.txt
if [ $? == 255 ]; then
rm -f $STORAGE_OUTPUT_FOLDER/$NODE_NAME-vdisk0.txt
fi
omreport storage vdisk controller=1 -outc $STORAGE_OUTPUT_FOLDER/$NODE_NAME-vdisk1.txt
if [ $? == 255 ]; then
rm -f $STORAGE_OUTPUT_FOLDER/$NODE_NAME-vdisk1.txt
fi
#!/usr/bin/env python
'''
This script checks the usage of /tmp and /scratch, and wipes these directories if this exceeds 50%. NOTE: This script must be run as sudo.
'''
#Import necessary modules
import subprocess
#Runs "df -h" and makes a list of the output
o1 = subprocess.Popen("df -h", stdout=subprocess.PIPE, shell=True)
o1_text = str(o1.communicate())
usage_list = o1_text.split(r"\n")
del usage_list[0]
del usage_list[-1]
for a in range(0, len(usage_list)):
usage_list[a] = " ".join(usage_list[a].split())
#Cleans up output list by combining elements that should be on a single line and deleting duplicated elements
to_del = []
for b in range(0, len(usage_list)):
if len(usage_list[b].split()) == 1 and len(usage_list[b+1].split()) == 5:
usage_list[b+1] = usage_list[b] + " " + usage_list[b+1]
to_del.append(b)
for c in range(0, len(to_del)):
del usage_list[to_del[c]-c]
#Creates array out of output list
usage_array = [["Filesystem", "Size", "Used", "Avail", "Use%", "Mounted on"]]
for d in range(0, len(usage_list)):
usage_array.append(usage_list[d].split(" "))
#Creates list corresponding to data usage of /tmp and /scratch
tmp_list = []
scratch_list = []
for e in range(0, len(usage_list)+1):
if usage_array[e][5] == "/tmp":
tmp_list = usage_array[e]
if usage_array[e][5] == "/scratch":
scratch_list = usage_array[e]
#Deletes files in /tmp and/or /scratch if usage exceeds 50%
if tmp_list != [] and int(tmp_list[4].strip("%")) > 50:
o2 = subprocess.Popen("rm -r /tmp/*", stdout=subprocess.PIPE, shell=True)
if scratch_list != [] and int(scratch_list[4].strip("%")) > 50:
o2 = subprocess.Popen("rm -r /scratch/*", stdout=subprocess.PIPE, shell=True)
#!/usr/bin/env python
'''
This script checks the CPU usage of jobs running on the cluster and kills those that have exceeded 70% CPU usage for 30 minutes of more. NOTE: This script must be run as sudo.
'''
#Import necessary modules
import subprocess
import time
import os
import signal
from pwd import getpwuid
to_kill = []
notify = []
#Runs commands multiple times to ensure no intensive processes are missed
for i in range(0, 6):
#Runs ps and makes an array of the output
o1_text = ""
ps_list = []
ps_array = []
o1 = subprocess.Popen("ps -G users u --sort=-pcpu", stdout=subprocess.PIPE, shell=True)
o1_text = str(o1.communicate())
ps_list = o1_text.split(r"\n")
del ps_list[0]
del ps_list[-1]
for c in range(0, len(ps_list)):
ps_list[c] = " ".join(ps_list[c].split())
ps_array = [["USER", "PID", "%CPU", "MEM", "VSZ", "RSS", "TTY", "STAT", "START", "TIME", "COMMAND"]]
for d in range(0, len(ps_list)):
ps_array.append(ps_list[d].split(" "))
#Adds PID's to to_kill corresponding to jobs that are using > 70% CPU usage and have been running for 30 minutes or more.
#The if statement below also checks that the job is not already in to_kill from a previous iteration.
for f in range(1, len(ps_list)+1):
if float(ps_array[f][2]) >= 70 and int(ps_array[f][9][:-3]) >= 30 and ps_array[f][1] not in to_kill:
to_kill.append(ps_array[f][1])
if ps_array[f][0].isdigit():
notify.append(getpwuid(int(ps_array[f][0])).pw_name)
else:
notify.append(ps_array[f][0])
#Waits a few seconds in between each iteration
if i < 5:
time.sleep(2)
#Quits program if to_kill is empty (no intensive jobs being run); otherwise, it kills the jobs in to_kill
if to_kill == []:
quit()
else:
for g in range(0, len(to_kill)):
o2 = subprocess.Popen("kill %s" % to_kill[g], stdout=subprocess.PIPE, shell=True)
O3 = subprocess.Popen("echo 'Dear user, you were running a CPU intensive job for an extended period of time on the cluster. Due to usage policies, this job has been terminated. Please use HTCondor in order to run this job without issues. Thank you for your compliance.' | write %s " % notify[g] , stdout=subprocess.PIPE, shell=True)
#!/usr/bin/env python
'''
This script checks the CPU usage of jobs running on the cluster and kills those that have exceeded 70% CPU usage for 1 hour or more. It also warns the user by email about their intensive job for those exceeding 70% CPU usage for 20 minutes or more. NOTE: This script must be run as sudo.
'''
#Import necessary modules
import subprocess
import time
import os
import signal
import smtplib
from pwd import getpwuid
to_kill = []
notify = []
time_running = []
CPU_usage = []
admins = ['jabeen@umd.edu', 'kakw@umd.edu', 'youngho.shin@cern.ch', 'johnmichaelmartyn@gmail.com', 'jamiebmazza@gmail.com', 'mnudelli@terpmail.umd.edu']
send_to = []
user_email = ""
o2_text = ""
#Runs commands multiple times to ensure no intensive processes are missed
for i in range(0, 6):
#Runs ps and makes an array of the output
o1_text = ""
ps_list = []
ps_array = []
o1 = subprocess.Popen("ps -G users u --sort=-pcpu", stdout=subprocess.PIPE, shell=True)
o1_text = str(o1.communicate())
ps_list = o1_text.split(r"\n")
del ps_list[0]
del ps_list[-1]
for c in range(0, len(ps_list)):
ps_list[c] = " ".join(ps_list[c].split())
ps_array = [["USER", "PID", "%CPU", "MEM", "VSZ", "RSS", "TTY", "STAT", "START", "TIME", "COMMAND"]]
for d in range(0, len(ps_list)):
ps_array.append(ps_list[d].split(" "))
#Adds PID's to to_kill corresponding to jobs that are using > 70% CPU usage and have been running for 20 minutes or more.
#The if statement below also checks that the job is not already in to_kill from a previous iteration.
for f in range(1, len(ps_list)+1):
if float(ps_array[f][2]) >= 70 and int(ps_array[f][9][:-3]) >= 20 and ps_array[f][1] not in to_kill:
to_kill.append(ps_array[f][1])
if ps_array[f][0].isdigit():
notify.append(getpwuid(int(ps_array[f][0])).pw_name)
else:
notify.append(ps_array[f][0])
CPU_usage.append(ps_array[f][2])
time_running.append(ps_array[f][9])
#Waits a few seconds in between each iteration
if i < 5:
time.sleep(2)
#Quits program if to_kill is empty (no intensive jobs being run); otherwise, it kills the jobs in to_kill running for times greater than 1 hour and emails admins and the users in the notify array.
if to_kill == []:
quit()
else:
for g in range(0, len(to_kill)):
#Used to obtain user's email if it is in hepcms_Users.csv
o2 = subprocess.Popen("python /home/hon-martyn/scripts/cronscripts/parseUsers.py -m %s /home/hon-martyn/scripts/cronscripts/hepcms_Users.csv" % notify[g], stdout=subprocess.PIPE, shell=True)
o2_text = str(o2.communicate())
#If the user's email is not in hepcms_Users.csv, the job is killed without notifying the user; otherwise, the user is notified.
if "Unknown user" in o2_text:
if int(time_running[g][:-3]) >= 60:
o3 = subprocess.Popen("kill %s" % to_kill[g], stdout=subprocess.PIPE, shell=True)
else:
user_email = o2_text.split(r"\n")[-3]
#If the job has been running for more than 1 hour, it is killed. Otherwise, the user is only emailed a warning.
if int(time_running[g][:-3]) >= 60:
o4 = subprocess.Popen("kill %s" % to_kill[g], stdout=subprocess.PIPE, shell=True)
ohost1 = subprocess.Popen("hostname", stdout=subprocess.PIPE, shell=True)
host = str(ohost1.communicate())[2:-10]
msg = "Dear %s, \n \n You were running an intensive job on %s for an extended period of time. Due to cluster policies, this job has been terminated. Please consider using HTCondor to run this job, or contact Tier 3 for more information. Thank you. \n \n Statistics summary (user, job number, percent CPU usage, time running): \n " % (notify[g], host)
msg += notify[g] + ", " + to_kill[g] + ", " + CPU_usage[g] + ", " + time_running[g] + "\n"
admins = ['jabeen@umd.edu', 'kakw@umd.edu', 'youngho.shin@cern.ch', 'johnmichaelmartyn@gmail.com', 'jamiebmazza@gmail.com', 'mnudelli@terpmail.umd.edu']
from_addr = 'root@hepcms-hn.umd.edu'
to_addr1 = ", ".join(admins)
to_addr1 = to_addr1 + ", " + user_email
send_to = admins
send_to.append(user_email)
subject = "Note about Intensive Job Being Run on Cluster"
header = "From: %s\nTo: %s\nSubject: %s\n" % (from_addr, to_addr1, subject)
email = header + msg
server = smtplib.SMTP('localhost')
for addr in send_to:
server.sendmail(from_addr, addr, email)
server.quit()
else:
ohost2 = subprocess.Popen("hostname", stdout=subprocess.PIPE, shell=True)
host = str(ohost2.communicate())[2:-10]
msg = "Dear %s, \n \n You are currently running an intensive job on %s. Due to cluster policies, this job will be terminated if it continues to run for more than 1 hour. You will receive two warning emails before termination. Please consider using HTCondor to run this job, or contact Tier 3 for more information. Thank you. \n \n Statistics summary (user, job number, percent CPU usage, time running): \n " % (notify[g], host)
msg += notify[g] + ", " + to_kill[g] + ", " + CPU_usage[g] + ", " + time_running[g] + "\n"
admins = ['jabeen@umd.edu', 'kakw@umd.edu', 'youngho.shin@cern.ch', 'johnmichaelmartyn@gmail.com', 'jamiebmazza@gmail.com', 'mnudelli@terpmail.umd.edu']
from_addr = 'root@hepcms-hn.umd.edu'
to_addr2 = ", ".join(admins)
to_addr2 = to_addr2 + ", " + user_email
send_to = admins
send_to.append(user_email)
subject = "Note about Intensive Job Being Run on Cluster"
header = "From: %s\nTo: %s\nSubject: %s\n" % (from_addr, to_addr2, subject)
email = header + msg
server = smtplib.SMTP('localhost')
for addr in send_to:
server.sendmail(from_addr, addr, email)
server.quit()
#Separately emails admins about all intensive jobs
admins = ['jabeen@umd.edu', 'kakw@umd.edu', 'youngho.shin@cern.ch', 'johnmichaelmartyn@gmail.com', 'jamiebmazza@gmail.com', 'mnudelli@terpmail.umd.edu']
ohost3 = subprocess.Popen("hostname", stdout=subprocess.PIPE, shell=True)
host = str(ohost3.communicate())[2:-10]
msg = "Intensive jobs being run on " + host + " (user, job number, % CPU usage, time running): \n "
for g in range(0, len(to_kill)):
msg += notify[g] + ", " + to_kill[g] + ", " + CPU_usage[g] + ", " + time_running[g] + "\n"
from_addr = 'root@hepcms-hn.umd.edu'
to_addr3 = ", ".join(admins)
subject = "WARNING: Intensive Jobs Being Run on Cluster"
header = "From: %s\nTo: %s\nSubject: %s\n" % (from_addr, to_addr3, subject)
email = header + msg
server = smtplib.SMTP('localhost')
for addr in admins:
server.sendmail(from_addr, addr, email)
server.quit()
#!/usr/bin/env python
'''
This script checks the CPU usage of jobs running on the cluster and emails admins about those that have exceeded 70% CPU usage for 30 minutes of more. NOTE: This script must be run as sudo.
'''
#Import necessary modules
import subprocess
import time
import os
import signal
import smtplib
from pwd import getpwuid
to_kill = []
notify = []
CPU_usage = []
time_running = []
admins = ['jabeen@umd.edu', 'kakw@umd.edu', 'youngho.shin@cern.ch', 'johnmichaelmartyn@gmail.com', 'jamiebmazza@gmail.com', 'mnudelli@terpmail.umd.edu']
#Runs commands multiple times to ensure no intensive processes are missed
for i in range(0, 6):
#Runs ps and makes an array of the output
o1_text = ""
ps_list = []
ps_array = []
o1 = subprocess.Popen("ps -G users u --sort=-pcpu", stdout=subprocess.PIPE, shell=True)
o1_text = str(o1.communicate())
ps_list = o1_text.split(r"\n")
del ps_list[0]
del ps_list[-1]
for c in range(0, len(ps_list)):
ps_list[c] = " ".join(ps_list[c].split())
ps_array = [["USER", "PID", "%CPU", "MEM", "VSZ", "RSS", "TTY", "STAT", "START", "TIME", "COMMAND"]]
for d in range(0, len(ps_list)):
ps_array.append(ps_list[d].split(" "))
#Adds PID's to to_kill corresponding to jobs that are using > 70% CPU usage and have been running for 30 minutes or more.
#The first if statement below also checks that the job is not already in to_kill from a previous iteration.
for f in range(1, len(ps_list)+1):
if float(ps_array[f][2]) >= 70 and int(ps_array[f][9][:-3]) >= 30 and ps_array[f][1] not in to_kill:
to_kill.append(ps_array[f][1])
if ps_array[f][0].isdigit():
notify.append(getpwuid(int(ps_array[f][0])).pw_name)
else:
notify.append(ps_array[f][0])
CPU_usage.append(ps_array[f][2])
time_running.append(ps_array[f][9])
#Waits a few seconds in between each iteration
if i < 5:
time.sleep(2)
#Quits program if to_kill is empty (no intensive jobs being run); otherwise, it emails admins about the intensive jobs
if to_kill == []:
quit()
ohost = subprocess.Popen("hostname", stdout=subprocess.PIPE, shell=True)
host = str(ohost.communicate())[2:-10]
msg = "Intensive jobs being run on " + host + " (user, job number, % CPU usage, time running): \n "
for g in range(0, len(to_kill)):
msg += notify[g] + ", " + to_kill[g] + ", " + CPU_usage[g] + ", " + time_running[g] + "\n"
from_addr = 'root@hepcms-hn.umd.edu'
to_addr = ", ".join(admins)
subject = "WARNING: Intensive Jobs Being Run on Cluster"
header = "From: %s\nTo: %s\nSubject: %s\n" % (from_addr, to_addr, subject)
email = header + msg
server = smtplib.SMTP('localhost')
for addr in admins:
server.sendmail(from_addr, addr, email)
server.quit()
#!/usr/bin/env python
'''
This script checks the usage of /tmp notifies admins when it is 90% full or more. NOTE: This script must be run as sudo.
'''
#Import necessary modules
import subprocess
import smtplib
admins = ['jabeen@umd.edu', 'kakw@umd.edu', 'youngho.shin@cern.ch', 'johnmichaelmartyn@gmail.com', 'jamiebmazza@gmail.com', 'mnudelli@terpmail.umd.edu']
#Runs "df -h" and makes a list of the output
o1 = subprocess.Popen("df -h", stdout=subprocess.PIPE, shell=True)
o1_text = str(o1.communicate())
usage_list = o1_text.split(r"\n")
del usage_list[0]
del usage_list[-1]
for a in range(0, len(usage_list)):
usage_list[a] = " ".join(usage_list[a].split())
#Cleans up output list by combining elements that should be on a single line and deleting duplicated elements
to_del = []
for b in range(0, len(usage_list)):
if len(usage_list[b].split()) == 1 and len(usage_list[b+1].split()) == 5:
usage_list[b+1] = usage_list[b] + " " + usage_list[b+1]
to_del.append(b)
for c in range(0, len(to_del)):
del usage_list[to_del[c]-c]
#Creates array out of output list
usage_array = [["Filesystem", "Size", "Used", "Avail", "Use%", "Mounted on"]]
for d in range(0, len(usage_list)):
usage_array.append(usage_list[d].split(" "))
#Creates list corresponding to data usage of /tmp
tmp_list = []
for e in range(0, len(usage_list)+1):
if usage_array[e][5] == "/tmp":
tmp_list = usage_array[e]
print tmp_list
print usage_array
#Sends email to admins if /tmp usage exceeds 90%
if tmp_list != [] and int(tmp_list[4].strip("%")) > 90:
ohost = subprocess.Popen("hostname", stdout=subprocess.PIPE, shell=True)
host = str(ohost.communicate())[2:-10]
msg = "On " + host + ": \n" + "/tmp is " + str(tmp_list[4].strip("%")) + "% full."
from_addr = 'root@hepcms-hn.umd.edu'
to_addr = ", ".join(admins)
subject = "WARNING: /tmp usage is high."
header = "From: %s\nTo: %s\nSubject: %s\n" % (from_addr, to_addr, subject)
email = header + msg
server = smtplib.SMTP('localhost')
for addr in admins:
server.sendmail(from_addr, addr, email)
server.quit()
check_certs_expire_new.py
#!/usr/bin/python ''' This script checks certificate expiration dates in /data/site_conf/certs (excluding /data/site_conf/certs/old and any .pem file with "key" in its name, however) and emails admins if any of them are going to expire within 1-30 days. NOTE: This script must be run as someone who has access to these certificates (i.e. sudo) for it to function properly. ''' #Import necessary modules import smtplib import subprocess import os #Set admins and directory containing certificates admins = ['jabeen@umd.edu', 'kakw@umd.edu', 'youngho.shin@cern.ch', 'johnmichaelmartyn@gmail.com', 'jamiebmazza@gmail.com', 'mnudelli@terpmail.umd.edu'] GRID_SECURITY_DIRS = ['/data/site_conf/certs/'] #Be sure to have a '/' at the end of any directory in this list #Checks if /data/site_conf/certs exists if str(os.path.isdir("/data/site_conf/certs")) == "False" : quit() #Creates message string, a list of certificates that will expire soon, and a dictionary for use later msg = "" certs_expiring = [] day = {} #Produces list of .pem files in directories of GRID_SECURITY_DIRS for i in range(0, len(GRID_SECURITY_DIRS)): os.chdir(GRID_SECURITY_DIRS[i]) o1 = subprocess.Popen("ls | grep .pem | grep -v .pem-old | grep -v empty.pem | grep -v key", stdout=subprocess.PIPE, shell=True) o1_text = str(o1.communicate()) files = o1_text.split(r"\n") files[0] = files[0][2:] del files[-1] #Checks which certificates will expire in 1-30 day(s); compiles a message containing the expiring certificates and their expiration dates for file in files: for x in range(1,31): o_day = subprocess.Popen("openssl x509 -checkend %d -noout -in %s%s ; echo $?" % ((x*86400), GRID_SECURITY_DIRS[i], file), stdout=subprocess.PIPE, shell=True) day_list = str(o_day.communicate()).split(r"\n") day_list[0] = day_list[0][2:] del day_list[-1] day["day%d" % x] = day_list o_exp = subprocess.Popen("openssl x509 -enddate -noout -in %s%s" % (GRID_SECURITY_DIRS[i], file), stdout=subprocess.PIPE, shell=True) exp_date = str(o_exp.communicate()).split(r"\n") exp_date = exp_date[0][11:] if exp_date == "" : msg += "check_cert_expire.py does not have access to %s%s, and its expiration date cannot be determined. \n" % (GRID_SECURITY_DIRS[i], file) break for y in range(1,31): if day["day%d" % y] == ["1"]: msg += "%s%s will expire within %d day(s). Its expiration date is: %s \n" % (GRID_SECURITY_DIRS[i], file, y, exp_date) certs_expiring.append(file) break #Emails admins if any certificates are nearing expiration if msg: ohost = subprocess.Popen("hostname", stdout=subprocess.PIPE, shell=True) host = str(ohost.communicate())[2:-10] msg = "On " + host + ": \n" + msg from_addr = 'root@hepcms-hn.umd.edu' to_addr = ", ".join(admins) subject = "WARNING: Certificates nearing expiration." header = "From: %s\nTo: %s\nSubject: %s\n" % (from_addr, to_addr, subject) email = header + msg server = smtplib.SMTP('localhost') for addr in admins: server.sendmail(from_addr, addr, email) server.quit()
#!/usr/bin/env python
'''
This script checks the usage of /mnt/hadoop notifies admins when it is 80% full or more. NOTE: This script must be run as sudo.
'''
#Import necessary modules
import subprocess
import smtplib
admins = ['jabeen@umd.edu', 'kakw@umd.edu', 'youngho.shin@cern.ch', 'johnmichaelmartyn@gmail.com', 'jamiebmazza@gmail.com', 'mnudelli@terpmail.umd.edu']
#Runs "df -h" and makes a list of the output
o1 = subprocess.Popen("df -h", stdout=subprocess.PIPE, shell=True)
o1_text = str(o1.communicate())
usage_list = o1_text.split(r"\n")
del usage_list[0]
del usage_list[-1]
for a in range(0, len(usage_list)):
usage_list[a] = " ".join(usage_list[a].split())
#Cleans up output list by combining elements that should be on a single line and deleting duplicated elements
to_del = []
for b in range(0, len(usage_list)):
if len(usage_list[b].split()) == 1 and len(usage_list[b+1].split()) == 5:
usage_list[b+1] = usage_list[b] + " " + usage_list[b+1]
to_del.append(b)
for c in range(0, len(to_del)):
del usage_list[to_del[c]-c]
#Creates array out of output list
usage_array = [["Filesystem", "Size", "Used", "Avail", "Use%", "Mounted on"]]
for d in range(0, len(usage_list)):
usage_array.append(usage_list[d].split(" "))
#Creates list corresponding to data usage of /mnt/hadoop
hadoop_list = []
for e in range(0, len(usage_list)+1):
if usage_array[e][5] == "/mnt/hadoop":
hadoop_list = usage_array[e]
#print to_del
#print hadoop_list
#print usage_array
#Sends email to admins if /mnt/hadoop usage exceeds 80%
if hadoop_list != [] and int(hadoop_list[4].strip("%")) > 80:
ohost = subprocess.Popen("hostname", stdout=subprocess.PIPE, shell=True)
host = str(ohost.communicate())[2:-10]
msg = "On " + host + ": \n" + "/mnt/hadoop is " + str(hadoop_list[4].strip("%")) + "% full. Please consider deleting or removing files."
from_addr = 'root@hepcms-hn.umd.edu'
to_addr = ", ".join(admins)
subject = "WARNING: /mnt/hadoop usage is high"
header = "From: %s\nTo: %s\nSubject: %s\n" % (from_addr, to_addr, subject)
email = header + msg
server = smtplib.SMTP('localhost')
for addr in admins:
server.sendmail(from_addr, addr, email)
server.quit()
Check_Hadoop.py
#!/usr/bin/python ''' This script checks if Hadoop is running properly and emails admins if it is not. NOTE: This script must be run on a node where hadoop is set up (i.e. r510) and run by someone who has root access (i.e. sudo) for it to function properly. ''' #Import necessary modules import smtplib import subprocess import os #Set admins admins = ['jabeen@umd.edu', 'kakw@umd.edu', 'youngho.shin@cern.ch', 'johnmichaelmartyn@gmail.com', 'jamiebmazza@gmail.com', 'mnudelli@terpmail.umd.edu'] #Runs command to check if hadoop is running properly o1 = subprocess.Popen("service hadoop-hdfs-datanode status", stdout=subprocess.PIPE, shell=True) o1_text = str(o1.communicate()) #Emails admins if hadoop is not running properly if 'Hadoop datanode is running' not in o1_text: ohost = subprocess.Popen("hostname", stdout=subprocess.PIPE, shell=True) host = str(ohost.communicate())[2:-10] msg = "On " + host + ": \n Hadoop is not running properly. \' service hadoop-hdfs-datanode status \' did not return its expected output." from_addr = 'root@hepcms-hn.umd.edu' to_addr = ", ".join(admins) subject = "WARNING: Hadoop is not running properly." header = "From: %s\nTo: %s\nSubject: %s\n" % (from_addr, to_addr, subject) email = header + msg server = smtplib.SMTP('localhost') for addr in admins: server.sendmail(from_addr, addr, email) server.quit()
Check_Mounting.py
/root/cronscripts/tar_script.sh which resides in all nodes and is run using local node crontab.
#!/usr/bin/env python
'''
This script checks if /home, /hadoop, /data, and /cvmfs are proplerly mounted on a node. If cvmfs is not mounted, 'ls' and
'clean_cvmfs_cache.py' are run to try to mount it. If any of these directories are not mounted, an email is sent to the admins.
NOTE: This script must be run as sudo.
'''
#Import necessary modules
import subprocess
import smtplib
admins = ['jabeen@umd.edu', 'kakw@umd.edu', 'youngho.shin@cern.ch', 'johnmichaelmartyn@gmail.com', 'jamiebmazza@gmail.com', 'mnudelli@terpmail.umd.edu']
#Runs "df -h" and makes a list of the output
o1 = subprocess.Popen("df -h", stdout=subprocess.PIPE, shell=True)
o1_text = str(o1.communicate())
usage_list = o1_text.split(r"\n")
del usage_list[0]
del usage_list[-1]
for a in range(0, len(usage_list)):
usage_list[a] = " ".join(usage_list[a].split())
#Cleans up output list by combining elements that should be on a single line and deleting duplicated elements
to_del = []
for b in range(0, len(usage_list)):
if len(usage_list[b].split()) == 1 and len(usage_list[b+1].split()) == 5:
usage_list[b+1] = usage_list[b] + " " + usage_list[b+1]
to_del.append(b)
for c in range(0, len(to_del)):
del usage_list[to_del[c]-c]
#Creates array out of output list
usage_array = [["Filesystem", "Size", "Used", "Avail", "Use%", "Mounted on"]]
for d in range(0, len(usage_list)):
usage_array.append(usage_list[d].split(" "))
#Creates list corresponding to data usage
directory_list = ["/home", "/hadoop", "/data", "/cvmfs"]
directory_mounted = [0, 0, 0, 0]
for e in range(0, len(directory_list)):
for f in range(0, len(usage_list)+1):
if directory_list[e] in usage_array[f][5]:
directory_mounted[e] = 1
#If cvmfs is not mounted, ls is run in an attempt to mount cvmfs, and df -h is used again to check if cvmfs has been mounted
if directory_mounted[3] == 0:
o2 = subprocess.Popen("ls", stdout=subprocess.PIPE, shell=True)
o3 = subprocess.Popen("df -h", stdout=subprocess.PIPE, shell=True)
o3_text = str(o3.communicate())
usage_list = o3_text.split(r"\n")
del usage_list[0]
del usage_list[-1]
for a in range(0, len(usage_list)):
usage_list[a] = " ".join(usage_list[a].split())
#Cleans up output list by combining elements that should be on a single line and deleting duplicated elements
to_del = []
for b in range(0, len(usage_list)):
if len(usage_list[b].split()) == 1 and len(usage_list[b+1].split()) == 5:
usage_list[b+1] = usage_list[b] + " " + usage_list[b+1]
to_del.append(b)
for c in range(0, len(to_del)):
del usage_list[to_del[c]-c]
#Creates array out of output list
usage_array = [["Filesystem", "Size", "Used", "Avail", "Use%", "Mounted on"]]
for d in range(0, len(usage_list)):
usage_array.append(usage_list[d].split(" "))
#Checks if cvmfs has been mounted
for h in range(0, len(usage_list)+1):
if directory_list[3] in usage_array[h][5]:
directory_mounted[3] = 1
#If cvmfs is still not mounted, clean_cvmfs_cache.py is run in an attempt to mount cvmfs, and df -h is used again to check if cvmfs has been mounted
if directory_mounted[3] == 0:
o2 = subprocess.Popen("python /root/cronscripts/clean_cvmfs_cache.py", stdout=subprocess.PIPE, shell=True)
o3 = subprocess.Popen("df -h", stdout=subprocess.PIPE, shell=True)
o3_text = str(o3.communicate())
usage_list = o3_text.split(r"\n")
del usage_list[0]
del usage_list[-1]
for a in range(0, len(usage_list)):
usage_list[a] = " ".join(usage_list[a].split())
#Cleans up output list by combining elements that should be on a single line and deleting duplicated elements
to_del = []
for b in range(0, len(usage_list)):
if len(usage_list[b].split()) == 1 and len(usage_list[b+1].split()) == 5:
usage_list[b+1] = usage_list[b] + " " + usage_list[b+1]
to_del.append(b)
for c in range(0, len(to_del)):
del usage_list[to_del[c]-c]
#Creates array out of output list
usage_array = [["Filesystem", "Size", "Used", "Avail", "Use%", "Mounted on"]]
for d in range(0, len(usage_list)):
usage_array.append(usage_list[d].split(" "))
#Checks if cvmfs has been mounted
for h in range(0, len(usage_list)+1):
if directory_list[3] in usage_array[h][5]:
directory_mounted[3] = 1
#Message compiled corresponding to unmounted storage devices
msg = ""
for i in range(0, len(directory_mounted)):
if directory_mounted[i] == 0:
msg = msg + directory_list[i] + " is not mounted properly. \n"
#Sends email to admins if any storage devices are not mounted properly
if msg != "" :
ohost = subprocess.Popen("hostname", stdout=subprocess.PIPE, shell=True)
host = str(ohost.communicate())[2:-10]
msg = "On " + host + ": \n" + msg
from_addr = 'root@hepcms-hn.umd.edu'
to_addr = ", ".join(admins)
subject = "WARNING: data directories are not properly mounted"
header = "From: %s\nTo: %s\nSubject: %s\n" % (from_addr, to_addr, subject)
email = header + msg
server = smtplib.SMTP('localhost')
for addr in admins:
server.sendmail(from_addr, addr, email)
server.quit()
For backup of root var and etc hepcms-hn://root/cronscripts/backup_systemfiles_allnodes.sh
needs to be used yet.
# HEADER: This file was autogenerated at Fri Jan 29 23:29:33 -0500 2016 by puppet. # HEADER: While it can still be managed manually, it is definitely not recommended. # HEADER: Note particularly that the comments starting with 'Puppet Name' should # HEADER: not be deleted, as doing so could cause duplicate cron jobs. # CRONTAB FILE LOCATION: /var/spool/cron/root # edit via: env EDITOR='emacs -nw' crontab -e # (or just edit the file directly) # # # crontab syntax: # <minute> <hour> <day (numerical)> <month (numerical)> <day of week (Sunday=0)> job # # 1:15am daily backup of the etc on the head node to /data #15 01 * * * /root/scripts/backup_headNode.sh >& /root/LogScripts/DailyEtcBackup_$(date +\%Y\%m\%d).log 2>&1 # 1:30am rsync of /export/home on the head node to /DataCampusBackup 30 01 * * 0 /root/cronscripts/rsync_home.sh & #tars /etc /var and /root Works with backup_headNode.sh 0 0 * * * /root/cronscripts/tar_script.sh # Puppet Name: puppet 6 * * * * /usr/bin/env puppet agent --config /etc/puppet/puppet.conf --onetime --no-daemonize --noop # Yearly reminder to change passwords , for those who've had accounts for more than a year. #@yearly python /root/cronscripts/SendMail.py -a -s "Time to update hepcms passwords" -g 365 -m /root/cronscripts/messages/UpdatePasswords.txt # cron script to post condor status on web for ease of monitoring, every 10 minutes 1,11,21,31,41,51 * * * * /root/condor-status-script.sh */5 * * * * /root/cronscripts/collect_disk_monitor.sh # This retrieves the status of PhEdeX */5 * * * * /usr/bin/python /root/cronscripts/phedex_scraper.py # Checks omreports and emails admins if there is a critical report 10 22 * * * /usr/bin/python /root/cronscripts/check_omreports.py */20 * * * * /root/cronscripts/EnoPriority.sh */20 * * * * /root/cronscripts/yhshinPriority.sh # Runs omreport on this node for system, disks, and chassis 0 22 * * * /bin/bash /data/monitoring/scripts/run_omreport.sh # Runs temperature gmetric script */5 * * * * /etc/ganglia/sensors_gmetric.sh 2>&1 /dev/null # Pings all nodes and emails admins if nodes are unreachable 0 5 * * * /root/cronscripts/ping_nodes.py # Cleans CVMFS cache when it reachs 75% or above 0 0 * * 2 /usr/bin/env /root/cronscripts/clean_cvmfs_cache.py 0 0 */3 * * python /root/cronscripts/check_cert_expire_new.py 0 0 */1 * * python /root/cronscripts/check_cert_expire_daily_new.py ~ ~
HEADER: This file was autogenerated at Thu Dec 17 20:38:13 -0500 2015 by puppet. # HEADER: While it can still be managed manually, it is definitely not recommended. # HEADER: Note particularly that the comments starting with 'Puppet Name' should # HEADER: not be deleted, as doing so could cause duplicate cron jobs. # Puppet Name: puppet 7 * * * * /usr/bin/env puppet agent --config /etc/puppet/puppet.conf --onetime --no-daemonize --noop 0 23 */3 * * python /root/Check_Mounting.py # tars /ect /root and /var 0 0 * * * /root/cronscripts/tar_script.sh ~ ~
///