SCRIPTS

To specify, the main goal is to gather the code already written in a specific area in the /data space in a folder named cluster-health (I made it recently). However, since there are different versions of the same code in different nodes, we need to gather them in that specific space.

the main objective are fall into these 6 categories:

1- backup of the files in each and every nodes

2- check mounts cvmfs, data, home and hadoop

3- check log areas or 100% full disks

4- check bad usage and kill jobs or programs

5- check certificate expiration

6- omsa health reports

For backing up of the files:

tar_scripts.sh

To check mounts cvmfs, data, home and hadoop:

Check_Mounting.py, Check_Mounting2.py, Check_Hadoop.py

To check log areas or 100% full disks:

disk_monitor.sh, collect_disk_monitor.sh

To check bad usage and kill jobs or programs:

Kill_Jobs_2.py, kill_email.py, kill_ps.py

To check certificate expiration (not sure it’s going to put in risk the health of the cluster):

check_cert_expire.py, check_cert_expire_daily.py, check_cert_expire_daily_new.py, check_cert_expire_new.py

To get the health reports using omsa:

omsa_reports.sh

I could gather all the python and bash scripts in these node in the attached pdf file:

r510-0-1, r520-0-4, r520-0-5, r520-0-6, r520-0-9, r520-0-10, r520-0-11, r720-0-1, r720-0-2, r540-0-20, r540-0-21, compute-0-6, compute-0-7, compute-0-10, hepcms-hn, hepcms-in2, hepcms-in1, hepcms-namenode

I looked at the mounting and the backup files in the entire cluster and I could gather the final version of those file in an excel spreadsheet. I was wondering if you could help me on doing the same for the files in the categories 3, 4 and 5 are the more completed one.

Crontab

# HEADER: This file was autogenerated at Fri Dec 23 15:05:03 -0500 2016 by puppet.

# HEADER: While it can still be managed manually, it is definitely not recommended.

# HEADER: Note particularly that the comments starting with 'Puppet Name' should

# HEADER: not be deleted, as doing so could cause duplicate cron jobs.

# Puppet Name: puppet every hour at x:09

9 * * * * /usr/bin/env puppet agent --config /etc/puppet/puppet.conf --onetime --no-daemonize --noop

*/5 * * * * /root/cronscripts/disk_monitor.sh every 5 minutes

# Puppet Name: get_omsa_reports runs at 22:00 daily

0 22 * * * /bin/bash /data/monitoring/scripts/run_omreport.sh

# Cleans CVMFS cache when it reaches 75% or above runs every 12 hours

0 */12 * * * /usr/bin/env /root/cronscripts/clean_cvmfs_cache.py

#gmetric cpu temp script every 5 minutes */5 * * * * /etc/ganglia/sensors_gmetric.sh 2>&1 /dev/null

Check_cert_expire.py

#!/usr/bin/python

'''

This script checks certificate expiration dates in /etc/grid-security (excluding /etc/grid-security/certificates however) and emails admins if any of them are going to expire within 1-30 days. NOTE: This script must be run as someone who has access to these certificates for it to function properly.

'''

#Import necessary modules

import smtplib

import subprocess

import os

#Set admins and directory containing certificates

admins = ['jabeen@umd.edu ', 'kakw@umd.edu', 'youngho.shin@cern.ch ', 'johnmichaelmartyn@gmail.com ', 'jamiebmazza@gmail.com ', 'mnudelli@terpmail.umd.edu ']

GRID_SECURITY_DIRS = ['/etc/grid-security/']

#Produces a list of directories in /etc/grid-security/ to search through

ohost = subprocess.Popen("hostname", stdout=subprocess.PIPE, shell=True)

if str(os.path.isdir("/etc/grid-security")) == "False" :

quit()

os.chdir("/etc/grid-security")

o0 = subprocess.Popen("find . -type d| grep -v './certificates'", stdout=subprocess.PIPE, shell=True)

o0_text = str(o0.communicate())

dirs = o0_text.split(r"\n")

del dirs[0]

del dirs[-1]

for dir in dirs:

GRID_SECURITY_DIRS.append("/etc/grid-security" + dir[1:] + "/")

#Creates message string, a list of certificates that will expire soon, and a dictionary for use later

msg = ""

certs_expiring = []

day = {}

#Produces list of .pem files in directories of GRID_SECURITY_DIRS

for i in range(0, len(GRID_SECURITY_DIRS)):

os.chdir(GRID_SECURITY_DIRS[i])

o1 = subprocess.Popen("ls | grep .pem | grep -v .pem-old | grep -v empty.pem", stdout=subprocess.PIPE, shell=True)

o1_text = str(o1.communicate())

files = o1_text.split(r"\n")

files[0] = files[0][2:]

del files[-1]

#Checks which certificates will expire in 1-30 day(s); compiles a message containing the expiring certificates and their expiration dates

for file in files:

for x in range(1,31):

o_day = subprocess.Popen("openssl x509 -checkend %d -noout -in %s%s ; echo $?" % ((x*86400), GRID_SECURITY_DIRS[i], file), stdout=subprocess.PIPE, shell=True)

day_list = str(o_day.communicate()).split(r"\n")

day_list[0] = day_list[0][2:]

del day_list[-1]

day["day%d" % x] = day_list

o_exp = subprocess.Popen("openssl x509 -enddate -noout -in %s%s" % (GRID_SECURITY_DIRS[i], file), stdout=subprocess.PIPE, shell=True)

exp_date = str(o_exp.communicate()).split(r"\n")

exp_date = exp_date[0][11:]

if exp_date == "" :

msg += "check_cert_expire.py does not have access to %s%s, and its expiration date cannot be determined. \n" % (GRID_SECURITY_DIRS[i], file)

break

for y in range(1,31):

if day["day%d" % y] == ["1"]:

msg += "%s%s will expire within %d day(s). Its expiration date is: %s \n" % (GRID_SECURITY_DIRS[i], file, y, exp_date)

certs_expiring.append(file)

break

#Emails admins if any certificates are nearing expiration

if msg:

host = str(ohost.communicate())[2:-10]

msg = "On " + host + ": \n" + msg

from_addr = 'root@hepcms-hn.umd.edu'

to_addr = ", ".join(admins)

subject = "WARNING: Certificates %s are nearing expiration." % " ".join(certs_expiring)

header = "From: %s\nTo: %s\nSubject: %s\n" % (from_addr, to_addr, subject)

email = header + msg

server = smtplib.SMTP('localhost')

for addr in admins:

server.sendmail(from_addr, addr, email)

server.quit()

check_omreports.py

This script checks the status of nodes and emails admins about any critical nodes.

#!/usr/bin/python

#Emails admins about critical nodes

#Import packages

from os import walk

from smtplib import SMTP

#Function that creates a dictionary from a set of data

def make_dict(data):

d = {}

for element in data:

pair = element.split(' :')

d.update({pair[0].strip() : pair[1].strip('\n').strip()})

return d

#Function that creates a dictionary from a set of data, this one is reversed/mirrored

def make_dict_rev(data):

d = {}

for element in data:

pair = element.split(' :')

d.update({pair[1].strip('\n').strip() : pair[0].strip()})

return d

#Define variables

REPORT_DIR = '/data/monitoring/omsa_reports'

msg = ""

admin_emails = ['jrtaylor95@gmail.com']

critical_nodes = []

node_name = ''

msg_part = ''

#Searches for critical nodes, compiles a list of these nodes into a msg

for (dirpath, dirname, filename) in walk(REPORT_DIR):

if filename:

for (file) in filename:

with open(dirpath + "/" + file, "r", 0) as fo:

lines = fo.readlines()

for i, line in enumerate(lines):

if 'Critical' in line:

if 'pdisk' in file:

d = make_dict(lines[i-1:i+38])

node_name = file[:-11]

msg_part = '%s\n' % node_name

for key in ['Name', 'Status', 'Failure Predicted']:

msg_part += '\t%-20s\t%-20s\n' % (key, d[key])

elif 'vdisk' in file:

d = make_dict(lines[i-1:i+17])

node_name = file[:-11]

msg_part = '%s\n' % node_name

for key in ['Name', 'Status' 'Device Name']:

msg_part += '\t%-20s\t%-20s\n' % (key, d[key])

else:

d = make_dict_rev(lines[5:-3])

node_name = file[:-4]

msg_part = '%s\n' % node_name

for k, v in d.items():

if 'Critical' in v:

msg_part += '\t%-20s\t%-20s\n' % (k, v)

critical_nodes.append(node_name)

msg += '%s\n' % msg_part

#Emails admins about any critical nodes

if msg:

critical_nodes = list(set(critical_nodes))

from_addr = 'root@hepcms-hn.umd.edu'

to_addr = ", ".join(admin_emails)

subject = "WARNING: Node(s) %s are critical" % " ".join(critical_nodes)

header = "From: %s\nTo: %s\nSubject: %s\n" % (from_addr, to_addr, subject)

email = header + msg

#print email

server = smtplib.SMTP('localhost')

for addr in admins:

server.sendmail(from_addr, addr, email)

server.quit()

clean_cvmfs_cache.py

This script wipes the cvmfs cache when its usage is greater than 75%.

#!/usr/bin/env python

#cleans cvmfs cache and temp scratch if they are almost full

#Import modules

import subprocess

import re

#Checks disk usage, wipes cache if >75% is used up

output = subprocess.Popen(['df', '-h'], stdout=subprocess.PIPE)

#finds line usage % is on and compares value to 75%

for line in iter(output.stdout.readline, ''):

if "cvmfs" in line:

percent = line.split()[4]

percent = int(percent.strip("%"))

if percent > 75:

subprocess.call(["cvmfs_config", "wipecache"])

break

clean_hadoop_logs.txt

This script cleans hadoop logs when usage is greater than a user-specified value, which defaults to 95%.

#!/usr/bin/env python

#cleans cvmfs cache and temp scratch if they are almost full

import subprocess, socket, sys, getopt

#sets options for running this script

def parse_args( argv ):

try:

opts, args = getopt.getopt(argv, 't:h', ['threshold=', 'help'])

except getopt.GetoptError:

print 'clean_hadoop_logs.py [options <argument>]\ntry clean_hadoop_logs.py -h | --help for more information'

sys.exit(2)

output = None

for opt, arg in opts:

if opt in ('-h', '--help'):

print 'Usage: clean_hadoop_logs.py [options <argument>]\n\t-t, --threshold\tSet threshold for log deletion \n\t-h, --help\tDisplay help dialog'

sys.exit(2)

elif opt in ('-t', '--threshold'):

output = arg

return output

#Checks disk usage, cleans logs if usage is greater than 95% or user specified amount

def main( argv ):

arg = parse_args(argv)

hostname = socket.gethostname()

output = subprocess.Popen(['df', '-h'], stdout=subprocess.PIPE)

if arg:

threshold = arg

else:

threshold = 95

for line in iter(output.stdout.readline, ''):

if "scratch" in line:

percent = line.split()[4]

percent = int(percent.strip("%"))

if percent > threshold:

subprocess.call(['python', '/data/osg/scripts/pyCleanupHadoopLogs.py', '-k', '15', '-s', '$(' + hostname + ').log', '--dir', '/scratch/hadoop/hadoop-hdfs/'])

break

#Runs script if being called directly

if __name__ == "__main__":

main(sys.argv[1:])

run_onmreport.sh

#!/bin/bash

NODE_NAME=${HOSTNAME%%.*}

#Sets node type of NODE_NAME

if [[ $NODE_NAME =~ "in" ]]; then

NODE_TYPE="interactive"

elif [[ "$NODE_NAME" =~ "compute" ]]; then

NODE_TYPE="Compute"

elif [[ "$NODE_NAME" =~ "r510" ]]; then

NODE_TYPE="R510"

elif [[ "$NODE_NAME" =~ "r720" ]]; then

NODE_TYPE="R720"

elif [[ "$NODE_NAME" =~ "gridftp" ]]; then

NODE_TYPE="GRIDFTP"

elif [[ "$NODE_NAME" =~ "hn" ]]; then

NODE_TYPE="HN"

else

NODE_TYPE="OTHER"

OMSA_REPORTS_FOLDER="/data/monitoring/omsa_reports"

SYSTEM_OUTPUT_FOLDER="$OMSA_REPORTS_FOLDER/system/$NODE_TYPE"

CHASSIS_OUTPUT_FOLDER="$OMSA_REPORTS_FOLDER/chassis/$NODE_TYPE"

STORAGE_OUTPUT_FOLDER="$OMSA_REPORTS_FOLDER/storage/$NODE_TYPE"

#makes directories if they do not exist, and outputs .txt file corresponding to NODE_NAME

if [ ! -d "$SYSTEM_OUTPUT_FOLDER" ]; then

mkdir -p $SYSTEM_OUTPUT_FOLDER

omreport system summary -outc $SYSTEM_OUTPUT_FOLDER/$NODE_NAME.txt

if [ ! -d "$CHASSIS_OUTPUT_FOLDER" ]; then

mkdir -p $CHASSIS_OUTPUT_FOLDER

omreport chassis -outc $CHASSIS_OUTPUT_FOLDER/$NODE_NAME.txt

if [ ! -d "$STORAGE_OUTPUT_FOLDER" ]; then

mkdir -p $STORAGE_OUTPUT_FOLDER

#outputs .txt files corresponding to specified node, deletes them if the exit status indicates an error

omreport storage pdisk controller=0 -outc $STORAGE_OUTPUT_FOLDER/$NODE_NAME-pdisk0.txt

if [ $? == 255 ]; then

rm -f $STORAGE_OUTPUT_FOLDER/$NODE_NAME-pdisk0.txt

omreport storage vdisk controller=0 -outc $STORAGE_OUTPUT_FOLDER/$NODE_NAME-vdisk0.txt

if [ $? == 255 ]; then

rm -f $STORAGE_OUTPUT_FOLDER/$NODE_NAME-vdisk0.txt

omreport storage vdisk controller=1 -outc $STORAGE_OUTPUT_FOLDER/$NODE_NAME-vdisk1.txt

if [ $? == 255 ]; then

rm -f $STORAGE_OUTPUT_FOLDER/$NODE_NAME-vdisk1.txt

Clean_tmp_and_scratch

#!/usr/bin/env python

'''

This script checks the usage of /tmp and /scratch, and wipes these directories if this exceeds 50%. NOTE: This script must be run as sudo.

'''

#Import necessary modules

import subprocess

#Runs "df -h" and makes a list of the output

o1 = subprocess.Popen("df -h", stdout=subprocess.PIPE, shell=True)

o1_text = str(o1.communicate())

usage_list = o1_text.split(r"\n")

del usage_list[0]

del usage_list[-1]

for a in range(0, len(usage_list)):

usage_list[a] = " ".join(usage_list[a].split())

#Cleans up output list by combining elements that should be on a single line and deleting duplicated elements

to_del = []

for b in range(0, len(usage_list)):

if len(usage_list[b].split()) == 1 and len(usage_list[b+1].split()) == 5:

usage_list[b+1] = usage_list[b] + " " + usage_list[b+1]

to_del.append(b)

for c in range(0, len(to_del)):

del usage_list[to_del[c]-c]

#Creates array out of output list

usage_array = [["Filesystem", "Size", "Used", "Avail", "Use%", "Mounted on"]]

for d in range(0, len(usage_list)):

usage_array.append(usage_list[d].split(" "))

#Creates list corresponding to data usage of /tmp and /scratch

tmp_list = []

scratch_list = []

for e in range(0, len(usage_list)+1):

if usage_array[e][5] == "/tmp":

tmp_list = usage_array[e]

if usage_array[e][5] == "/scratch":

scratch_list = usage_array[e]

#Deletes files in /tmp and/or /scratch if usage exceeds 50%

if tmp_list != [] and int(tmp_list[4].strip("%")) > 50:

o2 = subprocess.Popen("rm -r /tmp/*", stdout=subprocess.PIPE, shell=True)

if scratch_list != [] and int(scratch_list[4].strip("%")) > 50:

o2 = subprocess.Popen("rm -r /scratch/*", stdout=subprocess.PIPE, shell=True)

Kill_jobs

#!/usr/bin/env python

'''

This script checks the CPU usage of jobs running on the cluster and kills those that have exceeded 70% CPU usage for 30 minutes of more. NOTE: This script must be run as sudo.

'''

#Import necessary modules

import subprocess

import time

import os

import signal

from pwd import getpwuid

to_kill = []

notify = []

#Runs commands multiple times to ensure no intensive processes are missed

for i in range(0, 6):

#Runs ps and makes an array of the output

o1_text = ""

ps_list = []

ps_array = []

o1 = subprocess.Popen("ps -G users u --sort=-pcpu", stdout=subprocess.PIPE, shell=True)

o1_text = str(o1.communicate())

ps_list = o1_text.split(r"\n")

del ps_list[0]

del ps_list[-1]

for c in range(0, len(ps_list)):

ps_list[c] = " ".join(ps_list[c].split())

ps_array = [["USER", "PID", "%CPU", "MEM", "VSZ", "RSS", "TTY", "STAT", "START", "TIME", "COMMAND"]]

for d in range(0, len(ps_list)):

ps_array.append(ps_list[d].split(" "))

#Adds PID's to to_kill corresponding to jobs that are using > 70% CPU usage and have been running for 30 minutes or more.

#The if statement below also checks that the job is not already in to_kill from a previous iteration.

for f in range(1, len(ps_list)+1):

if float(ps_array[f][2]) >= 70 and int(ps_array[f][9][:-3]) >= 30 and ps_array[f][1] not in to_kill:

to_kill.append(ps_array[f][1])

if ps_array[f][0].isdigit():

notify.append(getpwuid(int(ps_array[f][0])).pw_name)

else:

notify.append(ps_array[f][0])

#Waits a few seconds in between each iteration

if i < 5:

time.sleep(2)

#Quits program if to_kill is empty (no intensive jobs being run); otherwise, it kills the jobs in to_kill

if to_kill == []:

quit()

else:

for g in range(0, len(to_kill)):

o2 = subprocess.Popen("kill %s" % to_kill[g], stdout=subprocess.PIPE, shell=True)

O3 = subprocess.Popen("echo 'Dear user, you were running a CPU intensive job for an extended period of time on the cluster. Due to usage policies, this job has been terminated. Please use HTCondor in order to run this job without issues. Thank you for your compliance.' | write %s " % notify[g] , stdout=subprocess.PIPE, shell=True)

Kill_jobs 2 (Kills jobs and sends emails)

#!/usr/bin/env python

'''

This script checks the CPU usage of jobs running on the cluster and kills those that have exceeded 70% CPU usage for 1 hour or more. It also warns the user by email about their intensive job for those exceeding 70% CPU usage for 20 minutes or more. NOTE: This script must be run as sudo.

'''

#Import necessary modules

import subprocess

import time

import os

import signal

import smtplib

from pwd import getpwuid

to_kill = []

notify = []

time_running = []

CPU_usage = []

admins = ['jabeen@umd.edu', 'kakw@umd.edu', 'youngho.shin@cern.ch', 'johnmichaelmartyn@gmail.com', 'jamiebmazza@gmail.com', 'mnudelli@terpmail.umd.edu']

send_to = []

user_email = ""

o2_text = ""

#Runs commands multiple times to ensure no intensive processes are missed

for i in range(0, 6):

#Runs ps and makes an array of the output

o1_text = ""

ps_list = []

ps_array = []

o1 = subprocess.Popen("ps -G users u --sort=-pcpu", stdout=subprocess.PIPE, shell=True)

o1_text = str(o1.communicate())

ps_list = o1_text.split(r"\n")

del ps_list[0]

del ps_list[-1]

for c in range(0, len(ps_list)):

ps_list[c] = " ".join(ps_list[c].split())

ps_array = [["USER", "PID", "%CPU", "MEM", "VSZ", "RSS", "TTY", "STAT", "START", "TIME", "COMMAND"]]

for d in range(0, len(ps_list)):

ps_array.append(ps_list[d].split(" "))

#Adds PID's to to_kill corresponding to jobs that are using > 70% CPU usage and have been running for 20 minutes or more.

#The if statement below also checks that the job is not already in to_kill from a previous iteration.

for f in range(1, len(ps_list)+1):

if float(ps_array[f][2]) >= 70 and int(ps_array[f][9][:-3]) >= 20 and ps_array[f][1] not in to_kill:

to_kill.append(ps_array[f][1])

if ps_array[f][0].isdigit():

notify.append(getpwuid(int(ps_array[f][0])).pw_name)

else:

notify.append(ps_array[f][0])

CPU_usage.append(ps_array[f][2])

time_running.append(ps_array[f][9])

#Waits a few seconds in between each iteration

if i < 5:

time.sleep(2)

#Quits program if to_kill is empty (no intensive jobs being run); otherwise, it kills the jobs in to_kill running for times greater than 1 hour and emails admins and the users in the notify array.

if to_kill == []:

quit()

else:

for g in range(0, len(to_kill)):

#Used to obtain user's email if it is in hepcms_Users.csv

o2 = subprocess.Popen("python /home/hon-martyn/scripts/cronscripts/parseUsers.py -m %s /home/hon-martyn/scripts/cronscripts/hepcms_Users.csv" % notify[g], stdout=subprocess.PIPE, shell=True)

o2_text = str(o2.communicate())

#If the user's email is not in hepcms_Users.csv, the job is killed without notifying the user; otherwise, the user is notified.

if "Unknown user" in o2_text:

if int(time_running[g][:-3]) >= 60:

o3 = subprocess.Popen("kill %s" % to_kill[g], stdout=subprocess.PIPE, shell=True)

else:

user_email = o2_text.split(r"\n")[-3]

#If the job has been running for more than 1 hour, it is killed. Otherwise, the user is only emailed a warning.

if int(time_running[g][:-3]) >= 60:

o4 = subprocess.Popen("kill %s" % to_kill[g], stdout=subprocess.PIPE, shell=True)

ohost1 = subprocess.Popen("hostname", stdout=subprocess.PIPE, shell=True)

host = str(ohost1.communicate())[2:-10]

msg = "Dear %s, \n \n You were running an intensive job on %s for an extended period of time. Due to cluster policies, this job has been terminated. Please consider using HTCondor to run this job, or contact Tier 3 for more information. Thank you. \n \n Statistics summary (user, job number, percent CPU usage, time running): \n " % (notify[g], host)

msg += notify[g] + ", " + to_kill[g] + ", " + CPU_usage[g] + ", " + time_running[g] + "\n"

admins = ['jabeen@umd.edu', 'kakw@umd.edu', 'youngho.shin@cern.ch', 'johnmichaelmartyn@gmail.com', 'jamiebmazza@gmail.com', 'mnudelli@terpmail.umd.edu']

from_addr = 'root@hepcms-hn.umd.edu'

to_addr1 = ", ".join(admins)

to_addr1 = to_addr1 + ", " + user_email

send_to = admins

send_to.append(user_email)

subject = "Note about Intensive Job Being Run on Cluster"

header = "From: %s\nTo: %s\nSubject: %s\n" % (from_addr, to_addr1, subject)

email = header + msg

server = smtplib.SMTP('localhost')

for addr in send_to:

server.sendmail(from_addr, addr, email)

server.quit()

else:

ohost2 = subprocess.Popen("hostname", stdout=subprocess.PIPE, shell=True)

host = str(ohost2.communicate())[2:-10]

msg = "Dear %s, \n \n You are currently running an intensive job on %s. Due to cluster policies, this job will be terminated if it continues to run for more than 1 hour. You will receive two warning emails before termination. Please consider using HTCondor to run this job, or contact Tier 3 for more information. Thank you. \n \n Statistics summary (user, job number, percent CPU usage, time running): \n " % (notify[g], host)

msg += notify[g] + ", " + to_kill[g] + ", " + CPU_usage[g] + ", " + time_running[g] + "\n"

admins = ['jabeen@umd.edu', 'kakw@umd.edu', 'youngho.shin@cern.ch', 'johnmichaelmartyn@gmail.com', 'jamiebmazza@gmail.com', 'mnudelli@terpmail.umd.edu']

from_addr = 'root@hepcms-hn.umd.edu'

to_addr2 = ", ".join(admins)

to_addr2 = to_addr2 + ", " + user_email

send_to = admins

send_to.append(user_email)

subject = "Note about Intensive Job Being Run on Cluster"

header = "From: %s\nTo: %s\nSubject: %s\n" % (from_addr, to_addr2, subject)

email = header + msg

server = smtplib.SMTP('localhost')

for addr in send_to:

server.sendmail(from_addr, addr, email)

server.quit()

#Separately emails admins about all intensive jobs

admins = ['jabeen@umd.edu', 'kakw@umd.edu', 'youngho.shin@cern.ch', 'johnmichaelmartyn@gmail.com', 'jamiebmazza@gmail.com', 'mnudelli@terpmail.umd.edu']

ohost3 = subprocess.Popen("hostname", stdout=subprocess.PIPE, shell=True)

host = str(ohost3.communicate())[2:-10]

msg = "Intensive jobs being run on " + host + " (user, job number, % CPU usage, time running): \n "

for g in range(0, len(to_kill)):

msg += notify[g] + ", " + to_kill[g] + ", " + CPU_usage[g] + ", " + time_running[g] + "\n"

from_addr = 'root@hepcms-hn.umd.edu'

to_addr3 = ", ".join(admins)

subject = "WARNING: Intensive Jobs Being Run on Cluster"

header = "From: %s\nTo: %s\nSubject: %s\n" % (from_addr, to_addr3, subject)

email = header + msg

server = smtplib.SMTP('localhost')

for addr in admins:

server.sendmail(from_addr, addr, email)

server.quit()

Kill_Jobs 3 (only emails admins about intensive jobs, does not kill them)

#!/usr/bin/env python

'''

This script checks the CPU usage of jobs running on the cluster and emails admins about those that have exceeded 70% CPU usage for 30 minutes of more. NOTE: This script must be run as sudo.

'''

#Import necessary modules

import subprocess

import time

import os

import signal

import smtplib

from pwd import getpwuid

to_kill = []

notify = []

CPU_usage = []

time_running = []

admins = ['jabeen@umd.edu ', 'kakw@umd.edu', 'youngho.shin@cern.ch ', 'johnmichaelmartyn@gmail.com ', 'jamiebmazza@gmail.com ', 'mnudelli@terpmail.umd.edu ']

#Runs commands multiple times to ensure no intensive processes are missed

for i in range(0, 6):

#Runs ps and makes an array of the output

o1_text = ""

ps_list = []

ps_array = []

o1 = subprocess.Popen("ps -G users u --sort=-pcpu", stdout=subprocess.PIPE, shell=True)

o1_text = str(o1.communicate())

ps_list = o1_text.split(r"\n")

del ps_list[0]

del ps_list[-1]

for c in range(0, len(ps_list)):

ps_list[c] = " ".join(ps_list[c].split())

ps_array = [["USER", "PID", "%CPU", "MEM", "VSZ", "RSS", "TTY", "STAT", "START", "TIME", "COMMAND"]]

for d in range(0, len(ps_list)):

ps_array.append(ps_list[d].split(" "))

#Adds PID's to to_kill corresponding to jobs that are using > 70% CPU usage and have been running for 30 minutes or more.

#The first if statement below also checks that the job is not already in to_kill from a previous iteration.

for f in range(1, len(ps_list)+1):

if float(ps_array[f][2]) >= 70 and int(ps_array[f][9][:-3]) >= 30 and ps_array[f][1] not in to_kill:

to_kill.append(ps_array[f][1])

if ps_array[f][0].isdigit():

notify.append(getpwuid(int(ps_array[f][0])).pw_name)

else:

notify.append(ps_array[f][0])

CPU_usage.append(ps_array[f][2])

time_running.append(ps_array[f][9])

#Waits a few seconds in between each iteration

if i < 5:

time.sleep(2)

#Quits program if to_kill is empty (no intensive jobs being run); otherwise, it emails admins about the intensive jobs

if to_kill == []:

quit()

ohost = subprocess.Popen("hostname", stdout=subprocess.PIPE, shell=True)

host = str(ohost.communicate())[2:-10]

msg = "Intensive jobs being run on " + host + " (user, job number, % CPU usage, time running): \n "

for g in range(0, len(to_kill)):

msg += notify[g] + ", " + to_kill[g] + ", " + CPU_usage[g] + ", " + time_running[g] + "\n"

from_addr = 'root@hepcms-hn.umd.edu'

to_addr = ", ".join(admins)

subject = "WARNING: Intensive Jobs Being Run on Cluster"

header = "From: %s\nTo: %s\nSubject: %s\n" % (from_addr, to_addr, subject)

email = header + msg

server = smtplib.SMTP('localhost')

for addr in admins:

server.sendmail(from_addr, addr, email)

server.quit()

Notify_About_Temp_Size

#!/usr/bin/env python

'''

This script checks the usage of /tmp notifies admins when it is 90% full or more. NOTE: This script must be run as sudo.

'''

#Import necessary modules

import subprocess

import smtplib

admins = ['jabeen@umd.edu ', 'kakw@umd.edu', 'youngho.shin@cern.ch ', 'johnmichaelmartyn@gmail.com ', 'jamiebmazza@gmail.com ', 'mnudelli@terpmail.umd.edu ']

#Runs "df -h" and makes a list of the output

o1 = subprocess.Popen("df -h", stdout=subprocess.PIPE, shell=True)

o1_text = str(o1.communicate())

usage_list = o1_text.split(r"\n")

del usage_list[0]

del usage_list[-1]

for a in range(0, len(usage_list)):

usage_list[a] = " ".join(usage_list[a].split())

#Cleans up output list by combining elements that should be on a single line and deleting duplicated elements

to_del = []

for b in range(0, len(usage_list)):

if len(usage_list[b].split()) == 1 and len(usage_list[b+1].split()) == 5:

usage_list[b+1] = usage_list[b] + " " + usage_list[b+1]

to_del.append(b)

for c in range(0, len(to_del)):

del usage_list[to_del[c]-c]

#Creates array out of output list

usage_array = [["Filesystem", "Size", "Used", "Avail", "Use%", "Mounted on"]]

for d in range(0, len(usage_list)):

usage_array.append(usage_list[d].split(" "))

#Creates list corresponding to data usage of /tmp

tmp_list = []

for e in range(0, len(usage_list)+1):

if usage_array[e][5] == "/tmp":

tmp_list = usage_array[e]

print tmp_list

print usage_array

#Sends email to admins if /tmp usage exceeds 90%

if tmp_list != [] and int(tmp_list[4].strip("%")) > 90:

ohost = subprocess.Popen("hostname", stdout=subprocess.PIPE, shell=True)

host = str(ohost.communicate())[2:-10]

msg = "On " + host + ": \n" + "/tmp is " + str(tmp_list[4].strip("%")) + "% full."

from_addr = 'root@hepcms-hn.umd.edu'

to_addr = ", ".join(admins)

subject = "WARNING: /tmp usage is high."

header = "From: %s\nTo: %s\nSubject: %s\n" % (from_addr, to_addr, subject)

email = header + msg

server = smtplib.SMTP('localhost')

for addr in admins:

server.sendmail(from_addr, addr, email)

server.quit()

check_certs_expire_new.py

#!/usr/bin/python ''' This script checks certificate expiration dates in /data/site_conf/certs (excluding /data/site_conf/certs/old and any .pem file with "key" in its name, however) and emails admins if any of them are going to expire within 1-30 days. NOTE: This script must be run as someone who has access to these certificates (i.e. sudo) for it to function properly. ''' #Import necessary modules import smtplib import subprocess import os #Set admins and directory containing certificates admins = ['jabeen@umd.edu', 'kakw@umd.edu', 'youngho.shin@cern.ch', 'johnmichaelmartyn@gmail.com', 'jamiebmazza@gmail.com', 'mnudelli@terpmail.umd.edu'] GRID_SECURITY_DIRS = ['/data/site_conf/certs/'] #Be sure to have a '/' at the end of any directory in this list #Checks if /data/site_conf/certs exists if str(os.path.isdir("/data/site_conf/certs")) == "False" : quit() #Creates message string, a list of certificates that will expire soon, and a dictionary for use later msg = "" certs_expiring = [] day = {} #Produces list of .pem files in directories of GRID_SECURITY_DIRS for i in range(0, len(GRID_SECURITY_DIRS)): os.chdir(GRID_SECURITY_DIRS[i]) o1 = subprocess.Popen("ls | grep .pem | grep -v .pem-old | grep -v empty.pem | grep -v key", stdout=subprocess.PIPE, shell=True) o1_text = str(o1.communicate()) files = o1_text.split(r"\n") files[0] = files[0][2:] del files[-1] #Checks which certificates will expire in 1-30 day(s); compiles a message containing the expiring certificates and their expiration dates for file in files: for x in range(1,31): o_day = subprocess.Popen("openssl x509 -checkend %d -noout -in %s%s ; echo $?" % ((x*86400), GRID_SECURITY_DIRS[i], file), stdout=subprocess.PIPE, shell=True) day_list = str(o_day.communicate()).split(r"\n") day_list[0] = day_list[0][2:] del day_list[-1] day["day%d" % x] = day_list o_exp = subprocess.Popen("openssl x509 -enddate -noout -in %s%s" % (GRID_SECURITY_DIRS[i], file), stdout=subprocess.PIPE, shell=True) exp_date = str(o_exp.communicate()).split(r"\n") exp_date = exp_date[0][11:] if exp_date == "" : msg += "check_cert_expire.py does not have access to %s%s, and its expiration date cannot be determined. \n" % (GRID_SECURITY_DIRS[i], file) break for y in range(1,31): if day["day%d" % y] == ["1"]: msg += "%s%s will expire within %d day(s). Its expiration date is: %s \n" % (GRID_SECURITY_DIRS[i], file, y, exp_date) certs_expiring.append(file) break #Emails admins if any certificates are nearing expiration if msg: ohost = subprocess.Popen("hostname", stdout=subprocess.PIPE, shell=True) host = str(ohost.communicate())[2:-10] msg = "On " + host + ": \n" + msg from_addr = 'root@hepcms-hn.umd.edu' to_addr = ", ".join(admins) subject = "WARNING: Certificates nearing expiration." header = "From: %s\nTo: %s\nSubject: %s\n" % (from_addr, to_addr, subject) email = header + msg server = smtplib.SMTP('localhost') for addr in admins: server.sendmail(from_addr, addr, email) server.quit()

Notify_About_Hadoop_Size.py

#!/usr/bin/env python

'''

This script checks the usage of /mnt/hadoop notifies admins when it is 80% full or more. NOTE: This script must be run as sudo.

'''

#Import necessary modules

import subprocess

import smtplib

admins = ['jabeen@umd.edu', 'kakw@umd.edu', 'youngho.shin@cern.ch', 'johnmichaelmartyn@gmail.com', 'jamiebmazza@gmail.com', 'mnudelli@terpmail.umd.edu']

#Runs "df -h" and makes a list of the output

o1 = subprocess.Popen("df -h", stdout=subprocess.PIPE, shell=True)

o1_text = str(o1.communicate())

usage_list = o1_text.split(r"\n")

del usage_list[0]

del usage_list[-1]

for a in range(0, len(usage_list)):

usage_list[a] = " ".join(usage_list[a].split())

#Cleans up output list by combining elements that should be on a single line and deleting duplicated elements

to_del = []

for b in range(0, len(usage_list)):

if len(usage_list[b].split()) == 1 and len(usage_list[b+1].split()) == 5:

usage_list[b+1] = usage_list[b] + " " + usage_list[b+1]

to_del.append(b)

for c in range(0, len(to_del)):

del usage_list[to_del[c]-c]

#Creates array out of output list

usage_array = [["Filesystem", "Size", "Used", "Avail", "Use%", "Mounted on"]]

for d in range(0, len(usage_list)):

usage_array.append(usage_list[d].split(" "))

#Creates list corresponding to data usage of /mnt/hadoop

hadoop_list = []

for e in range(0, len(usage_list)+1):

if usage_array[e][5] == "/mnt/hadoop":

hadoop_list = usage_array[e]

#print to_del

#print hadoop_list

#print usage_array

#Sends email to admins if /mnt/hadoop usage exceeds 80%

if hadoop_list != [] and int(hadoop_list[4].strip("%")) > 80:

ohost = subprocess.Popen("hostname", stdout=subprocess.PIPE, shell=True)

host = str(ohost.communicate())[2:-10]

msg = "On " + host + ": \n" + "/mnt/hadoop is " + str(hadoop_list[4].strip("%")) + "% full. Please consider deleting or removing files."

from_addr = 'root@hepcms-hn.umd.edu'

to_addr = ", ".join(admins)

subject = "WARNING: /mnt/hadoop usage is high"

header = "From: %s\nTo: %s\nSubject: %s\n" % (from_addr, to_addr, subject)

email = header + msg

server = smtplib.SMTP('localhost')

for addr in admins:

server.sendmail(from_addr, addr, email)

server.quit()

Check_Hadoop.py

#!/usr/bin/python ''' This script checks if Hadoop is running properly and emails admins if it is not. NOTE: This script must be run on a node where hadoop is set up (i.e. r510) and run by someone who has root access (i.e. sudo) for it to function properly. ''' #Import necessary modules import smtplib import subprocess import os #Set admins admins = ['jabeen@umd.edu', 'kakw@umd.edu', 'youngho.shin@cern.ch', 'johnmichaelmartyn@gmail.com', 'jamiebmazza@gmail.com', 'mnudelli@terpmail.umd.edu'] #Runs command to check if hadoop is running properly o1 = subprocess.Popen("service hadoop-hdfs-datanode status", stdout=subprocess.PIPE, shell=True) o1_text = str(o1.communicate()) #Emails admins if hadoop is not running properly if 'Hadoop datanode is running' not in o1_text: ohost = subprocess.Popen("hostname", stdout=subprocess.PIPE, shell=True) host = str(ohost.communicate())[2:-10] msg = "On " + host + ": \n Hadoop is not running properly. \' service hadoop-hdfs-datanode status \' did not return its expected output." from_addr = 'root@hepcms-hn.umd.edu' to_addr = ", ".join(admins) subject = "WARNING: Hadoop is not running properly." header = "From: %s\nTo: %s\nSubject: %s\n" % (from_addr, to_addr, subject) email = header + msg server = smtplib.SMTP('localhost') for addr in admins: server.sendmail(from_addr, addr, email) server.quit()

Check_Mounting.py

system files are put in /data/root/ using

/root/cronscripts/tar_script.sh which resides in all nodes and is run using local node crontab.

#!/usr/bin/env python

'''

This script checks if /home, /hadoop, /data, and /cvmfs are proplerly mounted on a node. If cvmfs is not mounted, 'ls' and

'clean_cvmfs_cache.py' are run to try to mount it. If any of these directories are not mounted, an email is sent to the admins.

NOTE: This script must be run as sudo.

'''

#Import necessary modules

import subprocess

import smtplib

admins = ['jabeen@umd.edu', 'kakw@umd.edu', 'youngho.shin@cern.ch', 'johnmichaelmartyn@gmail.com', 'jamiebmazza@gmail.com', 'mnudelli@terpmail.umd.edu']

#Runs "df -h" and makes a list of the output

o1 = subprocess.Popen("df -h", stdout=subprocess.PIPE, shell=True)

o1_text = str(o1.communicate())

usage_list = o1_text.split(r"\n")

del usage_list[0]

del usage_list[-1]

for a in range(0, len(usage_list)):

usage_list[a] = " ".join(usage_list[a].split())

#Cleans up output list by combining elements that should be on a single line and deleting duplicated elements

to_del = []

for b in range(0, len(usage_list)):

if len(usage_list[b].split()) == 1 and len(usage_list[b+1].split()) == 5:

usage_list[b+1] = usage_list[b] + " " + usage_list[b+1]

to_del.append(b)

for c in range(0, len(to_del)):

del usage_list[to_del[c]-c]

#Creates array out of output list

usage_array = [["Filesystem", "Size", "Used", "Avail", "Use%", "Mounted on"]]

for d in range(0, len(usage_list)):

usage_array.append(usage_list[d].split(" "))

#Creates list corresponding to data usage

directory_list = ["/home", "/hadoop", "/data", "/cvmfs"]

directory_mounted = [0, 0, 0, 0]

for e in range(0, len(directory_list)):

for f in range(0, len(usage_list)+1):

if directory_list[e] in usage_array[f][5]:

directory_mounted[e] = 1

#If cvmfs is not mounted, ls is run in an attempt to mount cvmfs, and df -h is used again to check if cvmfs has been mounted

if directory_mounted[3] == 0:

o2 = subprocess.Popen("ls", stdout=subprocess.PIPE, shell=True)

o3 = subprocess.Popen("df -h", stdout=subprocess.PIPE, shell=True)

o3_text = str(o3.communicate())

usage_list = o3_text.split(r"\n")

del usage_list[0]

del usage_list[-1]

for a in range(0, len(usage_list)):

usage_list[a] = " ".join(usage_list[a].split())

#Cleans up output list by combining elements that should be on a single line and deleting duplicated elements

to_del = []

for b in range(0, len(usage_list)):

if len(usage_list[b].split()) == 1 and len(usage_list[b+1].split()) == 5:

usage_list[b+1] = usage_list[b] + " " + usage_list[b+1]

to_del.append(b)

for c in range(0, len(to_del)):

del usage_list[to_del[c]-c]

#Creates array out of output list

usage_array = [["Filesystem", "Size", "Used", "Avail", "Use%", "Mounted on"]]

for d in range(0, len(usage_list)):

usage_array.append(usage_list[d].split(" "))

#Checks if cvmfs has been mounted

for h in range(0, len(usage_list)+1):

if directory_list[3] in usage_array[h][5]:

directory_mounted[3] = 1

#If cvmfs is still not mounted, clean_cvmfs_cache.py is run in an attempt to mount cvmfs, and df -h is used again to check if cvmfs has been mounted

if directory_mounted[3] == 0:

o2 = subprocess.Popen("python /root/cronscripts/clean_cvmfs_cache.py", stdout=subprocess.PIPE, shell=True)

o3 = subprocess.Popen("df -h", stdout=subprocess.PIPE, shell=True)

o3_text = str(o3.communicate())

usage_list = o3_text.split(r"\n")

del usage_list[0]

del usage_list[-1]

for a in range(0, len(usage_list)):

usage_list[a] = " ".join(usage_list[a].split())

#Cleans up output list by combining elements that should be on a single line and deleting duplicated elements

to_del = []

for b in range(0, len(usage_list)):

if len(usage_list[b].split()) == 1 and len(usage_list[b+1].split()) == 5:

usage_list[b+1] = usage_list[b] + " " + usage_list[b+1]

to_del.append(b)

for c in range(0, len(to_del)):

del usage_list[to_del[c]-c]

#Creates array out of output list

usage_array = [["Filesystem", "Size", "Used", "Avail", "Use%", "Mounted on"]]

for d in range(0, len(usage_list)):

usage_array.append(usage_list[d].split(" "))

#Checks if cvmfs has been mounted

for h in range(0, len(usage_list)+1):

if directory_list[3] in usage_array[h][5]:

directory_mounted[3] = 1

#Message compiled corresponding to unmounted storage devices

msg = ""

for i in range(0, len(directory_mounted)):

if directory_mounted[i] == 0:

msg = msg + directory_list[i] + " is not mounted properly. \n"

#Sends email to admins if any storage devices are not mounted properly

if msg != "" :

ohost = subprocess.Popen("hostname", stdout=subprocess.PIPE, shell=True)

host = str(ohost.communicate())[2:-10]

msg = "On " + host + ": \n" + msg

from_addr = 'root@hepcms-hn.umd.edu'

to_addr = ", ".join(admins)

subject = "WARNING: data directories are not properly mounted"

header = "From: %s\nTo: %s\nSubject: %s\n" % (from_addr, to_addr, subject)

email = header + msg

server = smtplib.SMTP('localhost')

for addr in admins:

server.sendmail(from_addr, addr, email)

server.quit()

CRONTAB on head

For backup of root var and etc hepcms-hn://root/cronscripts/backup_systemfiles_allnodes.sh

needs to be used yet.

# HEADER: This file was autogenerated at Fri Jan 29 23:29:33 -0500 2016 by puppet. # HEADER: While it can still be managed manually, it is definitely not recommended. # HEADER: Note particularly that the comments starting with 'Puppet Name' should # HEADER: not be deleted, as doing so could cause duplicate cron jobs. # CRONTAB FILE LOCATION: /var/spool/cron/root # edit via: env EDITOR='emacs -nw' crontab -e # (or just edit the file directly) # # # crontab syntax: # <minute> <hour> <day (numerical)> <month (numerical)> <day of week (Sunday=0)> job # # 1:15am daily backup of the etc on the head node to /data #15 01 * * * /root/scripts/backup_headNode.sh >& /root/LogScripts/DailyEtcBackup_$(date +\%Y\%m\%d).log 2>&1 # 1:30am rsync of /export/home on the head node to /DataCampusBackup 30 01 * * 0 /root/cronscripts/rsync_home.sh & #tars /etc /var and /root Works with backup_headNode.sh 0 0 * * * /root/cronscripts/tar_script.sh # Puppet Name: puppet 6 * * * * /usr/bin/env puppet agent --config /etc/puppet/puppet.conf --onetime --no-daemonize --noop # Yearly reminder to change passwords , for those who've had accounts for more than a year. #@yearly python /root/cronscripts/SendMail.py -a -s "Time to update hepcms passwords" -g 365 -m /root/cronscripts/messages/UpdatePasswords.txt # cron script to post condor status on web for ease of monitoring, every 10 minutes 1,11,21,31,41,51 * * * * /root/condor-status-script.sh */5 * * * * /root/cronscripts/collect_disk_monitor.sh # This retrieves the status of PhEdeX */5 * * * * /usr/bin/python /root/cronscripts/phedex_scraper.py # Checks omreports and emails admins if there is a critical report 10 22 * * * /usr/bin/python /root/cronscripts/check_omreports.py */20 * * * * /root/cronscripts/EnoPriority.sh */20 * * * * /root/cronscripts/yhshinPriority.sh # Runs omreport on this node for system, disks, and chassis 0 22 * * * /bin/bash /data/monitoring/scripts/run_omreport.sh # Runs temperature gmetric script */5 * * * * /etc/ganglia/sensors_gmetric.sh 2>&1 /dev/null # Pings all nodes and emails admins if nodes are unreachable 0 5 * * * /root/cronscripts/ping_nodes.py # Cleans CVMFS cache when it reachs 75% or above 0 0 * * 2 /usr/bin/env /root/cronscripts/clean_cvmfs_cache.py 0 0 */3 * * python /root/cronscripts/check_cert_expire_new.py 0 0 */1 * * python /root/cronscripts/check_cert_expire_daily_new.py ~ ~

CRONTAB on datanfs

HEADER: This file was autogenerated at Thu Dec 17 20:38:13 -0500 2015 by puppet. # HEADER: While it can still be managed manually, it is definitely not recommended. # HEADER: Note particularly that the comments starting with 'Puppet Name' should # HEADER: not be deleted, as doing so could cause duplicate cron jobs. # Puppet Name: puppet 7 * * * * /usr/bin/env puppet agent --config /etc/puppet/puppet.conf --onetime --no-daemonize --noop 0 23 */3 * * python /root/Check_Mounting.py # tars /ect /root and /var 0 0 * * * /root/cronscripts/tar_script.sh ~ ~

///

Report abuse