of different hardware and software work in building the cluster as well as troubleshooting various issues
After the squid issue, as of Nov23, 2016, taking R720-01 out of condor to keep the traffic low.
[root@r720-0-1 ~]# service condor status
condor_master (pid 9826) is running...
[root@r720-0-1 ~]# service condor stop
Stopping Condor daemons: [ OK ]
Node lets you enter password and then kicks you off without prompt
Node doesn't let you enter password, just ssh closed - Denyhosts block
were asked to add the following software:
from hn
ssh-agent $SHELL
ssh-add
clush -w @R510 yum install -y tcl tk zsh perl-ExtUtils-Embed compat-libstdc++-33
clush -w @compute yum install -y tcl tk zsh perl-ExtUtils-Embed compat-libstdc++-33
clush -w @R720 yum install -y tcl tk zsh perl-ExtUtils-Embed compat-libstdc++-33
clush -w @interactive yum install -y tcl tk zsh perl-ExtUtils-Embed compat-libstdc++-33
clush -w @hepcms-in1 yum install -y tcl tk zsh perl-ExtUtils-Embed compat-libstdc++-33
more /etc/clustershell/groups
clush -w @INT yum install -y tcl tk zsh perl-ExtUtils-Embed compat-libstdc++-33
-y is for automatic yes to options. Make sure to test the install on one node first to see if everything works.
8 Check what the changes were after you try something in Puppet:
10 Test puppet changes without implementing them on a node:
10.1 Test puppet changes without implementing them on a node for just one feature (tags):
10.3 Stop a puppet agent (these run automatically on a node either in kickstart or crontab):
10.4 Make that puppet agent not start automatically upon node reboot):
10.5 Start a puppet agent (these run automatically on a node either in kickstart or crontab):
10.8 puppet agent --test --noop reports failed dependencies?
10.9 Want to add a puppet class in base.pp or site.pp instead of on hepcms-foreman web?
10.10 Want to add a puppet class in a hiera yaml instead of on hepcms-foreman web?
10.11 Changed hostname on a machine, need to re-get puppet cert:
10.13 r10k make sure we don't lose changes updated locally and not in github:
10.16 Foreman kickstart telling you there's not enough disk space for partitions?
10.20 Is the Foreman build of a baremetal machine working (checking during build):
10.25 Add a puppet module by hand in an area (locally) where r10k & git won't affect it:
10.28 Puppet module complaining about operating system issues?
10.33 Change in some .yaml parameter or class not taking effect at all on a node?
10.34 Check puppet agent behavior for a specific module (on that node):
10.35 All your nodes in the hepcms-foreman web page suddenly orange for "not in sync"?
10.36 How to change a puppet configuration file in your hiera .yaml?
10.38 Did your hiera implementation give you something weird, like ["?
10.41 Why is my node stuck in blue A and always doing the same update?
10.42 Implementing a new puppet module and get an error about "Could not find class"?
5 Some ethernet port is not working even when I modify it in Foreman?
9 ssh_exchange_identification: read: Connection reset by peer:
10 Kickstarted a node but Foreman's having trouble seeing an (external) port:
Power Down and Power Up procedures for the hepcms cluster
Dell linux hot swap sysadmin reference: https://grox.net/sysadm/unix/linux_disk_hotplug_helpful_commands
Setting up xfs: https://www.percona.com/blog/2011/12/16/setting-up-xfs-the-simple-edition/
Dangerous - delete partition with fdisk: http://www.cyberciti.biz/faq/linux-how-to-delete-a-partition-with-fdisk-command/
Dangerous - fdisk management: http://www.thegeekstuff.com/2010/09/linux-fdisk/
RedHat quotas SL6: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/ch-disk-quotas.html
Dangerous: hadoop datanode disk repair: http://hep-t3.physics.umd.edu/HowToForAdmins/errors.html#errorsHadoopFsck
http://www.cyberciti.biz/faq/linux-disk-format/
http://www.ehow.com/how_1000631_hard-drive-linux.html
http://lissot.net/partition/ext2fs/labels.html
Main Dell landing page for server update: http://www.dell.com/support/article/us/en/19/SLN293301
Dell: Maintaining R510 drives: http://www.dell.com/Support/Article/us/en/19/HOW10062
Dell firmware: https://lonesysadmin.net/2011/03/07/the-easiest-way-to-update-a-dell-servers-firmware/
For PowerVault (r720-datanfs, hepcms-gridftp): http://www.dell.com/support/home/us/en/04/Drivers/DriversDetails?driverId=F96NR
R720 Manual
http://topics-cdn.dell.com/pdf/poweredge-r720_owner's%20manual_en-us.pdf
RAM snapshots: http://www.ovirt.org/Features/RAM_Snapshots#Commit_to_snapshot
Single disk snapshot: http://www.ovirt.org/Features/Single_Disk_Snapshot
Live snapshots: http://www.ovirt.org/Live_Snapshots
Puppet getting started with hiera: http://docs.puppetlabs.com/hiera/1/#getting-started-with-hiera
OSG puppet github: https://github.com/opensciencegrid/puppet-contrib
Looking for official puppet modules: https://forge.puppetlabs.com
How to hiera/module: http://garylarizza.com/blog/2013/12/08/when-to-hiera/
Refactoring puppet class to use with Hiera: http://rnelson0.com/2014/10/20/rewriting-a-puppet-module-for-use-with-hiera/
Puppet, getting different data types with Hiera: http://codingbee.net/tutorials/puppet/puppet-retrieving-data-from-yaml-files-using-hiera/
Hiera/r10k blog post: http://rnelson0.com/2014/07/21/hiera-r10k-and-the-end-of-manifests-as-we-know-them/
Puppet: languages and namespaces: https://docs.puppetlabs.com/puppet/latest/reference/lang_namespaces.html#autoloader-behavior
Puppet: beginner's guide to modules: https://docs.puppetlabs.com/guides/module_guides/bgtm.html#step-one-giving-your-module-purpose
Hiera yaml format: http://symfony.com/doc/current/components/yaml/yaml_format.html
http://www.puppetcookbook.com/posts/creating-a-directory.html
CERN IT support status: old page http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/default-dynamic.asp
CERN status (improved page, login for most info): https://cern.service-now.com/service-portal/sls.do
T2Admin guide twiki: https://twiki.cern.ch/twiki/bin/viewauth/CMS/T2AdminGuide
CompOpsStoreTemp policy for hadoop: https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsStoreTemp
Other CompOps information: https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOps
Puppet OSG from Trey: https://github.com/treydock/puppet-osg
Our old cluster (SL5): http://hep-t3.physics.umd.edu/
How much CPU/RAM do I have on this machine? http://geroldm.com/2012/06/find-out-cpu-memory-and-harddisk-info-in-linux/
25 most common iptables example: http://www.thegeekstuff.com/2011/06/iptables-rules-examples/
Marguerite's UNIX/ROOT/CMSSW cheat sheet: http://cmsdoc.cern.ch/~belt/CheatSheet.html
rsync: http://www.tecmint.com/rsync-local-remote-file-synchronization-commands/
rsync: https://gist.github.com/KartikTalwar/4393116
10 useful commands to configure (and check) a network interface - you are better off using Foreman or ovirt to configure, but this is if you really need to do it by hand: http://www.tecmint.com/ip-command-examples/
Yum&RPM from OSG Release3: https://twiki.opensciencegrid.org/bin/view/Documentation/Release3/YumRpmBasics
Bootable Linux USB from Mac (used this sucessfully): http://borgstrom.ca/2010/10/14/os-x-bootable-usb.html
Alternate Bootable Linux USB from Mac:
Step one, need hdutil convert: http://unix.stackexchange.com/questions/114984/how-to-create-a-bootable-linux-installation-usb-from-an-iso-in-os-x
step 2: http://blog.tinned-software.net/create-bootable-usb-stick-from-iso-in-mac-os-x/
Alternate Bootable Linux USB from Mac: http://osxdaily.com/2015/06/05/copy-iso-to-usb-drive-mac-os-x-command/
Bootable Mac Linux loader (makes liveCD where you can write things - may or may not be Linux bootable, but I hope it is): http://blog.sevenbits.tk/Mac-Linux-USB-Loader/
Install and secure NIS master: http://www.setuptips.com/unix/install-configure-nis-master/
NIS tools (managing): http://docstore.mik.ua/orelly/networking_2ndEd/nfs/ch13_04.htm
NIS best practices: http://archive09.linux.com/feature/114201
NIS, what files are managed: http://docstore.mik.ua/orelly/networking_2ndEd/nfs/ch03_03.htm
NIS add new users: http://www.linuxhomenetworking.com/wiki/index.php/Quick_HOWTO_:_Ch30_:_Configuring_NIS#Adding_New_NIS_Users
complete guide to useradd: http://www.tecmint.com/add-users-in-linux/
The main link for all OSG stuff: https://twiki.grid.iu.edu/bin/view/Documentation/Release3/WebHome
https://twiki.cern.ch/twiki/bin/view/CMSPublic/USCMSTier3Doc
https://twiki.cern.ch/twiki/bin/viewauth/CMS/T2AdminGuide (requires CMS authentication - T2, NOT T3 guide, some but not all info overlaps
a book on Ganglia, available online:
https://www.safaribooksonline.com/library/view/monitoring-with-ganglia/9781449330637/
http://tldp.org/HOWTO/Clock-2.html
http://www.freeraidrecovery.com/library/raid-5-6.aspx
I actually looked at this last night, this is the manual for the
controller for the RAID6 disks on hepcms-hn:
http://www.flagshiptech.com/eBay/Dell/poweredgeh310h710h810UsersGuide.pdf
https://www.lsc-group.phys.uwm.edu/daswg/download/vmwareSL6.html I don't
recall what I looked at here
SE-0-2:
http://en.community.dell.com/support-forums/servers/f/906/t/19590720
(Copied from old cheatsheet Kak: Nov-4)
Nodes with oldest root pass:
sl5, any node not part of Foreman that still has SL5 and Rocks 5.4
SE-0-3 : broken disk
interactive-0-4: broken disk
public network switch web page, username admin, old root pass
private network switch web page, username root, old root pass
hepcms-gridftp
iDRAC
clush password
Nodes with new root pass:
hepcms-ovirt web page, username admin, new pass
Nodes with 2016 summer root pass:
hepcms-in8, hepcms-in1, hepcms-gums,hepcms-se, hepcms-squid, hepcms-ce, hepcms-hn-backup, hepcms-secondary-namenode, hepcms-namenode, hepcms-vmtest, foreman-vmtest2, r720-0-1, r720-0-2, r720-datanfs,compute-0-10, compute-0-5, foreman, head node, hepcms-in2, compute-0-11, compute-0-8, compute-0-6, r510-0-1, r510-0-4, r510-0-5, r510-0-6, r510-0-9, r510-0-10, r510-0-11
compute-0-5, compute-0-11, r510-0-1
Kak, Margarita, James, Marguerite
hepcms-hn
drjohn,treydock,reduardo, Shabnam, Marguerite, Margarita, Kak
Check that the user didn't already exist, and if the software uses shared filesystem, the user must have the same UID and groupid across all systems. So check the information on hepcms-hn as well as the individual nodes
id hdfs
Also can get list of groups on a node this way:
getent group