[drjohn@hepcms-1 config.d]$ condor_ce_q
-- Schedd: hepcms-1.umd.edu : <128.8.216.10:9619?... @ 05/19/20 06:49:34
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS
drjohn ID: 773977 5/19 00:30 _ _ _ 1 1 773977.0
The mapping done by /etc/condor-ce/condor_mapfile was not working. I changed the regular expression and now it maps. Your SAM tests are running:
StarterLogs
[root@r540-0-20 condor]# condor_config_val daemon_list
MASTER, STARTD
Downgrading from, 8.8.5 to 8.6.3 on siab-1
Removed:
blahp.x86_64 0:1.18.41.bosco-2.osg34.el7 boost169-python2.x86_64 0:1.69.0-2.el7 condor.x86_64 0:8.8.5-1.4.osg34.el7
condor-classads.x86_64 0:8.8.5-1.4.osg34.el7 condor-procd.x86_64 0:8.8.5-1.4.osg34.el7 munge-libs.x86_64 0:0.5.11-3.el7
python2-condor.x86_64 0:8.8.5-1.4.osg34.el7
Installed:
blahp.x86_64 0:1.18.41.bosco-1.osg34.el7 condor.x86_64 0:8.6.13-1.4.osg34.el7 condor-classads.x86_64 0:8.6.13-1.4.osg34.el7
condor-procd.x86_64 0:8.6.13-1.4.osg34.el7 python2-condor.x86_64 0:8.6.13-1.4.osg34.el7
Downgrade process:
yum install yum-plugin-versionlock
yum remove condor condor-classads condor-procd blahp python2-condor
## put osg3.4 into /etc/yum.repo.d/osg.repo
yum install condor-classads-8.6.13-1.4.osg34.el7.x86_64 --disablerepo=* --enablerepo=osg3.4
yum versionlock condor-classads-8.6.13-1.4.osg34.el7
yum install blahp-1.18.41.bosco-1.osg34.el7 --disablerepo=* --enablerepo=osg3.4
yum versionlock blahp-1.18.41.bosco-1.osg34.el7
yum install condor-classads-8.6.13-1.4.osg34.el7.x86_64 python2-condor-8.6.13-1.4.osg34.el7.x86_64 condor-8.6.13-1.4.osg34.el7 --disablerepo=* --enablerepo=osg3.4
As first approximation, try changing your submit file to
should_transfer_files = YES
second in case there is firewall
make sure following ports are open 9618, 9000-9999 (LOWPORT-HIGHPORT)
by the way I use LOWPORT-HIGHPORT in my condor config file, but in your it shows IN_LOWPORT, IN_HIGHPORT.
Condor fail to warn any firewall issue. Hard to debug.
Main reference:
https://opensciencegrid.org/docs/data/install-hadoop/#updating-to-osg-34
General reference:
https://www.michael-noll.com/blog/2011/08/23/performing-an-hdfs-upgrade-of-an-hadoop-cluster/ ## older version commands but the idea is similar
https://hadoop.apache.org/docs/r2.6.0/index.html
Overview:
1. Backup namenode, take snapshot of relevant health stats
2. Hadoop enters safe mode, turn off datanode, secondary datanode, and lastly namenode
3. upgrade namenode, turn on and check health page
4. then upgrade datanode and bring them up one by one, checking progress on dfshealth web interface
5. health page should show progress at 100% and confirm that it leaves safe mode
6. finalize upgrade, therefore deleting old metadata and unfreeze old blocks
Before upgrade:
1. check that hadoop have <90% usage
2. check that each and every single hadoop disk (>100 of them) has at least 1GB of free space (unfortunately hadoop does not balance usage on disks within the same node. If one of the disks gets too full, the extra metadata will use up the disk space and prevent the upgrade from moving on)
3. make sure as many disk as possible are online
4. clean up data and run hadoop balancer if possible to fix 1 and 2
## on namenode, check corrupted, and under-replicated blocks
hdfs fsck / | egrep -v '^\.+$' | grep -v replica| sed '/^$/d'
hdfs dfsadmin -report
Shutdown and backup:
1. Enter safe mode to stop write operation
2. name file needs to be backed up
3. back up configuration files too /etc/hadoop
hdfs dfsadmin -safemode enter
hdfs dfsadmin -saveNamespace
## log into each datanode to turn off datanode services
/etc/init.d/hadoop-hdfs-datanode stop
## (use clush) to confirm they are indeed off
/etc/init.d/hadoop-hdfs-datanode status
## confirm location of and back up name data, currently in /var/lib/hadoop-hdfs/cache/hdfs/dfs/
grep -C1 dfs.namenode.name.dir /etc/hadoop/conf/hdfs-site.xml
# the above is recommended but currently our variable is set in dfs.name.dir
grep -C1 dfs.name.dir /etc/hadoop/conf/hdfs-site.xml
cp -r /var/lib/hadoop-hdfs/cache/hdfs/dfs/ /data/backup/hadoopconfig20190419/hepcms-namenode/
## this is the single most important piece of data of hadoop system, so one has to be very careful. Everything was backup to /data/ and /data2/ for redundancy
## back up configuration file on namenode and on datanode
cp -r /etc/hadoop /data/backup/hadoopconfig20190419/hepcms-namenode/
## then turn off namenode and secondary namenode
/etc/init.d/hadoop-hdfs-secondarynamenode stop
/etc/init.d/hadoop-hdfs-secondarynamenode status
/etc/init.d/hadoop-hdfs-namenode stop
/etc/init.d/hadoop-hdfs-namenode status
Upgrade:
1. yum update namenode service
2. run namenode service in upgrade mode
3. upgrade other nodes
## in case osg3.4 is not installed yet. usually not necessary
rpm -e osg-release
rpm -Uvh http://repo.opensciencegrid.org/osg/3.4/osg-3.4-el6-release-latest.rpm
yum clean all --enablerepo=*
## since the meta-package osg-se-hadoop-* is not present in osg-upcoming (yet), I directly fetched the hadoop version from osg-upcoming repo
## first thing first to upgrade namenode. hadoop needs to be in safe mode and down totally
## check that it fetches the right hadoop versions
yum --enablerepo=osg-upcoming update hadoop-hdfs-namenode hadoop
## initiate upgrade
/etc/init.d/hadoop-hdfs-namenode upgrade
## then monitor /var/log/hadoop-hdfs/ to see everything is ok
[root@hepcms-namenode ~]# grep upgrade /scratch/hadoop/hadoop-hdfs/hadoop-hdfs-namenode-hepcms-namenode.privnet.log
2019-04-24 08:38:22,860 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Starting upgrade of local storage directories.
2019-04-24 08:38:22,862 INFO org.apache.hadoop.hdfs.server.namenode.NNUpgradeUtil: Starting upgrade of storage directory /var/lib/hadoop-hdfs/cache/hdfs/dfs/name
2019-04-24 08:38:31,398 INFO org.apache.hadoop.hdfs.server.namenode.NNUpgradeUtil: Performing upgrade of storage directory /var/lib/hadoop-hdfs/cache/hdfs/dfs/name
/etc/init.d/hadoop-hdfs-namenode status ## should be running
## at this point one can check the upgrade progress on web interface. It's possible to use a port to maintain stable connection over long period of time. On terminal:
ssh -L localhost:8000:hepcms-namenode.privnet:50070 kakw@hepcms-hn.umd.edu
## check on browser: http://localhost:8000/dfshealth.html
# upgrade datanode
yum --enablerepo=osg-upcoming update hadoop-hdfs-datanode hadoop
/etc/init.d/hadoop-hdfs-datanode start
## when all datanode are up and webpage show 99.99%+ blocks found, then
hdfs dfsadmin -safemode get
## should give: Safe mode is OFF
## At this point, hadoop is writable. Redo health checks one did before the update, such as:
hdfs fsck / | egrep -v '^\.+$' | grep -v replica| sed '/^$/d'
# There should be no increase of corrupted blocks
## On nodes with hadoop mounted, you need to upgrade the hdfs-fuse-client
## You don't need to run the rpm command if you don't need to upgrade the node to osg 3.4
## on gridftp and se, restart gridftp service/ cmsd/ xrootd service after upgrade
## It's a good idea to diff "service --status-all" before and after upgrades since we cannot tell what is turned off or crashes in the process
yum --enablerepo=osg-upcoming update hadoop-client hadoop
## on all nodes with hadoop mounted, do:
mount /mnt/hadoop
## and confirm that the mount is accessible
## similarly, for secondary namenode:
yum --enablerepo=osg-upcoming update hadoop-hdfs-secondarynamenode hadoop
/etc/init.d/hadoop-hdfs-secondarynamenode start
# check log to see that it is running. This is the least important of all node and last to be updated
Finish Upgrade
1. Once every system is found to work with the upgraded system. One has to finalize update, thus completing the upgrade process. (This is not done yet as of 2 May 2019) The command is:
hdfs dfsadmin -finalizeUpgrade
2. On datanodes, the upgrade process moves data block to previous/ directories and makes hard link between it and current/ directory:
/hadoop2/data/current/BP-953065178-10.1.0.16-1445909897155/current/finalized/
/hadoop2/data/current/BP-953065178-10.1.0.16-1445909897155/previous/finalized/
3. data blocks in previous/ directory are frozen until finalize deletes them. If the upgrade is unsuccessful, blocks in previous/ directory can be used to restore the datanode to pre-upgrade state
4. This also means that no new space can be freed until finalize command is run, and hadoop balance will not be effective either
Debugging
A. Crashed datanode upgrade
1. datanode restart was not completely successful but didn't crash. The service remains in running state, but blocks are reported missing. Log reported memory overflow issues.
2. Check that previous.tmp/ exists, if migration completes, these will be renamed previous/
ls -d /hadoop*/data/current/BP-953065178-10.1.0.16-1445909897155/previous.tmp"
3. solution is to simply restart datanode in that case, it will pick up from where it left off
B.
##### in case a node is too full (dangerous operation) ######
1. datanode service fails
2. one has to move away a few blocks (two full-sized blocks ~ 256MB will suffice for each full disk)
3. the subdirectory structure needs to be preserved, in case one needs to restore the blocks (hopefully this won't be necessary due to replication factor)
4. the block and the metadata need to move in unison, e.g.:
-rw-r--r-- 1 hdfs users 128M Jan 31 2015 /hadoop2/data/current/blk_-121168754068712736
-rw-r--r-- 1 hdfs users 129K Jan 31 2015 /hadoop2/data/current/blk_-121168754068712736_3879327.meta
references:
http://gbif.blogspot.com/2015/05/dont-fill-your-hdfs-disks-upgrading-to.html
https://hadoopsters.net/2017/05/11/dont-just-plug-that-disk-in/
## example: compute-0-6
# make directory
mkdir -p /data/backup/hadoopconfig20190419/compute-0-6/hadoop1/data/current/BP-953065178-10.1.0.16-1445909897155/current/finalized/subdir0/
## nothing is done at this step; just confirming what one wants to do. The three commands are the same except --dry-run and --remove-source-files options
rsync -aiv /hadoop1/data/current/BP-953065178-10.1.0.16-1445909897155/current/finalized/subdir0/ /data/backup/hadoopconfig20190419/compute-0-6/hadoop1/data/current/BP-953065178-10.1.0.16-1445909897155/current/finalized/subdir0/ --dry-run
## data is copied over (only step that takes time)
rsync -aiv /hadoop1/data/current/BP-953065178-10.1.0.16-1445909897155/current/finalized/subdir0/ /data/backup/hadoopconfig20190419/compute-0-6/hadoop1/data/current/BP-953065178-10.1.0.16-1445909897155/current/finalized/subdir0/
## remove source file (effectively confirm that)
rsync -aiv /hadoop1/data/current/BP-953065178-10.1.0.16-1445909897155/current/finalized/subdir0/ /data/backup/hadoopconfig20190419/compute-0-6/hadoop1/data/current/BP-953065178-10.1.0.16-1445909897155/current/finalized/subdir0/ --remove-source-files