Minimum system requirements:
AIX 6.1 TL6 (CAA introduced at this level) however AIX 6.1 TL7 or TL8 with latest SSP recommended
AIX 7.1 base, however AIX 7.1 TL1 or TL2 with latest SP recommended
bos.ahafs (autonomic health advisor filesystem used by CAA to communicate events)
bos.cluster (CAA)
other filsets may be required depending on what's already installed / which HA filesets are required
Things that you can no longer do with 7.1 important to bear in mind for migrations
No AIX 5.3 support as we need CAA
You can't have disk heartbeat any more, or any non-IP network CAA now uses target mode on HBAs to provide heartbeat over SAN. Maybe dedicated HB disk could be re-used as CAA repository if migrating to 7.1
No more Token Ring, FDDI, ATM (!)
Host name changes when cluster fails over (favourite for SAP app teams given root access to play around?) won't work due to CAA it will break as soon as the hostname changes
AIX Hostname cannot be different to HA node name, although it normally isn't
Non-ECVG (enhanced concurrent VGs) won't work
IPAT via replacement shouldn't really be any reason to be using this nowadays?
Heartbeat over aliasing a few people use this but not recommended and will no longer work
Two-node configuration assistant doubt this will be missed much
Summary of changes in 7.1 be aware of these if nothing else!
7.1 uses Cluster Aware AIX (CAA) this is not an HA-specific thing and is also used by VIOS Shared Processor Pools. It is new AIX functionality introduced in 6.1 TL6 and 7.1.
Very importantly CAA requires
A repository disk >= 1GB
A multicast IP address
There is now a much needed clmgr command line interface, which I think is really neat, so you can do things like
clmgr add cluster mycluster nodes=nodea,nodeb repository=hdisk4
clmgr add service myvip network=net_ether_01
clmgr add rg myRG nodes=nodea,nodeb fallback=nfb service_label=myvip volume_group=sharedvg
a cluster (albeit a very simple one) build in 3 easy commands, this is a massive improvement on before, Veritas has been able to do this all along, and the time taken to build clusters and admin using only SMIT has been a complaint for years.
This also facilitates automated cluster build, etc which was difficult and not recommended before.
Director plugin in 7.1, welcome replacement for websmit. Allows you to do most things on a cluster and looks nice for people who like GUI picture of cluster/cluster admin, again something that Veritas has been able to do for as long as I can remember
Smit menu whole layout has changed (again) but most things are still there
At last! HA 7.1 can be configured to respond immediately (ie. failover) when rootvg is lost something which it never used to do and caused several major issues with customers on SAN boot so very welcome
New SmartAssists:
SAP
FileNet
Tivoli Storage Manager
Lotus Domino Server
MaxDB
New resource group dependencies and custom resources, more info on these to come below
Installation of HA71 - experience, CAA gotchas
HA Filesets are generally called similar names as previous versions so it should be fairly easy to see what you need to install.
As above certain AIX pre-reqs are required but these are all on the AIX media if not already installed
First gotcha was that initial HA discovery fails with the very helpful error message
/usr/es/sbin/cluster/utilities/clvt_kshHandler: line 1526: 4194454: Memory fault. (core dump)
-- this may well be improved at the latest HA SP4 but the issue was the lack of the /etc/cluster/rhosts file needs to contain host name of all nodes. Note the different path in this version � it's part of CAA now not an HA specific thing
You can't sync the cluster or do much without the repository disk defined which as above needs to be a minimum of 1Gb - you define the repository disk in the HA menus or using the command line clmgr add cluster (see example above)
In order to use the SFWCOM (the new SAN based monitoring replacing non-IP heartbeat) on an LPAR which uses either VSCSI or NPIV, you need to create a VLAN 3358 on your system, and create a Virtual Ethernet Adapter on the LPAR. Plus enable target mode on the Fibre Adapters(s) on the VIO server(s).
The process is described here https://www-304.ibm.com/support/entdocview.wss?uid=isg1IV03643
you can tell if it's worked because a new sfwcommX device should appear on AIX as a child device of the new virtual ethernet adapter:
ha71a:/ # lsdev -Ccadapter ent0 Available Virtual I/O Ethernet Adapter (l-lan) ent1 Available Virtual I/O Ethernet Adapter (l-lan) fcs0 Defined 48-T1 Virtual Fibre Channel Client Adapter fcs1 Available 48-T1 Virtual Fibre Channel Client Adapter vsa0 Available LPAR Virtual Serial Adapter ha71a:/ # lsdev -p ent1 sfwcomm2 Available vLAN Storage Framework Comm
NOTE: for NPIV, there is no target mode on a virtual fibre adapter, this doesn't work at the NPIV level, only on the physical port on the VIOS.
If using dedicated adapters on LPAR need to enable target mode on them
This is not in the Redbook or InfoCenter yet (when I last checked) only on a documentation APAR
This also won't work if you're below VIOS 2.2.0.11-FP24SP01
After basic cluster is created (which in terms of doing it in HA shouldn't be much different to 6.1 unless you like to use the great new clmgr command!) an initial sync creates the CAA AIX cluster under the covers (calls mkcluster AIX command) after cluster verify and sync:
# lspv hdisk0 00cbd0ce8a323917 rootvg active caa_private0 00cbd0cef06f31cf >>>>> caavg_private active <<<<< hdisk2 00cbd0ce8a9c68fa rootvg active
NOTE CAA no longer re-names disk at new AIX TLs (Oct 2011) - CAA changed significantly from initial release
At this stage lscluster commands should work, to query the state of the CAA cluster on both nodes, eg:
ha71a:/ # lscluster -i Network/Storage Interface Query Cluster Name: ha71acluster Cluster uuid: d9fcabe0-ad5d-11e0-b5c3-722d73a58903 Number of nodes reporting = 2 Number of nodes expected = 2 Node ha71a Node uuid = 3b15b84c-6b4e-11e0-9b00-722d73a58903 Number of interfaces discovered = 3 Interface number 1 en0 <--- Ethernet interface ifnet type = 6 ndd type = 7 Mac address length = 6 Mac address = 72.2d.73.a5.89.2 Smoothed rrt across interface = 7 Mean Deviation in network rrt across interface = 3 Probe interval for interface = 100 ms ifnet flags for interface = 0x1e080863 ndd flags for interface = 0x21081b Interface state UP <--- UP = good Number of regular addresses configured on interface = 2 <-- boot and service address IPV4 ADDRESS: 9.137.62.148 broadcast 9.137.62.255 netmask 255.255.255.0 IPV4 ADDRESS: 9.137.62.218 broadcast 9.137.62.255 netmask 255.255.255.0 Number of cluster multicast addresses configured on interface = 1 IPV4 MULTICAST ADDRESS: 228.137.62.148 broadcast 0.0.0.0 netmask 0.0.0.0 Interface number 2 sfwcom <--- this is the monitoring over SAN ifnet type = 0 ndd type = 304 Mac address length = 0 Mac address = 0.0.0.0.0.0 Smoothed rrt across interface = 7 Mean Deviation in network rrt across interface = 3 Probe interval for interface = 100 ms ifnet flags for interface = 0x0 ndd flags for interface = 0x9 Interface state UP <--- main thing to check I guess :) Interface number 3 dpcom <--- this is the repository disk, means it's not being used for monitoring it will ONLY be used if all ethernet and storage interfaces can't be used ifnet type = 0 ndd type = 305 Mac address length = 0 Mac address = 0.0.0.0.0.0 Smoothed rrt across interface = 750 Mean Deviation in network rrt across interface = 1500 Probe interval for interface = 22500 ms ifnet flags for interface = 0x0 ndd flags for interface = 0x9 Interface state UP RESTRICTED AIX_CONTROLLED <-- this is normal
As far as I am aware this is the lowest level you would normally get to in terms of checking interface monitoring this replaces the topsvcs logs, nim.XXXX, etc that used to sit in /var/ha/log and were used to look at missed heartbeats etc. All of that is no more, and the info above replaces it.
Basic CAA PD
/var/adm/ras/syslog.caa <-- useful log for CAA, you can see HA creating CAA cluster etc
There are also some fairly scary looking logs within /var/ct which I haven't investigated fully yet for example /var/ct/<clustername>/log/cthags group services logs
Check status of CAA:
# lssrc -g caa Subsystem Group PID Status cld caa 6291678 active clcomd caa 6095042 active clconfd caa 3932170 active solidhac caa 7012454 active solid caa 4653292 active #
the above are the CAA daemons including the SoliDB database which is made highly available by CAA across two nodes.
NOTE SoliDB no longer used in AIX 6.1 TL7 / 7.1 TL1 - more details to come
# df Filesystem 512-blocks Free Iused Mounted on /dev/hd4 393216 0 100% 10232 93% / .
/aha - - - 42 1% /aha /dev/fslv03 524288 499424 5% 17 1% /clrepos_private2 #
On AIX 6.1 TL6 / 7.1 TL0, on a 2-node cluster, clrepos_private1 should be mounted on one node, and clrepos_private2 on the other. Later AIX releases don't feature the clrepos_private* filesystems.
Fingers crossed many stability issues are fixed in latest fixpack (not encountered them yet on SP4) but have had to do all of the following at some stage:
if one of the clrepos_private* filesystems wasn't mounted on a node, manually mount it
HOWEVER - clrepos_private* filesystems only exist at 6.1 TL6 / 7.1 TL0 - recommended to upgrade to next TL
startsrc -g caa if not already running
reboot node if all else fails
do not try to manually create or modify CAA cluster HA should be doing it
Latest enhancements in PowerHA 7.1.1 (June 2012)
Repository disk resiliency to avoid repository disk SPOF (no longer a showstopper for customers who do cross-SAN LVM mirroring)
netmon.cf can (and should) be configured like for previous versions of PowerHA
see http://www-01.ibm.com/support/docview.wss?uid=isg1IV14422
Flex p260/p460 compute nodes supported - http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/FLASH10779
More on CAA important as this is the main difference with 7.1
CAA is an AIX-level cluster so in an HA setup we have 2 distinct clusters, the CAA one and the PowerHA one.
You only create the PowerHA one, and the CAA one is created automatically. It takes the same name because HA gives it the same name when it calls mkcluster on AIX. Unless things go wrong, you generally don't need to worry about or directly touch the CAA cluster yeah right
A CAA cluster can exist without PowerHA (and is used for other things like VIOS SSP) but it doesn't provide any quorum or failover capability just monitors interfaces and provides disk fencing
CAA takes care of heartbeating - RSCT topology services is no longer used, disk heartbeat is no longer used.
CAA requires a repository disk (a disk visible to both cluster nodes) of minimum 1GB
CAA requires a multicast IP address which you can either specify or use an auto-generated one so multicast traffic needs to be permitted on the network
Careful planning required: repository disk, cluster name, cluster IP cannot be changed without removal and recreation of cluster / or at least without some playing around and an outage
CAA will also heartbeat over the SAN, using target mode on the Fibre Adapters (remember SSA!)
need to chdev the following on the adapters (on VIOS [vscsi] or LPAR [npiv/physical])
fcs tme=yes
fscsi dyntrk=yes
fscsi fc_err_recov=fast_fail
Ensure your Fibre Adapter is supported for this CAA communication:
This CAA monitoring happens in kernel space as opposed to user space and will, apparently, not be impacted by CPU starvation, so the Deadman Switch (DMS) no longer exists (this was required in previous versions to allow a node which was under extreme load to "kill itself" before another node attempted to take over and cause cluster partitioning/split brain)
This can all happen using dedicated adapters or VIO using VSCSI or NPIV but there are gotchas around AIX SP levels and VIOS FP24-SP01 minimum is required
Certain things need to be done on VIOS like enabling TM on the HBAs, and for VSCSI/NPIV a secret VLAN 3380 needs creating - totally crucial but so far totally undocumented requirement
CAA monitoring is supposed to be very clever and self-adjusting, ie. tolerate slower networks etc. If you're interested Scalable Reliable Multicast is used which I have no idea how it works but there's plenty on google - I believe it's a standard protocol rather than an IBM specific thing. More detail if you are interested:
Pre-7.1 HA versions using topology services:
7.1 now uses CAA and not topsvcs: