06 EM Agent Does Not Start After Server is on New Subnet

References

Oracle® Enterprise Manager Advanced Configuration 10g Release 5 (10.2.0.5)

RHEL 5.0 Deployment Guide

Overview

After moving BANKSTOWN from 10.1.1.0/24 subnet to the 192.168.1.0/24 one, the EM agent failed on startup. Correcting the hosts entry entry fixed the problem.

Analysis

Startup

Using Starting, Stopping, and Checking the Status of the Management Agent on UNIX, the startup of the agent is done by emctl start agent. The message returned is:

[oracle@bankstown ~]$ emctl start agent

Oracle Enterprise Manager 10g Release 5 Grid Control 10.2.0.5.0.

Copyright (c) 1996, 2009 Oracle Corporation. All rights reserved.

Starting agent ........................................................................................................................... failed.

Consult the log files in: /u00/app/oracle/agent10g/sysman/log

Logs

The following logs were checked for messages at the time of agent startup at 21:58 on 2 February 2011:

EMAGENT.LOG

About the Management Agent Log and Trace Files describes emagent.log as:

    • The Management Agent log file (emagent.log)
    • The Agent saves information to the log file when the Agent performs an action (such as starting, stopping, or connecting to a Management Service) or when the Agent generates an error (for example, when the Agent cannot connect to the Management Service).

/u00/app/oracle/agent10g/sysman/log/emagent.log shows:

2012-02-02 21:58:49,603 Thread-10050336 Starting Agent 10.2.0.5.0 from /u00/app/oracle/agent10g (00701)

2012-02-02 21:58:54,547 Thread-10050336 [E-Mail IMAP Server] InstanceProperty (imap_host) is marked OPTIONAL but is being used. Filename /u00/app/oracle/agent10g/sysman/admin/metadata/imap.xml (00506)

2012-02-02 21:58:57,291 Thread-10050336 [E-Mail POP Server] InstanceProperty (pop_host) is marked OPTIONAL but is being used. Filename /u00/app/oracle/agent10g/sysman/admin/metadata/pop.xml (00506)

2012-02-02 21:58:57,502 Thread-10050336 [E-Mail SMTP Inbound Server] InstanceProperty (smtp_host) is marked OPTIONAL but is being used. Filename /u00/app/oracle/agent10g/sysman/admin/metadata/smtp_in.xml (00506)

2012-02-02 21:58:57,518 Thread-10050336 [E-Mail SMTP Outbound Server] InstanceProperty (smtp_host) is marked OPTIONAL but is being used. Filename /u00/app/oracle/agent10g/sysman/admin/metadata/smtp_out.xml (00506)

2012-02-02 22:00:11,958 Thread-10050336 EMAgent started successfully (00702)

The highlighted message contradicts the output from the emctl start agent.

EMAGENT.TRC

About the Management Agent Log and Trace Files describes emagent.trc as:

    • The Management Agent trace file (emagent.trc)
    • The Agent saves information to the log file when the Agent performs an action (such as starting, stopping, or connecting to a Management Service) or when the Agent generates an error (for example, when the Agent cannot connect to the Management Service).

/u00/app/oracle/agent10g/sysman/log/emagent.trc shows:

2012-02-02 21:58:50,555 Thread-10050336 WARN command: Job Subsystem Timeout set at 600 seconds

2012-02-02 21:58:50,642 Thread-10050336 WARN upload: Upload manager has no Failure script: disabled

2012-02-02 21:58:51,315 Thread-10050336 WARN metadata: File=file:/u00/app/oracle/agent10g/sysman/admin/metadata/bea_alsb.xml,line=10: <AssocTargetInstance> is not valid element, will be ignored

2012-02-02 21:58:54,543 Thread-10050336 WARN metadata: TABLE metric umsg_imap_user_count can't define table_name: mail_imap_usercount. Filename:/u00/app/oracle/agent10g/sysman/admin/metadata/imap.xml

2012-02-02 21:58:54,543 Thread-10050336 WARN metadata: Metric umsg_imap_user_count does not have any data columns. Filename /u00/app/oracle/agent10g/sysman/admin/metadata/imap.xml

2012-02-02 21:58:54,543 Thread-10050336 WARN metadata: TABLE metric umsg_imap_db_session_count can't define table_name: umsg_imap_dbconnections. Filename:/u00/app/oracle/agent10g/sysman/admin/metadata/imap.xml

2012-02-02 21:58:54,544 Thread-10050336 WARN metadata: Metric umsg_imap_db_session_count does not have any data columns. Filename /u00/app/oracle/agent10g/sysman/admin/metadata/imap.xml

2012-02-02 21:58:54,544 Thread-10050336 WARN metadata: TABLE metric umsg_imap_socket_count can't define table_name: umsg_imap_socketcount. Filename:/u00/app/oracle/agent10g/sysman/admin/metadata/imap.xml

2012-02-02 21:58:54,544 Thread-10050336 WARN metadata: Metric umsg_imap_socket_count does not have any data columns. Filename /u00/app/oracle/agent10g/sysman/admin/metadata/imap.xml

2012-02-02 21:58:54,544 Thread-10050336 WARN metadata: TABLE metric umsg_imap_connections can't define table_name: mail_imap_connections. Filename:/u00/app/oracle/agent10g/sysman/admin/metadata/imap.xml

2012-02-02 21:58:54,544 Thread-10050336 WARN metadata: TABLE metric umsg_imap_connections_lost can't define table_name: umsg_imap_connections_lost. Filename:/u00/app/oracle/agent10g/sysman/admin/metadata/imap.xml

2012-02-02 21:58:54,544 Thread-10050336 WARN metadata: Metric umsg_imap_connections_lost does not have any data columns. Filename /u00/app/oracle/agent10g/sysman/admin/metadata/imap.xml

This log file does not show any errors regarding the startup problem, if any.

EMAGENT.NOHUP

About the Management Agent Log and Trace Files describes emagent.nohup as:

    • The Management Agent startup log file (emagent.nohup)
    • The Management Agent saves information to the startup log file when there is a problem starting the agent. This file is updated by the Management Agent Watchdog Process. When the Watchdog Process finds any problems, it logs to this file.

/u00/app/oracle/agent10g/sysman/log/emagent.nohup shows:

----- Thu Feb 2 21:58:48 2012::Time elapsed between Launch of Watchdog process and execing EMAgent is 1 secs -----

----- Thu Feb 2 21:58:48 2012::Agent Launched with PID 5799 at tim Thu Feb 2 21:58:48 2012 -----

(pid=5799): starting emagent version 10.2.0.5.0

----- Thu Feb 2 21:59:22 2012::Checking status of EMAgent : 5799 - ---

----- Thu Feb 2 21:59:52 2012::Checking status of EMAgent : 5799 - ---

(pid=5799): emagent started successfully

----- Thu Feb 2 22:00:26 2012::Checking status of EMAgent : 5799 - ---

This log also indicates that the startup of the EM agent was successful as indicated by the highligthed message. Then, why does the output from the emagent start agent command say otherwise?

EMDCTL.TRC

About the Management Agent Log and Trace Files describes emdctl.trc as:

Agent control utility trace file

/u00/app/oracle/agent10g/sysman/log/emdctl.trc shows:

2012-02-02 21:58:36,476 Thread-9304864 WARN http: nmehl_connect_internal: connect failed to (bankstown.yaocm.id.au:3872): No route to host (error = 113)

2012-02-02 21:58:36,537 Thread-9304864 ERROR main: nmectla_agentctl: Error connecting to https://bankstown.yaocm.id.au:3872/emd/main/. Returning status code 1

Here we have, in the highlighted message above, the first clue to what the problem is: the agent cannot connect to the host itis monitoring.

EMAGENT_PERL.TRC

About the Management Agent Log and Trace Files describes emagent_perl.trc as:

Trace file for the PERL scripts. This includes the PERL metrics and the discovery

/u00/app/oracle/agent10g/sysman/log/emagent_perl.trc shows:

dynamic_esa_config.pl: Thu Feb 2 21:59:00 2012: ERROR: dynamic_esa_config.pl STARTED

dynamic_esa_config.pl: Thu Feb 2 21:59:00 2012: ERROR: File /u00/app/oracle/agent10g/sysman/config/esa/bankstown.yaocm.id.au_host.properties was not found at location /u00/app/oracle/agent10g/sysman/config/esa

dynamic_esa_config.pl: Thu Feb 2 21:59:00 2012: ERROR: dynamic_esa_config.pl ENDED

maxRowCount.pl: Thu Feb 2 21:59:00 2012: ERROR: File /u00/app/oracle/agent10g/sysman/config/esa/LISTENER_bankstown.yaocm.id.au_oracle_listener.properties was not found at location /u00/app/oracle/agent10g/sysman/config/esa

has_metrics.pl: Thu Feb 2 21:59:18 2012: WARN: WARN:has::Common:: has::Common::hasGetSQLResults: Could not connect to dbsnmp/(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=bankstown.yaocm.id.au)(Port=1521))(CONNECT_DATA=(SID=ocm11g))): ORA-12560: TNS:protocol adapter error (DBD ERROR: OCIServerAttach) at /u00/app/oracle/agent10g/sysman/admin/scripts/has/SQL.pl line 252, <STDIN> line 2.

has_metrics.pl: Thu Feb 2 21:59:22 2012: WARN: WARN:has::Common:: has::Common::hasGetSQLResults: Could not connect to dbsnmp/(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=bankstown.yaocm.id.au)(Port=1521))(CONNECT_DATA=(SID=ocm11g))): ORA-12560: TNS:protocol adapter error (DBD ERROR: OCIServerAttach) at /u00/app/oracle/agent10g/sysman/admin/scripts/has/SQL.pl line 252, <STDIN> line 2.

The highlighted error message reinforces the idea that a TCP client client cannot connect to the host called bankstown.yaocm.id.au.

EMCTL.LOG

The log, emctl.log, is not described in About the Management Agent Log and Trace Files.

/u00/app/oracle/agent10g/sysman/log/emctl.log shows:

5200 :: Thu Feb 2 21:58:30 2012::AgentLifeCycle.pm: Processing status agent

5200 :: Thu Feb 2 21:58:30 2012::AgentStatus.pm:Processing status agent

5200 :: Thu Feb 2 21:58:36 2012::AgentStatus.pm:emdctl status returned 1

5763 :: Thu Feb 2 21:58:44 2012::AgentLifeCycle.pm: Processing start agent

5763 :: Thu Feb 2 21:58:44 2012::AgentLifeCycle.pm: EMHOME is /u00/app/oracle/agent10g

5763 :: Thu Feb 2 21:58:44 2012::AgentLifeCycle.pm: service name is

5763 :: Thu Feb 2 21:58:47 2012::AgentLifeCycle.pm:status agent returned with retCode=1

5763 :: Thu Feb 2 22:06:43 2012::AgentLifeCycle.pm: Exited loop with retCode=1

7544 :: Thu Feb 2 22:27:59 2012::AgentLifeCycle.pm: Processing status agent

7544 :: Thu Feb 2 22:27:59 2012::AgentStatus.pm:Processing status agent

7544 :: Thu Feb 2 22:28:02 2012::AgentStatus.pm:emdctl status returned 1

Three (3) processes exited with status of one (1). No reasons are given.

AGABEND.LOG

About the Management Agent Log and Trace Files describes agabend.log as:

This log provided in 10.2.0.3 or higher contains all the Agent startup errors. Errors will be added for each failed startup to this file. The Agent watchdog mines this file, to report on an abnormal end of the Agent.

/u00/app/oracle/agent10g/sysman/log/agabend.log shows:

Thu Feb 2 21:58:47 2012

XXXXXXXXXXXXXXXX

No messages of interest here at all.

Result of Log Analysis

The following logs indicate that there is a TCP connectivity issue with the host called bankstown.yaocm.id.au:

  1. EMAGENT_PERL.TRC
  2. EMDCTL.TRC

Diagnosis

Since the error messages indicate a TCP connectivity issue, let's try a simple ping test first:

[oracle@bankstown log]$ ping -c 5 bankstown.yaocm.id.au

PING bankstown.yaocm.id.au (10.1.1.244) 56(84) bytes of data.

From bankstown.yaocm.id.au (192.168.1.244) icmp_seq=2 Destination Host Unreachable

From bankstown.yaocm.id.au (192.168.1.244) icmp_seq=3 Destination Host Unreachable

From bankstown.yaocm.id.au (192.168.1.244) icmp_seq=4 Destination Host Unreachable

From bankstown.yaocm.id.au (192.168.1.244) icmp_seq=5 Destination Host Unreachable

--- bankstown.yaocm.id.au ping statistics ---

5 packets transmitted, 0 received, +4 errors, 100% packet loss, time 4141ms,

pipe 2

Love how the IP address of the reverse name lookup changes with the second ping message!

Check Hosts Data

Used the procedure in 16.9. Managing Hosts to verify that the following hosts entry existed:

IP

10.1.1.244

Name

bankstown.yaocm.id.au

Aliases

bankstown

Changed the IP address from 10.1.1.244 to 192.168.1.244 and saved the changes.

Ping is now Successful

Pinging bankstown.yaocm.id.au is now successful:

[oracle@bankstown log]$ ping -c 5 bankstown.yaocm.id.au

PING bankstown.yaocm.id.au (192.168.1.244) 56(84) bytes of data.

64 bytes from bankstown.yaocm.id.au (192.168.1.244): icmp_seq=1 ttl=64 time=0.369 ms

64 bytes from bankstown.yaocm.id.au (192.168.1.244): icmp_seq=2 ttl=64 time=0.097 ms

64 bytes from bankstown.yaocm.id.au (192.168.1.244): icmp_seq=3 ttl=64 time=0.032 ms

64 bytes from bankstown.yaocm.id.au (192.168.1.244): icmp_seq=4 ttl=64 time=0.032 ms

64 bytes from bankstown.yaocm.id.au (192.168.1.244): icmp_seq=5 ttl=64 time=0.033 ms

--- bankstown.yaocm.id.au ping statistics ---

5 packets transmitted, 5 received, 0% packet loss, time 3999ms

rtt min/avg/max/mdev = 0.032/0.112/0.369/0.131 ms

Start the Agent Again

Using the emctl start agent from Starting, Stopping, and Checking the Status of the Management Agent on UNIX, I get the following result:

[oracle@bankstown log]$ emctl start agent

Oracle Enterprise Manager 10g Release 5 Grid Control 10.2.0.5.0.

Copyright (c) 1996, 2009 Oracle Corporation. All rights reserved. Agent is already running

This confirms the successful startup messages seen in the following logs:

Check the Status of the EM Agent

Using the emctl start agent from Starting, Stopping, and Checking the Status of the Management Agent on UNIX, I get the following result:

[oracle@bankstown log]$ emctl status agent

Oracle Enterprise Manager 10g Release 5 Grid Control 10.2.0.5.0.

Copyright (c) 1996, 2009 Oracle Corporation. All rights reserved.

---------------------------------------------------------------

Agent Version : 10.2.0.5.0

OMS Version : 10.2.0.5.0

Protocol Version : 10.2.0.5.0

Agent Home : /u00/app/oracle/agent10g

Agent binaries : /u00/app/oracle/agent10g

Agent Process ID : 5799

Parent Process ID : 5781

Agent URL : https://bankstown.yaocm.id.au:3872/emd/main/

Repository URL : https://gridctrl.yaocm.id.au:1159/em/upload

Started at : 2012-02-02 21:58:48

Started by user : oracle

Last Reload : 2012-02-02 22:09:00

Last successful upload : 2012-02-02 22:55:56

Total Megabytes of XML files uploaded so far : 5.30

Number of XML files pending upload : 0

Size of XML files pending upload(MB) : 0.00

Available disk space on upload filesystem : 52.08%

Last successful heartbeat to OMS : 2012-02-02 22:59:35

---------------------------------------------------------------

Agent is Running and Ready

Conclusion

As with the drama documented in OMS Does Not Start After Server is on New Subnet, the hosts entry was the root cause of the problem.