Chapter Wordtags: Network Baseline, Network Documentation, Network Performance, Bottom up, Top down, Divide and Conquer,core dump,extended ping.RFC , Jabber, latency. jitter,Carrier
Documenting Your Network
To efficiently diagnose and correct network problems, a network engineer needs to know how a network has been designed and what the expected performance for this network should be under normal operating conditions. This information is called the network baseline and is captured in documentation such as configuration tables and topology diagrams.
Network configuration documentation provides a logical diagram of the network and detailed information about each component. This information should be kept in a single location, either as hard copy or on the network on a protected website. Network documentation should include these components:
Network configuration table
End-system configuration table
Network topology diagram
Network Configuration Table
Contains accurate, up-to-date records of the hardware and software used in a network. The network configuration table should provide the network engineer with all the information necessary to identify and correct the network fault.
Click the Router and Switch Documentation button in the figure.
The table in the figure illustrates the data set that should be included for all components:
Type of device, model designation
IOS image name
Device network hostname
Location of the device (building, floor, room, rack, panel)
If it is a modular device, include all module types and in which module slot they are located
Data Link layer addresses
Network layer addresses
Any additional important information about physical aspects of the device
End-system Configuration Table
Contains baseline records of the hardware and software used in end-system devices such as servers, network management consoles, and desktop workstations. An incorrectly configured end system can have a negative impact on the overall performance of a network.
For troubleshooting purposes, the following information should be documented:
Device name (purpose)
Operating system and version
IP address
Subnet mask
Default gateway, DNS server, and WINS server addresses
Any high-bandwidth network applications that the end-system runs
Network Topology Diagram
Graphical representation of a network, which illustrates how each device in a network is connected and its logical architecture. A topology diagram shares many of the same components as the network configuration table. Each network device should be represented on the diagram with consistent notation or a graphical symbol. Also, each logical and physical connection should be represented using a simple line or other appropriate symbol. Routing protocols can also be shown.
At a minimum, the topology diagram should include:
Symbols for all devices and how they are connected
Interface types and numbers
IP addresses
Subnet masks
How does the network perform during a normal or average day?
Where are the underutilized and over-utilized areas?
Where are the most errors occurring?
What thresholds should be set for the devices that need to be monitored?
Can the network deliver the identified policies?
Planning for the First Basline
Because the initial network performance baseline sets the stage for measuring the effects of network changes and subsequent troubleshooting efforts, it is important to plan for it carefully. Here are the recommended steps for planning the first baseline:
Step 1. Determine what types of data to collect
When conducting the initial baseline, start by selecting a few variables that represent the defined policies. If too many data points are selected, the amount of data can be overwhelming, making analysis of the collected data difficult. Start out simply and fine-tune along the way. Generally, some good starting measures are interface utilization and CPU utilization. The figure shows some screenshots of interface and CPU utilization data, as displayed by a Fluke Networks network management system.
Click the Devices and Ports of Interest button in the figure.
Step 2. Identify devices and ports of interest
The next step is to identify those key devices and ports for which performance data should be measured. Devices and ports of interest include:
Network device ports that connect to other network devices
Servers
Key users
Anything else considered critical to operations
In topology shown in the figure, the network administrator has highlighted the devices and ports of interest to monitor during the baseline test. The devices of interest include routers R1, R2, and R3, PC1 (the Admin terminal), and SRV1 (the Web/TFTP server). The ports of interest include those ports on R1, R2, and R3 that connect to the other routers or to switches, and on router R2, the port that connects to SRV1 (Fa0/0).
By narrowing the ports polled, the results are concise, and network management load is minimized. Remember that an interface on a router or switch can be a virtual interface, such as a switch virtual interface (SVI).
This step is easier if you have configured the device port description fields to indicate what connects to the port. For example, for a router port that connects to the distribution switch in the Engineering workgroup, you might configure the description, "Engineering LAN distribution switch."
Click the Determine Baseline Duration button in the figure.
Step 3. Determine the baseline duration
It is important that the length of time and the baseline information being gathered are sufficient to establish a typical picture of the network. This period should be at least seven days to capture any daily or weekly trends. Weekly trends are just as important as daily or hourly trends.
The figure shows examples of several screenshots of CPU utilization trends captured over a daily, weekly, monthly, and yearly period. The work week trends are too short to accurately reveal the recurring nature of the utilization surge that occurs every weekend on Saturday evening when a major database backup operation consumes network bandwidth. This recurring pattern is revealed in the monthly trend. The yearly trend shown in the example is too long a duration to provide meaningful baseline performance details. A baseline needs to last no more than six weeks, unless specific long-term trends need to be measured. Generally, a two-to-four-week baseline is adequate.
You should not perform a baseline measurement during times of unique traffic patterns because the data would provide an inaccurate picture of normal network operations. You would get an inaccurate measure of network performance if you performed a baseline measurement on a holiday or during a month when most of the company is on vacation.
Baseline analysis of the network should be conducted on a regular basis. Perform an annual analysis of the entire network or baseline different sections of the network on a rotating basis. Analysis must be conducted regularly to understand how the network is affected by growth and other changes.
Network engineers, administrators, and support personnel realize that troubleshooting is a process that takes the greatest percentage their time. Using efficient troubleshooting techniques shortens overall troubleshooting time when working in a production environment.
Two extreme approaches to troubleshooting almost always result in disappointment, delay, or failure. At one extreme is the theorist, or rocket scientist, approach. At the other extreme is the impractical, or caveman, approach.
The rocket scientist analyzes and reanalyzes the situation until the exact cause at the root of the problem has been identified and corrected with surgical precision. While this process is fairly reliable, few companies can afford to have their networks down for the hours or days that it can take for this exhaustive analysis.
The caveman's first instinct is to start swapping cards, cables, hardware, and software until miraculously the network begins operating again. This does not mean that the network is working properly, just that it is operating. While this approach may achieve a change in symptoms faster, it is not very reliable, and the root cause of the problem may still be present.
Since both of these approaches are extremes, the better approach is somewhere in the middle using elements of both. It is important to analyze the network as a whole rather than in a piecemeal fashion. A systematic approach minimizes confusion and cuts down on time otherwise wasted with trial and error.
another layer that could be causing the problem.
OSI Reference Model
The OSI model provides a common language for network engineers and is commonly used in troubleshooting networks. Problems are typically described in terms of a given OSI model layer.
The OSI reference model describes how information from a software application in one computer moves through a network medium to a software application in another computer.
The upper layers (5-7) of the OSI model deal with application issues and generally are implemented only in software. The Application layer is closest to the end user. Both users and Application layer processes interact with software applications that contain a communications component.
The lower layers (1-4) of the OSI model handle data-transport issues. Layers 3 and 4 are generally implemented only in software. The Physical layer (Layer 1) and Data Link layer (Layer 2) are implemented in hardware and software. The Physical layer is closest to the physical network medium, such as the network cabling, and is responsible for actually placing information on the medium.
The stages of the general troubleshooting process are:
Stage 1 Gather symptoms - Troubleshooting begins with the process of gathering and documenting symptoms from the network, end systems, and users. In addition, the network administrator determines which network components have been affected and how the functionality of the network has changed compared to the baseline. Symptoms may appear in many different forms, including alerts from the network management system, console messages, and user complaints.
While gathering symptoms, questions should be used as a method of localizing the problem to a smaller range of possibilities.
Stage 2 Isolate the problem - The problem is not truly isolated until a single problem, or a set of related problems, is identified. To do this, the network administrator examines the characteristics of the problems at the logical layers of the network so that the most likely cause can be selected. At this stage, the network administrator may gather and document more symptoms depending on the problem characteristics that are identified.
Stage 3 Correct the problem - Having isolated and identified the cause of the problem, the network administrator works to correct the problem by implementing, testing, and documenting a solution. If the network administrator determines that the corrective action has created another problem, the attempted solution is documented, the changes are removed, and the network administrator returns to gathering symptoms and isolating the problem.
Troubleshooting Methods
There are three main methods for troubleshooting networks:
Bottom up
Top down
Divide and conquer
Software Troubleshooting Tools
A wide variety of software and hardware tools are available to make troubleshooting easier. These tools may be used to gather and analyze symptoms of network problems and often provide monitoring and reporting functions that can be used to establish the network baseline.
NMS Tools
Network management system (NMS) tools include device-level monitoring, configuration, and fault management tools. The figure shows an example display from the What's Up Gold NMS software. These tools can be used to investigate and correct network problems. Network monitoring software graphically displays a physical view of network devices, allowing network managers to monitor remote devices without actually physically checking them. Device management software provides dynamic status, statistics, and configuration information for switched products. Examples of commonly used network management tools are CiscoView, HP Openview, Solar Winds, and What's Up Gold.
Click the Knowledge Base button in the figure to see an example of a knowledge base website.
Knowledge Bases
On-line network device vendor knowledge bases have become indispensable sources of information. When vendor-based knowledge bases are combined with Internet search engines like Google, a network administrator has access to a vast pool of experience-based information.
The figure shows the Cisco Tools & Resources page found at http://www.cisco.com. This is a free tool providing information on Cisco-related hardware and software. It contains troubleshooting procedures, implementation guides, and original white papers on most aspects of networking technology.
Click the Baselining Tools button in the figure to see some examples of baselining tools.
Baselining Tools
Many tools for automating the network documentation and baselining process are available. These tools are available for Windows, Linux, AUX operating systems. The figure shows a screen chapter of the SolarWinds LAN surveyor and CyberGauge software. Baselining tools help you with common baseling documentation tasks. For example they can help you draw network diagrams, help you to keep network software and hardware documentation up-to-date and help you to cost-effectively measure baseline network bandwidth use.
Click the Protocol Analyzer button in the figure to see an example of a typical protocol analyzer application.
Protocol Analyzers
A protocol analyzer decodes the various protocol layers in a recorded frame and presents this information in a relatively easy to use format. The figure shows a screen capture of the Wireshark protocol analyzer. The information displayed by a protocol analyzer includes, the physical, data link, protocol and descriptions for each frame. Most protocol analyzers can filter traffic that meets certain criteria so that, for example, all traffic to and from a particular device can be captured.
Hardware Troubleshooting Tools
Click the buttons in the figure to see examples of various hardware troubleshooting tools.
Network Analysis Module
A network analysis module (NAM) can be installed in Cisco Catalyst 6500 series switches and Cisco 7600 series routers to provide a graphical representation of traffic from local and remote switches and routers. The NAM is a embedded browser-based interface that generates reports on the traffic that consumes critical network resources. In addition, the NAM can capture and decode packets and track response times to pinpoint an application problem to the network or the server.
Digital Multimeters
Digital multimeters (DMMs) are test instruments that are used to directly measure electrical values of voltage, current, and resistance. In network troubleshooting, most of the multimedia tests involve checking power-supply voltage levels and verifying that network devices are receiving power.
Cable Testers
Cable testers are specialized, handheld devices designed for testing the various types of data communication cabling. Cabling testers can be used to detect broken wires, crossed-over wiring, shorted connections, and improperly paired connections. These devices can be inexpensive continuity testers, moderately priced data cabling testers, or expensive time-domain reflectometers (TDRs).
TDRs are used to pinpoint the distance to a break in a cable. These devices send signals along the cable and wait for them to be reflected. The time between sending the signal and receiving it back is converted into a distance measurement. The TDR function is normally packaged with data cabling testers. TDRs used to test fiber optic cables are known as optical time-domain reflectometers (OTDRs).
Cable Analyzers
Cable analyzers are multifunctional handheld devices that are used to test and certify copper and fiber cables for different services and standards. The more sophisticated tools include advanced troubleshooting diagnostics that measure distance to performance defect (NEXT, RL), identify corrective actions, and graphically display crosstalk and impedance behavior. Cable analyzers also typically include PC-based software. Once field data is collected the handheld device can upload its data and up-to-date and accurate reports can be created.
Portable Network Analyzers
Portable devices that are used for troubleshooting switched networks and VLANs. By plugging the network analyzer in anywhere on the network, a network engineer can see the switch port to which the device is connected and the average and peak utilization. The analyzer can also be used to discover VLAN configuration, identify top network talkers, analyze network traffic, and view interface details. The device can typically output to a PC that has network monitoring software installed for further analysis and troubleshooting.
Research Activity
The following are links to various troubleshooting tools.
Software Tools
Network Management Systems:
http://www.ipswitch.com/products/whatsup/index.asp?t=demo
http://www.solarwinds.com/products/network_tools.aspx
Baselining Tools:
http://www.networkuptime.com/tools/enterprise/
Knowledge Bases:
http://www.cisco.com
Protocol Analyzers:
http://www.flukenetworks.com/fnet/en-us/products/OptiView+Protocol+Expert/
Hardware Tools
Cisco Network Analyzer Module (NAM):
http://www.cisco.com/en/US/docs/net_mgmt/network_analysis_module_software/3.5/user/guide/user.html
Cable Testers:
http://www.flukenetworks.com/fnet/en-us/products/CableIQ+Qualification+Tester/Demo.htm
Cable Analyzers:
http://www.flukenetworks.com/fnet/en-us/products/DTX+CableAnalyzer+Series/Demo.htm
Network Analyzers:
http://www.flukenetworks.com/fnet/en-us/products/OptiView+Series+III+Integrated+Network+Analyzer/Demos.htm
WAN Connection Technologies
A typical private WAN uses a combination of technologies that are usually chosen based on traffic type and volume. ISDN, DSL, Frame Relay, or leased lines are used to connect individual branches into an area. Frame Relay, ATM, or leased lines are used to connect external areas back to the backbone. ATM or leased lines form the WAN backbone. Technologies that require the establishment of a connection before data can be transmitted, such as basic telephone, ISDN, or X.25, are not suitable for WANs that require rapid response time or low latency.
Physical Network Diagram
A physical network diagram shows the physical layout of the devices connected to the network. Knowing how devices are physically connected is necessary for troubleshooting problems at the Physical layer, such as cabling or hardware problems. Information recorded on the diagram typically includes:
Device type
Model and manufacturer
Operating system version
Cable type and identifier
Cable specification
Connector type
Cabling endpoints
Logical Network Diagram
A logical network diagram shows how data is transferred on the network. Symbols are used to represent network elements such as routers, servers, hubs, hosts, VPN concentrators, and security devices. Information recorded on a logical network diagram may include:
Device identifiers
IP address and subnet
Interface identifiers
Connection type
DLCI for virtual circuits
Site-to-site VPNs
Routing protocols
Static routes
Data-link protocols
WAN technologies used
Causes of Data Link Layer Problems
Issues at the Data Link layer that commonly result in network connectivity or performance problems include:
Encapsulation errors
An encapsulation error occurs because the bits placed in a particular field by the sender are not what the receiver expects to see. This condition occurs when the encapsulation at one end of a WAN link is configured differently from the encapsulation used at the other end.
Address mapping errors
In topologies such as point-to-multipoint, Frame Relay, or broadcast Ethernet, it is essential that an appropriate Layer 2 destination address be given to the frame. This ensures its arrival at the correct destination. To achieve this, the network device must match a destination Layer 3 address with the correct Layer 2 address using either static or dynamic maps.
When using static maps in Frame Relay, an incorrect map is a common mistake. Simple configuration errors can result in a mismatch of Layer 2 and Layer 3 addressing information.
In a dynamic environment, the mapping of Layer 2 and Layer 3 information can fail for the following reasons:
Devices may have been specifically configured not to respond to ARP or Inverse-ARP requests.
The Layer 2 or Layer 3 information that is cached may have physically changed.
Invalid ARP replies are received because of a misconfiguration or a security attack.
Framing errors
Frames usually work in groups of 8 bit bytes. A framing error occurs when a frame does not end on an 8-bit byte boundary. When this happens, the receiver may have problems determining where one frame ends and another frame starts. Depending on the severity of the framing problem, the interface may be able to interpret some of the frames. Too many invalid frames may prevent valid keepalives from being exchanged.
Framing errors can be caused by a noisy serial line, an improperly designed cable (too long or not properly shielded), or an incorrectly configured channel service unit (CSU) line clock.
STP failures or loops
The purpose of Spanning Tree Protocol (STP) is to resolve a redundant physical topology into a tree-like topology by blocking redundant ports. Most STP problems revolve around these issues:
Forwarding loops that occur when no port in a redundant topology is blocked and traffic is forwarded in circles indefinitely. When the forwarding loop starts, it usually congests the lowest bandwidth links along its path. If all the links are of the same bandwidth, all links are congested. This congestion causes packet loss and leads to a downed network in the affected L2 domain.
Excessive flooding because of a high rate of STP topology changes. The role of the topology change mechanism is to correct Layer 2 forwarding tables after the forwarding topology has changed. This is necessary to avoid a connectivity outage because, after a topology change, some MAC addresses previously accessible through particular ports might become accessible through different ports. A topology change should be a rare event in a well-configured network. When a link on a switch port goes up or down, there is eventually a topology change when the STP state of the port is changing to or from forwarding. However, when a port is flapping (oscillating between up and down states), this causes repetitive topology changes and flooding.
Slow STP convergence or reconvergence, which can be caused by a mismatch between the real and documented topology, a configuration error, such as an inconsistent configuration of STP timers, an overloaded switch CPU during convergence, or a software defect.
Step 1. Identify that an STP loop is occurring.
When a forwarding loop has developed in the network, these are the usual symptoms:
Loss of connectivity to, from, and through the affected network regions
High CPU utilization on routers connected to affected segments or VLANs
High link utilization (often 100 percent)
High switch backplane utilization (compared to the baseline utilization)
Syslog messages that indicate packet looping in the network (for example, Hot Standby Router Protocol duplicate IP address messages)
Syslog messages that indicate constant address relearning or MAC address flapping messages
Increasing number of output drops on many interfaces
javascript:openExternal('http://cisco.com/en/US/tech/tk389/tk621/technologies_tech_note09186a0080136673.shtml#troubleshoot');
A useful command for viewing ACL operation is the log keyword on ACL entries. This keyword instructs the router to place an entry in the system log whenever that entry condition is matched. The logged event includes details of the packet that matched the ACL element.
The log keyword is especially useful for troubleshooting and also provides information on intrusion attempts being blocked by the ACL.
If the router is running both ACLs and NAT, the order in which each of these technologies is applied to a traffic flow is important:
Inbound traffic is processed by the inbound ACL before being processed by outside-to-inside NAT.
Outbound traffic is processed by the outbound ACL after being processed by inside-to-outside NAT.
Complex wildcard masks provide significant improvements in efficiency, but are more subject to configuration errors. An example of a complex wildcard mask is using the address 10.0.32.0 and wildcard mask 0.0.32.15 to select the first 15 host addresses in either the 10.0.0.0 network or the 10.0.32.0 network.
Symptoms of Application Layer Problems
Application layer problems prevent services from being provided to application programs. A problem at the Application layer can result in unreachable or unusable resources when the physical, data link, network, and Transport layers are functional. It is possible to have full network connectivity, but the application simply cannot provide data.
Another type of problem at the Application layer occurs when the physical, data link, network, and Transport layers are functional, but the data transfer and requests for network services from a single network service or application do not meet the normal expectations of a user.
A problem at the Application layer may cause users to complain that the network or the particular application that they are working with is sluggish or slower than usual when transferring data or requesting network services.
The figure shows some of the possible symptoms of Application layer problems.
Chapter Commands:
#show access-list
#clear access-list counters
#show ip nat translations
#clear ip nat translation *
#debug ip nat
show frame-relay pvc
show interfaces serial
debug ppp authentication
Chapter Labs: 8.6.1