Voip Voice over Internet Phone

An Introduction to VoIP Protocols

Voice over IP (VoIP) offers the vision of a converged network carrying multiple types of traffic (voice, video, and data, to name a few). To carry out this vision, VoIP employs a number of different protocols that aren't used by the enterprise applications with which you are already familiar. And VoIP also has a unique set of performance requirements that make it a challenge for any data network. Understanding the operation of core VoIP protocols is therefore a first step in understanding the performance requirements that VoIP will place on your network.


A VoIP phone call occurs in two stages:

* Call setup. This stage is required to set up everything needed to make the telephone connection between the person making the call (the caller) and the person receiving the call (the called party).

* The call itself. The audio component of the conversation must be encoded and transmitted across the network.

Let's begin by looking at some of the protocols that are used in the call setup portion of a VoIP call.

Call Setup

The call setup stage of the call requires protocols that enable dial tone, number lookup, ringing, and busy signals before the call even occurs. In addition, the call setup protocols handle things that happen after the call -- any resource cleanup and statistical reporting.

Call setup protocols use the Transmission Control Protocol (TCP) or User Datagram Protocol (UDP) to transfer data during the setup and takedown phases of a telephone call. Each protocol uses a well-known port or ports to communicate with a call server, which functions like a PBX to enable IP phone calls. The required setup messages are sent back and forth between the caller, called party, and call server. For calls that travel between the VoIP network and the Public Switched Telephone Network (PSTN), the call server converses with a voice gateway using the same call setup protocol.

The setup messages, which vary in size and number, handle functions like the mapping of phone numbers to IP addresses, generating dial tones and busy signals, ringing the called party, and hanging up. Many different call setup protocols are in current use for VoIP deployments; some are standardized and some proprietary. The major call setup protocols are described below.


The call setup protocol H.323 is standardized by the International Telecommunications Union (ITU). H.323 is widely deployed and, among the call setup protocols, has been around the longest. In a VoIP environment, H.323 is a common protocol running on voice gateways to connect the VoIP network to the PSTN.

H.323 is actually a family of telephony-based standards for multimedia, including voice and videoconferencing. This set of interrelated protocols has been refined over many years. As a result, it is robust and flexible, but one downside of its robust capabilities is high overhead: a calling session includes lots of handshakes and data exchanges for each function performed. Because H.323 uses TCP for communication, setting up a call with H.323 can require many back-and-forth TCP flows. Be aware of this behavior if you need to investigate network performance issues affecting call setup.

H.323 can require additional configuration on the voice gateway, which maintains information about how calls are routed.


The Media Gateway Control Protocol (MGCP) is another commonly used call setup protocol. It is covered in the informational RFC 2705. MGCP differs from some other call setup protocols in that the endpoints, or phones, do not use MGCP to control the phone call itself. More commonly, MGCP is used so that a call server can control a voice gateway connection to the PSTN.

MGCP sends messages between the gateway and call server over UDP port 2427. Because the call server controls the gateway, the bulk of the call control intelligence resides there. Likewise, call routing information is configured in the call server instead of in the gateway.


SIP (Session Initiation Protocol) is a lightweight protocol developed by the IETF in RFC 3261 (with Proposed Standard status). SIP represents typical data-networking logic, which asks: Why use a heavyweight protocol (such as H.323) when a lightweight protocol (such as SIP) gets the job done most of the time? SIP represents the future for call setup as more vendors, including Cisco and Avaya, offer SIP phone/endpoint support. In addition, Microsoft recently announced the availability of their Office Communications Server, which uses SIP for call setup.

Although SIP can use either TCP or UDP for transport, most implementations use TCP and port 5060. SIP messages are similar to HTTP in that they are text-based and generally follow a request-response structure.


In addition to the standardized call setup protocols discussed above, certain vendors have provided their own proprietary protocols. One popular example is the Cisco Skinny Client Control Protocol (SCCP). SCCP or "Skinny" provides a simple, lightweight call setup protocol for Cisco devices. Skinny passes messages using TCP and port 2000.

There is no single, dominant call setup protocol in use today. The protocols discussed here (H.323, MGCP, SIP, and SCCP) are all commonly used in VoIP equipment. However, the trend is moving toward SIP as the call setup protocol of choice.

VoIP Conversations

The conversation portion of the call must be converted from analog to digital, translated into packets, sent across the network in packet format, reassembled, and converted from digital back to analog. A number of different components, standards, and protocols are required to enable the VoIP traffic to travel across the data network.


Codecs encode and decode both ends of the conversation to allow the signal to be sent and received across the network. Different codecs have different bandwidth requirements and different characteristics that can affect network performance.

Some commonly used codecs are also ITU standards; these are named G.711 and G.729. The codec's job is to take speech audio and transform it into a payload for transmission across the data network. Some codecs, like G.711, employ no compression schemes. The lack of compression means no additional data loss, but the tradeoff is that the codec requires more bandwidth from the network. Other codecs, like G.729, compress the data and therefore require less bandwidth. However, such compression is usually "lossy," which means that some degradation in voice quality results from that process.

Once the codec has its payload ready, it's up to another protocol, the Real-time Transport Protocol (RTP) to transfer data to the intended recipient.


Unlike call setup protocols, where no one protocol dominates, the single protocol that is used almost exclusively for transfer of VoIP conversations is RTP. (We won't discuss Skype in this article, but you should be aware that it uses proprietary protocols.) RTP was originally defined in RFC 1889, and was obsoleted by RFC 3550 in July, 2003 (with Internet Standard status). Widely used for streaming audio and video, RTP is designed for applications that need real-time performance to send data in one direction with no acknowledgments.

Since a VoIP call is bidirectional, two RTP streams carry the conversation, one in each direction. The path that these RTP streams take through the network and the impairments encountered along the way are important factors in determining the quality of voice conversations carried over data networks.

RTP is an application protocol that uses UDP for transport. All the fields related to RTP are enclosed within the UDP payload. Like UDP, RTP is a connectionless protocol. The software that creates RTP datagrams is not commonly part of the TCP/IP protocol stack, so applications are written to add and recognize an additional 12-byte header in each UDP datagram. The sender fills in each header, which contains four important fields:

RTP Payload Type -- Specifies the codec that is used. The payload type is important so that the receiver can apply the same codec to decode the data in the payload.
Sequence Number - Helps the receiving side reassemble the data and detect lost, out-of-order, and duplicate datagrams.
Time Stamp - Used to reconstruct the timing of the original audio or video. It also helps the receiving side determine variations in datagram arrival times, known as jitter.
The time stamp brings real value to RTP. An RTP sender puts a time stamp in each datagram. The receiving side of an RTP application notes when each datagram actually arrives and compares this to the time stamp. If the time between datagram arrivals is the same as when they were sent, there is no variation. However, depending on network conditions, there could be lots of variation in datagram arrival times--jitter. The receiving side can easily calculate the level of jitter using the time stamp.
Source ID -- Each sender generates a unique source ID and places it in the RTP header. This ID allows the software at the receiving side to distinguish among multiple, simultaneous incoming streams.

Bandwidth Considerations and Tradeoffs

While the RTP header is important to support the real-time nature of the protocol, the accumulation of headers can add a lot of overhead, especially considering the relative sizes of VoIP codec payloads. For example, a typical payload size when using the G.729 codec is 20 bytes, which means that the codec produces 20-byte chunks of the VoIP call at a predetermined rate, usually every 20 milliseconds. With RTP, two-thirds of the datagram is the header because the total header overhead consists of

RTP (12 bytes) + UDP (8 bytes) + IP (20 bytes) = 40 bytes

The real bandwidth consumption by a VoIP call is actually higher than it first appears. The G.729 codec, for example, has a data payload rate of 8 kbps. Its actual bandwidth usage is higher than this, however. When sending at 20-ms intervals, its payload size is 20 bytes per datagram. To this, add the 40 bytes of RTP header and any additional Layer 2 headers. For example, Ethernet drivers generally add 18 more bytes. The Bandwidth Required column in Table 1-1 shows a more accurate picture of actual bandwidth usage for some common codecs on an Ethernet network.


Nominal Data Rate

Typical Speech Packet Size

Bandwidth Required


64.0 kbps

20 ms

87.2 kbps


64.0 kbps

20 ms

87.2 kbps


32.0 kbps

20 ms

55.2 kbps


8.0 kbps

20 ms

31.2 kbps

G.723.1 MPMLQ

6.3 kbps

30 ms

21.9 kbps

G.723.1 ACELP

5.3 kbps

30 ms

20.8 kbps

Table 1-1 -- Codec attributes

Some IP phones let you set the "delay between packets" or "speech packet size," which is the rate at which the sender delivers datagrams onto the network. For example, at 64 kbps, a 20-ms speech datagram implies that the sending side creates a 160-byte datagram payload every 20 ms. A simple equation relates the codec speed, the speech packet size, and the datagram payload size:

Payload size (in bytes) =
Codec speed (in bits/sec) * speech packet size (ms)

8 (bits/byte) * 1000 (ms/sec)

In this example:

160 bytes = (64000 * 20)/8000

For a given data rate, increasing the speech packet size in milliseconds also increases the datagram size in bytes because datagrams are sent less frequently to transport the same quantity of data. A speech packet size of 30 ms at a data rate of 64 kbps would require sending 240-byte datagrams. You may ask, Why don't all codecs just increase speech packet size to produce larger datagrams and reduce the impact of header overhead? The answer is that the larger speech packet size adds delay, which can have a negative impact on call quality. In addition, a larger speech packet size places more voice data in a single datagram, which, if lost, can have a negative impact on call quality as well.

Now that you understand more about the VoIP protocols, you are ready to move on to a more detailed discussion of the how VoIP affects network performance. We'll cover this topic in the next installment of our VoIP article series.