What is a network packet in my VoIP system?

Calls sound smooth only when thousands of tiny network packets behave well. When they do not, you hear gaps, echoes, or silence, even while “the internet” looks fine.

Table of Contents hide

1 How is a network packet structured—headers, payload, checksum?

1.1 The layers inside a VoIP packet

1.1.1 Link layer: Ethernet and VLAN tags

1.1.2 Network and transport: IP and UDP

1.1.3 RTP header and payload

1.1.4 RTCP and SRTP

2 Which packet fields matter for my SIP and RTP traffic?

2.1 Packet fields that drive SIP behavior

2.2 Packet fields that drive RTP behavior

2.3 DSCP and QoS markings

3 How do packets travel across my VLANs, NAT, and firewalls?

3.1 From phone to gateway: inside the voice VLAN

3.2 Through routers and QoS in the core

3.3 Crossing NAT and firewalls

3.4 End-to-end view

4 How do I capture and analyze packets with Wireshark or tcpdump?

4.1 Where and how to capture

5 Footnotes

In a VoIP system, a network packet is a small, structured chunk of data that carries SIP signaling or RTP media over IP. Each packet has headers for routing and QoS, and a payload with your encoded voice.

Two office IP phones labeled SIP and RTP sitting on a glowing network diagram, conceptually separating SIP signaling from RTP media paths and showing a network test meter on the side — SIP Signaling vs RTP Media Call Paths

Once I think in packets, call quality stops feeling random. I can see how SIP messages and RTP audio slices move through VLANs, NAT, and firewalls, and I can prove where they break by capturing them with Wireshark VoIP analysis features ¹ or a tcpdump command reference ².

How is a network packet structured—headers, payload, checksum?

VoIP issues are hard to fix if “packet” just means “some data on the wire.” The structure inside each packet explains why small changes in overhead, MTU, or checksums can break calls.

A VoIP packet is built in layers: link header (Ethernet), network header (IP), transport header (UDP/TCP), RTP or SIP headers, and a payload with codec data or signaling, plus checksums that protect each layer.

Layered protocol stack diagram illustrating Ethernet header, VLAN tag, IP header, UDP header, and underlying payload for a VoIP packet — Ethernet, VLAN, IP, and UDP Headers over VoIP Payload

The layers inside a VoIP packet

When a device sends RTP audio over IPv4 on Ethernet, the packet looks like a stack:

Layer	Example header size	What it holds
Ethernet	14 bytes	Source MAC, destination MAC, EtherType, VLAN tag
IP (v4)	~20 bytes	Source/dest IP, TTL, DSCP, protocol, checksum
UDP	8 bytes	Source/dest port, length, checksum
RTP	12 bytes (minimum)	Sequence, timestamp, payload type, SSRC, marker
Payload	Varies (20 ms audio)	Codec frames: G.711, Opus, G.729, etc.

With IPv4, IP+UDP+RTP overhead is about 40 bytes per packet, before Ethernet. With IPv6, the IP header is larger, so overhead grows to about 60 bytes.

Link layer: Ethernet and VLAN tags

At the bottom, Ethernet wraps the packet for local delivery:

Source MAC and destination MAC.
EtherType (for example, IPv4 or IPv6).
Optional IEEE 802.1Q VLAN tag ³ with a priority code point (PCP).

In a voice VLAN, that tag often carries a higher priority. Switches then treat VoIP packets better during congestion.

Network and transport: IP and UDP

Above Ethernet, the IP header holds:

Source and destination IP addresses.
DSCP bits for QoS (voice often uses EF, value 46).
TTL and fragmentation fields.
Header checksum (IPv4).

Next, UDP carries:

Source and destination ports (for example, 5060 for SIP, high range for RTP).
Length and checksum.

VoIP media prefers UDP because it does not wait for retransmissions. Lost packets simply vanish. This keeps delay low, which users feel more than small artifacts.

RTP header and payload

RTP wraps voice frames and gives them timing:

Sequence number: used to detect loss and reordering.
Timestamp: maps packets back to playback time.
Payload type (PT): codec type or profile.
SSRC: stream identifier.
Marker bit: signals events like talk bursts or frame boundaries.

The payload is one or more codec frames. A typical RTP packet holds 20 ms of audio, but 10/30/40/60 ms are also common. Longer packetization reduces overhead but increases delay and risk per packet.

RTCP and SRTP

RTCP packets travel alongside RTP and carry:

Reports on jitter, packet loss, and delay.
Sender and receiver reports for sync.
They do not carry audio.

When SRTP is used:

The RTP payload and parts of the header are encrypted.
An authentication tag is added.
Keys come from SDES in SIP or from DTLS-SRTP.

The structure stays the same in shape. The content becomes unreadable without keys, but analysis still uses sequence, timing, and header info.

Finally, the Ethernet frame must stay under the Maximum Transmission Unit (MTU) ⁴ (often 1500 bytes). If packets exceed this, fragmentation appears. That adds delay and loss risk, so VoIP designs try to keep packets small enough to avoid it.

Which packet fields matter for my SIP and RTP traffic?

Not all header bits matter day to day. A few key fields decide whether calls register, audio flows, and QoS works as planned.

For SIP, the important fields are IPs, ports, DSCP, and transport (UDP/TCP/TLS). For RTP, the critical fields are ports, DSCP, sequence numbers, timestamps, SSRC, and payload type.

SIP message breakdown showing VIA, DSCP, and Contact headers, highlighting source and destination IPs, source and destination ports, and a SIP INVITE using various transports — Anatomy of a SIP INVITE with IPs, Ports, and Transport

Packet fields that drive SIP behavior

The Session Initiation Protocol (SIP) ⁵ is signaling. It often runs over UDP or TCP 5060 and TLS 5061. Important fields at packet level:

Layer	Field	Why it matters for SIP
IP	Source/dest IP	NAT mapping, routing, firewall rules
IP	DSCP	Priority of SIP signaling vs data
UDP	Source/dest port	SIP port bindings, ALG behavior
TCP	Flags, sequence, ACK	For SIP over TCP/TLS, session stability

When I debug registrations, I check:

Source IP and port of SIP packets leaving the client.
How NAT changes those values on the public side.
Whether DSCP marks survive across hops.

If SIP leaves on one source IP and returns to another, or if a firewall closes the TCP session early, registrations flap or calls fail.

Packet fields that drive RTP behavior

The Real-time Transport Protocol (RTP) ⁶ is media. Many call quality problems are visible without opening the payload, just by watching packet headers.

Key RTP-related fields:

Field	Role in RTP and audio
UDP ports	Identify media streams, NAT pinholes
DSCP	Controls QoS class for media
Sequence number	Detects loss and reordering
Timestamp	Controls playout timing and jitter buffer behavior
SSRC	Distinguishes different streams
Payload type (PT)	Codec / format type

From these, I can see:

Packet loss (missing sequence numbers).
Reordering (sequence jumps and returns).
Jitter (variable arrival gaps between timestamps).
Codec mix-ups (wrong PT at one end).

For DJSlink SIP intercoms and phones, matching payload types and SSRC between endpoints and PBX or SBC is part of basic interoperability testing.

DSCP and QoS markings

QoS often depends on one simple IP header field:

DSCP EF (46) for RTP, defined by the Expedited Forwarding (EF) PHB ⁷.
A lower-priority mark for SIP signaling, or sometimes the same.

Routers and switches then map EF into priority queues. If this marking is wrong or stripped, voice competes with bulk data, and calls sound bad when links are busy.

Packet tools let me see:

DSCP values on live calls.
Whether network devices remark or clear them.
If SIP and RTP follow the intended classes.

Once I know which fields matter, packet captures become much easier to read. I do not need to understand every bit; i focus on the ones that match real symptoms.

How do packets travel across my VLANs, NAT, and firewalls?

Packets do not live in a flat world. They move through separate VLANs, hit NAT, pass firewalls, and maybe reach an SBC or cloud PBX somewhere else. That path shapes call behavior.

Packets start in a local VLAN, hit the default gateway, then pass through routers, QoS queues, NAT, and firewalls before reaching SBCs or providers. Each hop may retag VLANs, change IPs, or filter ports.

Network layout with a dedicated Voice VLAN 20 for DJSlink IP phones, DSCP EF QoS markings, switches, routers, and a VoIP test handset and laptop at a remote office — Voice VLAN and DSCP EF QoS Path for DJSlink IP Phones

From phone to gateway: inside the voice VLAN

A typical voice layout:

Phones and SIP intercoms sit in Voice VLAN X.
Each device has an IP like 10.20.20.50/24.
Default gateway is 10.20.20.1 on a Layer 3 switch or router.

Packet steps:

An RTP packet leaves the phone, tagged with VLAN X and DSCP EF.
Access switch receives it, keeps the tag, and queues based on DSCP.
Switch sends it to the voice VLAN gateway.

At this point:

MAC addresses change.
IP addresses and ports stay the same.
VLAN tags help separate voice and data policies.

Through routers and QoS in the core

From the gateway, packets move into the core or WAN:

Routers inspect IP, DSCP, and routing tables.
QoS policies apply shaping and priorities.
Some links may re-mark DSCP to internal classes.

If design is good:

EF traffic gets low-latency queues.
SIP signaling is protected enough, but not at the cost of media.
Data flows use best effort or lower priority queues.

If design is bad:

Voice shares queues with backups or bulk transfers.
Latency and jitter spike when links are near saturation.
MOS falls and users hear choppy calls.

Crossing NAT and firewalls

At the edge, packets hit NAT and firewalls:

Private IPs (for example 10.20.20.50) become public IPs.
Source ports may change, especially with symmetric NAT.
Firewalls track “flows” from inside to outside and open pinholes for RTP.

Typical problems:

Layer	Issue	Symptom
NAT	Symmetric or random port mapping	Remote cannot send RTP back correctly
Firewall	SIP ALG rewriting SDP badly	One-way audio or no audio
Routing	Asymmetric paths	Only one direction works through stateful FW

To handle this cleanly, many designs place an SBC at the edge. All SIP and RTP flows terminate there. The SBC then manages:

Public IP and port exposure.
Policy and security.
Interop with providers.

Packets still cross NAT and firewalls, but in a controlled way. The endpoints talk to a stable SIP peer, not directly to the provider.

End-to-end view

A short life story of one RTP packet:

Leaves phone in Voice VLAN, DSCP EF, private IP.
Traverses switches, stays tagged and prioritized.
Hits router, goes into WAN with EF priority.
Reaches edge device, NATs to public IP and port.
Passes firewall state checks and rules.
Arrives at SBC or PBX, which decodes RTP.

If any step drops DSCP, breaks NAT, or sends packets on a strange path, that one small packet may never arrive. Enough missing packets and a once-clear call sounds broken.

How do I capture and analyze packets with Wireshark or tcpdump?

It is hard to argue about where packets got lost if nobody has seen them. Packet captures turn “I think” into “I see”.

To capture packets, I mirror traffic or capture on the endpoint, use tools like Wireshark or tcpdump to collect SIP and RTP, then apply filters and RTP analysis to see loss, jitter, and codec details.

Lab environment with a server rack and a laptop running tcpdump captures, focusing on SIP signaling on UDP port 5060 for DJSlink VoIP troubleshooting — Using tcpdump on Port 5060 to Analyze SIP Traffic in a VoIP Lab

Where and how to capture

Common capture points:

On the softphone PC (Wireshark on Windows).
On the PBX or SBC (tcpdump on Linux).
On a switch SPAN/port-mirror port for IP phones or DJSlink intercoms.

Examples:

On a Linux-based PBX or SBC:

tcpdump -i eth0 -n -s 0 port 5060 or portrange 10000-20000 -w voip.pcap

Footnotes

Practical Wireshark guide for VoIP call playback and RTP stream troubleshooting. ↩︎ ↩
Reference for tcpdump options to capture SIP/RTP traffic cleanly on Linux systems. ↩︎ ↩
Overview of VLAN tagging and how 802.1Q separates voice and data on switches. ↩︎ ↩
MTU basics and why oversized frames trigger fragmentation and voice-quality issues. ↩︎ ↩
Official SIP specification for signaling behavior, message formats, and interoperability basics. ↩︎ ↩
RTP standard defining sequencing, timestamps, and media transport fundamentals. ↩︎ ↩
Explains EF PHB and why DSCP 46 is used for low-latency voice queuing. ↩︎ ↩

About The Author

DJSLink R&D Team

DJSLink China's top SIP Audio And Video Communication Solutions manufacturer & factory .
Over the past 15 years, we have not only provided reliable, secure, clear, high-quality audio and video products and services, but we also take care of the delivery of your projects, ensuring your success in the local market and helping you to build a strong reputation.