What is SIP calling and how does it work?

SIP calling looks simple until calls fail at the worst moment. One-way audio, random registration drops, and DTMF not working can turn a clean rollout into daily firefighting.

Table of Contents hide

1 SIP calling: signaling sets the call, RTP carries the voice

1.1 The basic call flow in plain language

1.2 Why SIP works well for phones, PBX, and intercoms

2 How do SIP registration and trunks connect PBX, phones, and intercoms?

2.1 Registration: how an endpoint becomes reachable

2.2 Trunks: how the PBX reaches the outside world (or another domain)

2.3 Common connection patterns that work well

3 What SIP ports, codecs, and DTMF settings should I configure?

3.1 SIP signaling ports

3.2 RTP media ports (the real firewall work)

3.3 Codec selection that avoids surprises

3.4 DTMF settings (critical for door release and IVR)

4 How do NAT, STUN, and SIP ALG affect audio and signaling?

4.1 Why NAT breaks media more than signaling

4.2 STUN, TURN, and ICE in VoIP reality

4.3 SIP ALG: why it causes “works but broken”

5 How does SRTP secure calls and meet compliance requirements?

5.1 What SRTP protects (and what it does not)

5.2 Keying methods: why interoperability matters

5.3 Compliance: focus on controls, not buzzwords

5.4 Practical SRTP settings that reduce failure

6 Conclusion

7 Footnotes

SIP calling uses SIP messages to set up, manage, and end calls, while the actual voice travels separately over RTP. SIP negotiates codecs and media ports via SDP, then endpoints stream audio directly or through an SBC.

SIP signaling messages controlling RTP audio video streams between cloud PBX server and softphones — SIP and RTP call flow overview

SIP calling: signaling sets the call, RTP carries the voice

SIP (Session Initiation Protocol (SIP) ¹) is the signaling layer. It is the part that says “who is calling who,” “should the phone ring,” “which codecs are allowed,” and “when is the call ended.” The voice itself is usually not inside SIP. The voice is carried by RTP (Real-time Transport Protocol (RTP) ²), which runs in parallel once SIP finishes negotiating the session.

The basic call flow in plain language

A normal SIP call follows a predictable sequence:

The caller sends an INVITE to the callee (often through a PBX, proxy, or SBC).
The callee (or server) answers with progress messages like 100 Trying and 180 Ringing.
The final acceptance is 200 OK, usually containing SDP that describes the chosen codec and media IP/port.
The caller confirms with ACK.
Media starts flowing via RTP in both directions.
The call ends with BYE and a final 200 OK response.

SDP (Session Description Protocol (SDP) ³) is the “menu” inside SIP messages. It lists codec options, packetization, IP/port for RTP, and sometimes SRTP keying details. Most call problems come from one of these areas:

SIP signaling cannot reach the far end (routing, auth, ports).
RTP media cannot reach the far end (NAT, firewall, wrong IP in SDP).
The negotiated codec/DTMF mode is incompatible.

Why SIP works well for phones, PBX, and intercoms

SIP scales because it separates control from media. Phones and SIP intercoms can register to a PBX for inbound reachability and can also place outbound calls through trunks. A PBX can fork calls (ring multiple devices), route by dial plan, and provide features like transfer, paging, and hunt groups.

In practical deployments, it helps to think in two planes:

Control plane (SIP): identities, authentication, routing, and features.
Media plane (RTP/SRTP): audio quality, jitter, packet loss, and encryption.

Plane	Typical ports	What breaks first	What to test
SIP signaling	5060 UDP/TCP, 5061 TLS	Registration, call setup, ringing	SIP logs, REGISTER/INVITE traces
RTP media	Dynamic UDP range	One-way/no audio, DTMF issues	RTCP stats, firewall pinholes
Security	TLS + SRTP	Compliance gaps, MITM risk	Certs, cipher suites, keying mode

SIP calling becomes predictable when signaling and media are treated as separate systems with separate failure modes.

If the goal is reliable phone + intercom projects, the next step is understanding how registration and SIP trunks connect everything into one dial plan.

How do SIP registration and trunks connect PBX, phones, and intercoms?

SIP networks fail when roles are mixed up. Phones “register,” trunks “peer,” and intercoms sometimes do both. Clarity here saves a lot of time.

Registration connects endpoints to a PBX by publishing their current contact address, while SIP trunks connect PBXs to carriers or other PBXs. Phones and intercoms usually register as extensions; trunks usually authenticate or IP-peer to exchange inbound/outbound calls.

SIP registrar mapping PBX devices intercoms and extensions during registration process — SIP registration and extensions

Registration: how an endpoint becomes reachable

Registration is the process where a phone or SIP intercom tells the PBX: “this is my current IP/port; send calls for extension 101 here.” The endpoint sends REGISTER requests ⁴ to a registrar on the PBX (or hosted platform). The PBX replies with authentication challenges (401) and accepts with 200 OK. The PBX stores the Contact location and refreshes it with an expiry timer.

Registration is ideal for:

SIP phones on desks
Indoor stations and door intercoms
Remote devices behind NAT (with keepalives)

For intercoms, registration also supports predictable inbound calling (call the door station) and feature control (DTMF for relay, paging, busy lamp, etc.).

Trunks: how the PBX reaches the outside world (or another domain)

A SIP trunk is a PBX-to-provider (or PBX-to-PBX) connection. Instead of “registering like a phone,” trunks often work as:

Registration-based trunks (PBX registers to provider with credentials)
IP-auth trunks (provider trusts calls from the PBX public IP, often with ACLs)
Mutual TLS trunks (strong identity, cert-based trust)

Trunks are where DID numbers, inbound routes, and outbound caller ID policies live. They often require SBC-like behavior: NAT awareness, codec control, and security enforcement.

Common connection patterns that work well

In mixed projects with SIP phones + intercoms, three patterns are common:
1) Local PBX + SIP endpoints (register)
Phones and intercoms register to the PBX. The PBX routes internal calls and uses a trunk for PSTN.
2) Hosted PBX + remote endpoints (register)
Endpoints register over the internet using TLS/SRTP. NAT traversal and keepalives matter.
3) SBC in front of PBX
Endpoints and trunks terminate on an SBC, which handles NAT, encryption, and policy. This often reduces “random” one-way audio in the field.

Element	Identity model	Typical auth	Best practice
Phone / Intercom	Extension (AOR + Contact)	Digest / TLS client auth	Short keepalive, stable registration refresh
PBX	Dial plan controller	N/A	Normalize codecs, DTMF, and RTP ranges
SIP trunk	Carrier peer	IP ACL / registration / mTLS	Use SBC, lock down inbound IPs
SBC (optional)	Security + NAT boundary	Certificates + policy	Terminate TLS/SRTP, hide topology

For SIP intercom deployments, a practical approach is to register each intercom as a normal extension and treat PSTN access as a trunk function only. It keeps roles clean and troubleshooting fast.

Next comes the question that causes most interoperability pain: ports, codecs, and DTMF. This is where “default settings” break multi-vendor systems.

What SIP ports, codecs, and DTMF settings should I configure?

Most “SIP doesn’t work” tickets end up being “RTP blocked,” “wrong codec,” or “DTMF mode mismatch.” A small set of settings prevents most of that.

Configure signaling on 5060 (UDP/TCP) or 5061 (TLS), open a defined RTP UDP port range for media, allow a small codec set for interoperability, and standardize DTMF as RTP events (RFC 2833/4733) unless a platform requires SIP INFO.

SIP signaling over UDP TCP 5060 and TLS 5061 with RTP media port range to desk and wireless phones — SIP ports and RTP media range

SIP signaling ports

5060 UDP: common default, efficient, but easier to spoof if exposed to the internet.
5060 TCP: useful when NAT devices mishandle UDP or when message size grows.
5061 TLS: encrypted signaling. Preferred for internet-facing deployments.

A clean rule: use TLS externally, and restrict 5060/5061 exposure to known IPs (SBC, PBX, provider).

RTP media ports (the real firewall work)

RTP uses dynamic UDP ports negotiated in SDP. Many PBXs and SBCs let you define a port range (example ranges seen in the field: 10000–20000, 20000–40000). The exact range is not universal, so the safest move is to:

Set a fixed RTP range on the PBX/SBC
Ensure endpoints match or can accept that range
Open that UDP range between the correct network zones

Codec selection that avoids surprises

A small codec policy works best:

G.711 (PCMU/PCMA): highest compatibility, higher bandwidth (~80–90 kbps per call including overhead at 20 ms ptime).
Opus: excellent quality and resilience, flexible bitrate; great when both ends support it.
G.729: lower bandwidth, but licensing and sensitivity can be a concern in some environments.

Packetization time (ptime) commonly defaults to 20 ms, which is a solid balance of latency and overhead.

DTMF settings (critical for door release and IVR)

DTMF can be carried in several ways:

RTP events (RFC 2833/4733) ⁵: best interoperability for VoIP. Recommended default.
SIP INFO: used by some systems, but can break across proxies or when not normalized.
In-band: depends on codec fidelity; often unreliable with compressed codecs.

For SIP intercoms controlling relays, RTP events are usually the safest. If a platform insists on SIP INFO, keep it consistent end-to-end and avoid transcoding gateways that drop INFO.

Item	Recommended default	When to change	Symptom if wrong
SIP transport	TLS/5061 externally	Legacy endpoints	Random registration failure, security risk
RTP range	Fixed UDP range on PBX/SBC	Multi-zone firewalls	One-way/no audio
Codec set	G.711 + Opus (optional)	Low bandwidth links	No audio, transcoding artifacts
ptime	20 ms	High-loss links (sometimes)	Latency or choppy audio
DTMF	RFC 2833/4733	Platform requires INFO	Door open fails, IVR ignores digits

A practical checklist for SIP devices (phones + intercoms) is: restrict codecs, lock ptime, standardize DTMF, and pin RTP ranges. It avoids 80% of multi-vendor interoperability problems.

Once ports and codecs are correct, the next failure mode is NAT. NAT is where signaling may work but audio fails, and SIP ALG often makes it worse.

How do NAT, STUN, and SIP ALG affect audio and signaling?

SIP can register successfully and still produce one-way audio. That is the classic sign of NAT and SDP problems, not “bad SIP credentials.”

NAT rewrites IP/ports and can cause SIP/SDP to advertise unreachable private addresses; STUN/ICE help endpoints discover public mappings; SIP ALG tries to rewrite SIP/SDP but often breaks modern VoIP and should usually be disabled in favor of SBCs and proper NAT traversal.

SIP and SDP passing through network address translation while RTP media is relayed — SIP SDP and RTP with NAT

Why NAT breaks media more than signaling

Signaling (SIP) often goes to a known server IP/port, so NAT creates a stable outbound mapping. Media (RTP) uses dynamic ports and may be peer-to-peer. If the SDP advertises a private IP (like 192.168.x.x) to a remote endpoint, the far end sends RTP to an unreachable address. Result: one-way audio or no audio.

Common NAT-friendly behaviors that help:

rport and symmetric response (server replies to source port)
SIP keepalives to maintain mappings
Symmetric RTP (send RTP back to the source address/port seen)
Short REGISTER refresh (balanced to avoid load)

STUN, TURN, and ICE in VoIP reality

STUN: tells a client its public mapped address. Works well for many NAT types but not all.
TURN: relays media through a server. Heavier, but reliable when direct media fails.
Interactive Connectivity Establishment (ICE) ⁶: negotiates the best candidate path (host, STUN-reflexive, TURN-relayed). Common in WebRTC and modern softphones.

For many enterprise SIP phones, STUN support is limited or optional. For softphones and WebRTC clients, ICE is often the default approach.

SIP ALG: why it causes “works but broken”

SIP ALG inspects and rewrites SIP/SDP to “help” NAT traversal. In theory, it fixes private IPs and opens pinholes. In practice, many ALGs:

Rewrite headers inconsistently
Mangle SDP ports
Break re-INVITEs and mid-call changes
Interfere with TLS (cannot inspect encrypted SIP)
Conflict with ICE/symmetric RTP behaviors

The most reliable pattern is: disable SIP ALG and use one of these instead:

An SBC at the edge
A PBX that supports NAT-aware contact handling and symmetric RTP
Proper firewall rules with predictable port ranges

Situation	Best approach	What to avoid	Typical symptom
Phones behind NAT to hosted PBX	TLS + keepalive + SBC	SIP ALG	Random one-way audio
WebRTC clients	ICE (STUN/TURN)	Forcing direct RTP only	Calls connect, audio fails
Site-to-site PBX	IPsec + SBC	Overlapping ALGs	Mid-call drops on re-INVITE
Mixed VLANs/firewalls	Fixed RTP range	Wide-open any-any UDP	Security risk + still unstable

For SIP intercom projects, NAT issues often show up as: registration works, but door station audio is one-way when calling outside the LAN. The fix is almost always SDP correctness, RTP pinholes, and disabling SIP ALG.

After NAT is handled, security becomes the next question: how to encrypt calls end-to-end and satisfy compliance needs without breaking interoperability.

How does SRTP secure calls and meet compliance requirements?

Unencrypted VoIP is easy to intercept on shared networks. That risk grows in multi-tenant buildings and cloud deployments. Encryption needs to be designed, not bolted on.

SRTP encrypts RTP media to protect voice content and adds integrity and replay protection; it is often paired with SIP over TLS for signaling. Compliance goals are met by encrypting in transit, controlling keys, enforcing strong identity, and logging security events without storing sensitive media unnecessarily.

Comparison of plain RTP versus SIP over TLS with SRTP encrypted media for SIP phones and intercoms — Plain RTP vs SIP over TLS and SRTP

What SRTP protects (and what it does not)

Secure Real-time Transport Protocol (SRTP) ⁷ secures the media stream:

Confidentiality (encryption)
Integrity (tamper detection)
Replay protection (blocks reused packets)

SIP over TLS protects signaling metadata in transit (dialed numbers, headers, SDP). Without TLS, SRTP may still protect audio, but SIP messages can leak sensitive call details and can be modified by attackers.

Keying methods: why interoperability matters

SRTP requires both ends to agree on keys. Common approaches include:

SDES-SRTP: keys are carried in SDP. Easy to deploy, but keys must be protected by TLS to be safe.
DTLS-SRTP: keys negotiated via DTLS handshakes. Common in WebRTC and some modern endpoints.
SRTP via SBC: the SBC terminates and re-encrypts, which is not end-to-end but is practical for compliance and interop.

For many enterprise PBX systems, “TLS + SDES-SRTP” is a common, workable baseline. For WebRTC, DTLS-SRTP is typical.

Compliance: focus on controls, not buzzwords

Compliance requirements vary, but the technical controls that usually matter are stable:

Encrypt signaling (TLS) and media (SRTP) in transit
Restrict who can connect (ACLs, mTLS where possible, SBC policies)
Manage certificates and rotation
Log authentication events and configuration changes
Protect recordings with access control and encryption at rest (if recordings exist)

In regulated environments, an SBC often becomes the policy enforcement point. It can require TLS, enforce cipher suites, prevent downgrade to plain RTP, and provide audit-friendly telemetry.

Practical SRTP settings that reduce failure

Keep a clear fallback policy: either require SRTP or allow fallback only on trusted LAN segments.
Avoid mixed keying modes across domains unless the SBC normalizes them.
Ensure time and certificates are correct on endpoints (clock drift breaks TLS).
Validate DTMF under SRTP (RTP events still work, but interop should be tested).

Security layer	Recommended baseline	Benefit	Common pitfall
SIP signaling	TLS (5061)	Protects SIP headers/SDP	Bad certs, wrong time, MITM warnings
Media	SRTP	Protects voice content	Mixed keying modes, forced fallback
Edge control	SBC policy + ACLs	Stops scanning and abuse	Exposing 5060 to the internet
Operations	Logs + rotation	Supports audits and response	No visibility when issues happen

For SIP phones and intercoms, SRTP works best when it is treated as a standard requirement, not an optional feature. The deployment becomes simpler when every device is expected to speak TLS/SRTP, and exceptions are isolated behind controlled gateways.

Conclusion

SIP calling uses SIP/SDP to negotiate and control sessions while RTP carries the audio. Reliable deployments standardize registration/trunk roles, lock ports/codecs/DTMF, handle NAT without SIP ALG, and secure media with TLS + SRTP.

Footnotes

Official SIP standard for message flows, dialogs, and core signaling behavior. ↩ ↩
Learn how RTP transports real-time voice packets, timing, and sequence handling. ↩ ↩
SDP reference for codec offers, media attributes, and negotiated IP/port details. ↩ ↩
Details REGISTER bindings, Contact refresh rules, and registrar processing expectations. ↩ ↩
Standard for carrying DTMF digits as RTP events across proxies and SBCs. ↩ ↩
ICE explains practical NAT traversal using STUN/TURN candidates and connectivity checks. ↩ ↩
SRTP standard for encrypting RTP media with integrity and replay protection. ↩ ↩

About The Author

DJSLink R&D Team

DJSLink China's top SIP Audio And Video Communication Solutions manufacturer & factory .
Over the past 15 years, we have not only provided reliable, secure, clear, high-quality audio and video products and services, but we also take care of the delivery of your projects, ensuring your success in the local market and helping you to build a strong reputation.