Dead-silent calls feel broken. Users think the line dropped, then they talk over each other, and support teams chase “network issues” that are not real.
Comfort Noise Generation (CNG) adds a controlled, low-level background noise during silence so VoIP calls feel continuous. It works with VAD and DTX to save bandwidth without making the line sound dead.

Why CNG exists in real VoIP networks
Silence is not neutral for humans
Silence in packet voice is not the same as silence in PSTN. Traditional phone lines always had some background noise from the analog circuit. People got used to that soft “air.” When VoIP uses silence suppression, the receiver can output absolute digital silence. That can feel like a disconnect, even when the call is active. In customer support, this is one of the most common false alarms: “the call drops when nobody talks.”
CNG solves that perception gap. It generates a small noise bed that matches the remote environment so the listener hears continuity. This matters a lot in SIP door intercoms and speakerphones, where talk spurts and pauses are frequent. A visitor speaks, pauses to listen, and silence can feel like the intercom stopped working.
How CNG is delivered without streaming noise
CNG is efficient because it does not transmit full audio during silence. Instead, the sender uses VAD to detect non-speech frames. When silence is detected, the sender stops sending normal voice payload and sends compact updates that describe the background noise. In many VoIP systems these updates are called Silence Insertion Descriptor (SID) frames 1. The receiver uses SID parameters to synthesize noise locally. This is why bandwidth drops during silence but the call still sounds “alive.”
Where CNG sits in the media pipeline
In a good audio chain, CNG should not interfere with echo cancellation or mixing. A practical placement is:
- Capture near-end mic
- Run AEC and noise suppression
- Decide VAD / DTX state
- Generate or update comfort noise parameters
- Send RTP (Real-time Transport Protocol) 2 packets (speech or SID updates)
In conferencing, CNG needs extra care. If every participant injects comfort noise, the mixer can sum multiple noise beds and create a higher hiss floor. This is why some conference bridges prefer to generate one comfort noise bed centrally, or they reduce CNG energy when many silent participants exist.
| Component | Job | What goes wrong when mis-set |
|---|---|---|
| VAD | Detect speech vs silence | Speech gets cut or noise triggers “talking” |
| DTX / silence suppression | Stop sending full-rate speech | NAT bindings can age out if no keepalives |
| SID updates | Describe background noise | “Pumping” or sudden noise jumps |
| CNG synthesis | Generate matching noise locally | Hiss too loud, dead silence, or clicks |
CNG is not a “nice-to-have.” It is part of making VoIP feel natural, especially for SIP phones, SIP intercoms, and paging endpoints that live in noisy spaces.
If CNG is understood as a controlled illusion, setup becomes simpler: keep it subtle, keep it stable, and keep transitions smooth.
Now the next question is the core mechanics: how CNG works with VAD and silence suppression in real RTP streams.
A clear mental model prevents most bad tuning decisions.
How does CNG work with VAD and silence suppression?
Silence suppression saves bandwidth, but it can make calls feel unstable. If users think the call dropped, they talk over each other and call quality feels worse.
VAD decides when speech stops, DTX stops sending full speech packets, and CNG fills the silence with synthetic noise. The sender sends small SID updates, and the receiver generates noise until speech resumes.

The three-part handshake: VAD → DTX → CNG
Voice Activity Detection (VAD) 3 is the gatekeeper. It looks at short frames, often 10–30 ms, and labels them speech or non-speech. When VAD says “non-speech,” DTX kicks in. DTX is the policy that reduces transmission during silence. In VoIP, this is often called discontinuous transmission (DTX) 4. Instead of sending 50 RTP packets per second of near-zero audio, the endpoint sends either nothing or very small periodic updates.
CNG is the listener-side result. The receiver generates a noise bed so silence does not sound like a hard mute. Good receivers also cross-fade when switching from synthetic noise to real speech to avoid clicks or sudden level jumps at the start of a talk spurt.
Why SID update rate matters
Background noise is not constant. A quiet office can turn into keyboard clicks. A lobby can shift from calm to busy. If SID frames are too infrequent, the receiver keeps using stale parameters, and the comfort noise can “pump” or jump when an update finally arrives. If SID frames are too frequent, bandwidth savings drop and some devices behave poorly.
A practical approach is to keep SID updates periodic and allow extra updates when noise changes a lot. Many systems do this automatically. When manual settings exist, stable is better than aggressive.
Interactions with NAT and keepalives
Silence suppression can reduce RTP traffic enough that NAT mappings age out on some routers. This can cause one-way audio after long silence, especially on consumer-grade NAT with short UDP timeouts. In deployments with SIP intercoms behind NAT, it helps to ensure:
- SIP keepalives (REGISTER refresh and/or CRLF keepalive)
- RTP keepalive behavior if supported
- Reasonable UDP timeout settings on the edge firewall
The goal is simple: silence suppression should not accidentally close the media path.
| Stage | Sender behavior | Receiver behavior | Key risk |
|---|---|---|---|
| Speech | Send full RTP with codec payload | Decode and play | Normal jitter/loss handling |
| Silence begins | VAD switches state, DTX reduces packets | Cross-fade into comfort noise | First syllable clipping if VAD slow |
| Long silence | Send periodic SID updates | Generate noise from SID | NAT timeout or stale noise parameters |
| Speech resumes | Resume full RTP | Fade out CNG and play speech | Clicks if transitions are abrupt |
When these pieces are aligned, a call stays natural and efficient. When they are not aligned, users complain about “random hiss,” “cut words,” or “audio drops after silence.”
Next is a practical question that comes up in every installation checklist: what comfort noise level should be used for phones and intercoms.
The answer is not one number, but there are safe targets.
What comfort noise level suits SIP phones and intercoms?
Comfort noise that is too low feels like a dead line. Comfort noise that is too high sounds like a hiss problem. Both create support tickets.
A good CNG level is subtle and close to the real background noise floor. For SIP phones, it should be barely noticeable. For SIP intercoms in noisy areas, it should match the environment but never mask speech onset.

Think in “perceived continuity,” not absolute voltage
Many devices do not expose comfort noise in dBFS or Vrms. They expose it as Low/Medium/High or Auto. In those cases, the target is still the same: make silence feel continuous but not distracting. The best CNG is the one nobody notices.
In quiet offices, comfort noise should be very low. In loud lobbies, comfort noise can be higher, but it should not become a constant hiss that operators hear all day. On door intercoms, the environment can change fast. Wind and street noise may spike, then drop. If the system chases those changes too aggressively, the noise bed becomes unstable.
Practical tuning approach that works across brands
A simple setup method avoids guessing:
1) Put the endpoint in its real environment.
2) Start a call and stay silent for 10–15 seconds.
3) Listen on the far end with good speakers or a headset.
4) Increase CNG only until silence feels “connected.”
5) Speak softly and check that the first syllable is not masked or clipped.
If the far end hears a strong hiss during silence, CNG is too high, or the noise model is wrong. If the far end hears pure dead silence and thinks the call dropped, CNG is too low or disabled.
Special notes for SIP intercoms
Intercoms often use speakerphone mode. That means AEC, noise suppression, and AGC are active, and they can reshape the noise floor. If AGC boosts during silence, the system may raise the perceived comfort noise too much. For that reason, it helps to keep AGC moderate and avoid excessive mic gain.
In DJSlink-style deployments, a common pattern is to keep CNG set to Auto or Low, then focus on microphone placement and noise suppression quality. When the physical audio is clean, CNG can stay subtle.
| Environment | Suggested CNG approach | What to watch |
|---|---|---|
| Quiet office SIP phones | Very low / Auto | Dead silence perception vs faint hiss |
| Call center headset | Often disable or very low | Headsets make hiss more obvious |
| Lobby intercom | Low to medium, stable | Noise “pumping” when crowd changes |
| Outdoor door station | Auto with smoothing | Wind noise causing false noise updates |
| Industrial paging point | Medium, but controlled | Noise masking soft speech starts |
Comfort noise should support conversation, not become the main thing people hear. If CNG becomes noticeable, the next step is often codec behavior and how SID is carried.
That leads into codec-specific behavior, because not all codecs handle CNG the same way.
How do codecs G.711, G.729, and Opus implement CNG?
Many VoIP issues happen after a codec change. People blame compression, but the real change was VAD/DTX/CNG behavior and how silence is represented.
G.711 and G.729 have standardized VAD/DTX/CNG options (often called Annex B) that use SID-style updates. Opus can use DTX and decoder-side noise handling, and it may rely on comfort noise payloads or internal estimation depending on implementation.

G.711: simple waveform, optional Annex B behavior
The ITU-T G.711 codec standard 5 is a waveform codec (PCMU/PCMA). It is heavy in bitrate compared to compressed codecs, but it is simple and interoperable. CNG with G.711 is commonly implemented with an optional mode often referred to as Annex B behavior in many systems. In practice, systems either:
- Send “silence” as low-level PCM continuously (no DTX)
- Or enable VAD/DTX so silence uses SID updates and receiver-side CNG
Because G.711 is so widely supported, mismatches can happen when one side expects CN payload handling and the other side does not. In those cases, silence can become dead quiet or turn into odd comfort noise artifacts.
G.729: bandwidth saver with common Annex B usage
The ITU-T G.729 speech coding standard 6 is a low-bitrate codec, and it often uses VAD/DTX/CNG options in real deployments. When enabled, silence periods reduce bandwidth further, and SID parameters update the receiver’s noise generator. The key practical point is that G.729 endpoints can differ in how aggressively they trigger VAD and how often they send SID updates. In mixed-vendor networks, this is a common source of “hiss changes” or clipped word starts.
For SIP intercoms that need reliability, the safest approach is to validate cross-vendor behavior in a short test matrix before large deployment. One bad codec pairing can create hundreds of “audio is weird” tickets.
Opus: flexible, modern, and implementation-dependent
The Opus codec specification (RFC 6716) 7 supports a wide range of bitrates and bandwidth modes. It also supports DTX-style behavior so the encoder can reduce traffic during silence. In many Opus deployments, the decoder can generate a comfort-noise-like output based on recent signal statistics, and packet loss concealment can also behave like synthetic noise during gaps. Some systems also use a dedicated comfort noise payload type in RTP for explicit CN handling, depending on the stack and negotiation.
The practical takeaway is not the internal math. The takeaway is interoperability:
- Keep Opus settings consistent across the call path when possible.
- Avoid unnecessary transcoding at the PBX/SBC.
- Validate silence behavior on the exact endpoints used in the project.
| Codec | Typical CNG style | Strength | Common deployment risk |
|---|---|---|---|
| G.711 | Optional DTX/CNG mode | Best compatibility | “One side ignores SID/CN” mismatch |
| G.729 | Often used with VAD/DTX/CNG | Low bandwidth | Aggressive VAD clips word starts |
| Opus | DTX + decoder noise handling | Best quality per bitrate | Different stacks behave differently, transcoding hurts |
Codec choice should be driven by the network and the endpoints, not by a single “best codec” claim. For SIP intercoms, clarity and interoperability usually beat squeezing bandwidth at all costs.
If the symptom is “hiss during silence,” the codec is only one suspect. The next section covers why hiss happens and how to fix it without guessing.
Why do I hear hiss with CNG and how to fix it?
Hiss complaints are common because comfort noise is supposed to be subtle. When users notice it, the level is wrong, the noise model is unstable, or the chain is amplifying noise.
Hiss with CNG usually comes from comfort noise set too high, unstable SID updates, gain staging that boosts noise, or codec/interoperability mismatches. Fix it by lowering CNG level, stabilizing VAD/DTX behavior, and correcting gain and DSP order.

The most common causes in the field
1) CNG level too high
This is the simplest. Many devices ship with aggressive defaults meant to avoid dead silence. On headsets and quiet rooms, that “safe” level becomes obvious hiss.
2) AGC or mic preamp noise being amplified
If the near-end audio chain is noisy, CNG can expose it. Some systems boost during silence. That makes noise floors audible. The fix is often proper mic gain staging and less aggressive AGC.
3) SID updates too slow or noise floor tracking too reactive
If the environment changes and SID updates lag, the noise bed can jump. People perceive this as pumping or hiss that changes. A smoother update behavior is better than chasing every change.
4) Transcoding or mixed endpoint behavior
When a PBX or SBC transcodes, it can break silence behavior. One side may send SID or CN updates, while the other side expects continuous comfort noise, or vice versa. This mismatch can create hiss bursts, dead silence, or strange tonal noise.
5) Conference mixing summing multiple CNG beds
In multi-party calls, multiple comfort noise sources can add up. The mix becomes a louder hiss floor. A bridge that manages comfort noise centrally is often cleaner.
Fix steps that work in a predictable order
I prefer a simple sequence:
- First, lower CNG level (or set it to Auto/Low).
- Second, confirm VAD is not clipping speech. If it clips, adjust thresholds and hangover so talk spurts start cleanly.
- Third, check gain staging: reduce mic gain if it clips, reduce speaker gain if it distorts, and avoid extreme AGC.
- Fourth, avoid transcoding. Keep a consistent codec end-to-end when possible.
- Fifth, retest in the real environment, not only in a lab.
Quick troubleshooting table for support teams
| Symptom | Likely cause | Fast fix |
|---|---|---|
| Constant noticeable hiss in silence | CNG level too high | Set CNG Low/Auto or reduce noise level |
| Hiss “pumps” up and down | SID updates unstable | Increase smoothing, avoid aggressive noise tracking |
| Hiss only on conference calls | Summed comfort noise | Reduce per-stream CNG or use bridge-controlled noise |
| Hiss only after codec change | Interop mismatch or transcoding | Lock codec path, verify VAD/DTX settings match |
| Hiss plus clipped first syllables | VAD too aggressive | Lower start threshold, add hangover, keep CNG subtle |
For SIP intercoms connected to amplifiers or paging systems, the same logic applies, with one extra point: paging amps can amplify noise floors aggressively. If the line input gain is too high, any comfort noise becomes more obvious. A clean gain plan keeps intercom audio near nominal and trims the amplifier input instead of boosting everything.
Conclusion
CNG keeps VoIP calls feeling alive during silence by generating subtle background noise. With VAD and DTX it saves bandwidth, but correct levels, stable SID behavior, and clean gain staging prevent hiss and speech clipping.
Footnotes
-
Defines RTP comfort-noise payload type and CN packet format. ↩ ↩
-
Specifies RTP packet format used to carry real-time audio and video. ↩ ↩
-
Overview of VAD concepts and common uses in communications systems. ↩ ↩
-
Explains DTX and its role in reducing transmissions during silence. ↩ ↩
-
ITU-T standard for 64 kbps PCM telephony audio (PCMU/PCMA). ↩ ↩
-
ITU-T standard for 8 kbps CS-ACELP speech coding used in VoIP. ↩ ↩
-
Defines Opus interactive speech and audio codec used in VoIP/WebRTC. ↩ ↩








