Voice and video calling on self-hosted infrastructure

Text messaging is the foundation of enterprise communication platforms, but voice and video calling have become baseline expectations. For organizations running self-hosted messaging, delivering reliable real-time media is achievable—but it introduces technical requirements that differ fundamentally from text-based communication.

Why real-time media is different

Text messages are small, tolerant of latency, and trivially stored. A message that arrives 200 milliseconds late is indistinguishable from one that arrives instantly. Voice and video traffic is none of these things. Audio streams require consistent delivery with latency below 150 milliseconds for natural conversation. Video adds bandwidth demands that scale with resolution and participant count. Both are intolerant of packet loss—even 1-2% loss produces audible artifacts in voice and visible degradation in video.

These characteristics mean that the infrastructure and software stack for real-time media must be designed with different priorities than the messaging layer. A messaging server that handles thousands of concurrent text conversations on modest hardware may struggle with a dozen simultaneous video calls if media processing is not appropriately resourced.

Infrastructure requirements

The core component for self-hosted voice and video is a media server or Selective Forwarding Unit (SFU). An SFU receives media streams from each participant and forwards them selectively to other participants, reducing bandwidth requirements compared to mesh topologies where each participant sends to every other participant directly. Open-source SFUs like Jitsi, LiveKit, and mediasoup provide mature, deployable solutions.

CPU and memory requirements depend on the SFU implementation and expected concurrent session load. Video forwarding without transcoding is less CPU-intensive than a Multipoint Control Unit (MCU) that composites streams, but features like recording, simulcast layer selection, and server-side noise suppression add overhead. Capacity planning should be based on load testing with realistic participant counts.

Network quality matters more than network speed. A 1 Gbps connection with consistent 5ms jitter will deliver better call quality than a 10 Gbps connection with sporadic 50ms spikes. QoS policies that prioritize real-time media traffic over bulk transfers improve reliability. For organizations with multiple offices, placing media servers close to user concentrations reduces latency—a media server in the same data center as the user will outperform one across the continent regardless of bandwidth.

TURN (Traversal Using Relays around NAT) servers are necessary for participants behind restrictive NATs or firewalls. The TURN server relays media when direct peer-to-peer or client-to-SFU connections cannot be established. Running self-hosted TURN servers—using coturn or similar—keeps relay traffic on owned infrastructure. Failing to provision TURN capacity is a common source of “calls work in the office but fail for remote users” complaints.

Encryption and privacy

WebRTC, the standard underpinning most modern voice and video implementations, encrypts media streams using DTLS-SRTP by default. This provides encryption in transit between each client and the SFU. However, the SFU itself has access to the unencrypted media streams as they pass through—a necessary consequence of selective forwarding.

For organizations requiring true end-to-end encryption of media (where even the server cannot access the stream content), Insertable Streams (also called SFrame) provides a mechanism to apply an additional encryption layer before media reaches the SFU. The SFU forwards the encrypted payloads without being able to decrypt them. This is supported by modern browsers and SFU implementations, though it limits server-side features like recording and transcription.

Operational considerations

Monitoring real-time media requires metrics that text messaging infrastructure does not typically capture: packet loss rate, jitter, round-trip time, audio/video codec negotiation outcomes, and call setup latency. Integrating these metrics into the organization’s observability stack enables proactive detection of quality degradation before users report it.

Scaling real-time media horizontally is more complex than scaling a messaging API. Session affinity—ensuring that all participants in a call connect to the same SFU instance—is a fundamental requirement. Load balancing strategies must account for session state and geographic proximity, not just request volume.

Takeaway

Self-hosted voice and video calling is technically demanding but entirely feasible with current open-source tooling. Success depends on respecting the distinct requirements of real-time media: low-latency networking, appropriately sized media servers, TURN relay capacity, and monitoring that captures quality metrics. Organizations that approach it with the same rigor they apply to other production infrastructure will deliver a communication experience that matches—and in privacy terms, exceeds—cloud alternatives.