As CTO of XirSys, it is often my job to work out various intricacies of new technologies and how they fit in with our infrastructure. To do this, I have had to become quite a NAT traversal expert; able to build and test our own TURN servers, understanding connectivity and its various problems and, ultimately, using that information to help our customers through support tickets. Having answered many tickets over the past couple of years, I find myself seeing common patterns related to customer issues and questions and find that I’m almost always ready with a cookie-cutter response to help point the customer in the right direction. I have decided, therefore, to write this article, explaining the technology and providing ideas about where to look to help get your application off the ground.
STUN and TURN
To debug your application or connection, it first pays to understand a little of what STUN and TURN actually do. Most people using WebRTC know that STUN helps you to connect for peer-to-peer and TURN helps with relay, but few people bother to look into what actually happens. It is important to know this stuff, if you want to fix your issues, so I’m going to start by laying it all out, here.
When one machine wishes to establish a peer-to-peer connection with another machine on a different network, it first needs to trade public IP’s. Kind of like swapping your phone number with a friend. However, machines on a network are usually only privy to knowing their local IP, which will look something like 192.168.x.x. This is no good for connectivity, because the machines are on separate networks, so those IPs are useless. You can’t phone your friend if all you have is her telephone extension number, but not the actual outside telephone number. When your machine makes requests to the internet, it sends a packet to a server via your NAT or router, including your local IP in the header. It is the job of the NAT to translate your local IP to a public IP, thereby updating the packet header. The packet then continues its journey to the server. When the server responds back to your computer, its packet will also contain your public IP in the header, but it gets translated to your local IP when it reaches your NAT, before being passed to you. Thus, your machine never gains access to the public IP. The STUN server works in the same way, by acting as an Echo Server. However, in order to inform you of your public IP, it first copies the IP from the header into the packet body, which your NAT will leave intact. When you receive the packet, you will know your public IP and can share this with the peer.
TURN is an extension of STUN. It uses the same packet protocols, but extends it with additional features, enabling you to establish a server session for both you and the peer, so you can pass data to each other through the server. This is a necessary process if, for some reason, a direct peer-to-peer connection cannot be established between the two machines. The TURN connection has quite a complex handshake. In all connections with TURN, one person will be known as the client and the other person will be the peer. The client is the person who initiates the connection. This is important, because the client is responsible for sending all the necessary data and creating the link between the two machines. Only the client needs the authentication to use the TURN server. The client literally provides the access needed for the peer to connect, so the peer can be quite dumb about the process. Due to this ‘delegation of authority’, I have seen circumstances where TURN initiated in one direction may fail, where initiating in the other direction, works.
The TURN handshake will start off with an ‘Allocation’. An allocation is essentially the client ‘allocating’ a port for the peer. It is at this time that the handshake may fail with an ‘Allocation Failed’ packet, usually because the client credentials fail. An ‘Allocation Success’ packet will contain the IP and port that the peer must connect to on the TURN server.
Next up is the CreatePermission packet, which is only sent if the allocation succeeded. CreatePermission is provided with the peer’s IP address and is used to bind the peer to the allocation. This is for security purposes and stops other peers – those who do not have permission – from connecting to the IP and port provided by the allocation. Permissions are refreshed by the client within 5 minute intervals.
If all has gone well at this point, the client can now send data to the peer using a ‘Send Indication’ packet, and the peer can send data to the client using a ‘Data Indication’ packet. Under some circumstances, however, the client will opt to create a ‘ChannelBind’ request and then send future packets (optionally) as ‘ChannelData’ packets. ChannelData and Send Indications contain identical packet bodies; the differences between the two are that ChannelData packets have a much smaller header, resulting in smaller overall packets and thus faster sending times, which results in better performing audio and video. Such establishments, however, are at the discretion of the WebRTC client, and I am not aware of anything that can be done programmatically to ensure this.
The Connection Process
With this knowledge in tow, let’s look at how the WebRTC client establishes a connection with its peer. WebRTC uses a specification called Trickle ICE, which is a means to establish a connection with the peer by trying out an order of connectivity options until one passes and the connection can be established. This can result in long initial connection times for the most complex NAT traversal, but results in very fast connectivity for simple situations. The trade-off is that only some will see a delay, rather than everyone. Connectivity information is shared with the peer via the Signalling Platform, which can be anything from websockets and COMET to emails and note taking. Essentially, any means of getting a chunk of text data from client to peer and vice versa. The packets describing the connectivity options are called ICE Candidates, and they look a little like this:
a=candidate:4022866446 1 udp 2113937151 192.168.0.197 36768 typ host generation 0
This particular ICE candidate shows the machine’s local IP and a ‘typ’ value of ‘host’. There are three different types of ICE candidate:
Host candidates are for same-network connections. If you’re connecting to a peer who is on the same network as you, and thus behind the same NAT, then you don’t need STUN or TURN. There’s no need for a public IP, because you’ll connect behind your NAT without having to pass through the public internet. Likewise, there’s no need for TURN, because there will be no problem seeing peers on your own network. It is for this reason that the host candidates are the first to be shared and the first connectivity to be attempted.
If the same-network connectivity fails, the client will attempt to make a STUN connection to acquire its public IP. STUN connections, at least for WebRTC, are UDP only, so a failed attempt at this task will skip it entirely. If the client acquires its IP, it’ll send a SRFLX candidate to the peer. The peer will then repeat the process to acquire its IP, and if successful, will send that back to the client.
a=candidate:1853887674 1 udp 1845501695 220.127.116.11 36768 typ srflx raddr 192.168.0.197 rport 36768 generation 0
If both client and peer manage to acquire their IPs and share them, this does not guarantee that peer-to-peer connectivity will work, only that it will be attempted.
If peer-to-peer fails or if the STUN requests fail due to UDP packet blocking, the client will go ahead and make a TURN allocation with the TURN server. An allocation returns the public IP in its response, so this is the client’s second chance to receive this. Once it does, it sends relay candidates to the peer.
a=candidate:2157334355 1 udp 33562367 18.104.22.168 54278 typ relay raddr 22.214.171.124 rport 38135 generation 0
Once relay candidates have been exchanged, TURN connectivity will be attempted and will, hopefully, succeed. For more information about ICE candidates and the SDP (Session Description Protocol) format in general, take a look at The Anatomy of a WebRTC SDP over at WebRTCHacks, for an interactive breakdown.
There are many reasons why failure to connect may occur, and understanding what is happening will help decipher what the problem actually is. We’ll take a look at several of the symptoms, here, and will discuss them. However, let’s first look at the reasoning behind many issues and identify the general problem.
STUN and TURN are DUMB
At XirSys, we provide every trick under the sun to ensure that our TURN servers provide every opportunity to ensure connectivity. Despite this, however, no amount of clever server side trickery is going to amount to anything if the client never contacts the server in the first place. As an active member on the support tickets, I often receive questions such as “Can you please check the server logs to see why we’re not connecting” or “I think your TURN server is to blame for the connection not being established”. Now, I can go on all day about how well our servers manage almost impossible connectivity issues til I’m exhausted and blue in the face, but let’s look at this logically. Firstly, we provide connectivity to many thousands of connections per minute to many thousands of happy customers. We have very large bandwidth capabilities and spread our TURN servers out across a cluster. We provide the full availability of ICE server connectivity options on standard ports as well as port 443, for those real awkward networks and we have found our WebRTC connectivity often works where VOIP applications, such as Skype, fail. Secondly, when looking at the topology of all of this, many of all call failures happen before any form of connectivity to the STUN and TURN servers have occurred. Now, the TURN server is a very simple beast with a very complex protocol. If you connect and you stream data, your connection will work. If you don’t connect or you fail to send data, then it can’t possibly work. None of this, however, is the fault of the TURN server. Therefore, each time a connection fails to work, you must first look at the client application and local network. After all, we can’t possibly send you connectivity logs if the connection never occurred in the first place.
ICE Candidate Issues
When attempting to work out the problem with your connection, try running your application with WebRTC-Internals open. You can do this by opening a browser tab and navigating to chrome://webrtc-internals. Now, keep that tab open while running your app and then take a look at what WebRTC-Internals have captured.
The first thing you’ll notice is that it lists all of the ICE candidates generated during the call. You can open them all up and identify their ‘typ’ attribute, as noted previously in this article. Now, if all you see are HOST candidates, then your application was never able to reach out to our servers to help establish a connection. Sometimes, receiving only HOST candidates can mean that your network is restricting both the STUN UDP packets, as well as the TURN allocations over TCP (deep packet sniffing maybe?). However, as even I have encountered today, while running an internal project, HOST candidates can also indicate a coding issue. The quickest way to work this out is to run a WebRTC demo you know to work and to see what candidates that produces. Running our own demo internally quickly helped us work out that it was in fact a code issue that was to blame.
SRFLX (STUN) Candidates
If you see SRFLX candidates, then are you seeing them from both sides? In order to connect peer-to-peer, both client and peer need to send these. Receiving these on one side is not enough. What’s more, if you see SRFLX candidates and the connection is not occurring, why are you not also seeing RELAY candidates? Are you certain you’re using the correct ICE server string? Perhaps you forgot to swap out the public Google STUN server URL you were originally testing with?
RELAY (TURN) Candidates
So, your client is producing RELAY candidates. This is great. This means your client has at least sent Allocation packets to the TURN server and received a response, so it knows relay is an option. So, why isn’t is connecting? Does your network deep packet sniff and refuse UDP through TCP tunnelling or is it something else?
Wireshark to the Rescue
To get a deeper look at what’s going on, try running Wireshark. You can even capture sessions of your application connectivity and send them to us, along with your ICE candidates, so we can help decipher the problem with you. Wireshark is a free application that works on most operating systems. It works by analysing the packets on the network from your machine. So, any packets your machine sends or receives should be captured by Wireshark. Once captured, you can order the packets sent during your connection attempts by ‘Protocol’. This way, you can simply scroll to the packets marked as STUN to see just what happened during your tests. Both STUN and TURN are marked as the STUN protocol, as mentioned earlier in this article, so you should be able to get a full picture of events very easily. The packets above describe STUN packets. A Binding is a request or response from STUN. In the example, our TURN server is responding with the clients IP address stored in the XOR-MAPPED-ADDRESS attribute. These packets describe a TURN session. If the client makes allocations for Trickle ICE, then those allocations may be all that you see. Under such circumstances, TURN has likely not succeeded. However, in the above example, the Permissions and Indication packets are clearly seen, so we know that TURN connectivity has been established successfully.
Connected, but No Video and / or Audio
As I mentioned above, the STUN and TURN servers are dumb. The TURN server will stream any packets sent to it from the client to the peer and vice versa. If you’ve successfully connected, then think of what other reasons can be affecting your video stream. Are you passing the correct stream to the connection object? Are you capturing the stream at all? The point is, the TURN connectivity is working, it’s the issue of your application, or browser, causing the stream to fail at sending either audio or video.
Connected, but Only For X Minutes
The TURN server maintains an allocation for a period of time after disconnection. If you’re experiencing disconnection issues, it’s likely to be a network problem. However, like all networks, connectivity can be re-established, and this doesn’t have to require re-setting up the connection from scratch. Simply carrying out the last few steps in initiating the connection should get you re-established, and often with no notion that the disconnection ever even happened.
The point of this article was to highlight the roles that STUN and TURN play in the connectivity game and provide some solutions to figuring out where your problems lay. With this knowledge, tracking down that buggy code should be so much easier. We hope this article helps you track those pesky bugs, but remember, you are free to send us your ICE candidates and Wireshark captures if you’d like a second set of eyes on the problem.