Adding Voice Chat to Space Nerds in Space

2020-04-02 — Here’s a little description of what it took to add voice chat to Space Nerds in Space.

For audio in general, I am using portaudio, which is a somewhat low level sound library. It is low level in that it requires you to write your own mixer code, and does not provide primitives for playing WAV files or Ogg files or anything like that. You write a callback function which portaudio calls at a specified frequency and it’s the job of this function to provide a buffer of audio data for portaudio to play during the next little bit of time. If you want to play multiple sounds concurrently, this function must keep track of where in each sound we currently are, and mix them together and provide portaudio the mixed fragment of audio data on each callback.

I had long used portaudio for this, and had built up a small audio library around it that does provide very basic features like allowing simple triggering of playback of particular preloaded sounds, mixing however many such sounds as required, etc.

But for voice chat, I needed more than just simple playback of preloaded sounds.

  1. We need to make sure the number of concurrent streams sent to a client from from the server does not exceed the number of streams the client can handle.
  2. We need the ability to record sound, and receive a stream of audio data from the microphone as a series of callbacks.
  3. We need the ability to compress these packets and send them to my server process. For this I used libopus.
  4. We need the server to forward these packets to destination clients, keeping in mind that there might be multiple clients streaming audio to the server, and these would need to be then fanned back out to the destination clients. The streams would need to be kept separate though, because the decompressor is stateful, and you can’t combine multiple streams of packets and send them through a single decompressor instance. They each need their own decompressor.
  5. At the clients, we need to receive the audio data streams from the server and decompress them, and queue them up for the mixer to chew on.

To ensure that clients never receive more streams of audio than they can handle, a token system is used.

There is one token for each audio channel the clients are able to handle (nominally, there are 4 of them.) Just before clients begin recording and transmitting audio data to the server, they request a token from the server. They then transmit to the server (whether or not they eventually get the token) knowing that if they don’t get a token, the server will just drop their packets. The server has a fixed number of token which it assigns to the clients as they ask for them until they are all in use. If the server receives any audio packets from clients which it knows do not have a token, it just drops those packets. In this way, the server never transmits more streams of audio than the clients can handle.

Recording audio

Recording is triggered by a keypress event, and terminated by a key release event, as it is a “push to talk” system. Pressing the key requests a token from the server (but doesn’t wait for it to be given) and starts the recording process, which sets up a portaudio thread reading from the microphone and periodically calling back a function passing along the PCM audio data that was recorded 1920 samples at a time (at a sampling rate of 48000 Hz). (I chose 1920 and 48kHz because these are reasonable values supported by libopus. This did mean I had to resample my existing audio files from 44.1kHz to 48kHz.)

We transmit without waiting for the token from the server so that the instant the token is given (before the client even receives it) the server may begin accepting audio packets from the client. Also it’s easier to code as we do not need to write any code to wait for the token.

The code for that looks like this:

        if (event->keyval == GDK_F12) { /* F12 key pressed? */
                pthread_mutex_lock(&voip_mutex);
                        if (!have_talking_stick) {
                                pthread_mutex_unlock(&voip_mutex);
                                request_talking_stick();
                        } else {
                                pthread_mutex_unlock(&voip_mutex);
                        }
                /* We transmit regardless of whether we have a talking stick.
                 * If we do not have it, snis_server will drop our messages */
                if (control_key_pressed)
                        voice_chat_start_recording(VOICE_CHAT_DESTINATION_ALL, 0);
                else
                        voice_chat_start_recording(VOICE_CHAT_DESTINATION_CREW, 0);
        }

voice_chat_start_recording() ultimately
starts
up a portaudio thread to begin recording data
and calling the recording_callback
function described below.

This recording callback function cannot directly just compress and transmit the data to the server, as it might conceivably fall behind the recording process if say, writing to the network socket blocks or is slow. So it puts the data into an “outgoing” queue and returns. It looks like this:

static void recording_callback(void *cookie, int16_t *buffer, int nsamples)
{
        if (nsamples != VC_BUFFER_SIZE)
                return;
        recording_buffer.nsamples = nsamples;
        if (recording_audio)
                recording_level = get_max_level(&recording_buffer);
        else
                recording_level = 0;
        pthread_mutex_lock(&outgoing.mutex);
        enqueue_audio_data(&outgoing, recording_buffer.audio_buffer, recording_buffer.nsamples,
                        recording_buffer.destination, recording_buffer.snis_radio_channel);
        pthread_mutex_unlock(&outgoing.mutex);
}

When the “transmit” key is released, the portaudio recording thread is stopped, and if the client is in possession of any token, it is released to the server where it may then be handed out again to whichever client asks for a token.

        if (event->keyval == GDK_F12) {
                voice_chat_stop_recording(); /* This shuts down the portaudio thread that was recording. */
                /* We release even if we don't have, snis_server will know the real deal. */
                release_talking_stick();
        }

Compressing audio

There is then another thread that consumes audio data from this outgoing queue, compresses it with libopus, then sends it on to the server. The meat of that function looks like this:

        while (1) {

                /* Get an audio buffer from the queue */
                pthread_mutex_lock(&q->mutex);
                b = dequeue_audio_buffer(q);
                if (!b) {
                        rc = pthread_cond_wait(&q->event_cond, &q->mutex);
                        if (q->time_to_stop) {
                                pthread_mutex_unlock(&q->mutex);
                                goto quit;
                        }
                        pthread_mutex_unlock(&q->mutex);
                        if (rc != 0)
                                fprintf(stderr, "pthread_cond_wait failed %s:%d.\n", __FILE__, __LINE__);
                        continue;
                }
                pthread_mutex_unlock(&q->mutex);
/* ... */
                /* Encode audio buffer */
                len = opus_encode(encoder, b->audio_buffer, VC_BUFFER_SIZE, b->opus_buffer, OPUS_PACKET_SIZE);
                if (len < 0) { /* Error */
                        fprintf(stderr, "opus_encode failed: %s\n", opus_strerror(len));
                        goto quit;
                }

                /* Transmit audio buffer to server */
                transmit_opus_packet_to_server(b->opus_buffer, len, b->destination, b->snis_radio_channel);
                free(b);
        }

The function that does the compression is opus_encode(), and transmit_opus_packet_to_server() transmits the compressed audio to the server.

Receiving and routing the audio on the Server

When the server receives a packet of compressed audio from a client, it knows which client it came from (because of which socket it came in on and which thread is monitoring that socket), and which token, if any that client currently possesses (because if the client has a token, it’s because the server gave the client the token and remembers which one, or if it didn’t give it one).

If the client does not have a token, the packet is dropped. If it does have a token, then this token determines which of the 4 audio channels this data belongs to, and the data is fanned out to the destination clients along with the token number. The client which sent the data is generally excluded from receiving its own audio data back, as there’s no point in repeating back to them what they just said but with a slight delay.

That code looks like this:

        pthread_mutex_lock(&universe_mutex);
        client_lock();
        if (c->talking_stick == NO_TALKING_STICK) {
                /* Client does not have talking stick. */
                client_unlock();
                pthread_mutex_unlock(&universe_mutex);
                return 0;
        }
        /* Ignore audio chain from client, it put NO_TALKING_STICK there anyway 'cause it doesn't know */
        audio_chain = c->talking_stick;
        client_unlock();
        pthread_mutex_unlock(&universe_mutex);
        pb = packed_buffer_allocate(10 + datalen);
        packed_buffer_append(pb, "bhbwhr", OPCODE_OPUS_AUDIO_DATA,
                                (uint16_t) audio_chain, destination, radio_channel, datalen, buffer, datalen);

        /* Don't send a client's own audio back at him. */
        except.nclients = 1;
        except.client[0] = c - &client[0];
        except.shipid[0] = c->shipid;

        switch (destination) {
        case VOICE_CHAT_DESTINATION_CREW:
                send_packet_to_all_clients_on_a_bridge_except(c->shipid, pb, ROLE_ALL, &except);
                break;
        case VOICE_CHAT_DESTINATION_ALL:
        case VOICE_CHAT_DESTINATION_CHANNEL: /* TODO: implement radio channels */
                send_packet_to_all_clients_except(pb, ROLE_ALL, &except);
                break;
        default:
                fprintf(stderr, "Unexpected destination code %hhu in opus audio packet\n", destination);
                return -1;
        }

Decompressing and playing back data

When the client receives audio data, it is put into an “incoming” queue. The data is accompanied by a token number.

void voice_chat_play_opus_packet(uint8_t *opus_buffer, int buflen, int audio_chain)
{
        if (buflen > VC_BUFFER_SIZE)
                buflen = VC_BUFFER_SIZE;
        if (audio_chain < 0 || audio_chain >= WWVIAUDIO_CHAIN_COUNT)
                return;
        pthread_mutex_lock(&incoming.mutex);
        enqueue_opus_audio(&incoming, opus_buffer, buflen, audio_chain);
        pthread_mutex_unlock(&incoming.mutex);
}

The “incoming” queue is consumed by a thread for decoding the audio packets. The thread uses the token number for each audio packet to determine which of the 4 opus decoders (decompressors) is used to decompress the data. The opus decoders are stateful, and their state depends on previously decoded packets, so it is important not to interleave packets from different clients into a decoder.

Once the data is decompressed, it is appended to one of the 4 chains of VOIP audio data the mixer consumes according to the token number.

The meat of that code looks like this:

        while (1) {

                /* Get an audio buffer from the queue */
                pthread_mutex_lock(&q->mutex);
                b = dequeue_audio_buffer(q);
                if (!b) {
                        rc = pthread_cond_wait(&q->event_cond, &q->mutex);
                        if (q->time_to_stop) {
                                pthread_mutex_unlock(&q->mutex);
                                goto quit;
                        }
                        pthread_mutex_unlock(&q->mutex);
                        if (rc != 0)
                                fprintf(stderr, "pthread_cond_wait failed %s:%d.\n", __FILE__, __LINE__);
                        continue;
                }
                pthread_mutex_unlock(&q->mutex);

                /* decode audio buffer */
                i = b->audio_chain;
                len = opus_decode(opus_decoder[i], b->opus_buffer, b->nopus_bytes, b->audio_buffer, VC_BUFFER_SIZE, 0);
                if (len < 0) {
                        fprintf(stderr, "opus_decode failed\n");
                        goto quit;
                }
/* ... */
                playback_level = get_max_level(b);

                /* If it's been a couple seconds since we've seen data on this chain then
                 * inject 100ms of silence ahead of the data to put the mixer 100ms behind
                 * it so that if there's jitter or some space between subsequent packets,
                 * there's a little bit of slack before the mixer runs out.
                 */
                mcc = wwviaudio_get_mixer_cycle_count();
                difference = mcc - last_mixer_cycle_count[b->audio_chain];
                if (difference > (4 * 48000) / VC_BUFFER_SIZE && difference < (unsigned int) 0xfffff000)  {
                        /* > about 4 seconds at VC_BUFFER_SIZE samples per mixer cycle */
                        /* < 0xfffff000 to avoid hiccup at mcc wraparound */
                        wwviaudio_append_to_audio_chain(short_silence, ARRAYSIZE(short_silence),
                                                        b->audio_chain, NULL, NULL);
                }
                last_mixer_cycle_count[b->audio_chain] = mcc;

                /* Let the mixer have the data */
                wwviaudio_append_to_audio_chain(b->audio_buffer, len, b->audio_chain, free_audio_buffer, b);
        }

Mixing the audio

The mixer mixes several (about 10-20) channels of data dedicated for preloaded sound effects and 4 channels of VOIP data. The VOIP data is in the form of a linked list. When one chunk of audio data is consumed the mixer calls a callback function associated with that data (typically used to free the buffers containing the data) and then the mixer moves on to the next chunk in the linked list. Data may be appended to the list at any time by the thread doing decompression of audio data incoming from the network.

The mixer function is quite complex, but it’s here.

~ by scaryreasoner on April 2, 2020.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

 
%d bloggers like this: