Writing a Basic FIR Filter in C

Introduction and Theory

A FIR (Finite Impulse Response) filter is mathematically identical to a convolution between the input array and a set of taps which represent the impulse response of the filter. This can be implemented using a series of dot products between the tap vector and a vector consisting of the current input sample and those which came immediately before it.

y n = N = 0 N 1 x n N b N

where y is the output array, x is the input array, b is the array of taps and N is the number of taps.

When implementing this in code we need to decide how to treat the first N samples as we have no indication of what data should come before them. We can take one of two methods, either we can assume all previous samples are 0 or we can start our calculation at the Nth sample. For this example we will choose the zero assumption method as this gives us the same number of output samples as input.

Basic FIR Filter

We will first create an unoptimised FIR filter which cab be called multiple times as data comes in, with memory of the previous samples. This will require 4 functions, one to initialize the required memory, one to apply the FIR filter, one to reset the memory in the case of a discontinuity in the received data, and one to free all the filter memory. To pass pass around this memory we will first declare the following structure:

typedef struct FIR_Stream_Mem {
    size_t tap_len;
    double *taps;
    double *buffer;
} FIR_Stream_Mem;

The tap_len field contains the number of taps stored in the taps array. buffer will contain the previous samples and will also have size tap_len. We will allocate and populate this structure using the following function:

FIR_Stream_Mem *init_fir_stream(double* taps, size_t tap_len){
    /* Create the output object in memory */
    FIR_Stream_Mem *output = (FIR_Stream_Mem*)(malloc(sizeof(FIR_Stream_Mem)));

    /* Copy the taps across */
    output->tap_len = tap_len;
    output->taps = (double*)(malloc(sizeof(double) * tap_len));
    memcpy(output->taps, taps, tap_len * sizeof(double));

    /* Set the buffer to all zeros */
    output->buffer = (double*)(malloc(sizeof(double) * tap_len));
    memset(output->buffer, 0x00, tap_len * sizeof(double));

    return output;
}

This first allocates the memory to store the FIR_Stream_Mem structure itself, before allocating memory for the buffer and taps. The taps are copied over from the input to allow for the structure to be passed around more easily, and the buffer is set to all zeros as discussed above.

We are now ready to start using the FIR filter. To do this, we need to implement the algorithm so it can be run on our data. We will use the following function:

void apply_fir(FIR_Stream_Mem *mem, 
               const double *input, 
               double *output, 
               size_t size){
    /* Loop through all input samples */
    for(size_t i = 0; i < size; i++){
        /* Calculate first output sample */
        double sample_output = input[i] * mem->taps[0];

        /* Loop through all the taps */
        for(size_t j = 1; j < mem->tap_len; j++){
            sample_output += mem->buffer[j] * mem->taps[j];
        }

        /* Update buffer */
        for(size_t j = mem->tap_len - 2; j > 0; j--){
            mem->buffer[j + 1] = mem->buffer[j];
        }
        mem->buffer[1] = input[i];
        output[i] = sample_output;
    }
}

This has the following arguments:

It loops through all the input samples and calculates the dot product as discussed above. It then shifts all the samples in the buffer to the right and adds the current sample to the front before calculating the next sample.

This function can now be applied on chunks of our input data as they come in, with the memory from the last chunk being kept in the FIR_Stream_Mem struct.

However, if a discontinuity is detected in the input stream of samples the history stored in buffer will be incorrect and we will have no way of determining what it should be. In this case it is best to reset the filter by setting the filter’s buffer to all zeros, which can be done with the following function:

void reset_fir(FIR_Stream_Mem *mem){
    memset(mem->buffer, 0x00, mem->tap_len * sizeof(double));
}

Finally, when we are done with the filter we should free all associated memory. This can be done with the following function:

void deinit_fir(FIR_Stream_Mem *mem){
    free(mem->buffer);
    free(mem->taps);
    free(mem);
}

Speed Optimizations

The biggest flaw with the previous implementation is the shifting of the buffer array in the FIR_Stream_Mem struct after every sample is processed. This is done by re-writing every element of the buffer, as shown in the following table:

Changes to FIR buffer
Sample Buffer Content
N 7 6 5 4 3 2 1 0
N + 1 8 7 6 5 4 3 2 1
N + 2 9 8 7 6 5 4 3 2

This means that we are doing FIR_Stream_Mem->tap_len writes to memory in order to update the buffer. However, only one of the values in buffer has actually changed, the others have just shifted position. By changing the start position of the buffer, we can reduce the number of writes to only one per sample. The table below shows this in action, where the bolded positions are the start of the array.

Optimized FIR buffer memory layout
Sample Buffer Content
N 7 6 5 4 3 2 1 0
N + 1 7 6 5 4 3 2 1 8
N + 2 7 6 5 4 3 2 9 8

To implement this method we first need to modify the FIR_Stream_Mem struct to record the starting position of buffer :

typedef struct FIR_Stream_Mem {
    size_t tap_len;
    double *taps;
    double *buffer;
    size_t buffer_start;
} FIR_Stream_Mem;

The buffer_start value should also be initialized to 0 in init_fir_stream:

FIR_Stream_Mem *init_fir_stream(double* taps, size_t tap_len){
    /* Create the output object in memory */
    FIR_Stream_Mem *output = (FIR_Stream_Mem*)(malloc(sizeof(FIR_Stream_Mem)));

    /* Copy the taps across */
    output->tap_len = tap_len;
    output->taps = (double*)(malloc(sizeof(double) * tap_len));
    memcpy(output->taps, taps, tap_len * sizeof(double));

    /* Set the buffer to all zeros */
    output->buffer = (double*)(malloc(sizeof(double) * tap_len));
    memset(output->buffer, 0x00, tap_len * sizeof(double));

    /* Set the buffer start position to 0 */
    output->buffer_start = 0;

    return output;
}

Finally, we need to modify apply_fft to loop index and update buffer correctly. We loop though buffer when applying the filter taps, so we make out first mod here. Rather than using the for-loop index, we calculate our own index using the following algorithm:

size_t idx;

if(mem->buffer_start + j < mem->tap_len){
    idx = mem->buffer_start + j;
} else {
    idx = mem->buffer_start + j - mem->tap_len;
}

This ensures we don’t get an out-of-bounds error by looping the index back to the start if it will otherwise overrun.

We also need to ensure that buffer gets updated correctly. This is done by setting the value at buffer_start to be our current input sample, then subtracting one from buffer_start. We again avoid an overrun by checking if buffer_start is 0, in which case we set buffer_start to be the last element in the array. This is done using the following code:

mem->buffer[mem->buffer_start] = input[i];
if(mem->buffer_start == 0){
    mem->buffer_start = mem->tap_len - 1;
} else {
    mem->buffer_start--;
}

This gives a final apply_fft function of:

void apply_fir(FIR_Stream_Mem *mem, const double *input, double *output, size_t size){
    /* Loop through all input samples */
    for(size_t i = 0; i < size; i++){
        /* Calculate first output sample */
        double sample_output = input[i] * mem->taps[0];

        /* Loop through all the taps */
        for(size_t j = 1; j < mem->tap_len; j++){
            size_t idx;

            if(mem->buffer_start + j < mem->tap_len){
                idx = mem->buffer_start + j;
            } else {
                idx = mem->buffer_start + j - mem->tap_len;
            }

            sample_output += mem->buffer[idx] * mem->taps[j];
        }

        /* Update buffer */
        mem->buffer[mem->buffer_start] = input[i];
        if(mem->buffer_start == 0){
            mem->buffer_start = mem->tap_len - 1;
        } else {
            mem->buffer_start--;
        }

        output[i] = sample_output;
    }
}

Speed Boost

Modern compilers are pretty good at optimizing the code given to them, so does this modification actually give a noticeable speed increase? To test this we can use the following function which calls apply_fft 1,000,000 times and compare the execution times between the two implementations:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <assert.h>

#include "fir_filter.h"

static double randfrom(double min, double max){
    double range = max - min;
    double div = RAND_MAX / range;
    return min + (rand() / div);
}

int main(){
    /* Seed random number generator */
    srand(time(NULL));

    /* Create some random input data */
    double data[1024];

    for(size_t i = 0; i < 1024; i++){
        data[i] = randfrom(-1.0, 1.0);
    }

    /* Create random taps */
    double taps[8] = {1, 2, 3, 4, 4, 3, 2, 1};

    /* Init fir filter */
    FIR_Stream_Mem *fir_filter = init_fir_stream(taps, 8);

    /* Filter data */
    double output[1024];
    for(size_t i = 0; i < 1000000; i++){
        apply_fir(fir_filter, data, output, 1024);
    }

    /* Deallocate the FIR memory */
    deinit_fir(fir_filter);
}

Running the code with the unoptimized apply_fft took 25.936 seconds, whereas the optimized version took only 22.511 seconds.

Memory Optimization

As modern computers have a lot of memory I have not experimented with the possible memory optimizations which could be made to this code. However, there are two changes which could be made. The first is that the init_fir_stream function could store the pointer its given to taps rather than making a copy. This would reduce the memory footprint of the FIR filter at the expense of thread safety in multithreaded programs. Secondly, the buffer element in FIR_Stream_Mem is currently one byte longer than it needs to be - the final element is never used as part of the filter. This could be fixed, but it would make the indexing in apply_fft a bit more complicated.