In-place array uniq in C

July 10, 2008

I’ve been developing Insight even though the uni project has come to an end, because it’s fun! I also want to make it more stable and eventually release it under an open-source licence of some kind. There will be an update coming soon, I promise! I now have Internet, so I can write up some things… Anyway, one of the interesting things I wanted to do for Insight was an in-place form of uniq for an array, ideally without any additional memory allocation. It seems that this is something nobody else has done yet! So I set about doing it myself… For those of you who are unfamiliar with the Linux/UNIX command uniq, it takes a sorted list and removes any duplicates. This is almost exactly what I’m trying to do, with one caveat: I need to keep the “discarded” duplicates. What happens is that I have an array containing a number of strings, and these have all been dynamically allocated via malloc() or calloc(). If I just remove or overwrite their pointers, they’ll vanish and cause a memory leak. While I’ve now fixed a large number of leaks thanks to Valgrind, I’m trying to actively avoid any possibility of adding them. Read on for the details…

The Idea

The basic idea is that we have a sorted list of items that may have some duplicates, say:

{A₀, A₁, B₀, C₀, C₁, E₀, F₀, F₁, F₂, G₀, H₀, H₁}

and we want to remove all of the duplicates, so we get two sets:

{A₀, B₀, C₀, E₀, F₀, G₀, H₀} {A₁, C₁, F₁, F₂, H₁}

Note that it doesn’t matter which of the duplicates ends up in which set, just that the items in the first are all unique and that we don’t lose any. Also, A₀ and A₁ have the same value but the subscript will help to distinguish exactly which of the As I’m talking about. Now we have the problem, let’s look at some solutions.

Lossless solution – additional lists

So the first solution uses two temporary lists, unique and duplicate. We start with two pointers into the original list: p and q. We set p to point to the start of the list, and q to point to the element after p. We then copy the item p points at to our unique list. Then, while we still have more items to examine:

If the items pointed to by p and q have the same value, then we copy the item q points at to the duplicate list and advance p and q.
If they are different, we copy the item at q to our unique list, and advance p and q.

We keep doing this until we get to the end of the array, and then copy the unique list to the start of the array, the duplicate list after that (both overwriting the previous contents) and set the number of unique items to the length of the unique

list.

Lossy solution – one list

We can do a similar thing in-place, although this will result in data loss. The essential idea is:

We start in the same way: p pointing to the first element, and q to the second
While q has the same value as p, advance q along the list
When they are different, advance p, copy q to p, and advance q
Once q goes past the end of the list, we’re done, and the offset of p is the number of unique items.

Now this is quite simple, and only requires one pass through the list, but it results in the loss of information, which is unacceptable in this case.

Lossless solution – one list

For this solution, we can also do it with one pass through the list and two pointers. The basic intuition is that we’re going through the array, accumulating a rotating block of stuff that we don’t want. At any given point:

Everything between the start and p (not inclusive) is the unique list (so far)
Everything between p and q (not inclusive) is an unwanted duplicate
Everything between q (inclusive) and the end is yet to be processed.

At the end, we return the number of unique items at the start of the list; everything above that is a duplicate.

For integers

Here is some pseudo-Java that performs my algorithm for a set of integers.


int set_uniq(int set[]) {

  int count = set.length();

  int tmp, p=0, q=1;

  while (p < set.length && q < set.length) {

    while (p < set.length && set[p] != set[q]) {

      p = p + 1;

      if (q < set.length) {

        if (p < q) {

          tmp = set[p];

          set[p] = set[q];

          set[q] = tmp;

        }

        q = q + 1;

      }

      else if (count > p) {

        count = p;

      }

    }

    while (q < max && set[p] == set[q]) {

      q = q + 1;

    }

  }

  // q hit the end and is still the same as p

  if (count > p)

    count = p + 1;



  return count;

}

The general code

The arguments to the general set_uniq() function are very similar to the arguments to standard qsort():

The (sorted) array to work on
The number of items in the array
The size of each item in the array
A comparator function that returns a negative integer if the first argument is less than the second, zero if they are equal, and a positive integer if the first argument is greater than the second.

So, without further ado, my code:


int set_uniq(void *set,

    size_t count,

    size_t elem_size,

    int (*cmp)(const void*, const void*)) {

  void *p, *q, *max;

  void *tmp=calloc(1, elem_size); /* TODO: check for failure */

  p=set;

  q=set+elem_size;

  max=set+(count*elem_size);

  while (p< p-set)

      count = (p-set)/elem_size;

    }

    while (q p-set)

    count = 1 + (p-set)/elem_size;

  ifree(tmp);

  return count;

}

Conclusion

This code is released under a Creative Commons licence (see below). A similar idea will work well for other solutions, like set difference. I’ll post those later if anyone really cares. Also, there is a better way of swapping the items by using the fact that:


a ^= b;

b ^= a;

a ^= b;

swaps a and b without needing a temporary variable. If you just go along the data in 32-bit chunks, performing those XOR operations to swap the data. This actually works out to be 30% or so faster (if I remember correctly). Finally – general code like this is always going to be slower than code that’s been specialised for a particular purpose. Calling this on integers, for example, will be much slower than coding an integer-specific version. I should also point out that you can just overwrite integers, as you don’t need to free them 😉

A Blog Less Ordinary