Monday, February 25, 2013

Bitfield Tutorial

Tutorial on the usage and pitfalls of C style bitfields

Overview

Bitfields are a handy C construct for setting individual bits in a structure. They are seemingly straightforward but have many small gotchas that can be painful to figure out how to deal with or ruin program portability. This tutorial will give a brief overview of how to use bitfields, and explore a few of the more problematic issues I've personally run into and how to work around them.

Basic Usage

You can declare a variable of type signed or unsigned that uses foo bits in any struct. For example:
  struct
  {

    unsigned header    : 7;
    unsigned data      : 20;

    unsigned footer    : 5;
  } bitExample;
Will make a structure with the header set as the first 7 bits, the data as the middle 20 bits, and the footer as the last 5 bits. Fairly intuitive so far. Also one can do something like:
  struct 
  {
     unsigned         : 7;
     unsigned  data   : 25;

  } bitExample;
which will allocate the first 7 bits as unnamed padding. This is *not* guaranteed to be 0 when reading out the data, more on that later.
Getting the data out is easily accomplished using unions:
  union 
  {

     struct 
     {
        unsigned         : 7;
        unsigned  data   : 25;

     } bitExample;
     int rawData;
  } bitUnion;

or using memcpy:
  struct 
  {
     unsigned         : 7;

     unsigned  data   : 25;
  } bitExample;
  
  int rawData = 0;

  memcpy( &rawData, &bitExample, sizeof(int) );


Problem 1 -- Bad bit ordering due to little endian architecture

Consider the following example program:
  #include 

  int main()

  {
     union
     {
        struct
        {
           unsigned x  :3;

           unsigned y  :5;
        }bitExample;

 char rawData;

     } bitUnion; 
     
     bitUnion.bitExample.x = 0;

     bitUnion.bitExample.y = 0xF; 

     printf("%hhXn", bitUnion.rawData);

  }
We would expect the binary and hex output to look like:
   x    y       
  000 01111  == 0000 1111 == 0F
Which will be the case if running on a big endian machine. But on a little endian machine we will see:
  
                       y    x
  78 == 0111 1000 == 01111 000

As you can see it built it up backwards from what you would expect! Simply reversing the bit order of the struct like:
     struct
     {
        unsigned y  :5;

        unsigned x  :3;
     }bitExample;
fixes this. To allow it to work on either big or little endian one could do something such as:
     struct
     {
        #ifdef LITTLE_ENDIAN
    unsigned y  :5;

    unsigned x  :3;
 #else
    unsigned x  :3;

    unsigned y  :5;
        #endif
     } bitExample;


Problem 2 -- The bitfield reordering for little endianess is only done on a word by word basis

Consider the following on a little endian machine:
  #include 

  int main()

  {
     union
     {
        struct
        {
           unsigned x  :32;

           unsigned y  :32;
        }bitExample;

 int rawData[2];

     } bitUnion; 
     
     bitUnion.bitExample.x = 0xFFFF;

     bitUnion.bitExample.y = 0; 

     printf("As integers:n");

     printf("%X, %Xnn", bitUnion.rawData[0]
                      , bitUnion.rawData[1]);

  }
you would expect from the previous example that we would see:
  0, FFFF
as the output since the bitfields are made backwards. But this only occurs per word, so we actually see:
  FFFF, 0

So the ordering of words should be the same, but the bits in those words should be reversed for little endian. So for the following struct:
  struct
  {
     unsigned x  : 3;

     unsigned y  : 29;

     unsigned z  : 26;

     unsigned a  : 2;
     unsigned b  : 4;


     unsigned c  : 32;
  }
to be compatible on both big and little endian you would need to do the following:
  struct
  {
     #ifdef LITTLE_ENDIAN
         unsigned y  : 29;

         unsigned x  : 3;

         unsigned b  : 4;

         unsigned a  : 2;
         unsigned z  : 26;

 
         unsigned c  : 32;
     #else
         unsigned x  : 3;

         unsigned y  : 29;

         unsigned z  : 26;

         unsigned a  : 2;
         unsigned b  : 4;

 
         unsigned c  : 32;
     #endif  
  }
Again swapping bits within a 32 bit word, but not the words themselves.

Problem 3 -- Misreprensentation due to copying to less than word sized structures

Yet another endianess problem. Consider the following program:
    #include 

    int main()
    {
       union
       {

          struct
          {
             unsigned x  :32;
             unsigned y  :30;

             unsigned z  :2;
          }bitExample;
  
          int rawData[2];

          short sRawData[4];
       } bitUnion; 
     
       bitUnion.rawData[0] = 0;

       bitUnion.rawData[1] = 0;
       bitUnion.bitExample.x = 0xFFFF;

       bitUnion.bitExample.y = 0; 
       bitUnion.bitExample.z = 3; 

       printf("As integers:n");

       printf("%X, %Xnn", bitUnion.rawData[0]
                        , bitUnion.rawData[1]);

       printf("As shorts:n");
       printf("%4hX, %4hXn", bitUnion.sRawData[0]

                           , bitUnion.sRawData[1]);
       printf("%4hX, %4hXn", bitUnion.sRawData[2]

                           , bitUnion.sRawData[3]);
    }
As before, everything behaves as you would expect on a big endian machine. But lets do some bit busting assuming little endian architecture:
   x = 0xFFFF hex therefore 
   x = 1111 1111 1111 1111 

   z = 3 dec therefore 
   z = 11 bin
   y = 0 dec therefore 
   y = 00 0000 0000 0000 0000 0000 0000 0000 bin
   
so using the idea that word order is the same but bits are added backwards we would assume:
   0000 FFFF C000 0000 
   
as the hex output. Taking into account the printf formatting in the example we would expect:
   As integers:
   FFFF, C0000000

   As shorts:

   0, FFFF
   C000, 0
   
Unfortunately we actually get:
   As integers:
   FFFF, C0000000

   As shorts:
   FFFF, 0
   0, C000
   
As you can see when printing it out in 16 bit chunks we get each set of shorts swapped.
Unfortunately the fix entirely depends on how you plan to spit it out in the end. If it is in 32 bit chunks you would be fine, 16 bit chunks you would need to swap the values that you loaded up in 16 bit chunks using either htons functions (For TCP/IP, since it is big-endian often converted to little-endian) or making your own simple variant.

Problem 4 -- Internal memory padding causing extra padding during retrieval of raw data

The first problem that is not an endianess issue, it will cause issues on any architecture. This is assuming a word size of 32 bits, on a 64 bit architecture it will be different. For portabilities sake one should assume 32 bit word boundaries, since that will work for both.
Consider the following example program:
  #include 

  int main()

  {
     union
     {
        struct
        {
           unsigned x  :30;

           unsigned y  :5;
           unsigned z  :29;

        }bitExample;

        int rawData[2];
     } bitUnion; 
     
     bitUnion.bitExample.x = 1073741823; // 30 bit max

     bitUnion.bitExample.y = 31; // 5 bit max
     bitUnion.bitExample.z = 536870911; // 29 bit max 

     printf("%Xn", bitUnion.rawData[0]);

     printf("%Xn", bitUnion.rawData[1]);

  }
Running it on my machine produced(It can be different depending on the state of memory, but probably won't be right):
  BFFFFFFF
  B7F5DFFF
Which is wrong since we maxed out x,y,z we would expect to see both integers as FFFFFFFF
The issue is that the bitfields are not aligned on the word boundary. To speed up handling of bitfields internally the C compiler makes sure that the fields are aligned, if something goes past it pushes it into the next word. So for example the the 30 + 5 = 35 which is greater than the 32 bit word boundary so the 5 bits is moved and we pick up garbage at the end of the word in the last 2 bits. Then we have 5 + 29 = 34 same thing and we see more garbage in the last 27 bits.
A quick fix is below:
  #include 

  int main()

  {
     union
     {
        struct
        {
    // Word 1

           unsigned x  :30;
           unsigned y1  :2;

           // Word 2
           unsigned y2  :3;
           unsigned z  :29;

        }bitExample;
        int rawData[2];
     } bitUnion; 
     
     bitUnion.bitExample.x = 1073741823; // 30 bit max 
     bitUnion.bitExample.y1 = 3; // 2 bit max 
     bitUnion.bitExample.y2 = 7; // 3 bit max 
     bitUnion.bitExample.z = 536870911; // 29 bit max

     printf("%Xn", bitUnion.rawData[0]);

     printf("%Xn", bitUnion.rawData[1]);

  }
  
As expected we get:
  FFFFFFFF
  FFFFFFFF
     
But the problem is that the field is now split into two. Unfortunately we are limited in what we can do, the easiest thing I've found to do is just split up an incoming value like so:
  unsigned integer y = 31;

  // 0x03 == 0b00011  So take only the last 2 bits

  bitUnion.bitExample.y1 = y & (0x03); 

  // Shift out the last bits so that the top 3 bits 
  //   are now the entire number

  bitUnion.bitExample.y2 = (y >> 2);   

Any value of y will now work, although any bitfield across words has to be tweaked like this.

Problem 5 -- Padding Fields Are Worthless

Consider the following:
#include

int main(){

   union {
      struct {
         unsigned   :16;

         unsigned x :16;
      } fields;
      int out;

   } test;

   test.fields.x = 65535;  // Max 16 bit field

   
   printf("%Xn", test.out);
   return 0;

}
We would expect FFFF0000 on a little endian machine, of course this isn't what happens. It depends on the state of your machines memory but I got FFFF7050. The problem is depending on compiler (This was done using gcc 4.1) it will not initialize padding to 0, and you can't explicit reference it. You could set test.out = 0; or just name the value and set it to 0
Also if instead we did something like test.out = 0xAAAAAAAA; it would overwrite the padding. The easiest fix is to name the area and before using the bitfield set it to 0.

Conclusion

If you've read this far and really had to make an attempt to understand this, I'm sorry. Bitfields should be a simple construct for building up binary but due to inane compiler design they can become almost useless on little endian without properly knowing how they are internally managed. Don't use them unless there really is no other simpler alternative or you know you will always be running on a big-endian architecture. Also make sure your fields are word aligned if you can, if not you are just introducing more headaches.

No comments: