Profiling with gprof

I was messing around with my little game today, trying to squeeze a bit more speed out of it. It is not unusual for there to be as many as 3000 objects being kept track of, moved, and drawn 30 times per second in this game, so it’s important to make sure time isn’t being wasted in the inner loop.

One tool which can help find bottlenecks is gprof, the GNU profiler. To use it, simply compile your program with the “-pg” option. Then, when you run your program, it will produce a file called “gmon.out” Using gprof, you can then get a report about your program.

$ gcc -pg -o yourprogram yourprogram.c
$ yourprogram > output.txt
$ gprof yourprogram gmon.out > report.txt

One such report I got from analyzing a run of my game looked like this:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  us/call  us/call  name
 41.54      0.27     0.27      315   857.19   888.94  draw_objs
 41.54      0.54     0.27                             advance_game
  7.69      0.59     0.05                             patestCallback
  4.62      0.62     0.03    23365     1.28     1.28  find_free_obj
  1.54      0.63     0.01   166345     0.06     0.06  draw_spark
  1.54      0.64     0.01    15936     0.63     0.63  abs_xy_draw_letter
  1.54      0.65     0.01    14681     0.68     0.68  draw_letter
  0.00      0.65     0.00   188684     0.00     0.00  move_spark
  0.00      0.65     0.00    28235     0.00     0.00  randomn
  0.00      0.65     0.00    23166     0.00     1.28  add_generic_object
  0.00      0.65     0.00    22915     0.00     1.28  add_spark
  0.00      0.65     0.00    20295     0.00     0.00  interpolate
  0.00      0.65     0.00    13552     0.00     0.00  draw_on_radar
  0.00      0.65     0.00    10725     0.00     2.48  bridge_move

draw_objs as number one isn’t surprising, it has the most work to do. Same with advance_game (which moves every object, 30 times a second). patestCallback is called to keep the sound card’s buffer full, so it’s not too surprising it’s up there. find_free_obj is the object allocator. It is using a rather naive algorithm, of sequentially scanning an array for slots which are marked as being unused. Replacing that algorithm with a smarter one knocked find_free_obj down into the noise. The various drawing routines, draw_spark, draw_letter, etc. aren’t too surprising, there are lots of sparks in this game. I was a little surprised randomn was as high as it was in the list. A quick look at that code revealed why it was so high:

The function randomn() returns a random integer between 0 and n. It’s used for things like deciding the x and y velocities of a hundred or so sparks every time something explodes.

int randomn(int n)
        return (int) (((random() + 0.0) / (RAND_MAX + 0.0)) * (n + 0.0));

Aauugh! Floating point math! A floating point divide! What the hell was I thinking?

It’s a very simple minded algorithm. Get a random number between 0 and RAND_MAX (RAND_MAX is 2 billion or so), divide it by RAND_MAX, to get a random floating point number between 0 and 1, and multiply that by n, then convert n back to an integer.

It works.


How to speed this up? First, get rid of the floating point math. Part of the trick to doing that is to realize that in this game, we never need random numbers that are bigger than 16 bits can hold ( that is, never bigger than 32767).

So, we can do something like this:

        return ((random() & 0x0000ffff) * n) / 0x0000ffff;

This says, get a random number between 0 and RAND_MAX, and zero out the high sixteen bits. Now we have a random number between 0 and 65535 decimal (0xffff in hex). Multiply that by n, then divide by 65535. As long as n is reasonably small, which we know it is in this application, we won’t have to worry about that multiply overflowing in a 32 bit integer.

There’s one more trick we can do though. Dividing by 65535 is almost the same thing as dividing by 65536. It differs by only 1 part in 65000 or so. And were talking about random numbers here, in a game, not a cryptographic application. Are we really going to notice that 1 part in 65000 difference? No.

Why is dividing by 65536 better than dividing by 65535?

Look at it in hex:


        return ((random() & 0x0000ffff) * n) / 0x0000ffff;

vs. this:

        return ((random() & 0x0000ffff) * n) / 0x00010000;

The second one is identical to this:

        return ((random() & 0x0000ffff) * n) >> 16;

And so we can replace that expensive divide with a cheap shift.

Profiling a test application shows that this last version is 5 times faster than the floating point version on my system.

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ns/call  ns/call  name
 83.67      0.05     0.05  1000000    50.20    50.20  randomn2
 16.73      0.06     0.01  1000000    10.04    10.04  randomn1

Fixing that, and declaring randomn as a static inline, and now gprof shows that randomn() is down in the noise where it belongs in my game. There are probably better, faster random number scaling algorithms, but this makes it good enough that there are better places to spend my time optimizing.

~ by scaryreasoner on February 18, 2008.

2 Responses to “Profiling with gprof”

  1. Regarding your random number generation optimization, I want to quote the rand(3) manpage:

    “If you want to generate a random integer between 1 and 10, you should always do it by using high-order bits, as in

    j = 1 + (int) (10.0 * (rand() / (RAND_MAX + 1.0)));

    and never by anything resembling

    j = 1 + (rand() % 10);

    (which uses lower-order bits).”

    The main problem is that the distribution provided by rand() and random() is only “good” if you consider 100% of the bits. If you look at a small subset of the bits, you may get a terrible distribution (example: some older versions of rand() alternated between even and odd numbers with 100% predictability).

    You could easily write your own random number generator from scratch (based off of any of the many examples available on the net) that would only provide the necessary number of bits.

  2. Thanks for the informative comment. I’m sure you’re correct. For my application, (mostly generating random velocities for sparks for explosions in a video game) what I did turned out to be good enough (which only means there was no obvious non-randomness in the generated explosions.) In that application, speed is far far more important than having a good random number generator, so long as the randomness is “good enough.”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: