New Infinite Machine and site design

Today I’m launching two things simultaneously: a new site design of quadpoint.org and a rewrite of Infinite Machine, the software behind quadpoint.org.

Old: Old

New: New

Back in February, I implemented something I called ‘Infinite Caching’ in infmx. It was basically a feature that allowed for virtually infinite cache expire times. Thus, all of the pages on my site would stay in memory, completely rendered, and requests theoretically never hit the disk because the cache was always warm. Whenever a content update was made, I ran a little utility to update the cache asynchronously. I had a git hook set up to run the utility whenever I pushed into the production content repository, so that changes would show up soon right after the push. It was pretty slick, or so I thought.

A couple of months ago I got the idea of an even better caching scheme: the persistent cache. The idea was that static HTML pages should be generated and served by the webserver directly instead of inserting the pages into a memory cache and served by Django. I hacked a persistent cache implementation into my cache-updating utility in July.

Not even a week later I had an epiphany: why is my setup so complicated? Why am I running Django when all of my pages are pre-rendered and in a cache? My dynamic web application was actually just generating the same static content over and over. So a week after implementing the pcache, I started ripping out all of the dynamic bits and rewriting Infinite Machine to implement Jekyll’s interface. Jekyll has the same idea: produce static pages from formatted content. (Aside: I considered switching to Jekyll, but it lacks some of the features that I use in infmx, like breadcrumbs and intrawiki linking.) Now, infmx produces static content just like it did before, except it writes it all to an output directory on the filesystem that can be served up by a webserver. The only dependency on Django now is for the template engine, which is only required during document compilation, and not required on the server.

There are a few major (and obvious) benefits to this. Firstly, it’s static content served directly by the webserver. Even though I was previously serving pages straight from memcached, the fastcgi + django + python + memcached overhead was significant. The same content served from a static file is a full order of magnitude faster: ~800 requests/second for django+memcached versus 8000 requests/second with static. For a 6KB page, that’s over 50MB/s. In other words, I’ll saturate my network link before the webserver (nginx) becomes a bottleneck.

Second, no more Django on my server. That means one less thing to fail and one less thing to make sure is running when the box reboots. My deployment process is now much easier, too. To update, I basically do:

% infmx && ./deploy.sh prod

where deploy.sh is a simple wrapper around rsync to mirror my local copy to my server. Simple.

The takeaway is this: think hard about what you’re doing with your web application. In the early days of the web, everything was static HTML. Then, scripting languages like PHP came along and everything was dynamic. Now that browsers and JavaScript no longer suck, we’re converging to a reasonable hybrid of the two. Try to avoid doing anything ‘smart’ on the fly. Preprocess as much as you can and let your webserver do the work. For fancy dynamic things, JavaScript and a lightweight server-side backend work well.

launching libgithub v1.0

Today I’m launching libgithub, a JavaScript library for displaying interesting GitHub data on websites.

Right now, it can generate badges for one or more commits to a project, and looks something like this:

libgithub badge

Commit badges are great for showing recent activity on a project page. A user can quickly get a feel for a project’s freshness without going to GitHub to see the commit history. For instance, I use commit badges on my projects index for this reason. Badges are also useful for a sidebar/widget display. My friend Jesse Crouch made a demo showing how libgithub badges can be used in a sidebar.

Badges are just a start; more widgets belong in libgithub. Read more about the project, how to use it, and how to contribute at the libgithub project page.

Engine Yard hashing contest post-mortem

It’s over. Finally. On July 20-21, 2009, Engine Yard held a contest to find the smallest hamming distance to a given SHA-1 hash by hashing permutations of 12 words from a given dictionary, and optionally 5 ASCII characters appended to the end. I’ve spent the last several evenings writing, rewriting, and optimizing my C implementation and distributed infrastructure.

This is my post-mortem of the project.

About the contest

Before I explain what I did specifically, I’ll explain what the above paragraph means. A hash function simply takes some input and produces some fixed-size output. They are deterministic, too, so the same input will always produce the same output. Hash values are essentially just very large numbers that input values map to. In the case of SHA-1, the output is 160 bits long, meaning that SHA-1 hash values range from 0 to 2160 - 1 = 1461501637330902918203684832716283019655932542975 (base 10).

The given phrase for the contest was “I would much rather hear more about your whittling project” which hashes to 1249c4b7f578204f10798c0269f8488280fb9981 in hexadecimal (base 16). In binary (base 2), this is:

0001001001001001110001001011011111110101
0111100000100000010011110001000001111001
1000110000000010011010011111100001001000
1000001010000000111110111001100110000001

To reiterate, the goal of the contest was to find a phrase of 12 words (and optionally 5 ASCII characters at the end) that would produce a SHA-1 hash with the fewest bit flips away from the challenge hash. The number of bit flips between hashes is called the hamming distance. For example, the hamming distance between “00011” and “10010” is 2; two bits need to be flipped in “00011” to turn it into “10010”.

Hash functions are supposed to generate fairly well-distributed hashes such that even a very small change (say, one bit flip in the input) will produce an entirely different hash output. This ‘randomness’ makes the math quite easy: if we assume that if we have a 50% chance of correctly guessing each of the 160 bits when generating the hash for a random phrase, the result is a binomial distribution. This means that for each phrase, we’re essentially flipping 160 coins and hoping to get as many heads as possible.

Some math

So, the question is, how hard is it, really, to get all heads, or even close? The short answer is: hard. For this contest, the winning entry had a hamming distance of 30. Generating hashes at random, the probability of getting a 30 is 1 in:

To even come close to a 30, one would need a very large number of powerful processors, or some hardware designed to do calculations. Steve Worley released a CUDA implementation for the contest on nVidia’s forums a few days ago. High-end nVidia GPUs were able to process hundreds of millions of hashes per second, whereas lowly CPU-only implementations could only do a few million per second per core.

My implementation

I wrote my implementation in C. It started out in C++, but after simplifying my searching algorithm, I removed a lot of complexity and dropped down to pure C. The fastest I was able to get out of my implementation was 3.2M hashes/sec/core on a Xeon 2.13GHz.

My code is available on GitHub: http://github.com/msparks/sham

My general searching algorithm:

  1. Pick a random set of 12 words
  2. For 1,000,000 iterations, randomize one of the ASCII characters at the end of the phrase per iteration.
  3. Calculate SHA-1.
  4. Repeat

My main goal was to do as little work as possible each iteration, and just crank out as many hashes as the processor could compute. Thus, I did not vary capitalization of letters, and only randomized one ASCII character per iteration, instead of all 5.

Here are some other things I did to reduce time per iteration, in no particular order:

  • Used OpenSSL for hashing. In the interest of time, correctness, and speed, I did not try to implement SHA-1 for this contest.
  • Compiler optimizations. -O3. A given.
  • Kept a SHA_CTX structure around for the current phrase. SHA1_Update() in libcrypto will calculate chunks of input at a time, and the result context can be stored in memory to be updated later and finalized. I picked a new set of 12 words every million iterations, and when I picked a new set, I calculated the SHA_CTX for that set and kept it in memory. Thus, each iteration only needed to run the characters since the last SHA chunk boundary and the following 5 ASCII characters, instead of the whole string.
  • Pick sets ‘intelligently’ based on length. pick_minimal_set() in my implementation would repeatedly pick a random set of 12 words until the length of the resulting phrase was within 5 characters of a chunk boundary (512 bits; 64 bytes). Thus, each iteration would have to SHA a maximum of 10 characters.
  • Use gcc builtin. Counting bits manually is slow. Use __builtin_popcount*() to do it for you.
  • Profile. I used Shark.

Distributed Infrastructure

Calculating SHA-1s from a well-known fixed dictionary is an embarrassingly parallel problem. Nevertheless, I still wanted an easy way to monitor all of the running instances. I wrote a simple Python client to send result phrases to a matching Python server sitting on my server. It computed the hamming distance and showed the new minimum distances as they came in.

Here’s the log from my monitoring server, showing timestamps and lowest distances. Hostnames removed.

2009-07-20 12:04:53,357 - *** shamserver starting ***
2009-07-20 12:04:53,357 - challenge phrase: I would much rather hear more about your whittling project
2009-07-20 12:04:53,357 - challenge hash: 1249c4b7f578204f10798c0269f8488280fb9981
2009-07-20 12:05:46,233 - [xxxxx] distance: 46
2009-07-20 12:05:46,233 - [xxxxx] phrase:
pieprzyk sass git utc regex command sqlserver sysoev threading newman 
sockets lampson ][;?=
2009-07-20 12:05:56,453 - [xxxxx] distance: 45
2009-07-20 12:05:56,453 - [xxxxx] phrase:
record rake oscon rfc1157 clone processes applet procs frozen inline chain 
resource ^QvH5
2009-07-20 12:06:12,061 - [xxxxx] distance: 44
2009-07-20 12:06:12,061 - [xxxxx] phrase:
debug chore amp renders behlendorf reports kurtz url joy icon response 
bigtable &CVpu
2009-07-20 12:06:30,838 - [xxxxx] distance: 43
2009-07-20 12:06:30,838 - [xxxxx] phrase:
lampson dependency key xkcd syslogs assoc mailers uniq json atbash helper 
browser 4c/3Q
2009-07-20 12:06:51,466 - [xxxxx] distance: 38
2009-07-20 12:06:51,466 - [xxxxx] phrase:
rfc959 error hillis lib debug href proc watson capistrano hawkes deprecate 
iverson 1,*D@
2009-07-20 12:11:49,624 - [xxxxx] distance: 36
2009-07-20 12:11:49,624 - [xxxxx] phrase:
textmate ubuntu montulli winograd paths joshp binary rails friedman tests 
kindi oikarinen dg'7O
2009-07-20 14:00:27,171 - [xxxxx] distance: 35
2009-07-20 14:00:27,171 - [xxxxx] phrase:
stearns sub nodoc whitfield thread parse beta solo erb kurtz torvalds 
nitems E,giH
2009-07-20 14:23:22,954 - [xxxxx] distance: 34
2009-07-20 14:23:22,954 - [xxxxx] phrase:
dystopia services syntax regexp heinemeier computes metaprogramming 
rescue summit yin record wadler 3[G^_
2009-07-20 20:44:23,431 - [xxxxx] distance: 32
2009-07-20 20:44:23,431 - [xxxxx] phrase:
blocks rfc2810 shannon nodes host rdoc wep joshp size perform applet 
rescue HlKLr

The lowest distance I found was 32. I estimate that I calculated about 25-30 trillion SHA-1s in the 30-hour period.

Final thoughts

Overall, the contest was a blast. I’ve done nothing else for the last few evenings, but writing the code and optimizing was a great learning experience. Using compiler builtins and Apple’s Shark were new to me. Shark is especially awesome and generally more accessible than gprof for profiling.

In hindsight, I should have focused on a CUDA implementation early on. Maybe for the next challenge in September.

Now that it’s over, I can go back to being productive for a while. :)

alphasign 1.0!

I just pushed out a new version of alphasign, my API for talking to LED signs like the Betabrite. With this new version I added a lot of documentation and examples, which are fairly useful for an API. The generated documentation can be found here. This is also my first project on PyPI.

Unfortunately, while I’m at college, I am away from my Betabrite, so I can’t add support for the DOTS files (essentially bitmaps for the sign) for quite a while. Maybe someone will send me a patch. :)