I’ve been a fan of freeDB for years. It’s a great way of crowd-sourcing CD title/artist/track information and is a huge help when converting CDs into part of your digital music collection (“ripping”).
However, more recently I have noticed that the majority of times I submit a new CD to freeDB, it gets rejected due to a discid collision. This is due to a fundamental limitation in the discid hashing algorithm which freeDB inherited from CDDB – it’s only a 32-bit number, of which a mere 8 bits are used as a checksum for the individual track starting times. So it’s no surprise that we’re getting collisions galore, at an increasing frequency as the database continually grows. Even worse, CDDB attempts to deal with collisions by making CD entries in the database uniquely retrievable by
(discid, category) pairs, where category is one of only 11 musical genres. Of course this is woefully inadequate, because there are countless genres and most music defies classification anyway. They attempted to deal with this by calling the 11th category “misc”, but that still has the problem of restricting entries to one unique discid per genre. Unsurprisingly this has caused a huge number of collisions, especially in the “misc” category. As a result, people have been re-submitting collided entries into the wrong genre, simply because having an entry with the wrong genre in the database is still better than not having it at all.
Gracenote, the eventual owner of CDDB have developed a new generation database imaginatively called CDDB2 which adds a much richer meta-data structure. Gracenote has taken advantage of this to clean up the mess caused by attempting to shoe-horn classical CDs into an inadequate schema, and license the results to Apple for iTunes. Unfortunately that’s no use to those of us who recognise the value of freedom over vendor lock-in.
It seems that the freeDB server software hasn’t been updated since 2006, so presumably there’s not much of an active community left. So there’s a ripe opportunity for a smart philanthropist hacker to breathe new life into this valuable project. Sounds ideal for Google Summer of Code task, for instance. As this is largely a lazyweb blog post, here are my thoughts on what needs to be done; it’s unlikely I’ll ever manage to prioritise it above other things already on my plate:
- Design a new collision-proof hashing algorithm. It should produce at least 128-bit hashes, and include as much information about the contents of the physical CD as possible, namely:
- number of tracks
- starting times of all tracks
- total playing time
This algorithm could be as simple as calculating the MD5 digest of a delimiter-separated concatenation of the above items represented as integers.
Notice that this should be limited to information which can be retrieved very quickly; for instance producing MD5 digests of the contents of each track takes too long to be useful in practice.
- Design the next level of the CDDB protocol (which at the time of writing would be level 7), which allows additional querying by this new 128-bit (or larger) digest.
- Extend the existing freeDB server software to support this new level whilst remaining backwards-compatible with existing clients. In other words, database entries should be retrievable both by the old (32-bit discid, category) pair and the new digest. This would require iterating once through all existing entries to recalculate the new digest for each.
- (Optional) Extend one or more F/OSS clients to use the new protocol level, and advocate other clients to do the same …
For bonus points, you could extend the database schema in a similar way to CDDB2, and then start a crowd-sourcing project for cleaning up the database with respect to all those pesky classical tracks which have distinct composer / performer metadata.
So, any takers? You’d win the admiration and gratitude of a few, the satisfaction of knowing you helped slightly improve the lives of millions, and a place in heaven 😉