The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Short URL

Hi guys,

I'm trying to figure out a good way to shorten URL's, but without the need for a database to store them first. Basically trying to compress/deflate the original URL to a short string (hash), that can be decompressed/inflated to its original form on request.

Example:

http://discuss.joelonsoftware.com/default.asp?pg=pgDiscussPost&ixDiscussTopicParent=0&ixDiscussGroup=4

... becomes ...

http://discuss.joelonsoftware.com/s83jslk3plk2j


Maybe the strongest of string compression algorithms might not be useful in such a scenario, but well I'm trying to ponder on it. Maybe its not as big a problem as it seems.

Any ideas or suggestions?

Cheers :)
14thNorton
Wednesday, March 05, 2008
 
 
General case or specific web app?  Your key problem is making it reversible while lacking a map.

That would seem to boil down to a compression algorithm of one sort or another.  Try researching short string compression.  I'm not familiar enough with compression to know if it's even efficient to try to reversibly compress short strings (where 'short' is ~10-100 characters), and then render them in a URL friendly character set.  Is there enough regularity to do some sort of run-length encoding?

If you have a more specific webapp you could have an implicit map of common values.  By knowing what possible parameters can appear on a URL, you can shortcut them without an explicit database, but this still implies keeping a map.
Dan Fleet Send private email
Wednesday, March 05, 2008
 
 
I suppose you could use one-way hashes like MD5? Perhaps you should ask yourself why you want to do it. From a SEO -- and indeed, human -- point of view, using garbled URLs makes it difficult to discern what the address is about.

Joel made that mistake with his CMS software. The earlier versions had ugly, ugly names like "fog0000000001.html".

Also, if people suspect it might be a sort of "tinyurl" address they might not click on it because they fear getting goatse'd.
The Luggage
Wednesday, March 05, 2008
 
 
A one-way hash won't work. The OP has to be able to reverse it to get back the original URL.
clcr
Wednesday, March 05, 2008
 
 
What is the use case? Eg. why do you want to do this?
Troels Knak-Nielsen Send private email
Wednesday, March 05, 2008
 
 
You might try Arithmetic Coding with a fixed table of symbol frequencies:

http://en.wikipedia.org/wiki/Arithmetic_encoding
H. Bowman
Wednesday, March 05, 2008
 
 
What if you later want to use http://discuss.joelonsoftware.com/s83jslk3plk2j as a URL?
Ruatara P Send private email
Wednesday, March 05, 2008
 
 
Think of it in terms of URL forwarding.

http://discuss.joelonsoftware.com/s83jslk3plk2j

..forwards to..

http://discuss.joelonsoftware.com/default.asp?pg=pgDiscussPost&ixDiscussTopicParent=0&ixDiscussGroup=4

..but no information is stored in a database. Instead the original URL is compresses/packed to create a shorter URL. If someone accesses the short URL, then the URL key (s83jslk3plk2j) is unpacked to get the original forwarding URL.

Dan and Bowman seem to have understood where I'm coming from. Burrows-Wheeler Transform (BWT) is another algorithm that I've come acorss atto compress the original URL string. But it will take some toying around to find a good solution (if at all possible).

Any more ideas guys?!?!
14thNorton
Thursday, March 06, 2008
 
 
Troels Knak-Nielsen Send private email
Thursday, March 06, 2008
 
 
14th,

Take into account that what you say cannot work if these two conditions hold:

- No database
- Arbitrary set of URL's

I mean, who assures you that http://discuss.joelonsoftware.com/s83jslk3plk2j

is not an actual, uncompressed URL?

In other words, if you have the original url's:

http://discuss.joelonsoftware.com/actual_url_47
http://discuss.joelonsoftware.com/complicated_article_url_23456

What happens if "actual_url_47" hashes to "abcdx56" and "complicated_article_url_23456" hashes to... "actual_url_47"?

In practice you should forbid short uncompressed url's, that may cause problems or force an awkward design (what happens with index.html?)
Daniel
Monday, March 10, 2008
 
 
If I read the original post correctly, the shortened version does not need to be a legal URL; just some truncated form that can be 'decompressed/inflated' back to a legal URL?

Thus the decompressed form can be distinguished from a real URL by making it any old illegal string.

Re-reading subsequent comments, the actual problem does slightly different in that they want TinyURL style functionality without a database; so a text encoding that reduces length but not hashing that could result in collisions.

Do TinyURL allow themselves to be used as a web service? Could be a near no-work solution. ;-)
Grant Black Send private email
Thursday, March 13, 2008
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz