| ||
|
The "Design of Software" discussion group has been merged with the main
Joel on Software discussion group.
The archives will remain online indefinitely. |
Hi guys, I'm trying to figure out a good way to shorten URL's, but without the need for a database to store them first. Basically trying to compress/deflate the original URL to a short string (hash), that can be decompressed/inflated to its original form on request. Example: http://discuss.joelonsoftware.com/default.asp?pg=pgDiscussPost&ixDiscussTopicParent=0&ixDiscussGroup=4 ... becomes ... http://discuss.joelonsoftware.com/s83jslk3plk2j Maybe the strongest of string compression algorithms might not be useful in such a scenario, but well I'm trying to ponder on it. Maybe its not as big a problem as it seems. Any ideas or suggestions? Cheers :)
14thNorton Wednesday, March 05, 2008
General case or specific web app? Your key problem is making it reversible while lacking a map. That would seem to boil down to a compression algorithm of one sort or another. Try researching short string compression. I'm not familiar enough with compression to know if it's even efficient to try to reversibly compress short strings (where 'short' is ~10-100 characters), and then render them in a URL friendly character set. Is there enough regularity to do some sort of run-length encoding? If you have a more specific webapp you could have an implicit map of common values. By knowing what possible parameters can appear on a URL, you can shortcut them without an explicit database, but this still implies keeping a map.
I suppose you could use one-way hashes like MD5? Perhaps you should ask yourself why you want to do it. From a SEO -- and indeed, human -- point of view, using garbled URLs makes it difficult to discern what the address is about. Joel made that mistake with his CMS software. The earlier versions had ugly, ugly names like "fog0000000001.html". Also, if people suspect it might be a sort of "tinyurl" address they might not click on it because they fear getting goatse'd.
The Luggage Wednesday, March 05, 2008
A one-way hash won't work. The OP has to be able to reverse it to get back the original URL.
clcr Wednesday, March 05, 2008
You might try Arithmetic Coding with a fixed table of symbol frequencies: http://en.wikipedia.org/wiki/Arithmetic_encoding
H. Bowman Wednesday, March 05, 2008
What if you later want to use http://discuss.joelonsoftware.com/s83jslk3plk2j as a URL?
Think of it in terms of URL forwarding. http://discuss.joelonsoftware.com/s83jslk3plk2j ..forwards to.. http://discuss.joelonsoftware.com/default.asp?pg=pgDiscussPost&ixDiscussTopicParent=0&ixDiscussGroup=4 ..but no information is stored in a database. Instead the original URL is compresses/packed to create a shorter URL. If someone accesses the short URL, then the URL key (s83jslk3plk2j) is unpacked to get the original forwarding URL. Dan and Bowman seem to have understood where I'm coming from. Burrows-Wheeler Transform (BWT) is another algorithm that I've come acorss atto compress the original URL string. But it will take some toying around to find a good solution (if at all possible). Any more ideas guys?!?!
14thNorton Thursday, March 06, 2008
14th, Take into account that what you say cannot work if these two conditions hold: - No database - Arbitrary set of URL's I mean, who assures you that http://discuss.joelonsoftware.com/s83jslk3plk2j is not an actual, uncompressed URL? In other words, if you have the original url's: http://discuss.joelonsoftware.com/actual_url_47 http://discuss.joelonsoftware.com/complicated_article_url_23456 What happens if "actual_url_47" hashes to "abcdx56" and "complicated_article_url_23456" hashes to... "actual_url_47"? In practice you should forbid short uncompressed url's, that may cause problems or force an awkward design (what happens with index.html?)
Daniel Monday, March 10, 2008
If I read the original post correctly, the shortened version does not need to be a legal URL; just some truncated form that can be 'decompressed/inflated' back to a legal URL? Thus the decompressed form can be distinguished from a real URL by making it any old illegal string. Re-reading subsequent comments, the actual problem does slightly different in that they want TinyURL style functionality without a database; so a text encoding that reduces length but not hashing that could result in collisions. Do TinyURL allow themselves to be used as a web service? Could be a near no-work solution. ;-) | |
Powered by FogBugz
