The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Metadata a la Flickr and YouTube


I'm creating an application in which various pieces of content will have tags assigned to them that describes the content. It will be very similar to how Flickr and YouTube does it. Does anyone know of any resources that talks about managing tagging data like this? Some of the specific things I'm researching is how to use the tags to find related content (including some sort of relevancy rating), whether or not to specify different tag types (to influence the relevency rating), and any sort of ancedotal evidence about how to make these systems performant and efficient.

Eric Marthinsen Send private email
Thursday, August 10, 2006
I run a forum which manages content using tagging. Among others there is a section on tagging: .

Blogosphere has some amount of articles where people talk about how to improve tagging, from user / research point, but you will find maybe 1 or 2 really technical articles. There are Flickr presentation at and another short but interesting article where author suggests to take amout of users used particular tag while calculating tag weight - can't find it right now, will post it when I'll find it.
Denis Krukovsky Send private email
Thursday, August 10, 2006

Thanks for the info. This is all good stuff. If you come across that other article, definitely post it here. I'd be very interested to read it. Thanks again.

Eric Send private email
Friday, August 11, 2006
Make sure you are using a database that can do subselects - unless you like pain, you'll be using them a lot.

My website's tag schema looks like this:


user_id (so I can see who added the tag)

tag_soundex (for offering "suggestions")

When I pushed out tags on my users (, they were used quite a bit initally.  People would go back and tag up their old stories and everything.  Six months later, hardly any new content is tagged.

I suspect the reasons are

1) Tags are not a requirement for posting a story.
2) My interface isn't as clean as it could be.  For example, many would try to comma seperate their tags and I dont support that.
3) It isn't as integrated as it could be.  In otherwords, there isn't much benifit to using tags.

and maybe....

4) Tags really are just hype and doesn't add much value to a website.

I've seen this pattern on other sites who rolled out tags as well.  The feature gets used a lot until initially until the novelty wears off and nobody uses it.

There are plenty of other social / design issues to address as well but I can save that for another post.
Cory R. King
Friday, August 11, 2006
Here are some of the queries I use:

This is my related tags query - $tag is the name of the tag a user is viewing right now (example: ):
        my $query = qq(
                SELECT t.tag_name, count(st.sid) as count
                FROM tags t, story_tags st
                WHERE  st.tag_id=t.tag_id
                        AND t.tag_name != $tag
                        AND st.sid IN  (
                                SELECT st.sid
                                FROM story_tags st, tags t, stories s
                                WHERE t.tag_name=$tag
                                        AND s.sid=st.sid
                                        AND st.tag_id=t.tag_id
                        GROUP BY tag_name
                        ORDER BY count desc
                        LIMIT $count;
Cory R. King
Friday, August 11, 2006

Wow. Great info. I like your use of the soundex value. How well does it work for you? Your query was also very helpful.

Eric Send private email
Friday, August 11, 2006
The soundex worked okay, but I had to tweak it to only use the first four characters instead of the full five or six.  If I didn't, it wouldn't return any results.  Before I made the move to Postgres, I had the system use MySQL's fulltext searching on tags too.  When both were used to assist somebody assigning tags ("did you mean <this>?") it worked quite well.

Flickr looks like it does something a bit more crazy with their tags.  Somehow they are splitting things up by word - .  You dont see:

"My Dog" as a top tag, only "dog".  You dont see "Seattle, WA", only "Seattle".

Whatever they are doing, I haven't looked enough to figure out their algorithm, but I really like it.  I dont like the system I and many others use of storing tags as phrases.  It doesn't make for pretty tag clouds, and I suspect phrases give you less data to derive relationships with.  Probably somebody better versed in database / set theory would say something like "flickr further normalized the data" or something but alas it is 2am...

This, by the way, was where I learned enough set theory to make the query above:

It is good stuff.  I'll re-iterate that you *need* an SQL implementation that supports subqueries.  Tags forced me to upgrade to MySQL 5 and finally Postgres once I got tired of MySQL's habit of corrupting my tag tables.

While I question its ROI, creating the tag system for Photographica was a hell of a lot of fun.

Good luck!
Cory R. King
Saturday, August 12, 2006

Good stuff. I'll check out that link. My DB is SQL Server 2000, so we are good on the subqueries. I think flickr dones some interesting things as well. It looks like they take a tag and normalize it by removing spaces and capitalization, but store it in both its normalized and denormalized forms, so a search for "San Francisco" will bring up things with the sanfrancisco tag. Although a search for sanfrancisco doesn't seem to bring up things tagged with "San Francisco". Clearly, I'm still trying to work it out.

Thanks for the info.

Eric Send private email
Sunday, August 13, 2006

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz