The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Optimizing a custom tag search

I'm writing a tag based search algorithm with about 800 tags on several tens of thousands of searchable items.  I have a simple db structure: an item table, a tag table and a reference table with a foreign key relationship to both.

Here's the flow from the user's point of view:

User sees a listing all the tags.  Each tag has the number of items tagged by that tag next to it.

User selects tag, screen refreshes to show the selected tag and a listing of all the tags that are left: that is, tags that are attached to both the selected item and the tag itself.  (For example: the user selects 'foo'.  The tag bar still shows up because there are 12 items that have foo and bar.  The 'bert' tag doesn't show up anymore since there aren't any items tagged by 'foo' and 'bert')

Then the user can keep on selecting tags and narrowing the search. 

The way the data is, there are often searches that have 6 or seven tags. 

In it's current unoptimized (but functional) state, I have a script that reads all the selected tags and joins the reference table with itself n times where n is the number of tags selected.  Here's an example query on just 4 tags:

select * from item_talent,
 ref_tag t0,
 ref_tag t1,
 ref_tag t2,
 ref_tag t3
where 1 and (1
  and t0.tag = 7
  and t1.tag = 98
  and t2.tag = 87
  and t3.tag = 103)
 and t0.ftable = 'item_talent'
 and item_talent.item_talent = t0.fkey
 and t1.ftable = 'item_talent'
 and item_talent.item_talent = t1.fkey
 and t2.ftable = 'item_talent'
 and item_talent.item_talent = t2.fkey
 and t3.ftable = 'item_talent'
 and item_talent.item_talent = t3.fkey

This query takes 0.1537 sec or so, but here's the rub:  if there are 100 tags left that share these four, that means that this query has to be run 100 times with a fifth option. 

I currently have a system in place that caches each query as it is done, but the data is updated fairly regularly, so I have to keep on dumping significant percentages of my cache... and even on 15s page load is unacceptable... not to mention what happens when 'many' visitors show up at once.

I'm on the verge of extracting the data straight out of the db every hour or so and building a 100% script site mechanism. 

Before I do this, can anyone think of a way I can use the db more efficiently?

Thanks in advance!
jde Send private email
Thursday, October 25, 2007
 
 
Instead of having one expensive query (derived), couldn't you instead perform multiple cheap (basic) queries? You could then do the expense locally of retaining only the items that all employ. This is also much more cachable and performing a union locally should be very fast.
Benjamin Manes Send private email
Friday, October 26, 2007
 
 
The query I use for an almost identical system looks like:

SELECT t.name FROM tag_map tm
JOIN tags t ON (t.id=tm.tag_id)
JOIN companies c ON (tm.company_id=c.id)
WHERE tm.company_id IN (
  SELECT tm.company_id FROM tag_map tm JOIN tags t ON
  tm.tag_id=t.id WHERE t.name IN ('foo','bar') GROUP BY   
  tm.company_id HAVING count(tm.company_id)=2
) GROUP BY tm.tag_id, t.name;

You'll obviously want to check this is better with your setup...

When you're building the query dynamically, the only bits that need to change are the 'IN' list, and the count value- the count wants to be the same as the number of tags.
G Jones Send private email
Friday, October 26, 2007
 
 
This isn't really an optimization, but seems to be a cleaner way of expressing the query:

select distinct * from item
inner join itemtag it0 ON it0.itemid = item.itemid AND it0.tagid = 7
inner join itemtag it1 ON it1.itemid = item.itemid AND it1.tagid = 98
inner join itemtag it2 ON it2.itemid = item.itemid AND it2.tagname = 107
inner join itemtag it3 ON it3.itemid = item.itemid AND it3.tagname = 177

Friday, October 26, 2007
 
 
Oops, shouldn't do

select distinct * from... as that doesn't reduce the result set.

select distinct itemtag.itemid from ...

Friday, October 26, 2007
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz