The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Bidirectional self join.


As per this article:

There were a lot of comments on this but also a lot of ranting. I have what seems to be the exact situation and would appreciate some help in deciding which route to take.

I have a table which represents a Candidate applying for a job. Simplified as follows.


One feature of the system is to indicate if a candidate has applied twice. This is customised in code and could be simply candidates with the exact same email address, or a combination of checks on parts of Surname, Initials, Address, etc.

The "possible" duplicates are shown to the user. For example, looking at candidate John Smith the user is informed there may be duplicates and is shown a list of all Candidates that match J. Smith.

The user is then given the option to confirm if any of the suggestions are definitely duplicates or definitely NOT duplicates.

My planned structure for this is:
Duplicate (bit)

For the candidate being viewed, the suggested duplicates are checked in this table and should Duplicate be true then they are flagged as definite, if Duplicate is falst, they are flagged as not definite, if they do not exist in the table they are left as suggested duplicates.

As with the previous poster, my question is, should I store two entries in this table for every relationship, or just one.

My mind is screaming, just one, there is no additional information being stored if I store 2 records, therefore I shouldn't.

However, my worries are with integrity of the data and ease of use for myself and other developers.

Can I or should I add a unique constraint to ensure a record is only ever entered once?

Will a coder make an "assumption" that it works a different way and write incorrect code. How can it be made obvious to them to only create one entry or force them to make two entries? Same with using the table, how can I make it obvious that they need to check both directions, or just one?

One possibility is to create a View to display a UNION of the links to simulate the two record option?

At the moment I am going to go with my gut feeling, there is no data advantage in two records so I will stick to one record and leave it to the developers (of which I am one) to only add a record if required.

Any discussion would be helpful.
Robin Day Send private email
Thursday, July 17, 2008
If you don't duplicate the records then you can simulate the duplication when you select: by selecting a union, or by selecting from a view (or from an indexed/materialized view) of a union.

The problem with simulating the duplication (instead of really duplicating) is when you update or delete: the application needs to update or delete the real record and not the fake record. Wrapping access to the table into stored procedures would encapsulate the knowledge of which of the two logical records actually physically exists (so that the application wouldn't have to know).

If you want to physically duplicate the records, I think you can do that by defining a trigger on insert/update/delete (so that the application changes one record and the trigger automatically does the same to that record's mirror).

Perhaps if the 'mirror' records are maintained automatically by database triggers, then any check constraints on the relational integrity of the mirroring of the records are non-essential (if you trust the triggers).

Technically, having that extra "Duplicate (bit)" column has denormalized the thing: you have the same information in two records. An alternative, which I don't necessarily recommend, would be:

--- This table may contain mirrored records

--- One instance of the IsDuplicate data for each pair
IsDuplicate (bit)
Christopher Wells Send private email
Thursday, July 17, 2008
Make one of the conditions for your join that the ID from the first table be greater than the ID from the 2nd table.  That way, you'll only every see each set once.  They'll always be in the same order, so you if you flag the pair as "not duplicate" you can easily exclude them from results.
Joel Coehoorn Send private email
Thursday, July 17, 2008
Use a single record, but put a constraint in your
system that ID1 < ID2. Maybe you can do this with
a trigger or check constraint.

Now, you can have (ID1,ID2) as a unique key if you

You will have to intercept additions of the form
(BigID, SmallID) and convert them to (SmallID,BigID).
You might also need to subsequently convert any
additions of the form (ID2,ID3) to the form (ID1,ID3).

You may be able to manage all this automatically
in the database with an updatable view.
Object Hater
Thursday, July 17, 2008
It would seem much simpler to me to split the concept of candidate and application into to two tables. Then it becomes rather trivial to discover if the same candidate has applied more than once.
Financial programmer
Thursday, July 17, 2008
Not necessarily.  You can still have situations where the same person ends up with more than one candidate record, even if you key on something like e-mail address.  The OP was also checking for complicated name matches, etc.
Joel Coehoorn Send private email
Thursday, July 17, 2008
Thanks for the help. I hadn't thought of using an ID1 > ID2 method for a constraint.

I'm thinking the single method approach but not enforce a uniqueness. Therefore, should a developer insert the data a second time then so be it. I can then add a trigger to the table to ensure that should an update or delete happen then it will update or delete the second record.
Robin Day Send private email
Friday, July 18, 2008

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz