The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

DB design for multilingual dictionary

Hi all,
I'm trying to make a multilingual dictionary in MySQL.  Its main table looks something like this:

ID | english | spanish | french | german | ...
1    house    casa      NULL    Haus    ...
2    dog      perro    chien    Hund    ...
...

I've got text files with a lot of word translations, one for each language pair.  For example, there's an English-German file that looks like this:
house ||| Haus
dog ||| Hund
...

I've gotten some mileage out of INSERT ... ON DUPLICATE KEY UPDATE... which works if I, say, insert English-German and then English-Spanish.  But a problem occurs if I insert, say, "dog - perro" from English-Spanish and then "chien - Hund" from French-German, and the table looks like this:

ID | english | spanish | french | german | ...
1    dog      perro    NULL    NULL
2    NULL      NULL      chien    Hund

and then if I try to insert Spanish-French "perro - chien", it gives me an error because it would create duplicate rows.  I'd like it to just merge the two rows together.

So what do I do?
- Is there a SQL command that does this?
- Should I just make sure to insert all the english-* files first, so I'm only adding one foreign language at a time?
- Should I redesign my database, maybe to have one table for each language pair?  n^2 tables seems bad, but then I could just JOIN them together or something whenever I want to get a translation, right?

Apologies if this is dumb; I'm new to SQL and databases.  Thanks for the help!
Sedate Snail Send private email
Tuesday, July 31, 2007
 
 
You have bigger problems than the SQL.  To have anything even close to useful you need to consider synonymy (two words may mean the same thing) and polysemy (one word may have more than one meaning). For example, in English both 'house' and 'dog' are verbs, not translated by 'casa', 'perro', etc.

Words do not have translations; meanings have translations.  Your tables' structure (you'll need more than one) will need to accommodate that fact.
Will Dowling Send private email
Tuesday, July 31, 2007
 
 
+1 to Will Dowling.

One possible solution is to give each "meaning" a unique primary key. For now, don't worry about what the actual meaning is, just acknowledge that primary key 1 has a specific, explicit definition that's different than primary key 2.

In separate tables, one for each language, make a list of words. Give each word a foreign key that points back to the "meaning" in the primary key table.  For example...

Meanings
--------
1 - (noun) a four footed canine animal
2 - (verb) to harass someone

English
-------
dog - 1
dog - 2

Spanish
-------
perro - 1

French
------
chein - 1

This would imply there are no direct Spanish equivalents to the English verb as in "mi nino [dogged] la maestra".
TheDavid
Tuesday, July 31, 2007
 
 
Right; sorry, I didn't make this clear.  I know that words have multiple meanings and that language translation is much harder than looking up each word (otherwise, babelfish would be all we need!)

Nonetheless, I need a multilingual lexicon.  I'm using a phrase-based model that takes in short phrases (like "the black dog" or "above the table" or even "house around the" or whatever), so I need to translate these phrases.  In reality, I don't just have "dog" in the database; it looks more like:

ID | english        | spanish ...
1    the black dog    el perro negro ...

I was just using the simple one-word phrases for an example.  So is there a good way to do this, disregarding issues about the difficulty of translation?

Thanks for your replies, though!
Sedate Snail Send private email
Tuesday, July 31, 2007
 
 
No.

This is a computation problem, not a raw data problem. Even in the best case scenario where you capture all of the possible lexicons in a database, you still have the problem of searching through all those records to find entries that match.

To use your example, you've got a "black dog" referring to the actual animal, the movie "Black Dog", the beer "Black Dog" and I would presume there's both a song and a book out there somewhere. Adding an article as in "the black dog" narrows it down a little bit in some ways, and expands it in other ways. For example, am I still referring to the animal, or the character in the movie?

To the best of my knowledge, most research in this area centers around heuristics and word frequency counts to derive a basic, arbituary and "unpublishable" meaning which then serves as a primary key, as opposed to documenting literal one-to-one word translations.

For example, Google recently published a whitepaper. The gist of the paper was such that if Google sees "black dog" and the name of a chemical used together frequently throughout the page, it can reasonably infer that "black dog" refers to a medical subject rather than the movie. On the other hand, "black dog" and Patrick Swayze likely refers to the movie.

I think Google refered to it as a sphere of association but I'm not certain.

One of the interesting tangents of the paper (and this is how I learned about it) was that once you've identified the sphere of association, you can give it a culturally appropriate name in each language and use it where needed.

For example, in the US, Patrick Swayze may be known as the Black Dog. In France, he may be known as the Devil Dog. In Russia, he may be known as the Road Wolf. I'm obviously making these names up but a pure lexical dictionary would have problems equating Black Dogs with Road Wolves unless you set out to enumerate every single possible meaning, and even then, you'd have difficulty building a searchable index.

Anyway, I didn't mean to ramble, I just personally find it fascinating to look at why technology is being used the way it is, in addition to how.
TheDavid
Tuesday, July 31, 2007
 
 
It is interesting.  I understand your concerns, but for this project, I'm not worried about meaning at all.  I don't want to sound all authoritative, but I've thought of these requirements beforehand.

Let me rephrase my original question.  Suppose I have a table called "table1" with four columns, each with a unique index:

ID | col1 | col2 | col3 | col4
1    a      b      NULL  NULL
2    NULL  NULL  c      d

and I know that "b" in col2 corresponds to "c" in col3.  So I'd like to run some command that makes the database look like this after I run it:

ID | col1 | col2 | col3 | col4
1    a      b      c      d

Sorry if this is harder to understand, but I think it gets to the heart of the issue better.
Sedate Snail Send private email
Tuesday, July 31, 2007
 
 
>In reality, I don't just have "dog" in the database; it looks more like....

Then why not use something like:

LanguageTable
-------------
LanguageID
Language

&

PhraseTable
-----------
ID
LanguageID
Phrase

Where the PK for PhraseTable is the pair (ID, LanguageID), presuming there is some sort of uniqueness there.

So you get
LanguageTable:
ID....Language
1.....English
2.....Spanish

&

PhraseTable:
ID..LiD...Phrase
1...1.....The Black Dog
1...2.....el perro negro
Peter Send private email
Tuesday, July 31, 2007
 
 
Well, you wouldn't use the "Insert", you'd use the "Update" keyword.  But first, you'd have to find what the equivalent English word was for your "WHERE" clause.  And before that, you'd have to find out if the equivalent English word was even in the table yet -- if it was not, THEN you'd use the INSERT keyword.
AllanL5
Tuesday, July 31, 2007
 
 
Okay, yeah.  I like your idea, Peter; it makes a lot of sense.  Taking it a step further, because I'd like to add tags about individual phrases (like "this phrase is a person's name") as well as weights for a translation (like 0.75: "black dog = perro negro 75% of the time"), I came up with this:

phrase_pair
-----------
src_phrase_id (combined PK)
dest_phrase_id (combined PK)
weight

phrase_text
-----------
phrase_id
phrase_text

phrase_lang
-----------
phrase_id
lang_id

language
--------
lang_id (PK)
lang_name

phrase_tag
----------
phrase_id
tag_id

tag
--------
tag_id (PK)
tag_name

I guess I don't need the INSERT/UPDATE query then, but thanks anyway, AllanL5.

And thanks to all of you for your responses!
Sedate Snail Send private email
Tuesday, July 31, 2007
 
 
For transitive closure of pairwise relations distinct attributes see keyword FULL OUTER JOIN.

Tuesday, July 31, 2007
 
 
We have been through a number of variations on this, our current one being:

CREATE TABLE [dbo].[Languages_Static](
    [LangID] [varchar](10) NOT NULL,
    [Description] [varchar](50) NOT NULL
)

For LangID, we use the ISO language codes: en-us, etc (http://msdn2.microsoft.com/en-us/library/ms533052.aspx)

CREATE TABLE [dbo].[LangMaster](
    [TextID] [int] NOT NULL,
    [TextEnglish] [nvarchar](255) NOT NULL,
    [TextLocation] [varchar](100) NULL,
    [TextDescription] [varchar](255) NULL
)

The master dictionary is in this table, this has a TextID autonumber, with the English text, the location in the application (URL, form or source file) and the description of the usage.

CREATE TABLE [dbo].[LangTranslation](
    [LangCode] [varchar](10) NOT NULL,
    [TextID] [int] NOT NULL,
    [LocalText] [nvarchar](255) NULL
)

The translation table holds the localized versions.

For phrases with varying insertion points, we use a system like:

{0:Number of foxes} foxes jumped over {1:Number of dogs} lazy dogs.

For a translation, it might become:

{1:Number of dogs} lazy dogs were jumped over by {0:Number of foxes} foxes.

This handles context (verb/noun) or whatever - each usage must result in a separate entry in the LangMaster.

You can tell when you are missing translations for a language by doing a LEFT JOIN from LangMaster to LangTranslation for a particular LangID.

We similarly use a LEFT JOIN in our SP to allow the app to default to the English translation if one has not been entered.
Cade Roux Send private email
Tuesday, July 31, 2007
 
 
'To use your example, you've got a "black dog" referring to the actual animal, the movie "Black Dog", the beer "Black Dog" and I would presume there's both a song and a book out there somewhere. Adding an article as in "the black dog" narrows it down a little bit in some ways, and expands it in other ways. For example, am I still referring to the animal, or the character in the movie?'

"No, not a lab [Labrador Retriever, not laboratory].  The Black Dog that I am referring to is a beer.  The Black Dog that I had yesterday was delicious."

Language translation is not a solved problem by any means.

Sincerely,

Gene Wirchenko
Gene Wirchenko Send private email
Tuesday, July 31, 2007
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz