The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Can the next major database not have column widths, please?

Maybe someone can provide the opposing argument, but I do not see any reason anymore for database columns to have widths. For that matter, why can't any row contain optional additional tagged pieces of information? Say 10% of the records need an additional ID associated, so why not stick these in the same table rather than creating an additional table?

In the "olden days" things were done very rigidly. FORTRAN programs had specific columns in which to enter instructions, but it really didn't serve a strong purpose, it was more a holdover from punch card days. CHAR columns space-padded the fields out to the column width.

But nowadays we might even be using different text encodings on the client and server so that a column width loses all meaning when a string has different byte widths in different encodings.

A row in a database table would generally save a lot of space if it was preceded by a small array of integers specifying field byte widths. Doesn't anyone else feel that reserving 32 bytes for every username, when the average length is 8, is less than efficient? And why should we have to go through convulsions over that one in a thousand case which requires a long value? When I create a database table it wants me to know the field widths, but I don't know! Why should I? It is an artificial constraint! (and please don't tell me to use a CLOB for usernames).

Or how about having a row like a tagged XML fragment so that additional pieces of information could be added without affecting pre-existing logic?

Why is a database enforcing value lengths when that should be a business logic or UI decision anyway?

For those who feel we need column widths for performance I disagree -- knowing the field offset ahead of time is a very small part of the performance picture.

I ask this question in "Rowboat: A Database Design."
http://www.benpoint.com/database.htm
Thanks!
Ben Bryant
Thursday, August 03, 2006
 
 
I've started using string fields with 255 chars (the standard maximum) for my databases. Also, I have started using some extra fields by default as well for all the tables. And by using Object Orientation and Data Access Layer I can change some of these behaviors. For instance, I could add some tagging support to some tables.

With harddrives approaching the terabytes, and the complexity of the programs increasing, I think it's a timely adjustment.

I'm generally happy with my database abstractions.
Lostacular
Thursday, August 03, 2006
 
 
> Doesn't anyone else feel that reserving 32 bytes for every username, when the average length is 8, is less than efficient

Isn't that the difference between defining the column as "char(32)" or as "vchar(32)"?

> With harddrives approaching the terabytes ...

Performance is surely affected by how well the database's data fits in RAM.
Christopher Wells Send private email
Thursday, August 03, 2006
 
 
I didn't read the linked article.

However, you could ask programmers the same question; why do you still need to define integers and characters? Why not just go the Perl route and have the system determine what type of data that it's dealing with?

The primary answer (to both questions) is performance. I'm sure you've heard it all before, so I'll move on...

A good secondary answer is that some data types do have value, particularly the SQL datetime data type. Unfortunately, there's no intuitive way to universally communicate "it can be any data type you want unless I explicitly say it's a datetime." Ergo, we're stuck specifying the column width for lack of a better solution.
TheDavid
Thursday, August 03, 2006
 
 
"Performance is surely affected by how well the database's data fits in RAM."

Yep, which needs to be put to good use. Some people have 4 GBs in RAM and waste it trying to cache the whole database, which definitely can surpass it anyway. Coarse-grained use of the database can cause lots of headaches with only a few gigs of RAM, compared to the terabytes in HD. I propose some fine-grained use of the RAM. :-)
Lostacular
Thursday, August 03, 2006
 
 
>> And why should we have to go through convulsions over that one in a thousand case which requires a long value?

Use variable length character strings, and increasing the size is just a tiny command away.
David Aldridge Send private email
Thursday, August 03, 2006
 
 
I'm sensing general agreement. Integer and date types is a different issue; I'm not saying throw away data types. If a column contains integers, then that column in every row would happen to be the same number of bytes, but this could still be implemented using a small group of integer offsets at the start of the row. Its just that text types or any variable length type could be truly variable. And to me, VARCHAR versus CHAR may involve a different implementation internally but you are still forced to specify the maximum width of a VARCHAR field.
Ben Bryant
Thursday, August 03, 2006
 
 
Everything has a price, and with most software you pay a little more up front (development) hoping to reap the savings through thousands of executions.

For one-offs you use things like Variant or other weak types.
l belk
Thursday, August 03, 2006
 
 
I'll quote bits from the OP and respond to each:

"A row in a database table would generally save a lot of space if it was preceded by a small array of integers specifying field byte widths."

Why would specifying the width of fields in a second place (in the row? in addition to being in the table definition) save any space at all?

"Doesn't anyone else feel that reserving 32 bytes for every username, when the average length is 8, is less than efficient?"

Reserving too much space for the data is a big, big problem. This should have been prevented with good table design. It's a PITA to fix later.

"And why should we have to go through convulsions over that one in a thousand case which requires a long value? When I create a database table it wants me to know the field widths, but I don't know! Why should I? It is an artificial constraint! (and please don't tell me to use a CLOB for usernames)."

Couple of things here. One, by the title of your post and the first sentence in the quote, you do not understand how a relational database stores data. I'm a SQL server guy, here is a link to some details on 2005:

http://msdn2.microsoft.com/en-us/library/ms190969.aspx

As to the last part of the above quote (why should you & artificial constraint...) you need to have the data designed before you try to design the database. If you are storing a phone number, how many digits is that? Local numbers only? Thats 7 digits (555-1234). Need the area code? Thats 10 (555-555-1234). International? Now you need to throw in the country code.

Some things like names you may not know, which AFAIK is why they came up with VARCHAR...see here:

http://www.mssqlcity.com/FAQ/General/char_vs_varchar.htm

"Or how about having a row like a tagged XML fragment so that additional pieces of information could be added without affecting pre-existing logic?"

Well there are certain types that are basically pointers to files but what do you mean by adding it? Do you need to index it? Does the data repeat? Does it then throw the tables and data schema out of proper normal form? You could 'pack' the data:

1 = bool #1
2 = bool #2
4 = bool #3
etc...

but then you incur cost when you have to write it, as well as reading and updating. And finding the value of just one of the items in the 'packed' field would suck.

"Why is a database enforcing value lengths when that should be a business logic or UI decision anyway?"

AH-HA! Here we go. Don't feel bad, as there are boatloads of folks out there that can't figure this one out. Think of yourself as one of those poor souls that Joel refers to as not having the part of the brain that enables you to understand pointers.

However, you're in luck as we will plant that in your head, so if you actually read all this gibberish I'm typing you'll get it.

The most important aspect of a database (IMO, YMMV) is data validity. If the schema is loose (missing foreign keys, bad types, etc) and the validity of the data is called into question, you're screwed.

There are two types of rules here. Data rules and business rules. The grand debate of where the business rules should reside I will leave for another day. And your application can certainly check for validation on the front end. But there should always be data validation occurring on the database side (in the database to which the data is entered. It's ok to sometimes eliminate that stuff from OLAP/reporting databases that are getting their data from a database that has the checks).

If you look at the discussions about RoR (DHH comes to mind) you may note that he admittedly deals mostly with 'application databases'. Great.

But the majority of databases that I deal with are 'integration databases'. What's the difference? Basically the whole N-tier structure is lax in the app db's because only one application will deal with the data.

So if you screw up the schema but the validation on the app catches it, you're ok.

But the stuff I deal with (and what I think most folks deal with in medium to large companies) I might have one or several applications hitting my database. I may have a reporting application hitting it. I might have any number of exports of that data to customers, to other db's and so on and so on.

And when that data comes out of that database it has to be right. No excuses.

I hope this answered some of your questions. :)
D in PHX Send private email
Friday, August 04, 2006
 
 
"Maybe someone can provide the opposing argument, but I do not see any reason anymore for database columns to have widths."

Your hypothetical database would have bad performance. If the DBMS knows ahead of time what the data type is, then it can pack a fixed number of data onto a 4K page and not have to jump through as many hoops figuring out where one piece of data ends and the other starts. You give up some flexibility up front so that the database can optimize resources as well as perform better.
CIFChamp
Friday, August 04, 2006
 
 
CIFChamp, this is the normal assumption that performance is greatly aided by knowing the column width, but it is not true. With all the inertia behind this assumption, the best I can do here is say that from reading about things like the SQL Server row overflow data and other variable char field and database page implementations my opinion goes against this assumption. To see my plan for performance look at my row block design.
Ben Bryant
Friday, August 04, 2006
 
 
D in PHX, thanks for the point by point response. You ask why would specifying field widths in both the row and table definition save space but to avoid a long discussion I am essentially only suggesting that the offsets or lengths for variable fields be placed in each row which is not entirely different from the current practice. Thanks for the links, it looks like recently (SQL Server 2005) this overflow data unit does something similar to the row block mechanism I suggest (not saying I invented it, it is standard computer science), but it is handled as an exception or an overflow and maximum widths are still specified. In general rows per page calculations are affected by the varchar sizes. You telephone number example is a great example for my case. Why should the developer have to know the max telephone number width at database design time? You can have validation features without making max field width part of the core storage technology - and field width is only one of many validity checks so making it part of the core doesn't gain you anything. If you want to design a column as fixed width that's fine, it doesn't go against my point.

In XML you tag the city value without even thinking of what the maximum city length is. Why should you? Especially when text width doesn't mean much anymore when you move beyond fixed byte encodings. It seems obvious now that the evolution of databases will be to no longer require the max width for columns.
Ben Bryant
Friday, August 04, 2006
 
 
The time when it's especially useful for a column to have a finite length might be when the column is indexed and used for joins: because you want your index data to be contiguous and small enough to fit in memory.
Christopher Wells Send private email
Friday, August 04, 2006
 
 
>> I'm sensing general agreement.

I don't think so. Variable length character types already exist in database and are probably more commonly used than fixed char types. They are easy to expand if needed.

Data lengths are there to protect your data from nonsensical additions, such as addresses that are 2000 characters long. A great many RDBMS features that people see as hindrances are actually there for protection -- remember that a database is not just a data dump, it is also a guarantor of data integrity, so the data is an accurate representation of the real world.

Here's a relevant pquestion as AskTom

http://asktom.oracle.com/pls/ask/f?p=4950:8:::::F4950_P8_DISPLAYID:54274298311615
David Aldridge Send private email
Friday, August 04, 2006
 
 
Average column name length = 8? WTF? C'mon guys, gals, in-betweens, have mercy on the poor soul who comes after you. Typing a few extra characters is not onerous, really, and modern computers do have that nifty "cut-and-paste" feature now.

Seriously, 30, 32 characters is usually enough to create names that convey real meaning to those who come after you. It's so hard to take up someone else's code.
Steve Hirsch Send private email
Friday, August 04, 2006
 
 
Personally, I think it's very helpful in a database for fields NOT to have a variable size.  If they did, you'd have to read every record in order to find out where the next record began.

Or maintain an 'index of offsets' somewhere, which would have to be updated every time ANY insertion was done.

Since database speed is still a limiting factor in many applications, this "space saving" probably would have an unacceptable impact on speed.
AllanL5
Friday, August 04, 2006
 
 
Eventually you will need to distribute the database handling through several machines. If it could be done in a cheap way, hardware expense only, it wouldn't be a problem. The question is how far we are from this inexpensive dream. :-) I'm betting that in the soon future we will have some general solution for this, and then the micro-managed databases will not have have reason to be.

Let's say, 10 servers prodiving redundancy/distribution, with 1 Gbit ethernet connecting them. Some 8 Gbs of RAM per server. I think we are getting closer to this, and probably some kind of prototypes/early releases already try to do it.

That is, one should not need to be a Google or to have to partner with Microsoft to have such a solution.
Lostacular
Friday, August 04, 2006
 
 
Ben, I'm not really sure where to start with your initial post, and with your followups.  Others have tried above, but I don't think you are listening or getting it (I don't mean that in a rude way, but you seem to be missing a lot).

Your issues are not new.  But this is not the days of IMS fixed width data segment layouts.  I can't speak about SQL Server specifically (I do more Oracle), but I'm sure it is similar.

When yo specify varchar(30), and only put the letter 'a' in there, it only takes up one byte.  Just like when you specify number(10), for a max ten digit number, and you only put the number 5 in there, it will take up less space than if you put a larger 10 digit number in.

If you want to "reserve" a block of space, and use it all, padding as needed, then you use the char type.  Plus, when you specify varchar(30), do you mean 30 bytes or 30 characters in the (possibly multibyte) character set you are dealing with?

None of these issues are new, and they are all dealt with quite elegantly in Oracle (and I would imagine many other databases like sql server etc).  I'd suggest reading about how Oracle uses language support, and about data types and storage in some of their manuals.  Likewise with other dbs.
smartin Send private email
Friday, August 04, 2006
 
 
Oh, and your second part to your original, about 10% needing an extra id, sounds like you are talking about the issue of normalization (although you may mean something different?).

You are certainly free to concat two different data values and store them in the same column.  But you go back a step in the normalization rules.gets to the issue of normalization.

But if you want to be normalized (at least 3rd), then just join to some other table as the need arises.  Joins are what a RDBMS is built to do.

And if you want to store xml, Oracle has data types specirically for that, and ways to easily manipulate specific attributes out of an xml column.
smartin Send private email
Friday, August 04, 2006
 
 
>> Personally, I think it's very helpful in a database for fields NOT to have a variable size.  If they did, you'd have to read every record in order to find out where the next record began.

Or maintain an 'index of offsets' somewhere, which would have to be updated every time ANY insertion was done.

Since database speed is still a limiting factor in many applications, this "space saving" probably would have an unacceptable impact on speed. <<

This has already been thought out over many years by the database vendors, and they make extensive use of variable lengths because it addresses the balance of speed and flexibility better than the alternatives.

Here's how Oracle does it

http://download-east.oracle.com/docs/cd/B19306_01/server.102/b14220/logical.htm#CNCPT302

http://download-east.oracle.com/docs/cd/B19306_01/server.102/b14220/schema.htm#sthref704
David Aldridge Send private email
Friday, August 04, 2006
 
 
Is it just me, or is this thread a lot of smoke and noise about nothing?  Major DBMSs like Oracle, SQLServer, etc have had variable length fields and nullable fields since like forever and have long since figured out how to store and access this kind of data with good performance and effective space utilization.
Mike S Send private email
Friday, August 04, 2006
 
 
Thanks, David, I was too lazy to look up the references :)
Mike S Send private email
Friday, August 04, 2006
 
 
I don't think it's a bunch of smoke.

It's more like people that most likely should not be doing database work are.

There are a hell of alot of people writing SQL code that should only be touching the db thru the DAL layer.
D in PHX Send private email
Friday, August 04, 2006
 
 
Given the choice, lots of programmers would not like to touch the RDBMSs at all. :-)

For example, Google created something they called a big spreadsheet like technology, or BigTable in short. And it too saves more information per database that you might ever wanted too.
Lostacular
Friday, August 04, 2006
 
 
You're right. Most people are disagreeing, or at least I haven't communicated my position well. The first response was along the lines of going with 255 and seemed to concur on the problem with specifying column widths, and then I saw the discussion of varchars and dates as a divergence. But it is pretty clear at this point my argument is not making headway.

"When yo specify varchar(30), and only put the letter 'a' in there, it only takes up one byte"

To me this shows a disregard for the realities of the implementation of paging in the database. But ultimately yes I agree that varchar implementations save storage space while doing some insignificant performance tradeoffs akin to the row block implementation I recommend.

"You are certainly free to concat two different data values and store them in the same column"

I'm not talking about hacks to work around common database design, I am trying to suggest a new design.

"Oracle, SQLServer, etc have had variable length fields"

Do you mean like long and CLOB, if so then I would partially agree, only that in the normal everyday usage you create a table like street1(50),street2(50),city(50),zip(10), whereas I suggest we could be doing street1,street2,city,zip instead.

"Data lengths are there to protect your data from nonsensical additions"

That strikes me as an arbitrary benefit. There are so many data integrity issues not solved by the data schema, but anyway I won't repeat what I said about validation earlier.

"useful for a column to have a finite length might be when the column is indexed"

Yes, but this is a table schema choice, again not necessary to be part of the core database implementation.

I really appreciate all the feedback, even if you disagree with me!
Ben Bryant
Friday, August 04, 2006
 
 
Search for multi-valued, multi-dimensional, post-relational or Pick databases. It was done in the 80's and 90's. It was very, very  flexible. Unfortunately the flexibility lead to chaos unless the system was managed and maintained by inflexible people. Access speed was maintained by indexes external to the files.

The technology could not compete against rdb's and died. The only survivor I know of is IBM Universe.
Phil C Send private email
Friday, August 04, 2006
 
 
My understanding is that even when varchar is used, all in-memory processing of db data is done in fixed length columns (and thus fixed length rows).  Data is stored in shortened length only on the disk.  (The shortened disk records have benefit not only of taking up less disk space, but of lowering disk i/o.  But even with variable length records on disk, all are expanded into fixed length records when read into RAM.)

The need for in-memory processing in fixed length columns is for speed.  If you don't have fixed length columns and, for example, an index lookup tells the db it needs to return the 366,332nd record, then the db will have to do a brute force search through records to find which is actually the 366,332nd record.  With fixed length records the db can just calculate a simple offset and go to the 366,332nd record immediately.

I'm sure someone will correct me if I'm wrong, but that's my understanding.  What OP is asking for is basically to have variable length record, not just on disk but in RAM, and that would both (1) be a big change from the way db's are made right now, and (2) introduce horrific performance problems.
Herbert Sitz Send private email
Friday, August 04, 2006
 
 
For the record:

"BigTable is a system for storing and managing very large amounts of structured data. The system is designed to manage several petabytes of data distributed across thousands of machines, with very high update and read request rates coming from thousands of simultaneous clients."
http://glinden.blogspot.com/2005/09/googles-bigtable.html
Lostacular
Friday, August 04, 2006
 
 
> the db will have to do a brute force search

You could have an array of [fixed-sized] pointers to the begining of each record: one extra level of indirection.
Christopher Wells Send private email
Friday, August 04, 2006
 
 
"You could have an array of [fixed-sized] pointers to the begining of each record: one extra level of indirection. "

How would that work?  Say you've got a db with 1M records.  Then a change is made that affects the length of the 15th record (and say that record is also the 15th record in your array of pointers).  Won't the pointers for the 15th record _and_ each of the following 999,984 records need to be updated?  And this is going to have to be done each time any record is changed?

Maybe I'm wrong in point out possible failign in array of pointers approach above, and in any case I'm sure there are various ways to make the search for records with variable records in RAM more efficient than a pure brute force search.  But I doubt whether any of them would approach the speed of having fixed length records. 

I suspect the db designers have considered the pluses and minuses of these different design decisions in trying to come up with best blend of performance and ease of use.  Having to specify column lengths doesn't seem like a big problem to me.  I'm happy to do if it has benefit of much faster db operation, which I believe it does.
Herbert Sitz Send private email
Friday, August 04, 2006
 
 
"I believe"

I think it was Tim Bray that said one of these days that we shouldn't assume so much.
Lostacular
Friday, August 04, 2006
 
 
> How would that work?

Only since you ask:

* If the record gets smaller then you just let it shrink: leave a hole in memory where its end used to be; don't even update the pointer to it (since its location is unchanged).

* If the record gets bigger then move it into some new memory, update the (single) pointer which points to it, and leave a hole where it used to be.
Christopher Wells Send private email
Friday, August 04, 2006
 
 
"It was done in the 80's and 90's. It was very, very  flexible. Unfortunately the flexibility lead to chaos..."

Phil C, this is the best point I've seen here and thanks for the names: "multi-valued, multi-dimensional, post-relational or Pick databases" I will look into them to see how they relate to my proposal.

As for the points from Sitz and Wells I have thought a lot about these issues and again I touched on them in my recommendation in my article, but it is better argued with a whiteboard and is very time consuming here. Again your assumptions about performance and fixed widths versus offsets comes up which would take a whole book to argue. I don't see coming to an understanding about performance and implementation specifics here, but regarding finding the 366,332nd record I would mention that queries generally don't depend on database order of records.
Ben Bryant
Friday, August 04, 2006
 
 
Christoper -- Neat, but I see that requires big changes to the way things would work that I can get only a glimmer of.  Changes that would require more overhead (vs. offset lookup) and make things slower.

Ben said: "regarding finding the 366,332nd record I would mention that queries generally don't depend on database order of records."

Well, if you're using indexes (which as a practical matter is the only way to get efficient performance from a relational db) then location records responsive to the query is definitely going to depend on the database order of records.  An index is used to find the offset or location info of responsive records, that info is used to go directly to the records and accumulate them in set that's responsive to query.  (Then if query has conditions that aren't indexed, a brute force search will proceed but only in subset of table that has already been identified by the index.)

But really, don't you think that relational database designers have pretty much weighed the pros and cons of requiring designs that have fixed length records and decided that the pros outweigh the cons?  I mean, some of those guys are pretty smart cookies. . . .  We're not likely to come up with anything in this thread that they haven't thought through forward and backward many times before, and come out on side of fixed length records.
Herbert Sitz Send private email
Friday, August 04, 2006
 
 
Actually, the problem with the array of pointers to the beginning of each record is not when the record size changes, but when a record is deleted, because then the 366,332nd record is the 366,333rd in the array. But anyway, that has got nothing to do with my proposal which recommends "row blocks" inside of pages. We're not talking about variable width rows slapped next to each other, but linked lists of row blocks inside of database blocks or pages. The performance characteristics have been considered.
Ben Bryant
Friday, August 04, 2006
 
 
Herbert Sitz, I mention indexes, and of course they use direct pointers to the records, not record number. Record number is pretty much useless when a preceeding record is deleted.
Ben Bryant
Friday, August 04, 2006
 
 
"those guys are pretty smart cookies" no doubt but they are solving problems within broad guidelines shaped by the technology they are building on top of. The main problem with my proposal is the shift in usage and incompatibility with established systems and techniques. However, the shift has already begun like in the XML example I mentioned where essentially data schemas are being created on the fly without regard to the artificial constraint of a maximum field width. The technology will lag behind the usages, but I believe the shift is well underway.
Ben Bryant
Friday, August 04, 2006
 
 
Indexes don't use pointers in the sense of a pointer to a memory location.  Load a table into memory today, records go to some arbitrary location.  Load into tomorrow, they end up somewhere different.  Same index is used in both cases.

Any "pointer" that an index uses has to be an offset from some base location (e.g., from physical beginning of table).  Indexes don't have anything to do with "record numbers" (which technically have no meaning within a relational db), but the nitty gritty of indexes definitely does have to do with locating records based on where they lie in a table's physical order.
Herbert Sitz Send private email
Friday, August 04, 2006
 
 
"Indexes don't use pointers" okay it is semantics. The offset "points" to the record. The offset is not from the beginning of the table, it is a combination of file block number and record block number within that file block. I thought you were referring to essentially the record number based on the database order of records which noww you've agreed is useless, so I must have missed that one.
Ben Bryant
Friday, August 04, 2006
 
 
Herbert, I see you did refer to "the offset or location info of responsive records" which is what I meant when I said pointer, i.e. a direct way of referencing the record. I was in a hurry to distinguish from the notion of record number based on database order of records which is what I thought you were talking about but I should have read more closely. Thanks.
Ben Bryant
Friday, August 04, 2006
 
 
> multi-valued, multi-dimensional, post-relational or Pick databases

Multi-dimensional databases are now quite a big deal in the OLAP/data mining crowd. They aren't dead yet.
son of parnas
Friday, August 04, 2006
 
 
""Indexes don't use pointers" okay it is semantics. The offset "points" to the record."

Yes, semantics.  But since Christopher W. was talking about an array with pointers that were, I assume, _real_ pointers (i.e., to absolute memory addresses), seemed like it made sense to clarify.
Herbert Sitz Send private email
Friday, August 04, 2006
 
 
I used wikipedia to get the basics on "multi-valued, multi-dimensional, post-relational or Pick databases" and none of them are anything like my proposal. They move away from the relational model which is where I am firmly planted.

Probably the first database example everyone does is something like:

CREATE TABLE customer
(
CUSTID NUMBER,
NAME VARCHAR2(30),
PHONE VARCHAR2(10),
ZIP VARCHAR2(10)
)

The most basic part of what I am suggesting is that in the future this should be something like:

CREATE TABLE customer
(
CUSTID NUMBER,
NAME VARCHARX,
PHONE VARCHARX,
ZIP VARCHARX
)

No one here has responded to the issue of text encoding, but when you encode a Far Eastern name in a field with a column width, you are forced at table creation time to consider the repercussions of your column width decision in future client and server encoding choices (you can have a double byte client accessing a UTF-8 server etc), but I contend there is really no value to being forced to make this decision at this time. The database should not enforce an arbitrary text memory size when data stores like XML do not. Table schema developers should not have to worry about what encodings the text might end up passing through, but a column width could end up forcing the issue back on them.
Ben Bryant
Friday, August 04, 2006
 
 
So haw large can a name be?  2 million bytes?  No, that's silly or is it.  define how large it should be.  On Oracle you can specify characters of bytes for the column width.  Yes, f you are using multibyte characters you need to take that into consideration.  (Oracle handles the code page translation provided you choose a code page in the databse that is a set or a superset of what the client wants, eg utf8 to wesstern European 8 bit)

The fact that xml has this unrestricted character widths is irrelevant.  It is sloppy hacking.  Again how big is a name?  If I don't know what you are limiting your name width to how can you easily deal with another system?  You want your point of sale system to talk to my banking system we better agree on the maximum length of a name! 

How do you prevent stuffing a field with data it isn't intended for?  If I allow gigs of space for a name - which we agree is way more than is necessary - then the user can stuff all sorts of crap there that has nothing to do with the name field.  Your data and application quickly loose value.

Also disk space isn't free.  It is cheaper, but it isn't free.  In larger systems we are dealing with SAN's and disk redundency and backup. (backing up to tape is still quite common and slow)  Unless the data is trivial it had to be backed up.  Size does matter.

I can see a lot of downsides to unrestricted field lengths.  I don't see any upsides.(orther than the developer can ignore a business requirement or doesn't have to type varchar(50) vs char)
Jim Send private email
Friday, August 04, 2006
 
 
Jim, I am not saying unrestricted name lengths, only that the database core is not the place to restrict it. It is a business logic issue, as is the fact that a phone number should contain digits in a particular format and a U.S. state should belong to a set of accepted values.

Speaking of B2B, at a more granular level, one business may have constraints different from another. These are not managed by the database schema, these are managed in the code that does the data exchange.
Ben Bryant
Friday, August 04, 2006
 
 
The advantages for me is that Ruby has only a handful of data types, and they are basically limitless by default. Also, my own frameworks work with the native Ruby data types, so I don't have extra work. That said, the productivity gain of everything allows me to create more applications parallelly. Also, the same application can be used by more than one customer, and if it requires some customization, it better be as painless as possible.

This development environment is basically free of extra licensing, so I don't use SQL Server by default, for example. My default database choice is Firebird, because it has good support for my locale, its license is friendly, and I already know it quite well because I used to work with its precursor, Interbase.

So far, I have 5 web projects and maybe 8 desktop projects. All the web projects are database oriented. And most of the desktop projects are database oriented as well. This is only the beginning.

Besides the varchar(255), I use longint and double float (currency).

All of this because I remember having to adjust some of those limits in a Delphi/Interbase project and it was painful. This time I want to err on the other extreme. :-)
Lostacular
Friday, August 04, 2006
 
 
unrestricted name length reminds me of a scene in a comedy (Monty Python?) where I think it was a news reporter interviewing a guy with a very long name and he would commit himself to saying the entire name properly -- I think the guy died before he could finish...

But I mean really, why do you have to pull a number out of a hat like 100 bytes is all I am allowing for a name. Then you get some rare exotic name in Unicode involving a bunch of diacritics and combining characters and cause somebody some grief when they are trying to use your system properly. Column width is not a way of stopping people from entering the wrong data into your column anyway. If users are entering a million bytes in the name field, I think there is something wrong with your data entry.
Ben Bryant
Friday, August 04, 2006
 
 
"You want your point of sale system to talk to my banking system we better agree on the maximum length of a name"

no, if your side requires a limit I will conform my data to it when I send it, but my side is only concerned that the data is GOOD data, i.e. the names you give me are correct. There is nothing that says I should be concerned about the max length. This is not a hack. On the contrary, enforcing a maximum length is the hack in this case.
Ben Bryant
Friday, August 04, 2006
 
 
Now I get it, have the application code the limit not the database.  Sorry, that is a loser.  If I have 10 applications that need to use the data, not an uncommon requirement, then I have to code them into 10 applications and test the code 10 places.  If I code it in the database 1 then everything else has to use it.  Sorry, but been there done that.  Bad idea to rely on only  the application for such limits.  Data outlives the application that uses it.  Sorry, bad idea.
Jim Send private email
Friday, August 04, 2006
 
 
Yep. But I do have unit testing, system testing and DAL. So I could test the rules with one click.
Lostacular
Friday, August 04, 2006
 
 
And my biggest program is all this development environment. Take some, give some. No one can win everything.
Lostacular
Friday, August 04, 2006
 
 
Jim, if you're going to rely on the database, you still have to code in 10 places to test the result code from the database and truncate the data or otherwise resolve the situation.
Ben Bryant
Friday, August 04, 2006
 
 
Let me see if there's some middle ground here.

Certainly it's absurd to code everything as varchar(4000) or some such, because some doofus somewhere will put in a 4000-character name, just to muck up the speed of your system (or more than just the speed if they do it in several fields in the same row).

On the other hand, who among us hasn't felt the irritation of entering a 12-character password and getting back "Sorry, your password has to have 8 to 10 characters."  That kind of limit is just plain dumb.  The problem is too-small limits.

Now suppose that a database distinguished between "soft" limits and "hard" limits.  A varchar(50) field (we'll call it LastName) could have either a soft or a hard limit.

A hard limit would mean you get fifty characters and that's it (just like now).  A soft limit would mean that when a guy comes along with a 60-character last name, the database would make an exception and let him put it in, perhaps in sort of a probationary status.  Then from time to time, whoever maintains the database could check out these exceptional records and see if they warrant raising the maximum limit of the field.  Or perhaps an alert would be sent out that a field limit was exceeded, allowing you to take a look and judge whether it warranted a change in field limits.

Obviously you can already change the size of a field.  The difference here would be that your field limit wouldn't absolutely prevent somebody from putting in longer data.  So in our password example, if some dolt set the password field as a varchar(10), but it had a soft limit, people wouldn't be prevented from entering longer passwords; and the DBA or whoever would receive notifications that the field might not be long enough.  Perhaps the database would keep track of the percentage of entries per field exceeding the soft limit; the occasional too-long entry could be ignored (and left as an exception), whereas if the number grew, it would suggest raising the standard limit.

What do you think?
Kyralessa Send private email
Saturday, August 05, 2006
 
 
Kyralessa, I agree that real world answers often lie in compromises between extremes, however I don't believe that my proposal is the extreme -- maybe those other types of databases like multi-dimensional are the extreme which have value in only specific circumstances. But I am going after the most common database usage as I showed in the create table example above starting with a small improvement relating to string declarations (no max width).

When I am coding up a function, in the "olden days" in C I would declare a string buffer with a length and be always mindful not to overrun the buffer. Nowadays I use a string class and do not check the length except when the circumstances, the UI, the business logic, require it. People are developing XML schemas generally without specifying limits on field sizes. It appears that database developers are so far just too hooked on column widths to see it is not necessary as part of core database usage.
Ben Bryant
Saturday, August 05, 2006
 
 
"It appears that database developers are so far just too hooked on column widths to see it is not necessary as part of core database usage."

Why do you think it has anything to do with database developers.  These are the people who use databases.  Of course they know that they could have a database designed so it doesn't require specification of column widths. 

It's the database _designers_ who have chosen to implement databases with column width requirements.  At least some of the reasons for this are easy to understand and have been mentioned in this thread.  I assume there are others.

By the way, there have been databases that don't impose column width requirements for a long time.  They're generally known as "free form" or "text" databases.  AskSam is a wellknown example that's been around for at least twenty years.  Anyone interested can try out AskSam and see how it works for them.  My guess: they'll find that AskSam is great for a limited set of applications but that it's performance suffers when compared with conventional rdb's when used for typical database applications.
Herbert Sitz Send private email
Saturday, August 05, 2006
 
 
Herbert Sitz Send private email
Saturday, August 05, 2006
 
 
"there have been databases that don't impose column width requirements for a long time"

Even oracle does not impose column widths on long (~32K) and CLOB fields. So, you have the choice of doing what I am suggesting within an Oracle database though it is not built for that.

Again going to the C/C++ programming analogy, in the past you were accustomed to:

struct X
{
  char szName[32];
  char szZip[11];
};

But now you often use something like:

struct X
{
  CString csName;
  CString csZip;
};

And you don't hear programmers complaining that they aren't guaranteed the strings will be less than a certain length. Yes a database often is more decoupled from the program than a structure. But if your system is open to "some doofus somewhere [to] put in a 4000-character name" then the field length is the least of your worries.

I think database users think there is no other way just because they are accustomed to it.
Ben Bryant
Saturday, August 05, 2006
 
 
>> I am not saying unrestricted name lengths, only that the database core is not the place to restrict it. It is a business logic issue, as is the fact that a phone number should contain digits in a particular format and a U.S. state should belong to a set of accepted values.


Then your argument is no different to those who say that databases should have no constraints at all -- unique, foreign key, check etc.. You just want to transfer the onconvenience of column length restrictions from one layer to another, and for some reason you think it is easier to change the maximum length allowed in every application that accesses the database, rather than in the one place common to all those environments.

It's an argument that treats databases as bit dumps, as not as the guardians of data integrity that they really are.

For some reason, something like 90% of Joel readers just don't "get" relational databases - not cool and modern enough or something, I expect. If you don't understand it, don't touch it -- leave it to the professionals.
David Aldridge Send private email
Monday, August 07, 2006
 
 
"For some reason, something like 90% of Joel readers just don't "get" relational databases "

In my opinion it's just a different kind of approach. Some companies are built with a handful of people only, not hundreds of corporate drones, thus they need to improvise more often than not. For instance, some small companies create and support several programs, despite having only 4 to 6 people in their teams of diverse backgrounds. That's very much contrary to the idea of building only one program well enough in the most professional possible way.

If you can create web applications and web sites in a small team the way the teachers teach in universities, then all the power to you.
Lostacular
Monday, August 07, 2006
 
 
Lostacular
Monday, August 07, 2006
 
 
"Then your argument is no different to those who say that databases should have no constraints at all"

How on earth did you come up with that? The database doesn't test the digits in your phone number, why should it check maximum length of a name? Key and table constraints and things to do with data integrity are certainly part of the database. Keeping names under 30 bytes is certainly not part of data integrity. As I pointed out, n bytes doesn't even have meaning when you involve two different text encodings. And no you aren't pushing more work on the app if you don't restrict column widths in the database -- do a use case.
Ben Bryant
Monday, August 07, 2006
 
 
"It's an argument that treats databases as bit dumps, as not as the guardians of data integrity that they really are.

For some reason, something like 90% of Joel readers just don't "get" relational databases - not cool and modern enough or something, I expect. If you don't understand it, don't touch it -- leave it to the professionals."

+1 David! You hit the nail on the head.

"The database doesn't test the digits in your phone number, why should it check maximum length of a name?"

Maybe it should. Does your database or application allow 1 digit zipcodes? How useful is a 2 digit phone number to you?

"Keeping names under 30 bytes is certainly not part of data integrity."

The size of the column itself may not play into your data integrity rules, but it sure as hell effects the database performance.

I honestly think if you took a step back and read up on how modern RDBMS's work you would understand how crazy this whole thing is.

You're making tradeoffs that no one knowledgable of databases would do. The sad fact is that databases and their data typically have a longer lifecycle than your applications, so if you are designing it this way, alot of people are going to have to deal with these poor decisions for a long, long time.
D in PHX Send private email
Monday, August 07, 2006
 
 
>>he database doesn't test the digits in your phone
>>number, why should it check maximum length of a name?"
>Maybe it should.

The problem is the hierarchy of validation involved. The database can handle simple declaritive validation but it can't handle more complex validations that require rule resolution and/or integration of disparate information sources.

Unless you move the application into the DB you get a sort of schizophrenia about what resides where, why it's where it is, and how it's implemented.

Moving validation to the rear end of the data chain either requires duplication, because a phone number on a form can't wait to hit the database for validation, plus I will probably use a phone number service to validate the number, or you need to hit the database for validaiton which doesn't scale, or you need to make the database do very little validation and move all validation in the behaviour layers of your application.



> I honestly think if you took a step back and read up on
> how modern RDBMS's work you would understand how crazy
> this whole thing is.

Maybe people have and they don't like everything they see?
son of parnas
Monday, August 07, 2006
 
 
"It's an argument that treats databases as bit dumps, as not as the guardians of data integrity that they really are."

Could you be confusing data integrity and data validity?

"Does your database or application allow 1 digit zipcodes? How useful is a 2 digit phone number to you?"

The database doesn't stop a 1 digit zipcode in a 10 byte text column. You seem to be implying that your database is validating your data, but with varchar2(10) it is just limiting your field to 10 bytes, which might appear to help but you've still got to check or limit the length in other places like the data entry code, and do all the other validation.

"You're making tradeoffs that no one knowledgable of databases would do"

You keep saying I know nothing about databases and I should read about them and I heard you the first time. If you really feel this debate is beneath you then you've got your point across. Thank you for your responses that kept to the actual subject.
Ben Bryant
Monday, August 07, 2006
 
 
"The database can handle simple declaritive validation but it can't handle more complex validations that require rule resolution and/or integration of disparate information sources."

What complex validations? The length of a piece of data?

"Moving validation to the rear end of the data chain either requires duplication..."

No it doesn't. I could really care less about the validation that the developers tell me occurs on their end.

Having those same rules in place in the database is not duplication IMHO.

"You keep saying I know nothing about databases..."

I wasn't trying to say that you know nothing about them. But it appears that you don't understand how the databases are handling the data as you would then see the need to define the types.

Nothing personal, honest. :)
D in PHX Send private email
Monday, August 07, 2006
 
 
>> The database doesn't test the digits in your phone number, why should it check maximum length of a name? <<

It does if you're serious about data integrity. You put phone numbers into a char(12) or a varchar(12), and do something like ...

Check (translate(phone_number,'0123456789','9999999999') = '999 999-9999')

... for example. You do not allow people to put in something like "customer didn't know it".

>> Key and table constraints and things to do with data integrity are certainly part of the database. Keeping names under 30 bytes is certainly not part of data integrity.<<

How about keeping them under 500 bytes, or 80 bytes, or one million bytes?

>> As I pointed out, n bytes doesn't even have meaning when you involve two different text encodings.<<

So use a database that handles expended character sets properly.

http://download-east.oracle.com/docs/cd/B19306_01/server.102/b14225/ch1overview.htm#sthref38

I'm sorry if you've had bad experiences in the past from designers underestimating the appropriate maximum length, but that is a design problem, not one relating to the underlying technology.
David Aldridge Send private email
Monday, August 07, 2006
 
 
>> Could you be confusing data integrity and data validity? <<

The database is a model of a real-world scenario. The integrity/validity checks all serve the purpose of ensuring that the model cannot deviate from the real world -- for example, a person cannot have a date of death prior to their date of birth, and cannot have a name longer than the Declaration of Independence. It is all to do with the accuracy of the model.

>> The database doesn't stop a 1 digit zipcode in a 10 byte text column. You seem to be implying that your database is validating your data, but with varchar2(10) it is just limiting your field to 10 bytes, which might appear to help but you've still got to check or limit the length in other places like the data entry code, and do all the other validation. <<

The database can indeed stop a 1-digit zipcode, and any designer worth their salt will put in-place constraints to ensure that the zip code is either constrained against a list of valid entries, or of the format 99999 or 99999-9999, or even both.
David Aldridge Send private email
Monday, August 07, 2006
 
 
> I could really care less about the validation that the developers tell me occurs on their end.

Perhaps that you could care less is the problem?
son of parnas
Monday, August 07, 2006
 
 
Actually I couldn't care less as I don't trust it to maintain the integrity and validity of the data in the database.
D in PHX Send private email
Monday, August 07, 2006
 
 
"Actually I couldn't care less as I don't trust it to maintain the integrity and validity of the data in the database."

+1

But I'd add that the constraints that we place in the database can alos be an important spurce of metadata that the optimizer can use ... for example having a constraint ...

CHECK (NAME = UPPER(NAME))

... can allow a predicate of ...

WHERE UPPER(NAME) = 'GEORGE'

... to be optimized to ...

WHERE NAME = 'GEORGE'

... which would allow an index on NAME to be used.
David Aldridge Send private email
Monday, August 07, 2006
 
 
Yes, constraints can definately help the optimizers. I think alot of people that have 'application' databases actually have 'integration' databases and don't realize it.
D in PHX Send private email
Monday, August 07, 2006
 
 
> I don't trust it to maintain the integrity and
> validity of the data in the database.

Yet you trust it a lot. Most of the validation that happens is in the application, not the database. In the example here, the database has only the most trivial validations of a phone number.

Many of us see the application as an entire stack, not just a database, or a this, or a that, so the zeal with which true defenders of the database put forward in their arguments for relatively minor integrity checks, seems misplaced.
son of parnas
Monday, August 07, 2006
 
 
No, I think you assume a lot.

My posting in this thread is not meant to be an exhaustive step by step guide to database design, validation and constraints.

Are there applications out there that do the majority of the validation instead of enforcing such things in the db?

Yeah, but that doesn't make it right in the slightest.

As for "Many of us see the application as an entire stack..." I direct you to my comment about application and integration databases.

Most databases (and their data) have a longer lifecycle than applications. YMMV, but from my experience, that's just the way it is.

Making sure that data in the database is squeaky clean isn't a high profile position, but it is really, really important IMHO.
D in PHX Send private email
Monday, August 07, 2006
 
 
"It does if you're serious about data integrity. You put phone numbers into a char(12) or a varchar(12), and do something like ... Check (translate(phone_number,'0123456789','9999999999') = '999 999-9999')"

This is validity to be precise, not integrity. If you are implementing validation in the database and the logic changes for certain international conventions then you will need to synchronize your changes in both the database and the applications involved. If you are explicitly validating, then field length can also be tested as part of these checks. I know you think the data schema column width is providing a great deal of help to you, but I am just pointing out for the sake of this debate that if you are doing these other checks, the length can be checked too.

"Having those same rules in place in the database is not duplication IMHO"

Okay but you are doing the check in 2 places. If you need the database to be the last defense against invalid data, the redundancy is useful, but it is still odd to claim it is not duplication.
Ben Bryant
Monday, August 07, 2006
 
 
> Making sure that data in the database is squeaky
> clean isn't a high profile position

You must be cleaning with a non steroidal cleanser then. The number of contraints a database can enforce is very small and they tend to be trivial compared to the vast network of the external semantic relationships data participate in.
son of parnas
Monday, August 07, 2006
 
 
Ben,

I may be checking the same thing in two places, but it's not the same check.

I look at data in the app (variables, object properties, etc) as something distinctly different than data in the database.

Everything the database does it does to ensure the data is correct for anything that requests it from the db (generalization but still).

The app only cares about the state of its variables or objects. That's it's scope. And that is a good think.

That limited scope is what makes the app. effective, as it does very defined things very well.

The database meets the demands of many apps at once and therefore fundamentally cannot work the same as the app.

BTW, I may sound like I'm down on developers (and applications) but I'm really not, I used to do that stuff. And I'm glad we can disagree and still keep it civil.
D in PHX Send private email
Monday, August 07, 2006
 
 
"The number of contraints a database can enforce is very small and they tend to be trivial compared to the vast network of the external semantic relationships data participate in"

Ah, if you don't use the seat belt in the car you're riding in, is it still there?

Although I don't believe that all business rules should reside in the database, on the other hand there are alot of constraints that should be in the db that are miscategorized as business rules.

This manifests itself in strange, sometimes hard to find data anomilies that make it harder and harder for other apps to use the data down the road.
D in PHX Send private email
Monday, August 07, 2006
 
 
I just want to clarify that my proposal is not for or against doing validity checks in the database. Field length constaints can be implemented as checks rather than data schema column widths.

The criticisms of my proposal seem to be:

A) We need to make sure there aren't too many bytes stored in a phone number field and the data schema column width is the way to do that.
B) The performance of my proposal would be bad.

I don't think there is all that much else here. I disagree with both A and B, but I've enjoyed the debate. Thanks.
Ben Bryant
Monday, August 07, 2006
 
 
"This is validity to be precise, not integrity."

No, if you're going to get technical about it then you're confusing integrity constraints as a whole amount to nothing more than referential integrity constraints.  Actually, referential integrity constraints are a subset of all integrity constraints. 

Other posters have given examples of SQL constraints established with the CHECK command.  These are integrity constraints, despite the fact that they have nothing to do with referential integrity.  If they are violated, the integrity of the data in the table is damaged.  E.g., if you let a phone number in that has only two digits, the integrity of that record is violated.  Letting users save invalid data violates the integrity of a record, table, database as a whole.
Herbert Sitz Send private email
Monday, August 07, 2006
 
 
"These are integrity constraints" Yes CHECK constraints are called "integrity constraints" though they provide validation.
Ben Bryant
Monday, August 07, 2006
 
 
What is the business cost of not being able to capture long user names?  How much money would be lost by this system constraint?

What is the price of surplus storage and processing capacity to hold fixed length fields that are mostly empty?

What is the price of computation to discover the length of variable length fields at run time?

Would any of these above costs not be incurred until the company is larger and better able to afford a bit of extra spending on IT?

What is the current cost of a developer's time to ponder, research, and implement a more concise data structure?

Without answers to all of these questions, there's no evidence that the investment in variable-length user names will be a reasonable use of business resources.

"Premature optimization," that's the term I'm looking for.
Flow
Monday, August 07, 2006
 
 
What I find curious is that nobody has yet to point out that most implmentations of VARCHAR are specified in characters, not bytes.

Bytes != Characters.  A character is a letter or a symbol whos size in bytes depend on it's encoding (UTF8, ASCII, etc).  You can have 8 characters of Arabic, or 8 letters of the english alphabet.  Doesn't matter.

Interestingly, optimization and data integrity issues aside (of which I take no issue with), I too find it annoying that I have to specify a string length for things like a first name or last name.  I mean, what the hell is a good max length for a name?
Cory R. King
Tuesday, August 08, 2006
 
 
"most implmentations of VARCHAR are specified in characters" They may use the base character size, a byte for multi-byte or a word for UTF-16, but in both cases a character can be more than one of them. It is because ultimately you are talking about the storage, not the characters, as far as the database is concerned.
Ben Bryant
Tuesday, August 08, 2006
 
 
>> I mean, what the hell is a good max length for a name?

http://everything2.com/index.pl?node_id=1534419&lastnode_id=0

:D
David Aldridge Send private email
Tuesday, August 08, 2006
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz