A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.
I'm currently working on planning a move from an Access-based integration solution to an XML one - my company's supplier currently provides us quarterly updates via an Access MDB file, and starting next year they'll be discontinuing it in favor of an XML-based system. I'd like to get a head start on it, and they've provided me a sample of what the new format will contain so I can look into it. As it stands, I am the sole IT person, and in charge of all of our IT operations.
I'm fairly competent when it comes to database design, but the snag I'm running into involves the way data is categorized in the new format - items have a minimum of two categories, but can have as many as six. The category data appears to be structured as part of the product's schema, as opposed to the old Access format where there were seperate tables and only three levels of detail.
The produce file references categories in a manner similar to this example:
<Classification type="DBV2" >
<Code sequence="1" listName="Level1" >Category1</Code>
<Code sequence="2" listName="Level2" >Category2</Code>
<Code sequence="3" listName="Level3" >Category3</Code>
<Code sequence="4" listName="Level4" >Category4</Code>
<Code sequence="5" listName="Level5" ></Code>
<Code sequence="6" listName="Level6" ></Code>
And, as stated before, the number of categories vary by product; there will always be at least the first two, but the other 4 (possibly more in the future - our supplier gives the caveat that they might expand categories later) depends on the specific item. I've seen some that have all six - most seem to have three or four.
The issue I'm running into is coming up with a good schema to import this data into a SQL Server 2005 (Standard) back-end. I'm *really* hoping that our supplier has an XML file that ONLY contains the categories, otherwise I'll have to sort through 40,000+ products to find the distinct classifications (as I said, the old format had tables that contained the information, and were related to the products table. The new file, as near as I can tell, lists a product's categories as nodes of itself). They have a sample site that demonstrates how the navigate the products, and I've been spending the past day or so playing around with it, trying to figure out possible ways that they could have set it up (however, their sample site doesn't perform any transactions, so for all I know it could just be parsing the XML file directly).
I'm at a loss for figuring out how best to tackle this problem - I had hoped to seperate out and normalize the tables, having Category, Section, Subsection, Group, and the like. Given that the structure is basically recursive, I'm thinking that the better (only?) choice would be to have a large Category table that links to itself via a Parent/Child ID relationship.
Am I on the right track with this, or is there a better way that I'm overlooking? The supplier seems to think (from what I can tell by looking at the data structure) that we should read the XML files directly (seeing as they provide individual item files as well as larger files that contain multiple products) - but wouldn't this be ridiculously slow compared to a database? My experience with XML has never been with very large amounts, so I'm not sure how it fares in a situation like this. If performance wouldn't be affected (and it's probably something I'll have to run some tests on to get an accurate idea), then realistically I could pull the product/category data from XML (since this is exactly what is provided by our supplier), and then our business-specific information for that product (its price, for example) from the database. But doing that means that everything which now looks to the database for product information would need to look to XML, and for some of our orders that would mean several round-trips (once to read product information from XML, once to get price information for that product from the DB, once to pull the actual order information from the DB), which again seems to be a huge performance hit.
Any suggestions/recommendations/advice would be greatly appreciated!
Thursday, February 14, 2008
It seems from a quick read that you have 2 issues: one, the new data model is questionable, and two, there is a large difficulty in dealing with XML, especially large files (it doesn't scale well at all).
As for the first issue, I can't speak to, but as far as dealing with XML, my recommendation is to immediately move it to a real relational database like SQL Server or Oracle and convert it into real relational tables.
Friends don't let friends use XML.
Haha that's what I was thinking, as well (i.e. move it to a real RDBMS). Unfortunately I have no control over the XML data model, as its provided as-is by our supplier and we have to make use of it.
I'll have to dig deeper into it and come up with a good database schema for it.
The little I have done with xml would suggest that you will want to write an app to import it into a real database and export it again if necessary. On a 15 meg file java's parser took under a min, fast but not fast enough to pretend it is a database.
Thursday, February 14, 2008
I don't think you should be trying to convert the XML format into a database schema. The XML is just the message or transaction. You should have a database schema that you need for your application, and use the XML only as the go-between, the data-interchange, rather than designing a database schema to mirror the XML. You can archive the XML as-is in case you need to mine transactions or do unforseen reports on transactions later.
>quarterly updates via an Access MDB file
If each MDB file is essentially the entire database, replacing the previous one, then once they switch to XML, you will just need to generate a relational database on your side that will look a lot like the MDB file did. Basically stick with the schema of the MDB file, don't try to mirror the XML. You will need to understand the XML in order to import it into your database, that's all.
>sort through 40,000+ products to find the distinct classifications
If they don't give you the distinct list separately, this should still only be a quick step. Don't sort; use a map (hash table) to get a unique list on a once through. Should only be a couple seconds on 15MB of XML.
I should mention, in regards to speed, that the sample file I have has about 5,800 products, and weighs in at 60MB. I think I'm going to have to speak with our supplier's IT people and find out what their assumptions are about the data, and how it'll be provided (they might not give a file that has all the products, for all I know, and expect me to use individual files for each product). What I'm working with is a sample that's several months old, which they sent out as a "preview").
Sorry for a triple post, but one more thing:
Somewhat related but not quite, does anyone know if there's a way to get SQL Server 2005 to understand the substitutionGroup XSD property? Our supplier's schema is based off something called OAGIS (which I looked up, and really seems interesting), and they're using substutitionGroup for the section - I'm trying to set up a job in SSIS to extract the data so I can put it into a table, and it's erroring out saying that Substitution Groups aren't supported - It looks like I'm going to get screwed one way or another, if I can't extract the data from all the places I need to in order to prepare our ERP system. I mean, I could write a program to parse the XML (after I brush up on my XPath, that is), but I thought the point of SSIS was to--you know--INTEGRATE various heterogeneous data sources. I'd love to have it all integrated to perform the various aggregations and manipulations that we require.
>really seems interesting
Don't you just love when a software "solution" becomes a little project of its own?
See, the issue is that our current schema doesn't follow any rhyme or reason - data is just thrown into a table (it looks like it was just imported from Access or Excel - nearly everything is nvarchar(255) and allows nulls) that bears some resemblance to a business object.
The new format (the XML one) reads to me like a redesign on the part of our supplier - they changed product categories around, added new ones, etc. Given that it's a radical departure from the Access format, and that our current situation is ugly, I'm using it as the catalyst to look at designing a proper data model from the ground up. The goal isn't to make a database schema based 100% on the XML data (as it contains a lot of information we don't need), but to make a data model that fits more snugly around the data we're provided - the previous developer, who was let go when I was hired, seemed to do everything in Access and basically just grab all of these tables, put them into a new table; grab this field from another table, join it with the first one, put it in a new table; I would run four or five of the queries he left behind (poorly documented, of course) and the size of the database would swell from about 300MB to just shy of the 2GB limit (close enough that Access would start to become corrupted).
For example, I'd like to take advantage of the new categorization that our supplier has provided in the XML (like the sample code I posted above) - it's much richer than the old format (which they still include) and IMO would provide for a better buying experience for the customer.
The big issue is that the current data model is very poor, so at some point I'm going to need to take data from the XML file and use it to populate the redesigned data model; this is the main thing I'm trying to work on, although given the amount of data our supplier provides, and how it's organized, I really need to find out from them what their assumptions are, or if they can provide some guidance on how to make use of it (we're one of their best customers, so we should be able to swing some special treatment) - their sample site is very impressive, but I'm not sure how they're pulling the data in for display.
This looks like a perfect task to stress test SQL 2005 XML datatype.
My first impulse would be to try and load all the raw XML input in one field, than "data mine" it with XQuery to populate application tables, and optionally keep this input for archiving purposes or if it may have some data not yet in the regular tables. If this works, it should be relatively painless to make changes to the tables and "refill" them at any later point.
Friday, February 15, 2008
You may be able to use the SQLXML Bulk import tool, and design an import schema (using sql annotations) then you just bulk load it into relational structure.
Once it is shredded, you could then stage that into your other data model relativly easy, this assumes ofcourse SQLXL import will support the substitution group, or allow you to hack around it.
Thanks for that, I'll take a look. I asked our supplier and it seems like the current MDB version won't be phased out anytime soon, so I'm no longer in a rush to port things to XML - although I still think the XML format is a lot richer and more descriptive than the current information, and I'm still going to evaluate using it (at least in part).
I would like to take advantage of that OAGIS commonality for business data, but I'm not sure if all of the entities we deal with (government, specifically) would support it. More research is in order, I think - I admit that I'm not that familiar with XML for uses other than configuration files. Maybe I should hold off on it for now, since I'd probably need to look at something like BizTalk server, and I can't requisition the budget to purchase that anytime soon.
This shows why XML is completely stupid and should die!!!!
The file is going to be 20 times larger just to send the same information without any additional value. Actually, its less valueable since all the relationships are lost with the format. Support people spend a huge amount of time trying to understand the data because its hidden in all the definition tags. Nine times out of ten the tags mean nothing or at best are the raw field name from the sending system causing endless emails about what the field is what it does etc...
If I were his boss, I would be pissed that my valuable employees time is spent analyzing this instead of helping me to run my business better. I might even send the vendor a bill for my empoyees analysis time for doing something that brings zero value to me.
There is nothing wrong with sending the MDB format or even a simple tabular file.
Sick of changes for changes sake. ARGH!!!!
Wednesday, February 20, 2008
"There is nothing wrong with sending the MDB format or even a simple tabular file."
Nobody said there was anything wrong with it. Our supplier is the one who said at first (although they seem to have changed it now) that they would discontinue the MDB version by the year's end in favor of the XML one, and they're the ones who keep hyping the new format by saying how great it is and how much better the content is (which, sadly, is true. I've compared the two and the XML descriptions are much more detailed and compelling than the ones in the MDB).
What I'm finding is that while it's better content, the thing is near impossible to extract to put into a real database format, so I really have no clue how in the hell they expect it to work. Maybe they really DO think you'll parse the XML directly for display.
FORGET ABOUT THE XML FORMAT/SCHEMA.
DON'T USE A XML FOR STORING YOUR DATABASE, use a database with an internal/binary format, that allows you to import/export in XML to other database's formats and providers.
Long Answer: There are several LONG, COMPLICATED answers to your question. You asked things about how the design of the file, and got different answers, similar to this one.
(1) There are databases software that:
(a) store their contents in binary format
(b) store their contents in XML format
(c) databases that store their contents in binary format,
BUT ALLOW YOU TO IMPORT & EXPORT their data in XML format, so you can easy migrate to other database's software providers
Right answer: (c)
(2) You ALSO asked about the XML FORMAT & DESIGN of the file. Maybe you post the wrong post's title. I suggest repost your question, AGAIN, like this:
"Am I using the right XML schema to migrate our database?"
"My provider is using XML files to import/export data from their database, to our database, and they are using the following XML file that..."
Just my 2 cents
Marco Aurelio Ramirez Carrillo
Friday, February 22, 2008
I wish this was my project, its the kind of thing I specialize in. Drop me a line if you want a consultation/free advice. I can probably whip up a script to look at ways of processing querying transforming your XML to better plan how to put it in your database.
You should try the firstobject editor for browsing and processing your XML fast (its free, and made for large files):
Friday, February 29, 2008
This topic is archived. No further replies will be accepted.Other recent topics
Powered by FogBugz