The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Excel data for DW

We're developing a DW solution that needs data from various sources. Some of these sources are actual applications that have back-end databases where standard ETL techniques can be applied. But a lot of data that we need to make proper aggregations resides in personal Excel files that various people use to keep lists. These lists are updated semi-frequently, but the updates need to get to DW database.

There is little chance that those Excel files will be implemented as a more meaningful system in the near future, so we have little choise other than getting the data from them.

A point that complicates the matter still is the lack of control on the structure of those Excel sheets. They are owned and maintained by users that have no restrictions on the structure of the files,

We've contemplated two possible solutions : one is a VBA script that would be implanted into each of those files and have the data transferred to our DW. The other is an upload web form that the users will use to upload the file to a web server that will analyze it and store its data into the DW.

To reduce maintenance we are thinkin of letting the users map columns in the Excel sheet to columns in the DW tables before uploading and show them a preview of what the data will look like when its in the DW before they commit.

Still, I don't feel very comfortable with any of these solutions. Do you have any thoughts on the matter? Has anyone tried to implement something similar?
Eli Golovinsky
Monday, February 07, 2005
One of my minor peeves is people who use acronyms in open posts without explaining the acronym the first time it is used.

Thus: What's a DW?  What's an ETL? 

Excel and VBA you don't have to explain.
Monday, February 07, 2005
With no restrictions on the structure of the files, it is pointless to try and do this.

Really, what magic could you invoke to extract data from a spreadsheet it you have no idea what the spreadsheet looks like?

*If* you can get the users to follow any kind of rules, such as labeling the cells or putting them in a certain spot, then you can extract based on that.  If not, you are wasting your time.
John Senior
Monday, February 07, 2005
DW = Data Warehouse
ETL = Extract, Transform and Load

I never use the term ETL but some of my programming pals who are in need of serious psycho-therapy do.
Mike Drips Send private email
Monday, February 07, 2005
Switching the users over from Excel to Access is usually simple first step. If they are just maintaining lists then a desktop database gives generally enough control for them and enough restrictions for you.

Excel would be a nightmare. This DW is for their benefit, not yours, and they need to start making some useful contribution.
David Aldridge Send private email
Monday, February 07, 2005
How about inserting a row with restricted choices in each column?  They could choose which kind of data the column refers to in this row.  You still have lots of problems in checking data, but this is a pretty simple way to map.

Bankstrong Send private email
Monday, February 07, 2005
I'd recommend writing a VBA mini-app that first lets the user map columns to what you want, then preps the data as you want.

The bottom line is users being users and Excel being Excel, you need to invest some resources here. Been there, done that.
Bob Walsh Send private email
Monday, February 07, 2005
I did something like this for the Bond Traders at Deutsche Bank a few years ago. They tracked their positions on Excel spreadsheets and needed to "share" with their collegues around the world each night.

These guys had spent years building up their spreadsheets and the certainly weren't going to entertain the thought of anyone changing their way of working. So much money depended upon these unbelievable complex Excel files.

The only solution acceptable to them was to write some VBA code to scan the sheets and upload the data. This was very scary, as traders have a (well-deserved) reputation for not accepting problems willingly, but it worked.

Of course, in an ideal world you'd just put in a trading system that allowed them to do all this too, but the systems they had didn't work in the way that they wanted.

I enjoyed working with those guys, it was certainly different. I'd recommend that anyone who wants to do IT as a career should work on the trading desk of an investment bank, to gain a useful perspective ;-)
Nemesis Send private email
Tuesday, February 08, 2005

The problem you have described is pretty much exactly what we do. We don't use vba to scan the file for information, but we do have a way of aggregating data from various sources turn it into xml and delivery both the content and subsequent documents form that xml to pretty much any where based on the initial content.

we are here

drop me a line if you need more info

Rob Stote
Rob Stote Send private email
Tuesday, February 08, 2005
If this is Excel 2003 you can look into Windows Sharepoint Services to make the lists you need inside the Excel files shared and published automatically on a WSS site. You can then aggregate your data through the site.
Just me (Sir to you) Send private email
Tuesday, February 08, 2005
I'd say let them have their Excel, but put in code that refuses to use their worksheets if they've made unauthorized modifications.  You _could_ use named ranges and so forth, but Excel just gives you way too many unknowns in terms of what people can change.  You can also lock things, but that gets really annoying to the users after the 20th "This cell is locked" popup.

I worked on an Excel/VBA-based system at one place; the system existed before I came along, and their accountants liked Excel, so it stayed.  But the ground rule (which also existed before I was there), understood by users and bosses alike, was that if you dared modify a worksheet on your own, you had no business complaining when things didn't work properly.

If you can get agreement on that, and refuse to load from worksheets that are off-kilter, you'll probably be fine.
Kyralessa Send private email
Tuesday, February 08, 2005
We frequently do this. The secret is to write a job to *verify* the excel data.

Basically, vendors for a client upload data to our client via an FTP site. We have a job that opens each spreadsheet, and scans it for violations of the spec. If there are violations, the file is rejected and not forwarded to our load job that pulls the data into the database.

We actually have 2 options depending on the errors and or severity of violations of the spec. Sometimes we reject the entire file, sometimes we reject only certain records. No need to reject 30,000 records when there's only one record that's in violation. Depends on the situation and the business rules on what action you choose: reject entire spreadsheet file, or only reject specific records within the file.

BTW -- this is completely automated. It logs the status on what is rejected and when, sends emails to the folks submitting the files letting them know acceptance/rejection status, sends emails to the operators, populates our "kickout" tables with the rejected records, and forwards the "cleaned" files to our load processing engine to get loaded into the database. The load engine assumes all files are scrubbed first, so there was no need to do any checking there. Just send 'em to the FTP site and they're automagically scrubbed before they even get to the dataload engine.

Face it, business folks *love* Excel, and wouldn't think of submitting, say, an XML file (or even a text file). They want to submit an xls file, they're going to submit an xls file. There's no getting around it. All you can do is write an automated gatekeeper to keep the garbage out.

Works for us.
Sgt. Sausage
Wednesday, February 09, 2005
I feel much better now that I know we are not alone. Users do love Excel, and the ones that need the aggregated data (management) are usually not the ones the hold the details in their Excle sheets.

We are going to build a validation job server-side. I would rather not add VBA to the current sheets.

Thanks folks!
Eli Golovinsky
Thursday, February 10, 2005
In the work we do for our client (GE Infrastructure), we've got support from the IT & business managers to digitize Excel data in Oracle, and maintain it via a web front-end. 

However, as another user suggested, Access is a very good intermediate solution.  Since datatypes in an Excel column are not necessarily the same, your ETL load always runs the risk of hitting a "bad" row.
Arun Philip Send private email
Tuesday, February 22, 2005

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz