The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Programmatically modifying Excel file before importing

Okay, I'm sure this can be done but Excel isn't my forte.  My company receives a spreadsheet of product data from a third-party processor.  This data then needs to be loaded into a database table for further analysis.  The problem is that the data supplied by the third-party has things like extraneous header information, column names spaced out across multiple rows, and the like, something like:

COMPANY NAME

(My Company's Account Number)

Field    Field
No. 1    No. 2
-----    -----
<data>  <data>

Not sure if it will appear properly, but the point is that I can't import the data as-is: It needs to be cleaned prior, the extra information needs to be removed, and the column names need to be modified to only be on one line instead of several and changed to fit the table it will be imported to.  The number of rows in the spreadsheet change depending on the amount of data, but there is always a header and footer, as well as a static number of columns.

Right now the cleanup process is done manually, and I would like to see if I can automate it via a script or something, which can then be called from a batch file as part of processing the data.  It would then be a simple matter of placing the Excel spreadsheet in a particular folder, and executing the batch script which would modify the file and import it into SQL Server.

The import part I have already, for the most part.  It's figuring out how to externally modify the Excel sheet prior to loading the data that's stumping me.

Is there an easy way of handling this, via VBScript or similar?  Keep in mind it needs to be done external to the Excel file for automation purposes.

Any help would be appreciated!
WayneM. Send private email
Tuesday, August 19, 2008
 
 
a different approach would be to

1. import into a temporary table
2. do you adjustments in the temporary table
3. copy required data into final table

no excel integration required
bumperbox
Tuesday, August 19, 2008
 
 
Excel has a decent automation API available to COM or .Net, as well as from VBA in its own designer.  You could also look at something like SpreadsheetGear for .Net, which I've found to be a little faster and more reliable.  Once upon a time, they were giving away a copy of this control if you downloaded and registered any of the Express editions of .Net 2008, but I don't know if this is still the case.
D. Lambert Send private email
Tuesday, August 19, 2008
 
 
When I have this situation, I do the same as bumperbox suggested. I import it into an intermediary table. Run queries that do any necessary cleanup and error detection and then import into the destination table.
Rich
Tuesday, August 19, 2008
 
 
The issue is I can't load it into SQL because of the way the spreadsheet is set up (all the extra info).  The extra info screws up the import (wrong info in wrong columns, etc.), so its imperative that it be removed prior to doing the import.
WayneM. Send private email
Tuesday, August 19, 2008
 
 
Wayne,

You can use OLE Automation to programmatically work with Excel files.

The easiest way to get started is to just record a macro in VBScript from within Excel itself; that gives you the properties and methods you'd need from the VBScript.

Alternately, if you're working in Delphi or a .NET language, you can buy a component package that allows you to programmatically create and/or edit Excel spreadsheets without even involving Excel (it's actually a lot faster that way). The package is XLSReadWriteII, from Axolot ( http://www.axolot.com/ ).
Ken White Send private email
Tuesday, August 19, 2008
 
 
Thanks, Ken.  I'll check that out.
WayneM. Send private email
Tuesday, August 19, 2008
 
 
Hi!  I have written up quite a bit of stuff like this on my blog.  See here:

http://nerds-central.blogspot.com/2008/01/excel-vbscript-index-page.html

Hope it helps - AJ
Alex Turner Send private email
Tuesday, August 19, 2008
 
 
Start by making the manual effort simple:
1. Delete all the offending rows except the last one before the data starts
2. Type in the headings you want in each column in that last undesired row
3. Import that range of headings into Custom Lists (see Help for your version of Excel)

You are now set up so that the Fill Series operation will work with your custom list and you have one import ready to go. For your next import:
1. Delete all offending rows except the last one before the data starts
2. Type in the first column name in the first column in the row immediately above the data
3. Drag the 'fill handle' to the right as far as necessary to hit all the columns. This will fill your custom list as headings across the data.

If that needs to be automated, then you can start by recording a macro as you go through that second set of steps. If even that leaves too much work for somebody, you'll at least have a very nice 'target' for any other automation you decided to implement.
Ron Porter Send private email
Tuesday, August 19, 2008
 
 
Your objection to the method bumperbox suggested is not sustainable.  any collection of data can be imported into a temporary table, if you are willing to  put up with the wrong data appearing in the worng column.

The SQL needed to make sense out oft he poorly structured data,  and convert it into something resembling a normal form  (pun intended)  could turn out to be quite difficult, or it could turn out to be easier than doing the same thing in Excel.

What made you choose Excel as the intermediate platform?
Walter Mitty Send private email
Tuesday, August 19, 2008
 
 
I find Python with the package win32com.client quite handy for this sort of thing.
George Jansen Send private email
Wednesday, August 20, 2008
 
 
The biggest problem I would find is that the column headings contain duplicates or other illegal column names.

I guess I would create a temporary table with fake column names (c1, c2, c3 ,....c127) and skip the first line, if necessary, in that case.
Steve Hirsch Send private email
Wednesday, August 20, 2008
 
 
If I had to that I would ask the other party for a CSV version of the file. If not possible I would do the Excel --> CSV conversion myself. Once the file is in CSV format you can do pretty much anything you want, just a matter of string manipulation.

Basically I would stay away as much as possible from integration with Excel.
nick
Thursday, August 21, 2008
 
 
There are free products that let you modify Excel files without needing to involve Excel, too.  For example, Perl has the Spreadsheet::ParseExcel and Spreadsheet::WriteExcel libraries -- just one reason why it's such a good ETL tool.
Iago
Saturday, August 23, 2008
 
 
Script it

We encounter this a lot with data coming in a bunch of formats that we need to pick out the common bits, make sure users have not broken the spreadsheet format (inserting columns or changing cell formats being a favourite).

I found writing small TAWK .exe's fast & easy, but Python or C# script would be fine as well; have your load process call the script step to perform some basic validation and cleaning, to get the spreadsheet into a form ready to load into an import table.
Grant Black Send private email
Monday, August 25, 2008
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz