The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Fileformat Conversion,need advice on design approaches?

Hi all,

I'm a beginner and this is the first time I will have to do a file format conversion.
Now as I was just wondering about the solution but I'm still not sure how should I go about doing this. I mean here is what I've in my mind.
I know how both formats look like. For example, say, I've to convert file A to file B. File A might contain a label like [link] and I will map it to generate a html link for file B, just an example.

So I intend to write a desktop program where user will open file-open dialog and open file A. Now, once file A is opened then I enter into a _while loop_ and within that loop(which ends once we reach the last character of the file)I will read chunks of file A and convert labels like [link] to a html link using if-else structures.
Now though this might sound correct but I'm really looking for a more elegant approach. To experienced programmers who've done file-conversion in past:
Do you really do all that stuff in a while-loop using if-else structures? Can other Data-Structures be used for file-conversion utilities e.g. Hashtables, Trees and etc?

I'm not a native speaker but I hope I've made things clear. If something is still vague I will try to re-phrase my question but I'm just looking for a more elegant approach using some nice Data structures.
Oltmans Send private email
Saturday, January 12, 2008
Three obvious approaches.
1, Traditional = use yacc/lex to parse the file and automatically generate code to convert file.
2, Modern = convert file A into an XML representation and then use xslt to generate the html from this
3, Bloody stupid = write code in a big loop to specifically convert A to B.
Martin Send private email
Saturday, January 12, 2008
Thanks, Martin.

Well, I actually took A and B as an example. But more importantly I'm looking for design approaches applicable in almost all kind of file-conversion routines for most of the file-formats.
Oltmans Send private email
Saturday, January 12, 2008
Taking Microsoft Word as an example, I would guess that they open the file and figure out what it is (text, RTF, Word 97, Word 2007, etc).  Then there is a special routine for loading that particular format in and converting it to their current internal data structures.  When it's time to write back out they probably have a few different save routines.
Saturday, January 12, 2008
if it's a simple set of tags, you could (sloppily) do the replacing with regexps.
Totally Agreeing
Sunday, January 13, 2008
It depends on how different format A and B are. If they are merely syntactically different, you could use a simple transformer and pipe data into one end. But often, different formats have totally different organisation of data. Then, you might need to read the input format into an in-memory structure, before writing it out as the output format.
Troels Knak-Nielsen Send private email
Sunday, January 13, 2008
I have done a lot of fileconversions, and even though the suggested solutions with yacc and lex and xml-structures sounds nice, they have never worked for me. Mostly because the input file format wasn't quite as structured as promised.

What I usually do is: First I read the inputfile into a fitting memorystructure - if it is an xml-file, I read it into a DOM, if it is a typical [config]-file I read it into a hashmap of hashmaps, and if it is something completely else I
define a structure specifically for holding that type of file. This could be compared with tokenizing the file.
Then I forget about the file, and convert the memorystructure into something that resembles the outputfile format. This conversion can be done in many ways, but usually I write a parser that knows what to expect, and how to handle mistakes and errors in the inputformat. Finally I write the outputfile, usually by simply walking through the new memorystructure - yes with a while-loop - and writing everything to disk.

This might seem a little over-engineered, or over-architected, but when the input-files are created by humans (as most of those I have seen are), then they are riddled with formatting-errors, and I don't think that anything but a specialized parser would be able to take that into consideration.

So thats how I do it.
Peter L. Send private email
Sunday, January 13, 2008
Here is one implementation of Peters design in an OO language:

Format - This is the base class for all the in-memory representation for the formats. You need to derive one class for each different format.

FormatFactory - This is the class responsible for figuring out which of the Format derived classes to create based on the format.

InputMedia - Base class for all input types. You need to derive from it to specific types of input like from a file system, from a FTP site, from a backup tape etc.

InputMediaFactory - this is the class which is responsible for determining the type of InputMedia to create based on the name of the input.

OutputMedia - base class for all different output media just like above.

You can put common functionality in a common base class called Media.

OutputMediaFactory - this is the class which is responsible for determining the type of OutputMedia to create based on the name of the input.

View - The base class which is responsible for asking the input, showing the progress, perhaps displaying the final result etc. You have to derive classes from this for each view type that you support - HTML, console etc.

Controller - The class which ties everything together. One implementation of this might be:
  The view waits in a loop (message loop if it is a GUI program)
  The view calls it with the location of the input file moniker and the output file moniker
  Controller passes the input and output moniker to the InputMediaFactory and OutputMediaFactory and gets a pointer to the appropriate InputMedia and OutputMedia objects respectively.
  It then calls the function to read the input file.
  Passes the InputMedia object to the FormatFactory. FormatFactory creates the apporpriate Format derived object and passes it back to the controller.
  Calls inputformat << inputmedia
  Calls outputformat << inputformat - this is where all the meat is. This is also the place to show progress by calling the view.
  Calls outputmedia << outputformat

Every reason to use an object oriented design comes down to ease of maintenance/enhancement. So, forget this if this is just a one time thing.

If this a long time thing where you want to add several file formats, input types, user interfaces etc. then you should consider something like this. In the above, adding another format for example involves only 2 steps and you  have to touch very little of the existing code:

1. Derive a new class from Format and implement all the required functions.
2. Add logic to the FormatFactory so that it knows to create this type of object when it sees a certain type of format.
Sunday, January 13, 2008
Thank you, everyone.
Oltmans Send private email
Tuesday, January 15, 2008
I'd go about this with a filter pipeline structure. For your case i'd design 4 classes:
- one memory data structure to hold the data contained in the two formats
- one input filter class that's responsible with reading the
format A and constructing the memory structure
- one output filter class that takes the in memory data and writes it in format B
- one pipeline class that plugs things together that will look something like this:
class ConvertAToBPipeline
  // just a proof of concept quickly to write
  // not to be taken ad literam - it's a pretty dumb
  // implementation of the pipeline idea
  void Convert(File A, File B)
    Filter inputFilter = new InputFilterA(A);
    Data data = inputFilter.Process();
    Filter outputFilter = new OutputFilterA(data);
    B = outputFilter.Process();

i like it because it's a fairly simple design but it is easily extesible. Instead of hardcoding filtering sequences
you could read them from a file to enable multiple conversions between multiple formats later.
Thursday, January 17, 2008
That is not really a pipeline. Presumably, you read the entire input dataset into an in-memory structure first. Then hand it to an output processor. If the dataset is large, that design would consume much memory.
If at all possible, I would suggest a design, which reads one record (line, or what ever) at a time, piping it through a filter, and writing it out. This is what I think of, as a filter pipeline. Eg.:

Reader reader = new FormatAReader("filename.a");
Writer writer = new FormatBWriter("filename.b");
Filter filter = new AtoBFilter();
while (reader.valid()) {
Troels Knak-Nielsen Send private email
Friday, January 18, 2008
@Troels Knak-Nielsen
True about the possible memory waste on large files. But yet again i can't make a good trade-off design unless i know all the details of my domain. My example was intended to be simple and valid in the majority of cases; not a "fits all cases" solution. And your solution is not viable unless there is a linear mapping between formats(what if format B requires
to start with data found at the end of format A or there is
 a non trivial transformation involved in the processing - maybe something that requires interpretation of entire context of format A? - you would still end up reading the entire data set in memory. but then again as i said those kind of decisions can't be taken unless you know the details.
For a general case i still stand beside my initial solution.

"Early optimization is the root of all evil"
Monday, January 21, 2008

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
Powered by FogBugz