The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Pattern for variable type/size struct array in C++?

I need to load and process a bunch of data (records) in memory (C++). The type of data is not known at compile time so I cannot do fix structs/classes for them. Additionally the number of rows could easily top 200K-300K records.

Is there a pattern how to do this?
Jonas
Saturday, May 10, 2008
 
 
Well, you must know *something* about the data you're going to process. If the records have fields of well-known types, you could do something like this:

enum field_type {INT, FLOAT, STRING};

typedef union
{
    int intvalue;
    float floatvalue;
    char *stringvalue;
} field_value;

typedef struct
{
    field_type type;
    field_value value;
} field;

/* Then just define your "record" as a Vector of fields...
Mark Bessey Send private email
Saturday, May 10, 2008
 
 
Depending what you need to do to them and how much performance you need - you could just treat each entire record as a string.
On read parse the record and store an index of the split character position (assuming these aren;t fixed length)

Then you class just needs to store
{
 char *data
 int index[some large number]
}

Or split the record on read and store each filed in a vector of char*

Then a bunch of functions to return data in field 'n' as a string/int/float etc.
Martin Send private email
Saturday, May 10, 2008
 
 
Is there some reason you can't just use a hash table? (Or, really, a list of hash tables.)
my name is here Send private email
Saturday, May 10, 2008
 
 
I have done this at a couple of jobs. In the cases of the data I worked with, it was stored on disk as ASCII data (binary values converted into string representations) with fixed width fields (no parsing needed to separate fields, though I did do some parsing to ensure that the data wasn't corrupt). Each record could  be one of several predefined formats (each kind of file had it's own set of predefined formats) but we didn't know what format a particular record was in until we read the record from the file. Records were delimited by newlines.

We would read each record in from the file and check the first few bytes of the record to determine what format the record was in. Then we would marshal the individual fields into structures that 'matched' the predefined record formats. The structures were just a bunch of char*s: since we weren't doing much manipulation of the data, just moving the fields around into a new format for output, there was no need to convert the fields back to their binary forms.

If you don't even know the format structures until you read the file, it gets a little bit more involved, but the outline should be pretty much the same. If you need to manipulate the fields in a way that required their binary values, you're code will run a bit slower because of all the decoding being done between string and binary formats (if the on-disk format is all binary, you might still need to convert between the on-disk binary format and your machine's native binary format, which can be painful).

Since you haven't told us anything substantial about the data you are working with, it is very hard to make any concrete recommendations, but the general solution will look something like this:

    typedef {
          int int_field;
          char char_field;
          char *str_field;
          double *float_field;
          ...
    } Field_Type_Union;

    enum (
          int_type,
          char_type,
          str_type,
          float_type,
          ...
    ) Field_Type_Enum;

    Field_Type_Enum field_type[MAX_FIELDS];
    Field_Type_Union rec_fields[MAX_FIELDS];
    int field_count;

Now, when you read in a record from the file, you do whatever you need to do to determine what the record structure is and fill in both the field_type array (with the type of each field), the rec_fields array (with the values from each field), and the field_count with the number of fields in the record. All other routines that operate on the input records key off the field_count and the entries in the field_type array. For extra credit you can dynamically allocate the field_type and rec_fields arrays.

This, of course, is the non-C++ solution (straight C), for C++ you would replace the Field_Type_Union with a root class from which all actual field types descend. You would then dispense with the Field_Type_Enum and the field_type array and simply use polymorphism in C++ to get appropriate behaviors for each field type. How you would get the input records interpreted in an object-oriented manner is left as an exercise for the reader (largely because I don't see any good way to do it in C++, mostly due to the limitations of an early bound language).
Jeffrey Dutky Send private email
Saturday, May 10, 2008
 
 
Use boost library.

If elements have fixed number of types, use boost::variant. Otherwise boost::any. Boost::any holds any type.

std::vector<boost::any> anyVec;
anyVec.push_back(1.0f);
anyVec.push_back("this is a string");
...
Glitch
Sunday, May 11, 2008
 
 
Store each field separately as an __int64 array. Each field will take 300K*8 or just 2.4 mb per field, which is no big deal these days.  Of course string fields will take more heap.
dd
Monday, May 12, 2008
 
 
Since you are in C++, why not use C++?  Templates are not a bad way to go these days, even on an embedded program.  You can overload ostreams and istreams for input and output, make functions that behave like the standard algorithms.  a la copy(vctr.begin(), vctr.end(), newvec::back_inserter) or whatever.
Erich Weiss
Friday, May 30, 2008
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz