The Design of Software (CLOSED)

A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.

The "Design of Software" discussion group has been merged with the main Joel on Software discussion group.

The archives will remain online indefinitely.

Inconsistent databases

I am developing a data cleansing software (C++, STL, Qt) which make some simple data cleansing and data reconciliation operations not available in existense data cleansing and ETL, Data Warehouse engines. My software is intended to cleane and repair constraints for numerical data sets (like cencus, bioinformatics, DSS, sales).  I am trying to evaluate my algorithms/programs, but I do not have enough "real data"

There are plenty of sample large relational data sets  which can be used to evaluate  algorithms o <a href="http://www.imdb.com/interfaces#plain"> see for example</a>, but all of them represent cleaned and verified data


I am searching for  a sample inconsistent "dirty" database.
Of course, it is easy to generate one or take any other and define a
set of constraint such that initial consistent database  becomes
inconsistent, but I am interested to have a real-world example, to test
some particular hierestics for data cleansing

Does someone know sample  data sets (any format - cvs, xml, sql - I
can convert them into Oracle loader or plain sql in 30 minutes), numerical data sets are preferred,
around 1-50 million tuples, which fail to satisfy a set of real-world
constraints (CHECK constraints are more important then Key or Foreighn
Key, or even constraints which are not expressible in DBMS,)? I nee data sets with the size more then 10, 000, better 1-50 million rows
I would be very grateful if you help me
Andrei Lopatenko Send private email
Saturday, March 11, 2006
 
 
Can you not just generate them?

For example, here's a million rows where every twentieth row is a numeric with ABC concatanated to the end.

Create table bad
as
select rownum my_num,
      decode(mod(rownum,20),0,rownum||'ABC') my_bad_num
from  dual
connect by 1=1 and
      level <= 1000000

Look at the DBMS_RANDOM package also.
David Aldridge Send private email
Tuesday, March 14, 2006
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz