A public forum for discussing the design of software, from the user interface to the code architecture. Now closed.
I am developing a data cleansing software (C++, STL, Qt) which make some simple data cleansing and data reconciliation operations not available in existense data cleansing and ETL, Data Warehouse engines. My software is intended to cleane and repair constraints for numerical data sets (like cencus, bioinformatics, DSS, sales). I am trying to evaluate my algorithms/programs, but I do not have enough "real data"
There are plenty of sample large relational data sets which can be used to evaluate algorithms o <a href="http://www.imdb.com/interfaces#plain"> see for example</a>, but all of them represent cleaned and verified data
I am searching for a sample inconsistent "dirty" database.
Of course, it is easy to generate one or take any other and define a
set of constraint such that initial consistent database becomes
inconsistent, but I am interested to have a real-world example, to test
some particular hierestics for data cleansing
Does someone know sample data sets (any format - cvs, xml, sql - I
can convert them into Oracle loader or plain sql in 30 minutes), numerical data sets are preferred,
around 1-50 million tuples, which fail to satisfy a set of real-world
constraints (CHECK constraints are more important then Key or Foreighn
Key, or even constraints which are not expressible in DBMS,)? I nee data sets with the size more then 10, 000, better 1-50 million rows
I would be very grateful if you help me
Saturday, March 11, 2006
Can you not just generate them?
For example, here's a million rows where every twentieth row is a numeric with ABC concatanated to the end.
Create table bad
select rownum my_num,
connect by 1=1 and
level <= 1000000
Look at the DBMS_RANDOM package also.
Tuesday, March 14, 2006
This topic is archived. No further replies will be accepted.Other recent topics
Powered by FogBugz