databaseSQL2021-08-175108 chars
Keeping Database Histories
We want to compare two different patterns of preserving data for machine learning. One is to design the SQL table as insert-only, so that any update will result an extra row; another is to create a separate history table to record the history of changes.
Insert-Only Example
As a starting point, we have 1 product in our products table.
id |
name |
price |
timeCreated |
timeDeleted |
1 |
Hamburger |
1000 |
2021-08-17T18:40:00.000Z |
null |
When we want to update the name to "Cheese Hamburger".
id |
name |
price |
timeCreated |
timeDeleted |
1 |
Hamburger |
1000 |
2021-08-17T18:40:00.000Z |
2021-08-17T18:41:00.000Z |
2 |
Cheese Hamburger |
1000 |
2021-08-17T18:41:00.000Z |
null |
When we want to update the price to "1500".
id |
name |
price |
timeCreated |
timeDeleted |
1 |
Hamburger |
1000 |
2021-08-17T18:40:00.000Z |
2021-08-17T18:41:00.000Z |
2 |
Cheese Hamburger |
1000 |
2021-08-17T18:41:00.000Z |
2021-08-17T18:42:00.000Z |
3 |
Cheese Hamburger |
1500 |
2021-08-17T18:42:00.000Z |
null |
We only keep one table, but the table size grows on every update we make. And we need to manage the changing id of the same item.
History Table Example
Same starting point, but we have 2 extra tables to record the history of name column and price column.
id |
name |
price |
timeCreated |
timeDeleted |
1 |
Hamburger |
1000 |
2021-08-17T18:40:00.000Z |
null |
productId |
name |
timeUpdated |
1 |
Hamburger |
2021-08-17T18:40:00.000Z |
productId |
price |
timeUpdated |
1 |
1000 |
2021-08-17T18:40:00.000Z |
When we want to update the name to "Cheese Hamburger".
id |
name |
price |
timeCreated |
timeDeleted |
1 |
Cheese Hamburger |
1000 |
2021-08-17T18:40:00.000Z |
null |
productId |
name |
timeUpdated |
1 |
Hamburger |
2021-08-17T18:40:00.000Z |
1 |
Cheese Hamburger |
2021-08-17T18:41:00.000Z |
productId |
price |
timeUpdated |
1 |
1000 |
2021-08-17T18:40:00.000Z |
When we want to update the price to "1500".
id |
name |
price |
timeCreated |
timeDeleted |
1 |
Cheese Hamburger |
1500 |
2021-08-17T18:40:00.000Z |
null |
productId |
name |
timeUpdated |
1 |
Hamburger |
2021-08-17T18:40:00.000Z |
1 |
Cheese Hamburger |
2021-08-17T18:41:00.000Z |
productId |
price |
timeUpdated |
1 |
1000 |
2021-08-17T18:40:00.000Z |
1 |
1500 |
2021-08-17T18:42:00.000Z |
By keeping history tables, we only keep the live and active data in products table, it keeps the id stable, and we can query/update as usual. The changes are simply recorded in the product name history and product price history tables.
Question
Which pattern would you like to adopt and what are your thoughts?