Persistent cache #130

bloep · 2024-03-28T09:24:05Z

Is your feature request related to a problem? Please describe.

Scenario 1:

We have two databases that contain partially the same data because the corresponding systems communicate with each other via APIs and exchange this data (e.g. names and addresses).

Example goal: Person X with first name "Richard Roe" will become "John Doe" in both dumps.

Scenario 2:

A dump is automatically created every night for the development. Different anonymized values should not be generated every night for the same original values.

Example goal: Person X with the first name "Richard Roe" becomes "John Doe" today and "John Doe" tomorrow and not "Foo Bar".

In both cases, the resulting data for the same input is different.

Describe the solution you'd like
A possibility to persist the cache entries in a cache file. Then a next execution would already find the cached value and return the same result for the same input.

guvra · 2024-04-04T10:02:33Z

You could implement this with by adding a file parameter in the cache converter, but it's really not trivial to implement, the cache may contain billions of values when dumping a big database.

staabm · 2024-04-04T10:14:49Z

a php-based cache file, which dumps the output of var_export and re-reads it using a require should be pretty effiecient even with huge arrays (and would also prevent the need of e.g. json encoding/decoding or whatever format you use instead).

similar as done here:

guvra · 2024-04-04T10:22:02Z

a php-based cache file, which dumps the output of var_export and re-reads it using a require should be pretty effiecient even with huge arrays (and would also prevent the need of e.g. json encoding/decoding or whatever format you use instead).

similar as done here:
* cache write: https://github.com/staabm/phpstan-dba/blob/e86594d4e0d7c868c9dfbdbeda5805b93a4ca6ce/src/QueryReflection/ReflectionCache.php#L212-L219

* cache read: https://github.com/staabm/phpstan-dba/blob/e86594d4e0d7c868c9dfbdbeda5805b93a4ca6ce/src/QueryReflection/ReflectionCache.php#L147

But it's extremely insecure, it's okay with phpstan because it's a development tool, gdpr-dump can be executed on production environments.

amenk · 2024-12-12T16:13:17Z

Different approach

First of all you should check if it is sufficient to use
https://github.com/Smile-SA/gdpr-dump/wiki/Data-Converters#anonymizetext
which should always lead to the same results on same input.

If that is not sufficient and you want to use faker to have "nice" anonymous data, I think we could use a hash of the original data point, set the seed and should always get the same resulting name.

This could be applied to all the random generators.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persistent cache #130

Persistent cache #130

bloep commented Mar 28, 2024

guvra commented Apr 4, 2024 •

edited

Loading

staabm commented Apr 4, 2024

guvra commented Apr 4, 2024

amenk commented Dec 12, 2024 •

edited

Loading

Persistent cache #130

Persistent cache #130

Comments

bloep commented Mar 28, 2024

Scenario 1:

Scenario 2:

guvra commented Apr 4, 2024 • edited Loading

staabm commented Apr 4, 2024

guvra commented Apr 4, 2024

amenk commented Dec 12, 2024 • edited Loading

Different approach

guvra commented Apr 4, 2024 •

edited

Loading

amenk commented Dec 12, 2024 •

edited

Loading