Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent cache #130

Open
bloep opened this issue Mar 28, 2024 · 4 comments
Open

Persistent cache #130

bloep opened this issue Mar 28, 2024 · 4 comments

Comments

@bloep
Copy link

bloep commented Mar 28, 2024

Is your feature request related to a problem? Please describe.

Scenario 1:

We have two databases that contain partially the same data because the corresponding systems communicate with each other via APIs and exchange this data (e.g. names and addresses).

Example goal: Person X with first name "Richard Roe" will become "John Doe" in both dumps.

Scenario 2:

A dump is automatically created every night for the development. Different anonymized values should not be generated every night for the same original values.

Example goal: Person X with the first name "Richard Roe" becomes "John Doe" today and "John Doe" tomorrow and not "Foo Bar".

In both cases, the resulting data for the same input is different.

Describe the solution you'd like
A possibility to persist the cache entries in a cache file. Then a next execution would already find the cached value and return the same result for the same input.

@guvra
Copy link
Collaborator

guvra commented Apr 4, 2024

You could implement this with by adding a file parameter in the cache converter, but it's really not trivial to implement, the cache may contain billions of values when dumping a big database.

@staabm
Copy link
Contributor

staabm commented Apr 4, 2024

a php-based cache file, which dumps the output of var_export and re-reads it using a require should be pretty effiecient even with huge arrays (and would also prevent the need of e.g. json encoding/decoding or whatever format you use instead).

similar as done here:

@guvra
Copy link
Collaborator

guvra commented Apr 4, 2024

a php-based cache file, which dumps the output of var_export and re-reads it using a require should be pretty effiecient even with huge arrays (and would also prevent the need of e.g. json encoding/decoding or whatever format you use instead).

similar as done here:

* cache write: https://github.com/staabm/phpstan-dba/blob/e86594d4e0d7c868c9dfbdbeda5805b93a4ca6ce/src/QueryReflection/ReflectionCache.php#L212-L219

* cache read: https://github.com/staabm/phpstan-dba/blob/e86594d4e0d7c868c9dfbdbeda5805b93a4ca6ce/src/QueryReflection/ReflectionCache.php#L147

But it's extremely insecure, it's okay with phpstan because it's a development tool, gdpr-dump can be executed on production environments.

@amenk
Copy link
Contributor

amenk commented Dec 12, 2024

Different approach

First of all you should check if it is sufficient to use
https://github.com/Smile-SA/gdpr-dump/wiki/Data-Converters#anonymizetext
which should always lead to the same results on same input.

If that is not sufficient and you want to use faker to have "nice" anonymous data, I think we could use a hash of the original data point, set the seed and should always get the same resulting name.

This could be applied to all the random generators.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants