Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change versions table layout for performance #1457

Merged
merged 1 commit into from
May 28, 2024

Conversation

fatkodima
Copy link
Contributor

We are currently using paper_trail and have billions of items in the versions table and the table is huge.

The one easy improvement I noticed is the versions table layout. Currently, its layout is not optimal and will cause unnecessary fragmentation inside the table. There are good articles on the theme like one and two. Basically, we need to have fields with static sizes first in the table packed in a way to reduce paddings.

With the currently implemented layout, if we consider that the user decides to use bigint for whodunnit (see #1456), then whodunnit, item_id and created_at should be positioned on 8 bytes boundaries (because each of them have 8 bytes in size) and the fields that precede them can have a padding added at the end for this to happen. This can be as much as 7 bytes of padding for each field.

For example, if we have 4 billions of records in the database and each row has a 21 byte of wasteful padding, then we can save 4 * 10^9 * 21 / 10^9 ~ 100Gb 🔥 of memory by just doing this simple table layout change.

Also, afaik, postgres precalculates padding for columns in the row for statically sized columns (for prefix of the columns with static types) and than can easily jump to specific columns using that offsets when reading the row. Instead of manually traversing the row with dynamic column sizes to get to the needed column. So, this will also speedup the reading of whodunnit, item_id and created_at columns.

I believe, this will improve the situation for MySQL too.

  • Wrote good commit messages.
  • Feature branch is up-to-date with master (if not - rebase it).
  • Squashed related commits together.
  • Added tests.
  • Added an entry to the Changelog if the new
    code introduces user-observable changes.
  • The PR relates to only one subject with a clear title
    and description in grammatically correct, complete sentences.

@jonatas
Copy link

jonatas commented Apr 12, 2024

Hey @fatkodima , that looks so cool! Have you checked adding timescaledb to also partition the data by time? It would have a massive storage gains using compression with dictionary algorithms over all the repeated values.

Happy to have a chat and introduce it.

@jaredbeck
Copy link
Member

Nice analysis, @fatkodima ! Thanks for the contribution.

@jaredbeck jaredbeck merged commit 67a1ec2 into paper-trail-gem:master May 28, 2024
7 checks passed
@fatkodima fatkodima deleted the optimize-table-layout branch May 28, 2024 17:11
@franzliedke
Copy link

Is it recommended to adopt this new table layout for existing apps? 🤔 I can imagine that would be quite a blocking DB migration...

@fatkodima
Copy link
Contributor Author

Depending oh how much size it is expected to be saved, so maybe worth it. This can be done in nonblocking way by creating a separate table with proper layout, copying the data and making a switch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants