Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/stats: inaccurate avg_size in statistics for column in single-column column family #137713

Open
michae2 opened this issue Dec 18, 2024 · 2 comments
Labels
A-sql-table-stats Table statistics (and their automatic refresh). branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-queries SQL Queries Team

Comments

@michae2
Copy link
Collaborator

michae2 commented Dec 18, 2024

Single-column column families are encoded using legacy marshaling. When we collect statistics on these single-column column families, we estimate their size using Datum.Size() instead of the actual on-disk size, and this can be very inaccurate. Here's an example:

CREATE TYPE e AS ENUM ('hello', 'howdy', 'hi');
CREATE TABLE abc (a e NOT NULL, b e NOT NULL, c e NOT NULL, FAMILY (a, b), FAMILY (c));
INSERT INTO abc VALUES ('hi', 'hi', 'hi');
ANALYZE abc;
SHOW STATISTICS FOR TABLE abc;

The avg_size of both a and b is 3 bytes, but avg_size of c is 48 bytes because DEnum.Size() includes a bunch of metadata that isn't written to disk.

[email protected]:26257/demoapp/defaultdb> SHOW STATISTICS FOR TABLE abc;
  statistics_name | column_names |            created            | row_count | distinct_count | null_count | avg_size | partial_predicate |    histogram_id     | full_histogram_id
------------------+--------------+-------------------------------+-----------+----------------+------------+----------+-------------------+---------------------+--------------------
  NULL            | {a}          | 2024-12-18 17:25:46.427418+00 |         1 |              1 |          0 |        3 | NULL              | 1030462984780972033 |                 0
  NULL            | {b}          | 2024-12-18 17:25:46.427418+00 |         1 |              1 |          0 |        3 | NULL              | 1030462984784281601 |                 0
  NULL            | {c}          | 2024-12-18 17:25:46.427418+00 |         1 |              1 |          0 |       48 | NULL              | 1030462984787001345 |                 0
  NULL            | {rowid}      | 2024-12-18 17:25:46.427418+00 |         1 |              1 |          0 |        9 | NULL              | 1030462984777924609 |                 0
(4 rows)

Jira issue: CRDB-45715

@michae2 michae2 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-sql-table-stats Table statistics (and their automatic refresh). T-sql-queries SQL Queries Team labels Dec 18, 2024

This comment was marked as resolved.

@github-project-automation github-project-automation bot moved this to Triage in SQL Queries Dec 18, 2024
@michae2 michae2 added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 labels Dec 18, 2024
@michae2
Copy link
Collaborator Author

michae2 commented Dec 19, 2024

One idea for a fix is: #137731

@mgartner mgartner moved this from Triage to Backlog in SQL Queries Dec 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-table-stats Table statistics (and their automatic refresh). branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 branch-release-24.1 Used to mark GA and release blockers, technical advisories, and bugs for 24.1 branch-release-24.2 Used to mark GA and release blockers, technical advisories, and bugs for 24.2 branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-queries SQL Queries Team
Projects
Status: Backlog
Development

No branches or pull requests

1 participant