Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet multi-chunk upload (split to RowGroups) #591

Open
arthurpassos opened this issue Jan 14, 2025 · 9 comments
Open

Parquet multi-chunk upload (split to RowGroups) #591

arthurpassos opened this issue Jan 14, 2025 · 9 comments

Comments

@arthurpassos
Copy link
Collaborator

Mimic MergeTree blocks as Parquet RowGroups. That will reduce memory usage.

Consider using min_insert_block_size_rows/bytes as well
Consider output_format_parallel_formatting

@arthurpassos
Copy link
Collaborator Author

Raised by Alexander

@arthurpassos
Copy link
Collaborator Author

From Alexander:

So, basically, we need it for two use cases:
Insert into Parquet file. We need to make sure it does not blow up. The default maximum insert block size is 1M rows, so if columns are big, it may take quite a lot of RAM
Move from MergeTree part to Parquet (syntax TBD). Part can be big (150GB), so we need to make sure we can process it by MergeTree blocks, so every block is written as a RowGroup
The second use case is more important (edited)

@arthurpassos
Copy link
Collaborator Author

arthurpassos commented Jan 14, 2025

ParquetBlockOutputFormat is the class responsible for transforming ClickHouse chunks into parquet row groups. Eventually, it becomes a parquet file.

Chunks are grouped together into row groups.

Both output_format_parquet_row_group_size and output_format_parquet_row_group_size_bytes control the number of rows/ bytes that go in a parquet row group. If a ClickHouse chunk contains less than the settings dictate, it is either put on hold until more chunks arrive or flushed in case it is the last chunk.

The number of rows in a chunk is not controlled by this class, so there is a chance a chunk that overflows the number of rows that go in a row group arrive. For that, there is a logic in place that'll split chunks into multiple row groups in case it is bigger than row_group_rows * 2.

Aside from that, in case the clickhouse native encoder is being used and parallel processing is enabled, there is some extra logic to avoid overuse of resources. It basically stales the process a bit in case bytes_in_flight <= format_settings.parquet.row_group_bytes * 4 && task_queue.size() <= format_settings.max_threads * 4

@arthurpassos
Copy link
Collaborator Author

arthurpassos commented Jan 14, 2025

Essentially, the number of row groups that can be written in parallel is controlled and guaranteed by format_settings.parquet.parallel_encoding && format_settings.max_threads.

Each row group can contain up to (output_format_parquet_row_group_size * 2) - 1 rows.

The amount of bytes per row group only comes into play in two occasions:

  1. In case of large columns, the row group can be flushed even if the number of rows required hasn't been reached.
  2. max number of parallel bytes that can be processed

Therefore, it seems theorically possible to crash ClickHouse by using large columns.

The issue could arise in case where the columns are large and the number of rows in the chunk is < row_group_rows * 2

@arthurpassos
Copy link
Collaborator Author

Looking at the sources that generate clickhouse chunks, most of them if not all respect max_block_size, which defaults to 65409 rows.

It tells nothing about how many bytes are being loaded.

@arthurpassos
Copy link
Collaborator Author

Ok, I also need to look into max_insert_block_size and how these two play together. The default for this one is 1mi.

@arthurpassos
Copy link
Collaborator Author

arthurpassos commented Jan 15, 2025

Summary of current situation:

As far as I could tell, max_insert_block_size does not play a role in this case.

The size of blocks that are passed to the "ParquetWriter" is controlled by max_block_size. How many blocks go into a parquet row group is controlled by output_format_parquet_row_group_size and output_format_parquet_row_group_size_bytes. Bear in mind these parquet settings act like a miminum, not maximum. There is an extra safe guard in place that checks if a block is greater than output_format_parquet_row_group_size * 2. In that case, it splits the block into N row groups.

There isn't a similar safe guard for bytes. So, theorically, one could have a data source that generates blocks that contain rows < output_format_parquet_row_group_size * 2 and large columns. That could become a problem, and there is no safe guard in place.

@arthurpassos
Copy link
Collaborator Author

Possible solutionn: implement a similar safe guard that splits chunks into multiple row groups in case they are way too big.

Imho, I would not waste time implementing this unless we have evidence that this is a problem

@arthurpassos
Copy link
Collaborator Author

arthurpassos commented Jan 17, 2025

QA will take over

@Selfeer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant