Parquet multi-chunk upload (split to RowGroups) #591

arthurpassos · 2025-01-14T13:33:23Z

Mimic MergeTree blocks as Parquet RowGroups. That will reduce memory usage.

Consider using min_insert_block_size_rows/bytes as well
Consider output_format_parallel_formatting

arthurpassos · 2025-01-14T13:33:33Z

Raised by Alexander

arthurpassos · 2025-01-14T13:34:05Z

From Alexander:

So, basically, we need it for two use cases:
Insert into Parquet file. We need to make sure it does not blow up. The default maximum insert block size is 1M rows, so if columns are big, it may take quite a lot of RAM
Move from MergeTree part to Parquet (syntax TBD). Part can be big (150GB), so we need to make sure we can process it by MergeTree blocks, so every block is written as a RowGroup
The second use case is more important (edited)

arthurpassos · 2025-01-14T17:16:40Z

ParquetBlockOutputFormat is the class responsible for transforming ClickHouse chunks into parquet row groups. Eventually, it becomes a parquet file.

Chunks are grouped together into row groups.

Both output_format_parquet_row_group_size and output_format_parquet_row_group_size_bytes control the number of rows/ bytes that go in a parquet row group. If a ClickHouse chunk contains less than the settings dictate, it is either put on hold until more chunks arrive or flushed in case it is the last chunk.

The number of rows in a chunk is not controlled by this class, so there is a chance a chunk that overflows the number of rows that go in a row group arrive. For that, there is a logic in place that'll split chunks into multiple row groups in case it is bigger than row_group_rows * 2.

Aside from that, in case the clickhouse native encoder is being used and parallel processing is enabled, there is some extra logic to avoid overuse of resources. It basically stales the process a bit in case bytes_in_flight <= format_settings.parquet.row_group_bytes * 4 && task_queue.size() <= format_settings.max_threads * 4

arthurpassos · 2025-01-14T17:44:06Z

Essentially, the number of row groups that can be written in parallel is controlled and guaranteed by format_settings.parquet.parallel_encoding && format_settings.max_threads.

Each row group can contain up to (output_format_parquet_row_group_size * 2) - 1 rows.

The amount of bytes per row group only comes into play in two occasions:

In case of large columns, the row group can be flushed even if the number of rows required hasn't been reached.
max number of parallel bytes that can be processed

Therefore, it seems theorically possible to crash ClickHouse by using large columns.

The issue could arise in case where the columns are large and the number of rows in the chunk is < row_group_rows * 2

arthurpassos · 2025-01-15T14:38:17Z

Looking at the sources that generate clickhouse chunks, most of them if not all respect max_block_size, which defaults to 65409 rows.

It tells nothing about how many bytes are being loaded.

arthurpassos · 2025-01-15T14:40:14Z

Ok, I also need to look into max_insert_block_size and how these two play together. The default for this one is 1mi.

arthurpassos · 2025-01-15T18:45:19Z

Summary of current situation:

As far as I could tell, max_insert_block_size does not play a role in this case.

The size of blocks that are passed to the "ParquetWriter" is controlled by max_block_size. How many blocks go into a parquet row group is controlled by output_format_parquet_row_group_size and output_format_parquet_row_group_size_bytes. Bear in mind these parquet settings act like a miminum, not maximum. There is an extra safe guard in place that checks if a block is greater than output_format_parquet_row_group_size * 2. In that case, it splits the block into N row groups.

There isn't a similar safe guard for bytes. So, theorically, one could have a data source that generates blocks that contain rows < output_format_parquet_row_group_size * 2 and large columns. That could become a problem, and there is no safe guard in place.

arthurpassos · 2025-01-15T18:47:52Z

Possible solutionn: implement a similar safe guard that splits chunks into multiple row groups in case they are way too big.

Imho, I would not waste time implementing this unless we have evidence that this is a problem

arthurpassos · 2025-01-17T14:35:38Z

QA will take over

@Selfeer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet multi-chunk upload (split to RowGroups) #591

Parquet multi-chunk upload (split to RowGroups) #591

arthurpassos commented Jan 14, 2025

arthurpassos commented Jan 14, 2025

arthurpassos commented Jan 14, 2025

arthurpassos commented Jan 14, 2025 •

edited

Loading

arthurpassos commented Jan 14, 2025 •

edited

Loading

arthurpassos commented Jan 15, 2025

arthurpassos commented Jan 15, 2025

arthurpassos commented Jan 15, 2025 •

edited

Loading

arthurpassos commented Jan 15, 2025

arthurpassos commented Jan 17, 2025 •

edited

Loading

Parquet multi-chunk upload (split to RowGroups) #591

Parquet multi-chunk upload (split to RowGroups) #591

Comments

arthurpassos commented Jan 14, 2025

arthurpassos commented Jan 14, 2025

arthurpassos commented Jan 14, 2025

arthurpassos commented Jan 14, 2025 • edited Loading

arthurpassos commented Jan 14, 2025 • edited Loading

arthurpassos commented Jan 15, 2025

arthurpassos commented Jan 15, 2025

arthurpassos commented Jan 15, 2025 • edited Loading

arthurpassos commented Jan 15, 2025

arthurpassos commented Jan 17, 2025 • edited Loading

arthurpassos commented Jan 14, 2025 •

edited

Loading

arthurpassos commented Jan 14, 2025 •

edited

Loading

arthurpassos commented Jan 15, 2025 •

edited

Loading

arthurpassos commented Jan 17, 2025 •

edited

Loading