-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet multi-chunk upload (split to RowGroups) #591
Comments
Raised by Alexander |
From Alexander:
|
Chunks are grouped together into row groups. Both The number of rows in a chunk is not controlled by this class, so there is a chance a chunk that overflows the number of rows that go in a row group arrive. For that, there is a logic in place that'll split chunks into multiple row groups in case it is bigger than Aside from that, in case the clickhouse native encoder is being used and parallel processing is enabled, there is some extra logic to avoid overuse of resources. It basically stales the process a bit in case |
Essentially, the number of row groups that can be written in parallel is controlled and guaranteed by Each row group can contain up to The amount of bytes per row group only comes into play in two occasions:
Therefore, it seems theorically possible to crash ClickHouse by using large columns. The issue could arise in case where the columns are large and the number of rows in the chunk is < row_group_rows * 2 |
Looking at the sources that generate clickhouse chunks, most of them if not all respect It tells nothing about how many bytes are being loaded. |
Ok, I also need to look into |
Summary of current situation: As far as I could tell, The size of blocks that are passed to the "ParquetWriter" is controlled by There isn't a similar safe guard for bytes. So, theorically, one could have a data source that generates blocks that contain rows < output_format_parquet_row_group_size * 2 and large columns. That could become a problem, and there is no safe guard in place. |
Possible solutionn: implement a similar safe guard that splits chunks into multiple row groups in case they are way too big. Imho, I would not waste time implementing this unless we have evidence that this is a problem |
QA will take over |
Mimic MergeTree blocks as Parquet RowGroups. That will reduce memory usage.
Consider using min_insert_block_size_rows/bytes as well
Consider output_format_parallel_formatting
The text was updated successfully, but these errors were encountered: