-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat] Update Arrow layers to support both RecordBatch and Table input #145
Comments
cc @felixpalmer as well |
No objections, just engaging in the conversation:
|
For anyone else looking at this issue, it's probably good to define some terminology:
Extract how, and would this involve CPU copies? If we have a Only allowing Only allowing (It may be that I'm constrained a bit in my thinking because I'm used to the existing binary attributes API. Perhaps there would be a way to accept
That's fair, and I originally agreed. But it's very easy to do a loop over the batches of a table and create a deck.gl layer for each one. That also shows the user that it's rendering one deck.gl layer per batch.
Correct, The existing implementation already allows this, just expects Lonboard uses this feature a lot because it allows Lonboard to move accessors to JS separately from the main data table. So when an attribute like The complexity here is that there's no guaranteed that the It's easier for the JS implementation to accept only
It's not usually enough (for JS native applications) to specify an existing column within a table because there might be some operation required to derive the accessor data from the original data in the table. In the case that an existing column should be passed directly as an accessor, it's also very easy to do this in the existing API: ScatterplotLayer({
data: recordBatch,
getFillColor: recordBatch.getChild("fillColor"), // accesses the relevant `Data` by name
})
Essentially yes. There's no theoretical requirement here for a primary |
Nice, I am copying this to the loaders.gl ArrowJS docs which we can link to.
Yes the trivial (initial) implementation would be to "concat" all the data into a CPU array and then pass that to deck.gl. However, making deck.gl accept an array of arrays for each attribute would not be hard, and it could then allocate a GPU buffer of the required size and just do a bunch of async GPU buffer writes. This would be part of the "deck.gl v10" overhaul (but could happen sooner of course if the direction is set). I understand that you are keen to stay true to the spirit of arrow and avoid any CPU copies, however from my point of view it would be fine to offer Table support now and just document that this currently involves CPU memory copies. People with zero-copy priorities would be able to supply RecordBatches.
Yes, this seems to be our biggest philosophical difference that we keep coming back to. To avoid a never-ending thread, perhaps best to discuss in person.
My concern is mainly that I don't like mixing abstractions. Either a layer takes a map of Data, or it takes a Table.
|
Note that implementing concat for arbitrary input is not entirely trivial, as for some data types (especially bitmasks) you can't just concatenate the underlying buffers. It is very cool that you could allocate one GPU buffer and copy multiple regions of source data into that one target buffer. That does make
That's fair.
I think at some level I don't know the pros/cons of having many layers in deck.gl. It seems the primary drawback is in picking that you'll overflow the (current) max of 255 layers?
FWIW I do agree that presenting a I suppose it's not too hard to support both, especially if the table input is just concatted into a single record batch. if (input instanceof arrow.Table) {
batch = concatTable(input);
} else if (input instanceof arrow.RecordBatch) {
batch = input;
} else {
throw new Error("unknown input")
}
On the contrary I think using Just as the existing deck API accepts an array of JS objects into the The existing deck API doesn't require that those existing JS objects already contain the accessor information. Those accessors are defined either as function callbacks or as new buffers passed directly as attribute. This directly maps to the case of passing I forget this because I never use it myself, but the existing Arrow implementation actually allows a function callback here as well, in which case the function callback receives an arrow deck.gl-community/modules/arrow-layers/src/utils/utils.ts Lines 191 to 198 in 4450249
This is true, but separating the accessors outside of the table should make it easier for deck to see when the geometry/main table has updated vs only a single accessor. So in Lonboard I never update the main table after initial render of each layer. This means that the geometries never need to be rendered again from scratch. Passing the buffers in as separate objects shows deck that only those accessors need to be recomputed. |
Good discussion. I think we are aligned enough to proceed. Just continuing the conversation on some of the interesting topics
True. The intention of the JS accessor functions wasn't really to support using columns from different tables, but I suppose they can be used that way. The binary accessor API certainly wasn't designed or audited in a thoughtful way, we just tried to quickly expose a way to allow binary data to be passed in. The nascent mental model in my mind is that a deck.gl layer would accept a GPUTable type object (an "Arrow table style" class where columns are GPU Buffers. We'd build a layer independent system that maps Arrow Tables and Arrow RecordBatches and Data objects into GPUTables and perhaps GPUColumns, etc. Then the layer can accept either a GPUTable or an ArrowTable in which case it will convert to a GPU Table under the hood.
My position is that as a functional programming API, "the business of deck.gl is diffing", so if we treat Arrow Tables as first class citizens we should implement diffing that understand the internal structure of Arrow Tables and ignore any columns who have As a side note, it would also be neat if there was a "declarative" way to specify simple accessors into an arrow table. Maybe strings with column names, that doesn't require JS code. Then we could support arrow layers (for simple Arrow tables) in deck.gl/json, deck.gl playground, traditional pydeck etc.
I happen to have a new PoC picking manager in luma.gl that uses WebGPU / WebGL2 techniques to remove the picking limit. However there is still a performance "limit" to how many layers deck.gl can handle. Hundreds of layers performs well, but thousands will start to tax the diffing engine and generate a lot of draw calls etc. The decline will be gradual but layers aren't completely "free". One use case I didn't think of mentioning is streaming loads via loaders.gl from non-Arrow formats into Arrow. There I want to emit RecordBatches as data comes in and since the data size is often not known a priori, it is impossible to limit the number of batches being generated (other than stop emitting batches after some limit and finish of with a "monster" batch, or perhaps do exponentially larger batches towards the end... |
From my point of view, those accessor functions are creating new data. The new data is usually derived from the primary table ( So likewise with the Arrow API it's possible to enforce that the attributes are part of the original table, but that's unnecessary rigidity. By accepting a If the attribute data is already in the table: ScatterplotLayer({
data: table,
getColor: table.getChild("colors"),
}) If the attribute data is generated separately: const colorVector = new Arrow.Vector(...);
// This assertion would be moved internal somewhere
assert(table.length() == colorVector.length());
ScatterplotLayer({
data: table,
getColor: colorVector,
}) The GPU concepts should align; a |
Target Use Case
Simplify the implementation of Arrow layers by requiring
RecordBatch
input, notTable
input.Why:
This maps to the existing data structures supported by the deck.gl binary attributes API.
It's more reliable for the end user, as they know that a single arrow layer will always create one underlying deck.gl layer.
Remove need for internal rechunking code.
Multiple arrow
Vector
s that have the same overall length can have different chunking structures. E.g. despite column A and column B both having length30
, column A could have two chunks (Data
in Arrow JS terminology) of15
rows each, and column B could have three chunks of10
rows each. If deck.gl's Arrow support allowed Vector input, then deck.gl would have to manage rechunking the data across input objects.In Lonboard, I don't currently hit this issue because I pre-process the data in Python, but for JS-facing APIs, I think it would significantly simplify the deck.gl implementation to accept only
RecordBatch
input, which enforces contiguous buffers. This pushes the responsibility of rechunking onto the user, if necessary. There can be multiple options for rechunking Arrow data, including pure-JS and Wasm compiled options, and the end user can choose the best option for their use case.Proposal
Right now the arrow layers accept a
Table
for the maindata
prop and arrowVector
objects for any accessors. This would change these layers to accept aRecordBatch
for the maindata
prop and arrow contiguous arrays (calledData
in the Arrow JS implementation) for any accessors.Details
No response
The text was updated successfully, but these errors were encountered: