normalize "in" predicate as disjunction of "==" #423

lr4d · 2021-02-24T15:41:29Z

Description:

The following two statements are equivalent:

x in [1,2,3]
(x == 1) or (x == 2) or (x == 3)

This approach simplifies the function by just calling itself instead of iterating over a for loop and essentially running the code block under if op == "==": for each value of an "in" predicate

Closes #xxxx
Changelog entry

fjetter · 2021-02-25T08:36:35Z

Does the any statement abort evaluation as soon as it finds the first value which evaluates to True or does it continue the evaluation? aborting early could vastly improve best case scenarios but I'm not sure about how the stdlib implementation deals with this (https://docs.python.org/3/library/functions.html#any doesn't tell us anything about it)

Edit:

Yes it does

In [4]: class Boom:
   ...:     def __bool__(self):
   ...:         raise Exception()

In [7]: any([b])
---------------------------------------------------------------------------
Exception
....

In [8]: any([True, b])
Out[8]: True

fjetter · 2021-02-25T08:37:17Z

eventually we might want to consider passing the filters directly to pyarrow since they implemented this by now as well. I would expect them to deal with these things much faster than we are in python.

fjetter · 2021-02-25T08:41:24Z

do we have any kind of benchmarks? What I'm a bit concerned about is that we interact with the parquet reader now for every element in the list. I don't know how expensive this is

lr4d · 2021-02-25T15:55:37Z

What I'm a bit concerned about is that we interact with the parquet reader now for every element in the list. I don't know how expensive this is

Very valid concern. We can isolate the operation evaluation logic in a separate function and call that directly so that we don't need to bother about this

fjetter · 2021-03-01T10:14:45Z

cc @mlondschien this might interest you

xref #325

fjetter · 2021-03-01T10:15:48Z

Very valid concern. We can isolate the operation evaluation logic in a separate function and call that directly so that we don't need to bother about this

All depends on how complex the end state will be since your intention is to simplify things. I think this would be ok, though

mlondschien · 2021-03-01T10:23:27Z

Thanks @fjetter for pinging me. This does not seem to explode the complexity exponentially as we're normalizing for a single column only. So I don't think the concerns mentioned in #325 would apply here.

lr4d · 2021-03-02T09:44:31Z

eventually we might want to consider passing the filters directly to pyarrow since they implemented this by now as well. I would expect them to deal with these things much faster than we are in python.

Good point. Do you know how "mature" this logic for pyarrow is atm? It might make more sense to invest in using their functionality directly rather than this kind of work

normalize "in" predicate as disjunction of "=="

64f89d7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

normalize "in" predicate as disjunction of "==" #423

normalize "in" predicate as disjunction of "==" #423

lr4d commented Feb 24, 2021

fjetter commented Feb 25, 2021 •

edited

Loading

fjetter commented Feb 25, 2021

fjetter commented Feb 25, 2021

lr4d commented Feb 25, 2021

fjetter commented Mar 1, 2021

fjetter commented Mar 1, 2021

mlondschien commented Mar 1, 2021

lr4d commented Mar 2, 2021

normalize "in" predicate as disjunction of "==" #423

Are you sure you want to change the base?

normalize "in" predicate as disjunction of "==" #423

Conversation

lr4d commented Feb 24, 2021

Description:

fjetter commented Feb 25, 2021 • edited Loading

fjetter commented Feb 25, 2021

fjetter commented Feb 25, 2021

lr4d commented Feb 25, 2021

fjetter commented Mar 1, 2021

fjetter commented Mar 1, 2021

mlondschien commented Mar 1, 2021

lr4d commented Mar 2, 2021

fjetter commented Feb 25, 2021 •

edited

Loading