Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normalize "in" predicate as disjunction of "==" #423

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

lr4d
Copy link
Collaborator

@lr4d lr4d commented Feb 24, 2021

Description:

The following two statements are equivalent:

  • x in [1,2,3]
  • (x == 1) or (x == 2) or (x == 3)

This approach simplifies the function by just calling itself instead of iterating over a for loop and essentially running the code block under if op == "==": for each value of an "in" predicate

  • Closes #xxxx
  • Changelog entry

@fjetter
Copy link
Collaborator

fjetter commented Feb 25, 2021

Does the any statement abort evaluation as soon as it finds the first value which evaluates to True or does it continue the evaluation? aborting early could vastly improve best case scenarios but I'm not sure about how the stdlib implementation deals with this (https://docs.python.org/3/library/functions.html#any doesn't tell us anything about it)

Edit:

Yes it does

In [4]: class Boom:
   ...:     def __bool__(self):
   ...:         raise Exception()

In [7]: any([b])
---------------------------------------------------------------------------
Exception
....

In [8]: any([True, b])
Out[8]: True

@fjetter
Copy link
Collaborator

fjetter commented Feb 25, 2021

eventually we might want to consider passing the filters directly to pyarrow since they implemented this by now as well. I would expect them to deal with these things much faster than we are in python.

@fjetter
Copy link
Collaborator

fjetter commented Feb 25, 2021

do we have any kind of benchmarks? What I'm a bit concerned about is that we interact with the parquet reader now for every element in the list. I don't know how expensive this is

@lr4d
Copy link
Collaborator Author

lr4d commented Feb 25, 2021

What I'm a bit concerned about is that we interact with the parquet reader now for every element in the list. I don't know how expensive this is

Very valid concern. We can isolate the operation evaluation logic in a separate function and call that directly so that we don't need to bother about this

@fjetter
Copy link
Collaborator

fjetter commented Mar 1, 2021

cc @mlondschien this might interest you

xref #325

@fjetter
Copy link
Collaborator

fjetter commented Mar 1, 2021

Very valid concern. We can isolate the operation evaluation logic in a separate function and call that directly so that we don't need to bother about this

All depends on how complex the end state will be since your intention is to simplify things. I think this would be ok, though

@mlondschien
Copy link
Contributor

Thanks @fjetter for pinging me. This does not seem to explode the complexity exponentially as we're normalizing for a single column only. So I don't think the concerns mentioned in #325 would apply here.

@lr4d
Copy link
Collaborator Author

lr4d commented Mar 2, 2021

eventually we might want to consider passing the filters directly to pyarrow since they implemented this by now as well. I would expect them to deal with these things much faster than we are in python.

Good point. Do you know how "mature" this logic for pyarrow is atm? It might make more sense to invest in using their functionality directly rather than this kind of work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants