-
-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unpack jv values directly #50
Unpack jv values directly #50
Conversation
jq.pyx
Outdated
cdef int i | ||
cdef jv ik | ||
cdef jv iv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I try to avoid using variable names like i
. Are there more descriptive names that could be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, no problem. Changed those to idx
/idx_key
/idx_value
. Tell me if that's not clear enough.
jq.pyx
Outdated
while True: | ||
if not jv_object_iter_valid(v, i): | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would while jv_object_iter_valid(v, i):
do the same thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
D'oh, of course. Forgot to tidy it up after converting jv_object_foreach
to Python. Replaced.
results.append(iterator._next_string()) | ||
except StopIteration: | ||
return "\n".join(results) | ||
return "\n".join(json.dumps(v) for v in self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is using _jv_to_python
and json.dumps
faster than using jv_dump_string
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tested it, but very likely not. However, it's simpler. Tell me if you'd prefer performance over simplicity here. Provided there's performance benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be useful to know what the performance implications are. I (perhaps naively!) think that it shouldn't introduce too much complexity -- instead of just _next_string
, we could have something like _next_result
which returns a jv
, and then the two different methods (text
and __next__
) can handle the result appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to throw together a little test for this, but meanwhile I pushed the change and it took me 25 lines:
diff --git a/jq.pyx b/jq.pyx
index f5642fe..54f024a 100644
--- a/jq.pyx
+++ b/jq.pyx
@@ -254,7 +254,13 @@ cdef class _ProgramWithInput(object):
return _ResultIterator(self._jq_state_pool, self._bytes_input)
def text(self):
- return "\n".join(json.dumps(v) for v in self)
+ iterator = self._make_iterator()
+ results = []
+ while True:
+ try:
+ results.append(iterator._next_string())
+ except StopIteration:
+ return "\n".join(results)
def all(self):
return list(self)
@@ -288,7 +294,25 @@ cdef class _ResultIterator(object):
return self
def __next__(self):
+ cdef jv value
+ self._next_jv(&value)
+ return _jv_to_python(value)
+
+ cdef unicode _next_string(self):
cdef int dumpopts = 0
+ cdef jv value
+ cdef jv dump
+
+ self._next_jv(&value)
+ dump = jv_dump_string(value, dumpopts)
+ try:
+ string = jv_string_value(dump).decode("utf8")
+ finally:
+ jv_free(dump)
+
+ return string
+
+ cdef int _next_jv(self, jv *presult) except 1:
while True:
if not self._ready:
self._ready_next_input()
@@ -296,7 +320,8 @@ cdef class _ResultIterator(object):
result = jq_next(self._jq)
if jv_is_valid(result):
- return _jv_to_python(result)
+ presult[0] = result
+ return 0
elif jv_invalid_has_msg(jv_copy(result)):
error_message = jv_invalid_get_msg(result)
message = jv_string_value(error_message).decode("utf8")
jq.pyx
Outdated
try: | ||
arr.append(_jv_to_python(iv)) | ||
finally: | ||
jv_free(iv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would changing the contract of _jv_to_python
to consume its input simplify some of this code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. I considered this before in context of memory usage, but couldn't wrap my head around it. Did some experiments and amended it. Should be better now.
Also, I love your usage of "contract" 😀 Nobody would understand me if I tried talking like that at my place 😞.
Add tests verifying various value types and combinations are preserved after passing through the program.
Even though "message-less" invalid values don't technically need "freeing", doing so is supported by the library, and could be good for uniformity and future-proofing.
4ec74ff
to
8a5769a
Compare
jq.pyx
Outdated
cdef jv idx_key | ||
cdef jv idx_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these are indices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started them with idx_
to indicate those are key/value at the idx
index. As in "index'es key" and "index'es value". What would work better for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps property_name
and property_value
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. However, this way we would be assigning the return value of jv_object_iter_key()
to property_name
, which I think would be confusing. Like this:
property_name = jv_object_iter_key(value, idx)
So, for the start I renamed them to property_key
and property_value
. Tell me if you would prefer property_name
regardless, and I'll change that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes too much difference. I think "name" matches the JSON spec, whereas "key" matches jq.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, now I see where property_
comes from. Would it be OK that we store array elements in property_value
as well?
Anyway, just thought I caught a mismatch here and tried to make it better. I think these are such minor details that it wouldn't matter in the end. I'll just put there whatever you'd like :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry, didn't notice you merged this already :D Thanks a lot 🎉!
results.append(iterator._next_string()) | ||
except StopIteration: | ||
return "\n".join(results) | ||
return "\n".join(json.dumps(v) for v in self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be useful to know what the performance implications are. I (perhaps naively!) think that it shouldn't introduce too much complexity -- instead of just _next_string
, we could have something like _next_result
which returns a jv
, and then the two different methods (text
and __next__
) can handle the result appropriately.
8a5769a
to
bf8dafe
Compare
Well, whaddaya know: nkondras@bard:~/projects/github.com/kernelci/kcidb-data$ time cat compressed.json | python3 -c $'import jq, sys\njq.compile(".").input(text=sys.stdin.read()).text()'
real 0m18.325s
user 0m17.004s
sys 0m1.353s
nkondras@bard:~/projects/github.com/kernelci/kcidb-data$ time cat compressed.json | python3 -c $'import jq, sys\njq.compile(".").input(text=sys.stdin.read()).text()'
real 0m30.572s
user 0m29.437s
sys 0m1.246s That's processing a (space-less) 266MB JSON file. First one is the yesterday's version, second one is the one I just pushed. |
That's not counting the effect of |
Now with a little bit of nkondras@bard:~/projects/github.com/kernelci/kcidb-data$ time for ((i=0; i<500; i++)); do cat sample.json; done | python3 -c $'import jq, sys\njq.compile(".").input(text=sys.stdin.read()).text()'
real 0m20.934s
user 0m20.038s
sys 0m1.089s
nkondras@bard:~/projects/github.com/kernelci/kcidb-data$ time for ((i=0; i<500; i++)); do cat sample.json; done | python3 -c $'import jq, sys\njq.compile(".").input(text=sys.stdin.read()).text()'
real 0m32.288s
user 0m31.270s
sys 0m1.172s This time it's a (loosely-spaced) 716KB JSON file, repeated 500 times. |
So, I think the simple version wins, but I'll keep the more complex one posted for now. Tell me what you'd prefer, and also how would you like me to call those |
Here are the results with the current master branch for reference: nkondras@bard:~/projects/github.com/kernelci/kcidb-data$ time cat compressed.json | python3 -c $'import jq, sys\njq.compile(".").input(text=sys.stdin.read()).text()'
real 0m29.568s
user 0m28.468s
sys 0m1.189s
nkondras@bard:~/projects/github.com/kernelci/kcidb-data$ time for ((i=0; i<500; i++)); do cat sample.json; done | python3 -c $'import jq, sys\njq.compile(".").input(text=sys.stdin.read()).text()'
real 0m31.145s
user 0m30.328s
sys 0m0.991s |
@mwilliamson, please don't hesitate to tell me if anything else is bothering you about this PR, even if it's just difficult to read/review. I'd be glad to improve it to your satisfaction 😃 I really need this stream parsing implementation to start pushing more data to the new Linux Kernel CI result database 😁 |
So using |
bf8dafe
to
30531bb
Compare
It seems so. The Python implementation has more users, I suppose, and is more optimized. I pushed the version using Do you want me to rename the |
Instead of dumping and parsing JSON, convert JQ's "jv" structures into Python values directly by recursively walking them. The naive implementation is still twice as fast.
30531bb
to
7f6137b
Compare
Thanks, merged. |
This adds unpacking jv values into Python values directly, bypassing JSON re-parsing, which is roughly twice as fast.
Compared to #49, this adds some tests checking the various value types and combinations are preserved. This also names the unpacking function
_jv_to_python()
instead of_jv_unpack()
.