Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unpack jv values directly #50

Merged
merged 3 commits into from
Sep 13, 2020

Conversation

spbnick
Copy link
Contributor

@spbnick spbnick commented Sep 9, 2020

This adds unpacking jv values into Python values directly, bypassing JSON re-parsing, which is roughly twice as fast.

Compared to #49, this adds some tests checking the various value types and combinations are preserved. This also names the unpacking function _jv_to_python() instead of _jv_unpack().

@spbnick spbnick mentioned this pull request Sep 9, 2020
jq.pyx Outdated
Comment on lines 66 to 68
cdef int i
cdef jv ik
cdef jv iv
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try to avoid using variable names like i. Are there more descriptive names that could be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, no problem. Changed those to idx/idx_key/idx_value. Tell me if that's not clear enough.

jq.pyx Outdated
Comment on lines 96 to 98
while True:
if not jv_object_iter_valid(v, i):
break
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would while jv_object_iter_valid(v, i): do the same thing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D'oh, of course. Forgot to tidy it up after converting jv_object_foreach to Python. Replaced.

results.append(iterator._next_string())
except StopIteration:
return "\n".join(results)
return "\n".join(json.dumps(v) for v in self)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is using _jv_to_python and json.dumps faster than using jv_dump_string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested it, but very likely not. However, it's simpler. Tell me if you'd prefer performance over simplicity here. Provided there's performance benefit.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be useful to know what the performance implications are. I (perhaps naively!) think that it shouldn't introduce too much complexity -- instead of just _next_string, we could have something like _next_result which returns a jv, and then the two different methods (text and __next__) can handle the result appropriately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to throw together a little test for this, but meanwhile I pushed the change and it took me 25 lines:

diff --git a/jq.pyx b/jq.pyx
index f5642fe..54f024a 100644
--- a/jq.pyx
+++ b/jq.pyx
@@ -254,7 +254,13 @@ cdef class _ProgramWithInput(object):
         return _ResultIterator(self._jq_state_pool, self._bytes_input)
 
     def text(self):
-        return "\n".join(json.dumps(v) for v in self)
+        iterator = self._make_iterator()
+        results = []
+        while True:
+            try:
+                results.append(iterator._next_string())
+            except StopIteration:
+                return "\n".join(results)
 
     def all(self):
         return list(self)
@@ -288,7 +294,25 @@ cdef class _ResultIterator(object):
         return self
 
     def __next__(self):
+        cdef jv value
+        self._next_jv(&value)
+        return _jv_to_python(value)
+
+    cdef unicode _next_string(self):
         cdef int dumpopts = 0
+        cdef jv value
+        cdef jv dump
+
+        self._next_jv(&value)
+        dump = jv_dump_string(value, dumpopts)
+        try:
+            string = jv_string_value(dump).decode("utf8")
+        finally:
+            jv_free(dump)
+
+        return string
+
+    cdef int _next_jv(self, jv *presult) except 1:
         while True:
             if not self._ready:
                 self._ready_next_input()
@@ -296,7 +320,8 @@ cdef class _ResultIterator(object):
 
             result = jq_next(self._jq)
             if jv_is_valid(result):
-                return _jv_to_python(result)
+                presult[0] = result
+                return 0
             elif jv_invalid_has_msg(jv_copy(result)):
                 error_message = jv_invalid_get_msg(result)
                 message = jv_string_value(error_message).decode("utf8")

jq.pyx Outdated
Comment on lines 88 to 91
try:
arr.append(_jv_to_python(iv))
finally:
jv_free(iv)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would changing the contract of _jv_to_python to consume its input simplify some of this code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I considered this before in context of memory usage, but couldn't wrap my head around it. Did some experiments and amended it. Should be better now.

Also, I love your usage of "contract" 😀 Nobody would understand me if I tried talking like that at my place 😞.

Add tests verifying various value types and combinations are preserved
after passing through the program.
Even though "message-less" invalid values don't technically need
"freeing", doing so is supported by the library, and could be good for
uniformity and future-proofing.
@spbnick spbnick force-pushed the unpack_jv_values_directly branch from 4ec74ff to 8a5769a Compare September 9, 2020 14:04
jq.pyx Outdated
Comment on lines 67 to 68
cdef jv idx_key
cdef jv idx_value
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these are indices?

Copy link
Contributor Author

@spbnick spbnick Sep 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started them with idx_ to indicate those are key/value at the idx index. As in "index'es key" and "index'es value". What would work better for you?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps property_name and property_value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. However, this way we would be assigning the return value of jv_object_iter_key() to property_name, which I think would be confusing. Like this:

            property_name = jv_object_iter_key(value, idx)

So, for the start I renamed them to property_key and property_value. Tell me if you would prefer property_name regardless, and I'll change that.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes too much difference. I think "name" matches the JSON spec, whereas "key" matches jq.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, now I see where property_ comes from. Would it be OK that we store array elements in property_value as well?
Anyway, just thought I caught a mismatch here and tried to make it better. I think these are such minor details that it wouldn't matter in the end. I'll just put there whatever you'd like :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, didn't notice you merged this already :D Thanks a lot 🎉!

results.append(iterator._next_string())
except StopIteration:
return "\n".join(results)
return "\n".join(json.dumps(v) for v in self)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be useful to know what the performance implications are. I (perhaps naively!) think that it shouldn't introduce too much complexity -- instead of just _next_string, we could have something like _next_result which returns a jv, and then the two different methods (text and __next__) can handle the result appropriately.

@spbnick spbnick force-pushed the unpack_jv_values_directly branch from 8a5769a to bf8dafe Compare September 10, 2020 09:50
@spbnick
Copy link
Contributor Author

spbnick commented Sep 10, 2020

Well, whaddaya know:

nkondras@bard:~/projects/github.com/kernelci/kcidb-data$ time cat compressed.json | python3 -c $'import jq, sys\njq.compile(".").input(text=sys.stdin.read()).text()'                           
real    0m18.325s
user    0m17.004s
sys     0m1.353s
nkondras@bard:~/projects/github.com/kernelci/kcidb-data$ time cat compressed.json | python3 -c $'import jq, sys\njq.compile(".").input(text=sys.stdin.read()).text()'

real    0m30.572s
user    0m29.437s
sys     0m1.246s

That's processing a (space-less) 266MB JSON file. First one is the yesterday's version, second one is the one I just pushed.

@spbnick
Copy link
Contributor Author

spbnick commented Sep 10, 2020

That's not counting the effect of "\n".join(...), of course.

@spbnick
Copy link
Contributor Author

spbnick commented Sep 10, 2020

Now with a little bit of join, similar results:

nkondras@bard:~/projects/github.com/kernelci/kcidb-data$ time for ((i=0; i<500; i++)); do cat sample.json; done | python3 -c $'import jq, sys\njq.compile(".").input(text=sys.stdin.read()).text()'

real    0m20.934s
user    0m20.038s
sys     0m1.089s
nkondras@bard:~/projects/github.com/kernelci/kcidb-data$ time for ((i=0; i<500; i++)); do cat sample.json; done | python3 -c $'import jq, sys\njq.compile(".").input(text=sys.stdin.read()).text()'

real    0m32.288s
user    0m31.270s
sys     0m1.172s

This time it's a (loosely-spaced) 716KB JSON file, repeated 500 times.

@spbnick
Copy link
Contributor Author

spbnick commented Sep 10, 2020

So, I think the simple version wins, but I'll keep the more complex one posted for now. Tell me what you'd prefer, and also how would you like me to call those idx_key and idx_value vars. Thanks :)

@spbnick
Copy link
Contributor Author

spbnick commented Sep 10, 2020

Here are the results with the current master branch for reference:

nkondras@bard:~/projects/github.com/kernelci/kcidb-data$ time cat compressed.json | python3 -c $'import jq, sys\njq.compile(".").input(text=sys.stdin.read()).text()'

real    0m29.568s
user    0m28.468s
sys     0m1.189s
nkondras@bard:~/projects/github.com/kernelci/kcidb-data$ time for ((i=0; i<500; i++)); do cat sample.json; done | python3 -c $'import jq, sys\njq.compile(".").input(text=sys.stdin.read()).text()'

real    0m31.145s
user    0m30.328s
sys     0m0.991s

@spbnick
Copy link
Contributor Author

spbnick commented Sep 11, 2020

@mwilliamson, please don't hesitate to tell me if anything else is bothering you about this PR, even if it's just difficult to read/review. I'd be glad to improve it to your satisfaction 😃

I really need this stream parsing implementation to start pushing more data to the new Linux Kernel CI result database 😁

@mwilliamson
Copy link
Owner

So using json.dumps is faster than jv_string_value? If so, then that seems like the right way to go.

@spbnick spbnick force-pushed the unpack_jv_values_directly branch from bf8dafe to 30531bb Compare September 13, 2020 14:28
@spbnick
Copy link
Contributor Author

spbnick commented Sep 13, 2020

So using json.dumps is faster than jv_string_value? If so, then that seems like the right way to go.

It seems so. The Python implementation has more users, I suppose, and is more optimized. I pushed the version using json.dumps now.

Do you want me to rename the idx_key and idx_value vars into something else?

Instead of dumping and parsing JSON, convert JQ's "jv" structures into
Python values directly by recursively walking them. The naive
implementation is still twice as fast.
@spbnick spbnick force-pushed the unpack_jv_values_directly branch from 30531bb to 7f6137b Compare September 13, 2020 15:04
@mwilliamson mwilliamson merged commit 006b54f into mwilliamson:master Sep 13, 2020
@mwilliamson
Copy link
Owner

Thanks, merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants