-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thoughts about API design #110
Comments
This looks clean, below are my thoughts: I think more important that adding meta data to Query and its Candidates separately. It is important to have metadata for a reranking run: PTAL at the https://github.com/castorini/rank_llm/blob/main/src/rank_llm/result.py |
Agreed. There should be a pretty straightforward mapping from json/dict to Python objects. I envision
Agreed. An example of metadata attached to a |
What should be the name of the high-level class containing Query and
Candidates, and the metadata? Currently it is Results, but maybe we can
find a better alternative.
(The batch call then, can take a list of these objects, rather two lists
for queries and candidates separately)
…On Fri, Apr 19, 2024, 5:16 p.m. Jimmy Lin ***@***.***> wrote:
This looks clean, below are my thoughts: If we want the rerank to be
compatible with many retrievers, we should add creators to Query and
Candidates classes to convert str and json/dict to instances of these
classes and expect the user to call them before calling rerank.
Agreed. There should be a pretty straightforward mapping from json/dict to
Python objects. I envision from_json "importers" and to_json "exporters".
(We can quibble about method naming later.)
I think more important that adding meta data to Query and its Candidates
separately. It is important to have metadata for a reranking run: PTAL at
the https://github.com/castorini/rank_llm/blob/main/src/rank_llm/result.py
Where we store each prompt, response, and input output token counts in an
execution info. Each rerank for a given query will have a list of this
information. This doesn't belong to a query nor its candidate, but an
execution summary for query + its candidates.
Agreed. An example of metadata attached to a Query might be its id; in
the case of TREC topics, different fields (title, narrative, description),
etc.
—
Reply to this email directly, view it on GitHub
<#110 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABNSCSUHK3FEZY5YAKLTV6TY6GCSBAVCNFSM6AAAAABGPWB4V6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRXGI4TGMZWGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
This is what we agreed on with @ronakice. New Query and Candidate classes:
Existing Classes:
Then batch_batch would look like this:
@lintool PTAL and let me know what you think. The remaining issue is the name |
Captured discussion over Slack w/ @sahel-sh ... still unresolved points. This seems uncontroversial:
Note that However, what should The obvious extension would be:
But @sahel-sh points out: ...not batch friendly: when you always need to zip two lists to iterate through them together (list of queries and list of candidates), it is a strong signal that you should have a list of pairs. I also think the execution info belongs to both query and its candidates not just candidates. An alternative would be:
But at this point, you might as well wrap
So, we have S1, S2, S3 as options... |
S3 doesn't capture what I had in mind accurately:
where each |
There are a couple of minor issues with s1 naming as well that we should resolve if we decide to adopt it:
|
I guess I'm hung up over naming of Let me propose S4 then?
Then the issue is that we don't have anywhere to stuff the metadata. |
I like s4 better than s1. Similar to s1, the metadata can go into Candidates class (not the best fit imho). I think the only advantages that s3 has over s4 are:
I think my preference ranking is s3 > s4 > s1 > s2 (s2 is my least favorite since chunking a list in batch mode to be handled with multiple threads is more natural than chunking a dictionary) |
S5: another strategy - use Python's ability to return multiple values:
Note, in this design, we don't have Or S5a
Or S6, making more prominent use of
|
Okay, after further thoughts, S7:
Commentary:
So I would envision a snippet like:
And:
|
Having slept throught most of this, I like S7 the most. Polymorphism is perfectly fine. I think we should discuss more in detail about if we want rerank to be calling rerank_batch (vllm introduces some differences) but a discussion not for this thread. |
I also like s7. It follows many best practices like separation of the input param type and output param (even if they are very similar now (they might diverge in the future). re |
Not sure how much I agree with this. At a high level, shouldn't methods take conceptual versions of the objects that abstract away from the implementation? That is, we should have a
I really mean My $0.02... but happy to discuss further though. |
Thanks for explaining, I will implement s7 as is, then |
Two more issues to sort out:
For (2), in Anserini, In Pyserini, the options would be Thoughts? Happy to consider alternate names for the options. |
Just thoughts... @ronakice @sahel-sh reactions welcome.
Two methods:
rerank
andrerank_batch
.Method signature:
Commentary:
Query
object as opposed to just astr
. This will allow us to attach additional fields to include metadata, etc.Candidates
object as opposed to just a list of json objects. Again, this will allow us to attach other metadata, such as execution traces, etc.Candidates
is a fresh object. I.e., the method is non-destructive - does not mess with the input.The,
rerank_batch
would be:rerank
can just route torerank_batch
with a list of one object.The text was updated successfully, but these errors were encountered: