Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for variably sized inputs #168

Open
kiudee opened this issue Oct 16, 2020 · 3 comments
Open

Support for variably sized inputs #168

kiudee opened this issue Oct 16, 2020 · 3 comments
Labels
enhancement New feature or request Priority: Low

Comments

@kiudee
Copy link
Owner

kiudee commented Oct 16, 2020

cs-ranking is currently requiring the user to provide a fixed input size.
There exist different tricks to be able to handle variably sized inputs (e.g. padding to maximum length), but all with their own trade-offs.

Approaches like nested tensors could be useful in solving this issue more elegantly.

@kiudee kiudee added enhancement New feature or request Priority: Low labels Oct 16, 2020
@ddelange
Copy link

Hi @kiudee,

From the docstring of fit() for X input:

The provided queries can be of a fixed size (numpy arrays) or of
varying sizes in which case dictionaries are expected as input.

and also the dict check in _fit():
if isinstance(X, dict):

suggest there should already be some kind of support for instances with different amount of objects inside.

But then on the first line of fit(), X input is assumed to be a fixed size numpy array:

_n_instances, self.n_objects_fit_, self.n_object_features_fit_ = X.shape

Could you elaborate on the current state of things?

Or more concrete:

  • Is there a workaround I can exploit for the moment to get FATEObjectRanker rolling for my data?
  • Or can you point me to what to fix to get support for the dict input?
  • And for the keys of the dict, can they just be index of the instances like 0, 1, 2, 3, ...? The docstring says map from n_objects to numpy arrays, but I have multiple instances with the same number of n_objects inside them, meaning I would have to choose e.g. the first one in order to cast to dict.

padding to maximum length

Here, do you mean simply padding both X and Y with np.zeros? What would be the tradeoff you mentioned there? Any other alternatives?

@kiudee
Copy link
Owner Author

kiudee commented Mar 10, 2021

Hey @ddelange,
we are currently in the process of migrating the complete code base to PyTorch, which is why progress on the issue front is currently slow.

Regarding fit() it is true that it currently does not support dict input anymore.
_fit() expects a dict of the form:

{
    3: np.array(...),  # shape: (n_instances_with_3_objects, 3, n_features) 
    4: np.array(...),  # shape: (n_instances_with_4_objects, 4, n_features)
    ...
}

That way each np.array(...) can contain multiple instances of the same size. What the _fit() method then does is to train on each of the sizes separately (with shared weights) and updates the weights proportional to the number of instances present for the given set size.
So a workaround could be to use _fit() directly, but the dict support has not been tested for a while.

Since the line

_n_instances, self.n_objects_fit_, self.n_object_features_fit_ = X.shape

is only ever used for the fixed size representation, it could be as simple as inserting an if there:

if not isinstance(X, dict):
    _n_instances, self.n_objects_fit_, self.n_object_features_fit_ = X.shape 

@timokau what do you think?


padding to maximum length

Here, do you mean simply padding both X and Y with np.zeros? What would be the tradeoff you mentioned there? Any other alternatives?

Yes, basically you would determine the maximum number of objects you want to input, lets call it n_objects_max, and then construct an array of shape (n_instances, n_objects_max, n_features), which you initialize with zeros. Then for each instance you fill it with the corresponding amount of objects. The same thing you can do for Y.
One trade-off is of course running time and memory, especially if the number of objects is highly variable. There it is useful to look at how many instances with many objects there really are and possibly discard those which occur rarely. Another problem could be that the "zero objects" impact the model fit in some way, especially if you standardize the inputs.

@timokau
Copy link
Collaborator

timokau commented Mar 12, 2021

@timokau what do you think?

I'm not sure if it would be quite that simple. For example the _construct_models function in fate_network.py uses self.n_object-features_fit_ and is also called for variably-sized inputs. That is not a problem, since that remains constant anyway. It would still need to be initialized though. There are probably more cases like this in the code base. Supporting the "train separately and merge weights" approach again would need a bit of work and testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Priority: Low
Projects
None yet
Development

No branches or pull requests

3 participants