Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the pipelines to avoid fitting before splitting #3877

Merged
merged 6 commits into from
Dec 2, 2023

Conversation

suhaibmujahid
Copy link
Member

Resolves #3818

Note
The changes will be clearer if each commit is reviewed separately

  • Converted clf to a pipeline and moved the union step to it (the step that requires fitting)
    • Refactored the sampler to be part of the clf pipeline
      • The sampler requires X to be vectors, not dicts
      • The model train method became cleaner
  • The Files feature is the only feature that requires fit()
    • Moved the part that requires fitting to the clf pipeline
  • Replaced fit_transform() with transform()
  • Adjusted the model saving process since the xgboost model now is part of a pipeline

Train on Taskcluster: annotateignore

@suhaibmujahid suhaibmujahid changed the title Refactor the piplines to avoid fitting before spliting Refactor the pipelines to avoid fitting before splitting Dec 2, 2023
@marco-c
Copy link
Collaborator

marco-c commented Dec 2, 2023

Replaced fit_transform() with transform()

Where is this done?

bugbug/commit_features.py Outdated Show resolved Hide resolved
bugbug/commit_features.py Outdated Show resolved Hide resolved
bugbug/commit_features.py Outdated Show resolved Hide resolved
bugbug/commit_features.py Outdated Show resolved Hide resolved
@marco-c
Copy link
Collaborator

marco-c commented Dec 2, 2023

For confirmation, can you try running the annotateignore model before/after and see if the number of features is the same and if the metrics are the same?

@suhaibmujahid
Copy link
Member Author

Replaced fit_transform() with transform()

Where is this done?

Sorry, the commit was not pushed. It is done in 666fc46.

For confirmation, can you try running the annotateignore model before/after and see if the number of features is the same and if the metrics are the same?

Here are the logs for both:

After the refactoring, the shape does not reflect the number of features because the vectorization happens as part of the clf pipeline. I added a log for the number of features in a1b8f45.

The number of features has increased; it was 13572 before and is now 14465. This increase is expected since we fit on a subset of the data. For instance, during the fitting stage, the number of commits is smaller, which causes the threshold to consider a file as frequent to become lower. As a result, we ended up considering more files as features.

Regarding the rest of the metrics, they look almost the same.

Copy link
Collaborator

@marco-c marco-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice simplification!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not fit transformers on testing data
2 participants