Refactor the pipelines to avoid fitting before splitting #3877

suhaibmujahid · 2023-12-02T03:36:30Z

Resolves #3818

Note
The changes will be clearer if each commit is reviewed separately

Converted clf to a pipeline and moved the union step to it (the step that requires fitting)
- Refactored the sampler to be part of the clf pipeline
  - The sampler requires X to be vectors, not dicts
  - The model train method became cleaner
The Files feature is the only feature that requires fit()
- Moved the part that requires fitting to the clf pipeline
Replaced fit_transform() with transform()
Adjusted the model saving process since the xgboost model now is part of a pipeline

Train on Taskcluster: annotateignore

bugbug/models/annotate_ignore.py

marco-c · 2023-12-02T11:32:13Z

Replaced fit_transform() with transform()

Where is this done?

bugbug/commit_features.py

marco-c · 2023-12-02T11:37:39Z

For confirmation, can you try running the annotateignore model before/after and see if the number of features is the same and if the metrics are the same?

suhaibmujahid · 2023-12-02T15:13:56Z

Replaced fit_transform() with transform()

Where is this done?

Sorry, the commit was not pushed. It is done in 666fc46.

For confirmation, can you try running the annotateignore model before/after and see if the number of features is the same and if the metrics are the same?

Here are the logs for both:

After the refactoring, the shape does not reflect the number of features because the vectorization happens as part of the clf pipeline. I added a log for the number of features in a1b8f45.

The number of features has increased; it was 13572 before and is now 14465. This increase is expected since we fit on a subset of the data. For instance, during the fitting stage, the number of commits is smaller, which causes the threshold to consider a file as frequent to become lower. As a result, we ended up considering more files as features.

Regarding the rest of the metrics, they look almost the same.

marco-c

Nice simplification!

suhaibmujahid added 2 commits December 1, 2023 01:36

Perform the column transforming step in the clf pipeline

4cc4089

Integrate the sampler as a step in the clf pipeline

5af9a8a

suhaibmujahid changed the title ~~Refactor the piplines to avoid fitting before spliting~~ Refactor the pipelines to avoid fitting before splitting Dec 2, 2023