-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor the pipelines to avoid fitting before splitting #3877
Conversation
Where is this done? |
For confirmation, can you try running the annotateignore model before/after and see if the number of features is the same and if the metrics are the same? |
bbcd190
to
c47f37c
Compare
Sorry, the commit was not pushed. It is done in 666fc46.
Here are the logs for both: After the refactoring, the shape does not reflect the number of features because the vectorization happens as part of the clf pipeline. I added a log for the number of features in a1b8f45. The number of features has increased; it was 13572 before and is now 14465. This increase is expected since we fit on a subset of the data. For instance, during the fitting stage, the number of commits is smaller, which causes the threshold to consider a file as frequent to become lower. As a result, we ended up considering more files as features. Regarding the rest of the metrics, they look almost the same. |
a1b8f45
to
0b755b0
Compare
0b755b0
to
d739773
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice simplification!
Resolves #3818
clf
to a pipeline and moved the union step to it (the step that requires fitting)clf
pipelineX
to be vectors, not dictsFiles
feature is the only feature that requiresfit()
clf
pipelinefit_transform()
withtransform()
Train on Taskcluster: annotateignore