Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ndelima/training validation #61

Merged
merged 5 commits into from
Sep 27, 2024
Merged

Conversation

ndelima-ekumen
Copy link
Collaborator

🎉 New feature

Closes #23

Summary

Added a validation run after each epoch, calculated the average loss, and logged it.

Test it

Run the training pipeline, check that a new folder inside training_ws/runs is generated, then visualize the logs by running the training_vis profile and opening a web browser on localhost:6006.

Checklist

  • Signed all commits for DCO
  • Added tests
  • Added example and/or tutorial
  • Updated documentation (as needed)

Note to maintainers: Remember to use Squash-Merge and edit the commit message to match the pull request summary while retaining Signed-off-by messages.

Copy link
Member

@agalbachicar agalbachicar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, tried it locally and worked as expected.

README.md Show resolved Hide resolved
docker/docker-compose.yml Outdated Show resolved Hide resolved
@agalbachicar
Copy link
Member

I've checked out to 8bd8b3b and:

  • Built everything.
  • Executed a training with a successful output.
  • Executed the training_test profile with the following:
$ docker compose -f docker/docker-compose.yml --profile training_test up
WARN[0000] Found orphan containers ([ros_debug pallet_detector ares_simulator ares_control ros_master perception_agent load_jam roudi core_tests perception_agent_component_test ares_bag_player ares_bag_recorder]) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up. 
[+] Running 1/0
 ✔ Container training_test  Created                                                                                                                    0.0s 
Attaching to training_test
training_test  | Cuda is available.
training_test  | 73: ./data/png/rgb_0073.png
training_test  | Traceback (most recent call last):
training_test  |   File "/root/training_ws/eval.py", line 254, in <module>
training_test  |     main()
training_test  |   File "/root/training_ws/eval.py", line 247, in main
training_test  |     save_image(
training_test  |   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
training_test  |     return func(*args, **kwargs)
training_test  |   File "/usr/local/lib/python3.10/dist-packages/torchvision/utils.py", line 151, in save_image
training_test  |     im.save(fp, format=format)
training_test  |   File "/usr/local/lib/python3.10/dist-packages/PIL/Image.py", line 2563, in save
training_test  |     fp = builtins.open(filename, "w+b")
training_test  | FileNotFoundError: [Errno 2] No such file or directory: '/root/training_ws/model/test_output/output_0_0.jpg'
training_test exited with code 1

This is the result of ls to the model folder:

$ ls -lash model
total 159M
208K drwxrwxr-x  3 agalbachicar agalbachicar 204K Sep 25 15:25 .
4.0K drwxrwxr-x 12 agalbachicar agalbachicar 4.0K Sep 25 15:20 ..
159M -rw-r--r--  1 root         root         159M Sep 25 15:26 model.pth
4.0K drwxr-xr-x  3 root         root         4.0K Sep 25 15:25 runs

What am I doing wrong @ndelima-ekumen ?

@ndelima-ekumen
Copy link
Collaborator Author

It's just that test_output is missing, I'll update the PR with the fix. During testing, one unsuccessful run generated the folder for some reason, and the following tests used that folder, I missed that detail.

@agalbachicar
Copy link
Member

I'll give it a shot tomorrow @ndelima-ekumen

Copy link
Member

@agalbachicar agalbachicar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@agalbachicar agalbachicar force-pushed the ndelima/training_validation branch from ddb2686 to 05e4eab Compare September 27, 2024 12:02
@agalbachicar agalbachicar merged commit 056abe6 into main Sep 27, 2024
1 check passed
@agalbachicar agalbachicar deleted the ndelima/training_validation branch September 27, 2024 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement a validation run after each epoch to evaluate progression of the training process
2 participants