-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ndelima/training validation #61
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments, tried it locally and worked as expected.
I've checked out to 8bd8b3b and:
$ docker compose -f docker/docker-compose.yml --profile training_test up
WARN[0000] Found orphan containers ([ros_debug pallet_detector ares_simulator ares_control ros_master perception_agent load_jam roudi core_tests perception_agent_component_test ares_bag_player ares_bag_recorder]) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up.
[+] Running 1/0
✔ Container training_test Created 0.0s
Attaching to training_test
training_test | Cuda is available.
training_test | 73: ./data/png/rgb_0073.png
training_test | Traceback (most recent call last):
training_test | File "/root/training_ws/eval.py", line 254, in <module>
training_test | main()
training_test | File "/root/training_ws/eval.py", line 247, in main
training_test | save_image(
training_test | File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
training_test | return func(*args, **kwargs)
training_test | File "/usr/local/lib/python3.10/dist-packages/torchvision/utils.py", line 151, in save_image
training_test | im.save(fp, format=format)
training_test | File "/usr/local/lib/python3.10/dist-packages/PIL/Image.py", line 2563, in save
training_test | fp = builtins.open(filename, "w+b")
training_test | FileNotFoundError: [Errno 2] No such file or directory: '/root/training_ws/model/test_output/output_0_0.jpg'
training_test exited with code 1 This is the result of ls to the model folder: $ ls -lash model
total 159M
208K drwxrwxr-x 3 agalbachicar agalbachicar 204K Sep 25 15:25 .
4.0K drwxrwxr-x 12 agalbachicar agalbachicar 4.0K Sep 25 15:20 ..
159M -rw-r--r-- 1 root root 159M Sep 25 15:26 model.pth
4.0K drwxr-xr-x 3 root root 4.0K Sep 25 15:25 runs What am I doing wrong @ndelima-ekumen ? |
It's just that test_output is missing, I'll update the PR with the fix. During testing, one unsuccessful run generated the folder for some reason, and the following tests used that folder, I missed that detail. |
I'll give it a shot tomorrow @ndelima-ekumen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Nicolas de Lima <[email protected]>
Signed-off-by: Nicolas de Lima <[email protected]>
Signed-off-by: Nicolas de Lima <[email protected]>
Signed-off-by: Nicolas de Lima <[email protected]>
Signed-off-by: Nicolas de Lima <[email protected]>
ddb2686
to
05e4eab
Compare
🎉 New feature
Closes #23
Summary
Added a validation run after each epoch, calculated the average loss, and logged it.
Test it
Run the training pipeline, check that a new folder inside training_ws/runs is generated, then visualize the logs by running the training_vis profile and opening a web browser on localhost:6006.
Checklist
Note to maintainers: Remember to use Squash-Merge and edit the commit message to match the pull request summary while retaining
Signed-off-by
messages.