-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fs: nvs: Add recovery path for corrupted NVS #83128
fs: nvs: Add recovery path for corrupted NVS #83128
Conversation
Hello @corymayer, and thank you very much for your first pull request to the Zephyr project! |
Hi @corymayer, this is a nice addition. Thank you for the proposal. If you agree I would like to see this possibility enabled by a Kconfig option. I don't think the change in comment is required, when this problem occurs it is not by definition a corrupt NVS system, it can also be a non NVS partition. |
Sure, will update soon |
ed01058
to
9fa7b6c
Compare
subsys/fs/nvs/Kconfig
Outdated
@@ -37,6 +37,14 @@ config NVS_DATA_CRC | |||
The CRC-32 is transparently stored at the end of the data field, | |||
in the NVS data section, so 4 more bytes are needed per NVS element. | |||
|
|||
config NVS_BAD_PARTITION_RECOVERY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about "NVS_INIT_BAD_REGION" ?
I would like this option and help text to reflect what it does and not (less) why it does this. NVS has no knowledge of partitions so using "partition" should be avoided.
The help could be something like:
Enable automatic initialization of a NVS on a flash region that does not contain a valid NVS. A region containing an invalid NVS can be caused by corruption or by providing a non-empty region. This option ensures a new NVS can be created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, updated
subsys/fs/nvs/nvs.c
Outdated
#ifdef CONFIG_NVS_BAD_PARTITION_RECOVERY | ||
LOG_WRN("All sectors closed, erasing all settings..."); | ||
addr = 0U; | ||
rc = flash_erase(fs->flash_device, 0, fs->sector_size * fs->sector_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is erasing at the correct location (missing fs->offset).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, thought that the offset in that API was relative to the flash device. Device I tested it on happened to have an offset of 0. Fixed.
if (rc) { | ||
rc = -EIO; | ||
goto end; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please insert a space after this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line was removed
subsys/fs/nvs/nvs.c
Outdated
rc = -EIO; | ||
goto end; | ||
} | ||
i = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want:
i = fs->sector_count;
addr = ((fs->sector_count - 1) << ADDR_SECT_SHIFT) + (uint16_t)(fs->sector_size - ate_size);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking closer, I think we can remove this line altogether since that will be guaranteed by the previous loop
subsys/fs/nvs/nvs.c
Outdated
if (closed_sectors == fs->sector_count) { | ||
#ifdef CONFIG_NVS_BAD_PARTITION_RECOVERY | ||
LOG_WRN("All sectors closed, erasing all settings..."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be no mention of settings here.
NVS and Settings are two different subsystems.
You should use "erase all entries" or maybe "erasing all sectors"
subsys/fs/nvs/nvs.c
Outdated
#ifdef CONFIG_NVS_BAD_PARTITION_RECOVERY | ||
LOG_WRN("All sectors closed, erasing all settings..."); | ||
addr = 0U; | ||
rc = flash_erase(fs->flash_device, 0, fs->sector_size * fs->sector_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use flash_flatten instead.
subsys/fs/nvs/nvs.c
Outdated
addr = 0U; | ||
rc = flash_erase(fs->flash_device, 0, fs->sector_size * fs->sector_count); | ||
if (rc) { | ||
rc = -EIO; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need to change the error code returned by the above function. Just keep the same error code
9fa7b6c
to
6bab748
Compare
subsys/fs/nvs/nvs.c
Outdated
if (closed_sectors == fs->sector_count) { | ||
#ifdef CONFIG_NVS_INIT_BAD_REGION | ||
LOG_WRN("All sectors closed, erasing all sectors..."); | ||
addr = 0U; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this code tested? If you enter this if
and you erase the sectors, then you also enter the if (i == fs->sector_count)
condition below, where there is rc = nvs_flash_cmp_const(fs, addr - ate_size, erase_value, sizeof(struct nvs_ate));
that assumes that addr
points to the last ATE within the last sector. Am I correct?
Please consider extending NVS unit tests with a scenario that you initialize NVS using a garbage data in all sectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just did a full re-test with the latest version I pushed. Set the ATEs to 0xdeadbeef and verified the log is printed and the NVS settings are written correctly afterwards.
Can you point me to the unit tests for NVS? I don't see them in this directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. The tests are in tests/subsys/fs/nvs
directory. Though feel free to merge this PR and add the tests later :)
subsys/fs/nvs/nvs.c
Outdated
if (closed_sectors == fs->sector_count) { | ||
#ifdef CONFIG_NVS_INIT_BAD_REGION | ||
LOG_WRN("All sectors closed, erasing all sectors..."); | ||
addr = 0U; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is still needed to modify addr
, after all sectors have been erased we should set it at the last sector close ate, the same as would be found for an empty region. This line should also move after the rc
check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed this, we can just remove this line too right, since the previous loop will ensure addr is set to the last sector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is set to the first sector when all sectors are closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, updated to set them to the last sector as you suggested
6bab748
to
b6fe8e7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@corymayer thank you for this PR.
Enable automatic initialization of a NVS on a flash region that does | ||
not contain a valid NVS. A region containing an invalid NVS can be | ||
caused by corruption or by providing a non-empty region. This option | ||
ensures a new NVS can be created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming here is confusing. You are talking about region but in the description you are referring to "bad partition recovery"
AFAIK the term "Region" doesn't seem to be used anywhere in NVS.
NVS knows only about Sectors of data
Could be NVS_INIT_BAD_SECTORS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The terminology region
was carefully chosen not to overlap with partition (as defined in dts), area (as used in flash_area_...
routines, there also are no bad sectors (which would be used by nand devices).
@corymayer, please update the bool to "Non-volatile Storage bad region recovery".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Laczen So maybe it is better to use the terminology "memory region" instead of just "region"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Memory region" is also existing terminology.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"memory region" is used to describe a part of the memory. This is what this PR is about, right ? Am I missing something ? Or do you mean that the "memory region" terminology is reserved for other purposes in zephyr ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything is existing terminology, so context matters. If something does not exist in a given context than it can be used to describe some characteristic in that context. And NAND actually uses Bad Block Management, for example winbond devices, to manage sectors in pages, which is again vendor specific nomenclature, because sectors/pages/block naming varies between devices.
So, Region or Memory region, when defined within context of NVS as a specific offset/range tuple may be fine, as long as it's relation to sectors or partitions, as used by NVS, is clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"memory region" is used in dts for linker memory regions, in user mode for memory domains... It is in no way reserved exclusively for these purposes but might be confusing to use it here as some people associate memory with RAM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case I prefer "memory region" which covers (flash/ram/... regions) instead of just "region"
but that's not a blocker for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the name to "bad memory region" and added a unit test
@corymayer are you a discord user? I would like to get in touch. |
Yes, just joined as corymeta_74139 |
b6fe8e7
to
ede130b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the term "bad memory region" confusing and long to set as config variable, but I'm not going to spend more time on terminology.
Thanks for adding the test.
ede130b
to
eac7018
Compare
Fixed an off-by-one in the unit test. Now prints this in the twister log:
|
tests/subsys/fs/nvs/src/main.c
Outdated
err = flash_write(fixture->fs.flash_device, fixture->fs.offset + | ||
(fixture->fs.sector_size * (i + 1)) - | ||
sizeof(struct nvs_ate), &data, sizeof(data)); | ||
zassert_true(err == 0, "flash_write failed: %d", err); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One space to much after the ",".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed these
tests/subsys/fs/nvs/src/main.c
Outdated
|
||
#ifdef CONFIG_NVS_INIT_BAD_MEMORY_REGION | ||
err = nvs_mount(&fixture->fs); | ||
zassert_true(err == 0, "nvs_mount call failure: %d", err); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one space to much after ",".
tests/subsys/fs/nvs/src/main.c
Outdated
execute_long_pattern_write(TEST_DATA_ID, &fixture->fs); | ||
#else | ||
err = nvs_mount(&fixture->fs); | ||
zassert_true(err == -EDEADLK, "nvs_mount call ok, expect fail: %d", err); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One space to much after ",".
eac7018
to
54dc53e
Compare
tests/subsys/fs/nvs/src/main.c
Outdated
*/ | ||
ZTEST_F(nvs, test_nvs_init_bad_memory_region) | ||
{ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: remove this line
tests/subsys/fs/nvs/src/main.c
Outdated
/* Write bad ATE to each sector */ | ||
for (uint16_t i = 0; i < TEST_SECTOR_COUNT; i++) { | ||
data = 0xdeadbeef; | ||
err = flash_write(fixture->fs.flash_device, fixture->fs.offset + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: improve the alignment or introduce a offset
variable to make this line shorter.
@corymayer Please fix the compliance. |
f1c1789
to
33e7164
Compare
tests/subsys/fs/nvs/src/main.c
Outdated
for (uint16_t i = 0; i < TEST_SECTOR_COUNT; i++) { | ||
data = 0xdeadbeef; | ||
err = flash_write(fixture->fs.flash_device, | ||
fixture->fs.offset + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alignment should be with the start of fixture->fs.flash_device
. The complete file however is no example for correct alignment :-).
3d215ee
to
05d5839
Compare
When all sectors have non-zero data, NVS would fail to init. Add recovery path option to erase the all sectors and reinitialize. This could occur due to non-empty sectors, or corrupted data. Signed-off-by: Cory Andrew Mayer <[email protected]>
Add unit test to exercise bad memory region recovery path Signed-off-by: Cory Andrew Mayer <[email protected]>
05d5839
to
5fa5854
Compare
I was able to run the compliance check locally and verify all issues are resolved |
data = 0xdeadbeef; | ||
err = flash_write(fixture->fs.flash_device, | ||
fixture->fs.offset + (fixture->fs.sector_size * (i + 1)) - | ||
sizeof(struct nvs_ate), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT, I think this alignment might still be considered incorrect.
Hi @corymayer! To celebrate this milestone and showcase your contribution, we'd love to award you the Zephyr Technical Contributor badge. If you're interested, please claim your badge by filling out this form: Claim Your Zephyr Badge. Thank you for your valuable input, and we look forward to seeing more of your contributions in the future! 🪁 |
When all sectors of the settings partition have some data that is not 0xFF, NVS will fail to initialize settings. Add a recovery path to erase the whole settings partition and recover. Should not happen unless there's some serious flash corruption, or possibly interrupted garbage collection. Verified by recreating the scenario with flash CLIs.