Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs: nvs: Add recovery path for corrupted NVS #83128

Merged

Conversation

corymayer
Copy link
Contributor

When all sectors of the settings partition have some data that is not 0xFF, NVS will fail to initialize settings. Add a recovery path to erase the whole settings partition and recover. Should not happen unless there's some serious flash corruption, or possibly interrupted garbage collection. Verified by recreating the scenario with flash CLIs.

Copy link

Hello @corymayer, and thank you very much for your first pull request to the Zephyr project!
Our Continuous Integration pipeline will execute a series of checks on your Pull Request commit messages and code, and you are expected to address any failures by updating the PR. Please take a look at our commit message guidelines to find out how to format your commit messages, and at our contribution workflow to understand how to update your Pull Request. If you haven't already, please make sure to review the project's Contributor Expectations and update (by amending and force-pushing the commits) your pull request if necessary.
If you are stuck or need help please join us on Discord and ask your question there. Additionally, you can escalate the review when applicable. 😊

@Laczen
Copy link
Collaborator

Laczen commented Dec 17, 2024

Hi @corymayer, this is a nice addition. Thank you for the proposal.

If you agree I would like to see this possibility enabled by a Kconfig option.

I don't think the change in comment is required, when this problem occurs it is not by definition a corrupt NVS system, it can also be a non NVS partition.

@corymayer
Copy link
Contributor Author

Hi @corymayer, this is a nice addition. Thank you for the proposal.

If you agree I would like to see this possibility enabled by a Kconfig option.

I don't think the change in comment is required, when this problem occurs it is not by definition a corrupt NVS system, it can also be a non NVS partition.

Sure, will update soon

@corymayer corymayer force-pushed the settings_nvs_corruption_recovery branch from ed01058 to 9fa7b6c Compare December 17, 2024 20:29
@@ -37,6 +37,14 @@ config NVS_DATA_CRC
The CRC-32 is transparently stored at the end of the data field,
in the NVS data section, so 4 more bytes are needed per NVS element.

config NVS_BAD_PARTITION_RECOVERY
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about "NVS_INIT_BAD_REGION" ?

I would like this option and help text to reflect what it does and not (less) why it does this. NVS has no knowledge of partitions so using "partition" should be avoided.

The help could be something like:
Enable automatic initialization of a NVS on a flash region that does not contain a valid NVS. A region containing an invalid NVS can be caused by corruption or by providing a non-empty region. This option ensures a new NVS can be created.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, updated

#ifdef CONFIG_NVS_BAD_PARTITION_RECOVERY
LOG_WRN("All sectors closed, erasing all settings...");
addr = 0U;
rc = flash_erase(fs->flash_device, 0, fs->sector_size * fs->sector_count);
Copy link
Collaborator

@Laczen Laczen Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is erasing at the correct location (missing fs->offset).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, thought that the offset in that API was relative to the flash device. Device I tested it on happened to have an offset of 0. Fixed.

if (rc) {
rc = -EIO;
goto end;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please insert a space after this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line was removed

rc = -EIO;
goto end;
}
i = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want:
i = fs->sector_count;
addr = ((fs->sector_count - 1) << ADDR_SECT_SHIFT) + (uint16_t)(fs->sector_size - ate_size);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking closer, I think we can remove this line altogether since that will be guaranteed by the previous loop

if (closed_sectors == fs->sector_count) {
#ifdef CONFIG_NVS_BAD_PARTITION_RECOVERY
LOG_WRN("All sectors closed, erasing all settings...");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no mention of settings here.
NVS and Settings are two different subsystems.
You should use "erase all entries" or maybe "erasing all sectors"

#ifdef CONFIG_NVS_BAD_PARTITION_RECOVERY
LOG_WRN("All sectors closed, erasing all settings...");
addr = 0U;
rc = flash_erase(fs->flash_device, 0, fs->sector_size * fs->sector_count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use flash_flatten instead.

addr = 0U;
rc = flash_erase(fs->flash_device, 0, fs->sector_size * fs->sector_count);
if (rc) {
rc = -EIO;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need to change the error code returned by the above function. Just keep the same error code

@corymayer corymayer force-pushed the settings_nvs_corruption_recovery branch from 9fa7b6c to 6bab748 Compare December 18, 2024 15:03
if (closed_sectors == fs->sector_count) {
#ifdef CONFIG_NVS_INIT_BAD_REGION
LOG_WRN("All sectors closed, erasing all sectors...");
addr = 0U;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this code tested? If you enter this if and you erase the sectors, then you also enter the if (i == fs->sector_count) condition below, where there is rc = nvs_flash_cmp_const(fs, addr - ate_size, erase_value, sizeof(struct nvs_ate)); that assumes that addr points to the last ATE within the last sector. Am I correct?

Please consider extending NVS unit tests with a scenario that you initialize NVS using a garbage data in all sectors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did a full re-test with the latest version I pushed. Set the ATEs to 0xdeadbeef and verified the log is printed and the NVS settings are written correctly afterwards.

Can you point me to the unit tests for NVS? I don't see them in this directory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. The tests are in tests/subsys/fs/nvs directory. Though feel free to merge this PR and add the tests later :)

if (closed_sectors == fs->sector_count) {
#ifdef CONFIG_NVS_INIT_BAD_REGION
LOG_WRN("All sectors closed, erasing all sectors...");
addr = 0U;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still needed to modify addr, after all sectors have been erased we should set it at the last sector close ate, the same as would be found for an empty region. This line should also move after the rc check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed this, we can just remove this line too right, since the previous loop will ensure addr is set to the last sector?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is set to the first sector when all sectors are closed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, updated to set them to the last sector as you suggested

@corymayer corymayer force-pushed the settings_nvs_corruption_recovery branch from 6bab748 to b6fe8e7 Compare December 18, 2024 16:53
Laczen
Laczen previously approved these changes Dec 18, 2024
Copy link
Collaborator

@Laczen Laczen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@corymayer thank you for this PR.

@corymayer corymayer requested a review from rghaddab December 18, 2024 19:38
Enable automatic initialization of a NVS on a flash region that does
not contain a valid NVS. A region containing an invalid NVS can be
caused by corruption or by providing a non-empty region. This option
ensures a new NVS can be created.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming here is confusing. You are talking about region but in the description you are referring to "bad partition recovery"
AFAIK the term "Region" doesn't seem to be used anywhere in NVS.
NVS knows only about Sectors of data
Could be NVS_INIT_BAD_SECTORS

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The terminology region was carefully chosen not to overlap with partition (as defined in dts), area (as used in flash_area_... routines, there also are no bad sectors (which would be used by nand devices).

@corymayer, please update the bool to "Non-volatile Storage bad region recovery".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Laczen So maybe it is better to use the terminology "memory region" instead of just "region"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Memory region" is also existing terminology.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"memory region" is used to describe a part of the memory. This is what this PR is about, right ? Am I missing something ? Or do you mean that the "memory region" terminology is reserved for other purposes in zephyr ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything is existing terminology, so context matters. If something does not exist in a given context than it can be used to describe some characteristic in that context. And NAND actually uses Bad Block Management, for example winbond devices, to manage sectors in pages, which is again vendor specific nomenclature, because sectors/pages/block naming varies between devices.

So, Region or Memory region, when defined within context of NVS as a specific offset/range tuple may be fine, as long as it's relation to sectors or partitions, as used by NVS, is clear.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"memory region" is used in dts for linker memory regions, in user mode for memory domains... It is in no way reserved exclusively for these purposes but might be confusing to use it here as some people associate memory with RAM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case I prefer "memory region" which covers (flash/ram/... regions) instead of just "region"
but that's not a blocker for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the name to "bad memory region" and added a unit test

@Laczen
Copy link
Collaborator

Laczen commented Dec 19, 2024

@corymayer are you a discord user? I would like to get in touch.

@corymayer
Copy link
Contributor Author

@corymayer are you a discord user? I would like to get in touch.

Yes, just joined as corymeta_74139

Laczen
Laczen previously approved these changes Dec 19, 2024
Copy link
Collaborator

@Laczen Laczen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the term "bad memory region" confusing and long to set as config variable, but I'm not going to spend more time on terminology.

Thanks for adding the test.

@corymayer
Copy link
Contributor Author

corymayer commented Dec 19, 2024

Fixed an off-by-one in the unit test. Now prints this in the twister log:

2024-12-19` 14:06:03,971 - twister - DEBUG - QEMU (2562101): W: All sectors closed, erasing all sectors...
2024-12-19 14:06:03,971 - twister - DEBUG - QEMU (2562100): PASS - test_nvs_full_sector in 0.624 seconds

err = flash_write(fixture->fs.flash_device, fixture->fs.offset +
(fixture->fs.sector_size * (i + 1)) -
sizeof(struct nvs_ate), &data, sizeof(data));
zassert_true(err == 0, "flash_write failed: %d", err);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One space to much after the ",".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed these


#ifdef CONFIG_NVS_INIT_BAD_MEMORY_REGION
err = nvs_mount(&fixture->fs);
zassert_true(err == 0, "nvs_mount call failure: %d", err);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one space to much after ",".

execute_long_pattern_write(TEST_DATA_ID, &fixture->fs);
#else
err = nvs_mount(&fixture->fs);
zassert_true(err == -EDEADLK, "nvs_mount call ok, expect fail: %d", err);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One space to much after ",".

@corymayer corymayer force-pushed the settings_nvs_corruption_recovery branch from eac7018 to 54dc53e Compare December 20, 2024 06:23
*/
ZTEST_F(nvs, test_nvs_init_bad_memory_region)
{

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: remove this line

/* Write bad ATE to each sector */
for (uint16_t i = 0; i < TEST_SECTOR_COUNT; i++) {
data = 0xdeadbeef;
err = flash_write(fixture->fs.flash_device, fixture->fs.offset +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: improve the alignment or introduce a offset variable to make this line shorter.

@de-nordic
Copy link
Collaborator

@corymayer Please fix the compliance.

@corymayer corymayer force-pushed the settings_nvs_corruption_recovery branch 2 times, most recently from f1c1789 to 33e7164 Compare December 20, 2024 15:08
for (uint16_t i = 0; i < TEST_SECTOR_COUNT; i++) {
data = 0xdeadbeef;
err = flash_write(fixture->fs.flash_device,
fixture->fs.offset +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alignment should be with the start of fixture->fs.flash_device. The complete file however is no example for correct alignment :-).

@corymayer corymayer force-pushed the settings_nvs_corruption_recovery branch 2 times, most recently from 3d215ee to 05d5839 Compare December 21, 2024 14:49
When all sectors have non-zero data, NVS would fail to init.
Add recovery path option to erase the all sectors and reinitialize.
This could occur due to non-empty sectors, or corrupted data.

Signed-off-by: Cory Andrew Mayer <[email protected]>
Add unit test to exercise bad memory region recovery path

Signed-off-by: Cory Andrew Mayer <[email protected]>
@corymayer corymayer force-pushed the settings_nvs_corruption_recovery branch from 05d5839 to 5fa5854 Compare December 21, 2024 15:54
@corymayer
Copy link
Contributor Author

@corymayer Please fix the compliance.

I was able to run the compliance check locally and verify all issues are resolved

data = 0xdeadbeef;
err = flash_write(fixture->fs.flash_device,
fixture->fs.offset + (fixture->fs.sector_size * (i + 1)) -
sizeof(struct nvs_ate),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT, I think this alignment might still be considered incorrect.

@kartben kartben merged commit 8094f92 into zephyrproject-rtos:main Dec 23, 2024
24 checks passed
Copy link

Hi @corymayer!
Congratulations on getting your very first Zephyr pull request merged 🎉🥳. This is a fantastic achievement, and we're thrilled to have you as part of our community!

To celebrate this milestone and showcase your contribution, we'd love to award you the Zephyr Technical Contributor badge. If you're interested, please claim your badge by filling out this form: Claim Your Zephyr Badge.

Thank you for your valuable input, and we look forward to seeing more of your contributions in the future! 🪁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants