Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache-less fast file search by directly reading the file system #61

Open
valaphee opened this issue Jul 5, 2023 · 3 comments
Open

Cache-less fast file search by directly reading the file system #61

valaphee opened this issue Jul 5, 2023 · 3 comments

Comments

@valaphee
Copy link

valaphee commented Jul 5, 2023

There is a way on Windows, and definitely also on Linux, to directly read the file system structure, which don't involve the operating systems file caching, search indices, etc.

And when searching for a file over the entire disk, it's way faster to just read for example the NTFS Master File Table; As a proof on concept, I created a project which currently implements this kind of file search for Windows NTFS file system: https://github.com/valaphee/curtain/blob/main/storage/examples/search.rs, and it is only limited by the speed of how fast the Master File Table can be read, and only takes about 1 second on my PC when searching on an 512GB NVMe volume with around 200k files.

This might also be combined with file system watches and caching. But without writing the cache to disk, as the load time of the cache is slower than just reading the file entries. And it also increases the disk wear level by an unnessary amount, especially as compressed data may completely change by just adding/removing one file.

The internals are little bit more complicated as it basically simulates the NTFS file system read operation and it also requires administrator privileges. One other downside is that it might be slower when limiting the search like for example searching for example by directory which has <= Total NTFS files / 2, which can be optimized by parsing the file structure, etc. with the cost of higher memory usage.

But its the technically the fastest way for searching files.

@valaphee valaphee changed the title Cache-less fast file search by directly reading the file table Cache-less fast file search by directly reading the file system Jul 5, 2023
@stysan
Copy link

stysan commented Jul 6, 2023

i second this

@grayfallstown
Copy link

I just wanted to suggest the MFT store, however, I thought to use it to build the index/cache which would speed up subdirectory search when/if implemented.

@valaphee
Copy link
Author

valaphee commented Jul 7, 2023

I just wanted to suggest the MFT store, however, I thought to use it to build the index/cache which would speed up subdirectory search when/if implemented.

But you would have to overhaul the cache, as the "path" operation is really slow, because you have to walk the entire path to the root directory,

Which can be solved by normal tree walking utilizing the index attribute, or to directly build a tree from leaf to root.
And you could linearize the tree like:

Names (Linear vec, zero terminated strings)
0 C:\
1 | helloworld.txt
2 | System32\
3 | | ..
4 | Users\
5 | | valaphee\
6 | | | helloworld.txt
7 | | | helloworld\
8 | | | | helloworld.txt
9 | | | | ..
10| | | ..
11| | anotheruser\
12| | | helloworld.txt
13| | | ..
14| | ..
15| ..
16..
(will look like "C:\\\0helloworld.txt\0System32\\\0..\0Users\\\0valaphee\\\0...")

Ranges (HashMap)
C:\ 1-15
C:\System32\ 3-3
C:\Users\ 5-13
C:\Users\valaphee\ 6-8
C:\Users\valaphee\helloworld\ 8-8
C:\Users\anotheruser\ 12-12

Searching for "helloworld.txt" in C:\Users\valaphee:
- look in ranges for "C:\Users\valaphee\" => 6-8
- 6. found helloworld.txt
- 7. push helloworld\ (\ indicates directory)
- 8. found helloworld.txt
- (9.) pop last element (e.g. helloworld, .. indicates pop)

This method is a trade-off between linear search speed (CPU cache hits, because there is no pointer chasing like a Vec of Strings), hash lookup speed and also memory occupancy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants