-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split tables #218
base: master
Are you sure you want to change the base?
Split tables #218
Conversation
The PR looks great so far! Separating the keys and values into separate tables isn't just good for packing, it's also great for cache pressure since the values of all the keys not being searched for don't pollute the cache. The savings in memory being touched for searching should far outweigh the single cache miss looking up the final value. Furthermore LLVM should be able to figure out when the tables have the same length to eliminate some bound checks. Have you considered generating these rsv (heh, rust separated values) files in a custom build step instead of having them checked into source control? (I can imagine generating them on build every time slows down compile times too much...) |
Surprisingly, the rsv generation is rather quick in release mode. Computers are fast, even with the simple > efficient mindset of the generation code's ownership model. However, we want to ship the tables pre-generated when used as a library. It would be theoretically possible to not commit the files, but especially given the current generation script which does all of the tables at one time to reduce redundant work. I agree that it'd be better to get the downloaded/generated files out of the git repository, but that's future work and it's just easier to commit for now. Additionally, the large number of subcrates makes it even more interesting. The end goal might be unic-gen as a library and used in individual build scripts, but for now a single manual script is easier. It's already much better than the collection of Python that most repos like this run off of, as it's actually decently documented (and if it's not, that's mostly my fault). |
Haven't looked at the diff yet, but just a quick reply to the conversion here...
The idea was just "Rust Value", since they can be included where a value is expected. And, I believe we have
One main reason we have decided to commit generated tables into the repo is that we want to make sure no-std components stay no-std, even in their build step. Generating these steps in no-std would add extra complexity, which IMHO is not necessary. Eventually, if we conclude that it's not a useful feature to have, we can move the generation code into build scripts. On the source data side, we definitely don't want any build process/system to pull data over the web. It's bad for our reliability, and it's bad for unicode.org, etc servers. I'm working on setting up official git mirrors for Unicode data files. When those are available, we can make Overall, IMHO we have more important features to focus on right now, and the |
That merges in the changes since I started the branch. We'll see how compile time goes on Travis, and if it's abysmal then I'll try the stopgap simplification of unic/ucd/name to just a direct mapping rather than the pieces currently used; that will be fewer things for the compiler to worry about. I still plan to do the specialized table, but hopefully this PR is landable separate from that. |
Timeout
I don't think I touched Of note: it's timing out on 1.22 as well, so it's not a Rust change. |
It's very strange for The AppVeyor failure seems to be unrelated to the Travis one. I believe |
Rerunning Travis. |
All builds timed out on unic_data again. I'm going to ask Travis to run a build off of master, and see if that fails as well. I was having some build time oscillation on my local machine, but still don't really see what I might have changed for that crate's build perf, and I just did a cold build for unic_data at 1m16s on my local machine. I'm scared that this is some sort of build indeterminacy being exacerbated by Travis's environment. |
Master build passed. I'm going to restart Travis on this PR one more time, then if that fails accross the board again I'll try and see what I can diagnose locally. @behnam if you've got a not-Windows machine, it'd be useful if you could run a cold build and see what you get. FWIW my last cold build --all took a total of ~21m (though I should note I built unic-data separately first). |
Did one more build locally, cold
|
Just curious on how this compares since it's working from a cold cache anyway. |
No difference. Will remove the de-incremental tomorrow. |
Okay, I've tried on my MacBook Pro and get similar results. Here's what I think is happening: To see this, you can limit the parallel builds to one. Now I'm going to take a look at the code and why the build halts. |
Looks like this is back to the unknown situation where just touching the names table puts the compiler in seemingly-infinite loop. Here's an idea to move forward and figure out the names problem later: how about you keep the new trait and type definitions along with the existing data type, and migrate over anything that doesn't break, and we will dig deeper into the ucd-name component afterwards, knowing the lower types are working fine. |
FYI, I got
|
Oh, I know exactly why unic/ucd/name would be taking so long: too much slice. The eventual better solution to the char->name table will hopefully make that worry a thing of the past, but for now I'm going to add in the simplified |
I would make that a nice single commit that could be reverted, but that's a bit beyond my git-rebase skill at the moment. |
gworsh, CI caught a failure in a different feature configuration! Let me fix that real quick, then this should be good for the full review @behnam. |
Great! One thing that would be easy to do on top of this diff is to put all the |
That would be possible, but relatively pointless, as rustc already is already handling it fine at this level of slice. I'm pushing for the better suffix compression + everything else specific that the name table can give us. |
What should we do about this PR, @CAD97? Would you like to rebase and see if there are still any compile problems? |
I've gone with the "indirect" table for a couple reasons. - Better packing for the binary-searched slice hopefully means better binary search performance? In any case, it does mean better packing for the payload. - slice::binary_search returns an index anyway, so we have to index. It might be marginally easier to index back to where we just were but I suspect that this isn't the case. Given the static data size optimization, this seems the sensible default. We can switch back to inline at any point, anyway, since the contents of the map is considered private.
Rebased. Now waiting on CI to see if the rebase broke anything. Also testing locally to see if it works with 8d2bc79 reverted. This map structure is still a win over the current structure, though of course the name mapping will be the biggest win once that gets done. I should have a bit more time I can spend on this moving forward (though of course I can't guarantee anything), and have a few more ideas for table formats that can help in different situations -- the name-specific trie needs to be done first and then #231 gives me two separate ideas as well (one of the two is what the actual inspiration did). |
Supersedes #207
Step one of refactoring the table structure for better future growth. I'll fix the merge conflict soon, but wanted to get this out.
I've been playing around with the tables ever since the thread prompted by miri. This is an incremental move towards the table design that I think is what we want.
The enum has been slit into two distinct types. This skips the branch required before access.
Both of the tables are now indirect -- this means having a separate
&[char]
and&[T]
, so everything should pack better. Asslice::binary_search
returns an index anyway, this seems like an all-around better choice for our associative slices, since space compression is what we're mostly concerned with, and the cyclomatic cost remains the same. I suspect any speed lost by not having the value next to the key should be gained in the better packing of data for the binary search. See the first commit for the full argument.After this, the next step will be to add a table specifically for the
char->Name
translation. See the multiple places I've markedTODO(CAD97)
for the upcoming work.If we ever get fields with a type of
impl Trait
, it would probably be prudent to add a trait for theCharMap
shape, so that the table structure can be changed in the future without affecting the public API surface. (Along those lines, I think there might be a public dependency in canonical_composition_mapping that we'd want to get rid of before bringingunic-ucd
to 1.0, based on what I had to modify.)r? @behnam
ping @CasualX; this sort of integrates some of the path you took.
This change is