Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split tables #218

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Split tables #218

wants to merge 7 commits into from

Conversation

CAD97
Copy link
Collaborator

@CAD97 CAD97 commented Mar 19, 2018

Supersedes #207

Step one of refactoring the table structure for better future growth. I'll fix the merge conflict soon, but wanted to get this out.

I've been playing around with the tables ever since the thread prompted by miri. This is an incremental move towards the table design that I think is what we want.

The enum has been slit into two distinct types. This skips the branch required before access.

Both of the tables are now indirect -- this means having a separate &[char] and &[T], so everything should pack better. As slice::binary_search returns an index anyway, this seems like an all-around better choice for our associative slices, since space compression is what we're mostly concerned with, and the cyclomatic cost remains the same. I suspect any speed lost by not having the value next to the key should be gained in the better packing of data for the binary search. See the first commit for the full argument.

After this, the next step will be to add a table specifically for the char->Name translation. See the multiple places I've marked TODO(CAD97) for the upcoming work.

If we ever get fields with a type of impl Trait, it would probably be prudent to add a trait for the CharMap shape, so that the table structure can be changed in the future without affecting the public API surface. (Along those lines, I think there might be a public dependency in canonical_composition_mapping that we'd want to get rid of before bringing unic-ucd to 1.0, based on what I had to modify.)

r? @behnam
ping @CasualX; this sort of integrates some of the path you took.


This change is Reviewable

@CAD97 CAD97 requested a review from behnam March 19, 2018 03:28
@CasualX
Copy link

CasualX commented Mar 19, 2018

The PR looks great so far!

Separating the keys and values into separate tables isn't just good for packing, it's also great for cache pressure since the values of all the keys not being searched for don't pollute the cache. The savings in memory being touched for searching should far outweigh the single cache miss looking up the final value.

Furthermore LLVM should be able to figure out when the tables have the same length to eliminate some bound checks.

Have you considered generating these rsv (heh, rust separated values) files in a custom build step instead of having them checked into source control? (I can imagine generating them on build every time slows down compile times too much...)

@CAD97
Copy link
Collaborator Author

CAD97 commented Mar 19, 2018

Surprisingly, the rsv generation is rather quick in release mode. Computers are fast, even with the simple > efficient mindset of the generation code's ownership model.

However, we want to ship the tables pre-generated when used as a library. It would be theoretically possible to not commit the files, but especially given the current generation script which does all of the tables at one time to reduce redundant work.

I agree that it'd be better to get the downloaded/generated files out of the git repository, but that's future work and it's just easier to commit for now.

Additionally, the large number of subcrates makes it even more interesting. The end goal might be unic-gen as a library and used in individual build scripts, but for now a single manual script is easier. It's already much better than the collection of Python that most repos like this run off of, as it's actually decently documented (and if it's not, that's mostly my fault).

@behnam
Copy link
Member

behnam commented Mar 19, 2018

Haven't looked at the diff yet, but just a quick reply to the conversion here...

rsv (heh, rust separated values)

The idea was just "Rust Value", since they can be included where a value is expected.

And, I believe we have rsd, short for "Rust Definition", which contains items, to be included inside a module, trait or type definition.

However, we want to ship the tables pre-generated when used as a library. It would be theoretically possible to not commit the files, but especially given the current generation script which does all of the tables at one time to reduce redundant work.

One main reason we have decided to commit generated tables into the repo is that we want to make sure no-std components stay no-std, even in their build step. Generating these steps in no-std would add extra complexity, which IMHO is not necessary. Eventually, if we conclude that it's not a useful feature to have, we can move the generation code into build scripts.

On the source data side, we definitely don't want any build process/system to pull data over the web. It's bad for our reliability, and it's bad for unicode.org, etc servers. I'm working on setting up official git mirrors for Unicode data files. When those are available, we can make data/* directories to be just git submodules, mirroring the official repos.

Overall, IMHO we have more important features to focus on right now, and the data/ and gen/ parts of the code-base are stable enough and don't need major changes (except table format changes which @CAD97 is working on). Of course if there are any issues with the process, we can start a new issue discussion and get to the details.

@CAD97
Copy link
Collaborator Author

CAD97 commented Mar 19, 2018

That merges in the changes since I started the branch.

We'll see how compile time goes on Travis, and if it's abysmal then I'll try the stopgap simplification of unic/ucd/name to just a direct mapping rather than the pieces currently used; that will be fewer things for the compiler to worry about.

I still plan to do the specialized table, but hopefully this PR is landable separate from that.

@CAD97
Copy link
Collaborator Author

CAD97 commented Mar 20, 2018

Timeout

   Compiling unic-data v0.0.0 (file:///home/travis/build/behnam/rust-unic/data)
     Running `rustc --crate-name unic_data data/src/main.rs --crate-type bin --emit=dep-info,link -C debuginfo=2 -C metadata=dc3747200008d173 -C extra-filename=-dc3747200008d173 --out-dir /home/travis/build/behnam/rust-unic/target/debug/deps -C incremental=/home/travis/build/behnam/rust-unic/target/debug/incremental -L dependency=/home/travis/build/behnam/rust-unic/target/debug/deps --extern serde_derive=/home/travis/build/behnam/rust-unic/target/debug/deps/libserde_derive-ea47998e28307385.so --extern serde=/home/travis/build/behnam/rust-unic/target/debug/deps/libserde-a8e442193f4c8269.rlib --extern toml=/home/travis/build/behnam/rust-unic/target/debug/deps/libtoml-fb4ac857952b9e34.rlib --extern tokio_core=/home/travis/build/behnam/rust-unic/target/debug/deps/libtokio_core-e3da18abb1df214d.rlib --extern clap=/home/travis/build/behnam/rust-unic/target/debug/deps/libclap-a12ff8fc86df6425.rlib --extern hyper=/home/travis/build/behnam/rust-unic/target/debug/deps/libhyper-4fe365a952c190d5.rlib --extern futures=/home/travis/build/behnam/rust-unic/target/debug/deps/libfutures-455c535dc7e0a786.rlib`


No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

I don't think I touched unic-data; ideas @behnam ? Last master build is still green, so I have no idea why it's timing out here.

Of note: it's timing out on 1.22 as well, so it's not a Rust change.

@behnam
Copy link
Member

behnam commented Mar 20, 2018

It's very strange for unic_data to fail on all builds, @CAD97. I got a similar time out on Travis once last week, only on one or two configs. It went well after hitting the Restart button. Maybe just try that once?

The AppVeyor failure seems to be unrelated to the Travis one. I believe serde-derive has some ongoing issues on nightly at the moment. There might be reports on that on https://github.com/rust-lang/rust already.

@CAD97
Copy link
Collaborator Author

CAD97 commented Mar 20, 2018

@CAD97
Copy link
Collaborator Author

CAD97 commented Mar 20, 2018

All builds timed out on unic_data again. I'm going to ask Travis to run a build off of master, and see if that fails as well. I was having some build time oscillation on my local machine, but still don't really see what I might have changed for that crate's build perf, and I just did a cold build for unic_data at 1m16s on my local machine.

I'm scared that this is some sort of build indeterminacy being exacerbated by Travis's environment.

https://travis-ci.org/behnam/rust-unic/builds/355665777

@CAD97
Copy link
Collaborator Author

CAD97 commented Mar 20, 2018

Master build passed. I'm going to restart Travis on this PR one more time, then if that fails accross the board again I'll try and see what I can diagnose locally. @behnam if you've got a not-Windows machine, it'd be useful if you could run a cold build and see what you get. FWIW my last cold build --all took a total of ~21m (though I should note I built unic-data separately first).

@CAD97
Copy link
Collaborator Author

CAD97 commented Mar 20, 2018

Did one more build locally, cold build --all. Hung at unic-data at around 20% cpu utilization for >10m.

real    20m32.337s
user    0m0.031s
sys     0m0.015s

@CAD97
Copy link
Collaborator Author

CAD97 commented Mar 20, 2018

Just curious on how this compares since it's working from a cold cache anyway.

@CAD97
Copy link
Collaborator Author

CAD97 commented Mar 20, 2018

No difference. Will remove the de-incremental tomorrow.

@behnam
Copy link
Member

behnam commented Mar 20, 2018

Okay, I've tried on my MacBook Pro and get similar results.

Here's what I think is happening: unic-ucd-name is the crate with build problem, but we see unic-data as the last item in the list because that's the one with most external dependencies. When unic-ucd-name gets stuck, build for all other components await that, but unic-data gets cleared from deps and gets build eventually.

To see this, you can limit the parallel builds to one.

Now I'm going to take a look at the code and why the build halts.

@behnam behnam mentioned this pull request Mar 21, 2018
@behnam
Copy link
Member

behnam commented Mar 21, 2018

Looks like this is back to the unknown situation where just touching the names table puts the compiler in seemingly-infinite loop.

Here's an idea to move forward and figure out the names problem later: how about you keep the new trait and type definitions along with the existing data type, and migrate over anything that doesn't break, and we will dig deeper into the ucd-name component afterwards, knowing the lower types are working fine.

@behnam
Copy link
Member

behnam commented Mar 21, 2018

FYI, I got unic-ucd-name compiled on this branch on my MacBook Pro in about 19 minutes:

$ time cargo +stable build --release --verbose --manifest-path unic/ucd/name/Cargo.toml
   Compiling unic-char-range v0.7.0 (file:///Users/behnam/code/unic/unic/unic/char/range)
   Compiling unic-common v0.7.0 (file:///Users/behnam/code/unic/unic/unic/common)
     Running `rustc --crate-name unic_char_range unic/char/range/src/lib.rs --crate-type lib --emit=dep-info,link -C opt-level=3 --cfg 'feature="default"' -C metadata=f3ac911853d0e14a -C extra-filename=-f3ac911853d0e14a --out-dir /Users/behnam/code/unic/unic/target/release/deps -L dependency=/Users/behnam/code/unic/unic/target/release/deps`
     Running `rustc --crate-name unic_common unic/common/src/lib.rs --crate-type lib --emit=dep-info,link -C opt-level=3 --cfg 'feature="default"' -C metadata=bbd743bcd5fc1b63 -C extra-filename=-bbd743bcd5fc1b63 --out-dir /Users/behnam/code/unic/unic/target/release/deps -L dependency=/Users/behnam/code/unic/unic/target/release/deps`
   Compiling unic-ucd-version v0.7.0 (file:///Users/behnam/code/unic/unic/unic/ucd/version)
     Running `rustc --crate-name unic_ucd_version unic/ucd/version/src/lib.rs --crate-type lib --emit=dep-info,link -C opt-level=3 -C metadata=eb58ae67be7e945f -C extra-filename=-eb58ae67be7e945f --out-dir /Users/behnam/code/unic/unic/target/release/deps -L dependency=/Users/behnam/code/unic/unic/target/release/deps --extern unic_common=/Users/behnam/code/unic/unic/target/release/deps/libunic_common-bbd743bcd5fc1b63.rlib`
   Compiling unic-char-property v0.7.0 (file:///Users/behnam/code/unic/unic/unic/char/property)
     Running `rustc --crate-name unic_char_property unic/char/property/src/lib.rs --crate-type lib --emit=dep-info,link -C opt-level=3 -C metadata=5c7d5a21d125f65d -C extra-filename=-5c7d5a21d125f65d --out-dir /Users/behnam/code/unic/unic/target/release/deps -L dependency=/Users/behnam/code/unic/unic/target/release/deps --extern unic_char_range=/Users/behnam/code/unic/unic/target/release/deps/libunic_char_range-f3ac911853d0e14a.rlib`
   Compiling unic-ucd-hangul v0.7.0 (file:///Users/behnam/code/unic/unic/unic/ucd/hangul)
     Running `rustc --crate-name unic_ucd_hangul unic/ucd/hangul/src/lib.rs --crate-type lib --emit=dep-info,link -C opt-level=3 -C metadata=290aab54d532b223 -C extra-filename=-290aab54d532b223 --out-dir /Users/behnam/code/unic/unic/target/release/deps -L dependency=/Users/behnam/code/unic/unic/target/release/deps --extern unic_ucd_version=/Users/behnam/code/unic/unic/target/release/deps/libunic_ucd_version-eb58ae67be7e945f.rlib`
   Compiling unic-ucd-name v0.7.0 (file:///Users/behnam/code/unic/unic/unic/ucd/name)
     Running `rustc --crate-name unic_ucd_name unic/ucd/name/src/lib.rs --crate-type lib --emit=dep-info,link -C opt-level=3 -C metadata=cdc37663d19b71d4 -C extra-filename=-cdc37663d19b71d4 --out-dir /Users/behnam/code/unic/unic/target/release/deps -L dependency=/Users/behnam/code/unic/unic/target/release/deps --extern unic_ucd_hangul=/Users/behnam/code/unic/unic/target/release/deps/libunic_ucd_hangul-290aab54d532b223.rlib --extern unic_char_property=/Users/behnam/code/unic/unic/target/release/deps/libunic_char_property-5c7d5a21d125f65d.rlib --extern unic_ucd_version=/Users/behnam/code/unic/unic/target/release/deps/libunic_ucd_version-eb58ae67be7e945f.rlib`
    Finished release [optimized] target(s) in 1192.87 secs

real	19m53.444s
user	19m43.632s
sys	0m4.710s

@CAD97
Copy link
Collaborator Author

CAD97 commented Mar 25, 2018

Oh, I know exactly why unic/ucd/name would be taking so long: too much slice. The eventual better solution to the char->name table will hopefully make that worry a thing of the past, but for now I'm going to add in the simplified CharMap<&str> rather than CharMap<&[&str]> to reduce the amount of work the compiler has to do. Hopefully. Performance should be about the same, but binary size might increase temporarily. Again, the new table I'm working on will fix this properly. But a cold compile time for --all of <200s is nice to have again.

@CAD97
Copy link
Collaborator Author

CAD97 commented Mar 25, 2018

I would make that a nice single commit that could be reverted, but that's a bit beyond my git-rebase skill at the moment.

@CAD97
Copy link
Collaborator Author

CAD97 commented Mar 25, 2018

gworsh, CI caught a failure in a different feature configuration! Let me fix that real quick, then this should be good for the full review @behnam.

@behnam
Copy link
Member

behnam commented Mar 25, 2018

Great!

One thing that would be easy to do on top of this diff is to put all the str slices in one big const and slice out of that. (Eventually, we can optimize the the string, but for now a simple concatenation should do the job.)

@CAD97
Copy link
Collaborator Author

CAD97 commented Mar 25, 2018

That would be possible, but relatively pointless, as rustc already is already handling it fine at this level of slice. I'm pushing for the better suffix compression + everything else specific that the name table can give us.

@behnam
Copy link
Member

behnam commented Aug 20, 2018

What should we do about this PR, @CAD97? Would you like to rebase and see if there are still any compile problems?

CAD97 added 7 commits August 20, 2018 15:53
I've gone with the "indirect" table for a couple reasons.

- Better packing for the binary-searched slice
  hopefully means better binary search performance?
  In any case, it does mean better packing for the payload.
- slice::binary_search returns an index anyway, so we have to index.
  It might be marginally easier to index back to where we just were
  but I suspect that this isn't the case.

Given the static data size optimization, this seems the sensible default.
We can switch back to inline at any point, anyway,
since the contents of the map is considered private.
@CAD97
Copy link
Collaborator Author

CAD97 commented Aug 20, 2018

Rebased. Now waiting on CI to see if the rebase broke anything. Also testing locally to see if it works with 8d2bc79 reverted.

This map structure is still a win over the current structure, though of course the name mapping will be the biggest win once that gets done.

I should have a bit more time I can spend on this moving forward (though of course I can't guarantee anything), and have a few more ideas for table formats that can help in different situations -- the name-specific trie needs to be done first and then #231 gives me two separate ideas as well (one of the two is what the actual inspiration did).

@CAD97
Copy link
Collaborator Author

CAD97 commented Aug 20, 2018

With simplified name tables
image
Non-simplified name tables caused a timeout across the board. Removing the revert.

@behnam behnam added this to the UNIC-0.9 milestone Jan 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants