Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: add job to verify binary size #475

Open
wants to merge 46 commits into
base: main
Choose a base branch
from

Conversation

justus-camp-microsoft
Copy link
Contributor

@justus-camp-microsoft justus-camp-microsoft commented Dec 12, 2024

This PR adds a job to diff binary sizes introduced in changes. As implemented, the action runs git merge-base to find a common ancestor with main, fetches a completed build from CI (it will try to up to 5 commits back in case CI hasn't completed for the commit returned by git merge-base), and outputs a diff.

GitHub actions with a pull_request trigger are unable to comment on PRs. As such, this implementation fails the check if the size difference is greater than a threshold. In the case where we're ok with the size increase, my understanding is that we can force the merge without the check passing.

@jstarks
Copy link
Member

jstarks commented Dec 13, 2024

How is this going to be different from #458?

@justus-camp-microsoft
Copy link
Contributor Author

I wasn't aware of that thread. I'll look into getting a baseline from a pipeline and using it for comparison.

@justus-camp-microsoft
Copy link
Contributor Author

I took a look at FluidFramework, which I used to work on and has a bundle size check as part of their PR workflow. From what I can tell, their way of doing this is to traverse HEAD~n until it finds a completed build and does a size comparison with that. Their CI has a bot that leaves a comment with the comparison but doesn't look like it blocks merging of a PR. What do we think about that approach?

@smalis-msft
Copy link
Contributor

Oooooh, prior art, nice.

I think the commit we want to compare against is whatever the merge is based on. That would allow us to get as good a measurement of "this PR adds X bytes compared to not having it" as possible. If that commit is still running through CI maybe we just wait for it? If it fails though then walking backwards on main/release does seem like a reasonable fallback strategy.

I think for ours we'd prefer to have a gate rather than just a comment, so long as there's some way for us to then override the block and say "yes this is acceptable". But a gate would prevent anyone from merging before the bot comments, for example. We could then have a dedicated size_override reviewers group that the gate requires sign off from to override or something.

Also, I'd like to make sure we're actually storing the whole built file that we're using to compare against, not just a pre-computed summary of it. That frees us up to do more complex and involved analysis in the future.

@smalis-msft
Copy link
Contributor

Tagging #76

@justus-camp-microsoft justus-camp-microsoft changed the title ci: add job to verify binary size WIP: ci: add job to verify binary size Jan 2, 2025
@justus-camp-microsoft justus-camp-microsoft marked this pull request as ready for review January 2, 2025 22:20
@justus-camp-microsoft justus-camp-microsoft requested review from a team as code owners January 2, 2025 22:20
@smalis-msft
Copy link
Contributor

Man this is exciting to see, the prior solution has been an annoyance for so long now.

Comment on lines +91 to +93
if total_diff > 100 {
anyhow::bail!("{} size verification failed: The total difference ({} KiB) is greater than the allowed difference ({} KiB).", self.new.display(), total_diff, 100);
}
Copy link
Contributor

@smalis-msft smalis-msft Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need some way to override this check on a PR level, some way to say "Yes this size diff is acceptable". Not sure what github allows us to do here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub doesn't really have a great way to do this. Ideally we could have this always be required and just hit an "override" button but afaik that's not possible. We're also unable to assign a review team through the action (as I painfully learned from trying to re-enable the unsafe reviewers assignment) because actions are scoped to the repo level and our review teams are scoped at the org level (no access).

My thought here is that we should have this action always succeed as long as it finishes all the way through and have it leave a comment on the PR with a summary of the size diff. The onus would be on the reviewer to look at the comment and make sure that the difference is acceptable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the best we can do then it's the best we can do I guess. Maybe include some big warning text in the comment if the diff is over a threshold.

We really should figure out some way to get review groups working though. Then we could have the unsafe reviewers group back and create a new binary size reviewer group for large diffs or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need a PAT with org-level team read access and then the reviewer assignment would work. My understanding was that we don't want to deal with maintaining the PAT.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have review teams coped to the repo instead of the org? I'm really not familiar with github, so I'm just spitballing. But yeah, maintaining a PAT has bitten us in the past, and definitely isn't ideal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point. I can look into that and see if that's a possibility. I hadn't thought of bringing the teams down to the repo level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looping back around to this - a GitHub action with a pull_request target is unable to comment on PRs (similar limitation to unsafe reviewers check) and as such I think our best bet here is to fail the action if it's over a threshold. In the case where it's over the threshold and we're ok with the size increase my understanding is that we can force the merge with the failing check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's OK for v1. But I think the pattern here to follow for v2 would be to create an additional workflow that depends on this one/is triggered by this one but comes from the base branch. That would allow it to safely have access to add comments, etc. I think this means using pull_request_target for that workflow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe workflow_run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding here is that for dependent workflows like that we would need to use workflow_run, but that the token passed when triggered has the same permissions as the one triggering it (as in, it would get the pull_request token that doesn't have comment permissions). I could definitely be wrong here as I didn't try it.

@justus-camp-microsoft justus-camp-microsoft changed the title WIP: ci: add job to verify binary size ci: add job to verify binary size Jan 8, 2025
all_jobs.push(job.finish());

// emit openvmm verify-size job
let job = pipeline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use a single job?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.

If you actually look inside _jobs::build_and_publish_openhcl_igvm_from_recipe, you'll see that its really just a wrapper around the core build_openhcl_igvm_from_recipe node + some wiring to build multiple IGVM files simultaneously, and then publish various artifacts.

If you just peel back a layer, and have your new _jobs::check_openvmm_hcl_size node interface with that build_openhcl_igvm_from_recipe Node directly, you can sidestep all this multi-job coordination, and just use the IGVM file you built in that job.

Plus - you wouldn't be shackled to the existing openhcl-igvm + openhcl-igvm-extras artifact structure, and could instead have a new, verify-size specific openhcl-igvm-verify-size-baseline artifact that you can then use across jobs (and which would only contain the precise artifacts the verify-size infrastructure cares about)

@@ -1793,6 +1793,14 @@ pub mod steps {

/// `github.token`
pub const GITHUB__TOKEN: GhContextVar = GhContextVar::new_secret("github.token");

/// `github.event.pull_request.head.ref`
pub const GITHUB__HEAD_REF: GhContextVar =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to be mindful about how we choose to expose this particular class of context variables in flowey.

All other constants defined in this list are guaranteed to be valid in any pipeline run. The same cannot be said of these new github.event.pull_request constants, which should only be used in PR-triggered workflows. i.e: it seems unwise to make it "trivial" to access these variables in the context of a CI-triggered workflow via the existing get_gh_context_var API, given that the resulting loosely-typed String could sometimes be empty.

My gut feeling is that we want to have some API that would let us model these sorts of context-dependent variables in a type-safe manner, in order to give users a way to get a ReadVar<Option<PullRequestRelatedThing>>.

Consider the following modification to the existing get_gh_context_var API:

impl NodeCtx {
    fn get_gh_context_var(&mut self) -> GhContextVarReader;
}

impl GhContextVarReader {
    fn global(&mut self, GhContextVar) -> ReadVar<String>;
    fn event(&mut self) -> GhContextVarReaderEvent;
}

impl GhContextVarReaderEvent {
    fn pull_request(&mut self) -> GhContextVarReaderEventPullRequest;
}

// and so on...
//
// thereby enabling:

let global: ReadVar<String> = ctx.get_gh_context_var().global(GhContextVar::RUNNER__TEMP);
let pr_specific: ReadVar<Option<String>> = ctx.get_gh_context_var().event().pull_request().head().ref();

// theoretically, with this scheme, we could switch all existing `const GhContextVar` enums to 
// just hang off a `global()` object, e.g: `global().runner().temp()`

The resulting API is very fluent for end users, and to make our lives easier as implementors, we can leverage the type-state pattern to avoid an explosion of different GhContextVarReader types, and instead, simply transition between various versions of a single backing GhContextVarReader<T> type.

With this API, the leaf-nodes (e.g: pull_request().head().ref()) can then encode any necessary flowey logic to read the raw String data, and then convert it into a Option<String> if need be.


I know this is all somewhat orthogonal to the problem this PR is specifically trying to solve... but we must not forget that flowey is not a framework set in stone. It still has many rough edges and core abstractions that need to be reconsidered / reworked. This sort of flowey rework can and should be done as part of whatever feature work we are doing.

In this case, I don't actually think there's too much work to re-jig the API as I suggest here - there shouldn't be any "backend" flowey work, and it'd simply be some clever refactoring of the user-facing NodeContext APIs.

That said - I would suggest you split this work out into a separate PR, and then rebase this PR on-top of that one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

theoretically, it might even make sense to have a single pull_request() -> ReadVar<Option<GhEventPullRequest>> method, which dumps the entire github.events.pull_request variable as JSON, and then flowey uses a serde defn of the corresponding object to avoid needing to manually write out methods for each field. See https://docs.github.com/en/webhooks/webhook-events-and-payloads#pull_request

Its a "big" object, but its not a huge object, so I think that might be a viable approach - even if the pipeline only ends up using one or two fields of the object that gets parsed in.

And if its hard to transcribe that particular object structure in its entirety into a serde type, we can always go piecemeal (given that serde will just ignore JSON fields during deserialization that it wasn't explicitly told about)

#[clap(about = "Verify the size of a binary hasn't changed more than allowed.")]
pub struct VerifySize {
/// Old binary path
#[clap(short, long, required(true))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the required(true) is redundant, as these aren't using Option<PathBuf>, nor are they relying on any clap default directive.

Comment on lines +81 to +87
let original_elf = object::File::parse(&*original).or_else(|e| {
anyhow::bail!(
r#"Unable to parse target file "{}". Error: "{}""#,
&self.original.display(),
e
)
})?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.rs/anyhow/latest/anyhow/trait.Context.html is your friend

Suggested change
let original_elf = object::File::parse(&*original).or_else(|e| {
anyhow::bail!(
r#"Unable to parse target file "{}". Error: "{}""#,
&self.original.display(),
e
)
})?;
let original_elf = object::File::parse(&*original).context(
format!(
r#"Unable to parse target file "{}""#,
&self.original.display()
)
)?;

@@ -0,0 +1,55 @@
// Copyright (c) Microsoft Corporation.
// Licensed under the MIT License.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget your module docs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(here, and elsewhere)

let get_action_id = |commit: String| {
xshell::cmd!(
sh,
"gh run list --commit {commit} -w '[flowey] OpenVMM CI' -s 'completed' -L 1 --json databaseId --jq '.[].databaseId'"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we def don't want to be hard-coding openvmm specific pipeline names within a flowey_lib_common node. this should be something that gets passed in as a runtime / comptime var.

sh,
"gh run list --commit {commit} -w '[flowey] OpenVMM CI' -s 'completed' -L 1 --json databaseId --jq '.[].databaseId'"
)
.env("GITHUB_TOKEN", gh_token.clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this actually necessary? I would expect GitHub Actions to already have this set ambiently?

gh_workflow_id,
} = request;

let gh_token = ctx.get_gh_context_var(GhContextVar::GITHUB__TOKEN);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, maybe not strictly required for this PR, but do we need some way API like get_gh_token_with_permissions(impl Iterator<Item = GhPermission>) -> ReadVar<String>? i.e: in case you need the pipeline to have a particular github permission, outside the context of emit_gh_step?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants