Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support syncing Amazon Linux blobstore style repos (e.g. AL2, AL2022) #457

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

stewartsmith
Copy link

The yum repository layout for Amazon Linux repositiors (AL1, AL2,
AL2022, and likely future versions) have a slightly interesting layout.

All Amazon Linux releases have a mirrorlist that points to the
repository, but the repository is not in a fixed location, but instead
under a GUID. This means that content can be synced (and staged) before
the atomic (and fast) operation of writing a new mirrorlist makes this
content visible.

For Amazon Linux 1 style repositories this wasn't a problem as each of
these GUID repos that the mirrorlist pointed to was a complete copy of
the repository, packages and all. This did mean that as the "updates"
repo grew, the amount of time it took to release package updates
increased.

Starting with Amazon Linux 2, instead of having each of these GUID repos
have a full copy of the repository, the repodata contains relative paths
over to a central blob store. Thus the only data that needs to be pushed
to release a package update is only the added packages, along with a new
copy of the repo metadata.

However, as of caf28c4, reposync does
not want to write files outside of the destination directory, and this
broke being able to reposync the Amazon Linux 2 style yum repositories.

The reposync shipped in the AL2 repositories carried a patch that would
just rewrite the path it was saving the RPMs to. This did have the
effect of having the --download-metadata option save metadata that still
had the old paths (to the blobstore) and thus the resultant repository
would not be usable - you would have to run createrepo_c.

So this patch implements the ability to fully mirror an Amazon Linux
style repository. When --blobstore is passed, reposync will sync the
full remote paths into the destination directory, thus preserving the
GUID repo, and the blobstore structure, which means the paths in the
metadata (if downloaded with --download-metadata) will work.

Since the synced repo will be under the GUID and not be predictable if
you want to reliably point to the repo as the repo GUID changes each
time there is an update. So, this is where the
new --save-relative-mirrorlist option comes in. It will save a copy of
the mirror list locally, with the URLs trimmed to be relative of the
destination directory rather than absolute URLs. If used with the
corresponding --save-relative-mirrorlist-prefix, you can construct a
usable mirrorlist if you're using reposync to construct an actual
mirror.

Much like the original patch in Amazon Linux 2's reposync, if none of
the new options are specified, just the packages are fetched, and any
fetched metadata will point to the wrong place.

Looks as though https://bugzilla.redhat.com/show_bug.cgi?id=1898089 / #441 is related, although this patch ends up with a bit more functionality to the reposync of such repositories, such as being able to end up with synced metadata that is functional.

stewartsmith and others added 2 commits May 30, 2022 12:18
The yum repository layout for Amazon Linux repositiors (AL1, AL2,
AL2022, and likely future versions) have a slightly interesting layout.

All Amazon Linux releases have a mirrorlist that points to the
repository, but the repository is not in a fixed location, but instead
under a GUID. This means that content can be synced (and staged) before
the atomic (and fast) operation of writing a new mirrorlist makes this
content visible.

For Amazon Linux 1 style repositories this wasn't a problem as each of
these GUID repos that the mirrorlist pointed to was a complete copy of
the repository, packages and all. This did mean that as the "updates"
repo grew, the amount of time it took to release package updates
increased.

Starting with Amazon Linux 2, instead of having each of these GUID repos
have a full copy of the repository, the repodata contains relative paths
over to a central blob store. Thus the only data that needs to be pushed
to release a package update is only the added packages, along with a new
copy of the repo metadata.

However, as of caf28c4, reposync does
not want to write files outside of the destination directory, and this
broke being able to reposync the Amazon Linux 2 style yum repositories.

The reposync shipped in the AL2 repositories carried a patch that would
just rewrite the path it was saving the RPMs to. This did have the
effect of having the --download-metadata option save metadata that still
had the old paths (to the blobstore) and thus the resultant repository
would not be usable - you would *have* to run createrepo_c.

So this patch implements the ability to fully mirror an Amazon Linux
style repository. When --blobstore is passed, reposync will sync the
full remote paths into the destination directory, thus preserving the
GUID repo, and the blobstore structure, which means the paths in the
metadata (if downloaded with --download-metadata) will work.

Since the synced repo will be under the GUID and not be predictable if
you want to reliably point to the repo as the repo GUID changes each
time there is an update. So, this is where the
new --save-relative-mirrorlist option comes in. It will save a copy of
the mirror list locally, with the URLs trimmed to be relative of the
destination directory rather than absolute URLs. If used with the
corresponding --save-relative-mirrorlist-prefix, you can construct a
usable mirrorlist if you're using reposync to construct an actual
mirror.

Much like the original patch in Amazon Linux 2's reposync, if none of
the new options are specified, just the packages are fetched, and any
fetched metadata will point to the wrong place.

= changelog =
msg:           Support syncing Amazon Linux blobstore style repos
type:          enhancement
resolves:      https://bugzilla.redhat.com/show_bug.cgi?id=1898089
@stewartsmith stewartsmith force-pushed the mirror-blobstore-repos branch from 465880a to f9f5a07 Compare May 30, 2022 18:52
@m-blaha m-blaha self-assigned this Jun 15, 2022
if self.opts.delete:
self.delete_old_local_packages(repo, pkglist)
if not gpgcheck_ok:
raise dnf.exceptions.Error(_("GPG signature check failed."))

def repo_target(self, repo):
if self.opts.blobstore:
mirrors = repo._repo.getMirrors()
dest_path = urllib.parse.urlparse(mirrors[0]).path[1:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repositories are not required to have mirrorlist. They can have metalink or baseurl set. In such case mirrors is an empty tuple and this line raises an IndexError exception.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't (currently at least, and don't have immediate plans to) have a metalink pointer for any of our repos... although I could sync one somewhere, create a metalink and test it that way. I'll give it a go and see if I can suitably test, or at least more gracefully bail out until such a repo exists.

Good point on the baseurl though, as that should probably work here as well, even if not the common path for Amazon Linux repos.


# If we haven't set the blobstore option (i.e. preserve full path)
# and we see references to a file outside our destination,
# then we should just store the package in the target location
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this changes the behavior of the plugin in case --download-metadata switch is used. According to the documentation of the option: "Download all repository metadata. Downloaded copy is instantly usable as a repository, no need to run createrepo_c on it.".
Now (with changed package download location) this is not true any more and user is not even informed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that in that case we should probably error out.

Logic should be something like:

  • if blobstore style repo
    • and --download-metadata but not --blobstore: error out as downloaded metadata will not be usable
    • if --blobstore and --download-metadata: download the metadata and things will work
  • if not blobstore style repo:
    • preserve existing behavior of --download-metadata or not.
      ?

pkg_download_path, repo_target))
repo_target_check = repo_target
if self.opts.blobstore:
repo_target_check = _pkgdir(pkg.repo.id if not self.opts.norepopath else '', '')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for computing repo_target_check value here? Later in code there is no usage for it.
Also these two subsequent conditions if not self.opts.blobstore and if self.opts.blobstore look a bit strange.

pkg_location = os.path.basename(pkg_location)
pkg_download_path = os.path.realpath(
os.path.join(repo_target, pkg_location))

# join() ensures repo_target ends with a path separator (otherwise the
# check would pass if pkg_download_path was a "sibling" path component
# of repo_target that has the same prefix).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commend now lost it's meaning. It should be also moved (together with os.path.join(repo_target, '') move)

action='store_true',
help=_('Save a modified copy of the repo mirror list '
'pointing to the repo data. pointing to the '
'repo downloaded. Useful with --blobstore '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pointing to the repo data. pointing to the repo downloaded. - probably remnant of previous version of the sentence.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh whoops, will fix.

@m-blaha
Copy link
Member

m-blaha commented Jun 15, 2022

@stewartsmith is it possible to post here a concrete example of AL2 repository config file? Just to be sure I've understood this GUID layout correctly.

@stewartsmith
Copy link
Author

Sure, I'll write up something that I can post publicly that has a more detailed explanation of how things work today in AL1/AL2/AL2022 and the motivation behind it, along with how the config files look today for each.

@kyle-walker
Copy link

Just to add a +1 to @m-blaha's line of inquiry. I am not familiar with this style of repository layout, and the behaviour around relative paths were locked down previously due to at least one CVE reported.

What tooling composes those repos, is it custom/internal to Amazon? Is there a design doc that specifies the layout, or is that the intent of the writeup mentioned in #457 (comment) :)

@stewartsmith
Copy link
Author

The write-up I mention in #457 (comment) is to ensure that a write-up makes it out into the open rather than sitting internally to Amazon, which isn't as useful place for it as sitting, well, literally anywhere else :)

@kyle-walker
Copy link

kyle-walker commented Sep 2, 2022

What with CVE-2018-10897, I would very much like to avoid accesses in reposync outside of the directory. If you replace the relative paths in the repository with symlinks, you end up with similar outcomes. Without needing to relax the restrictions, or add additional repository-specific flags, on the client side.

After looking at this problem for a while, I was wondering... Is there any leeway for changing the repository layout?

@mjaitly-amazon
Copy link

@Conan-Kudo
Copy link
Member

I would also prefer Amazon Linux to change its repository layout for similar reason to @kyle-walker. Deliberately creating a configuration that creates CVEs is no fun. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants