Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta updates plan #956

Open
22 of 26 tasks
AMDmi3 opened this issue Dec 16, 2019 · 5 comments
Open
22 of 26 tasks

Delta updates plan #956

AMDmi3 opened this issue Dec 16, 2019 · 5 comments

Comments

@AMDmi3
Copy link
Member

AMDmi3 commented Dec 16, 2019

Delta updates were planned for years, but still not implemented as it's proven to be too huge task to do in one go. We need a new plan to gradually implement it in smaller steps. Here it is:

  • Move database update logic from database_update in repology_update.py into dedicated module
  • Split update_finish query into a lot of smaller subtasks. This would also help to profile them
  • Implement package checksums
  • Revisit package deduplication to never allow multiple packages with equal checksums
    • Add check against packages with duplicate checksums (when uploading single packageset)
  • Add tracking of packageset checksums in the database
  • Add comparing packageset checksums to the database and cease to touch unchanged packages. This should yield first performance and diskspace gains
  • Start converting derived object generation onto deltas (2 top priority items, then sorted by decrease of update time impact)
    • Problems. This should too yield performance gains as it would eliminate need to call PackageFormatter from problems view (not doing it here)
      • Update the view in webapp (not doing it here)
    • Events/history
      • Allows to eliminate maintainer_repo_metapackages table and SQL code for its update, and do drop version information from projects/metapackages table
    • Binding tables for search
    • URL relations
    • Projects (metapackages)
    • Maintainers
    • Links
    • Repositories
    • Global statistics (partial - the important part is to avoid scanning packages table to get the number of packages)
    • Redirects
  • Start to store huge Package fields externally
    • Maintainers. Store array of maintainer IDs instead of array of maintainer names
    • Homepages/Downloads. The same (IDs need to be added to links table first)
    • This also allows to add more items to Package and not store them in the database, but possibly use in other ways
  • Add foreign key constraints to the database to prevent consistency problems no need to do it here, or probably at all
  • Add a way to force hash invalidation (e.g. a way to affect hashing seed) so add derived objects can be recalculated if requested
  • Reuse remaining big queries for validation (and repair) of aggregate data produced by delta updates
    • Repositories
    • Maintainers
    • Global statistics
@AMDmi3
Copy link
Member Author

AMDmi3 commented Dec 17, 2019

@AMDmi3
Copy link
Member Author

AMDmi3 commented Dec 17, 2019

TODO: decide on handling circular data dependencies and asynchronous updates. For instance, links:

  • since we no longer have reliable last_seen times, there's no way to know how long a link has not been seen. The solution is to either mark used links with bulk query once a week or so, or keep a reference count
  • problems regarding links 'dead for a month' are created and removed asynchronously on link checks (when link status update) in addition to project updates (when links change). The former case can probably be handled with bulk query once a few days too.

Other cases include:

  • related project flags
  • number of maintainers and problems for repository
  • repositories for maintainer

@AMDmi3
Copy link
Member Author

AMDmi3 commented Jan 10, 2020

For the note, first phase of delta updates is currently being deployed. The development resulted in major optimizations in other places, namely in repoproc deserializer, repositories table update (bad SQL execution plan led to 40x overhead, 25% of overall database update time), and repo_metapackages table update (thrashing due to inserting unordered items led to excess I/O and several extra minutes to update).

Unfortunately, some inevitable pessimizations were introduced too, which affect update time when everything is updated (such as first update on an empty database). Extra overhead is about 20%, +2 minutes for hashing (current package hash implementation which involves JSON may be improved) and +2 minutes for extra database queries. Of course, these are outweighted by partial update performance improvement (~2.5x currently).

@AMDmi3
Copy link
Member Author

AMDmi3 commented Jan 15, 2020

Status update: first phase of delta updates allowed to do update cycle in under 50 minutes (most time spent in fetch and parse).

The next step would be performance testing of update process which has to fetch previous state of packages from the database - there seem to be no way to avoid it if we want to generate history properly. If it turns out to be too slow, stored procedures may be investigated.

After that, we'll need to convert derived table updates without introducing huge pessimization when most packages change. There are two distinct kinds of such derived tables:

  • tables indexed by project name (metapackages, *_metapackages)
  • tables indexed by orthogonal objects (maintainers, links)

The former do not impose any limitations and may be updated in parallel to packages, with the same granularity. The latter lead to write amplification and n² patterns when done naively, e.g. when a lot of project updates modify single maintainer/repository/link.

The most important case is when e.g. maintainer IDs need to be known before updating binding tables or packages themselves (e.g. when we switch to storing maintainer/link IDs instead of verbatim texts).

The solution here is to cache stuff in memory and avoid updating referenced objects multiple times. That is, store maintainer-id mapping in memory, and use it to get IDs and decide whether maintainers need to be created. If it's too big for the memory, we can flush it (completely, or more optimally only some least recently used entries) periodically.

Regarding tables indexed by project name, the important thing is not to duplicate code for bulk and discrete updates. This may be achieved by using a temporary table which lives for the duration of update transaction and holds updated project names. With it we can use the same bulk queries, but limit them with a subset of projects. If the subset is big enough, it may be ignored and bulk update used.

@AMDmi3
Copy link
Member Author

AMDmi3 commented Jan 29, 2020

Binding tables done, -30% database update time. Next big time consumer is url_relations. There are ways to optimize both url_relations table construction and updating has_related flag in metapackages.

By the way, metapackages table is around 4x of it's minimal size (as can be seen after vacuum), which is caused by multiple consequent complete updates, which leads to N dead copiles for each tuple. Apart from only updating affected projects, optimization is to introduce conditions to UPDATE queries not to needlessly rewrite each row on each pass.

Also the shortcoming of current database schema is usage of last_seen fields (in metapackages, maintainers, links). For projects and maintainers, there should instead be orphaned_at field which could be set when all related packages disappear. For links, there could be a similar flag which we could update weekly by discovering all unreferenced links. After some time, these could be removed.

AMDmi3 added a commit that referenced this issue Feb 10, 2020
Notably, mechanism for cleanup of unreferenced links was changed:
there's now separate query to discover and mark orphaned links,
intended to be run weekly or so. There's no mechanism to automatically
run it yet (see #551), so for now it can be run manually.
AMDmi3 added a commit that referenced this issue Feb 10, 2020
New events are generated into parallel table for testing; in order to
switch completely we need a way to get reliable release time for a
version.
AMDmi3 added a commit that referenced this issue Feb 13, 2020
AMDmi3 added a commit that referenced this issue Feb 25, 2021
- Store links as json
- Implement new update steps
  - Create link entries for new links (creation of maintainers and
    projects will be moved here as well)
  - Add transformation step for incoming packages, which coverts
    link urls to corresponding link ids

This implements some remaining parts of #956.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant