-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delta updates plan #956
Comments
TODO: decide on handling circular data dependencies and asynchronous updates. For instance, links:
Other cases include:
|
For the note, first phase of delta updates is currently being deployed. The development resulted in major optimizations in other places, namely in repoproc deserializer, Unfortunately, some inevitable pessimizations were introduced too, which affect update time when everything is updated (such as first update on an empty database). Extra overhead is about 20%, +2 minutes for hashing (current package hash implementation which involves JSON may be improved) and +2 minutes for extra database queries. Of course, these are outweighted by partial update performance improvement (~2.5x currently). |
Status update: first phase of delta updates allowed to do update cycle in under 50 minutes (most time spent in fetch and parse). The next step would be performance testing of update process which has to fetch previous state of packages from the database - there seem to be no way to avoid it if we want to generate history properly. If it turns out to be too slow, stored procedures may be investigated. After that, we'll need to convert derived table updates without introducing huge pessimization when most packages change. There are two distinct kinds of such derived tables:
The former do not impose any limitations and may be updated in parallel to packages, with the same granularity. The latter lead to write amplification and n² patterns when done naively, e.g. when a lot of project updates modify single maintainer/repository/link. The most important case is when e.g. maintainer IDs need to be known before updating binding tables or packages themselves (e.g. when we switch to storing maintainer/link IDs instead of verbatim texts). The solution here is to cache stuff in memory and avoid updating referenced objects multiple times. That is, store maintainer-id mapping in memory, and use it to get IDs and decide whether maintainers need to be created. If it's too big for the memory, we can flush it (completely, or more optimally only some least recently used entries) periodically. Regarding tables indexed by project name, the important thing is not to duplicate code for bulk and discrete updates. This may be achieved by using a temporary table which lives for the duration of update transaction and holds updated project names. With it we can use the same bulk queries, but limit them with a subset of projects. If the subset is big enough, it may be ignored and bulk update used. |
Binding tables done, -30% database update time. Next big time consumer is By the way, metapackages table is around 4x of it's minimal size (as can be seen after vacuum), which is caused by multiple consequent complete updates, which leads to N dead copiles for each tuple. Apart from only updating affected projects, optimization is to introduce conditions to Also the shortcoming of current database schema is usage of |
Notably, mechanism for cleanup of unreferenced links was changed: there's now separate query to discover and mark orphaned links, intended to be run weekly or so. There's no mechanism to automatically run it yet (see #551), so for now it can be run manually.
New events are generated into parallel table for testing; in order to switch completely we need a way to get reliable release time for a version.
- Store links as json - Implement new update steps - Create link entries for new links (creation of maintainers and projects will be moved here as well) - Add transformation step for incoming packages, which coverts link urls to corresponding link ids This implements some remaining parts of #956.
Delta updates were planned for years, but still not implemented as it's proven to be too huge task to do in one go. We need a new plan to gradually implement it in smaller steps. Here it is:
database_update
in repology_update.py into dedicated moduleupdate_finish
query into a lot of smaller subtasks. This would also help to profile themThis should too yield performance gains as it would eliminate need to call PackageFormatter from problems view(not doing it here)Update the view in webapp(not doing it here)maintainer_repo_metapackages
table and SQL code for its update, and do drop version information from projects/metapackages tablepackages
table to get the number of packages)links
table first)Store name changes (Introduce name variation registry #815) the same wayAdd foreign key constraints to the database to prevent consistency problemsno need to do it here, or probably at allThe text was updated successfully, but these errors were encountered: