Delta updates plan #956

AMDmi3 · 2019-12-16T18:34:33Z

Delta updates were planned for years, but still not implemented as it's proven to be too huge task to do in one go. We need a new plan to gradually implement it in smaller steps. Here it is:

The text was updated successfully, but these errors were encountered:

AMDmi3 · 2019-12-17T20:14:52Z

Further development that this change would unlock, even with partial implementation:

More interesting freshness metrics Mean time to update metric #62
Multiple categories support Change category' package field into categories' array #76
Better packagelinks/homepages handling More per-repository links #119, Switch links to ids #195, Change homepage' package field into homepages' array #476, Generalized links #722, Make packagelinks specific to sources #794, Revisit pkglinks construction #948, Implement outdated status for downloads #950
Problems improvements Introduce problem types and improve problem formatting #196
Storage optimization Optimize package storage in postgres #360, Investigate database normalization #549
Track packages moving between projects Track package lifetimes through keynames #527
Better history support - quantity and consistency Implement per repository and per maintainer events #546, Fix maintainer changes wrt. feed #612
Simplify update by removing post-update phese Move more stuff to post-update phase #610
Complex package transformations/merging Implement aggregate package transformations #637
Complete support for name redirects Introduce name variation registry #815, Rework project redirects using tracknames #980
Dependency tracking
Patch tracking

AMDmi3 · 2019-12-17T23:21:24Z

TODO: decide on handling circular data dependencies and asynchronous updates. For instance, links:

since we no longer have reliable last_seen times, there's no way to know how long a link has not been seen. The solution is to either mark used links with bulk query once a week or so, or keep a reference count
problems regarding links 'dead for a month' are created and removed asynchronously on link checks (when link status update) in addition to project updates (when links change). The former case can probably be handled with bulk query once a few days too.

Other cases include:

related project flags
number of maintainers and problems for repository
repositories for maintainer

AMDmi3 · 2020-01-10T19:20:42Z

For the note, first phase of delta updates is currently being deployed. The development resulted in major optimizations in other places, namely in repoproc deserializer, repositories table update (bad SQL execution plan led to 40x overhead, 25% of overall database update time), and repo_metapackages table update (thrashing due to inserting unordered items led to excess I/O and several extra minutes to update).

Unfortunately, some inevitable pessimizations were introduced too, which affect update time when everything is updated (such as first update on an empty database). Extra overhead is about 20%, +2 minutes for hashing (current package hash implementation which involves JSON may be improved) and +2 minutes for extra database queries. Of course, these are outweighted by partial update performance improvement (~2.5x currently).

…es (#956)

AMDmi3 · 2020-01-15T00:22:41Z

Status update: first phase of delta updates allowed to do update cycle in under 50 minutes (most time spent in fetch and parse).

The next step would be performance testing of update process which has to fetch previous state of packages from the database - there seem to be no way to avoid it if we want to generate history properly. If it turns out to be too slow, stored procedures may be investigated.

After that, we'll need to convert derived table updates without introducing huge pessimization when most packages change. There are two distinct kinds of such derived tables:

tables indexed by project name (metapackages, *_metapackages)
tables indexed by orthogonal objects (maintainers, links)

The former do not impose any limitations and may be updated in parallel to packages, with the same granularity. The latter lead to write amplification and n² patterns when done naively, e.g. when a lot of project updates modify single maintainer/repository/link.

The most important case is when e.g. maintainer IDs need to be known before updating binding tables or packages themselves (e.g. when we switch to storing maintainer/link IDs instead of verbatim texts).

The solution here is to cache stuff in memory and avoid updating referenced objects multiple times. That is, store maintainer-id mapping in memory, and use it to get IDs and decide whether maintainers need to be created. If it's too big for the memory, we can flush it (completely, or more optimally only some least recently used entries) periodically.

Regarding tables indexed by project name, the important thing is not to duplicate code for bulk and discrete updates. This may be achieved by using a temporary table which lives for the duration of update transaction and holds updated project names. With it we can use the same bulk queries, but limit them with a subset of projects. If the subset is big enough, it may be ignored and bulk update used.

AMDmi3 · 2020-01-29T17:22:11Z

Binding tables done, -30% database update time. Next big time consumer is url_relations. There are ways to optimize both url_relations table construction and updating has_related flag in metapackages.

By the way, metapackages table is around 4x of it's minimal size (as can be seen after vacuum), which is caused by multiple consequent complete updates, which leads to N dead copiles for each tuple. Apart from only updating affected projects, optimization is to introduce conditions to UPDATE queries not to needlessly rewrite each row on each pass.

Also the shortcoming of current database schema is usage of last_seen fields (in metapackages, maintainers, links). For projects and maintainers, there should instead be orphaned_at field which could be set when all related packages disappear. For links, there could be a similar flag which we could update weekly by discovering all unreferenced links. After some time, these could be removed.

…ion (#956)

Notably, mechanism for cleanup of unreferenced links was changed: there's now separate query to discover and mark orphaned links, intended to be run weekly or so. There's no mechanism to automatically run it yet (see #551), so for now it can be run manually.

New events are generated into parallel table for testing; in order to switch completely we need a way to get reliable release time for a version.

…packages) (#956)

… for following update queries (#956)

- Store links as json - Implement new update steps - Create link entries for new links (creation of maintainers and projects will be moved here as well) - Add transformation step for incoming packages, which coverts link urls to corresponding link ids This implements some remaining parts of #956.

AMDmi3 added core sql labels Dec 16, 2019

This was referenced Dec 16, 2019

Implement partial upload machinery #547

Closed

Delta updates #88

Closed

AMDmi3 added a commit that referenced this issue Jan 9, 2020

Move database update logic into dedicated module (#956)

d862656

AMDmi3 added a commit that referenced this issue Jan 9, 2020

Split update_finish metaquery into smaller bits (#956)

43020a2

AMDmi3 mentioned this issue Jan 10, 2020

Introduce repository update periods #972

Closed

4 tasks

AMDmi3 mentioned this issue Jan 10, 2020

Investigate repository update performance #478

Closed

AMDmi3 added a commit that referenced this issue Jan 12, 2020

Implement hashing for Package (#956)

2119461

AMDmi3 added a commit that referenced this issue Jan 12, 2020

Add SQL to keep track of project hashes (#956)

c7b91b6

AMDmi3 added a commit that referenced this issue Jan 12, 2020

Keep track of project hashes in the database (#956)

48b552c

AMDmi3 added a commit that referenced this issue Jan 12, 2020

Add more package manipulation queries, consolidate under sql.d/packag…

8f2583b

…es (#956)

AMDmi3 added a commit that referenced this issue Jan 12, 2020

Switch to incremental updates (#956)

62a8524

This was referenced Jan 13, 2020

Rework project redirects using tracknames #980

Closed

Don't touch unchanged packages when updating #981

Open

This was referenced Jan 23, 2020

Implement [experimental version of] project filtering by package name. #995

Open

Implement endpoints based on original package name repology/repology-webapp#66

Closed

AMDmi3 added a commit that referenced this issue Jan 29, 2020

Switch binding tables to delta updates (#956)

378f662

AMDmi3 mentioned this issue Feb 7, 2020

Track project releases properly #1000

Open

AMDmi3 added a commit that referenced this issue Feb 10, 2020

Push updated packages through temporary table to allow delta calculat…

88fba65

…ion (#956)

AMDmi3 added a commit that referenced this issue Feb 10, 2020

Make apply_packages phase more apparent (#956)

858f085

AMDmi3 added a commit that referenced this issue Feb 10, 2020

Add alternative way of project events generation (delta based) (#956)

cd78c1a

New events are generated into parallel table for testing; in order to switch completely we need a way to get reliable release time for a version.

AMDmi3 added a commit that referenced this issue Feb 10, 2020

Add delta-friendly orphaned timestamp for projects (#956)

c26ef9a

AMDmi3 added a commit that referenced this issue Feb 10, 2020

Don't remove project packages prematurely (it's done in update_apply_…

3041f21

…packages) (#956)

AMDmi3 added a commit that referenced this issue Feb 11, 2020

Switch problems generation to delta updates (#956)

ec63343

AMDmi3 added a commit that referenced this issue Feb 13, 2020

Create new projects and maintainers early to make their ids available…

3db3388

… for following update queries (#956)

AMDmi3 added a commit that referenced this issue Feb 13, 2020

Switch maintainer events generation to deltas (#956)

f77c9da

AMDmi3 added a commit that referenced this issue Feb 18, 2020

Switch url_relations update to delta update (#956)

21dacce

AMDmi3 added a commit that referenced this issue Feb 19, 2020

Add mechanism to check and fix aggregated counters (#956)

412185a

AMDmi3 added a commit that referenced this issue Feb 19, 2020

Switch statistics update to delta update (#956)

8cbcc50

AMDmi3 added a commit that referenced this issue Feb 25, 2020

Switch projects update to delta update (#956)

0317fb0

AMDmi3 added a commit that referenced this issue Mar 4, 2020

Remove no longer needed projects fields (#956)

897cb43

AMDmi3 added a commit that referenced this issue Mar 16, 2020

Partially switch maintainers to delta update (#956)

895ba84

AMDmi3 added a commit that referenced this issue Mar 24, 2020

Completely switch maintainers to delta update (#956)

7b18493

AMDmi3 added a commit that referenced this issue Mar 24, 2020

Switch repositories update to delta update (#956)

a88a06f

AMDmi3 mentioned this issue May 27, 2020

Problems are not updated when link status changes #1053

Open

AMDmi3 mentioned this issue Feb 24, 2021

Optimize package storage in postgres #360

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delta updates plan #956

Delta updates plan #956

AMDmi3 commented Dec 16, 2019 •

edited

Loading

AMDmi3 commented Dec 17, 2019 •

edited

Loading

AMDmi3 commented Dec 17, 2019

AMDmi3 commented Jan 10, 2020

AMDmi3 commented Jan 15, 2020

AMDmi3 commented Jan 29, 2020

Delta updates plan #956

Delta updates plan #956

Comments

AMDmi3 commented Dec 16, 2019 • edited Loading

AMDmi3 commented Dec 17, 2019 • edited Loading

AMDmi3 commented Dec 17, 2019

AMDmi3 commented Jan 10, 2020

AMDmi3 commented Jan 15, 2020

AMDmi3 commented Jan 29, 2020

AMDmi3 commented Dec 16, 2019 •

edited

Loading

AMDmi3 commented Dec 17, 2019 •

edited

Loading