feat: archive git repository (experimental)

See doc/git-archive.md for general Git archive specifications
See doc/repos/metadata-repo.md for info and direction related to the new Git metadata archive
This commit is contained in:
Kevin Morris 2022-09-24 16:51:25 +00:00
parent ec3152014b
commit 30e72d2db5
34 changed files with 1104 additions and 50 deletions

75
doc/git-archive.md Normal file
View file

@ -0,0 +1,75 @@
# aurweb Git Archive Specification
<span style="color: red">
WARNING: This aurweb Git Archive implementation is
experimental and may be changed.
</span>
## Overview
This git archive specification refers to the archive git repositories
created by [aurweb/scripts/git_archive.py](aurweb/scripts/git_archive.py)
using [spec modules](#spec-modules).
## Configuration
- `[git-archive]`
- `author`
- Git commit author
- `author-email`
- Git commit author email
See an [official spec](#official-specs)'s documentation for spec-specific
configurations.
## Fetch/Update Archives
When a client has not yet fetched any initial archives, they should clone
the repository:
$ git clone https://aur.archlinux.org/archive.git aurweb-archive
When updating, the repository is already cloned and changes need to be pulled
from remote:
# To update:
$ cd aurweb-archive && git pull
For end-user production applications, see
[Minimize Disk Space](#minimize-disk-space).
## Minimize Disk Space
Using `git gc` on the repository will compress revisions and remove
unreachable objects which grow the repository a considerable amount
each commit. It is recommended that the following command is used
after cloning the archive or pulling updates:
$ cd aurweb-archive && git gc --aggressive
## Spec Modules
Each aurweb spec module belongs to the `aurweb.archives.spec` package. For
example: a spec named "example" would be located at
`aurweb.archives.spec.example`.
[Official spec listings](#official-specs) use the following format:
- `spec_name`
- Spec description; what this spec produces
- `<link to repo documentation>`
### Official Specs
- [metadata](doc/specs/metadata.md)
- Package RPC `type=info` metadata
- [metadata-repo](repos/metadata-repo.md)
- [users](doc/specs/users.md)
- List of users found in the database
- [users-repo](repos/users-repo.md)
- [pkgbases](doc/specs/pkgbases.md)
- List of package bases found in the database
- [pkgbases-repo](repos/pkgbases-repo.md)
- [pkgnames](doc/specs/pkgnames.md)
- List of package names found in the database
- [pkgnames-repo](repos/pkgnames-repo.md)

View file

@ -70,20 +70,48 @@ computations and clean up the database:
* aurweb-pkgmaint automatically removes empty repositories that were created
within the last 24 hours but never populated.
* aurweb-mkpkglists generates the package list files; it takes an optional
--extended flag, which additionally produces multiinfo metadata. It also
generates {archive.gz}.sha256 files that should be located within
* [Deprecated] aurweb-mkpkglists generates the package list files; it takes
an optional --extended flag, which additionally produces multiinfo metadata.
It also generates {archive.gz}.sha256 files that should be located within
mkpkglists.archivedir which contain a SHA-256 hash of their matching
.gz counterpart.
* aurweb-usermaint removes the last login IP address of all users that did not
login within the past seven days.
* aurweb-git-archive generates Git repository archives based on a --spec.
This script is a new generation of aurweb-mkpkglists, which creates and
maintains Git repository versions of the archives produced by
aurweb-mkpkglists. See doc/git-archive.md for detailed documentation.
These scripts can be installed by running `poetry install` and are
usually scheduled using Cron. The current setup is:
----
*/5 * * * * poetry run aurweb-mkpkglists [--extended]
# Run aurweb-git-archive --spec metadata directly after
# aurweb-mkpkglists so that they are executed sequentially, since
# both scripts are quite heavy. `aurweb-mkpkglists` should be removed
# from here once its deprecation period has ended.
*/5 * * * * poetry run aurweb-mkpkglists [--extended] && poetry run aurweb-git-archive --spec metadata
# Update popularity once an hour. This is done to reduce the amount
# of changes caused by popularity data. Even if a package is otherwise
# unchanged, popularity is recalculated every 5 minutes via aurweb-popupdate,
# which causes changes for a large chunk of packages.
#
# At this interval, clients can still take advantage of popularity
# data, but its updates are guarded behind hour-long intervals.
*/60 * * * * poetry run aurweb-git-archive --spec popularity
# Usernames
*/5 * * * * poetry run aurweb-git-archive --spec users
# Package base names
*/5 * * * * poetry run aurweb-git-archive --spec pkgbases
# Package names
*/5 * * * * poetry run aurweb-git-archive --spec pkgnames
1 */2 * * * poetry run aurweb-popupdate
2 */2 * * * poetry run aurweb-aurblup
3 */2 * * * poetry run aurweb-pkgmaint

121
doc/repos/metadata-repo.md Normal file
View file

@ -0,0 +1,121 @@
# Repository: metadata-repo
## Overview
The resulting repository contains RPC `type=info` JSON data for packages,
split into two different files:
- `pkgbase.json` contains details about each package base in the AUR
- `pkgname.json` contains details about each package in the AUR
See [Data](#data) for a breakdown of how data is presented in this
repository based off of a RPC `type=info` base.
See [File Layout](#file-layout) for a detailed summary of the layout
of these files and the data contained within.
**NOTE: `Popularity` now requires a client-side calculation, see [Popularity Calculation](#popularity-calculation).**
## Data
This repository contains RPC `type=info` data for all packages found
in AUR's database, reorganized to be suitable for Git repository
changes.
- `pkgname.json` holds Package-specific metadata
- Some fields have been removed from `pkgname.json` objects
- `ID`
- `PackageBaseID -> ID` (moved to `pkgbase.json`)
- `NumVotes` (moved to `pkgbase.json`)
- `Popularity` (moved to `pkgbase.json`)
- `pkgbase.json` holds PackageBase-specific metadata
- Package Base fields from `pkgname.json` have been moved over to
`pkgbase.json`
- `ID`
- `Keywords`
- `FirstSubmitted`
- `LastModified`
- `OutOfDate`
- `Maintainer`
- `URLPath`
- `NumVotes`
- `Popularity`
- `PopularityUpdated`
## Popularity Calculation
Clients intending to use popularity data from this archive **must**
perform a decay calculation on their end to reflect a close approximation
of up-to-date popularity.
Putting this step onto the client allows the server to maintain
less popularity record updates, dramatically improving archiving
of popularity data. The same calculation is done on the server-side
when producing outputs for RPC `type=info` and package pages.
```
Let T = Current UTC timestamp in seconds
Let PU = PopularityUpdated timestamp in seconds
# The delta between now and PU in days
Let D = (T - PU) / 86400
# Calculate up-to-date popularity:
P = Popularity * (0.98^D)
```
We can see that the resulting up-to-date popularity value decays as
the exponent is increased:
- `1.0 * (0.98^1) = 0.98`
- `1.0 * (0.98^2) = 0.96039999`
- ...
This decay calculation is essentially pushing back the date found for
votes by the exponent, which takes into account the time-factor. However,
since this calculation is based off of decimals and exponents, it
eventually becomes imprecise. The AUR updates these records on a forced
interval and whenever a vote is added to or removed from a particular package
to avoid imprecision from being an issue for clients
## File Layout
#### pkgbase.json:
{
"pkgbase1": {
"FirstSubmitted": 123456,
"ID": 1,
"LastModified": 123456,
"Maintainer": "kevr",
"OutOfDate": null,
"URLPath": "/cgit/aur.git/snapshot/pkgbase1.tar.gz",
"NumVotes": 1,
"Popularity": 1.0,
"PopularityUpdated": 12345567753.0
},
...
}
#### pkgname.json:
{
"pkg1": {
"CheckDepends": [], # Only included if a check dependency exists
"Conflicts": [], # Only included if a conflict exists
"Depends": [], # Only included if a dependency exists
"Description": "some description",
"Groups": [], # Only included if a group exists
"ID": 1,
"Keywords": [],
"License": [],
"MakeDepends": [], # Only included if a make dependency exists
"Name": "pkg1",
"OptDepends": [], # Only included if an opt dependency exists
"PackageBase": "pkgbase1",
"Provides": [], # Only included if `provides` is defined
"Replaces": [], # Only included if `replaces` is defined
"URL": "https://some_url.com",
"Version": "1.0-1"
},
...
}

View file

@ -0,0 +1,15 @@
# Repository: pkgbases-repo
## Overview
- `pkgbase.json` contains a list of package base names
## File Layout
### pkgbase.json:
[
"pkgbase1",
"pkgbase2",
...
]

View file

@ -0,0 +1,15 @@
# Repository: pkgnames-repo
## Overview
- `pkgname.json` contains a list of package names
## File Layout
### pkgname.json:
[
"pkgname1",
"pkgname2",
...
]

15
doc/repos/users-repo.md Normal file
View file

@ -0,0 +1,15 @@
# Repository: users-repo
## Overview
- `users.json` contains a list of usernames
## File Layout
### users.json:
[
"user1",
"user2",
...
]

14
doc/specs/metadata.md Normal file
View file

@ -0,0 +1,14 @@
# Git Archive Spec: metadata
## Configuration
- `[git-archive]`
- `metadata-repo`
- Path to package metadata git repository location
## Repositories
For documentation on each one of these repositories, follow their link,
which brings you to a topical markdown for that repository.
- [metadata-repo](doc/repos/metadata-repo.md)

14
doc/specs/pkgbases.md Normal file
View file

@ -0,0 +1,14 @@
# Git Archive Spec: pkgbases
## Configuration
- `[git-archive]`
- `pkgbases-repo`
- Path to pkgbases git repository location
## Repositories
For documentation on each one of these repositories, follow their link,
which brings you to a topical markdown for that repository.
- [pkgbases-repo](doc/repos/pkgbases-repo.md)

14
doc/specs/pkgnames.md Normal file
View file

@ -0,0 +1,14 @@
# Git Archive Spec: pkgnames
## Configuration
- `[git-archive]`
- `pkgnames-repo`
- Path to pkgnames git repository location
## Repositories
For documentation on each one of these repositories, follow their link,
which brings you to a topical markdown for that repository.
- [pkgnames-repo](doc/repos/pkgnames-repo.md)

14
doc/specs/popularity.md Normal file
View file

@ -0,0 +1,14 @@
# Git Archive Spec: popularity
## Configuration
- `[git-archive]`
- `popularity-repo`
- Path to popularity git repository location
## Repositories
For documentation on each one of these repositories, follow their link,
which brings you to a topical markdown for that repository.
- [popularity-repo](doc/repos/popularity-repo.md)

14
doc/specs/users.md Normal file
View file

@ -0,0 +1,14 @@
# Git Archive Spec: users
## Configuration
- `[git-archive]`
- `users-repo`
- Path to users git repository location
## Repositories
For documentation on each one of these repositories, follow their link,
which brings you to a topical markdown for that repository.
- [users-repo](doc/repos/users-repo.md)