Skip to content

Commit

Permalink
Merge branch 'describe-architecture-with-hugo'
Browse files Browse the repository at this point in the history
Now that the site is converted to be built with Hugo and Pagefind, let's
reflect that status quo in the document describing the site's
architecture.

Signed-off-by: Johannes Schindelin <[email protected]>
  • Loading branch information
dscho committed Jul 14, 2024
2 parents c5b0f93 + e64af61 commit 4724441
Showing 1 changed file with 51 additions and 115 deletions.
166 changes: 51 additions & 115 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
@@ -1,161 +1,97 @@
# git-scm.com architecture

This document describes the general setup and architecture that runs the
git-scm.com site. The idea is to document all the moving parts that
_aren't_ checked in to this repository. That may help new people joining
the project to help out, as well provide some continuity in case the
maintainer is hit by a bus.
git-scm.com site.

## Content

Though the site is a rails app, it can _mostly_ be thought of as serving
static content. It's just that we suck in that static content and
pre-process it using nightly scheduled jobs. We never write anything to
the database on behalf of user requests.
This site is served via GitHub Pages and is a [Hugo](https://gohugo.io/) site
with the search implemented using [Pagefind](https://pagefind.app/).

The content is a mix of:

- actual static content in this repository
- original content from this repository

- community book content brought in from https://github.com/progit;
see the `lib/tasks/book2.rake` file.
see the `script/update-book2.rb` and `script/book.rb` files.

- manpages from releases of the git project, imported and formatted
via asciidoctor; see the `lib/tasks/index.rake` task.
The content is pre-rendered and tracked in the `external/book/` directory
tree.

- manual pages from releases of the git project, imported and formatted via
AsciiDoctor, and translated versions of the manual pages from
https://github.com/jnavila/git-manpages-l10n/ (which itself contains
pre-rendered pages from https://github.com/jnavila/git-manpages-l10n/); see
the `script/update-docs.rb` file.

## Heroku
The pre-rendered pages are tracked in the `external/docs/` directory tree.

The app itself is served by Heroku. The app name is `git-scm` (so you
can visit it directly as https://git-scm.herokuapp.com). The site is
owned by the git-scm.com team. If you want to be involved in managing
uptime/deploys/etc, you'll need a Heroku account and request to be added
to that team.
To deploy to GitHub Pages, it is necessary to turn off the default setting to
"publish from a branch" and instead change the setting to "publish with a
custom GitHub Actions workflow":
https://docs.github.com/en/pages/getting-started-with-github-pages/configuring-a-publishing-source-for-your-github-pages-site#publishing-with-a-custom-github-actions-workflow
With this change, the site can be tested in the fork by pushing to the
`gh-pages` branch (which will trigger the `deploy.yml` workflow) and then
navigating to https://git-scm.<user>.github.io/.

We use a few Heroku add-ons:
## Non-static parts

- Bonsai elasticsearch (see below)
While the site consists mostly of static content, there are a couple of
parts that are sort of dynamic.

- Heroku Postgres as the database
The search is implemented client-side, via [Pagefind](https://pagefind.app/).

- Heroku Redis for rails caching
A few scheduled GitHub workflows keep the content up to date:

- Heroku scheduler for cron jobs
- `update-git-version-and-manual-pages` and `update-download-data` (pick
up newly released git versions)

The nightly scheduled jobs are:
- `update-translated-manual-pages` (fetch and format translated manual
pages from the jnavila/git-html-l10n repository)

- `rake downloads` (pick up newly released git versions)

- `rake preindex` (pull in and format manpages for released git
versions)

- `rake remote_genbook2` (pull in and format progit2 book content,
- `update-book` (fetch and format progit2 book content,
including translations)

It should be safe to run any of those jobs more frequently. E.g., if you
know there's a new Git release out, then:

heroku run rake preindex
heroku run rake downloads

will get it on the site without waiting for the nightly run.

Merges to the `main` branch on GitHub auto-deploy to Heroku, so unless
you're doing something tricky you generally shouldn't need to manually
deploy.

Note that some of the formatting of manpages and book content happens
when they are imported by the rake tasks. So after fixing some
formatting and deploying, the rake jobs may need to be re-run with a
special flag to re-import (see the individual tasks for details).


## Cloudflare

We get enough requests that it's easy to overwhelm the single Heroku
dyno. So we have Cloudflare sitting in front of it, aggressively caching
everything. That also should make the site faster to serve to regions
far away from Heroku's servers.

The Cloudflare setup is mostly pretty simple:
These workflows are also marked as `workflow_dispatch`, i.e. they can be run
manually (e.g. to update the download links just after Git for Windows
published a new release).

- they serve DNS for the whole domain (that's where they insert the CDN
magic)

- Cloudflare provides `https://` support to the user. Obviously the
site is totally open and doesn't have any sensitive data, so this is
really more about integrity. The certificate is generated by
Cloudflare (and requires SNI on the browser side).

- the Cloudflare connection to Heroku is passed over TLS; they provide an
"internal" certificate that we ask Heroku to use, so the connection
is secured between the two (again, mostly for integrity)

- the most exotic config is that we use "page rules" to mark the whole
site to be cached aggressively, regardless of any caching headers
sent from Heroku. This is a bit of a hack, but there's very little on
the site that can't be cached (which is perhaps a sign that the rails
setup needs to be tweaked to send more reasonable caching headers,
but this has been simple and effective so far).

There are a few special page rules to lift this caching for cases
where we do server-side logic (e.g.,
https://github.com/git/git-scm.com/issues/1129#issuecomment-363067019"),
but the long-term goal is to push that logic onto the client side as
much as possible.

Both domains (c.f., the section on [DNS](#DNS) below) are owned by a
Cloudflare "Team", and membership of that team is required to
administrate the domains. Similar to the Heroku setup, you can ask to
join this team if you wish to help out. The information about the team
setup is in escrow with the Git PLC at Software Freedom Conservancy.
Cloudflare provides the project with enough credits that it doesn't cost
anything (though we're not using very many features, so it's possible
that a free account would be sufficient, too).

## Bonsai Elasticsearch

The search functionality on the site is served by an elasticsearch
cluster. The index can be populated by running `rake search_index`
(manpages) and `rake search_index_book` (book) on Heroku (we only index
the manpages and book). This perhaps should be run nightly, or at least
after pulling in new content, but it currently isn't done automatically.

The elasticsearch cluster is provided by Bonsai via their Heroku plugin.
Our needs are larger than their free tier provides, but we receive
credits from them that provide the service for free.
Merges to the `gh-pages` branch on GitHub auto-deploy to GitHub Pages via the
`deploy` GitHub workflow.

Note that some of the formatting of manual pages and book content happens
when they are imported by the GitHub workflows. Therefore, whenever there are
changes to the scripts/workflows/automation that affect formatting, these
workflows may need to be triggered using the force-rebuild flag to be toggled
(see the individual workflows for details).

## DNS

The actual DNS service is provided by Cloudflare (see above). The domain
itself is registered with Gandi, and is owned by the project via
Software Freedom Conservancy. Funds for the registration are provided
from the Git project's Conservancy funds, and both the Git PLC and
Conservancy have credentials to modify the setup.
The actual DNS service is provided by Cloudflare. The domain itself is
registered with Gandi, and is owned by the project via Software Freedom
Conservancy. Funds for the registration are provided from the Git project's
Conservancy funds, and both the Git PLC and Conservancy have credentials to
modify the setup.

Note that we own both git-scm.com and git-scm.org; the latter redirects
to the former.


## Manual Intervention

The site mostly just runs without intervention:

- code merged to `main` is auto-deployed
- code merged to `gh-pages` is auto-deployed

- new git versions are detected daily and manpages and download links
- new git versions are detected daily and manual pages and download links
updated

- book updates (including translations) are picked up daily

There are a few tasks that still need to be handled by a human:

- new images added to the book have to be copied manually from
progit/progit2

- new languages for book translations need to be added to
`lib/tasks/book2.rake`
`script/book.rb`

- forced re-imports of content (e.g., a formatting fix to imported
manpages) must be triggered manually
- forced re-imports of content (e.g., when fixing formatting in the
imported manual pages) must be triggered manually with `force-rebuild`
toggled

0 comments on commit 4724441

Please sign in to comment.