Releases: ArchiveTeam/wget-lua
v1.21.3-at.20231213.03
Wget-AT 20231213.03 (Wget 1.21.3-at.20231213.03) Release Notes
This release adds the recording of more information on the build process of the used Wget-AT.
Wget-AT can be configured with several options, and build on and for different system. Information about this will now be written to the WARC record of WARC-Type
value warcinfo
using fields starting with wget-build-*
.
New warcinfo
headers
The new headers in the warcinfo
record are:
wget-build-version
: The version Wget was built aswget-build-system-host
: The triplet of CPU, vendor, and operating system information (https://www.gnu.org/software/autoconf/manual/autoconf-2.68/html_node/Specifying-Target-Triplets.html) ofhost
as available throughAC_CANONICAL_HOST
macro (https://www.gnu.org/software/autoconf/manual/autoconf-2.68/html_node/Canonicalizing.html)wget-build-system-build
: The triplet ofbuild
as available throughAC_CANONICAL_BUILD
macrowget-build-system-target
: The triplet oftarget
as available throughAC_CANONICAL_TARGET
macrowget-build-compilation-string
: The string used to compile Wgetwget-build-link-string
: The string used to link when building Wgetwget-build-features
: The included and excluded features from the build
Example
The new wget-build-*
headers in the warcinfo
record are for example
wget-build-version: 1.21.3-at.20231213.03
wget-build-system-host: x86_64-pc-linux-gnu
wget-build-system-build: x86_64-pc-linux-gnu
wget-build-system-target: x86_64-pc-linux-gnu
wget-build-compilation-string: gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/usr/local/etc/wgetrc" -DLOCALEDIR="/usr/local/share/locale" -I. -I../lib -I../lib -I/usr/include/luajit-2.1 -I/usr/local/include -DHAVE_LIBSSL -I/usr/local/include -DNDEBUG -g -O2
wget-build-link-string: gcc -I/usr/local/include -DHAVE_LIBSSL -I/usr/local/include -DNDEBUG -g -O2 -L/usr/local/lib -lcares -lpcre2-8 -lidn2 -lssl -lcrypto -L/usr/local/lib -lzstd -lz -lpsl -lm -ldl -lluajit-5.1 ../lib/libgnu.a
wget-build-features: +cares +digest -gpgme +https +ipv6 +iri +large-file -metalink -nls +ntlm +opie +psl +ssl/openssl
Minor update
A minor update is that the [email protected] email address, the repository URL https://github.com/ArchiveTeam/wget-lua, and the IRC channel #archiveteam-dev on hackint IRC are now noted in the output of commands --version
and --help
.
v1.21.3-at.20231213.01
Wget-AT 20231213.01 (Wget 1.21.3-at.20231213.01) Release Notes
This release adds the recording of minimal SSL/TLS information in the WARC of the connection used to send and receive data. Next to this, the release allows Wget-AT to keep track of used protocols.
WARC-Cipher-Suite
header
Recording of any details of a used SSL/TLS connection did not happen before this release. The only information about this was stored in the URI which would either start with https
(indicating a secure connection with the SSL or TLS protocol) or http
. Websites may (and it is confirmed that some will) return different HTTP responses depending on the details of the secure connections, which creates an urgency of storing such information in WARCs.
There is no information in the WARC format (https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/) on storing secure connection information. A discussion is taking place in the issue at iipc/warc-specifications#86 for adding support. This issue originally suggested the use of a WARC-TLS-Cipher-Suite
WARC header, but also notes WARC-Cipher-Suite
as an option.
WARC records
The cipher suite WARC header is written on records with WARC-Type
value request
and response
when request and response data for a HTTPS URL is recorded. It will be written on a revisit
record as well when this is written due to deduplicating a response
record.
A resource
record may also have this header written to it, for example when a FTPS connection is used. Currently this is not being done as the FTP WARC record writing functionality is Wget-AT is not optimal, and a major overhaul of this is being worked on, at which point this header will be written for data transferred over an FTPS connection.
This header is not currently written on a metadata
record in Wget-AT, while supported on that record type according to iipc/warc-specifications#86.
Allowed header values
The value of this field is the IANA defined cipher suite name (https://www.iana.org/assignments/tls-parameters/tls-parameters.txt) for TLS, or for SSL a name as defined in RFC 6101 for SSLv3 or "The SSL Protocol" (https://www.ietf.org/archive/id/draft-hickman-netscape-ssl-00.txt) for SSLv2 (SSL version 0.2).
The recorded value should be the cipher suite that is in use for the connection. It should not be the cipher suite presented in the client hello step of the handshake, or any other value before a cipher suite is agreed on and application data is being transferred.
Using header name WARC-Cipher-Suite
over WARC-TLS-Cipher-Suite
Having TLS
in the WARC header restrict one to store only TLS cipher suites. The obsolete SSLv3 protocol uses SSL cipher suites that have a name starting with SSL_*
defined. If a TLS connection is used, the cipher suite name starts with TLS_*
. Leaving TLS
out of the WARC header allows both SSL and TLS cipher suites to be stored, while it is still clear from the first three bytes of the header value if a SSL or TLS record was used.
RFC 5246 notes that "cipher suite values { 0x00, 0x1C }
and { 0x00, 0x1D }
are reserved to avoid collision with Fortezza-based cipher suites in SSL 3." These two cipher suite values are defined in RFC 6101 as respectively SSL_FORTEZZA_KEA_WITH_NULL_SHA
and SSL_FORTEZZA_KEA_WITH_FORTEZZA_CBC_SHA
. While most SSL_*
cipher suites have been assigned a similar TLS_*
name for further use in the TLS protocol (for example SSL_DHE_DSS_WITH_DES_CBC_SHA
was named TLS_DHE_DSS_WITH_DES_CBC_SHA
), some have not and are only assigned by their SSL_*
defined names.
Value { 0x00,0x1E }
was defined with name TLS_KRB5_WITH_DES_CBC_SHA
in RFC 2712, while it holds an entirely different name and definition in RFC 6101 with SSL_FORTEZZA_KEA_WITH_RC4_128_SHA
. Some cipher suites like SSL_CK_RC4_128_WITH_MD5
(used in SSL version 0.2, see "The SSL Protocol") do not have a TLS_*
defined name at all. These are examples that show how not all SSL_*
cipher suites can simply be represented by their TLS_*
cipher suites, and why cipher suites should be written with their SSL_*
defined name when the SSL protocol is used. While the SSL protocol is obsolete, it is technically possible to be used, and should be accounted for in the WARC headers.
Next to the above, a "cipher suite" is well defined and widely accepted as being either a TLS or SSL cipher suite, making WARC-Cipher-Suite
a more minimal representation of the type of data to store than WARC-TLS-Cipher-Suite
. Any future set of cipher suites that are neither SSL nor TLS cipher suites can also be written under the WARC-Cipher-Suite
header.
WARC-Protocol
header
Next to recording the used cipher suite, the SSL/TLS version should be recorded, for the same reason as given in the previous section. The WARC format currently does not define a way to store this version, but the issue at iipc/warc-specifications#42 discusses this. The proposed definition in this issue is implemented in this release.
The WARC-Protocol
header is allowed to be written on all records the WARC-Cipher-Suite
is allowed on.
Allowed header values
The allowed header values are as defined in the issue at iipc/warc-specifications#42 of which a subset is used in Wget-AT as of this release:
http/0.9
http/1.0
http/1.1
ssl/2
ssl/3
tls/1.0
tls/1.1
tls/1.2
tls/1.3
The value ftp
is not currently in use for the same reasons the WARC-Cipher-Suite
header is not yet written on WARC records for FTPS URLs as explained in the previous section.
As with WARC-Cipher-Suite
, the value should be that of the connection that is used to actually transfer data over, not anything used during negotiations.
Only one of the http/*
values is always written on the request and response records of a HTTP(S) URL, while one of the ssl/*
or tls/*
values is written only on a record of a HTTPS URL. An example of the written WARC-Protocol
headers on a record with a HTTP/1.1 payload with data transferred over a TLSv1.3 connection is
WARC-Protocol: http/1.1
WARC-Protocol: tls/1.3
Minor features
One minor feature was added in this release:
- The Dockerfile now uses Debian bookworm instead of Debian bullseye.
Bug fixes
Three bugs have been fixed in this release:
- A bug is fixed that prevented the use of option
--warc-cdx
. - The manual of Wget writes that specifying a protocol of
SSLv2
,SSLv3
,TLSv1
,TLSv1_1
,TLSv1_2
, orTLSv1_3
to option--secure-protocol
, forces the use of this protocol. In practice this was not the case, the protocol would be set as minimum version. If--secure-protocol=TLSv1_1
was given, one ofTLSv1_1
,TLSv1_2
, orTLSv1_3
would be used after negotiation. This is now fixed to follow the manual. - If a URL would be transformed from a HTTP to HTTPS URL due to HSTS, the HTTP version of the URL would still be written in the WARC headers, while the HTTPS URL was used for data transfer. This is now fixed.
v1.21.3-at.20220528.01
Version 1.21.3-at.20220528.01.
v1.21.3-at.20220503.02
Version 1.21.3-at.20220503.02
v1.21.3-at.20220503.01
Version 1.21.3-at.20220503.01.
v1.20.3-at.20211001.01
Version 1.20.3-at.20211001.01. Fix implicit conversion off_t to int.
v1.20.3-at.20200401.01
Wget-AT 20200401.01 (Wget 1.20.3-at.20200401.01) Release Notes
This is the first official release of Wget-AT as continuation of Wget-Lua. Wget-AT is a new direction with Wget-Lua to add more modern features for web archiving, in addition to the already implemented Lua scripting.
This release adds support for Zstandard with dictionary compression, implements URL-agnostic deduplication and moves to version 1.1 of the WARC format.
WARC/1.1
Version 1.1 of the WARC format (https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/) implements a number of different fields and changes a number of erroneous recommendations in version 1.0 of the format.
The notable changes to version 1.1 WARCs created with 1.20.3-at.20200401.01
compared to 1.0 WARCs created with previous versions are the addition of
- the
WARC-Refers-To-Target-URI
header and - the
WARC-Refers-To-Date
header
for WARC revisit
records. The version noted in the WARC records is now WARC/1.1
instead of WARC/1.0
.
Zstandard with dictionary
Normally, according to the standard for WARC/1.1
, WARC records are compressed using Zlib, creating .warc.gz
files. Every record is compressed individually. If many webpages are stored in a WARC files that have overlap, this overlap would cause an equal relative overlap between compressed records. With the use of dictionaries in which these overlapping parts can be referenced, the overlapping parts can be largely compressed away, causing a much smaller overhead in size for records compressed with Zstandard with a dictionary.
Implementation
The implementation of Zstandard with dictionary compression has been created in cooperation with Internet Archive to allow playback of Zstandard compressed WARCs through the Wayback Machine. WARCs created with Zstandard compression have extention .warc.zst
, similar to .warc.gz
when Zlib compression is used.
Zstandard can both be used with and without dictionary. Without dictionary it is shown that Zstandard performs better than many other compression algorithms, like Zlib normally used for WARC record compression. The additional use of dictionaries for compression allows records to be compressed to smaller sizes and allows for overlapping data between records to be compressed away with the right trained dictionaries.
Zstandard allows for skippable frames, which allow for any user data to be added between frames in an additional frame. This frame is normally skipped by software handling Zstandard compressed files. The skippable frame (see https://facebook.github.io/zstd/zstd_manual.html for details) consists of, in listed order,
- the skippable frame ID with values between
0x184D2A50
and0x184D2A5F
, in little endian format, - the frame size in 4 bytes, in little endian format, and
- the content of the frame.
A used dictionary can be stored in the skippable frame with frame ID 0x184D2A5D
as very first frame of the WARC file. By default the Zstandard dictionary is compressed with Zstandard before added as content of the skippable frame, unless option --warc-zstd-dict-no-compression
is given to prevent compression of the dictionary before storing it. To prevent the dictionary from being included at the start of the resulting WARC file, option --warc-zstd-dict-no-include
should be used.
--warc-compression-use-zstd
Use Zstandard instead of Zlib compression for compressing WARC records. To use a Zstandard dictionary as well, use option --warc-zstd-dict=FILENAME
.
--warc-zstd-dict=FILENAME
The Zstandard dictionary to use for compression. Option --warc-compression-use-zstd
needs to be used in order to use this option.
The dictionary is by default compressed with Zstandard and included in at the beginning of the WARC file, unless respectively options --warc-zstd-dict-no-compression
or --warc-zstd-dict-no-include
are used.
--warc-zstd-dict-no-include
Prevent the used Zstandard dictionary from being included in a skippable frame at the start of the WARC file. Option --warc-zstd-dict=FILENAME
needs to be used in order to use this option.
It can be useful to not include the dictionary if many seperate WARCs are created using the same dictionary. Storing the dictionary in every WARC creates overhead in size. Instead, it may be useful to store the Zstandard dictionary separately.
--warc-zstd-dict-no-compression
Prevent the compression of the used Zstandard dictionary with Zstandard before writing it to the skippable frame. Option --warc-zstd-dict=FILENAME
needs to be used in order to use this option.
Zstandard dictionaries themselves are not compressed, and compression can often yield tens of percents of reduction in the size of the skippable frame with compressed dictionary over that with uncompressed dictionary. Not compressing the dictionary might improve performance, as no decompression needs to take place in order to use the dictionary.
Deduplication
With deduplication on WARC records, a response
record can be converted to a revisit
record if it is found to be a duplicate from another record. In accordance with version 1.1 of the WARC format, the headers
WARC-Refers-To
, referring toWARC-Record-ID
of the original record,WARC-Refers-To-Target-URI
, referring toWARC-Target-URI
of the original record,WARC-Refers-To-Date
, referring toWARC-Date
of the original record,WARC-Profile
, with valuehttp://netpreserve.org/warc/1.1/revisit/identical-payload-digest
, andWARC-Truncated
, with valuelength
,
are added and header WARC-Type
is assigned value revisit
. WARC-Block-Digest
is set to the digest of the truncated data and WARC-Payload-Digest
is the digest of the original payload.
With this release URL-agnostic deduplication is supported for WARC records in a single Wget session with the --warc-dedup-url-agnostic
option. URL-gnostic deduplication is used by default for WARC writing, unless disabled with --warc-dedup-disable
.
--warc-dedup-url-agnostic
Allow URL-agnostic deduplication of WARC records in the same Wget session.
A response
record is converted into a revisit
records with URL-agnostic deduplication when only the WARC-Payload-Digest
matches that of a previously written record. Other WARC headers, like WARC-Target-URI
, do not have to be equal in order for a revisit
record to be written.
--warc-dedup-min-size=NUMBER
The minimum number of bytes a payload should be large before it is deduplicated. The default value is 100
.
When a response
record is converted to a revisit
record, a number of fields are added. The value of --warc-dedup-min-size
is used to determine when it is 'worth it' to write a revisit
record instead of the original, given the increase or decrease in size, performance, and other factors.
--warc-dedup-disable
Disables the URL-gnostic deduplication. This deduplication is turned on by default.
URL-gnostic deduplication converts a response
record into a revisit
record when another record was previously written with equal values for the WARC-Payload-Digest
and WARC-Target-URI
WARC headers.
v1.20.3-lua
Wget-Lua with updated Wget version 1.20.3.