-
Notifications
You must be signed in to change notification settings - Fork 15
Home
You can run Wget with your Lua script with the --lua-script
option:
wget-lua --lua-script YOURSCRIPT.lua URL
If you want to add URLs to the download queue with the get_urls
hook, you must also enable --recursive
or --page-requisites
.
wget-lua --lua-script YOURSCRIPT.lua --recursive URL
wget-lua --lua-script YOURSCRIPT.lua --page-requisites URL
Your Lua script will get a wget.callbacks
table. Implement your callback functions as fields of this object. Wget will call these functions during the download process. (You do not have to implement every function.)
You can define these 7 functions:
-
init
: called on initialization. -
lookup_host
: called for DNS requests. -
write_to_warc
: whether to write a WARC response record. -
download_child_p
: accept/reject URLs. -
httploop_result
: retry or continue on error. -
get_urls
: custom URL extraction. -
finish
: called before the final cleanup. -
before_exit
: called before Wget exits.
The names of these callback functions correspond with the C functions where they are called.
Your script might need debugging. The table_show.lua
library is very helpful if you want to inspect the parameter values or your own internal tables. The example script lua-example/print_parameters.lua
uses the table.show
function to print the parameters.
Called during Wget initialization.
wget.callbacks.init = function()
You can initialize counters and other Lua variables in this function, but it is often easier to place the initialization code at the top of the Lua script.
Called during DNS hostname lookups.
wget.callbacks.lookup_host = function(host)
Return a string containing the resolved IP address, a new hostname string, or nil
to use the original Wget behavior.
-
host
is the hostname to be resolved
Called before writing WARC response records for individual HTTP/S requests. Determines whether to skip writing the record.
wget.callbacks.write_to_warc = function(url, http_stat)
Return true
to write the record, false
to not write it.
-
url
is anurl
structure, as is also used inhttploop_result
. -
http_stat
is anhttp_stat
structure, as is also used inhttploop_result
.
Called at the end of Wget's accept/reject process. Define this function to add custom accept/reject rules.
wget.callbacks.download_child_p = function(urlpos, parent, depth, start_url_parsed, iri, verdict, reason)
Return true
to download, false
to skip the current URL.
Most of the parameters to this function are tables with many fields. A selection:
urlpos
is the URL from the Wget queue that wants to download:
-
urlpos["url"]["url"]
is the actual URL. -
urlpos["link_expect_html"]
is1
for HTML links (<a href="...">
) and0
for other links. -
urlpos["link_expect_css"]
is1
for CSS links (<link rel="stylesheet">
) and0
for other links. -
urlpos["link_inline_p"]
is1
for inline links, the page requisites (images, CSS etc.),0
for other links.
parent
is the parent URL that pointed to this URL:
-
parent["url"]
is the actual URL.
depth
is the depth of the current URL: the number of hops from the initial URL.
start_url_parsed
is the URL where Wget started (the URL from the command line or URL-list input file):
-
start_url_parsed["url"]
is the actual URL.
iri
gives Wget's URI encoding settings for this URL.
verdict
is Wget's decision for this URL:
-
verdict == true
if Wget wants to download this URL. -
verdict == false
if one or more accept/reject rules rejected this URL.
reason
is the reason for Wget's rejection:
-
reason == nil
if Wget accepted this URL. -
reason == "ALREADY_ON_BLACKLIST"
: Wget has already downloaded this URL. -
reason == "NON_HTTP_SCHEME"
: this is not an HTTP URL. -
reason == "NOT_A_RELATIVE_LINK"
: rejected by--relative
. -
reason == "DOMAIN_NOT_ACCEPTED"
: rejected by--domains
or--exclude-domains
. -
reason == "IN_PARENT_DIRECTORY"
: rejected by--no-parent
. -
reason == "DIRECTORY_EXCLUDED"
: rejected by--include-directories
or--reject-directories
. -
reason == "REGEX_EXCLUDED"
: rejected by--accept-regex
or--reject-regex
. -
reason == "PATTERN_EXCLUDED"
: rejected by--accept
or--reject
. -
reason == "DIFFERENT_HOST"
: rejected by (the absence of)--span-hosts
. -
reason == "ROBOTS_TXT_FORBIDDEN"
: rejected by arobots.txt
file.
download_child_p = {
["urlpos"] = {
["url"] = {
["url"] = "http://www.gnu.org/graphics/bullet.gif";
["scheme"] = "SCHEME_HTTP";
["host"] = "www.gnu.org";
["port"] = 80;
["path"] = "graphics/bullet.gif";
["dir"] = "graphics";
["file"] = "bullet.gif";
};
["link_expect_html"] = 0;
["link_expect_css"] = 0;
["link_base_p"] = 0;
["link_complete_p"] = 0;
["link_css_p"] = 1;
["link_inline_p"] = 1;
["link_refresh_p"] = 0;
["link_relative_p"] = 0;
["ignore_when_downloading"] = 0;
};
["parent"] = {
["url"] = "http://www.gnu.org/layout.css";
["scheme"] = "SCHEME_HTTP";
["host"] = "www.gnu.org";
["port"] = 80;
["path"] = "layout.css";
["dir"] = "";
["file"] = "layout.css";
};
["depth"] = 1;
["start_url_parsed"] = {
["url"] = "http://www.gnu.org/software/wget/";
["scheme"] = "SCHEME_HTTP";
["host"] = "www.gnu.org";
["port"] = 80;
["path"] = "software/wget/";
["dir"] = "software/wget";
["file"] = "";
};
["iri"] = {
["uri_encoding"] = "utf-8";
["utf8_encode"] = false;
};
["verdict"] = true;
["reason"] = "ALREADY_ON_BLACKLIST";
};
This function is called immediately after Wget finishes an HTTP request, before it handles any errors.
wget.callbacks.httploop_result = function(url, err, http_stat)
Return one of the following wget.actions
:
-
wget.actions.NOTHING
: follow the normal Wget procedure for this result. -
wget.actions.CONTINUE
: retry this URL. -
wget.actions.EXIT
: finish this URL (ignore any error). -
wget.actions.ABORT
: Wget willabort()
and exit immediately.
The url
and http_stat
parameters are tables with many fields. A selection:
url
is the URL for this request:
-
url["url"]
is the actual URL.
err
is Wget's status code for the response. It is one of those strings:
-
NOCONERROR
,HOSTERR
,CONSOCKERR
,CONERROR
,CONSSLERR
,CONIMPOSSIBLE
,NEWLOCATION
,NOTENOUGHMEM
,CONPORTERR
,CONCLOSED
,FTPOK
,FTPLOGINC
,FTPLOGREFUSED
,FTPPORTERR
,FTPSYSERR
,FTPNSFOD
,FTPRETROK
,FTPUNKNOWNTYPE
,FTPRERR
,FTPREXC
,FTPSRVERR
,FTPRETRINT
,FTPRESTFAIL
,URLERROR
,FOPENERR
,FOPEN_EXCL_ERR
,FWRITEERR
,HOK
,HLEXC
,HEOF
,HERR
,RETROK
,RECLEVELEXC
,FTPACCDENIED
,WRONGCODE
,FTPINVPASV
,FTPNOPASV
,CONTNOTSUPPORTED
,RETRUNNEEDED
,RETRFINISHED
,READERR
,TRYLIMEXC
,URLBADPATTERN
,FILEBADFILE
,RANGEERR
,RETRBADPATTERN
,RETNOTSUP
,ROBOTSOK
,NOROBOTS
,PROXERR
,AUTHFAILED
,QUOTEXC
,WRITEFAILED
,SSLINITFAILED
,VERIFCERTERR
,UNLINKERR
,NEWLOCATION_KEEP_POST
,CLOSEFAILED
,WARC_ERR
,WARC_TMP_FOPENERR
,WARC_TMP_FWRITEERR
httpstat
contains many useful properties of the response, among others:
-
http_stat["statcode"]
: the HTTP status code
httploop_result = {
["url"] = {
["path"] = "software/wget/";
["dir"] = "software/wget";
["host"] = "www.gnu.org";
["port"] = 80;
["file"] = "";
["scheme"] = "SCHEME_HTTP";
["url"] = "http://www.gnu.org/software/wget/";
};
["err"] = "RETRFINISHED";
["http_stat"] = {
["restval"] = 0;
["dltime"] = 0;
["local_file"] = "tmp/www.gnu.org/software/wget/index.html";
["orig_file_size"] = 15194;
["existence_checked"] = true;
["res"] = 0;
["rd_size"] = 0;
["orig_file_name"] = "tmp/www.gnu.org/software/wget/index.html";
["statcode"] = 200;
["message"] = "OK";
["contlen"] = -1;
["len"] = 0;
["error"] = "OK";
["timestamp_checked"] = false;
};
};
Called during the URL extraction for a downloaded file.
wget.callbacks.get_urls = function(file, url, is_css, iri)
Return a table of URLs that should be added to the download queue. The table is a list with one item per URL, with the following fields:
-
"url"
: the absolute URL to enqueue (mandatory). -
"link_expect_html"
:1
if the result should be parsed as an HTML file. -
"link_expect_css"
:1
if the result should be parsed as a CSS file. -
"post_data"
: a parameter string ofapplication/x-www-form-urlencoded
data to be posted in a POST request. -
"body_data"
: the request body. Unlike"post_data"
, this does not set the method. -
"method"
: the HTTP method. -
"headers"
: a table specifying custom headers to insert, mapping from header names to header values.
Example:
local urls = {}
-- a normal web page
table.append(urls, { url="http://example.com/", link_expect_html=1 })
-- a css page
table.append(urls, { url="http://example.com/style.css", link_expect_css=1 })
-- an image (do not extract links)
table.append(urls, { url="http://example.com/image.png" })
-- sending a POST request
table.append(urls, { url="http://example.com/login", post_data="username=test&password=test" })
file
is the local filename of the downloaded file. You can read the contents of this file to implement your own URL extractor.
url
is the URL for this request.
is_css
is true
if this is parsed as a CSS file, false
otherwise.
iri
gives Wget's URI encoding settings for this URL.
get_urls = {
["file"] = "tmp/www.gnu.org/software/wget/index.html";
["url"] = "http://www.gnu.org/software/wget/";
["is_css"] = false;
["iri"] = {
["content_encoding"] = "utf-8";
["uri_encoding"] = "ANSI_X3.4-1968";
["utf8_encode"] = false;
};
};
This function is called when Wget has finished downloading, just after it prints the "FINISHED" summary.
wget.callbacks.finish = function(start_time, end_time, wall_time, numurls, total_downloaded_bytes, total_download_time)
start_time
indicates when downloading began (clock time in seconds).
end_time
indicates when downloading finished (clock time in seconds).
wall_time
is the total time in seconds (end_time - start_time).
numurls
is the number of URLs downloaded.
total_downloaded_bytes
is the number of bytes downloaded (as a floating-point number).
total_download_time
is the download time in seconds.
finish = {
["start_time"] = 2.51e-07;
["end_time"] = 10.670458281;
["wall_time"] = 10.67045803;
["numurls"] = 2;
["total_downloaded_bytes"] = 7682;
["total_download_time"] = 0.000822633;
};
This function is called before Wget exits. Implement this function to change the exit status.
wget.callbacks.before_exit = function(exit_status, exit_status_string)
This method should return an integer exit code. Return exit_status
or use a custom number. For convenience, wget.exits
provides the following constants:
wget.exits.SUCCESS
wget.exits.IO_FAIL
wget.exits.NETWORK_FAIL
wget.exits.SSL_AUTH_FAIL
wget.exits.SERVER_AUTH_FAIL
wget.exits.PROTOCOL_ERROR
wget.exits.SERVER_ERROR
wget.exits.UNKNOWN
exit_status
is the exit status that Wget will return.
exit_status_string
is a text version of the exit status. It is one of
-
SUCCESS
,IO_FAIL
,NETWORK_FAIL
,SSL_AUTH_FAIL
,SERVER_AUTH_FAIL
,PROTOCOL_ERROR
,SERVER_ERROR
,UNKNOWN
before_exit = {
["exit_status"] = 8;
["exit_status_string"] = "SERVER_ERROR";
};