Gwtar: a static efficient single-file HTML format · Gwern.net
#Infrastructure

Gwtar: a static efficient single-file HTML format · Gwern.net

Tech Essays Reporter
19 min read

Gwtar is a new polyglot HTML archival format which provides a single, self-contained, HTML file which still can be efficiently lazy-loaded by a web browser. This is done by a header's JavaScript making HTTP range requests. It is used on Gwern.net to serve large HTML archives.

Gwtar is a new polyglot HTML archival format which provides a single, self-contained, HTML file which still can be efficiently lazy-loaded by a web browser. This is done by a header's JavaScript making HTTP range requests. It is used on Gwern.net to serve large HTML archives.

Background

Archiving HTML files faces a trilemma: it is easy to create an archival format which is any two of static (self-contained ie. all assets included, no special software or server support), a single file (when stored on disk), and efficient (lazy-loads assets only as necessary to display to a user), but no known format allows all 3 simultaneously. We introduce a new format, Gwtar (; pronounced “guitar”, .gwtar.html extension), which achieves all 3 properties simultaneously.

A Gwtar is a classic fully-inlined HTML file, which is then processed into a self-extracting concatenated file of an HTML + JavaScript header followed by a tarball of the original HTML and assets. The HTML header's JS stops web browsers from loading the rest of the file, loads just the original HTML, and then hooks requests and turns them into range requests into the tarball part of the file. Thus, a regular web browser loads what seems to be a normal HTML file, and all assets download only when they need to. In this way, a static HTML page can inline anything—such as gigabyte-size media files—but those will not be downloaded until necessary, even while the server sees just a single large HTML file it serves as normal. And because it is self-contained in this way, it is forwards-compatible: no future user or host of a Gwtar file needs to treat it specially, as all functionality required is old standardized web browser/server functionality.

Gwtar allows us to easily and reliably archive even the largest HTML pages, while still being user-friendly to read.

Example pages: “The Secret of Psalm 46” (vs original SingleFile archive—warning: 286MB download).

HTML Trilemma

One of the biggest challenges for long-term websites is archiving. Gwern.net makes an effort to solve this; and due to quality problems and link rot, simply linking to the Internet Archive is not enough, so I try to create & host my own web page archives of everything I link.

There are 3 major properties we would like of an HTML archive format, beyond the basics of actually capturing a page in the first place: it should not depend in any way on the original web page, because then it is not an archive and will inevitably break; it should be easy to manage and store, so you can scalably create them and store them for the long run; and it should be efficient, which for HTML largely means that readers should be able to download only the parts they need in order to view the current page.

No current format achieves all 3. The built-in web browser save-as-HTML format achieves single and efficient, but not static; save-as-HTML-with-directory achieves static partially and efficient, but not single; SingleFile, MHTML, & zstd-compressed MHTML (a zstd-compressed variant) achieve static, single, but not efficiency; WARC/WACZs achieve static and efficient, but not single (because while the WARC is a single file, it relies on a complex software installation like WebRecorder/Replay Webpage to display).

An ordinary ‘save as page HTML’ browser command doesn't work because “Web Page, HTML Only” leaves out most of a web page; even “Web Page, Complete” is inadequate because a lot of assets are dynamic and only appear when you interact with the page—especially images. If you want a static HTML archive, one which has no dependency on the original web page or domain, you have to use a tool specifically designed for this. I usually use SingleFile.

SingleFile produces a static snapshot of the live web page, while making sure that images are first loaded, so they are included in the snapshot. SingleFile often produces a useful static snapshot. It also achieves another nice property: the snapshot is a single file, just a simple single .html file, which makes life so much easier in terms of organizing and hosting. Want to mirror a web page? SingleFile it, and upload the resulting single file to a convenient directory somewhere, boom—done forever. Being a single file is important on Gwern.net, where I must host so many files, and I run so many lints and checks and automated tools and track metadata etc. and where other people may rehost my archives.

However, a user of SingleFile quickly runs into a nasty drawback: snapshots can be surprisingly large. In fact, some snapshots on Gwern.net are over half a gigabyte! For example, the homepage for the research project is 485MB after size optimization, while the raw HTML is 0.6MB. It is common for an ordinary somewhat-fancy Web 2.0 blog post like a post to be >20MB once fully archived. This is because such web pages wind up importing a lot of CSS, JS, widgets and icons etc., all of which assets must be saved to ensure it is fully static; and then there is additional wasted space overhead due to assets from their original binary encoding into text which can be Base64-encoded. This is especially bad because, unlike the original web page, anyone viewing a snapshot must download the entire thing. That 500MB web page is possibly OK because a reader only downloads the images that they are looking at; but the archived version must download everything. A web browser has to download the entire page, after all, to display it properly; and there is no lazy-loading or ability to optionally load ‘other’ files—there are no other files ‘elsewhere’, that was the whole point of using SingleFile! Hence, a SingleFile archive is static, and a single file, but it is not efficient: viewing it requires downloading unnecessary assets.

So, for some archives, we ‘split’ or ‘deconstruct’ the static snapshot back into a normal HTML file and a directory of asset files, using deconstruct_singlefile.php (which incidentally makes it easy to re-compress all the images, which produces large savings as many websites are surprisingly bad at basic stuff like PNG/JPG/GIF compression); then we are back to a static, efficient, but not single file, archive. This is fine for our static page archives because they are stored in their own directory tree which is off-limits to most Gwern.net infrastructure (and off-limits to search engines & agents or off-site hotlinking), and it doesn't matter too much if they litter tens of thousands of directories and files. It is not fine for HTML archives I would like to host as first-class citizens, and expose to Google, and hope people will rehost someday when Gwern.net inevitably dies.

So, we could either host a regular SingleFile archive, which is static, single, and inefficient; or a deconstructed archive, which is static, multiple, and efficient, but not all 3 properties. This issue came to a head in January 2026 when I was archiving the Internet Archive snapshots of Brian Moriarty's famous lectures and , since I noticed while writing that his whole website had sadly gone down. I admire them and wanted to host them properly so people could easily find my fast reliable mirrors (unlike the slow, hard-to-find, unreliable IA versions), but realized I was running into our long-standing dilemma: they would be efficient in the local archive system after being split, but unfindable; or if findable, inefficiently large and reader-unfriendly. Specifically, the video of “Who Buried Paul?” was not a problem because it had been linked as a separate file, so I simply and edited the link; but “The Secret of Psalm 46” turned out to inline the OGG/MP3 recordings of the lecture and abruptly increased from <1MB to 286MB. I discussed it with , and he began developing a fix.

Trisecting

To achieve all 3, we need some way to download only part of a file, and selectively download the rest. This lets us have a single static archive of potentially arbitrarily large size, which can safely store every asset which might be required. HTTP already easily supports selective downloading via the ancient Range request, which allows one to query for a precise range of bytes inside a URL. This is mostly used to do things like resume downloads, but you can also like run databases in reverse: a web browser client can run a database application locally which reads a database file stored on a server, because Range queries let the client download only the exact parts of the database file it needs at any given moment, as opposed to the entire thing (which might be terabytes in size). This is how formats like WARC can render efficiently: host a WARC as a normal file, and then simply range-query the parts displayed at any moment.

The challenge is the first part: how do we download only the original HTML and subsequently only the displayed assets? If we have a single HTML file and then a separate giant archive file, we could easily just rewrite the HTML using JS to point to the equivalent ranges in the archive file (or do something server-side), but that would achieve only static and efficiency, not single file. If we combine them, like SingleFile, we are back to static and single file, but not efficiency.

The simplest solution here would be to decide to complicate the server itself and do the equivalent of deconstruct_singlefile.php on the fly. HTML requests, perhaps detecting some magic string in the URL like .singlefile.html, is handed to a proxy process, which splits the original single HTML file into a normal HTML file with lazy-loaded references. The client browser sees a normal multiple efficient HTML, while everything on server sees a static single inefficient HTML. (A possible example is .) While this solves the immediate Gwern.net problem, it does so at the permanent cost of server complexity, and does not do much to help anyone else. (It is unrealistic to expect more than a handful of people to modify their servers this invasively.) I also considered taking the WARC red pill and going full WebRecorder, but quailed.

Concatenated Archive Design

A Gwtar is an HTML file with a HTML + JS + JSON header followed by a tarball and . (A Gwtar could be seen as almost a is a file valid as more than one format—in this case, a .html file that is also a .tar archive, and possibly .par2. But strictly speaking, it is not.)

CREATION

We provide a reference PHP script, deconstruct_singlefile.php, which creates Gwtars from SingleFile HTML snapshots. It additionally tries to recompress JPG/PNG/GIFs before storing in the Gwtar, and then appends . Example command to replace the original by with PAR2 : php ./static/build/deconstruct_singlefile.php --create-gwtar --add-fec-data
2010-02-brianmoriarty-thesecretofpsalm46.html

IMPLEMENTATION

The first line of the header is the magic HTML string , and the final line is the magic HTML string <!-- GWTAR END. (Additional metadata like the original input filename/hash/date may be included in comments.) The header stores a JSON dictionary of files/sizes/types/SHA-256-hashes of the real HTML (always first), followed by all of its assets (basename-asset-N.ext). There is always a HTML file and at least one asset. All of these assets are stored in the tarball immediately following the header (but which does not necessarily contain only these assets).

Example (with whitespace added for readability):

(Unfortunately, the assets are not necessarily valid for their mime-type—SingleFile passes through invalid or mismatched images and doesn't guarantee much about the validity of its generated HTML. So Gwtar cannot require or guarantee much about the input or output HTML and is best-effort.)

window.gwtar is attached to the window once it stops loading and does nothing initially. Finally, window.stop() is called at the end, before the appended tarball begins. This stops the web browser from loading any more data; then the main JS is free to run.

The main header JS starts using range requests to first load the real HTML, and then it watches requests for resources; the resources have been rewritten to be deliberately broken 404 errors (requesting from localhost, to avoid polluting any server logs), so when they fail, the JS then rewrites them into working range requests into the tarball, and repeats them.

Details

The simple approach is to download the binary assets, encode them into Base64 text, and inject them into the HTML DOM. This is inefficient in both compute and RAM because the web browser must immediately reverse this to get a binary to work with. So we actually use the browser optimization of blobs to just pass the binary asset straight to the browser.

A tricky bit is that inline JS can depend on “previously loaded” JS files, which may not have actually loaded yet because the first attempt failed (of course) and the real Range request is still racing. We currently solve this by just downloading all JS before rendering the HTML, at some cost to responsiveness.

So, a web browser will load a normal web page; the JS will halt its loading; a new page loads, and all of its requests initially fail but get repeated immediately and work the second time; the entire archive never gets downloaded unless required. All assets are provided, there is a single Gwtar file, it is efficient; it doesn't require JS for archival integrity, as just the entire archive downloads if the JS is not executed; and it is cross-platform and standards-compliant, requires no server-side support or future users/hosts to do anything whatsoever, and is a transparent, self-documenting file format which can be easily converted back to a ‘normal’ multiple-file HTML (cat foo.gwtar.html | perl -ne'print $_ if $x; $x=1 if /<!-- GWTAR END/' | tar xf -) or a user can just re-archive it normally with tools like SingleFile.

FALLBACK

In the event of JS problems, gwtar_fallback.html explains what the Gwtar format is and why it requires JS, and links to this page for more details. It also detects whether range requests are supported or downgraded to requesting the entire file. If the latter, it will start rendering it. This is not as slow as it seems because we can benefit from connection level compression like Brotli or gzip. And because our preprocessing linearize the assets in dependency order, we receive the bytes in order of page appearance, and so in this mode, the “above the fold” images and stuff will still load first and quickly. (This in comparison to the usual SingleFile, where you have to receive every single asset before you're done, and which may be slower.)

COMPRESSION

Gwtar does not directly support deduplication or compression. Gwtars may overlap and have redundant copies of assets, but because they will be stored bit-identical inside the tarballs, a filesystem can transparently remove most of that redundancy. Media assets like MP3 or JPEG are already compressed, and can be compressed during the build phase by a gwtar implementation. The HTML text itself could be compressed; it is currently unclear to me how Gwtar's range requests interact with transparent negotiated compression like Brotli compression (which for Gwern.net was as easy as enabling one option in Cloudflare). RFC 7233 doesn't seem to give a clear answer about this, and the seems to indicate that the range requests would have to be interpreted relative to the compressed version rather than the original, which is useful for the core use-case of resuming downloads but not for our use-case. So I suspect that probably Cloudflare would either disable Brotli, or downgrade to sending the entire file instead. It is possible that "transfer-encoding" solves this, but as of 2018, Cloudflare didn't support it, making it useless for us and suggesting little support in the wild. If this is a serious problem, it may be possible to compress the HTML during the Gwtar generation phase and adjust the JS.

LIMITATIONS

Local Viewing

Strangely, the biggest drawback of Gwtar turns out to be local viewing of HTML archives. SingleFileZ encounters the same issue: in the name of security (//sandboxing), browsers will not execute certain requests in local HTML pages, so it will break, as it is no longer able to request from itself. We regard this as unfortunate, but an acceptable tradeoff, as for local browsing, the file can be easily converted back to the non-JS dependent multiple/single-file HTML formats.

Range Request Support

Range requests are old, standardized, and important for resuming downloads or viewing large media files like video, and every web server should, in theory, support it by default. In practice, there may be glitches, and one should check. An example command which should return a HTTP 206 (not 200) request if range requests are correctly working:

curl --head --header "Range: bytes=0-99" 'https://gwern.net/doc/philosophy/religion/1999-03-17-brianmoriarty-whoburiedpaul.gwtar.html'

HTTP/2 206

date: Sun, 25 Jan 2026 22:20:57 GMT

content-type: x-gwtar

content-length: 100

server: cloudflare

last-modified: Sun, 25 Jan 2026 07:08:33 GMT

etag: "6975c171-7aeb5c"

age: 733

cache-control: max-age=77760000, public, immutable

content-disposition: inline

content-range: bytes 0-99/8055644

cf-cache-status: HIT

...

Servers should serve Gwtar files as text/html if possible. This may require some configuration (eg. in nginx), but should be straightforward.

Cloudflare Is Broken

However, Cloudflare has an undocumented, hardwired behavior: its proxy (not cache) will strip Range request headers for text/html responses regardless of cache settings. This does not break Gwtar rendering, of course, but it does break efficiency and defeats the point of Gwtar for Gwern.net As a workaround, we serve Gwtars with the MIME type x-gwtar—web browsers like Firefox & Chromium will content-sniff the opening tag and render correctly, while Cloudflare passes Range requests through for unrecognized types. (This is not ideal, but a more conventional MIME type like application/... results in web browsers downloading the file without trying to render it at all; and using a MIME type trick is better than alternatives like trying to serve Gwtars as MP4s, using a special-case subdomain just to bypass Cloudflare completely, using complex tools like Service Workers to try to undo the removal, etc.)

OPTIONAL TRAILING DATA

The appended tarball can itself be followed by additional arbitrary binary assets, which can be large since they will usually not be downloaded. (While the exact format of each appended file is up to the users, it's a good idea to wrap them in tarballs if you can.) This flexibility is intended primarily for allowing ad hoc metadata extensions like or forward error correction ().

Metadata

A Gwtar is served with a text/html mime-type. If necessary to Cloudflare, its mime-type is x-gwtar.

IP

This documentation and the Gwtar code is licensed under the copyright license. We are unaware of any software patents.

Further Work

Gwtar v1 could be improved with:

  • Validation tool
  • Checking of hashsums when rendering (possibly async or deferred)
  • More aggressive prefetching of assets
  • Integration into SingleFile (possibly as a “SingleFileZ2” forma?)
  • Testing: corpus of edge-case test files (inline SVG, srcset, CSS @import chains, web fonts, data URIs in CSS…)

A Gwtar v2 could add breaking changes like:

  • format provides more rigorous validation/checking of HTML & assets; require HTML & asset validity, assets all decode successfully, etc.
  • standardize appending formats
  • require FEC
  • built-in compression with Brotli/gzip for formats not already compressed
  • multi-page support

One would try to replace MAFF's capability of creating sets of documents which are convenient to link/archive and can automatically share assets for de-duplication (eg. page selected by a built-in widget, or perhaps by a hash-anchor like archive.gwtar.html#page=foo.html? Can an initial web page open new tabs of all the other web pages in the archive?)

Better de-duplication, eg. content-addressed asset names (hash-based) enabling deduplication across multiple gwtars

Gwtar may at some point try to check hash integrity during viewing; they are currently provided for forwards-compatibility and offline integrity-checking. We use SHA-256 specifically because it is familiar, easy to use on the commandline or in JavaScript, my default hash, provides long-term security, and is fast enough to never plausibly be a bottleneck in the Gwtar context.

A server I increasingly regret using on Gwern.net rather than Apache, as it seems nothing is ever simple or sane in it.

Or if that doesn't work, perhaps a tag combined with a zero width/height size so readers don't see a broken image placeholder. (Use of display: none risks browsers/SingleFile deciding that it can be omitted entirely.)

Strictly speaking, the tarball wrapper is unnecessary because par2 can scan & repair inside arbitrary bitstreams (see ), but it's tidier and friendlier.

A longer demonstration of using PAR2: mkdir -p ./tmp && cd ./tmp wget 'https://gwern.net/doc/philosophy/religion/1999-03-17-brianmoriarty-whoburiedpaul.gwtar.html' cp 1999-03-17-brianmoriarty-whoburiedpaul.gwtar.html archive.gwtar ## record size + hash of the protected payload ORIGSIZE=$(stat -c %s archive.gwtar) sha256sum archive.gwtar > archive.gwtar.sha256 ## create 25% redundancy for 'archive.gwtar' par2create -r25 archive.gwtar tar cf. archive.gwtar.par2 archive.gwtar.vol*.par2 ## build one-file carrier (payload + tarball of par2 packets) cat archive.gwtar archive.gwtar.par2 > archive.withfec.html ## simulate "only the one-file carrier survived" rm -f archive.gwtar archive.gwtar.par2 archive.gwtar.vol*.par2 ## corrupt some bytes in the protected region only (0..ORIGSIZE−1) cp archive.withfec.html broken.html for i in {1..32}; do OFFSET=$(( (RANDOM << 15 | RANDOM) % ORIGSIZE )) printf '\x00' | dd of=broken.html bs=1 seek="$OFFSET" count=1 conv=notrunc status=none done ## IMPORTANT: give par2repair (1) a .par2 filename, (2) an extra file to scan ln -sf broken.html broken.html.par2 par2repair broken.html.par2 broken.html # Loading "broken.html.par2". # Loaded 503 new packets including 499 recovery blocks # Loading "broken.html.par2". # No new packets found # # There are 1 recoverable files and 0 other files. # The block size used was 2472 bytes. # There are a total of 1997 data blocks. # The total size of the data files is 4936480 bytes. # # Verifying source files: # # Target: "archive.gwtar" - missing. # # Scanning extra files: # # Opening: "broken.html" # File: "broken.html" - found 1968 of 1997 data blocks from "archive.gwtar". # # Repair is required. # 1 file(s) are missing. # You have 1968 out of 1997 data blocks available. # You have 499 recovery blocks available. # Repair is possible. # You have an excess of 470 recovery blocks. # 29 recovery blocks will be used to repair. # # Computing Reed Solomon matrix. # Constructing: done. # Solving: done. # # Wrote 4936480 bytes to disk # # Verifying repaired files: # # Opening: "archive.gwtar" # Target: "archive.gwtar" - found. # # Repair complete. ## verify reconstructed file sha256sum -c archive.gwtar.sha256 # archive.gwtar: OK

Bibliography

Click to expand

Comments

Loading comments...