Hypertext Markup Language

In short

Universal language of the Web; deals with very rich layout; comparable to PDF; works together with CSS to achieve subtle results; looks perfect from the outside, may look horribe from the inside.

item info
types Markup, Text (plain)
preferred ⚠️ under conditions
extensions .htm, .html
related formats CSS, CSV, JavaScript, JSON, JSX, Markdown, SGML, SQL, TeX, Text, XHTML, XML, XSLT, YAML
wikipedia HTML


HTML is the core web technology, turning the pre-www internet of 1991 into a World Wide Web.

It is basically a document markup language, in which you can store text plus formatting plus layout.

In earlier times, the layout and format options were basic, nowadays both are as sophisticated as the printed page can be.

In order to achieve the richest formatting, HTML works together with another standard for styles: CSS.

What sets HTML apart from printed work is dynamics: users can interact with a web page, and defining this interaction is delegated to yet another technology, JavaScript, which is a programming language.

The ultimate reference to all these technologies is the Mozilla Developer Network (MDN):

HTML files may contain CSS styles inside, or they may refer to other files for their styling.

Likewise, they may contain JavaScript inside, or they may refer to other files for their scripting.


The World-wide-web is one of the most dynamic corners of IT. Standards are always on the move, and new patterns of organizing web content replace the best practices of yesterday.

HTML files may contain complete programs, documentation, and installation procedures on the one hand, or they may consist of long stretches of text with horribly redundant layout codes.

Most HTML in the world has not been typed by humans, but has been generated by software.

HTML files can be almost one liners that call a big Javascript program that does all the rest, or it may contain everything inside.

HTML in the archive

One scenario by which HTML files may enter the archive, is when a website gets archived. In that case, it is not the individual HTML files themselves that should be judged for their long-term preservability, but rather the website as an integral system.

In this scenario, it is preferable to archive the source code of the web site as well, not only the end result.

Nowadays, there are many popular ways to write HTML documentation as generated static pages. The source code resides in a software repository, and the generated pages are served by a service such as readthedocs or GitHub pages.

Another scenario is when HTML files with substantial content have been captured from a legacy system, or from other sources. If possible, scan the file for references to external files, and if possible, rescue those files as well, and store them in an organization that matches the way they are referenced. Then preserve the resulting directory in a TAR file.

See also