Markup

Plain text with layout and formatting

In short

Special syntax separates content from markup.

item info
formats CSS, HTML, JavaScript, JSX, Markdown, SGML, TeX, XHTML, XML, XSLT
extensions css, es, htm, html, js, jsx, md, sty, tex, xml, xsl, xslt
related types Data (container), Text (formatted), Text (plain)

Description

Markup is a mechanism to annotate content with layout, formatting, and additional semantics.

SGML, HTML, XML, and XHTML

XML, HTML and SGML are common and suitable markup language formats, provided the file formats are valid and complete (see paragraph below). Apart from these formats there are XML-based or SGML-based formats that can only be read by special software. Such files cannot be accepted without further verification; please check with DANS.

Connections between SG/X/HT/XHT-ML

SGML is a classic, which has been extensively used to manufacture large bodies of documentation.

Slightly newer is HTML, which is the defining characteristic of the world wide web.

XML arose as an attempt to make SGML simpler and to subsume HTML in the process.

Technically, XML is an application profile of the ISO standard SGML: all XML files are SGML files. Since XML has a much stricter syntax, it is easier to validate. Since then, XML has evolved further, there are various definition schemes and ways of validating XML, and it contains elements that are not defined in SGML in the same way, such as the encoding attribute. All text in an XML document is by definition in Unicode.

HTML is another variant of SGML; it is primarily intended for the presentation of rich text (and layout) and hyperlinks to other documents.

In addition to “regular” HTML there is also XHTML, which is HTML under the stricter rules of XML.

SGML and XML are hardly being further developed.

Quirks in the history of HTML

The story of HTML is more intriguing. Whereas the W3C has tried, for a long time, to integrate HTML with XML, the browser vendors, united in the WhatWG consider HTML itself as the technique to develop further and have defined HTML as a Living Standard](https://html.spec.whatwg.org). The ways have not really parted, because W3C and WhatWG have signed a memorandum of understanding.

Markdown, YAML and JSON

These three formats have become very popular in a wide range of contexts. They are not as strictly defined as XML and HTML. Markdown especially has many alternative syntaxes, extensions, and variations. Markdown essentially translates to HTML, and most Markdown processor allow interspersed HTML. That being said, markdown processors also tend to strip such HTML if it is considered dangerous.

Raison d'être

The richness of XML and HTML comes at a price: they are unsuitable for being typed by a human.

In scenarios where people have to write extensive documentation and configuration, new light-weight conventions have evolved for for typing formatted text using minimal markup overhead. The visual layout inside a plain text file, by means of white space and newlines, plus a few ASCII patterns with a special meaning serve as markup.

Out of this comes Markdown as a format for writing rich text, and YAML has a format for writing nested structures.

JSON is a strict format from the Javascript world. It is a way to serialize structrured data, very much like YAML, but with a syntax that is very close to JavaScript. JSON has largely taken over a particular role of XML: exchanging data between computers over a network.

Tex and LaTeX

Both are used extensively in science and scholarship for writing and publishing papers, mostly in the exact sciences where mathematical formulas are prominent but also in the huimanities where non alphabetic symbols are being used, e.g. music and exotic languages.

Many documentation tools use TeX as an intermediate format from which they generate PDF.

History

One of the first uses of markup for professional type-setting is the classic TeX. Here text is interspersed with macros and commands, most of which start with a backslash. Whereas TeX provides an extremely powerful base for doing typographical computations, LaTeX has been developed as a system to limit the power of raw TeX to more friendly and higher level commands.

Recommendations

With markup there are two issues that matter greatly for long-term preservation:

  • has the markup been applied in a valid way?
  • is the marked up file complete, i.e. self-contained?

Validity

Validity of a marked up file means that it can be parsed, which is almost always the first step one has to take in order to make use of such a file.

For TeX the criterion is: can the TeX processor handle the file. Unfortunately, there is no rigorous way to test that other than ... running TeX.

For JSON, Markdown, YAML, HTML, and XHTML the notion of validity is simple: either the file parses, or it does not.

For SGML and XML the notion of validity has been divided into well-formedness and validity.

Well-formed and valid

Well-formed documents have a correct mixture of markup and text, i.e. the markup elements are properly shaped, start tags and end tags match, all characters are defined. It is a largely cosmetic, but essential requirement.

Valid markup language documents are well-formed and moreover comply with an extra set of rules about how the marked up elements hang together.

Through schemas, entirely new “file formats” can be defined, such as

format description
SVG vector graphics
Text Encoding Initiative (TEI) format and annotate text
MathML mathematical formulas
SMIL synchronized multimedia

The World Wide Web Consortium (W3C) manages the specifications for HTML and XML, and provides a Markup Validator that can validate both XHTML and HTML and additional XML-based formats.

These rules that govern the content of a markup document can be described with the help of various substandards, such as a DTD, a W3C schema, or a Relax NG schema.

At the top of XML and XHTML documents is a declaration that may refer to such a schema.

Whether a file can be validated or not, depends on the presence of such a schema. And that is, in itself, a matter of completeness.

Completeness

Markup may be based on the use of multiple files in additional file formats.

For JSON and YAML this does not play a role.

For TeX this is about style file and package files. By studying the log file of a TeX run one can see which files are needed for that run.

For SGML, XML, and XHTML it is a matter of style sheets (e.g. CSS, XSLT), scripts (e.g. JavaScript), fonts (e.g. WOFF), and graphics (e.g. PNG) and the schemas by which the validity is checked.

Schemas

Last, but not least, the presence of a schema is a matter of completeness. There are three possibilities about where the schema can be found:

  1. inside the marked up file;
  2. on the local file system relative to the marked up file;
  3. on the local file system, absolutely addressed;
  4. online.

Ad 1. This is the easiest case. The schema will always be available.

Ad 2. Care must be taken that the schema is archived as well, and in the same relative position to the marked up file.

Ad 3. Care must be taken that the schema is archived as well, and possibly the marked up file must be changed so that the location of the schema is given in an absolute way.

Ad 4. Ideally, a local copy of the schema should be attached, unless it is available at a reliable public service.

If a non-standard schema is used, the data depositor should consult DANS beforehand.