Plain text with layout and formatting
Special syntax separates content from markup.
|related types||Data (container), Text (formatted), Text (plain)|
Markup is a mechanism to annotate content with layout, formatting, and additional semantics.
SGML, HTML, XML, and XHTML¶
XML, HTML and SGML are common and suitable markup language formats, provided the file formats are valid and complete (see paragraph below). Apart from these formats there are XML-based or SGML-based formats that can only be read by special software. Such files cannot be accepted without further verification; please check with DANS.
Connections between SG/X/HT/XHT-ML
SGML is a classic, which has been extensively used to manufacture large bodies of documentation.
Slightly newer is HTML, which is the defining characteristic of the world wide web.
Technically, XML is an application profile of the ISO standard SGML: all XML files are SGML files. Since XML has a much stricter syntax, it is easier to validate. Since then, XML has evolved further, there are various definition schemes and ways of validating XML, and it contains elements that are not defined in SGML in the same way, such as the encoding attribute. All text in an XML document is by definition in Unicode.
HTML is another variant of SGML; it is primarily intended for the presentation of rich text (and layout) and hyperlinks to other documents.
Quirks in the history of HTML
The story of HTML is more intriguing. Whereas the W3C has tried, for a long time, to integrate HTML with XML, the browser vendors, united in the WhatWG consider HTML itself as the technique to develop further and have defined HTML as a Living Standard](https://html.spec.whatwg.org). The ways have not really parted, because W3C and WhatWG have signed a memorandum of understanding.
Markdown, YAML and JSON¶
These three formats have become very popular in a wide range of contexts. They are not as strictly defined as XML and HTML. Markdown especially has many alternative syntaxes, extensions, and variations. Markdown essentially translates to HTML, and most Markdown processor allow interspersed HTML. That being said, markdown processors also tend to strip such HTML if it is considered dangerous.
In scenarios where people have to write extensive documentation and configuration, new light-weight conventions have evolved for for typing formatted text using minimal markup overhead. The visual layout inside a plain text file, by means of white space and newlines, plus a few ASCII patterns with a special meaning serve as markup.
Tex and LaTeX¶
Both are used extensively in science and scholarship for writing and publishing papers, mostly in the exact sciences where mathematical formulas are prominent but also in the huimanities where non alphabetic symbols are being used, e.g. music and exotic languages.
One of the first uses of markup for professional type-setting is the classic
TeX. Here text is interspersed with macros and commands, most of which start with a backslash. Whereas
TeX provides an extremely powerful base for doing typographical computations,
LaTeX has been developed as a system to limit the power of raw
TeX to more friendly and higher level commands.
With markup there are two issues that matter greatly for long-term preservation:
- has the markup been applied in a valid way?
- is the marked up file complete, i.e. self-contained?
Validity of a marked up file means that it can be parsed, which is almost always the first step one has to take in order to make use of such a file.
For TeX the criterion is: can the
TeX processor handle the file. Unfortunately, there is no rigorous way to test that other than ... running
Well-formed and valid
Well-formed documents have a correct mixture of markup and text, i.e. the markup elements are properly shaped, start tags and end tags match, all characters are defined. It is a largely cosmetic, but essential requirement.
Valid markup language documents are well-formed and moreover comply with an extra set of rules about how the marked up elements hang together.
Through schemas, entirely new “file formats” can be defined, such as
|Text Encoding Initiative (TEI)||format and annotate text|
Whether a file can be validated or not, depends on the presence of such a schema. And that is, in itself, a matter of completeness.
Markup may be based on the use of multiple files in additional file formats.
For TeX this is about style file and package files. By studying the log file of a
TeX run one can see which files are needed for that run.
Last, but not least, the presence of a schema is a matter of completeness. There are three possibilities about where the schema can be found:
- inside the marked up file;
- on the local file system relative to the marked up file;
- on the local file system, absolutely addressed;
Ad 1. This is the easiest case. The schema will always be available.
Ad 2. Care must be taken that the schema is archived as well, and in the same relative position to the marked up file.
Ad 3. Care must be taken that the schema is archived as well, and possibly the marked up file must be changed so that the location of the schema is given in an absolute way.
Ad 4. Ideally, a local copy of the schema should be attached, unless it is available at a reliable public service.
If a non-standard schema is used, the data depositor should consult DANS beforehand.