Text (plain)¶
Readable text without special formatting codes
In short
Text content that is readable without special software.
item | info |
---|---|
formats | CSS, CSV, HTML, JSON, Markdown, SQL, TeX, Text, XML, YAML |
extensions | cfg , css , csv , htm , html , ini , json , log , lst , md , sql , sty , tex , text , tsv , txt , xml , yaml |
related types | Markup, Program code, Text (formatted) |
Description¶
Text is a broad category of data. For some, text is something that you type with Microsoft Word, for others it is what they read in a PDF, and for yet others it is the program code they hack in a text editor.
For the purposes of archiving we divide the text data type in Text (formatted) and Text (plain).
Plain text consists of text strings without any particular layout information other than that which can be achieved by spaces, tabs and newlines.
Plain text is mostly used for IT purposes:
- writing quick notes, using a simple program like notepad often with extension .txt; note that Markdown files are themselves plain text, but they are used to represent Text (formatted) as well;
- writing software code (programs), using a text editor or an IDE (integrated developing environment); see also Program code;
- for data with formal characteristics, such as JSON CSV XML SQL
Information representation¶
Computer files are either binary, in which case they are just a sequence of bits (1 and 0), or they are text files, in which case they are interpreted as a sequence of characters, separated by line breaks.
Text files versus binary files
There are several notions of what a character is and what a line break is. A Windows line break is different from a Unix/Linux/Mac line break, and line breaks on OS9 Macs are yet different.
Character encoding
What a character is, is determined by an encoding, which is a system to map characters to sequences of bits.
The most ubiquitous character encoding is ASCII. It encodes a set of 128 characters. This is a basic set consisting of letters, uppercase and lowercase, digits, punctuation, arithmetical symbols, a few currency symbols, space, tab, newline, carriage return, and a few others.
Later came the extensions for letters with accents, for other scripts such as Cyrillic and Greek. The first was IBM's CP437 These extension sets were defined by code pages, each of which defined a limited supply of non-ascii characters.
Windows had its own notion of code page: 125x.
All this was common before UNICODE. Text files from this era pose the difficulty that nothing in the file itself declares which code page is being used. It is a matter of trial and error to determine the right code page, and sometimes it is impossible.
This problem is carried over to older text-based formats such as CSV and SQL.
While the structure of SQL and CSV files is usually well-defined, the use of undeclared code pages remains a liability.
Unicode
When Unicode arrived, it had the promise to tidy up most character issues. The Unicode standard is a major achievement. It not only maps nearly every written glyph unto a unique number, it also defines the notions of upper case and lower case intelligently, and it defines types of characters, such as letters, numerals, punctuation, and much more.
Last but not least, associated with Unicode are severel encodings to map the unique numbers to streams of bits in efficient ways.
In today's world, UTF8 is very common, and especially suited to Western languages, because it coincides with ASCII for the ASCII characters.
Other encodings are UTF16 and UTF32.
Specifying a Unicode encoding
File formats such as XML make it clear which Unicode encoding is being used.
But in general, file types for plain text do not specify the encoding in a standard way.
The recommendation is to let a non-UTF8 file start with a special character, the Byte Order Mark (BOM), from which most applications can deduce the encoding that is being used.