Unicode Space Characters

by David Jones

A Survey

Unicode defines the codepoint U+0020 SPACE for a space character (from ASCII).

U+0020 is a sort of general purpose generic space; the one you get if you’re typing ordinary text into an ordinary program and you press the space bar.

Unsurprisingly, there are quite a few more spaces.

Perhaps next most popular is U+00A0 NO-BREAK SPACE. A NO-BREAK (or non-breaking) space is one where it would be unwise to break the line: in titles like Mx Jones, dates, placenames; and following the SI system recommendation of putting a space between the number and the unit in weights and measures.

Unicode has a whole bunch more spaces in the General Punctuation Block at U+2000 to U+200F (as well as a few more hiding elsewhere). Many of these are based on fraction divisions of a whole Em.

With notes from real-fonts based on an informal survey (of at least Arial, Tahoma, Times New Roman as installed on my macOS computer) i can say the following:

There is also U+202F NARROW NO-BREAK SPACE which is not commonly implemented; Unicode says it is typically the width of a THIN or a mid (FOUR-PER-EM), that is ¼ or ⅕ Em. Despite that, in Microsoft Sans Serif it is a half of U+0020 SPACE, making it significantly narrower that ⅕ Em.

The Em is the size of the font in use, in metal typesetting it is the height of the metal body. In digital typesetting it is the selected point size, for example 1 Em is 12 points in a 12-point font.

Given that that actual widths of these spaces are defined by the font (sometimes), it’s not clear to me that the EM, EN (a half EM), THREE-PER-EM, FOUR-PER-EM, and SIX-PER-EM spaces have much use in digital typography. In any particular font they might not be defined or might not be defined to be the expected fractions of the Em. Though i suppose that within a particular system with a fixed set of engineered fonts (for example, System supplied fonts in Windows or macOS), then they might be useful for directly controlling space in user-interfaces.

The spaces not directly based on fractions of the Em are THIN and HAIR space; a THIN is used in traditional French typography which puts a THIN space inside quotation marks, and before punctuation marks like « ; » and « : ».

Typesetting SI weights and measures, like 68 kg, might use a THIN SPACE between the number and the unit; possibly a NARROW NO-BREAK SPACE.

Some recommendations for typesetting large numbers have spaces in between blocks of 3 digits: 1 000 000. It may be sensible to use NARROW NO-BREAK SPACE U+202F for this, so that the entire number is considered as a single world, according to the Unicode UAX #29 word boundary algorithm.

There are some other spaces: IDEOGRAPHIC SPACE, used in ideographic script, and an OGHAM SPACE MARK, used in Ogham. The last one, the OGHAM SPACE MARK, has the distinction of being the only space character in Unicode that has any printed marks; Ogham is typeset on a horizontal rule, the OGHAM SPACE MARK is a certain length of this line with no other strokes on it.

Illustrations

Typically the space and no-break space are the same width:

█ █ U+0020 SPACE

█ █ U+00A0 NO-BREAK SPACE

Unicode has an em quad and an em space; EM QUAD decomposes to EM SPACE, meaning they are in some sense equivalent. And lo they seem to be the same size:

█ █ U+2001 EM QUAD

█ █ U+2003 EM SPACE

An en space is a smaller space, nominally one-half em; again, EN QUAD decomposes to EN SPACE.

█ █ U+2000 EN QUAD

█ █ U+2002 EN SPACE

We can arrange most of the spaces by decreasing size (approximately, as we can’t know what the widths are in the fonts used for display):

█ █ █ █ █ █ █ █ █ █ █ █ █ U+2001 EM QUAD

█ █ █ █ █ █ █ █ █ █ █ █ █ U+2000 EN QUAD

█ █ █ █ █ █ █ █ █ █ █ █ █ U+2007 FIGURE SPACE

█ █ █ █ █ █ █ █ █ █ █ █ █ U+2004 THREE-PER-EM SPACE

█ █ █ █ █ █ █ █ █ █ █ █ █ U+2005 FOUR-PER-EM SPACE

█ █ █ █ █ █ █ █ █ █ █ █ █ U+2009 THIN SPACE

█ █ █ █ █ █ █ █ █ █ █ █ █ U+2006 SIX-PER-EM SPACE

█ █ █ █ █ █ █ █ █ █ █ █ █ U+200A HAIR SPACE

The Unicode Spec has this to say for FIGURE, THIN, and HAIR whose sizes are not rigidly determined to be fractions of an Em.

U+2007 figure space has a fixed width, known as tabular width, which is the same width as digits used in tables. U+2008 punctuation space is a space defined to be the same width as a period. U+2009 thin space and U+200A hair space are successively smaller-width spaces used for narrow word gaps and for justification of type.

In Glyphs, the default for U+2007 FIGURE SPACE is to make it the same width as 0 (zero).

█ █ █ █ █ █ █ █ █ █ █ █ █ U+2007 FIGURE SPACE

█0█0█0█0█0█0█0█0█0█0█0█0█ U+0030 DIGIT ZERO

The other non-breaking space, NARROW NO-BREAK SPACE, seems to be the same size as THIN SPACE.

█ █ █ █ █ █ █ █ █ █ █ █ █ U+202F NARROW NO-BREAK SPACE

█ █ █ █ █ █ █ █ █ █ █ █ █ U+2009 THIN SPACE

PUNCTUATION SPACE, according to The Unicode Spec, is “a space defined to be the same width as a period” (PERIOD in Unicode is now called U+002E FULL STOP). No idea what MEDIUM MATHEMATICAL SPACE should be used for.

█ █ █ █ █ █ █ █ █ █ █ █ █ U+2008 PUNCTUATION SPACE

█.█.█.█.█.█.█.█.█.█.█.█.█ U+002E FULL STOP

█ █ U+205F MEDIUM MATHEMATICAL SPACE

There is also a space for use with Chinese, Japanese, and Korean ideographic scripts. It is designed to be the same width as ordinary ideograms (ideographic fonts typically have all characters the same width).

█ █ U+3000 IDEOGRAPHIC SPACE

Fractions

We can bunch-up multiples of particular spaces to see if they add up to the Em space.

Here's 2 EN SPACE against an EM SPACE, 3 THREE-PER-EM SPACE, and so on.

█ █ █ █ █ U+2003 EM SPACE

█  █  █  █  █ U+2002 EN SPACE

█   █   █   █   █ U+2004 THREE-PER-EM SPACE

█    █    █    █    █ U+2005 FOUR-PER-EM SPACE

█      █      █      █      █ U+2006 SIX-PER-EM SPACE

█ █ █ █ █ U+2003 EM SPACE

Reflections and Divisions

At least in the font I am viewing this document right now, for the 3-, 4-, 6-, per em spaces, they are indeed the expected fractions of the Em.

The en space is notionally half an em.

So if we divide an Em into 12ths, and use those fractions we would have spaces of 2/12, 3/12, 4/12, 6/12 (en), 12/12 (em). We could put a hair space at 1/12 (as it is in the fonts i found it in).

It seems somewhere between possible and likely that in metal typesetting, the metal bodies carrying the types of a font would come in widths that were multiples of a twelth of an em. These spaces (1, 2, 3, 4, 6) are not only useful in their own right, but allow all multiples of 1/12 < 11/12 to be done with two spaces, with only 11/12 taking three. Actually we know that the Monotype casting system used a system of 18ths, so perhaps take all that with a grain of salt (but it also seems that the Monotype could control the width of spaces with more precision than other elements).

If the HAIR SPACE is set to 1/12th Em, and we use a 12 pt font (that is, 1 Em = 12 pt), then at that font size the hairs becomes the same as the point: 72 hairs to an inch (exactly, in the digital world). I don’t know if this is important or convenient, but it seems like a nice fun-fact.

Addendum on meti

Much of the poking about with font files was done with my own meti tool from Font 8; short for Metric Information (metric meaning widths).

Here’s an example output, using the Microsoft Sans Serif font that comes with macOS:

; meti '/System/Library/Fonts/Supplemental/Microsoft Sans Serif.ttf' | egrep ',(period|space|zero|uni200.|uni20[25]F|uni00A0),'
3,space,hmtx,width,544,lsb,0
17,period,hmtx,width,569,lsb,182
19,zero,hmtx,width,1139,lsb,92
670,uni200C,hmtx,width,0,lsb,-36
671,uni200D,hmtx,width,0,lsb,-219
672,uni200E,hmtx,width,0,lsb,-36
673,uni200F,hmtx,width,0,lsb,-431
2696,uni2000,hmtx,width,1024,lsb,0
2697,uni2001,hmtx,width,2048,lsb,0
2698,uni2002,hmtx,width,1024,lsb,0
2699,uni2003,hmtx,width,2048,lsb,0
2700,uni2004,hmtx,width,683,lsb,0
2701,uni2005,hmtx,width,512,lsb,0
2702,uni2006,hmtx,width,341,lsb,0
2703,uni2007,hmtx,width,1139,lsb,0
2704,uni2008,hmtx,width,569,lsb,0
2705,uni2009,hmtx,width,410,lsb,0
2706,uni200A,hmtx,width,171,lsb,0
2707,uni200B,hmtx,width,0,lsb,0
3032,uni202F,hmtx,width,272,lsb,0
3033,uni205F,hmtx,width,455,lsb,0

This is an old-school TTF font so has its UPEM set to 2048 (traditional for TrueType, an OpenType font would more likely use 1000). The EM SPACE, uni2003, is indeed set to 2048. We can see that FIGURE SPACE, uni2007, is indeed the same as zero; PUNCTUATION SPACE, uni2008, is the same width as period; and so on.

END