Came upon this article at Macleans.ca about online gambling in Kahnawake, and noticed what appeared to be a strange typo in the headline:
As of this writing, it’s still not corrected, which I guess means that nobody at Maclean’s checks articles once they’ve gone online.
Here’s how the end of that headline appears in the HTML code:
Even if it means starting a ï¬ght.
So it’s not my browser. It explicitly says “lowercase i with umlaut, mathematical negation symbol, and non-existent character with code #129.” My browser just did what it was told.
But why did this happen? For that we have to delve into two technical subjects I’ll do my best to explain: Unicode and ligatures.
Unicode is a solution to a giant computer problem. Back in the day, when ASCII gave us a proper standard for how to represent basic characters in binary computer form, it consisted of just the basics. Uppercase and lowercase letters, digits 0-9, and the punctuation you’d find on a standard U.S. 101-key keyboard.
Unfortunately, no provision was made for characters with accents, nor any non-Latin alphabets. So individual manufacturers began creating their own incompatible, extended versions of ASCII to include those characters. This led to huge compatibility headaches for anyone working in a non-English language, some of which are still with us today.
Unicode was designed as the Great Bible of characters. Instead of limiting itself to 256 of them (the number of different combinations of eight ones and zeroes), it brought in even the most remote of languages and bizarre of mathematical symbols, and has over 100,000 glyphs (character formations) in its massive database.
Unicode has been brought into our daily Internet lives through UTF-8, a character encoding that keeps compatibility with ASCII (and therefore not breaking too much). It’s variable-length, from one to four 8-bit bytes. The first 128 characters (which are identical to ASCII) require only one byte, but everything else needs more. That’s where things start to have problems. If you encode a special character as UTF-8 and decode it as something else (like ISO-8859-1, which was the standard Latin set previously), you see a set of gibberish characters instead of the one you want.
But, I hear you nerd-types say, “f” and “i” are ASCII letters. Why are they causing problems?
That brings us to ligatures. Those are what you get when you combine two letters into one glyph. “Æ” (AE) is a perfect example, but many hoity-toity printing-types (like apparently Maclean’s) use “stylistic” ones, that combine letters together to make them look nicer. Professional desktop publishing programs are even designed to substitute them automatically when they’re typed, to make our lives easier. The most commonly-mentioned stylistic ligature is “fi -> ?”, which combines the dot of the I with the top of the F to make it look nicer.
So that’s what happened here:
- The program changed “fi” to the ligature “?” in the print version
- The copy was entered into a computer database as a three-byte UTF-8 character 0xEF, 0xAC, 0x81 (as it should be), but came out as three ISO-8859-1 characters EF, AC, 81, which correspond to ï, ¬ and a non-existent character (which is translated in browsers to “?”)
- A program designed to ensure that special characters are properly encoded ironically converts these bytes into three HTML character entities (
ï ¬ ).
- My browser, reading the HTML as it’s supposed to be read, does what it says and displays three nonsense characters.
Of course, all this technical stuff is irrelevant. Whoever uploaded this story at Maclean’s would have seen that the incorrect characters were being displayed, and it would have been a two-second fix to retype “fi” into the headline.
But this wasn’t done. Because when traditional media go online, blindingly obvious typos in headlines aren’t important enough for anyone to pay attention to them. It’s just copy, paste, and forget about it.