Python 2 & 3, Software Engineering, Internationalisation, and Welsh

Posted 19 Dec 2021

I’ve programmed for a while, but when I first started using the Python programming language, it was at the very dawn of an unprecedented schism between Python 2 and Python 3.

In the future, or even the recent present, this may sound trivial or even silly. But in certain software engineering circles it was twelve years of bitter trench-warfare.

Python 3 had major potentially breaking changes. No software engineer – or any engineer – likes a foundation changing underneath them. There must be a compelling benefit. People invest a lot of work and there is a lot of inertia. Countless hours of free, open source, labour is donated to the commons and a whole programming “ecosystem” must be carried forward.

The main break was due to how Python 2 handled strings of text – today standardised as “Unicode” – versus raw bytes of data.

Whether a fault of the language directly, or a fault of the language indirectly in how it encouraged code utility libraries and frameworks to be written, Python 2 web frameworks and applications commonly choked and crashed as soon as certain letters appeared – such as a common Welsh-alphabet letter like “ŷ” (Latin y with a circumflex.).

Every Welsh child learns this letter in school - without it you cannot “ga i fynd i’r tŷ bach” (“go to the toilet”). This sentence was one of my test cases for my bakefont3 Python3/C/OpenGL text-rendering library.

“Ŷ” is a character outside of the historical “Latin-1” (ISO/IEC 8859-1) 7-bit ASCII character encoding that was often used in the past “by default” (in an Anlgocentric way) to encode binary digits (the “ones and zeroes” used by computers) as text that (English) humans can understand.

Much of Latin-1 is a subset of Unicode, but not completely - confusing the two leads to the characteristic “Mojibake” errors. or, in Python 2’s case, often outright crashes.

There are many languages other than Welsh where Python 2 had this problem, but that’s how I immediately experienced this design flaw within my first hours of evaluating Python 2. This led me to instead immediately adopt and bet on the upstart Python 3 – despite a much weaker software ecosystem at the time.

Even though it was possible to handle any written language in either software, it was only in Python 3 that it this was “easy by default”.

At Tawesoft, as a newly formed independent software engineering and website development company with a social conscience, we recognised that bilingual Welsh/English websites were a legal requirement in Welsh public sector contracts. Although not legally required, our startup also bet on Welsh language support not only being a competitive advantage in the commercial sector but we believed it to be a minimal ethical duty.

Countless webpages and tutorial bytes have been spilt on the Python 2 / 3 Unicode & bytes problem. Most of it is rightfully lost to history.

Despite the importance in the Python ecosystem ten-plus years ago, Unicode support is – thanks to language designers – taken-for-granted in better designed languages (Python 3 & others).

Even older languages, like C, through the dominance of “8-bit clean” UTF-8, can be considered to support Unicode – although this must be carefully considered at interface boundaries.

Part two (of two) of this article will have an in-depth look at programming libraries for localisation, with a particular focus on the C and Go programming languages, Welsh and Celtic consonant mutation, gendered Romance linguistics, and Welsh & Japanese honorifics, with the massive disclaimer that despite my best efforts I am terrible at learning languages.

As a special personal appeal, if you have any useful insights related to Sign languages that you think might inform this study, please do get in touch.

Also please do email me if you specifically have an interest in software localisation and want a notification for part two.

I can be reached at golightly.ben@googlemail.com.