GFX::Monk - Ruby's unicode treatment

I recently came across this enlightening post on the changes to strings and encodings in ruby 1.9. As a python lover who has only used ruby 1.8 so far, it’s interesting to see the different approaches to very similar problems in python 3 and ruby 1.9.

I may be biased, but ruby’s implementation sounds like it will lead to a lot of pain and bugs, while python’s implementation will lead to a little more pain as you are forced to learn about encodings, and a lot less bugs (as you are forced to learn about encodings). Let me explain why:

Ruby conflates the concept of strings and bytes

This has been the biggest problem with encodings in most languages for a long time. It saddens me to see that ruby is still not fixing it. In python 3, there are bytes and there are strs. Trying to do the wrong thing (without converting one to the other by explicitly encodeing or decodeing) will always result in a TypeError. In ruby, doing the wrong thing will sometimes raise an error.

Yes, both languages are dynamically typed - but the difference is still important. In python, you will get an error as soon as you run bad code. In ruby, you will get an error as soon as you run bad code with input data that is non-ASCII (and sometimes only if you use multiple encodings). And how much of your test data is ASCII? If your native tongue is english, probably most of it. That means your test cases will not catch encoding errors - your users will.

Note that neither language can prevent bugs that are due to incorrectly specifying the encoding of a given string. But in python, you have to specify this a lot less often (see my next point) and the conversion is represented at a type level, therefore it should be easier to spot and a more obvious test candidate.

I liken this approach to weak typing. In perl, 1 is less than 2. And 9 is less than 10. Unless one of your arguments is actually a string, at which point perl silently converts them both to strings. What’s worse, you will still think it works while you happen to use values 1 through 9 - only once you reach 10 will perl happily declare “10” to be less than “9”.

Ruby and python are both strongly typed and will not allow you to compare a number to a string¹, because that doesn’t make sense. The same goes with strings and bytes. It’s far more confusing since in the most common case (ASCII), they happen to equal the same thing, but combining the two in any operation (without explicit conversion) simply does not make sense. It may happen to work in a bunch of common cases, but that’s not how you write robust software.

Ruby makes you think about the encoding of every single string

If I read that article correctly, there is no String-like type in ruby that allows you to arbitrarily concatenate two strings, regardless of their original encoding. Rather the normal concatenation functions will happen to work² if the two strings are of the same encoding, or if one value happens to be only ASCII. If you want to properly concatenate two strings, you either have to:

check the encoding at each concatenation operation (incredibly cumbersome), or
ensure your entire application (or at least the part that you wrote) converts strings to your preferred encoding at every boundary. This includes not only your interfaces to the outside world (HTTP, files on disk) but also potentially any string that you receive from a library.

And if you’re writing a library, you’ll have to resort to option (1), since you don’t get to control the rest of the application.

In python, the only times you have to deal with encodings are when going to or from bytes, which is exactly when you need to know about encodings. So you still need to specify encodings for network traffic, or data on disk. But once you have a string, you never need to know its encoding, and it will combine just fine with any other string. What’s more, you will never be under the mistaken impression that your data is a string, when in fact it is really just binary data (or at least, you’ll find out as soon as you run that code).

Python libraries that don’t need to encode or decode data to or from bytes simply deal with the unicode str type, and never have to care about encodings. Whereas it sounds like erb and other ruby templating libraries will either have to do a lot of encoding checks / conversions, or will only work on strings with the correct encoding.

A caveat: perhaps unicode is not sufficient?

According to the original post, ruby cannot use a single unicode type because some encodings (specifically SHIFT-JIS) cannot be losslessly converted to unicode. This is the first I’ve heard of it, and information is scarce on the matter (and the article itself provides no further clues). I did eventually find this thread on hacker news, which suggests the article is referring to Han unification. I don’t know enough about asian character sets to know how big of an issue this actually is, but it’s an interesting one (and new to me). It would be a huge shame if unicode were not sufficient to represent some character sets, given that universality was the entire point of the unicode effort. Even so, I’m sure there are ways to deal with this situation that don’t have the unfortunate effects I’ve listed above - there is still no excuse for failing to properly separate Bytes from Strings, for example.

Apparently python 2 does not forbid this due to a weird implementation detail (although it at least gives less misleading results; a string is always greater than an integer). Thankfully python 3 will raise a TypeError. ↩
Note that things which merely happen to work will almost universally stop doing so in painfully inconvenient (and often confusing) ways ↩

What's all this then?

Index:

Elsewhere, internet-style:

Contact me:

Ruby's unicode treatment

Ruby conflates the concept of strings and bytes

Ruby makes you think about the encoding of every single string

A caveat: perhaps unicode is not sufficient?