Quick hand count: who knows what String.split() does?

Most developers probably do. Python? easy. Javascript? probably. But if you’re a ruby developer, chances are close to nil. I’m not trying to imply anything about the intelligence or skill of ruby developers, it’s just that the odds are stacked against you.


So, what does String.split() do?

In the simple case, it takes a separator string. It returns an array of substrings, split on the given string. Like so:

py> "one|two|three".split("|")
["one", "two", "three"]

Simple enough. As an extension, some languages allow you to pass in a num_splits option. In python, it splits only this many times, like so:

py> "one|two|three".split("|", 1)
["one", "two|three"]

Ruby is similar, although you have to add one to the second argument (it talks about number of returned components, rather than number of splits performed).

Javascript is a bit odd, in that it will ignore the rest of the string if you limit it:

js> "one|two|three".split("|", 2)
["one", "two"]

I don’t like the javascript way, but these are all valid interpretations of split. So far. And that’s pretty much all you have to know for python and javascript. But ruby? Pull up a seat.


Special cases.

a.k.a surprises.

irb> "one||two||three".split("|")
=> ["one", "", "two", "", "three"]

Seems reasonable. But:

irb> "one  two  three".split(" ")
=> ["one", "two", "three"]

Note there are double spaces in the input - ruby’s split has different behaviour if the separator is a single space character!

But if you don’t like that, you can use a regex!

irb> "one  two  three".split(/ /)
=> ["one", "", "two", "", "three"]

Theoretically this has the same meaning, but you get different behaviour!

The stupid thing about this example is that there is absolutely no need to treat ' ' as a special case - you can achieve the same behaviour explicitly if you actually want it, by using /s+/ as the separator.

And how about:

irb> "||one||two||three||".split("|")
=> ["", "", "one", "", "two", "", "three"]

ruby’s split treats leading blanks differently to trailing blanks.

unless you include a negative number as the num_splits argument - in that case it has no effect but its sign changes the behaviour of split to include trailing blanks:

irb> "||one||two||three||".split("|", -2)
=> ["", "", "one", "", "two", "", "three", "", ""]

(note that there is no way to enable this behaviour and limit the number of splits performed).

I can see no useful purpose for this special case. And again, if I actually wanted this behaviour, it would be easy enough to implement. In fact I would still implement it in order to correctly convey intent instead of relying on poorly-known edge cases of the ruby standard library.

By the way, these are not accidents - they are documented.

For bonus points, the two special cases outlined here can interact in fun ways. I didn’t mention it above, but splitting on a single space also ignores leading and trailing whitespace (also known as strip, if you actually want that behaviour). If you wanted to preserve those, you might try the negative number trick from above. But what does that give you?

irb> "  a  b  c  ".split(" ", -1)
=> ["a", "b", "c", ""]

What a mess.


So, have you got all that? Are you going to remember it next time you use split? I doubt it (I just wrote about it in depth, and I’ll still probably forget). And don’t forget to think about how these special cases could interact with the myriad of other special cases and surprising behaviour in ruby.

You might think that I’m just picking on ruby - this could happen in any language. But one of the reasons I like python so much is that not only is the language itself simple and elegant, but the community largely agrees upon some very fundamental ideas - the zen of python. These are more guiding principles than laws, but they are usually respected unless you have a good reason not to (and you probably don’t).

These points in particular all seem pretty relevant:

  • Simple is better than complex.
  • Special cases aren’t special enough to break the rules.
  • If the implementation is hard to explain, it’s a bad idea.

But I guess if you disagree with those, ruby might well be for you.