Chad Perrin: SOB

2 October 2007

Python/Ruby string concatenation discussion

Filed under: Geek — apotheon @ 11:51

The discussion in response to my I learn something new every day — this time about Python and Ruby has gotten a bit lengthy. 25 comments is too many to really address everything properly within further comments — it’s rapidly approaching the point where I might just let others discuss the matter and not get involved. Rather than ignore it, though, I’ve decided to tackle it here, in a new “top level” post.

Jeremy Bowers made a good point about various languages’ implementations of a += operator. This is sort of an implementation detail, though, rather than a common linguistic design choice — and tends to be more suited to a compiler (in Java’s case, a bytecode compiler) than to an interpreter. Python provides the ability to compile source code to bytecode, as I understand it, and Perl does a JIT parse tree compilation every time you run it — to some extent, such an implementation might be appropriate in both cases, though it may also impose some new limitations on how the language itself will be designed in the future. For this reason, even though VM implementations of Ruby are starting to appear, I don’t know that such an implementation of the += operator (or method, in this case) would be a good choice for Ruby. Luckily, Ruby has << as well, and collect-before-joining approach can be otherwise implemented by the programmer if so desired.

Jeremy’s also right, as far as I’m aware, about Python “generally” outperforming Ruby on an algorithm-by-algorithm basis, for current stable releases of Python and Ruby. I certainly didn’t mean to suggest that Ruby outperforms Python consistently — only that certain language design decisions limit the ability to eke greater performance out of certain types of operations, even when the design decision in question doesn’t seem directly related. In this case, it results in a greater performance benefit to Ruby for a particular type of operation than for Python, for a roughly comparable algorithm.

The heart of his final statement:

If this post says anything, it speaks to the dangers of how high-level languages can obscure the underlying algorithms, and therefore obscure the performance implications of them.

It’s true. On the other hand, unless you’re a core maintainer for one of these languages, the part that’s of most interest to you is likely to be how this affects the way you program. Since I’m not a core maintainer for any programming language at present, and have only ever directly contributed at all to a language by doing some scut-work for a C compiler project completely unrelated to these languages in particular, my focus was on how the algorithms used in implementing these languages affects the algorithms I’ll use when writing code.

A lot of attention was given to the choice of algorithm and how idiomatic it is to Python, of course. For instance, someone using the name “nirs” said:

The idiomatic Python code is:

s = []
for line in lines: s.append(line)
  s = ''.join(s)

Similarly, Paddy3118 said:

In Python one is taught not to concatenate strings using += but to use the join method instead.

Going back to Jeremy for a moment, he made the salient point that needs to be made here:

When you use two different algorithms in two different languages, all bets are always off; language differences are generally swamped by the differences in algorithms.

To be fair, I did ask (toward the end of my original post about string concatenation in Python and Ruby) for better ways to improve string concatenation performance in Python. On the other hand, several responses seemed to be offering counterarguments rather than improvements. These two in particular are doing something completely different from what I originally addressed — string concatenation. Instead, they say that you can get similar results with better performance by doing something else. All this means, in the end, is that when you tell your doctor “It hurts when I do this,” he should tell you “Don’t do that.” Sometimes, it is unfortunately the fact that it would be nice to be able to do that.

Of course, I just picked two names out of a hat. The same point was made as well by Simon Willison, metapundit, JamesH, Brandon Corfman, DDP, Vincent Foley, someone identified as “Anonymous”, and somewhat rudely by someone using the name Masklinn.

Someone else identified only as “Anonymous” posted the words “Python does have mutable strings. They’re called ‘lists’,” which points out that the same problem might be solved differently (as have the above-noted users of lists to avoid string concatenation), while simultaneously making a strictly incorrect statement.

I was somewhat impressed with Spacebat‘s response, in that it both suggested three different approaches to speeding up the operation in Python and discussed the downsides of each, rather than simply pretending that if you want a faster program you shouldn’t want the output you actually set out to get in the first place. It shouldn’t seem impressive that someone treated the matter reasonably, but it does.

Mark Thomas, meanwhile, pointed out that what I posted wasn’t idiomatic Ruby either — something the people complaining about the lack of Python “smarts” in the original example probably never thought to consider. I’m not really sure what qualifies for “idiomatic” Python style in this case: Mark suggests that the original example for Python was idiomatic, and I’m inclined to agree that it’s idiomatic Python for string concatenation operations, distinct from idiomatic “avoid string concatenation operations” Python. Mark’s example of idiomatic Ruby style is on the money, but would not have served as an effective comparison example to demonstrate the differences in the two languages’ handling of string concatenation (going back to the “different algorithms make a bigger difference than different languages” idea, again). Interesting (to me, at least) is the fact that, despite being visually quite different from the original example and leveraging the beautification and iteration capabilities of Ruby to positive effect, the execution optimizing piece of the code is exactly the same — namely, use of << in place of +=. Its execution performance is equivalent to the original “optimized” version as well.

Finally, there were several references to the idea of using an IO library call (by Chadwick Morris, Smel, Troy Kruthoff, one of Spacebat’s suggestions, and someone called Tom). While that’s useful to anyone thinking about writing code that behaves similarly, it also obscures the issue somewhat — after all, once we start doing library calls like that to make up for performance bottlenecks we lose any ability to compare the features of the core language. There are, I’m sure, at least a dozen Ruby libraries out there I could similarly use to change the performance characteristics of my examples.

Additional notes:

  1. Justin James made some interesting comments as well, but I think they deserve their own response.
  2. In the future, people posting code here might want to take note of the note above the comment text entry box that reads “Markdown: You can also format text using Markdown syntax.” In particular, indenting every line of code by four spaces in addition to any other spaces your code needs for indentation should provide the code formatting you need.
  3. There’s also a note below that text box, just above the “Preview” button, that reads “Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.” I’m afraid I didn’t get back to checking on this post’s comment activity for a couple days, and as such a bunch of people ended up saying roughly redundant things. This is not their fault, but the fault of the necessity of comment moderation and my own slowness to get around to dealing with comment moderation.

8 Comments

  1. I know what you’re getting at by looking at “what the language provides” for doing this particular operation.

    The thing is, concats are a relatively complex operation, or at least, complex enough that there’s no one perfect algorithm for all situations. Python and Ruby both provide multiple means – the syntax required varies of course, but what they provide with “+=” is considered a sensible default.

    The problem is that what that default could be is entirely subjective. In effect you pull random algorithms out of a hat and call one a winner. It’s no surprise, then, that you were jumped all over in the original thread.

    Comment by James H — 3 October 2007 @ 09:37

  2. That’s not the whole story, though — and it’s this factor that was the focus of my commentary:

    Certain decisions in language design will exclude certain choices of algorithm.

    As such, the default is not entirely subjective, and the work-arounds for that default when it doesn’t suit one’s needs is similarly not entirely subjective. The “all strings are immutable” design decision for Python makes this quite clear in the case of string concatenation operations.

    Ironically, part of my motivation for bringing this up was to point out that performance benchmarking is not as straightforward and all-important as some people like to think — but the Pythonistas who responded have, for the most part, taken my statements as a challenge to Python’s performance capabilities as though it were the straightforward, all-important metric of language quality that it isn’t. As such, they were willing — even eager — to ignore simple concepts like relevance of a given algorithm to my point in their attempts to prove Python’s fast.

    Comment by apotheon — 3 October 2007 @ 11:37

  3. += IS NOT A RUBY METHOD!

    Comment by Steven — 4 October 2007 @ 11:08

  4. Technically, you’re right: += is not a method. + is a method, though.

    Comment by apotheon — 5 October 2007 @ 02:19

  5. Certainly, the fact that Python has immutable strings creates some trade-offs, one of which is the slowness of huge numbers of string concatenations.

    That’s why “pythonic” code avoids concatenating strings. The idiomatic equivalents are, as noted in previous comments, (c)StringIO and list concatenation.

    Regarding comparing speeds: I think it’s a lot more useful to compare the same task — e.g. extracting all log entries starting with “warning” to a separate file — using idiomatic code in both languages. Why profile code that no one would write?

    Comment by Ryan Ginstrom — 8 October 2007 @ 07:26

  6. Why profile code that no one would write?

    I’m not suggesting anyone should use the execution speed of the provided code as an indicator of what language to use. In fact, my point was in some respects effectively the opposite — other things are more important than microbenchmarks when deciding between languages in the “dynamic high-level language” category.

    Comment by apotheon — 11 October 2007 @ 10:15

  7. In fact, my point was in some respects effectively the opposite ? other things are more important than microbenchmarks when deciding between languages in the ?dynamic high-level language? category.

    Excellent point. I also want to clarify that I don’t really think comparing ruby/python speed should be used to choose at all.

    Choosing a language solely for speed is a premature optimization, IMO :)

    Comment by Ryan Ginstrom — 11 October 2007 @ 03:35

  8. In general, I agree. There are extreme examples of exceptions to the rule that choosing solely for speed is premature optimization — and in cases where it isn’t, you should be using assembly language or a hex editor.

    Comment by apotheon — 12 October 2007 @ 04:20

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

All original content Copyright Chad Perrin: Distributed under the terms of the Open Works License