It may work, but it doesn’t mean it is correct

I wrote some time ago about a simple software of mine that saves entries from RSS feeds as messages in a IMAP store.

Recently I took the time to update the code to Python 3, hoping among other things that this could help me spot a few bugs. Python 3 has better unicode support, in particolar you either have and encoded array of bytes or an unicode string, so you can’t mistake one for the other like you can do with Python 2.

In fact I found a couple of those bugs (only two? I’m either good at programming, or bad at bug hunting, you guess :D), but one in particular made me lose some time.

The code was the following (redacted for brevity)

from email.mime.text import MIMEText
# ...
msg = MIMEText( body, subtype, 'utf-8' )
msg["Subject"] = entry.title
# ...
msgText = msg.as_string()

This worked just nice in Python 2, but in the new language I had sometimes errors like

UnicodeEncodeError: 'ascii' codec can't encode character '\xbb' in position 0: ordinal not in range(128)

This would happen with non ascii characters in the Subject… why the hell wouldn’t MIMEText encode it in utf-8 as I asked it ? “I told you to use utf-8 over there, why the hell do you use the ascii codec”. The fact that the same code was working in Python 2 added more frustration.

I then realized that the encoding of the headers of an email is quite a different matter from the encoding of its body, in fact it is handled in an RFC on its own (and then some other.

This means that a nice text with only some accented characters, like “Thìs strànge sentènce”, becomes the ugly “=?utf-8?b?VGjDrHMgc3Ryw6BuZ2Ugc2VudMOobmNl?=” (ouch!).

Anyway, there are plenty of libraries that handle this for us. In python this means that one shouldn’t use directly strings when setting the headers, but a specifically designed Header class. So the correct code becomes

from email.mime.text import MIMEText
from email.header import Header
# ...
msg = MIMEText( body, subtype, 'utf-8' )
msg["Subject"] = Header(entry.title,'utf-8')
# ...
msgText = msg.as_string()

But then, why did the original code work with Python 2? Well, probably the library didn’t really care about the encoding of the message, the IMAP server didn’t care either, and the mail client being a nice guy relized that the headers was a wrongly encoded string and guessed how to interpret it. The fact that the message didn’t travel through an SMTP server probably helped.

A nice example of “It may work, but doesn’t mean it is correct”