Unicode in Your Domain Names

January 24, 2010 📬 Get My Weekly Newsletter

Got a few questions about how I set up ❺➠.ws (which is powered by Shorty, my Scala-based URL shortener), so I thought I'd write up how I got it working. Short answer is that it was pretty easy.

I got the idea from John Gruber, who made a similar thing for his site to post entries on @daringfireball. The trickiest part was figuring out what this was called so I could find out who could sell me a domain with unicode characters in it.

It turns out, this is called an IDN (short for Internationalized Domain Name), and not everyone will sell you one. Couple that with the need to get a non-.com domain, and I had to hunt around for a while.

I ended up going with DynaDot as they could provide the wacky hostname that I wanted as well as a .ws TLD registration. I was amazed at the number of domain regstrars whose web forms could not handle Unicode. It's been almost 7 years since Joel Spolsky wrote his screed on dealing with Unicode, so I don't know what the deal is.

At any rate, the tricky bit in actually using the domain, because a) entering ❺➠ into vim is nontrivial, and b) I doubt that Apache's config file would work with unicode characters in it. Enter Punycode, which is an asciification of any IDN. Fortunately, the domain host provides the Punycode for your IDN, so configuring apache was a matter of:

<VirtualHost XXXXX>
ServerName xn--dfi5d.ws
ServerAlias www.xn--dfi5d.ws
DocumentRoot /home/webadmin/xn--dfi5d.ws/html
<!-- whatever else goes here -->
JkMount /s* ajp13
JkMount /s ajp13
  <Directory /home/webadmin/xn--dfi5d.ws/html>
    Options Includes FollowSymLinks
    AllowOverride All

At this point, it pretty much worked, although it was sometimes difficult to get curl to work with the non-punyied name.

One thing that was weird was that I found that a lot of domains I wanted to try were taken or not available (with no explanation). Often it seemed like the punycode version was a normal looking URL that was taken; I tried several IDNs that had a unicode character with a wierd "5", and they punyied to an ascii 5. Not sure what the deal is there, but I eventually found the one I wanted.