Nginx match Url Encoded Stuff in URL, Umlauts, UTF8

In my quest to whitelist stupid drupal and wordpress sites i’ve encountered a little problem with umlauts in urls.
Nowadays those will be encoded with UTF-8 Bytes and converted to URL encoded Format (if your page is also in UTF-8).
If you have a old browser and copy & paste a url or a page that is not UTF-8 the URL gets sent with the encoding of your operating System. (for my windows that would be something like ISO-8859-15)

When nginx gets that URL it will first decode it and that representation will be used in location matches and so on. It will do that with some stupid encoding as HTTP was

  • well
  • (NOT) designed not having a character set Option in requests.
    e.g. you have a URL that looks something like that with german umlauts “/%C3%B6ffentliche-b%C3%BCcherei”
    nginx will represent that as 3 unreadable (and most probably unwritable) characters mixed into the URL.

    Fortunately theres a very easy fix to get nginx to match that correctly, which is quite hard to find in the documentation.
    It should be in there, though i didn’t find it in the documentation.

    So to match the mentioned URL just put a “(*UTF8)” in front of your regex.

    location ~* (*UTF8)^/[öüäÖÜÄAß-Za-z0-9]*$ {

    also put the version in there without a (*UTF8) as some browser might send the url in ISO format.

    One thought on “Nginx match Url Encoded Stuff in URL, Umlauts, UTF8

    1. Fantastic post however I was wondering if you could write a litte more on this topic?
      I’d be very grateful if you could elaborate a little bit more.
      Appreciate it!

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>