OSEC

Neohapsis is currently accepting applications for employment. For more information, please visit our website www.neohapsis.com or email hr@neohapsis.com
 
Subject: Unicode and IDS evasion
From: Eric Hacker (hackerVUDU.NET)
Date: Sun Oct 29 2000 - 22:06:39 CST


Unicode and IDS evasion.

Everyone by now is aware of the IIS Unicode vulnerability [1]. Robert Graham
had mentioned in a previous post [2] that his company had a UTF8 parser
coded and released within hours of the announcement to catch this attack and
its variations.

I got to thinking about that. Wow, a UTF8 parser in a couple of hours? That
can’t be right, with all the different language options and whatnot, I doubt
I’d even be able to make a list of things to check in a few hours. Bruce
Schnier has warned about the complexities in Unicode processing and what it
might mean to security [3], now we’ve seen a vulnerability. To get it right
in a few hours seems very difficult.

So I took a look at the write up by Network Ice on what they were looking
for. [4] They say:

   UTF8 is a multibyte character set, which means it can
   use one, two, or more bytes to represent a single
   character. It is used to represent non-English
   characters beyond the traditional 7-bit ASCII. In
   particular, it is used for far-east characters such
   as Chinese, Japanese, and Korean. In all, UTF8 can
   represent over 30,000 different characters.

Wow. 30,000 different characters! What does IIS do with all those? I started
looking for documentation from Microsoft to answer that, but still haven’t
come up with anything clear. I read on:

   By using multiple bytes to represent a traditional
   7-bit ASCII character, an intruder can evade an
   intrusion detection system (IDS) or compromise a web
   server by evading cononicalization/normalization.

Ah yes, evade an IDS. That was what I was thinking. Why everyone well
schooled in Ptacek and Newsham [5] or more recently Hoglund and Gary [6]
will realize the difficulty in keeping IDS synchronized with the server.
Unicode doesn’t make that any easier. I read on:

   Cononicalization/normalization is a process whereby
   a web server strips off the "backtracking"
   subdirectory of "../". By URL encoding the backtracking
   subdirectories, an intruder could bypass this process,
   and thereby access any file on the system. ...

   With UTF8 encoded backtracking, it might look like:
   http://networkice.com/something/%C0%AF%C0%AF/default.htm
   This alert triggers when such an attempt has been made.

First, I thought %C0%AF was a ‘/’ and %C0%AE was the ‘.’, but I could have
crossed my test results. What really concerns me though, is that my
understanding of English would say that the above means that they only alert
on UTF8 encoded backtracking attempt. That doesn’t help much on the evasion
front. So then I thought maybe it wasn’t relevant for other characters,
because there was no other way to encode them.

Lacking a clear table indicating how IIS interprets UTF8, I did some
testing. I ran through some potential UTF8 codes on my unpatched W2K IIS
test server. I examined the logs to determine what IIS thought the URL was.
I found thirteen representations for the letter ‘a’. I tested all of these
and successfully retrieved a URL (http://myserver/a.txt) encoded with them.

That doesn’t bode well for IDS trying to monitor for hostile activity. If
a.txt was a vulnerability, would any IDS vendor have caught it?

A search of various (but not all) IDS vendor’s web sites does not bring up
any declaration of full Unicode support. Other than Network ICE, I really
didn’t see mention of it. It is unclear to me exactly what UTF8 characters
NetworkIce detects. I feel fairly confident that UTF8 encoding of an attack
would bypass most if not all network IDS today.

Do you want a code page with that?

I’m no IIS or W2K wizard, but from what I can tell, it seems that when W2K
is set up for different languages, then the interpretation of UTF8
characters will be different. I found this in the IIS documentation
regarding code pages:

   A code page can be represented in a table as a mapping
   of characters to single-byte values or multibyte
   values. Many code pages share the ASCII character set
   for characters in the range 0x00 - 0x7F.

If one thinks synchronizing the IDS to the way overlapping fragments are
dealt with by the TCP/IP stack of various OSs is difficult, try matching the
code pages the web server is using. This is sure to be a major hassle for
IDS monitoring of non-English versions of Microsoft IIS.

A light at the end of the tunnel?

All is not lost, however. Redirecting the focus of the IDS and standard web
server coding practices should alleviate much of this problem. I call it
client side normalization. I introduced it in my paper [8], here I will
apply it to the Unicode parsing problem.

A URL may consist of a resource location and perhaps data being submitted.
The data is typically preceded by a ‘?’. The resource location portion of a
URL is almost always generated by the web server owner. Web browsers do not
change the form of the resource location when submitting data. In this way I
claim that web browsers represent client side normalization.

Certainly this cannot be trusted. There are many ways for one to send any
URL one wants to a web server. However, the normal URLs being submitted will
be those previously generated by the server itself by normal web browsers.
If it was not in this set, it probably is an event of interest.

Web server owners should not create URLs that require reduction or
cononicalization. Why use a multi-byte character to represent ‘a’?
Therefore, when reduction or cononicalization takes place in the resource
location, it is an event worth noting. I repeat: the presence of reduction
or cononicalization within the resource location part of a URL is itself
anomalous, likely malicious, and worthy of a NIDS alert.

It is highly likely that the same can be said for the data portion of the
URL. Client side processing can be used to validate data within an accepted
character set. Again, this does not provide assurance. It does identify data
that contains non-standard encoding as anomalous.

Thus, by looking for reduction or cononicalization, itself, one can generate
alerts for likely malicious traffic and solve much of the Unicode problem.
Perhaps by using an overly inclusive character set that includes all
reduction possibilities from all languages one can avoid the language code
page problem. Otherwise the code page problem will require extensive
configurability within the IDS to protect international Unicode capable
services.

Is it pattern matching or is it protocol analysis?

I have an interesting tidbit to add to all this. The other day I was reading
a draft white paper that was given to an acquaintance. The paper was
supposedly authored by someone at NetworkIce and was arguing that third
generation IDS that performed protocol analysis were much better than
earlier IDS that just did pattern matching. The acquaintance said, "It all
boils down to pattern matching in the end."

I read the paper. I had to agree. There was nothing in that paper that
identified any particular advantage of protocol analysis other than that it
was a better way to reduce data before pattern matching. Yes, it is a better
way to reduce data, but much of what the paper called protocol analysis is
done by Snort already.

Here however, is a clear advantage to protocol analysis. There is no way a
standard pattern matching IDS can perform Unicode reduction. Sure someone
could throw a protocol pre-processor into Snort to handle Unicode, but
detecting this type of attack can’t be done with a reasonably sized pattern
set. I can see no way of writing a pattern matching signature that will
detect Unicode within a port 80 stream.

I welcome any discussion on the ideas presented here. I recognize that my
limited experience does not encompass all of reality and that my judgments
may therefore be wrong. If you have evidence or ideas that suggest such,
please discuss.

Eric Hacker, GCIA, MCSE, CCSE
Lucent NPS, Security Practice

[1] http://www.securityfocus.com/bid/1806
[2] http://www.securityfocus.com/archive/96/140752
[3] http://www.counterpane.com/crypto-gram-0007.html#9
[4] http://www.networkice.com/advice/intrusions/2000639/default.htm
[5] http://www.robertgraham.com/mirror/Ptacek-Newsham-Evasion-98.html
[6] http://www.securityfocus.com/focus/ids/articles/desynch.html
[7] http://windows.microsoft.com/windows2000/en/server/iis/
[8] http://www.securityfocus.com/focus/ids/articles/resynch.html