OSEC

Neohapsis is currently accepting applications for employment. For more information, please visit our website www.neohapsis.com or email hr@neohapsis.com
 
Re: Canicalization Of User Input In PHP

From: Paul Johnston (paulwestpoint.ltd.uk)
Date: Wed Jan 19 2005 - 08:16:54 CST


Hi,

In general I feel that trying to develop a generic "sanitize_input"
function is not fruitful. The set of dangerous characters depends on
where the string is used. For example, I just audited some code which
had such a function "safe_io" it called the MySQL and HTML escaping
functions. This was rigorously called for inputs, however, I found some
places where variables protected like that were passed to the shell.
Also, such functions can very easily corrupt data.

For escaping dangerous characters, I advocate escaping very close to
where the string will be reparsed. e.g. system("program " +
escape_shell(args)). Applying this principle to cross-site scripting
means escaping HTML as it is generated. A consequence of this is that
your database may contain HTML special characters. However, as you
follow this principle, you become more and more encouraged to just make
everything binary safe and sidestep the dangerous characters problem.
For SQL queries, most interfaces support some kind of parameterised
queries that are binary safe. As for passing to the shell, it usually
turns out the only good policy is to avoid this at all costs. I haven't
mentioned the string length, and most of the time a long string is not
dangerous.

One thing to note: in many situations the programmer should be able to
get a complete list of dangerous characters - because they control the
code that reparses the string. However, the most notable exception is
HTML. Here the client's browser does the parsing - programmer has no
control. Various browser-specific features require protecting more
characters.

Now, escaping bad characters is just one part of the puzzle. As a major
second line of defence every input value should be whitelist validated
as early as possible. Any input encoding (e.g. URL encoding) must be
decoded before this validation. Handling UTF-8 requires some
consideration here. This is a major defence against the possibility that
you've missed a character from your dangerous character list. Also, this
is a good place to put sensible length limits. However, for many inputs
quite permissive validation is the only acceptable option, a regex I
often use is ^[\x20-\x7e]*$ While helpful, this does nothing to protect
HTML or SQL special characters, so it is not a defence by itself.

So, the overall sequence is:

    input -> unescape -> validate -> [do stuff] -> escape -> output

I haven't mentioned "canonicalisation" but that is implicit in the above
sequence. Of course, all this only protects you against one class of
attacks. While the inputs are now validated, they remain untrusted. You
still have to design your logic correctly. Also, you need to consider
carefully what your inputs are. e.g. it would be easy to protect all
form input fields, but forget to apply the same validation to cookies.
For network applications it is usually clear what the inputs are (but
take care of things like reverse DNS lookups). A tradition Unix
situation where the inputs are many and varied is a setuid executable -
something very difficult to secure.

Regards,

Paul

warningsenvisagement.com wrote:

> I am working on implementing a basic PHP user input validation scheme
> and have come across several references to canonicalizing input before
> performing validation. After researching this topic on the net I have
> finally
> reached a point where I feel okay asking for help.
>
> At this point I have found a few basic functions related to this
> subject, but
> I am getting lost in alphabet soup (UTF-8, RFC 2279, ISO 10646, ...) and
> I am reaching a momentary saturation point where I am finding the
> learning
> curve is only getting steeper with the more I learn.
>
> For the basic validation I have found the following set of PHP filters
> via the
> owasp.org site.
>
> http://www.owasp.org/software/labs/phpfilters.html
> // sanitize.inc.php
> // Sanitization functions for PHP
> // by: Gavin Zuchlinski, Jamie Pratt, Hokkaido
> // webpage: http://libox.net
> // Last modified: December 21, 2003
>
> Now these functions are fairly clear and easy to understand and have
> generally validated what I have come to understand as best practices.
> as I have experience with fault tolerant coding, just not security.
> But, the
> issue I am having trouble coming to terms with is canonicalization of
> the data.
> Beyond the above routines, I have also found the urldecode() function in
> the PHP manual.
>
> At this point I feel (weakly, not securely) that one should use the
> following
> to canonicalize the data prior to validating any input.
>
> reset($_GET);
> foreach($_GET as $key => $value){
> // Transform to canonical form.
> $ckey = my_utf8_decode(urldecode($key));
> $cvalue = my_utf8_decode(urldecode($value));
> if( $ckey != sanitize_paranoid_string($ckey) ||
> $cvalue != sanitize_paranoid_string($cvalue) ){
> header('location:www.somesight.net/index.php');
> }
> }
>
> I understand this example is simplistic, but is this a proper way
> to canonicalize the input values? Or am I missing something here?
>
> Should I be looking at the following too?
>
> $_SERVER['CONTENT_TYPE'] == 'application/x-www-form-urlencoded'
>
> Is this data even trustworthy? I would at first guess think it could
> be forged in
> the header data.
>
> Any input would be appreciated.
>
> thanks,
>
> Sean
>

--
Paul Johnston, GSEC
Internet Security Specialist
Westpoint Limited
Albion Wharf, 19 Albion Street,
Manchester, M1 5LN
England
Tel: +44 (0)161 237 1028
Fax: +44 (0)161 237 1031
email: paulwestpoint.ltd.uk
web: www.westpoint.ltd.uk