Effectively filtering user input is one of the best ways to prevent an awful lot of web application vulnerabilities. There are several ways to approach this, each with their own pros and cons so I’ll run through them here an then you can think of the best way to combine them for your context. It’s important to remember though, that filters are context specific, there is not one filter that will work for a whole application and that’s what can make writing an effective filter tricky.
The problem with rejecting, or stripping, known bad is two fold – the first is that the attacker likely has a lot of flexibility in the payload of choosing, for example if you look to block the word <script> to prevent a simple cross-site scripting attack and you choose to strip that dangerous input an attacker has a few options, for example they could change the input to <sCrIpT> which has the same effect but might sneak past a naive filter. Alternatively the attacker could use the following payload: <scr<script>ipt> so that when the filter is applied and the <script> string is removed from the input the two pieces either side “<scr” and “ipt>” respectively are placed together, effectively bypassing the filter.
The second issue is that an attacker may utilize encodings to bypass character filters, for example by using URI encoding it may be possible to cause the string %3Cscript%3E to be interpreted in the same way as <script> would be. However there are often a number of encoding types available and it’s not always possible to know ahead of time all possible payloads, so blacklisting is rarely completely effective when deployed alone.
It follows sensibly then that if rejecting known bad is ineffectively then the opposite approach might be effective. The short answer is that it is, however the problem is implementing an effective filter of this nature in the real world. Take for example an input requesting a user’s mobile phone number, the character set could be restricted to just numbers and a maximum or say 15 characters however consider an advertisement placed on an online auction as an example, that’s a much harder input to whitelist. If a blacklist is difficult to write because it’s not possible to know all potential payloads a whitelist is difficult to write because it may not be possible to know all potential good inputs.
A third option then, is to accept input with potentially dangerous characters and simply encode them so that they become benign. In most cases HTML entity encoding the best option for web applications, consider the original example of the payload <script> this can be changed so that the potentially dangerous characters < and > are entity encoded into the form < and > respectively. That way if they are rendered to the screen they will show correctly to the user however could not be utilized for a cross-site scripting attack. Another example would be the SQL injection payload of ‘ OR 1=1 — , this could be represented as: ' OR 1=1 -- so that it could not be interpreted within a SQL query successfully therefore defending against the attack but benign payloads such as Paul O’brien-Shea would be processed as Paul O&pos;brien-Shea but would still be rendered “correctly” to the user within a web browser.
However bear in mind that the above options do not have to be implemented independently and a hybrid approach could be adopted, an possibly effective approach could be something like:
- Decode input recursively until no further decoding is possible
- Recursively strip known malicious sequences, in a strip occurs then repeat step 1.
- Encode potentially dangerous characters
If case is considered with the above (i.e. script and ScRiPt are both handled) then this approach will prevent an attacker using encoding to bypass a filter, it will prevent an attacker utilizing the stripping function of the filter to build up malicious payloads (as the filter runs recursively) and finally if a genuine input includes potentially dangerous characters and an encoding type such as HTML entity encoding is chosen then the output will be rendered correctly to the user. Dangerous characters should include the below list however you could consider encoding all none alpha-numberic characters.
Minimum Characters to be considered dangerous:
" ' ; - < > =