Security Design Principles - Input/Data Validation
- J.D. Meier, Alex Mackman, Michael Dunner, Srinath Vasireddy, Ray Escamilla and Anandha Murukan
Assume All Input Is Malicious.
Input validation starts with a fundamental supposition that all input is malicious until proven otherwise. Whether input comes from a service, a file share, a user, or a database, validate your input if the source is outside your trust boundary. For example, if you call an external Web service that returns strings, how do you know that malicious commands are not present? Also, if several applications write to a shared database, when you read data, how do you know whether it is safe?
Centralize Your Approach.
Make your input validation strategy a core element of your application design. Consider a centralized approach to validation, for example, by using common validation and filtering code in shared libraries. This ensures that validation rules are applied consistently. It also reduces development effort and helps with future maintenance.
In many cases, individual fields require specific validation, for example, with specifically developed regular expressions. However, you can frequently factor out common routines to validate regularly used fields such as e-mail addresses, titles, names, postal addresses including ZIP or postal codes, and so on.
Do Not Rely on Client-Side Validation.
Be Careful with Canonicalization Issues.
Data in canonical form is in its most standard or simplest form. Canonicalization is the process of converting data to its canonical form. File paths and URLs are particularly prone to canonicalization issues and many well-known exploits are a direct result of canonicalization bugs. For example, consider the following string that contains a file and path in its canonical form.
The following strings could also represent the same file.
somefile.dat c:\temp\subdir\..\somefile.dat c:\ temp\ somefile.dat ..\somefile.dat c%3A%5Ctemp%5Csubdir%5C%2E%2E%5Csomefile.dat
In the last example, characters have been specified in hexadecimal form:
%3A is the colon character. %5C is the backslash character. %2E is the dot character.
You should generally try to avoid designing applications that accept input file names from the user to avoid canonicalization issues. Consider alternative designs instead. For example, let the application determine the file name for the user.
If you do need to accept input file names, make sure they are strictly formed before making security decisions such as granting or denying access to the specified file.
Constrain, Reject, and Sanitize Your Input.
The preferred approach to validating input is to constrain what you allow from the beginning. It is much easier to validate data for known valid types, patterns, and ranges than it is to validate data by looking for known bad characters. When you design your application, you know what your application expects. The range of valid data is generally a more finite set than potentially malicious input. However, for defense in depth you may also want to reject known bad input and then sanitize the input.
To create an effective input validation strategy, be aware of the following approaches and their tradeoffs:
- Constrain input.
- Validate data for type, length, format, and range.
- Reject known bad input.
- Sanitize input.
Constraining input is about allowing good data. This is the preferred approach. The idea here is to define a filter of acceptable input by using type, length, format, and range. Define what is acceptable input for your application fields and enforce it. Reject everything else as bad data.
Constraining input may involve setting character sets on the server so that you can establish the canonical form of the input in a localized way.
Validate Data for Type, Length, Format, and Range
Use strong type checking on input data wherever possible, for example, in the classes used to manipulate and process the input data and in data access routines. For example, use parameterized stored procedures for data access to benefit from strong type checking of input fields.
String fields should also be length checked and in many cases checked for appropriate format. For example, ZIP codes, personal identification numbers, and so on have well defined formats that can be validated using regular expressions. Thorough checking is not only good programming practice; it makes it more difficult for an attacker to exploit your code. The attacker may get through your type check, but the length check may make executing his favorite attack more difficult.
Reject Known Bad Input
Deny "bad" data; although do not rely completely on this approach. This approach is generally less effective than using the "allow" approach described earlier and it is best used in combination. To deny bad data assumes your application knows all the variations of malicious input. Remember that there are multiple ways to represent characters. This is another reason why "allow" is the preferred approach.
While useful for applications that are already deployed and when you cannot afford to make significant changes, the "deny" approach is not as robust as the "allow" approach because bad data, such as patterns that can be used to identify common attacks, do not remain constant. Valid data remains constant while the range of bad data may change over time.
Sanitizing is about making potentially malicious data safe. It can be helpful when the range of input that is allowed cannot guarantee that the input is safe. This includes anything from stripping a null from the end of a user-supplied string to escaping out values so they are treated as literals.
Another common example of sanitizing input in Web applications is using URL encoding or HTML encoding to wrap data and treat it as literal text rather than executable script. HtmlEncode methods escape out HTML characters, and UrlEncode methods encode a URL so that it is a valid URI request.
The following are examples applied to common input fields, using the preceding approaches:
- Last Name field. This is a good example where constraining input is appropriate In this case, you might allow string data in the range ASCII A–Z and a–z, and also hyphens and curly apostrophes (curly apostrophes have no significance to SQL) to handle names such as O'Dell. You would also limit the length to your longest expected value.
- Quantity field. This is another case where constraining input works well. In this example, you might use a simple type and range restriction. For example, the input data may need to be a positive integer between 0 and 1000.
- Free-text field. Examples include comment fields on discussion boards. In this case, you might allow letters and spaces, and also common characters such as apostrophes, commas, and hyphens. The set that is allowed does not include less than and greater than signs, brackets, and braces.
Some applications might allow users to mark up their text using a finite set of script characters, such as bold, italic, or even include a link to their favorite URL. In the case of a URL, your validation should encode the value so that it is treated as a URL.
An existing Web application that does not validate user input. In an ideal scenario, the application checks for acceptable input for each field or entry point. However, if you have an existing Web application that does not validate user input, you need a stopgap approach to mitigate risk until you can improve your application's input validation strategy. While neither of the following approaches ensures safe handling of input, because that is dependent on where the input comes from and how it is used in your application, they are in practice today as quick fixes for short-term security improvement:
- HTML-encoding and URL-encoding user input when writing back to the client. In this case, the assumption is that no input is treated as HTML and all output is written back in a protected form. This is sanitization in action.
- Rejecting malicious script characters. This is a case of rejecting known bad input. In this case, a configurable set of malicious characters is used to reject the input. As described earlier, the problem with this approach is that bad data is a matter of context.
Validate all values sent from the client.
Restrict the fields that the user can enter and modify and validate all values coming from the client. If you have predefined values in your form fields, users can change them and post them back to receive different results. Permit only known good values wherever possible. For example, if the input field is for a state, only inputs matching a state postal code should be permitted.
Do not trust HTTP header information.
HTTP headers are sent at the start of HTTP requests and HTTP responses. Your Web application should make sure it does not base any security decision on information in the HTTP headers because it is easy for an attacker to manipulate the header. For example, the referer field in the header contains the URL of the Web page from where the request originated. Do not make any security decisions based on the value of the referer field, for example, to check whether the request originated from a page generated by the Web application, because the field is easily falsified.