When application security was still in it’s infancy, there were discussions on how to protect applications from newly discovered injection vulnerabilities. "Sanitize Input" was a popular solution that rolled off the tongue nicely and was not overly complicated to explain. It was also, a very generic solution that would (hopefully) be part of a more complete approach.
As much as "Sanitizing Input" makes sense, so does writing your code in a way which, allows you to handle failure safely. This way, when the unexpected does happen, an entire operation doesn't fall down, introduce a bug or propagate unsafe data.
The OWASP Top 10 categorizes XSS and SQL Injection separately. As an attacker, you are injecting data that is handled insecurely by application code. In this way, it is really just another form of injection. On that note, let’s discuss two manifestations of injection. SQL Injection and HTML Injection (XSS). I'd like to demonstrate other ways to think about or handle data beyond just "Sanitize Input". If you take away nothing more from this article, I'd like it to be that applications are unique, there is a level of complexity to design choices and solutions and there are more options than "Sanitize Input" available.
SQL Injection: "Save your one-liners for the bar". Parametrization of database queries is a classic method for handling queries safely and in many cases more efficiently. From a security standpoint, parametrized queries help to solidify the boundaries between user data and SQL statements. It ensures that data submitted by the user will be separated from the actual database query and won’t interfere with the SQL code and ultimately the database.
Example of lazy code....
$uname = $_GET['user_name']
....and this is the classic example everyone shows, nothing new here, that illustrates a SQL Injection flaw where the data ($uname) is actually included in the SQL statement.
"SELECT user_id from users where username = $uname;"
This programming flaw has destroyed the boundary between the SQL command and user-supplied input. Because the user data is now cast as a string-- it is no longer clear to the SQL server what part was supplied by the developer and what was supplied by the user. The whole query can fall apart by appending double quotes. This is not just a vulnerability, this is bad programming. Sure, it takes one line to write the query but there is no further sanity checking here. The string is formed, sent to the server, and executed as SQL. How are parametrized queries different?
Parametrized queries separate the data from the query so that we as coders don’t miscommunicate our intentions to the database server. How does it work? The majority of the query is sent to the server MINUS the actual user submitted data. So, the query is prepared (meaning sent to the server), a response comes back with a token (minus MySQL as I understand it), and THEN, the variable is sent to the server with the token and a SQL query executes. This means the expected query and actual data that we've gathered from the user are separated prior to execution.
Lets provide a visualization
// Pass in db credentials as well as the host it is located on and the database we'd like to connect to
$conn = new PDO("mysql:host=$dbhost;dbname=$dbname",$dbuser,$dbpass);
// Prepare the statement
$sql = "SELECT user_id from users where username = ?";
//Execute the query, taking in the variable data ($uname) from the user
$q = $conn->prepare($sql);
$object = $q->fetchColumn();
As you can see, the $sql statement is prepared and the server knows exactly what it should look like. Next, the SQL statement is executed, passing in the variable value in place of the "?" (shown above). By specifying that question mark, you tell the db, this is my statement but I don't know what the value will be.....I'll give you that on the next call.
Lets examine XSS. Again, something I hear a lot is "Sanitize your input". Some people even go as far as "Whitelist" versus "Blacklist". Okay, great, that is not extensible and ultimately context matters. What do I mean? It is a very one-sided approach with a lot of assumptions. Let me draw a picture for you. The understanding, as of right now, is the data comes in one place and is potentially echoed in another. So the model looks something like this: