Tuesday, February 23, 2010

Data validation musings

As I take on the responsibility of supporting more and more web applications I've come to realize that the most difficult and time consuming bugs to not only eradicate, but clean up after, are those which leak bad data into the backing databases.  As a result I find myself spending generous amounts of time working on data validation aspects in the new applications I create.

In web applications there are usually several layers of validation working in concert:

  1. Client validation.  Read: Javascript.  This form of validation became the norm in web applications over the past decade as Javascript has gained wider acceptance in the browser.  Since web applications are largely about the user experience in many ways, these weak checks can be quite important for a variety of reasons, the least of which relates to the basic integrity of a to-be-persisted piece of data.  From my perspective as data defender, I don't really consider Javascript validation to be, well, validation, at all.  It's 100% user experience in my view.  That's not to say that it isn't a key player in maintaining the correctness of data; after all, since users of web applications tend to interact with other players indirectly via the database, the means by which they see and interact with the data can have a serious impact on how they subsequently manipulate it.  AJAX helps web applications respond to and keep the users informed of events that have changed or which cannot be ascertained at the time of the initial request.
  2. Application code validation.  I've seen a lot of programs limp along with all validation being done at the server-side/application code level.  This is traditionally where most of the business logic lives.  And what are validations if not pure, codified business rules?
  3. Database validation.  This is where the edge cases are caught.  The place where the scenario of two people hitting a button at the exact same time is either caught and dealt with, or blissfully ignored at the cost of data integrity.  Functional tests can easily miss these cases, as they're usually the hardest to replicate.  Stress testing probably unearths more of them, but in a lot of production applications there are usually some cracks that, while small and unlikely to show themselves, exist, and which seem to be the most troublesome to debug.
The enumerations above are in practice not really as distinct as the numbered list would have you believe.  AJAX can blur the line between client-side validation and server-side validation, acting as a mere messenger rather than doing any work itself.  And the line between application code and database constraints is even more blurry, with transactions and vendor-specific features like stored procedures living in the fuzzy ether between the two.

While the validation of data is typically driven by business rules explicitly or implicitly, the method by which validations are written to enforce those rules varies quite a bit.  For example, a rule that users must enter their first and last names prior to registration is rather simple to enforce, as the presence or lack of values is itself enough information to determine whether the requirement has been met.  To contrast this, it becomes more complex when one must consult existing stored values in order to validate some new entry; as an example, a bank must check an account balance before determining whether a debit of $500 is allowed.  The difference between these two validations may seem minimal, especially when they sit side by side as brothers in a requirements document, but when the code hits the pavement the second example is many times more difficult to enforce absolutely.

In the first example we could simply slap a NOT NULL constraint on the relevant columns in the database and it would be enough to ensure the validity of our persistent data store.  It may not be handled in the most friendly manner, but there's no way that record will get inserted until values are provided.  If our application is the single point of access to the database and we enforce this requirement in every place in the application code which inserts or updates records, we could theoretically shift this check from the database to the application code and happily chug along for years in production with not a single problem.

The second example is not so simple.  Not only do we need to examine existing persisted data to determine whether the account has enough funds to cover the debit, we also need to serialize each transaction (overloaded term in this context) such that a situation cannot arise where simultaneous debits pass the balance check before either of them posts to the account, with the end result being a negative balance.  We cannot live in our own little bubble anymore; we must be fully aware of our environment and the other potential players.

To truly fulfill the basic integrity and correctness of data in modern day web applications, one typically must be competent in SQL, the vendor-specific quirks and features (such as PL/SQL or T-SQL) of the database being used, a server-side application programming language like C#, Java, Ruby, PHP or ASP, and most likely Javascript for the browser.  But simply "knowing" (whatever depth of comprehensive understanding that happens to convey) the appropriate languages is not enough.  One must be able to understand all the moving parts so that validations can be written in a way that all points at which these languages and tiers of the system interact with each other are conducted in such a fashion that things can't break regardless of the data being affected by simultaneous executions of the same code by other users and processes.

No comments:

Post a Comment