Learnt some valuable lessons from https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/
The original article is very valuable to read. However the examples are in Haskell. I try to rephrase it based on my understanding in Typescript.
Consider a function:
function head<T>(a:T[]):T{
return a[0]
}
This function is partial - it does not handle all posible inputs. If input is an empty array it would through. In Haskell this would not pass compilation. So something must be done.
function head<T>(a:T[]):T|undefined{
return a.length?a[0]:undefined
}
This function is complete. Whatever input, it returns. It returns undefined if input is empty.
Problem:
hidden cost, now the burden falls upon the user of the returned product to handle the possibility of empty array
so for example, a caller has already checked that the input is not empty, the caller still requires to check (or at least assert) that the return is not undefined, even it's already checked previously
so it's annoying, has potential performance cost (however little it might be), and
worst, "a bug waiting to happen" - what if the code checking the input array to be non-empty stops checking that? the cost using the return is still believed (if not modified correspondingly) the return is not undefined - it would be even worse if caller of the function and user of the product are not in the same place
function head<T>(a:NonEmptyArray<T>):T{
return a[0]
}
Note: Typescript actually cannot really define a non-empty array type (because it requires run-time checking). A typeguard must be used. This code is only for illustration. Pretend it works and remember it doesn't. See https://stackoverflow.com/questions/56006111/is-it-possible-to-define-a-non-empty-array-type-in-typescript
This forces the caller to deal with making sure the array is non-empty. The function is complete.
Now:
no redundant checking by user of the returned product
no "bug waiting to happen"
function validateNonEmpty<T>(a: T[]):T[]{
if (a.length===0) {
throw new Error();
}
return a;
}
function parseNonEmpty<T>(a: T[]):NonEmptyArray<T>{
if (a.length===0) {
throw new Error();
}
return a as NonEmptyArray<T>;
}
The two functions are completely identical in behaviour. However, (quote the original article)
the difference between validation and parsing lies almost entirely in how information is preserved...... parseNonEmpty returns NonEmpty a, a refinement of the input type that preserves the knowledge gained in the type system. Both of these functions check the same thing, but parseNonEmpty gives the caller access to the information it learned, while validateNonEmpty just throws it away...... validateNonEmpty obeys the typechecker well enough, but only parseNonEmpty takes full advantage of it...... parsers are an incredibly powerful tool: they allow discharging checks on input up-front, right on the boundary between a program and the outside world, and once those checks have been performed, they never need to be checked again!
The approach is:
parse (parse-validate) everything at boundary between system (or library) and outside world
once those checks have been performed, never need to be checked again
preserve the parse result as typing information - the values can be parsed long before they are actually used, if the boundary changes (for example now accepts empty array as well), the codes that use the values (somewhere away from the boundary) can fail to compile
Validation up-front has the "bug waiting to happen" problem:
if validation changes, like for example now accepts empty array as well
and codes that use the value do not sync, bug would occur
Validation ad-hoc (validation before use at place where value is used)
(quote) Ad-hoc validation leads to a phenomenon that the language-theoretic security field calls shotgun parsing
(quote) Shotgun parsing is a programming antipattern whereby parsing and input-validating code is mixed with and spread across processing code—throwing a cloud of checks at the input, and hoping, without any systematic justification, that one or another would catch all the “bad” cases.
(quote) Shotgun parsing necessarily deprives the program of the ability to reject invalid input instead of processing it. Late-discovered errors in an input stream will result in some portion of invalid input having been processed, with the consequence that program state is difficult to accurately predict.
(quote) Parsing avoids this problem by stratifying the program into two phases—parsing and execution—where failure due to invalid input can only happen in the first phase. The set of remaining failure modes during execution is minimal by comparison, and they can be handled with the tender care they require.
(quote)
Use a data structure that makes illegal states unrepresentable. Model your data using the most precise data structure you reasonably can. If ruling out a particular possibility is too hard using the encoding you are currently using, consider alternate encodings that can express the property you care about more easily. Don’t be afraid to refactor.
Push the burden of proof upward as far as possible, but no further. Get your data into the most precise representation you need as quickly as you can. Ideally, this should happen at the boundary of your system, before any of the data is acted upon.
If one particular code branch eventually requires a more precise representation of a piece of data, parse the data into the more precise representation as soon as the branch is selected. Use sum types judiciously to allow your datatypes to reflect and adapt to control flow.
In other words, write functions on the data representation you wish you had, not the data representation you are given.
Let your datatypes inform your code, don’t let your code control your datatypes. Avoid the temptation to just stick a Bool in a record somewhere because it’s needed by the function you’re currently writing. Don’t be afraid to refactor code to use the right data representation—the type system will ensure you’ve covered all the places that need changing, and it will likely save you a headache later.
Don’t be afraid to parse data in multiple passes. Avoiding shotgun parsing just means you shouldn’t act on the input data before it’s fully parsed, not that you can’t use some of the input data to decide how to parse other input data. Plenty of useful parsers are context-sensitive.
Keep single source of truth of data. If denormalization is absolutely necessary, use encapsulation.
Use an abstract "newtype" with a constructor / typeguard to fake a parser
Consider the principles in this blog post ideals to strive for, not strict requirements to meet. All that matters is to try.
Also reads: