The primary underlying principles behind my research are:
Data encodes the system that generated it. This implies that if you look at a data stream correctly, you can decode the "grammar" used to make that data stream. Initially, I explored the domain of natural language. In that case, the problem is that "words" are used to determine a grammar about "words." This results in a bootstrapping problem, that can be overcome. A second project related to this research area was a study of the security context at SDSU Health Services. In this case, the domain was computer processes, internet ports, internet addresses and internet connection states. This resulted in a far more complex set of values being used to determine a grammar for "acceptable internet behavior." This work led to a monitoring system that was able to detect abnormal internet behavior with very few false positives. The next domain of study will be Clinical Decision Support for the Medicat EMR system currently being used at SDSU Health Services.
An example of a power law graph being used to demonstrate ranking of popularity. To the right is the long tail, to the left are the few that dominate (also known as the 80-20 rule). [© Wikipedia.org]
Machine Learning. During my PhD course work at UCSD, I spent a significant amount of time working with their variety of neural network model, Parallel Distributed Processing (PDP). This model uses a training regime and a feedback mechanism to identify the "features" of a data set. After providing sufficient training data, the system does a good job of approximating the "grammar" that generated the data. But, the actual learning mechanism is very slow and doesn't accurately simulate "real world" learning. In particular, PDP (and most other probabilistic learning mechanisms) perform poorly on infrequently occurring data because of the paucity of training data. Those data points fall into the "Long Tail" of occurrences, which are often considered to be "noise." In my review of security monitoring techniques, the items in the "Long Tail" tend to be reported as "false positives." In many domains, the more interesting items that need to be identified and studied fall into the "Long Tail." To overcome these issues, I developed an alternative approach that I labeled "Occurrence-Based" processing. This technique ignores an item's frequency and simply looks at key defining features of an item. Items are then grouped using data clustering and an appropriate "distance" function. The key defining features are identified for each data cluster and form its "defining context.". A new item with only one occurrence can be classified if it occurs in a recognizable context. This led to much better word classification then extent methods (circa 1993). In the security environment project, it led to a very low number of false positives with no identified false negatives. In the medical project, we expect similar positive results from this approach.
Complex Dynamic Systems (CDS). John Holland, the father of the Genetic Algorithm, has developed a theory of Complex Adaptive Systems (CAS) that he describes in his book Hidden Order: How Adaptation Builds Complexity (1995). There he described a two stage model: (1) interactions of complex dynamic systems, and (2) reproduction of those systems (with mutations, an evolution of his original genetic algorithm work). I have found that the former provides a good working framework for what I have been observing in my security environment monitoring work. He describes a number of features of such systems that have direct analogs in the monitoring system I have developed.as part the security project. Thus, CDS has provided a theoretical framework for that work. Note that the CDS framework provides some justification for the mounting problems in testing large, complex software systems as well.
Orders of Ignorance. Phillip G. Armour laid out his concept of the "Five Orders of Ignorance" in the October 2000 issue of the Communications of the ACM (Vol. 43, No. 10, pp 17-20). The five orders of ignorance are:
0. Lack of Ignorance: "I know something and I can demonstrate my lack of ignorance..." In this case I know the answer to my problem.
1. Lack of Knowledge: "I don’t know something and can readily identify that fact." I know the question. And having a good question, I should be able to find the answer to my problem.
2. Lack of Awareness: "I don't know enough to know that I don't know enough." This is a real problem because we don't even know what questions to ask. Most software development starts here, especially when the domain of work is new. Generally, we can fall back on our a set of skills and knowledge to help us discover that there are things we don't know. As the problem areas are discovered, appropriate questions can be developed may finally lead to an answer.
3. Lack of Process: "I don't know of a way to find out there are things I don't know that I don't know." A major issue arises when our skills and knowledge are not adequate enough to all us to discover that new questions need to be developed.
4. Meta Ignorance: "I don't know about the Five Orders of Ignorance." Knowledge of this heirarchy helps in avoiding the impasse described for 3rd Order Ignorance. This involves continuing to cultivate skills and knowledge to: (a) discover new problems, (b) provide flexible solutions that can "grow" to solve unforeseen problems, and (c) apply solutions in small increments so that the impact of 3rd Order Ignorance is minimized.
The primary underlying principles behind my development work are:
Dynamic Programming Languages. These languages are interpretive in nature, with the code not being compiled in advance but "interpreted at the time of execution." The flexibility, and rapid development of prototypes, that theses languages provide make them ideal for the type of "research related" programming that I have pursed throughout my career. I started with Lisp in the mid 80's, and then became highly proficient in Mumps (M) during the late 80's and early 90's. In the mid 90's I began working with Web Applications. Since that time, I have worked almost exclusively in server-side VBScript and client-side ECMAScript (JavaScript). When server-based agents are necessary to perform off-line processing, I use Visual Basic 6. This set of tools still provide adequate computational power to meet my development needs.
Web Applications. In the Fall of 1997, I began building my first web application. Since then, I have built all of my software using this model. I particularly like the flexibility of this type of development, that greatly facilitates building "rapid prototypes." Other important aspects are the extremely light-weight client, the potential to isolate business logic in a middle-tier (the application/Web server) versus in the database, and the ability to rapidly update and distribute software updates (simply by updating the Application/Web server). Generally, the principle I have been following is that any windows heavy-client application can be replaced by a web application. The challenge has been to provide a comparable user interface.
"Slanty Design." Russell Beale laid out this concept in the January 2007 issue of the Communications of the ACM (Vol. 50, No. 1, pp 21-24). Standard practice is to design user interfaces (programs) to meet user’s needs. What "slanty design" does is to look at both what the user needs and doesn’t need. The idea is to make the “needs” easy and the undesirable stuff hard. Thus, the goal is to make the user more efficient doing their job by both minimizing the effort to do the right things, and minimizes the potential for error. The nice thing about Web Applications is that it is extreemely easy (and fast) to modify the user interface to help pevent errors as soon as problems surface. I have been incorporating this design strategy since the iChart Medical Practice Management System went live in the Fall of 2000.
"There is no best practice." Bob Lewis provides a discussion of best practices, that closely parallels my practice, in the April 10, 2006 edition of his KJR newsletter ("Keep the Joint Running;" IS Survivior Publishing). In particular, I have always had a problem with the term "best practices." If I had followed best practices for web development/deployment, then iChart would not have existed ("best practices" couldn't provide the performance required). In Fall 2005, we had an issue with performance on the iChart DB server. Best practices would have called for increasing the RAM on that server. But, my experience with other web applications indicated that the particular phenomenon I was seeing could not be solved by adding more memory (no matter how much memory was used, a similar problem kept recurring). Instead the correct solution to the problem was to tune the SQL Server to use "less memory." This is a very counter-intuitive solution that does actually have some theoretical justification. I have found that a lot of “best practices” are essentially intuitive solutions with only anecdotal support. As Bob states: "... much of what the industry calls 'best practices' are nothing of the sort. ... Many are descriptions of what one or two large corporations do and like, applied as prescriptions for every company regardless of whether they fit the circumstances or not. They're one-size-fits-nobody recommendations. ..." And, he states as his first core principle of IT Management:
"0. There is no best practice. There are practices that fit best. Different situations call for different solutions -- form follows function."