Improving information security decision making by reducing expert bias

This study identifies methods for eliciting knowledge from experts with minimal bias and evaluates their applicability to information security risk assessment, decision-making, and day-to-day operations. Decision makers rely on expert estimates in many fields, including information security (typically in the form of the probability of a security event or the potential impact but also operational estimates like project duration), but research shows no consistent relationship between the estimation accuracy of experts and their years of experience, publication record, or area of expertise (even if that area is statistics or judgement and decision making psychology!). Researchers regularly observe the effective application of bias reducing methods to improve estimates across industries and subjects. Such methods include

- formatting questions and available data in ways shown to ensure clarity and comprehension by experts,
- calibration training to reduce overconfidence,
- integration of estimates collected from multiple experts,
- integration of empirical data with expert estimates,
- creation of simulation models that factor in not only expert estimates but also the uncertainty had by the experts and the irreducible uncertainties of the world for which we don't have the resources to resolve,
- updating of simulation models when new information becomes available and when threats and opportunities change.

These methods are applicable to high-level information security risk assessment and decision-making processes, as well as low-level technical SOC and CIRT daily operations.

The problem

Evidence of the problem

Deficiencies in the evidence

Intended audience

Literature review

Methods

Single point subjective scoring

Calibrating experts

Integrating data with expert opinion using Bayes’ Theorem

Eliciting and communicating probabilities in frequency format

Aggregating estimates from multiple experts

Simple averaging

Weighting and trimming when aggregating judgments

Logistic regression for aggregating judgments

Monte Carlo simulation modeling

Fitting distributions to data

Approximating distributions with expert input

U.S. Environmental Protection Agency’s Monte Carlo Analysis guidance

Criticisms of Monte Carlo simulation models

Discussion of the findings

Reason sources were chosen

Theme 1: Format of questions

Theme 2: Format of answer

Theme 3: Calibration of experts

Theme 4: Aggregation of experts

Theme 5: Integration of data

Theme 6: Simulation

Comparison of the findings

Limitations of the study

Recommendations and conclusions

Monte Carlo Simulation

Conclusions

Appendix A – Non-peer-reviewed literature

References

The problem

Research by Nobel Laureate Daniel Kahneman, Amos Tversky, and other judgment and decision-making (JDM) psychologists found that humans are poor estimators of uncertainty. Their studies also found this to be true regardless of the field of work or the level of experience (Kahneman et al. 1972). Later studies confirmed these findings (Onkal et al., 2003, Soll & Klayman, 2004, Speirs-Bridge et al., 2010). Researchers found that experience and level of training only weakly related to performance (Camerer & Johnson, 1991; Burgman et al., 2011). Reliance on experts for decision making in the presence of uncertainty is common in many fields. Such areas include ecology (McBride et al. 2012), weather forecasting (Murphy and Winkler, 1984), accounting (Ashton, 1974), finance (Onkal et al., 2003), clinical medicine (Christensen-Szalanski et al., 1982), psychiatry (Oskamp, 1965), and engineering (Jorgensen et al., 2004) and information security (Kouns and Minoli, 2010). Surveys by Hilborn & Ludwif found that managers do not know if their successes and failures are a result of their expert’s guidance (1993, Sutherland, 2006; Roura-Pascual et al., 2009). The observations made in these studies suggested that the use of experts in risk assessment may provide a false measurement of risk.

Methods for eliciting knowledge from experts with minimally biased results have been developed and tested in multiple disciplines. These methods reduced bias and enabled management to measure the accuracy and precision of their experts (McBride et al, 2012; Bolger & Onkal-Atay, 2004, Lichtenstein et al., 1982; Clemen & Winkler, 1999). The objective of this document is to identify and evaluate expert knowledge elicitation methods that address the fallibility of human estimation. What methods are available to treat this observed human error? What are the criticisms of these methods? Can methods used in other disciplines be applied to the field of information security?

Evidence of the problem

Periodic assessments of risk to data handling assets are standard practice for regulated organizations such as insurance companies, financial institutions, and medical offices (Kouns & Minoli, 2010). In the United States, banks may be examined by the FDIC, Credit Unions by the NCUA, medical record handling organizations by OCR under HIPAA, and credit card information handling organizations by virtue of PCI-DSS certification requirements. Resources fund whole departments to keep up with these requirements and fines alone for violations can be catastrophic to an institution. Compliance, Risk Management, and Governance personnel support some form of regular risk assessment (Kouns & Minoli, 2010). Risk assessments aim to identify threats and then systematically consider the probability and the impact of each threat considered being exploited (Kouns & Minoli, 2010). Where scientific data is difficult to obtain to make these determinations, organizations more often request estimates from experts instead (Kuhnert et al., 2010). These experts are sometimes called advisors, consultants, senior analysts, or subject matter experts. Studies continually show that estimates provided by experts are inaccurate largely due to biases. In those studies experts expressed 80% confidence that their estimates were true, but when validated their estimates captured the truth only 49-65% of the time (Kahneman, Slovic, and Tversky, 1982; McBride, Fidler, & Burgman, 2012; McBride 2012, Onkal et al., 2003; Soll & Klayman, 2004; Speirs-Bridge et al., 2010). McBride et al. compared expert estimates to the amateur estimates of students aspiring to the same fields of work. The surprising result was that students, unlike experts, displayed a near-perfect awareness of their uncertainty while experts were not (McBride, Fidler, & Burgman, 2012) resulting in students providing what would be more reliable input for decision makers. In other words, students demonstrated a better awareness of their own ignorance. The research showed no consistent relationship between performance, years of experience, publication record, or self-assessment expertise (McBride, Fidler, & Burgman, 2012). McBride et al.’s experiments with bias reducing methods showed that experts needed special training in order to accurately communicate their knowledge (McBride, Fidler & Burgman 2012). Overconfidence meant that when experts were 80% certain that their estimates contained the correct answer, they were correct only 49-65% of the time. When students in the same fields of expertise underwent the same testing, they scored higher at being aware of their uncertainty than the people with more experience than them. This suggests the possibility that even critical decisions being made by military, government, healthcare, financial and critical infrastructure leaders ares subject to the same error.

Deficiencies in the evidence

Minimal research is available that evaluates the different methods of collecting and processing estimates provided by experts for areas like the Information technology industry or even the academic literature on informatics. Industry standards organizations provide guidance on information security risk assessment by experts in the form of estimates, but none of the guidance provides instruction on how to address the known human bias or explain why they believe their guidance would result in effective methods for measuring or managing risk.

Intended audience

Executive decision makers, risk management personnel, and information security leaders and analysts may benefit from this study, as it reviews methods and criticisms that they, or at least their risk department, may be using with confidence, when the research suggests that they should not be. Experts interested in accurately communicating their knowledge may also benefit from this research since methods of increasing accuracy are discussed. Finally, individuals interested in human judgment and decision-making psychology may find this to be a valuable collection of peer-reviewed articles on systematic reduction of human bias.

Literature review

This section provides readers with the background information necessary to understand the research problem that minimal guidance is available on how to reduce the impact of expert bias on information technology risk assessment, decision making, and operations. The foundational judgment and decision-making research and terms established by Kahneman et al. are summarized along with other uncommon terms used in the remainder of this document. The section continues to describe what options are available to address the issue of expert bias. Studies have found that special measures are required when eliciting knowledge from experts. Experts may be providing incorrect information without being aware of it. Specifically discussed are what options are available for optimizing the use of experts and the peer-reviewed criticisms of each. The literature reviewed includes methods of expert knowledge elicitation from other fields of work that may be applicable to information security risk management, decision making, and daily operations such as Security Operation Centers (SOC) and Critical Incident Response Teams (CIRT).

Heuristics and biases. Kahneman et al. published research into the challenges of eliciting knowledge from humans accurately. Kahneman et al. called the cognitive cause of these challenges heuristics. Heuristics are simplifications of our environment that the human mind presumably makes in order to solve problems more quickly, at the expense of considering all of the particular details of a decision (Kahnman et al., 1974). The heuristics discussed include what they called Representativeness, Availability, and Anchoring and Adjustment. Biases that resulted from these heuristics were divided into Miscalibration, Conjunction fallacy, and Base Rate Neglect. Representativeness was the heuristic seen when a person assumes that two things belong to the same group when they represent themselves similarly. Availability was the heuristic seen when someone assumes that the event frequently occurred since the expert can recall specific examples from memory so readily. Anchoring and Adjustment was the two-part heuristic seen when an expert provided estimates that adhered closely to an example provided by the person eliciting a response. These heuristics help humans rapidly solve problems in simple situations. Mental shortcuts like these resulted in bias when applied to more complex problems (Kahneman et al., 1974). The terms and conditions coined by Kahneman et al. are used in many research articles, but their conclusion on human estimation capacity has been challenged.

Criticisms of Kahneman et al. In “The Rhetoric of Irrationality”, Lopes criticizes one of the conclusions made by Kahneman et al. in response to their findings. Her interpretation of their research was that people use heuristics instead of probability theory when making decisions most of the time, not simply that humans are poor estimators. Lopes pointed out that data was available which showed that people correctly estimated value and risk in gambling situations (Anderson & Shanteau 1970; Shanteau 1974; Tversky, 1967) and also in assessing the likelihood of fairly complex joint events occurring (Beach & Peterson 1966, Lopes 1976; Shuford 1959). In that research subjects produced the same probabilities produced by normative calculation using expected utility and compound probability multiplication. Neither of these mathematical methods was common knowledge among the subjects, but they produced equally correct answers.

In ‘heuristics and biases’ bias in expert elicitation, Kynn found that there was a disproportionate bias toward the citation of Kahneman et al.’s research in the literature. Studies that showed poor performance by human estimators, compared to research showing good performance, were cited in the literature 6:1 (Kynn, 2008). Kynn pointed out that, regardless of the strong arguments made by authors like Lopes, most citations do not acknowledge these criticisms of Kahneman et al.’s work. Research by Gigerenzer et al. found humans are more Bayesian thinkers and intuitive statisticians than the works of Kahneman et al. expressed, so long as information is communicated to experts in a frequency format (1995). Frequency formats communicate information to experts in a form that more closely resembles the natural sampling observed in animal foraging and neural networks. What is 1% in standard format would be 10-out-of-100 in frequency format. The use of visuals was also a means to present data in a frequency format.

Bayesian inference. Bayesian Inference uses subjective estimates of knowledgeable persons and improves upon their estimates using statistical methods (Vose, 2008). The process of Bayesian Inference may involve an expert communicating their prior knowledge by multiplying together a likelihood estimate of an event occurring and impact of that based on historical data. These estimates and values take the form of functions or distributions instead of point estimates. Scientists and statisticians may instead perform the “classical” approach which is considered more objective than Bayesian Inference by leaving less room for human error. The classical approach involves experimentation and identical independent trials instead of subjective estimates provided by experts.

Distribution. Refers to Cumulative Distribution Function (CDF), distribution function, cumulative frequency function, cumulative probability function, or frequency distribution shape. A distribution function describes the probability that a random variable is less than or equal to some value. In the case of frequency distributions, this may take the form of a normal distribution. The normal distribution would show that most of the random values generated are closer to the mean than to either end of the range provided, forming the bell curve shape.

Variability and uncertainty. A common criticism of probability is that there is simply too much luck or chance involved with certain events to measure probability. The academic community addresses this criticism with the concepts of variability and uncertainty. Vose posits that the human inability to predict future events is due to a factor of variability and uncertainty (2008). If the uncertainty factor is excluded from a risk assessment, the results may be over confident which equates to the ranges being unrealistically narrow, i.e. Imagine the weatherman saying that there is between a 60 and 61% chance of rain tomorrow. The assessor may be either unaware or deliberately overconfident in their estimate. Conversely, the exclusion of variability will widen ranges so much that the results may be useless, i.e. saying that there is between a 1 and 100% chance it will rain tomorrow. Variability is sometimes called chance. Specifically, chance that is inherent and irreducible to what is being observed e.g. the chance of a fair coin being heads or tails after being flipped cannot be controlled or reduced from 50%. Similarly, if the coin is tossed twice there is a 25% chance of getting either heads-heads, heads-tails, tails-heads or tails-tails (Vose, 2008). Since we cannot further reduce our uncertainty of what the coin will show after a flip, the results contain an irreducible factor of randomness. A more complex example that Vose states is the stock market (2008). Stock prices are affected by a potentially infinite number of factors and so cannot easily be predicted. The EPA describes variability as “true heterogeneity or diversity in a population or exposure parameter” and, unlike Vose, specifies that such randomness is irreducible only most of the time opposed to absolutely (1997, p.9). The EPA’s response to high variability is to better characterize the diversity in a population or exposure parameter (1997, p.9). By doing so, inferences can still be made from the sample being observed. Uncertainty, unlike variability, can be reduced and is inherent in all estimates. It may be reduced by collecting more information or by finding more knowledgeable subject matter experts. Vose posits that experts can be made to provide subjective estimates that are as objective as possible (Vose, 2008). They do so by following a logical path of reasoning which excludes prior, non-quantitative information about what they are assessing. This sounds like the Calibration technique described in the next section of this document.

Methods

The methods that will be evaluated include single-point scoring as proposed in many industry standards, calibration of estimators as proposed by Lichtenstein et al. (1982), aggregating data with expert opinion with Bayes’ theorem by Yang et al (1997), Bayesian reasoning with frequency formats as proposed by Gigerenzer et al.(1995), aggregating estimates by multiple experts by Yaniv (2004), Harvey et al (1997), Lim et al (1995), Johnson et al (2001) and Lele et al (2005); and Monte Carlo simulation modeling as described by the EPA (1997) and Vose (2008). Each method is followed by criticisms where criticisms were available.

Single point subjective scoring

There are a variety of expert knowledge elicitation methods that involve communicating estimates in the form of single-point estimates, or scores. Methods for assessing risk that elicit estimates in the form of single-point figures, opposed to ranges, are called “single-point subjective scoring” throughout this paper. Official standards organizations who publish guidance for eliciting expert estimates in the form of single points include: the National Institute of Standards and Technology (NIST SP 800-30), the International Organization for Standardization (ISO 31000); organizations that use ISO 31000 like the British Standards Institute (BSI), the Australian Standard/New Zealand Standard (AS/ANS); the Information Security Forum (ISF), SANS Institute, Global Information Assurance Certification (GIAC), the Federal Financial Institutions Examination Council (FFIEC), the Information Systems Audit and Control Association (ISACA), the Project Management Institute (PMI), and the Computing Technology Industry Association (CompTIA), among many others. These organizations provide guidance to organizations on measuring organizational risk. In practice, a subject matter expert rates the probability and impact of events. Experts are asked to provide their estimates in the form of scores like 1, 2, or 3, or the words high, medium, or low; or the colors green, yellow, and red. In each case the answers are points on a spectrum. The first round of questions asks for the probability of an event. The second round asks for the impact. On one end of each of these spectrums may be the word “very low” and on the other end “very high.” In some cases, these descriptors are used explicitly instead of colors or numbers. Some methods use percentages or decimals between zero and one in place of the same aforementioned spectrum. Once ratings are established, they are sometimes fit into a grid or a risk matrix. Some methods add the step of multiplying the probability numbers and impact numbers together. Multiplying these two values together creates a single value believed to represent both probability and impact of the event or condition being assessed. No peer-reviewed publications are available that test the mathematical validity or success of these methodologies.

Criticisms of single-point subjective scoring. Hubbard and Evens identified four problems with scoring methods. The conclusion of their study was in favor of risk assessment methods that measure risk in terms of mathematical probability using methods like Monte Carlo simulation (Hubbard & Evans, 2010). Organizations measuring their risk of experiencing rare events like flu pandemics or data breaches, would have to systematically track the occurrence of the rare events in their organization over a longer period of time to determine the accuracy of their forecasts. The extended period of time would require the organization to trust their score-based forecasts until a sufficient number of rare and harmful events occurred, or did not occur, to evaluate its effectiveness (Hubbard & Evans, 2010). In studies involving single-point scoring methods that use labels in place of scores; such as high probability, medium probability, and low probability, interpretations varied considerably between assessors (Hubbard & Evans, 2010). This disparity impacted assessment output in most cases. Similar studies have shown assessors defining terms differently, even when provided explicit definitions (Budescu, Broomell, & Por, 2009; Heuer, 2005). When multiple risks are considered in an assessment, dependencies between them, that may change risk ratings, are not factored into any of the observed single-point methods (Hubbard & Evans, 2010). Hubbard and Evan’s proposed the use of percentages to communicate probability, dollar amounts to communicate impact, bias reducing techniques, like calibration, and Monte Carlo simulation to integrate probability and impact into a form that mathematical methods can be used on (Hubbard & Evans, 2010).

Risk Matrices. Risk matrices fail to communicate the disparity between different risk events. Two highly disparate risks may sit visually side-by-side on a risk matrix which gives the illusion that addressing either will mitigate a similar amount of risk. Matrices developed by the Federal Highway Administration, Federal Aviation Administration, California Department of Transportation, and a General Accounting Office report on “Combatting Terrorism” employed risk matrices and so were used by Cox to explore the mathematical properties of risk matrices (2008). Cox provides a list of conditions required to use a risk matrix to produce accurate results. When data is unavailable, subjective ratings are produced by experts and then mapped onto a risk matrix. Cox’s findings were that high-quality data, opposed to subjective estimates, were a minimum requirement for effective matrix use (Cox, 2008). In most cases, a risk matrix will not provide what decision makers are looking for, even under ideal conditions. In all cases, something will be lost and errors in matrix construction are often invisible without time- consuming evaluation of the individual matrices by mathematics experts. Cox observed that risk matrices provided poor detail resolution, errors, suboptimal resource allocation, and ambiguous inputs and outputs (Cox, 2008). When hazards given single- point risk ratings were mapped on risk matrices, they appeared to have identical risk when they their risk values were highly disparate.

Calibrating experts

Analysts may provide decision makers with estimates of the likelihood or impact of future events. They may also provide estimates for past events when sufficient historical data is unavailable. Estimates may be produced by processing data from a sufficiently reliable source. For the purposes of this research, a sufficiently reliable source includes any instrument that provides verifiably accurate measurements. Verifiability accurate and precise measurements have been provided by trained experts (Lichtenstein and Fischhoff, 1980). An example of a non-human instrument is a correctly calibrated pH probe. Such an instrument provides verifiably accurate data and so is trusted as providing factual data but only after it has been calibrated. In the same way, a calibrated human provides trustworthy information, some researchers go so far as to call this elicited information data. Calibration training requires intervals instead of single-point estimates. Interval estimates are estimates that take the form of ranges instead of single-point answers. For example, a questionnaire may ask: In what year was Benjamin Franklin born? The instructions may request a range of years instead of a single number. An expert taking such a questionnaire may be asked to provide a range that they are 90% confident contains the correct answer i.e. 1650-1750. That range is referred to as their confidence interval (CI) for that particular estimate. Experts are calibrated if the probability that they assign to an event turns out to be true on most occasions over time (Lichtenstein and Fischhoff, 1980). This ability may be improved through what they call calibration training. Training involves a trainer asking a trainee many questions to which the answers are known to the trainer. The trainee is asked to provide answers in the form of interval estimates. The objective of calibration is to get experts into the habit of recognizing their uncertainty when providing estimates. This skill is independent of subject matter and may be learned by most people who take the time (Kynn, 2008). The ranges provided should be sufficiently large to contain the answer but no so large that the expert is not providing useful information. That being said, if the expert provides only wide ranges, they may not have sufficient expertise in the subject matter.

Criticisms of calibration. One of the findings in Calibration of Probabilities: The state of the art to 1980 was that training can only improve the calibration to a limited extent (Lichtenstein et al. 1981). Lichtenstein et al. recommend the continued development of technology and bias reducing methods (Lichtenstein, Fischoff et al. 1981).

Integrating data with expert opinion using Bayes’ Theorem

Methods that use Bayes’ Theorem to integrate the subjective opinions of experts with real data are referred to as Subjective Bayesian methods in this document. Real data refers to data obtained using instruments that maintain reliable accuracy. In most cases examined by Yang & Berger, opinions of experts were elicited in the form of their prior beliefs (1987). These elicited priors were then used to help form a statistical model. The priors made up the parameters, typically minimum and maximum possible outcomes of the statistical model once quantified into a probability distribution. Resulting distributions, sometimes referred to as “posterior distribution” (Yang & Berger, 1997) can then be used to make probabilistic predictions.

Criticisms of subjective Bayesian. Yang and Berger experienced difficulty quantifying expert opinion in the form of a prior distribution. Challenges they described included quantifying the informativeness of an expert, justifying the costs and training of experts to provide subjective Bayesian estimates, eliciting minimally biased prior distributions, and prioritizing which factors to count and discount based on their perceived importance (1997). Methods for addressing these challenges are proposed by Lele and Allan later in this document.

Eliciting and communicating probabilities in frequency format

Communicating conditions in the form of frequency formats is a solution proposed by Gigerenzer et al. in How to Improve Bayesian Reasoning Without Instruction: Frequency Formats (1995). People were more likely to understand what was being asked of them and communicate their expert estimates more effectively when the information was presented in a frequency format. Estimation with Bayesian reasoning using frequency formats was a method proposed in response to the difficulty people appeared to have when assessing risk using other formats such as tables of purely numerical information. The difficulty is usually due to varying levels of expertise and comprehension of the mathematical concepts behind the questions. They observed assessments animals appeared to make when faced with risk. Animals seemed to comprehend risks that presented themselves visually or as events in a frequency format. In their research, the human subjects provided with frequency formats produced significantly more problem-solving methods that followed the equivalent Bayesian algorithms than those who received standard formats regardless of education in Bayes theorem. So long as the format presented was a frequency format, the subjects tested intuitively used Bayes theorem. The standard probability format produced inverse results. The analogy used was feeding binary formatted numbers (combinations of 0 and 1) into to a modern calculator that only understands the numbers 0 to 9. The calculator is exceptional at calculating values, but if the format of the information is unusual, then it cannot calculate accurately if at all. This new way of formatting problems had additional benefits such as reducing symptoms of the conjunction fallacy (Tversky & Kahneman, 1983) overconfidence bias, and representativeness (base rate neglect). Some of the formats evaluated included Standard probability format, Standard frequency format, Short probability format and Short frequency format. Standard probability format consisted of presenting probabilities in the form of percentages.

A series of known probabilities were presented to some of the subjects. Using those percentages the subject was then asked for the probability of an event or condition based on the probabilities presented. Their answer was also requested in percentage form.

Other subjects were presented the same probabilities but in fraction statements like “10 out of every 1,000” (frequency format) instead of 1% (standard probability format). Their answer was requested in the same fractional statement (frequency) format.

Probabilities presented in the frequency format caused subjects to use implicit mental calculations that resembled Bayesian formulas. The result of making estimates in this way was more accurate estimates more often, even for users with minimal statistics background (1995).

Criticisms of the frequency format methodology. Although not a direct criticism, Paul Meehl et al. found that most risks are not single factor problems and so are not calculable by the human mind consistently (Meehl et al.). Meehl et al. specified that computers should be used whenever possible to combine single factor probabilities into a risk model that reflects reality more accurately (1986)

Aggregating estimates from multiple experts

The opinions of experts are used to provide decision makers with guidance when hard data, the skills required to process the data, or the funds for either are unavailable. Common methods used by decision makers for aggregating the opinions of multiple experts include:

simple averages
weighting
trimming

Yaniv cited Kahneman, Slovic and Tversky’s (1982) research but proposed further research into when heuristics could be an effective tool for decision-making (Yaniv, 1997). Yaniv confirmed that decision-makers tend to use heuristics, like weighting and trimming, when performing complex tasks, such as aggregating highly disparate opinions. Combining estimates provided by multiple experts improved accuracy in many studies (Ashton & Ashton, 1985; Sniezek & Buckley, 1995; Sorkin et al. 2001; Winkler & Poses, 1993; Yaniv, 1997; Yaniv & Hogarth, 1993; Zarnowitz, 1984.)

Yaniv performed two studies, one in which computers were tasked with aggregating expert opinions, and the other in which humans were tasked to do the same using weighting, trimming, and a combination of the two. Each of these methods was used on a variety of sample sizes to identify if the heuristics provided accurate aggregates with varying sample size. The conclusion of the study was that the heuristics of weighting and trimming, when performed by humans, is a justifiable measure that increased accuracy of estimation depending on the properties of the estimates being made.

Simple averaging

[] describe how simple averaging was done in the Yaniv study.

Weighting and trimming when aggregating judgments

Weighting

Yaniv published research in 2004 that examined how people weight advice received from others. Yaniv’s research evaluated the influence of advice from advisors on advisees and the extent to which advice can improve judgment accuracy. It was found that advisees tended to adjust their initial estimates toward estimates provided by advisors but typically weighted their own opinion higher. Advisees who were knowledgeable in the subject matter were found to adjust their estimates less. The more disparate the advisor’s estimate was from the advisee’s initial estimate; the less adjustment tended to occur. The accuracy of estimates yielded from advisor and advisee collaboration tended to increase even with this observed conditional discounting of advisor opinion. Overall, seeking advice tended to improve accuracy, but Yaniv demonstrated through study that that integrating advice can be optimized to provide more precision. Yaniv found that the weight of advice provided by advisors decreased as deviation, or distance, from the advisee’s inherent estimate increased. Advice improved accuracy significantly though suboptimally. Advisees tended to put too much confidence in their own knowledge and estimates, discounting advisors advice to the point of decreasing accuracy of the resulting estimate. Advisees who recognized their lack of knowledge in the subject matter were found to discount advice from advisors less. Advisees tended to adjust their initial estimates toward those provided by advisors though only slightly, having more confident in their initial ideas. When advisors provided an estimate that varied greatly from the advisee’s, the advisee would adjust their initial estimate even less. In Yaniv’s 2004 study, subjects received bonuses for making accurate judgments. Their initial estimates could also be adjusted, after viewing estimates made by other subjects. The estimates of other subjects were provided along with computer generated estimates that are known to be inaccurate. This was to test if wrong estimates were just as influential on subjects. Estimates were provided in the form of ranges instead of single-points. The Confidence Interval (CI) requested was 95%. Weighting by decision makers was inferred by measuring the variation in estimates before and after advice was given. For example, the two ends of the weight spectrum are 0% discounting and 100% discounting. If the decision maker discounts the advisor’s advice completely, that is 100% discounting and is given a weight of 0. In contrast, 0% discounting of advice is given a weight of 1.0. Harvey and Fischer found that estimates by decision makers were observed shifting 20-30% toward advisor estimates. This occurred even when the they were aware that their advisor had less training than them in the subject matter (Harvey & Fischer, 1997). Research by Lim and O’Connor showed that advice in the form of calculated reports, such as those created using statistical measures, were discounted by advisees most of the time. Subjects tended to give their initial forecasts twice the weight that they gave to statistical model meant to inform their decision (Lim & O’Connor, 1995). Yaniv evaluated the methods for aggregating expert judgments under uncertainty using weighting and trimming. These methods were meant to offset challenges that arose from conflicting subjective estimates and varying levels of uncertainty between experts.

Trimming

Judgments provided by experts may be given more weight than others when aggregating multiple estimates. Judgments that are relative outliers may be trimmed from the aggregate product also. Such outliers may be outliers in terms of the expert’s unusually high or low level of confidence in their judgment, or the judgment is an extreme uncommon to the average judgment among the experts. The benefit of trimming is that it abstracts the central tendency of the judgments, and it prevents one expert from influencing the set of judgments excessively. The drawback to trimming is that if most of the experts are wrong and the outlying judgments were right, the accurate minority are excluded.

Yaniv found that methods which involved selecting judgments that overlap was ineffective as sample size increased due to disagreement among judges. Yaniv’s research also found that estimates with confidence intervals stated by experts to be at 95% contained the true answer only 43% of the time. To remediate this Yaniv found that weighting judgments by inverse width improved accuracy.

[] To Elaborate...

Weighting and trimming

Weighting and trimming provided more accurate judgments than trimming or weighting alone. The use of weighting and trimming also proved to be more effective at yielding more accurate answers than simple averaging. Yaniv’s 2004 publication found that simply aggregating one additional person’s opinion into an estimate improved accuracy by 20% most of the time. This effect does not require experts more knowledgeable than the decision makers requesting estimates. The estimates must come from advisors who work independently from each other, some dependence has however produced useful data (Johnson, Budescu, & Wallsten, 2001).

Logistic regression for aggregating judgments

Lele and Allen found that subjective Bayesian methods provided excessively wide confidence intervals. They proposed eliciting data instead of priors to combine subjective expert estimates with hard data in order to increase the precision of statistical analyzes (Lele & Allen, 2006). Their method also involved distinguishing between useful and less useful experts and weighting their estimates accordingly. That is, experts whose estimates provide information beyond what hard data is available opposed to providing estimates that stray from the truth. This is opposed to rating expert informativeness based on traditional factors like the expert’s experience, fame, or other qualitative characteristics.

Lele and Allen attempted to remedy some of the challenges that came with the logistic regression approach to probability (2006). This was in response to Fleishman et al.’s observations on the difficulty of estimating parameters in logistic regression when the number of important covariates was large compared to the number of observations (Fleishman et al., 2001). Where the standard error was excessively large, such was the case with rare events, logistic regression provided little value (Lele & Allen, 2006). Lele and Allen’s study attempted to use expert knowledge to increase the accuracy of logistic regression which in turn would provide more accurate predictions.

In Lele and Allen’s study, an expert was considered useful if they could improve accuracy, provide information that decreased standard error, and narrow the width of the confidence intervals established previously by analysts on data alone. The example problem used to test their hypothesis was to estimate a logistic regression model that related to the probability of species habitation based on habitat covariates at select locations (Lele & Allen, 2006). They were specifically attempting to predict the presence of a masked shrew which were believed to be a good indicator of certain environmental factors of interest in their research. What frequentist subjective information they did collect was used to supplement a logistic-regression derived prediction.

Lele and Allen found that elicitation of priors was difficult because it was not in a statistical format intuitive to most scientists (2006). Although communicating statistical format to scientists was possible, it was not often practical. Their research concluded that it was easier to elicit expert estimates in the form of the probability of events instead of recollected prior distribution of the events occurring or not, on the parameters of the statistical model.

Lele and Allen also evaluated the effectiveness of measuring the differences in informativeness between multiple experts, if multiple experts are employed for the elicitation. The statistical approach and formulas used are commonly used hierarchical models used to measure error in models. Lele and Allen provide a mathematically precise formula and instruction on how to quantify the informativeness of experts (2006). They also measure the value of combining multiple expert opinions.

After these methods and functions had been established, Lele and Allen tested them on species occurrence data observed and elicited from experts (2006). What they found was that experts were not equally useful, that experts varied in their usefulness based on the species under question, that a well-calibrated but non-useful expert will not negatively affect the overall analysis. Eliciting information on the observable scale and using their regression model to relate the elicited data to the observed data allowed calibration of the expert’s opinion in the form of elicited data.

Criticisms of aggregating estimates from multiple experts. Other conclusions made as a result of Yaniv’s studies was that the number of judgments required for accurate answers was fewer than suspected. It was found that increasing the number of experts providing estimates was minimally beneficial when using more than 2, up to 8, experts. This finding that minimal opinions are required for accurate estimates, or that more subjective data is not necessarily valuable was also observed by several other researchers (Ashton & Ashton, 1985; Hogarth, 1978).

Monte Carlo simulation modeling

Monte Carlo simulation modeling is a collection of tools for simulation using computational algorithms and repetitive random sampling (Metropolis and Ulam 1949). In a Monte Carlo simulation, a set of sample data is created using random numbers generated within the constraints of user specified parameters. Analysts generate random values that fall between the range (interval estimate) provided by the expert. The random numbers generated within that range may characterized further by having the random number generator produce only values that follow a specified frequency distribution. Monte Carlo simulation is widely regarded as a valid technique, and the mathematics required to create them can be very simple (Vose, 2008). Monte Carlo simulations are considered an upgrade from methods like what-if analyses. What-if analyses produce wider ranges that are limited to exclusively probabilities or impacts whereas Monte Carlo simulations enable taking into account data from both variables.

The probability distributions that may be used in Monte Carlo simulations are many. Common distributions that may be familiar to readers include the Triangle, Uniform, Beta, Cumulative Ascending, and Poisson distributions. Distributions can also be defined manually in any number of ways. Common methods for generating samples used in Monte Carlo simulation are Latin Hypercube Sampling and simple random number generation. The former is employed to reduce model the computation resources required to produce sufficiently random numbers (EPA 1997, p. 7).

Fitting distributions to data

Selecting a distribution to use for a Monte Carlo simulation requires elementary statistical method, the use of distribution fitting software, or prior knowledge of how the sample domain being simulated is distributed. Distribution fitting software provides the name of and formula for the distribution that best fits the sample data entered into it when data is available. The data entered into the software may be sample data derived from events that sufficiently characterize the kind of events being estimated by the expert. The quality of the data used will depend on factors including, but not limited to, collection methods, bias of the organization providing the data, and scope of the data available. Vose prescribes the use of maximum likelihood estimators and optimizing goodness-of-fit in addition to analyzing the properties of the data being used. The techniques available for determining a sample derived distribution Vose divides into first-order parametric, second-order parametric, and nonparametric distribution. There are also systematic and non-systematic errors, sample size, and sample dispersion to consider. Fitting distributions to data is not a new science and can be learned and practiced with elementary statistical education or the use of best-fit tools like those listed previously (Vose, 2008).

Approximating distributions with expert input

[] Loss Distribution Approach (LDA)

U.S. Environmental Protection Agency’s Monte Carlo Analysis guidance

The Risk Assessment Forum of the U.S. Environmental Protection Agency (EPA) published the Guiding Principles for Monte Carlo Analysis in March 1997. The document discussed the objectives, challenges, and potential value of Monte Carlo Analysis when applied to EPA efforts. The document states that the guidance provided serves only as a minimum set of principles and that innovative methods are encouraged where scientifically defensible (EPA p.3, 1997). Other EPA documents have emphasized the importance of using probabilistic techniques in risk assessments for adequately characterizing variability and uncertainty. These included the:

1986 Risk Assessment Guidelines
1992 Risk Assessment Council (RAC) Guidance (the Habicht memorandum)
1992 Exposure Assessment Guidelines
1995 Policy for Risk Characterization (the Browner memorandum) (EPA p1, 1997)

[] More recent publications include:

Introduction. The basic goal of Monte Carlo analysis in the guidance is to characterize quantitatively, the uncertainty and variability in estimates of exposure or risk as well as identify key sources of variability and uncertainty and to quantify their impact on risk model output. The EPA explicitly states in this section that the guidance is not intended to provide technical guidance on conducting or evaluating variability and uncertainty analysis.

Limit of review for EPA guidance. Each of these points is discussed in more detail in the 1997 report. The report also describes the benefits of Monte Carlo simulation and outlines methods for adequately communicating the often unfamiliar concepts of variability and uncertainty in terms of risk. Instead, the document is meant to provide a discussion of principles of good practice for Monte Carlo simulation as applied to environmental assessments (1997 p. 11). The guidance included a set of standard conditions for satisfactory probabilistic risk assessment methods. The conditions were based on principles relating to “good scientific practices of clarity, consistency, transparency, reproducibility, and the use of sound methods” (EPA 1997, p1). The relevant principles may be summarized as:

Purpose and scope clearly articulated and includes a full discussion of highly exposed or susceptible subpopulations.
Methods of analysis clearly documented, data representativeness clearly defined, all models and software documented, and information of the whole analysis must be sufficient enough for independent parties to reproduce the research.
Sensitivity analysis discussed, and probabilistic techniques applied to compounds, pathways, and factors of importance.
Correlations or dependencies between input variables incorporated into analysis and clear descriptions mapping their effects on the output distribution.
Input and output distribution information documented including tabular and graphical representations, locations of any point estimates of interest, rationale behind distribution selection, and variability and uncertainty differentiated.
Numerical stability of central tendency and higher end of output distribution are discussed.
Exposures and risks using deterministic methods are documented, allowing comparison to probabilistic analysis and other historical assessments. Similarities and differences between probabilistic and other methods in terms of data, assumptions, and models documented.
Fixed assumption metrics are documented and output distributions are aligned to them.

Determining the value of a quantitative variability and uncertainty analysis. Risk assessor, manager, and other stakeholders may establish whether a Monte Carlo simulation is necessary by considering:

1. Whether or not a quantitative analysis of uncertainty and variability will improve the risk assessment
2. What the major sources of variability and uncertainty are
3. Whether or not variability and uncertainty will be kept separate in the analysis
4. Whether or not there are sufficient resources to complete the analysis
5. Whether or not the project warrants the level of effort required
6. Whether or not a quantitative estimate of uncertainty will improve decision making
7. How the regulatory decision may be affected by the variability and uncertainty output
8. What skills and experience are necessary to perform the analysis
9. What the strengths and weaknesses of the analysis methods considered are
10. How variability and uncertainty analysis itself will be communicated to stakeholders and other interested parties

What the EPA calls preliminary “screening calculations” may show that a quantitative characterization of variability and uncertainty is unnecessary. Such a calculation may show

1. risk exposure to be clearly below the concern of decision makers.
2. the cost to remedy the potential risk may be sufficiently low that spending additional resources on analysis is deemed unnecessary

In contrast, there may be sufficient reason to perform a quantitative characterization of variability and uncertainty as a result of such preliminary screenings.

1. The screening may yield point estimates above decision maker’s risk appetite
2. There may be indications of bias in the expert estimates or data
3. The cost of remediation may be very high while exposure is seen as marginal
4. The potential impact of risk events is too high not to plan for

Defining the assessment questions. Begin an exposure assessment by first defining purpose and scope of the assessment clearly. Simplicity of the assessment should be balanced with including all important exposures of risk. Sophistication of the analysis should increase only if doing so will increase value.

Selection and Development of the Conceptual and Mathematical Models. Selection criteria should be established for each assessment question. Criteria may consider the varying exposure of populations examined, significant assumptions, uncertainties, and the degree of variation in output if alternative models were used.

Selection and Evaluation of Available Data. Evaluate data quality and representativeness of the data to the population being examined.

Selecting Input Data and Distributions for Use in Monte Carlo Analysis. Preliminary analysis should be performed to determine model structure, exposure pathways, model input assumptions, and parameters most influential to the assessment’s output, variability, and uncertainty. This information helps prevent spending resources on collecting data or performing analysis on unimportant parameters as well as identify dependencies and correlations between models. Identifying where correlations and dependencies exist informs the analyst that whatever models they choose must be compatible with each other. Preliminary analysis methods include what-if scenarios, numerical experiments, and performing systematic sensitivity studies.

Correlations and dependencies must be documented along with any parameter or pathways excluded from the analysis and the reasons why they were excluded. Distribution wideness should be commensurate with the available knowledge and certainty. The EPA emphasizes the importance of not employing probabilistic assessment on insignificant pathways or parameters since the process may be a significant undertaking and costly to perform. If distribution shapes change over time, the reason for the change and rationale for new shape should be documented. Selecting input distributions should involve qualitative and quantitative information available but both should undergo the same scrutiny. Considerations when evaluating the quality of information include:

- Availability of a mechanistic basis available for choosing the distribution family.
- What mechanisms dictate the shape of the distribution.
- Whether the variable is discrete or continuous.
- Bounds of the variable.
- Skew or symmetry of the distribution.
- Qualitative estimate of distribution’s skew and in which direction.
- Any other known factors affecting the shape of the distribution.
- When data is unavailable that represents the true distribution of the population being assessed, surrogate data may be used that justifiably resembles what the distribution shape but the rationale must be defensible and documented.

The EPA guidance goes on to describe environmental factor specific examples of scenario analysis and addressing quality of information at the tails of distributions. The guidance emphasizes the importance of highlighting and differentiating the distributions and data provided by expert judgments as opposed to real sample data. The appreciable effect of each on the outcome of the analysis should also be clearly communicated. In other words, the degree to which the assessment is based on real data vs expert estimates.

Evaluating Variability and Uncertainty. The EPA guidance recommends establishing formal approaches for distinguishing between and evaluating variability and uncertainty in the end report. The issues that should be considered include:

- That variability depends on averaging time, averaging space, and other dimensions in which the data are aggregated.
- That standard data analysis typically understates uncertainty in the form of human error while overstating variability in the form of measurement error.
- That model error may represent a significant source of uncertainty.
- That accuracy of variability is significantly dependent on the representativeness of the data.

Numerical stability of the moments and tails of distributions are important in evaluating variability and uncertainty. Numerical stability can be seen in changes of mean, variance, or percentiles observed in the output of a Monte Carlo simulation as permutations increase. Some models may require more permutations than others to stabilize or become consistent. Since that is the case, permutations should numerous enough to ensure stabilization has been reached.

Areas of uncertainty should be included in the analysis either quantitatively or qualitatively. Some level of uncertainty is unavoidable. The EPA recommends relative ranking of sources of uncertainty, particularly when quantitative measures are unavailable or the use of Bayesian methods which correct for subjective estimates so long as they distinguish between variability and uncertainty.

Presenting the Results of a Monte Carlo Analysis. Throughout the guidance there is an emphasis of clearly defining the limitations of the methods and results of Monte Carlo analysis. This includes previously described efforts like detailing the reasons why a particular input distribution was selected and any significant variability and/or uncertainty and goodness-of-fit statistics on the model shape. In addition to thorough documentation the EPA recommends visuals such as graphs and charts. Visuals that typically provide value are graphs of the probability density function (PDF), and cumulative distribution function(CDF). The PDF graph communicates the relative probability of values, most likely values, distribution shape, and any small changes in probability density. The PDF graph communicates fractiles (like mean), probability intervals (like confidence intervals), stochastic dominance, mixed, continuous, and discrete distributions.

The EPA describes the importance of evaluating hypotheses that one set of sampled observations are independent from the distribution chosen. Methods of evaluation include Goodness-of-fit tests like chi-square, Kolmogorov-Smirnov, Anderson-Darling tests; and for normality and lognormality, Lilliefor’s Shapiro-Wilks’ or D’Agostino’s tests. Alternative methods of testing should also be employed due to problems with the effectiveness of these tests with certain sample sizes. Assessing fit based on graphical comparison of experimental data and the fitted distribution with probability-probability (P-P, or quantile-quantile (Q-Q) plots are two such method. The guidance specifies that these methods are effective at ruling out poor fits but cannot provide confirmation of perfect fit.

Scales should be identical wherever possible in the report to prevent inaccurate or deceptive communication of data. Each graph accompanied by a summary table of the relevant data. The guidance goes on to describe optimal formatting and style of graphs as well as limitations on how much data to communicate in a single graph.

Criticisms of Monte Carlo simulation models

Monte Carlo Simulation is criticized as being too approximate. Though this can be remedied by increasing the number of permutations that the model simulates (Vose, 2008).

Another criticism of Monte Carlo simulation pertains to the common use of spreadsheets to perform them. Spreadsheet software works for simple Monte Carlo simulations, but become problematic for more complex assessments (Vose, 2008). In Proceedings of the 13th Hawaii International Conference on Systems Sciences, T.S.H Teo et al. presented data that showed spreadsheet errors were beyond what was expected by most organizations (1997). In Hitting the Wall: Errors in Developing and Code Inspecting: a ‘Simple’ Spreadsheet Model, R.R. Panko et al. found that software code errors were discovered in popular spreadsheet applications that substantially impacted outcomes. Further, the errors were not immediately evident to the user of the spreadsheet software and occurred primarily when handling large amounts of data (Decision Support Systems 22(4), April 1998, 337-353).

Discussion of the findings

Research by Kahneman and Tversky suggested that humans performed poorly at providing their expertise due to heuristics, and the biases they caused. Corrective methods have been successful in fields including ecology, finance, sociology, and psychology but there is minimal guidance or research available on how to reduce the impact of expert bias on information technology risk assessments.

Reason sources were chosen

Researchers sometimes count the number of peer-reviewed articles available on a topic in order to identify which methods deserve focus. Lopes demonstrated in The Rhetoric of Irrationality that there are a disproportionate number of articles assuming the conclusions of Kahneman and Tversky’s 1972 article on the topic of judgement and decision making. Though their works have merit and are widely cited, this document does not assume their claim that “humans are poor estimators.” This work instead remains open to Lopes’ conclusion that humans can estimate poorly under certain conditions and considers the complementary findings by Gigerenzer et al. showing that humans could be observably Bayesian if questions were presented to them in a particular way, resulting in accurate estimations of future outcomes.

The U.S Environmental Protection Agency’s Monte Carlo Analysis Guidance was an included source because it serves as a good example of the kind of guidance missing for other expert knowledge elicitation methods. The format of the guidance is directed at environmental work but can be easily applied to other fields of work such as information security.

Many different methods for eliciting expert knowledge have been examined across many fields but none that applied to the information technology industry or informatics research. In contrast, there were hundreds of studies on such methods as they are used in other fields e.g. ecology. Studies on methods discovered in other fields were evaluated for their effectiveness in that particular field, e.g. calibration of experts for sub-species population predictions, but also critically evaluated on the logical soundness of the method itself.

The list of expert knowledge elicitation methods reviewed here is, of course, incomplete. Whether or not these methods could be effective in information security may require collecting data from real organizations practising them.

The terms and definitions used across the fields of work studied varied. In order to maintain cohesion a common set of terms was used in this meta-study. Though Vose, Kouns et al, and Segal citations in this work are not from peer-reviewed research journals. The terms, definitions, and categorization described by these authors were used to establish the common language of this paper. Their terms and definitions for the subjects of risk management and statistics were compatible with the varying glossaries of the peer- reviewed articles included. In cases where no single term was satisfactory, each term was listed out.

Theme 1: Format of questions

Kahneman et al. observed that humans simplify their environment in order to solve problems more quickly. The same simplification, when applied to complex situations, causes humans to provide inaccurate estimates. Lopes interpreted the research differently, finding that humans provided accurate estimates in complex situations given the right conditions. Gigerenzer et al. also found this to be the case and most often when humans were given information in a format that resembles animal foraging and neural networks such as a frequency statement as opposed to a percentage or fraction notation to communicate the anticipated probability or likelihood of an event (1995).

Theme 2: Format of answer

Single-point subjective ordinal scoring methods appear to be the industry standard in information security. Hubbard et al. found that verifying the effectiveness of these methods is most often not practical or possible. They found that how verbal subjective ordinal scores like high, medium, and low, were defined by individuals varied widely with little consistency. Variability in definitions was observed between experts as well as by individual experts over time, even when asked within the same 24 hour period.

Single-point subjective ordinal scores presented in the format of risk matrices and heatmaps were found to be not only wholly inaccurate but deceptive in their implied soundness. Cox’s mathematical evaluation of such matrices for uses ranging from highway maintenance to combating terrorism revealed numerous flaws. As it turns out, accurately communicating risk in matrix or heatmap form requires a per-case rigorous mathematical evaluation, high quality non-subjective data, and the resources to do so in a reasonable period of time.

Yaniv found that decision makers intuitively weighted estimates from experts and improved overall estimate accuracy but did not do so optimally. The methods proposed for optimizing the process may not be applicable since, as Yaniv explicitly states in the study’s limitations, qualitative advice and opinions were not evaluated in the study, only quantitative factual estimates like the date of events were elicted. Answers in the form of ranges with confidence intervals follow the format used in calibration training and ensure that uncertainty is communicated with an estimate in addition to the estimate value. Since the format is a range, the expert is only obligated to provide a wide enough range that contains the estimated value and communicate their uncertainty with the width of that range. Ranges can also be used as the parameters of Monte Carlo simulation models created with the elicited information allowing further processing and correction while maintaining a sense of uncertainty had by experts

Gigerenzer et al. suggested that questions be formatted in a frequency format (1995), such as fractional statements instead of percentages, but also suggested that the subject’s responses be provided in a frequency format so to be communicated with minimal error.

Theme 3: Calibration of experts

Lichtenstein et al. found that overconfidence was a markedly prevalent bias across studies. As one might expect, the degree of overconfidence increased as the relative difficulty of the task increased. Fortunately, they found that something called calibration training could be used to increase a person’s ability to provide accurate estimates. The reduction in bias was also measurable and repeatable. Kynn found that most people could be calibrated if they took the time and that a person’s calibration i.e. performance in providing accurate estimates, carries over to estimates provided for content outside of the calibration training, such as the person’s field of work. Lichtenstein et al. found that such calibration could only improve accuracy to an extent and suggested the use of corrective technologies in addition to calibration of experts.

Yaniv describes corrective technologies facilitating the use of statistical methods like regression to integrate data with expert estimates. Regression modeling may be used to determine how much an expert’s opinion needs to be adjusted based on their measurable bias. Measures of bias may be obtained with methods like calibration training or by reviewing the accuracy of an expert’s predictions over time. In this way, the data about the expert is integrated with the estimates they provide and further the data about the subject matter itself can be used to gauge potential model performance.

Theme 4: Aggregation of experts

Aggregating the opinions of experts improved the accuracy of estimates in most trials in multiple studies. Yaniv found that when weighting and trimming was used by humans to aggregate expert opinions, accuracy improved in most cases. Like the benefits of calibration, the benefit of additional expert input to improve estimates had a point of diminishing return. Yaniv demonstrated use of a particular mathematical formula to measure the value of combining multiple expert opinions to find the optimal number required for the peak value.

Lele et al. found elicitation of priors from experts difficult and impractical. The effective alternative they discovered was asking experts for the probability of events that may occur instead of prior distributions representing the event occurring or not occurring.

Theme 5: Integration of data

Like Lele et al, Yang et al. experienced difficulty quantifying expert opinion in the form of prior distributions. They also found it difficult to measure the informativeness of an expert, and justifying the costs for training them on how to provide Bayesian estimates.

Lele and Allen observed that using regression, real data could be integrated with expert opinion and that the performance of the expert could contribute to the weight of their subsequent estimates.

Theme 6: Simulation

Although data should be used whenever possible before eliciting subjective estimates. Computer processing can reduce bias of human estimates by factoring in more information simultaneously than the human mind is capable of doing unaided or by performing the equivalent probability calculus by hand. Monte Carlo simulation factors in variability and uncertainty with an estimate where methods like what-if analysis and unassisted human reasoning i.e. heuristics cannot. Certain risk assessments, even if empirical data are used, exclude these factors and may result in inaccurate results or results lacking such precision that they do not provide value or inform decision making.

Comparison of the findings

Kuhnert et al. described how to elicit information from experts and incorporate that information into Bayesian models. Using data wherever it was available, establishing whether or not eliciting such estimates would provide sufficient value to spend resources, identifying how to measure the expert’s uncertainty, clearly presenting the question for the expert, using graphical aids and discussion with the expert instead of one-way questionnaires, and assessing the impact of Bayesian priors. Kuhnert does not discuss the consideration of variability along with uncertainty as Vose proposes.

Martin et al. presented methods for ensuring that uncertainty was adequately communicated by experts. The elicitation involved a single-point scoring and the use of computers where possible. They emphasize the importance of treating subjective elicitations as snapshots of the truth and encourage the use of comparing elicitations to available data wherever possible. They also recommend the use of multiple expert judgments mathematically supplemented with linear opinion pooling methods. They used weighted averages as data and addressed strong variations in data with more complex mathematical methods.

Tara et al. cites both Kynn and McBride’s works and adds research by O’Hagan et al. who confirms the calibration literature, that experts systematically underestimate their uncertainty (2006). They also acknowledged the findings made 40 years after Kahneman et al.’s works that some studies found people with minimal bias even without calibration training (Kahneman et al. 1972). It may be the case that these studies satisfied Gigerenzer et al’s frequency format (1995), or that some professionals are well calibrated without training. Using the Calibration exercises described by Kynn, the natural calibration of a professional can be tested.

Mcbride et al. experimented with a structured approach to preventing bias in expert elicitation using email. They found that emailed questionnaires were effective even for contentious issues (2012). The downside of the method was that it took months, opposed to days for traditional methods, to get all of the expert’s responses.

Limitations of the study

Peer reviewed publications specifically discussing expert elicitation methods applied to Information Security Risk Assessment were not available.

Recommendations and conclusions

Recommendations

Judgement and Decision Making Psychology literature showed the fallibility of human experts when providing estimates. Information Security decision makers use expert estimates to drive security efforts. Minimal guidance is available for eliciting knowledge from experts in the field of information security that corrects for this human error. Eliciting knowledge from experts with optimal accuracy and precision has been researched in other fields of work. These methods appear independent of subject matter and may be applied to other fields including Information Security / Cybersecurity. How might we use these methods in the office? Example applications of the expert knowledge elicitation methods discussed are outline below. These methods can inform decision makers for long-term, high-level management decisions, but also low-level day- to-day decisions such as what events and alerts a SOC and CIRT analysts should investigate first. Such prioritization is key to optimal cybersecurity decision-making and operations where there is both uncertainty and limited resources.

To briefly review the findings of this study; questions, and available information can be formatted in a way that ensures clarity and comprehension by experts. Responses elicited from experts if requested in the form of dollars and percentages, can effectively avoid human heuristic tendencies and their resulting biases. Experts that undergo calibration training can provide estimates minimally influenced by their over confidence or under confidence. Combining the opinions of multiple experts can improve both accuracy and precision of estimates. Integrating available data with expert estimates can improve accuracy and precision of estimates. Simulation models can decrease bias, take into account, uncertainty expressed by experts, irreducible uncertainty of the threat environment (variability), highly complex scenarios,available data, and data as it later becomes available or is learned from observations over time in an environment. Although these methods can be used in a variety of ways, a few example application scenarios are listed below.

Frequency Formats

The success of frequency formats observed by Gigerenzer et al. suggests that use of visual aids such as scatter plots and time-series graphs, which present data in a frequency format, should improve expert comprehension (1995). Lopes’ studies showed that presenting questions and data in the form of dollars and percentages should also increase comprehension of information for expert processing (1976). Presenting data in the frequency format takes advantage of a human’s natural ability to estimate the likelihood and impact of events, further increasing accuracy and precision. It may be beneficial to present information in a frequency format whenever possible and practical in order to help reduce bias along with the other measures discussed in this study.

Cybersecurity Decision Making. Converting available data to frequency formats may help improve the accuracy of estimates provided by experts. This can be done by replacing percentages with fraction ranges. That is, 1% becomes 10 out of every 1000. Visuals that present probability in a frequency format, such as a pie chart showing the fraction equivalent of the percentage may also help experts understand the relative probability of the data that they are viewing.

Cybersecurity Operations. The presentation of events with probability ratings may be best performed with a frequency format rather than a standard percentage. How SOC analysts can use probability values is explained in the Monte Carlo section below. Visuals that present probability in a frequency format, such as a pie chart showing the proportional equivalent of the percentage may also help analysts understand the relative probability of the events they are viewing. Such a visual may be present in a SOC visual dashboard.

Interval Estimates

Expert responses in the form of interval estimates avoid the many observed flaws of single-point scoring. Experts asked to provide a value for the probability of a particular event can provide a range instead of a single percentage, e.g. less than 50%, or greater than 80%. Impact values could be given in ranges such as $1-2 million.

By providing ranges, experts are not pressured to provide unrealistically accurate responses, which may create bias. Ranges communicate the estimated value, but also the expert’s uncertainty, a factor wholly absent from the subjective scoring methodologies that make up most industry standard methods. Monte Carlo simulation models use high and low bounds, like the interval estimates experts would provide. Finally, ranges allow decision makers to calculate the value of their experts by having a record of that expert’s precision by observing how narrow or wide the ranges they provide are. Since Gigerenzer et al.’s research showed the benefits of frequency formats (1995), experts may provide their estimates in the frequency format instead of standard percentage form.

Cybersecurity Decision Making. Risk management analysts or managers looking to their experts for advice may present their experts with a written or verbal questionnaire. The manager may be asking their internal or contracted experts about a recent cyber threat that has appeared from a threat intelligence feed. The risk analyst may be performing routine assessment on what risk is posed to company equipment. In either case, the experts could be asked to provide their answers in the form of ranges. The estimated probability of an event can be presented as a percentage. The estimated impact of an event can be presented with a low dollar amount and high dollar amount. The expert could also be asked for frequency distribution information or sources of data. An example risk event could be the risk of a DDoS attack on the company website. The conversation that would follow should involve specifying exactly what the questioner is looking for. After the manager or analyst and expert(s) put all assumptions on the table and define the terms and intended use of the estimate requested, they may all agree that what they are wondering is:

- How much money do we lose per hour that the site is down?
- How much money would we lose if the fact a DDoS attack on us was successful goes public?
- How much money would we lose if the DDoS attack weakened the web server and allowed data to be exfiltrated from it?
- How much money would we lose if the DDoS attack made our website vulnerable and allowed attackers to take control of it, post content, and cause reputational damage?

The manager or analyst is likely asking these questions because they want to know how much of a risk they are taking. They may also be wondering if the cost of protective controls would yield a net benefit, and if so, what controls would be optimal and not cost more than the negative event itself occurring. The expert may provide probability and impact estimates, in the form of ranges, as follows.

“But we cannot think of everything! So what’s the point?”

The number of potential scenarios that may result from any event is not often finite. With or without risk assessment, and regardless of assessment style, there will always be possible outcomes not considered or assessed. By performing these assessments regularly and pulling key risks from assessments performed by other organizations a risk assessment model grows, and if done correctly, reduces uncertainty. The alternative to this is creating deceptive models that create the illusion of visibility (see subjective single-point scoring and heat maps) or not measuring risk at all. By not measuring risk at all, you lose visibility into whether or not your decisions were good ones or not. You would not know that there were near misses during the year, that your successes were more chance than success, skill, or effort.

Cybersecurity Operations. Cyber Security analysts, such as those who would work in a CIRT or SOC, can provide the conclusions of their investigations in the form of ranges. The exception to this would be any work that requires analysts to suspend their judgement and to simply record observations, as may be the case with a forensic analyst or low level SOC analyst, at which point their observations would be passed on to someone to make a conclusion based on the observations made. The structure of this varies depending on how an organization handles cyber security, but the use of ranges instead of single-point estimates can be used to benefit from the literature on the use of estimates. Since resources are finite, and network events are often overwhelming, prioritizing those events for CIRT analysts by their probability and impact can help keep higher risk events at the top of the queue for investigation.

But what if the alerts are wrong? Then we ignore a crucial event!

Consider how this approach compares to the method an organization is using currently. If you discover that the alerting mechanism’s risk ratings are not appropriate, improve the mechanism. The alternative to this would be a human rating the probability and impact of every event, based on say, meta-data, before performing a full investigation. The only difference here is that the automated process would be less error prone, especially over time and given the numerous events assessed. Cursory prioritization like this is best performed by a computer. This approach is meant to be updated on a regular basis in response to environmental changes. The Cyber Security Decision Making level risk assessment can inform this automated mechanism, leaving prioritization of risks up to management. The organization will still get through all of the risks in the queue, they are just increasing the likelihood that events with the highest risk are investigated first.

The ranges used by the manual SOC analyst or automated system may look something like this (if made tabular for presentation):

The table above shows 2 different events and how the SOC analyst or automated system may adjust probability based on certain factors. The impact would, of course, remain the same throughout. The impact range would be generated by decision makers apriori in the same way shown in the previous section of this document.

Calibration

Calibrating experts allows decision makers to quantify the bias of their experts and monitor their progress reducing it with calibration training. The ranges provided by calibrated experts will more likely contain the correct probability and impact values, increasing estimate accuracy.

Trusting the opinion of experts can be risky for decision makers, especially when those experts are from outside of the organization, such as would be the case with consultants.

Calibration training provides a kind of grading system that shows an expert’s performance providing estimates of all kinds, regardless of the content. This ensures that decision makers and mathematical models are weighting the expert’s advice correctly and in a repeatable way.

Multiple experts can be used to increase accuracy in a repeatable and verifiable fashion. Using regression models the calibration grades of experts can be used to weight their advice respective to their observed bias. Additionally, regression models can be used to determine when additional expert opinions will not provide value commensurate with the cost of hiring them or taking them away from their existing cycles.

To take advantage of a person’s natural Bayesian tendencies, calibration questions and responses could take on the frequency format discussed previously. For calculating the performance of the expert via standard grading, percent correct vs. incorrect, frequency formats could be converted to standard percentages.

Cybersecurity Decision Making. Experts who will be providing estimates for risk assessment could undergo calibration training. Questions that resemble the kinds that the expert will be answering, but for which the answers are known, could be drafted by the person interviewing the expert. The questions used for calibration purposes should request numerical values such as the probability of certain events based on environmental factors or request impact values based on environmental factors. The answers requested from the expert should be in the form of ranges. Most experts are not well calibrated as the literature shows, for this reason, methods for improving their estimation abilities should be employed as the training progresses. If the training is effective, the percentage of ranges provided by the expert that contain the correct answer should increase. Regardless of the content they are being asked about; the expert’s ability to assess their uncertainty should be reflected by them consistently widening the ranges of their responses. The number of questions asked may depend on the expert’s progress in becoming calibrated. More questions and feedback on how to adjust their responses may be necessary for even the most brilliant of experts. Calibration questions may be outside of context like asking the person; what year was Benjamin Franklin Born? To which they may respond with a range like 1650-1750.

Cybersecurity Operations. Calibration training for analysts would be no different from that just described. It may help to include questions that the analysts can provide narrow ranges for to see how they adjust answers for things they are more confident for.

Aggregating Opinions

Aggregation of multiple expert opinions is yet another layer of increasing estimate accuracy. Expertise that is readily available can be optimally used, and unnecessary spending can be avoided with regression modeling. The same modeling can be used to integrate real data with the opinions of experts, further increasing estimate accuracy and precision.

Simulation with methods like Monte Carlo allow decision makers to take advantage of all of the different information made available to them and combine it with single values that are more understandable and on which further mathematical analysis can be performed. For scenarios that have too many factors for an expert to consider simultaneously, or for which there are many interacting and dependent variables, simulation can tie together all of the separate probabilities and impacts that the expert can provide.

Cybersecurity Decision Making. Decision makers may choose to ask multiple experts for their probability and impact estimates of key risk events occurring. The literature suggests that doing so may increase the accuracy of the end estimate chosen so long as each estimate is considered. If there is strong disparity between the expert’s estimates, the decision maker may benefit further by engaging both experts in conversation to see if a consensus can be achieved, or at the very least to hear their arguments for their ranges. The literature also suggests that the value of involving more experts diminishes quickly after only a few have provided their estimates. For this reason, decision makers can feel confident in their estimates without having to spend additional resources on more experts. More technical methods like regression can be employed to aggregate the estimates of experts as well if the expertise is available.

Cybersecurity Operations. Analysts may investigate anomalous activity that appears in their queue and conclude that the activity was normal or should be escalated to CIRT or another department. Aggregating the estimates of multiple analysts could be performed by having some level of redundancy in tickets or by having analysts review each other’s work before making escalations. If there is a disparity between their conclusions and a consensus cannot be reached, that event may be suitable for CIRT analyst investigation simply because there is much uncertainty.

Monte Carlo Simulation

With the above methods, experts provide probability (as %), impact (as $) and certainty (as range width). Monte Carlo simulation integrates each of these and also allows consideration of the frequency distribution. The expert can not only provide a range but also whether or not they believe the high or low end is more likely. Alternatively, sample data may be sufficient to evaluate for the best fitting distribution. By using sample data in this way, Monte Carlo simulation can also assist in integrating data with expert opinion.

Cybersecurity Decision Making. What information is provided by experts using the above methods can be used in simulation models using Monte Carlo techniques. Probability and impact ranges can make up the minimum and maximum values for the simulated probability and impacts in your model. If the expert has additional information about the probability or impact of an event, such as what end of their range is more likely, that too can inform the simulation model via a distribution selection. If data is available on an event that closely resembles the one that you are attempting to assess the risk of, distribution fitting methods may be used. That best-fit distribution can be used instead of one provided by the expert, so long as the data set being used had largely the same factors influencing the distribution. Estimates could be provided in the frequency format proposed by Gigerenzer et al. (1995) but will have to be mapped to their standard format equivalents for the calculations to occur. The results could then be converted back to frequency format to ensure optimal comprehension. An example of how the previously elicited estimates can be fit into a Monte Carlo simulation is shown below.

Estimates by experts for the “DDoS attack succeeds” event have been isolated into the above table for this example.

The following table shows a Monte Carlo simulation that uses these values.

Each white row shows a permutation, that is, a simulated year where the event occurs or not. In this case the expert and/or available data indicated a Normal distribution, which in Excel can be represented with the NORMINV function. The second column shows binary outputs 1 or 0 to represent the event occurring or not, which can be generated in Excel with IF(RAND()>0.75,1,0 if the probability of occurrence is 75%. This is just a snapshot of the many factors that may be involved in a simulation. Any number of factors could be represented with rows in columns in the same way, using discrete mathematical functions to connect them.

In this case analysts were interested in generating example costs of the risk event occurring based on the probabilities and impacts provided. This would likely be used as part of a larger simulation with more factors. The far right column, “Successful attack with outage” contains values that may be treated as sample data. You could, for example, highlight all numerical values in that column and generate averages or generate a graph. You could also generate the Minimum and Maximum values to double-check your work by comparing them to the minimum and maximum values provided by the expert. This column could also be modified by including events that affect the probability or impact of the risk event occurring. In the end, all factors considered, you could calculate probabilities based on the sample data. This example is just meant to show how estimates would be included in a Monte Carlo simulation.

Cybersecurity Operations. Decision makers may identify which scenarios are the highest risk using all of the above methods presented in this recommendations section. One of the controls that may reduce the probability or impact of such an event could be monitoring security logs for indicators. For example, for the risk event of a DDoS attack, one of the recommended controls may have been to configure logging and alerting to notify the SOC of any indications of a DDoS attack beginning to occur. The experts may have provided a list of indicators to include in logging and alert as a result of this assessment. One such indicator may be an uncharacteristically fast increase of users loading resources from the company’s website, such as the home page. The experts may include a few real-world caveats. Sudden popularity of the website may be due to a successful marketing campaign, or because of some fluke event like a popular YouTube video having the company logo in the background causing users to research the organization out of curiosity, or because the company computers have the company website as their homepage and a connection issue causes everyone to reload their browsers at the same time. For this reason, it may be valuable to take more factors into consideration before creating such an alert, or bringing it to the top of the analyst’s events-to-investigate queue. In the same way that ranges were provided for a Monte Carlo simulation, ranges can be provided to weigh the risk of particular events occurring real-time. When asked about what factors an expert would look for to confirm a DDoS attack, they may recommend monitoring the webserver’s connection logs for a variety of conditions. Such conditions may be any single IP address initiating an above average number of connections to the web server, or any user agent that is not that of a common web browser making the connections.

In either case, there is always the possibility that such traffic is normal, otherwise automated controls could simply prevent the connection attempts. What you instead want to know is the probability that such a traffic pattern is malicious. The impact of such an event has already been established by the risk assessment performed by decision makers in the previous section. Like the risk assessments performed by decision makers at a higher level, both probability and impact factors could be included in the model.

The same methods can be applied to measure the probability and impact of other threat scenarios that analysts are tasked with investigating. With probability and impact values established, analysts can sort the various different events in their queue by both the probability and impact of the risk event they are meant to help identify. By doing this, analysts are analyzing events that are not only most probable but also most impactful, first.

Excel was used in these examples because both business level executives and technical level analysts are familiar with its capabilities and format. Since the methods discussed consist of discrete mathematics, the same operations can be recreated in other software or languages. In the criticisms of Monte Carlo section of the Literature Review, some known drawbacks of spreadsheet software were presented. Research into other software, especially those specifically built for statistical methods like Monte Carlo simulation may be valuable, especially with large data sets or simulations requiring exceptionally large permutations.

Each of the methods is applicable to high-level cybersecurity risk assessment and decision-making processes, as well as low-level technical SOC and CIRT operations and prioritization. This translation of strategic and tactical risk into the common language of probabilities and dollar amounts also establishes a transparent linkage of the two decision making realms that may facilitates more holistic and defensible decisions.

Conclusions

Research by Daniel Kahneman, Amos Tversky, and other judgment and decision-making (JDM) psychologists found humans are poor estimators of uncertainty regardless of expertise or experience. Experience and level of training only weakly relate to performance (Camerer & Johnson, 1991; Burgman et al., 2011), and reliance on experts for decision making in the presence of uncertainty is common in a number of fields (Ashton, 1974, Christensen-Szalanski et al., 1982, Jorgensen et al., 2004, McBride et al. 2012, Murphy and Winkler, 1984, Onkal et al., 2003, Oskamp, 1965). Decision makers have increasingly relied on the input of experts in the field of Information Security (Kouns and Minoli, 2010), but research found managers do not know if their successes and failures are attributable to their expert’s guidance (Hilborn & Ludwif 1993, Sutherland, 2006; Roura-Pascual et al., 2009). These studies suggest that the use of experts for risk assessment may provide a false measurement of risk and further that executives and experts may be wholly unaware.

Standards organizations provide guidance in information security risk management but few include guidance on how to address human bias or even go so far as to explain why they believe their methods work. With no research verifying the effectiveness of methods applied to the information security profession, it is not surprising that standards organizations, decision makers, and analysts continue to use intuitive but non-verifiable methods.

Methods of reducing bias and validating the estimation accuracy and precision of their experts have been developed and tested in multiple disciplines (McBride et al, 2012; Bolger & Onkal-Atay, 2004, Lichtenstein et al., 1982; Clemen & Winkler, 1999), but not in information technology. The purpose of this study was to identify methods for extracting knowledge from experts with minimal bias in order to record measurements of risk that accurately represented reality in a way that could inform information security decisions. Drawing on previous literature, I grouped such methods into the categories of

1. formatting of questions and answers,
2. calibration of experts,
3. aggregation of expert opinions,
4. integration of data with expert opinions, and
5. simulation modeling with available data including estimates.

These methods were evaluated for potential application to information security risk measurement. Further research could be performed evaluating the effectiveness of these methods, specifically as applied to the information security data collection and decision making lifecycle.

Additionally, much of the available research assumed the conclusions of Kahneman et al. were correct when they may not have been. People have been observed correctly estimating value and risk (Anderson & Shanteau 1970; Shanteau 1974; Tversky, 1967) and in assessing the likelihood of fairly complex events occurring (Beach & Peterson 1966, Lopes 1976; Shuford 1959). One interpretation of these findings was that people use principally heuristics instead of probability theory when making decisions (Lopes, 1976). Researchers found that there was a disproportionate bias toward the citation of Kahneman et al.’s research through the literature citing studies showing poor performance by human estimators compared to research showing good performance 6:1 (Kynn, 2008). Researchers that took this into consideration found humans to be more Bayesian thinkers and intuitive statisticians but only so long as information was communicated to them in a frequency format (Gigerenzer et al., 1995) as opposed to a percentage or fraction format.

Appendix

Appendix A – Non-peer-reviewed literature

The following are promising big thinkers who sell a product. I am not advocating the use of their methods. No peer-reviewed academic research is available proving that the following methods work. I am listing them here because they are promising but are as yet unverified/ vetted though they do contain components closely resembling the scientific literature.

Factor Analysis of Information Risk (FAIR)

FAIR is the only foundation and framework I’ve discovered that is compatible with my research findings to an extent. Each of the recommendations are feasible, there are books, trainings, communities, seminars, and certifications that teach you real-world application.

http://www.fairinstitute.org/

Criticisms of FAIR (Corey's interpretation of the author anyway)

David Vose: https://www.linkedin.com/pulse/fair-style-cybersecurity-risk-assessment-spreadsheet-david-vose/

a simplification of operations research born method "Loss Distribution Approach (LDA)". Branding a taxonomy is unnecessary.
a marketing tool for CXOWARE to RiskLens to an industry standard.
a recipe for thinking instead of a tool.
Vulnerability estimation, Tcap and Diff. portion is too abstract and cannot be validated due to the format.
an ontology that artificially isolates and frames risk decision processes around cybersecurity without enterprise risk that would factor into the same model risks in strategy, physical safety etc.

David Vose's publications and software

The textbook published by Vose offers a well-documented guide to creating, evaluating, and updating probability models including the many variations of Monte Carlo simulation. The mathematical underpinnings of the methods are well explained, in the book, on the website, and within the software.

http://www.vosesoftware.com/david.php

Doug Hubbard's publications and software

The easy-to-read publications by Douglas Hubbard provides example uses of quantitative value-based risk management and a guide for quantifying and measuring intangibles. Hubbard calls his solution Applied Information Economics, which helps users measure the value of obtaining additional information, such as may be the case when generated probabilities are not sufficiently precise.

Hubbard’s recent book, co-authored with General Electric Healthcare’s Richard Seiersen, How to Measure Anything in Cybersecurity focuses specifically on measuring the intangibles of low level security operations and executive level budget justification.

Hubbard also offers consulting and calibration, AIE, and quantification/probability modeling training.

http://www.hubbardresearch.com/training/

References

Anderson, N. H., & Shanteau, J. C. (1970). Information integration in risky decision- making. Journal of Experimental Psychology, 84(3), 441-451. doi: 10.1037/h0029300

Ashton, R. H. (1974). An Experimental Study of Internal Control Judgements. Journal of Accounting Research, 12(1), 143. doi: 10.2307/2490532

Ashton, A. H. (1985). Aggregating Subjective Forecasts: Some Empirical Results. Management Science, 31(12), 1499-1508. Retrieved March 12, 2015, from http://www.jstor.org/stable/10.2307/2631790?ref=no-x-route:9b90bee181c954b5b559ad80875f67de

Beach, L. R., & Peterson, C. R. (1966). Subjective probabilities for unions of events. Psychonomic Science, 5(8), 307-308. doi: 10.3758/BF03328412

Bolger, F., & Önkal-Atay, D. (2004). The effects of feedback on judgmental interval predictions. International Journal of Forecasting, 20(1), 29-39. doi: 10.1016/S0169- 2070(03)00009-8

Budescu, D. V., Broomell, S., & Por, H. (2009). Improving Communication of Uncertainty in the Reports of the Intergovernmental Panel on Climate Change. Psychological Science, 20(3), 299-308. doi:10.1111/j.1467-9280.2009.02284.x

Christensen-Szalanski, J. J., Diehr, P. H., Bushyhead, J. B., & Wood, R. W. (1982). Two studies of good clinical judgment. Medical Decision Making, 2(275). doi: 10.1177/0272989X8200200303

Clemen, R. T., & Winkler, R. L. (1999). Combining Probability Distributions From Experts in Risk Analysis. Risk Analysis, 19(2), 187-203. doi: 10.1111/j.1539-6924.1999.tb00399.x

Cox, L.A. (2008), What’s Wrong with Risk Matrices?. Risk Analysis, 28: 497–512. doi:10.1111/j.1539-6924.2008.01030.x

Fleishman, E., MacNally, R., Fay, J. P., & Murphy, D. D. (2001). Modelling and predicting species occurrence using broad-scale environmental variables: An example with butterflies of the great basin. Conservation Biology, 15, 1674-1685.

Gigerenzer, G., & Hoffrage, U. (1995). How to improve Bayesian reasoning without instruction: Frequency formats. Psychological Review, 102(4), 684-704. doi:10.1037//0033- 295X.102.4.684

Harvey, N., & Fischer, I. (1997). Taking Advice: Accepting Help, Improving Judgment, and Sharing Responsibility. Organizational Behavior and Human Decision Processes, 70(2), 117-133. doi:10.1006/obhd.1997.2697

Heuer, R. J., Jr. (2005). Limits of Intelligence Analysis. Orbis, 49(1), 75-94. doi:10.1016/j.orbis.2004.10.007

Hogarth, R. M. (1978). A note on aggregating opinions. Organizational Behavior and Human Performance, 21(1), 40-46. doi:10.1016/0030-5073(78)90037-5

Hubbard, D. W. (2009). The failure of risk management: Why it’s broken and how to fix it. Hoboken, NJ: Wiley.

Hubbard, D., & Evans, D. (2010). Problems with scoring methods and ordinal scales in risk assessment. IBM Journal of Research and Development, 54(3), 2:1-2:10. doi:10.1147/JRD.2010.2042914

Jørgensen, M., Teigen, K. H., & Moløkken, K. (2004). Better sure than safe? Over-confidence in judgement based software development effort prediction intervals. Journal of Systems and Software, 70(1-2), 79-93. doi: 10.1016/S0164-1212(02)00160-7

Kahneman, D., Slovic, P., & Tversky, A. (1982). Judgment under uncertainty: Heuristics and biases. Cambridge: Cambridge University Press.

Kouns, J., & Minoli, D. (2010). Information technology risk management in enterprise environments: A review of industry practices and a practical guide to risk management teams. Hoboken, NJ: Wiley.

Kuhnert, P. M., Martin, T. G., & Griffiths, S. P. (2010). A guide to eliciting and using expert knowledge in Bayesian ecological models. Ecology Letters, 13(7), 900-914. doi:10.1111/j.1461-0248.2010.01477.x

Lele, S. R., & Allen, K. L. (2006). On using expert opinion in ecological analyses: A frequentist approach. Environmetrics, 17(7), 683-704. doi:10.1002/env.786

Lichtenstein, S., & Feeney, G. J. (1968). The importance of the data-generating model in probability estimation. Organizational Behavior and Human Performance, 3(1), 62-67. doi: 10.1016/0030-5073(68)90027-5

Lichtenstein, S., & Fischhoff, B. (1980). Training for calibration. Organizational Behavior and Human Performance, 26(2), 149-171. doi: 10.1016/0030-5073(80)90052-5

Lichtenstein, S., Fischhoff, B., & Phillips, L. D. (1976). Calibration of Probabilities: The State of the Art. Ft. Belvoir: Defense Technical Information Center.

Lim, J. S., & O’Connor, M. (1995). Judgmental adjustment of initial forecasts: Its effectiveness and biases. Journal of Behavioral Decison Making. Journal of Behavioral Decision Making, 8, 149-168.

Lopes, L. L. (1976). Model-based decision and inference in stud poker. Journal of Experimental Psychology: General,105(3), 217-239. doi: 10.1037//0096-3445.105.3.217

Lopes, L. L. (1991). The Rhetoric of Irrationality. Theory & Psychology, 1(1), 65-82. doi: 10.1177/0959354391011005

Martin, T. G., Burgman, M. A., Fidler, F., Kuhnert, P. M., Low-Choy, S., Mcbride, M., & Mengersen, K. (2012). Eliciting Expert Knowledge in Conservation

Science. Conservation Biology, 26(1), 29-38. doi: 10.1111/j.1523-1739.2011.01806.x Mcbride, M. F., Fidler, F., & Burgman, M. A. (2012). Evaluating the accuracy and calibration of expert predictions under uncertainty: Predicting the outcomes of ecological research. Diversity and Distributions, 18(8), 782-794. doi: 10.1111/j.1472- 4642.2012.00884.x

Merkle, E. C. (2008). O’Hagan, A., Buck, C.E., Daneshkhah, A., Eiser, J.R., Garthwaite, P.H., Jenkinson, D.J., Oakley, J.E., & Rakow, T. (2006). Uncertain judgements: Eliciting experts’ probabilities. Hoboken, NJ: Wiley. xiii 321 pp. 2. Psychometrika, 73(1), 163-

doi: 10.1007/s11336-007-9036-x

Metropolis, N., & Ulam, S. (1949). The Monte Carlo Method. Journal of the American Statistical Association, 44(247), 335. doi: 10.2307/2280232

Neimark, E. D., & Shuford, E. H. (1959). Comparison of predictions and estimates in a probability learning situation.Journal of Experimental Psychology, 57(5), 294-298. doi: 10.1037/h0043064

Önkal, D., & Muradoglu, G. (1996). Effects of task format on probabilistic forecasting of stock prices. International Journal of Forecasting, 12(1), 9-24. doi: 10.1016/0169- 2070(95)00633-8

Önkal, D., Yates, J., Simga-Mugan, C., & Öztin, Ş. (2003). Professional vs. amateur accuracy: The case of foreign exchange rates. Organizational Behavior and Human Decision Processes, 91(2), 169-185. doi: 10.1016/S0749-5978(03)00058-X

Oskamp, S. (1965). Overconfidence in case-study judgments. Journal of Consulting Psychology, 29(3), 261-265. doi: 10.1037/h0022125

Panko, R. R., & Sprague, R. H. (1998). Hitting the wall: Errors in developing and code inspecting a `simple’ spreadsheet model. Decision Support Systems, 22(4), 337-353. doi: 10.1016/S0167-9236(97)00038-9

Savage, L. J. (1954). The foundations of statistics. New York: Wiley.

Shanteau, J. (1974). Component processes in risky decision making. Journal of Experimental Psychology, 103(4), 680-691. doi: 10.1037/h0037157

Sniezek, J. A., & Buckley, T. (1995). Cueing and Cognitive Conflict in Judge-Advisor Decision Making. Organizational Behavior and Human Decision Processes, 62(2), 159-174. doi:10.1006/obhd.1995.1040

Soll, J. B., & Klayman, J. (2004). Overconfidence in Interval Estimates. Journal of Experimental Psychology: Learning, Memory, and Cognition, 30(2), 299-314. doi: 10.1037/0278- 7393.30.2.299

Sorkin, R. D., Hays, C. J., & West, R. (2001). Signal-detection analysis of group decision making. Psychological Review, 108(1), 183-203. doi:10.1037//0033-295X.108.1.183

Speirs-Bridge, A., Fidler, F., Mcbride, M., Flander, L., Cumming, G., & Burgman, M. (2010). Reducing Overconfidence in the Interval Judgments of Experts. Risk Analysis, 30(3), 512-523. doi: 10.1111/j.1539-6924.2009.01337.x

Teo, T. S., & Tan, M. (1997). Quantitative and qualitative errors in spreadsheet development. SystemSciences, 1997, Proceedings of the Thirtieth Hawaii International Conference on, 3, 149-155. doi: 10.1109/HICSS.1997.661583

Tversky, A., & Kahneman, D. (1974). Judgment under Uncertainty: Heuristics and Biases. Science, 185(4157), 1124-1131. doi: 10.1126/science.185.4157.1124

Tversky, A. (1967). Utility Theory And Additivity Analysis Of Risky Choices. Journal of Experimental Psychology,75(1), 27-36. doi: 10.1037/h0024915

Winkler, R. L. (1993). Evaluating and Combining Physicians’ Probabilities of Survival in an Intensive Care Unit. Management Science, 39(12), 1526-1543. Retrieved March 12, 2015, from http://www.jstor.org/stable/10.2307/2633069?ref=no-x-route:22d3ded23b30e0a19c2d79f47a4c102b

Yang, R., & Berger, J. (1997). A catalogue of noninformative priors. Institute of Statistics and Decision Science, 97-42.

Yaniv, I. (1997). Heuristics for Aggregating Judgments under Uncertainty. Organizational Behavior and Human Decision Processes, 69(3), 237-249. doi:10.1006/obhd.1997.2685

Yaniv, I. (2004). Receiving other people’s advice: Influence and benefit. Organizational Behavior and Human Decision Processes, 93(1), 1-13. doi:10.1016/j.obhdp.2003.08.002

Yaniv, I., & Hogarth, R. M. (1993). Judgmental versus Statistical Prediction: Information Asymmetry and Combination Rules. Psychological Science, 4(1), 58-62. Retrieved from http://www.jstor.org/stable/10.2307/40062505?ref=no-x-route:253c4f6c8878b90250b4a2f5edf9cc5b

Zarnowitz, V. (1984). The accuracy of individual and group forecasts from business outlook surveys. Journal of Forecasting, 3(1), 11-26. doi:10.1002/for.3980030103

Page updated

Google Sites

Report abuse

Improving information security decision making by reducing expert bias

The problem

Evidence of the problem

Deficiencies in the evidence

Intended audience

Literature review

Methods

Single point subjective scoring

Calibrating experts

Integrating data with expert opinion using Bayes’ Theorem

Eliciting and communicating probabilities in frequency format

Aggregating estimates from multiple experts

Simple averaging

Weighting and trimming when aggregating judgments

Logistic regression for aggregating judgments

Monte Carlo simulation modeling

Fitting distributions to data

Approximating distributions with expert input

U.S. Environmental Protection Agency’s Monte Carlo Analysis guidance

Criticisms of Monte Carlo simulation models

Discussion of the findings

Reason sources were chosen

Theme 1: Format of questions

Theme 2: Format of answer

Theme 3: Calibration of experts

Theme 4: Aggregation of experts

Theme 5: Integration of data

Theme 6: Simulation

Comparison of the findings

Limitations of the study

Recommendations and conclusions

Recommendations

Frequency Formats

Interval Estimates

Calibration

Aggregating Opinions

Monte Carlo Simulation

Conclusions

Appendix

Appendix A – Non-peer-reviewed literature

References

COPYRIGHT