Even while data was coming in, we used it to create possible versions of the skills. We first gathered input from the staff, knowing that we could go back to them for different information, if necessary. As the first staff data came in, we tested how well the AI tools could make sense of it. The formatting proved difficult for the AI to manage, since the data came in as a spreadsheet. That was difficult for the AI to understand correctly. We solved this by turning the final data into bulleted lists of responses for each question. Later, as student, family, and alumni data came in, we continued to add it to the test data and test the tools' ability to correctly interpret the information we provided.
At one point, as the amount of data increased, we noticed inconsistencies when we asked ChatGPT (the first AI we tried) to provide an exact breakdown of how often skills appeared in the stakeholder input. The replies did not match some of our own observations of the data. We asked it to count the number of some ideas, and its number was clearly incorrect. We told it about the problem, and it said it checked itself, and verified it was correct. It went into deep analysis mode and still held firm. Finally, when provided a quote that it could find in the document that it was not counting, it admitted an error and included just that one. After much investigation, we realized that this particular AI tool was lazy. It would read the first part of a document, but not all of it. That was an unacceptable for this application. So, some online investigation suggested that Claude might be a better AI alternative. We repeated the same exercise that had revealed ChatGPT's laziness and met with great success. Not only did it come up with a complete list of skills suggested by the input, but also voluntarily included how often each occurred in the data. Most importantly, it was accurate. At that point, we switched over to using Claude as our primary AI tool for this project.
At various stages, we experimented with providing the AI with different documents for it to balance. We started by erring on the side of providing too much information, so we supplied the following documents to the AI:
Staff input
Student input
Family input
Alumni input
Academic practices (cross-cutting themes) for core subject areas
Washington State's portrait of a graduate
Existing ILHS Skills
Research summary
We ended up combining all of the inputs into a single document which made file management easier. We decided against using Washington's portrait of a graduate since it was redundant, and we did not include our existing skills since the current input mattered far more than previous decisions we had made. We had included a summary of the research conducted separately, but that seemed constricting. It prevented us from finding anything more from the research. We verified that the same essential information was being used without it being provided separately, so we discarded the extra document.
In the end, we only provided three documents to the AI:
Stakeholder input
Academic practices
Instructions
Once all of the data collection deadlines had passed, we scrubbed all input data clean of any personally identifiable information. Each individual's submission was also given a new identifier that retained information about which group submitted the response, such as "Student Response 1" or "Staff Response 5." When the responses were collected into a single document for simpler processing, even those identifiers were removed, and we differentiated responses just by having a separate section for each stakeholder group.
Along with providing the data to the AI, we described the input collection process in our instructions. We told the AI how the data should be weighted, whether to weight each response equally, each group's collective responses equally, or to provide a disproportionate weight to any given group. Each person will tend to have some inbuilt biases that would weigh one group over another, so making this clear from the outset would help to insulate the results from such biases. In our case, we chose to weight each group's collective responses equally. In this way, our students, staff, families, and alumni each had equal input into the skills regardless of different sample sizes between the groups.
We used the data as it came in to test the ability of the AI to create the skills we wanted. Each iteration helped us to learn more about how to use the tools and what other information we might need.
One of the first things we learned was that we would need a set of instructions we could provide the AI. Otherwise, it was too tedious to tell it what we wanted for each new iteration. The instructions also provided a central place to update our guidelines for the new skills. The instructions also became the central location for tracking decisions we had made about what we wanted out of the skills. The instructions started very open-ended, which allowed us to keep it flexible while tightening specific areas as needed.
The original instructions provided context for the other files (stakeholder input and our existing skills) and provided a broad description of our desire to update the skills using the input and a broad examination of research answers to the same questions we asked of our stakeholders. The results verified that the process could work, but also that the instructions needed to be much more thorough. The first skills created by the AI had the right idea, but were far too broad, full of jargon, and often not applicable to high school. We went through an iterative process in which we would update the instructions document to provide more thorough or accurate information and then run the process to see what else should be improved.
When we had the opportunity, we shared the very rough drafts with others experienced in schoolwide competencies. This gave us a chance to get early feedback and make fast adjustments, which we did. The structure of the final output changed many times as a result, starting very different from our original skills and eventually ending up with a very similar structure. It also shone a spotlight on the fact that we were the only school they were aware of to create schoolwide competencies that would be used along with traditional academic courses and standards.
Between the first draft of our instructions to the AI and the last, we made many refinements:
We originally provided instructions for how the final output should be formatted. The AI's efforts to match this format were off in a way that made the output harder to use. By removing these specifications, we were able to make use of the AI's chosen formatting better. Once we had the final skills created, we then asked for different methods of formatting the skills and by then it was easier to focus just on that aspect.
The original instructions included our school's mission statement without context. The AI did not do anything with it. We added context that the skills should align with our mission. It was a good reminder that people would infer that everything should align with the mission because it is included, but the computer may not have such insights.
The original instructions provided a lot of flexibility for the organization of our skills. We specified ranges of numbers for how many domains and skills there should be, but that was about it. When we ran trial runs, we were able to try out many different structures. As a result of looking at those preliminary results, we decided we wanted to maintain the "4Cs" (communication, collaboration, critical thinking, and creativity) as domain names. We added this specification to the instructions, but specifically left the names of additional domains open for change.
Added instructions for the AI to provide a paragraph of details that explain what the skill entails. This was originally intended just to help us understand the intent of the skill so we could focus our refinements without changing the meaning of the skill. However, it also proved very useful beyond that. These descriptions helped those not involved in the project to understand the scope of the skills and how each description could have many aspects to it without listing them all out. As others used the descriptions, though, we realized that some might expect the long descriptions to be a long list of requirements to demonstrate that skill instead of a list of possible ways it could be demonstrated. So, we also had the AI update the descriptions it wrote to make that aspect of the descriptions clear.
Added instructions for the AI to provide a paragraph of justification for the skill's existence based on our stakeholder input and research. The proved invaluable, especially in the early stages, as it helped us to know the basis for the skill. Toward the end of the process, we did notice, however, that the AI could justify almost anything, even with a very tenuous basis in the input or research, so we had to be careful to double-check the relative prominence of the skills. The AI was very good at comparing proposed skills and identifying which were most and least supported by the input.
We originally attached the Washington State Profile of a Graduate and included instructions for the AI to align our skills with that profile. However, after conversations with a few people close to the project and looking at the outputs that were aligned to it, we found that structure did not suit us. Since we were already clearly on a path that included the skills described in the Profile of a Graduate, we just stopped including it as part of the considerations.
We loosened guidance about the research we asked the AI to do. In our early drafts, we specified that we wanted it to find and use meta-analyses and peer-reviewed research. Unfortunately, we discovered (and verified separately) that such sources relevant to this project were rare. There was not enough data in those sources about skills that had proven to lead to success in school, career, and/or life. So, we allowed the AI to just do broad internet research. We tested just this aspect and found the results to match what we already knew to be true and the rest fit with reason, so we proceeded with this approach.
We fleshed out guidance about how the skills would be assessed. Early drafts of the new skills included many skills that would be almost impossible to assess. They were often traits, not skills. We added instructions that skills must be observable and demonstrable through evidence. We ended up pushing the limits of this in the later stages of development, but it helped make the resulting skills much more usable.
It was easy for the proposed skills to become too simplistic as we tried to make the language easy to understand. As a result, we added specifications that the goal was to maintain a high depth of knowledge whenever possible.
In practical use, we also didn't use the full instructions all at once. We spent months generating and honing our foundational skills and just weeks on the advanced skills, since they were dependent on the foundational skills. As a result, most of our AI instructions began with directions to provide only the foundational skills at that time. We could have removed the information about the advanced skills, as a result, for much of the process. However, it still seemed beneficial to have that future direction included in case the AI would consider future ramifications (though we never tested to determine whether it would), so we left it all in one document. For others engaging in a similar process, they could consider having a set of instructions created for each phase of the project.
Along the way, we discovered some of the limitations of the tools. Often, the tool seemed forgetful or lazy. Some of the challenges we had with the AI:
The AI loves lists. It tried to include everything.
The AI defaulted to complex language. It would be technically correct, but difficult to understand. Jargon was common (lots of "synergy").
It ignored some instructions when they proved difficult to implement.
Some models would read only part of the data (ChatGPT).
In the vast majority of the cases above, we would tell the AI to make the relevant change and it would do so well. When the AI did not follow the given directions as closely as we wanted, it was generally best to take that output and then have it make sweeping changes to correct what we did not like about it. If, however, it was not something we had already included in our instructions, such behavior told us to add it.
We focused on the content of the skills through most of the development process. But, as soon as the output needed to be shared with others, we would have the AI clean up the language. Before finalizing the skills, we also double- and triple-checked the results. So, there were many versions of the skills that were progressively improved by asking the AI to check that it did what we asked.
Checks for the AI to run on the proposed skills:
Check the wording of the skills:
Student-friendly wording
Avoid jargon
Look for wording at high level of depth-of-knowledge scale
Avoid clunky wording
Eliminate unnecessarily long descriptions
Consistent tense among all the skills
Eliminate ambiguous wording, especially subjective adjectives and adverbs
Look for overlaps between skills (there should be none)
Verify each skill could be assessed through evidence
What implicit biases are evident from the skills?
Foundational: Would any be hard to extend to advanced?
Final checks asked of AI:
After all the changes, did the description, details, and justification sections still align?
Evaluate the final product against our original goals
Some final thoughts on the most important steps toward getting quality output from the AI:
Experiment in the early stages.
Spend the time to create a detailed set of instructions. This is a worthy investment.
Trust, but verify. Test whether the AI is using all data. Ask it to count occurrences of data in provided documents. Ask it to explain its sources.
Tell the AI what you like and do not like about its output. Let it fix it. Keep a running conversation with it.