In this page, we demonstrate the details of the examples mentioned in the paper.
Example 1 in RQ2 (GPT-2 on eli5_category dataset):
The LLM misunderstands this question and only captures the words scientists and commonly. Consequently, it turns towards a discussion about the human body parts that scientists commonly possess, delivering responses such as We all have the same DNA, we have the same genes, we all have the same ears... In this case, an LLM consistently generates similar responses related to body parts despite introducing stochasticity through perturbation. Even though the responses are completely incorrect, an LLM still yields a low uncertainty score.
Input: Why do scientists research commonly known stuff?
Perturbed Answer 1: The answer is that scientists are interested in the stuff that is common to all of us. We all have the same DNA, we all have the same genes, we all have the same brain, we all have the same body, we all have the same skin [Perturbed point] ,we all have the same hair, we all have the same eyes, we all have the same ears, we all have the same nose, we all have the same mouth, we all have the same teeth, we all have the same skin,
Perturbed Answer 2: The answer is that scientists are interested in the stuff that is common to all of us. We all have the same DNA, we all have the same genes, we all have the same brain, we all have the same body, we all have the same organs [Perturbed point], we all have the same skin, we all have the same hair, we all have the same eyes, we all have the same ears, we all have the same nose, we all have the same mouth, we all have the same teeth,
Perturbed Answer 3: The answer is that scientists are interested in the stuff that is common to all of us. We all have the same DNA, we all have the same genes, we all have the same brain, we all have the same body, we all have the same body [Perturbed point] parts, we all have the same organs, we all have the same organs, we all have the same organs, we all have the same organs, we all have the same organs, we all have the same organs, we all have the same organs
Perturbed Answer 4: The answer is that scientists are interested in the stuff that is common to all of us. We all have the same DNA, we all have the same genes, we all have the same brain, we all have the same body, we all have the same senses [Perturbed point], we all have the same brain, we all have the same body, we all have the same senses, we all have the same brain, we all have the same body, we all have the same senses, we all have the same brain,
Example 2 in RQ2 (Gemma2 on CNN DailyMail dataset, #767):
input: (CNN)It takes a village to raise a triathlete. Since returning home from our kickoff weekend where we got to experience life as a full-time triathlete, I\'ve been juggling training life with a full-time job, hectic commute and time with friends and family. With my Fit Nation teammates spread far and wide, this was a solo effort. Throw in a notorious Chicago winter that refused to go gently into that good night, and I\'ll admit it: I was not very diligent in getting all of my training done. "Partner up," the Facebook status read. Jae Rockwell, the founder of my local women\'s fitness groups, Women RUN the World, posted this mantra to help us keep ourselves accountable. And so that\'s what I did. I\'m lucky to count a handful of both experienced and aspiring triathletes among my friends. So, I reached out to them to ask what they were up to. Annastasia W., who\'s training for her first Ironman half-triathlon this summer, suggested a group brick workout at our local gym. She invited several other women, including complete newbies, through the SoleTri Sisters Facebook group and 10 of us met up on a cold Chicago morning to swim for 15 minutes, bike for 30 minutes and run for 15 minutes. That was great motivation to keep up with my workouts for the next week. A few weeks later, I traveled to Little Rock, Arkansas, to run a marathon. I traveled to Little Rock with three phenomenal "fitspirations": Samantha Goga, Natali Villarruel and Umber Rahman -- who are members of an obstacle course race group called the Midwest Vikings. Seeing how much fun they\'ve had participating in some really daunting physical challenges helped me push through some of the more difficult moments of running my first marathon! Swimming remains my weakest link. I just learned to swim last summer. Sometimes I forget that and don\'t cut myself any slack or recognize how far I\'ve come. Once my formal swim class ended in February I found myself seriously slacking on my swim training. Enter my swim angels, Lovie Twine and Eanista Bailey. They took me out into Lake Michigan last year after my first swim class was complete for a celebratory "swim" (which was really just wading). It was great to have them join me for a swim workout. I appreciated their perspective and gentle coaching. And you know what was happening while I was partnering up? Spring was on the way! And that meant I could finally break free of my Computrainer bike sessions and hit the road. I was lucky enough to connect with the Major Taylor Cycling Club of Chicago (MTC3) for a series of novice rides they\'ve been hosting on local trails around Chicagoland. With two falls behind me (hey, those clips take some getting used to), we got in nearly 23 miles and a new addiction was born. Riding outside with the wind whipping your face, chasing your friends and breathing fresh air is exhilarating and I\'m looking forward to many more rides outdoors. I want to say thank you to the fellowship of fitness that I\'ve been lucky enough to find. There are so many others we keep me inspired, accountable and motivated. I\'m so grateful that they keep me moving in the right direction and I hope that I can help others do the same. So when times get tough, partner up! TL;DR:
Perturb Answer 1: I'm still training for my first triathlon. I'm still not swimming as much as I should. I'm still not running as much as I should. I'm still not biking as much as I should. I'm still not eating as well as I should. I'm still not sleeping as much as I should. I'm still not drinking as much water as I should. I'm still not doing as much yoga as I should. I'm still not doing as much stretching as I should. I'm still not doing as much foam rolling as I should. I'm still not doing as much core work as I should. I'm still not doing as much strength training as I should. I'm still not doing as much cross training as I should. I'm still not doing as much cardio as I should. I
Perturb Answer 2: I'm still training for my first triathlon. I'm still not swimming as much as I should. I'm still not running as much as I should. I'm still not biking as much as I should. I'm still not eating as well as I should. I'm still not sleeping as much as I should. I'm still not drinking as much water as should. I'm still not doing as much yoga as I should. I'm still not doing as much stretching as I should. I'm still not doing as much foam rolling as I should. I'm still not doing as much core work as I should. I'm still not doing as much strength training as I should. I'm still not doing as much cross training as I should. I'm still not doing as much cardio as I should. I'm
Perturb Answer 3: I'm still training for my first triathlon. I'm still not swimming as much as I should. I'm still not running as much as I should. I'm still not biking as much as I should. I'm still not eating as well as I should. I'm still not sleeping as much as I should. I'm still not drinking as much water as body needs. I'm still not doing as much yoga as I should. I'm still not doing as much stretching as I should. I'm still not doing as much foam rolling as I should. I'm still not doing as much core work as I should. I'm still not doing as much strength training as I should. I'm still not doing as much cross training as I should. I'm still not doing as much cardio as I should. I'
Perturbed Answer 4: 'm still training for my first triathlon. I'm still not swimming as much as I should. I'm still not running as much as I should. I'm still not biking as much as I should. I'm still not eating as well as I should. I'm still not sleeping as much as I should. I'm still not drinking as much water. I'm still not doing as much yoga as I should. I'm still not doing as much stretching as I should. I'm still not doing as much foam rolling as I should. I'm still not doing as much core work as I should. I'm still not doing as much strength training as I should. I'm still not doing as much cross training as I should. I'm still not doing as much of anything as I should. But I
Example 3 in RQ2 (GPT-3.5 on CNN DailyMail dataset):
Conversely, a higher degree of uncertainty also does not necessarily imply that an LLM's prediction is incorrect. We show an example that GPT 3.5 summarizes a piece of news from the CNN/Daily Mail dataset.
In this case, the ground truth summary includes two details: a fire occurrence at a park and the absence of injuries.
For sample-based uncertainty measurement, all five generated samples incorporate these two pieces of information but also furnish additional varied information, such as the park's owner and its intended use. These extra information further lead to a higher variance in the generated samples' embedding, resulting in a high uncertainty score despite the LLM's prediction is reliable.
Input: Summarize the following sentences: (CNN)A mammoth fire broke out Friday morning in a Kentucky industrial park, sending plumes of thick smoke over the area as authorities worked to contain the damage. The blaze began shortly before 7 a.m. at the General Electric Appliance Park in Louisville, according to Mike Weimer from the city\'s emergency management agency. He said that there were no reports of anyone injured or trapped. Video showed both smoke and bright orange flames. Firefighters took up positions around the affected buildings, spraying water from the periphery. Weimer told CNN that authorities didn\'t know what had caused the fire, which had gone to at least four alarms. According to a GE website, its facility in the Louisville Appliance Park is "revitalizing manufacturing in the United States." The park is large, such that 34 football fields could fit in one of its warehouses in the facility. TL;DR:
Sample response 1: A large fire broke out Friday morning in a Kentucky industrial park, with no reported injuries or casualties. Firefighters worked to contain the flames and smoke, and the cause is still unknown. The affected park is owned by the General Electric Appliance Park and is the size of 34 football fields.
Sample resonse 2: A major fire broke out in an industrial park in Louisville, Kentucky on Friday morning. It was reported to be at least four alarms, and no injuries or people trapped were reported. Firefighters attempted to contain the blaze, and the facility is used for revitalizing manufacturing. It is so large that 34 football fields could fit in one of its warehouses.
Sample response 3: A large fire broke out in a Kentucky industrial park on Friday morning with no reported injuries. Firefighters fought the blaze from the outside as the cause is unknown. The facility is the size of 34 football fields and is operated by General Electric.
Sample response 4: On Friday morning, a large fire broke out in a Kentucky industrial park. Video showed both smoke and bright orange flames. No one was injured or trapped. Firefighters worked to contain the damage, but the cause of the fire is unknown. The industrial park is a large facility owned by General Electric and is meant to "revitalize manufacturing in the United States".
Sample response 5: A massive fire started at 7am in a Louisville industrial park, with orange flames and thick smoke seen in videos. No injuries or trapped people reported. The cause is unknown and the park is reportedly used for revitalizing US manufacturing. It is so large that 34 football fields could fit in one of its warehouses.