Prompt Templates

I. Semantic Context Comprehension

1. Global Context Extraction

You are a virtual reality game player. Currently, you are playing a VR game named {app_name} with the description {app_description}. Please extract the description metadata of the app, including genres, VR device support, themes, interaction mechanisms, and app language.

--=-- Output Format --=--

Output the inferred app description metadata in JSON format:

The description metadata of the app:

{

"app_description_metadata": {

"app_genre": "",

"app_vr_device_support: "",

"app_theme: "";

"app_interaction_mechanisms": "",

"app_language: ""

}

E.g., {demonstrations}.

DO NOT OUTPUT any other content besides JSON.

💡 Output: {app_description_metadata}.

Note: we randomly collect 30 samples from real-world applications. We analyze them and write chain-of-thought reasoning steps and results as demonstrations in the template for LLM. Orienter will randomly select three of them as demonstration examples during runtime.

2. Local Context Perception

You are a virtual reality game player. Currently, you are playing a VR game named {app_name} with the following relevant {app_description_metadata} and get the uploaded screenshot {VR_scene_under_analysis}. Please describe all GUI elements in the screenshot.

💡 Output: {app_GUI_scene}.

II. Reflection-Directed IGE Candidate Detection

1. Multi-Perspective Characteristics Mining

(a) IGE Candidate Recognition

--=-- Context and Task --=--

You are a virtual reality game player. Currently, you are playing a VR game named {app_name} with the following relevant {app_description_metadata}. Now in your field of view from the VR headset, you can see the current VR scenario: {app_GUI_scene}. Please extract information and infer what GUI elements you can see currently based on the current view and the app description. Please provide detailed object descriptions.

--=-- Chain-of-Thought Instructions and Demonstrations --=--

{CoT&demonstrations}

--=-- Output Format --=--

Output the inferred GUI elements in JSON format: {``GUI_elements'': [``element1'', ``element2'']}. E.g., {demonstrations}. DO NOT OUTPUT any other content besides JSON.

💡 Output: {IGE_candidates}.

We refer to "element1" and "element2" as {IGE_candidate_semantic_name} in the remaining content.

Note: We randomly collect 30 samples from real-world applications. We analyze them and write chain-of-thought reasoning steps and results as demonstrations in the template for LMM. Orienter will randomly select three of them as demonstration examples during runtime.

(b) Characteristics Dimension Recognition

--=-- Context and Task --=--

You are a virtual reality game player. Currently, you are playing a VR game named {app_name} with the following relevant {app_description_metadata}. Now in your field of view from the XR headset, you can see the current XR scenario: {app_GUI_scene}, as shown in the uploaded screenshot {VR_scene_under_analysis}. Specifically, you can see the following GUI elements: {IGE_candidates}. For each GUI element, please extract the dimensions of characteristics that can distinguish it from other GUI elements in the GUI scene.

--=-- Chain-of-Thought Instructions and Demonstrations --=--

{CoT&demonstrations}

--=-- Output Format --=--

Output the inferred distinguishable characteristic dimensions in JSON format: {``GUI_element_characteristics'': [``element1'': [characteristic1, characteristic2], ``element2'':[characteristic3, characteristic1, characteristic4]]}. E.g., {demonstrations}. DO NOT OUTPUT any other content besides JSON.

💡 Output: {IGE_candidate_characteristics}.

(c) Characteristics Question Formulation

--=-- Context and Task --=--

You are a virtual reality game player. Currently, you are playing a VR game, and you can see GUI elements {IGE_candidates} with multiple dimensions of characteristics {IGE_candidate_characteristics}. For each dimension of characteristics of each GUI element, please generate a question asking for the values of these characteristics.

--=-- Chain-of-Thought Instructions and Demonstrations --=--

{CoT&demonstrations}

--=-- Output Format --=--

Output the questions in JSON format: {``GUI_element_characteristic_questions'': [``element1'': [characteristic1: question1, characteristic2: question2], ``element2'':[characteristic3: question3, characteristic1: question4, characteristic4: question5]]}. E.g., {demonstrations}. DO NOT OUTPUT any other content besides JSON.

💡 Output: {IGE_candidate_characteristic_questions}.

Note: We randomly collect 30 samples from real-world applications. We analyze them and write chain-of-thought reasoning steps and results as demonstrations in the template for LLM. Orienter will randomly select three of them as demonstration examples during runtime.

(d) In-Context Characteristics Reasoning

Loop:

-----

For each question {IGE_candidate_characteristic_question_item} in {IGE_candidate_characteristic_questions}......

💡 Output: {IGE_candidate_characteristic_value_item}.

-----

After iterations, we get {IGE_candidate_characteristic_value_item} for each {IGE_candidate_characteristic_question_item}, forming an {IGE_candidate_characteristic_values} which consists of values for all characteristics.

Then, we form referring expressions based on {IGE_candidate_characteristic_values}.

Loop:

-----

For the characteristic-value pairs {IGE_candidate_characteristic_values_for_individual_element} of each one GUI element {IGE_candidate_sementic_name} in {IGE_candidate_characteristic_values}......

You are a virtual reality game player. Currently, you are playing an VR game and see a GUI element {IGE_candidate_sementic_name} with characteristics {IGE_candidate_characteristic_values_for_individual_element}. Please form a concise referring expression to describe and locate this GUI element {IGE_candidate_sementic_name}.

💡 Output: {IGE_referring_expression_item}.

-----

After iterations, we get {IGE_referring_expression_item} for each {IGE_candidate_sementic_name}, forming an {IGE_referring_expressions} which consists of referring expressions for all characteristics.

2. Described IGE Candidate Detection

Detecting the {IGE_candidates} based on their corresponding {IGE_referring_expressions}.

💡 Output from Module II.2:

Successfully detected referring expressions: {detected_RE_IGE_candidates_with_locations} (the locations, in exact words, are bounding boxes),
Unsuccessfully detected referring expressions: {unsuccessfully_detected_REs},
visualized IGE detection results in the {VR_scene_under_analysis}: {VR_scene_under_analysis_with_visualized_detection_results}, the locations of detected REs are boxed according to the exact bounding boxes, the REs are labeled on the top left corner of the corresponding bounding boxes.

3. Feedback-Directed Reflection

(a) For successfully detected referring expressions

Mirroring-Based Regional Verification

Cut the successfully detected regions according to the locations in {detected_RE_IGE_candidates_with_locations} as {zoomed_RE_regions}, forming tuples of <{VR_scene_under_analysis}, {zoomed_RE_region}, {detected_RE_IGE_candidate}> as {mirroring_RE_regions_and_original_scenes}.

Loop:

-----

For each {mirroring_RE_region_and_original_scene} in {mirroring_RE_regions_and_original_scenes}......

You are a virtual reality game player. Currently, you are playing a VR game named {app_name} with the following relevant {app_description_metadata}. Now in your field of view from the VR headset, you can see the current VR scenario: {app_GUI_scene}, as shown in the uploaded screenshot {VR_scene_under_analysis}. Please verify whether the uploaded figure of zoomed region {zoomed_RE_region} in the current VR scenario is {detected_RE_IGE_candidate}.

Output in JSON format: {"successful_judgement (yes/no)": "explanation"}. DO NOT OUTPUT any other content besides JSON.

💡 Output: {detected_RE_IGE_candidates_verification}.

-----

After iterations, we get {detected_RE_IGE_candidates_verification} for each {mirroring_RE_region_and_original_scene}, forming a {detected_RE_IGE_candidates_verifications} which consists of verifications of all successfully detected referring expressions.

(b) For unsuccessfully detected referring expressions

Region-Wise Reflection (?)

Loop:

-----

For each {unsuccessfully_detected_RE} in {unsuccessfully_detected_REs}......

You are a virtual reality game player. Currently, you are playing a VR game named {app_name} with the following relevant {app_description_metadata}. Now in your field of view from the VR headset, you can see the current VR scenario: {app_GUI_scene}. You boxed all the already-detected UI elements, as shown in the uploaded figure {VR_scene_under_analysis_with_visualized_detection_results}. Please check whether the detection missed {unsuccessfully_detected_RE}, i.e., there exist(s) {unsuccessfully_detected_RE}, but they haven't been boxed out with bounding boxes.

Output in JSON format: {"unsuccessful_judgement (yes/no)": "explanation"}. DO NOT OUTPUT any other content besides JSON.

💡 Output: {undetected_RE_IGE_candidates_reflection}.

-----

After iterations, we get {undetected_RE_IGE_candidates_reflection} for each {unsuccessfully_detected_RE}, forming an {undetected_RE_IGE_candidates_reflections} which consists of reflections of all unsuccessfully detected referring expressions.

(c) Confident on All Detection Results?

After 3.(a).(2) and 3.(b).(1), we get {detected_RE_IGE_candidates_verifications} and {undetected_RE_IGE_candidates_reflections}.

Merging the verifications and reflections, and reformulating the data according to each possible GUI element, we can form a {RE_IGE_candidates_judgements} for all possible GUI elements.

You are a virtual reality game player. Currently, you are playing a VR game named {app_name}. Now in your field of view from the VR headset, you can see the current VR scenario. You want to find all GUI elements in the current VR scenario. In the last round, you thought there are {IGE_candidates} and the detection and verification results are {RE_IGE_candidates_judgements}. Can you decide which GUI element candidates are already successfully verified (both successful detection and unsuccessful detection count), and which GUI element candidates may need further checking and verification?

Output in JSON format: {"successfully_verified_GUI_elements": [], "GUI_elements_that_need_further_checking": []}. DO NOT OUTPUT any other content besides JSON.

💡 Output: {successfully_verified_GUI_elements}, and {GUI_elements_that_need_further_checking}.

Initiate a pool {validated_IGE_candidates} as an empty pool.

If {GUI_elements_that_need_further_checking} is not NULL:

-----

For {GUI_elements_that_need_further_checking}, go back to II.1 (Multi-Perspective Characteristics Mining), get refined and iterated again.

For {successfully_verified_GUI_elements}, append them in the pool {validated_IGE_candidates}, waiting for other candidates in subsequent iterations.

-----

Else if {GUI_elements_that_need_further_checking} is NULL:

-----

For {successfully_verified_GUI_elements}, append them in the pool {validated_IGE_candidates}.

-----

After several iterations when LMM cannot find any GUI elements that need further checking, we got the complete {validated_IGE_candidates}.

III. Context-Sensitive Interactability Classification

–=– Context and Task –=–

You are a virtual reality game player. Currently, you are playing a VR game named {app_name} with the following relevant {app_description_metadata}. Now in your field of view from the VR headset, you can see the current VR scenario: {app_GUI_scene}. In the current VR scenario, you find the following GUI elements: {validated_IGE_candidates}. Please perform reasoning to identify the user-interactable objects, with which users can use VR devices like handheld controllers to interact, in the current VR scene screenshot. Please perform reasoning based on the app description and current VR scenario.

--=-- Chain-of-Thought Instructions and Demonstrations --=--

{CoT&demonstrations}

–=– Output Format –=–

Output the interactable GUI elements in JSON format: {{“elements”: [“element1”, “element2”]}}. DO NOT OUTPUT any other content besides JSON.

💡 Output: {IGEs}.

Note: We randomly collect 30 samples from real-world applications. We analyze them and write chain-of-thought reasoning steps and results as demonstrations in the template for LLM. Orienter will randomly select three of them as demonstration examples during runtime.

💡 Final output: Interactable GUI elements with their semantics and locations.

Page updated

Google Sites

Report abuse