This report investigates an Artificial Intelligence (AI) Assistant paradigm as a means to enhance the accessibility and inclusivity of Information and Communication Technologies (ICTs) for Blind and Low Vision Individuals (BLVIs). Through longitudinal co-design workshops and prototype testing, key barriers to ICT use were identified, such as information overload, inefficient navigation, and high cognitive load. The AI Assistant paradigm, utilizing AI and Large Language Models (LLMs), aims to respond to these challenges by enabling dialogical interaction, curating information for relevance, and customizing output to individual needs. However, limitations in AI's understanding of spatial relations currently hinder accurate visual translation. Future work should focus on training AI/LLM interfaces to better interpret spatial relations, explore the use of multiple coordinated AI agents to address trust and bias issues, and refine natural language interactions for optimal information delivery. Despite the limitations, the AI Assistant paradigm presents a promising direction for making ICTs more accessible and inclusive for diverse user groups.
AI LLM interfaces for ICTs provide the opportunity for dialogical interaction, meaning interaction with the technology through successive prompts and responses that mimic human conversation. By interacting in this way, the controls, representations, descriptions and methods of navigation in an ICT interface become refined and highly relevant, as parameters are adjusted in response to prompts that establish preferences and details about the user that the system can respond to productively.
1. Curation of content in feedback loops for relevance
As preferences for the level of detail in descriptions vary significantly between users (due to individual differences), content must be curated in consistent feedback loops (e.g. near real time, or through predictable phases of interactivity with gaps in time) to optimize accessibility and user experiences. Information overload is common without this curation capability, thus descriptions of UI elements (in picture form) must allow the user to specify different details to include and not to include.
2. Provide feedback on system status
Ensure that the system is responsive in providing information on possible actions (e.g., when a menu is present and can be opened to access more options), and on completed actions (e.g., when a feature has been activated). In a dialogical interface, this feedback should be speech, if there is not an obvious alternative with other benefits (e.g., user is using a controller that can vibrate
The personalization and preferences outlined in the section above also applies to all perceptual feedback and information produced by the system. If a user prefers more detail in their descriptions, less detail, or specific information is more important than all other information, the system must be capable of curating the output to prioritize this. Some situations may call for combinations of AI/LLM interfaces with other paradigms.
1. Creation of relevant output for diverse audience needs via text and speech
Participants in our co-design workshops responded extremely positively when dialogical interaction with a “personal assistant” resulted in personalized descriptions of the information based on prompts. Thus, for example, the system should interact with the user to understand that the user is blind, and should automatically craft its speech output to be as “sensory-grounded” as possible in response, meaning that it should describe concrete perceptual information as specifically as is needed, whenever this is possible (e.g., directions in terms of “right” and “left”, spatial relations in terms of “o’clock” format or simple shapes that the user can easily imagine). As systems become more tuned to preferences, they can adaptively customize information, as was observed during a testing session in which two participants (one blind, and one with low vision) navigated the same virtual space, but received different descriptions. The AI/LLM tool provided descriptions that reflected sounds and tactile elements that could help the first participant navigate the space, and provided relevant visual information (i.e., high contrast signs above a storefront) to the low vision participant.
While AI/LLM interfaces are effective at translation of visual information to text (even including contextual details, if properly trained for this), there are numerous constraints on this approach that suggest recommendations for how to successfully achieve these outcomes. Tools for producing these descriptions must be set to capture visual information that can be decoded for context by an AI/LLM agent, and effective capture of context through pictures is often dependent on how much of the content and context can be “seen” by the tool.
Tools that can accomplish this task should not be regarded as literal replication of the human vision system. For visual translation, the information (pictures of an interface, space, etc.) must be in a form that facilitates a complete translation of the visual content. For example, visual content (items or interface elements) must not be positioned so that they obscure one another from the perspective of the “camera”, and the content must be positioned at a “descriptive angle” (e.g., the Statue of Liberty pictured from the front produces a salient description, however, when pictured from above, the AI/LLM agent used in prototype usability testing described it as a “crumpled sheet of aluminum foil”).
A key limitation of AI-based visual translation tools is the difficulties with providing the data for visual translation via an articulated model, as opposed to data that would contain many assumptions (as a result of replication of the human visual perceptual system; AI visual translations tend to struggle with angles, perspective, and context of pictures of real scenes). The use of virtual reality scenes may address this issue by allowing the system to access information about the objects in the scene that can be more appropriately described (so the Statue of Liberty can be described as such, even when viewed from a distance or from an angle that doesn’t reveal the full figure).The more the tool is able to articulate from this type of descriptive visual information (especially when in the more controlled environment of a VR scene), the more accurate and usable the descriptions that will be produced, generally speaking.
By providing the ability for users to use their own speech as input, and process the speech with natural language models, the opportunity should be available to ask questions to generate specific descriptions and obtain system status information (e.g., status of the microphone being on or off in a virtual meeting, where to navigate to in order to find specific information, etc.) This type of input and interaction should be optimized for rapid clarification of prompts and commands (e.g. a system that quickly responds with statements such as “I don’t understand what you mean by [x], please specify” or “Did you mean that I should do [y] or [z]?”).
Control of systems via voice interaction also carries the benefit of freeing the user from the visual constraints of a keyboard or touchscreen, which can minimize delays in input and responses due to being unable to see the letters on the keys and having to forage through audio labels to confirm them, or not perceiving visual prompts for where on a touchscreen and in what direction to swipe in to get to the next step in an interaction).
The convergence of AI/LLM tools with visual representations in VR is somewhat awkward, intended as a medium-term solution to some of the pain points identified above (especially the lack of complete models or information by visual translation tools and resulting risk of hallucinations). A more effective solution, which is not available as of the writing of this report, would be AI/LLM models with extensive training on how to interpret spatial relations in visual media. Most at this time focus on analyzing the 2D topography of images and text, and producing descriptions from that, rather than deciphering a 3D layout from a 2D image by recognizing perspective and distance cues, for example. This would free the tools from needing the reference point of a physical or virtual space (in pictures or VR interfaces), since the training would also allow separate inferences to spatial relations to be possible. It would also circumvent the pitfalls of replicating the human visual perception system for this purpose, since that would entail issues such as the aforementioned obscuring of items viewed by the “camera” in line with one another, or a top view of an item causing its description to be inaccurate.
Another limitation of note is the concept of a “language navigable diagram”, which came up in the grocery store AI Assistant co-design workshop. While it was possible to navigate the store entirely through natural language interactions, without a model that actually understands and has translated a physical layout with spatial relations reflecting real world architecture and environments, the tool would be prone to hallucinations. For example, when participants reviewed the description of the store’s layout, they noticed that there was no aisle mentioned for toiletries. They asked the system where to find toilet paper, and the response was not “this store does not have toilet paper” or a similar accurate depiction based on a complete model, but rather, it indicated a location to find toilet paper that did not exist prior to the prompt - it invented an aisle as a response to the prompt. Further work into building models that have complete spatial relations and can interpret them accurately is needed, and applications such as the grocery store could simply be built from an accurate inventory of a store from the beginning, to avoid the hallucination issues and to remain useful.