On call application developers need tools to help resolve issues in large distributed software systems. The first line of defense is to restore service to customers so they can complete their tasks. The next step is to figure out "what" happened, so you find the root cause and fix the issue.
Logs are a tried and true way to get insight into what happened with your application, but searching through hundreds, thousands or millions of log lines can be daunting and often require a user to learn a querying language to do it efficiently.
Developers on call may be looking for the reason for a slow response, a downed service, or a software bug that made its way into the wild. Explore Logs provides a clickable UI that allows users to slice and dice their log data, find patterns, drill in, sort, and filter without incurring the overhead of learning a query language, making it possible for even non-engineers to find issues in logs.
Generative Research and Planning
We started this project with a very high level assumption about the users based on evidence from our public forums and customer calls. We set out to understand who our user was (an Application Developer or Tech Supporter in a large company), what they were trying to do (find the source or an error in the system) and capture any pain points along the way.
My responsibility was to source evidence to validate or discredit our assumptions. I did this by using internal systems including, but not limited to, bug tracking databases and customer support transcripts. I also interviewed customer facing Grafana employees, and explored our public community forum and Slack channel.
We were fortunate in that our own application developers fit the description of our target users. Developers at Grafana are responsible for monitoring their own code on cloud deployments and participate in the on-call rotation to respond to issues that arise in their systems.
Conducting bi-weekly focus groups with our internal dev and tech support groups helped me map workflows and understand the steps they go through to resolve an incident, often uncovering challenges with the tooling. These group meetings occured throughout the design phase of the product.
An important part of the design journey was to establish a set of design principles that key stakeholders, developers and users could all agree. These principles became a measure for evaluating our design.
Queries are implicit and composed from interactions with the UI
Users should never start on an empty screen
Focus on Observability use cases - not BI, IOT or Security use cases
Make it easy for users to transition to other telemetry in the system
An important part of the design is benchmarking our apps against other apps. The other apps might be competitor tools or apps that our users are familiar with and use daily. I watched many YouTube videos of our competitors' tools and installed developer tools to experience them firsthand.
Screen shots and explanations were presented on a Miro board gallery where team members could leave impressions on stickies.
I used Miro's beta feature "Talktrack" to record a guided tour of the Miro board for people I could not meet with in person.
Design Process
I assembled a cross-functional team for a 3-day design sprint, using Jake Knapp's, Design Sprint template for Miro. It worked well. I led the team through exercises that would help them align around a common challenge, explore design alternatives, and decide what area of the workflow to improve.
We created a low-fidelity prototype using Google Slides and used internal users to test our initial designs. I even encouraged a willing Senior manager to facilitate the usability study.
Design Iterations
Producing low-fidelity wireframes with Balsamiq, I was able to iterate over several design ideas quickly (I lost track after about 50). The Product Manager and I worked in 2-week sprints. We met with our focus group and stakeholders on alternating weeks. We utilized Slack to socialize and discuss design ideas mid-sprint. At the end of each sprint, we demoed our work to our stakeholders and set the direction for the next sprint. This turned out to be an effective way to move fast.
High Fidelity Mockups
After several rounds with low-fidelity wireframes, I translated design ideas into high-fidelity mock-ups using Figma. Figma is an effective tool for communicating and collaborating on designs with developers but producing high fidelity can be slow. We prioritized which high-fidelity mocks I would produce so they would be in lock-step with development sprints. Not all designs needed to be translated into high fidelity. At Grafana, Product Designers are embedded in development squads and frequently need to redesign or pivot in response to technical constraints. Developers are always welcome to offer alternative design solutions. You never know where a good idea will come from.
This was a design iteration to tackle the issue of too many filters, or filter values that are too long
A histogram showing log volumes needed to have several variations of tooltips
Showing the distribution of common patterns recognized in log lines so that users can eliminate noise from their search.
Deployment and Followup Research
Explore logs had gone through several rounds of internal reviews and testing, but we wanted to get some unbiased opinions. We ran a round of usability testing with external users, most of whom had never used Grafana.
The results were as expected, and uncovered even more opportunities to build on the progress that we made while validating our design choices
Deployment and Measurement
Explore Logs was released in a private preview to 10 customers. We engaged with the account team and met with customers every week. I interviewed the users who had used the app firsthand. Establishing this feedback loop enabled me and the team to develop important relationships with users that will continue to help us in the future.
We later released Explore Logs in a public preview to all of our customers. We added an in-app Google survey that helped us establish a base of user I could reach out to for interviews and participation in future usability studies.
Lastly, I created a FullStory dashboard to capture insights on usage. Monitoring Fullstory sessions and replaying them to developers was very enlightening
It was fun working on this project from inception to launch. This project was and continues to be a fun part of my work week. It has helped me grow the most at Grafana because it requires a lot of communication and collaboration.
One fun and challenging aspect of the project was finding ways to communicate with a diverse group of stakeholders and developers, many of whom were located in other time zones around the globe. Finding overlapping time in calendars was nearly impossible. Creating short videos to communicate design intent and posting them in Slack channels was a productive way to overcome those scheduling challenges.
Share quickly and often and overshare always - many stakeholders were involved in this project.
Establish your feedback loop early. The Fidelity of the design made no difference when we were trying to get feedback on concepts. Because we have a Design System in place, it was easy to translate low-fi into high-fi. Sticking to low-fi in the discovery phase allowed us to move fast.
Perfect is the enemy of done. The concepts and designs we produced would keep development busy for several release cycles. The Project Manager and Engineering Managers had no problem calling me out when I spent too much time polishing a design or skipping ahead too early to get to the next best thing.