A Redesign of Metaplane's Incidents
In a perfect world your data environment would never have issues. Systems would sync without interruption. Data would forever be complete and accurate. Your product team would always let you know before they deleted records or renamed columns. (Sorry, product team. You know it happens!)
Unfortunately, we live in a less than perfect world, and things like these do go wrong from time to time. Each time a Metaplane monitor spots a data issue (or several related issues at once), Metaplane creates an incident to help you understand the scope of the anomaly and resolve it as quickly and easily as possible.
Today we’re excited to announce that we’ve redesigned the incident experience from the ground up, incorporating feedback from our customers and research. Our improvements fall into three buckets: more context, better root cause analysis, and improved communication.
Context and actions where you need them most
When Metaplane surfaces a data anomaly, the last thing you want to do is dig for information about what’s going wrong. That’s why we’ve added a new Overview tab to the incidents page in Metaplane. The tab collects all the most important information, such as:
- the status of the incident
- which tables or columns are affected
- and what tables, columns, or BI dashboards are downstream of the anomaly
…all into a single place.
Instead of having to go to the monitor page to dig into that monitor’s history, you can now click on any failing monitor to take a closer look at the recent observations without having to leave the incident page. The same view also displays deeper context on each table or column, such as usage and lineage.
When a monitor fails as part of an incident, there are several things you might want to do about it. Perhaps the failure isn’t an anomaly at all, and that data point should be incorporated into the machine learning model for future observations. Perhaps the current state of your data is completely different than it was when Metaplane initially trained its machine learning model, and you’d like to reset the model and train it on the new state of the world. Perhaps it turns out that the monitor just isn’t very useful to your team, and you’d like to disable it. All of those actions are now available on the incident page, with helpful guidance for when you might want to do each.
Better root cause analysis
Data anomalies are not always easy to diagnose. Sometimes the origin of an anomaly can be a mystery (and not the fun kind). When something is going wrong, especially if that data is important to stakeholders or used in critical business logic, it can be a scramble to identify what might have caused the issue in the first place.
That’s why Metaplane’s new Root Cause tab pulls together several pieces of data that might give insight into the origin of the incident. Metaplane collects recent pull requests that might have been related to the affected tables and columns and displays them together. Similarly, any recent queries run inside of your data environment are identified and displayed. Metaplane also pulls in similar incidents so that you can explore recent anomalies that might be related. Finally, other helpful pieces of context, such as upstream lineage and a timeline of the incident, have a new home on the Root Cause tab.
Keep everyone in the loop when things go wrong
For larger data teams, or data teams that frequently collaborate with stakeholders on data incidents, we noticed that communication and collaboration were key parts of incident resolution. Metaplane’s incident improvements help support that communication and collaboration both inside of the Metaplane app and outside of Metaplane in tools like Slack.
Based on our research, there’s one common piece of information that teams frequently want to communicate to their stakeholders: they’ve noticed the incident, and they’re working to fix it. Metaplane’s new Acknowledge feature does just that. It allows you to acknowledge the incident for a set period of time, which mutes the incident from future notifications and clearly displays who on the team acknowledged the incident and for how long.
We also noticed that the incident thread in Slack was a popular place for teams to troubleshoot and problem-solve. Because we believe information is most helpful when it’s at your fingertips, we’ve pulled those threaded conversations into the incident page in Metaplane. Now, you don’t need to go searching for your teammate’s helpful comment in Slack when you’re in Metaplane.
There are several additional improvements that we’ve made to make incidents more intuitive and more performant:
- If an incident has been marked as normal or acknowledged in Metaplane, we now sync that information to Slack.
- We’ve improved our machine learning models across the board to help ensure that they learn more quickly when you mark a data point as normal.
- We now group failing monitors more intuitively to prevent long-running incidents.
- Now, when a failing monitor is part of an incident, we provide a link to that incident from the monitor page.
- We’ve polished up the UI, including adding a new sidebar that gives more breathing room to important information.
- We’ve made performance upgrades so that incidents loading is snappier and more reliable.
These features are live for every Metaplane account today. While we can’t live in a perfect world with no data issues (and if we find a portal to one, we’ll be sure to let you know), we hope that these improvements makes resolving data issues a breeze.