November 12, 2021
First things first, SLA stands for Service Level Agreement. It’s an agreement between a customer and a service provider that defines the level of service your customers expect from you. The SLA includes SLOs (service level objectives) as well as SLIs (Service Level Indicators) that are measures of specific aspects of the performance. These particular metrics are usually tightly connected to business performance since the primary goal of SLA is to measure and maintain high service standards.
The practice of SLAs was first introduced by Telecom companies back in the 80s and was successfully adopted by giants like Google and Amazon. Companies like Hubspot went even further and popularized the practice of SLA for data sharing among teams like marketing and sales.
A Data SLA promises a certain level of data quality and a time frame for data delivery to the data consumers.
The key here is the data consumer. In general, there can be two types of data consumers — internal and external ones. In the first case, the data consumer is the internal team, such as marketing, finance, or sales. In cases where data is a result of a product or service, the receiver of the product or service is an external data consumer. In both scenarios, the data SLA is vital because it defines the expected quality of data.
Data SLAs help ensure that all decision makers have reliable and trustworthy data to direct their decisions. Moreover, proper health metrics ensure that data teams are the first to know if data breaks, and can stay confident that they provide top-quality data on their end.
The Data SLA is called to fix any ambiguity between data providers and data consumers, and there are certain things to think and iterate before you kick off your first Data SLA:
Knowing and understanding your target audience is key for the business. This may sound like a marketing buzzword, but it works. Understanding the wants and needs of your data users will help you organize your workflow and prioritize the KPIs that matter most.
2. Define ‘Good’ data quality
Once you figure out who your data consumers are, find out what they mean by ‘good data’.
Ambiguity and uncertainty can create a lot of tension among stakeholders. Having sat down to talk with data consumers and letting them share their expectations will help you enhance mutual trust and understanding among all parties concerned. Make sure you also go through what your consumers see as ‘bad’ data qualities.
3. Secure your infrastructure
Identify what you may be lacking in infrastructure to deliver upon the data consumers expect. Let’s be honest, data outages can happen no matter what. A 99.999% data uptime SLA means a potential downtime of only 3 minutes in an entire year. To achieve that, you’ll probably need more people working round the clock, more servers, more memory, and so on.
4. Know your data owners
You might not have to make it public, but you definitely need to know where data is coming from and who will be in charge if things happen to fall apart. Remember, “anything that can go wrong will go wrong.” People resign, get sick, go on vacations. Just be ready for that.
5. Track data issues with the ticket management system
This is one of the best practices that can be adopted by your engineering and DevOps teams. Naturally, Data SLA issues should be your #1 Priority in the issue management process.
Note: if Data SLA tickets end up anywhere but anywhere but No.1 in your priority list, it is a clear sign that something is really wrong with the metrics you use to measure data quality.
6. Check the level of service the data team relies on
The best example here would be the use of AWS Spot Instances. Although they are way cheaper than standard cloud instances, they can be terminated at any moment, leaving you with little to no time to react.
7. Set up alerts on metrics for data SLA
Having alerts is vital. There are many tools that can alert data teams on data mistakes and help you manage such alerts. These include both paid tools like PagerDuty and free ones like Grafana. Just make sure you pack your alerts with as many details as possible, as it will help resolve the problem faster.
8. Have an incident response playbook for your data team
Make sure that everyone involved has a comprehensive understanding of how to address and mitigate any data quality issue that may arise. This will help you avoid situations when everyone assumes someone else will take care of the problem.
9. Communicate the system status to data users
Make sure you know which communication channels will work best for your data consumers. Netflix, for instance, uses Twitter as a customer support channel. Keep in mind that notification channels vary for each business and each data consumer segment. Communicating the issues properly will save you a lot of time on responding to what has happened to the data consumers’ dashboard.
10. Make your Data SLA public for the data consumers
Once you align all metrics and define the main points of your Data SLA, publish it on your company’s wiki or on your website and hold yourself to it.
Note: publishing the Data SLA does not on its own create team commitment to it. Make sure to engage with all people who have a stake in the success of your agreement. Brainstorming, sharing concerns, and understanding the impact of what is vital for business creates a buy-in from both stakeholders and the team. Everyone on the team should be on the same page considering the importance of the Data SLA, which will consequently drive motivation to follow it.
Finally, treat your Data reliability SLA as a marathon, not a sprint. Naturally, the system around data evolves and transforms, and so does the data produced by this system. An SLA is meant to serve the business and not vice versa. As businesses shift their priorities, their data organization structure is also bound to change. Which means the businesses need to review and adjust their SLAs to accommodate the change. Make sure your Data SLA metrics are well maintained and aligned with the business priorities.
Post Tags :
Data Engineering, Data Engineering Practices, Data Quality