There’s endless debate about what constitutes real-time data versus micro-batch versus batch processing. I used to think I understood real-time data—working with clients who needed data insertion within 2 minutes and were perfectly fine with data anomaly detection every 15-minute windows. But then we started working with an organization handling telemetry data, and that experience completely transformed my perspective about streaming.
This organization has a typical enterprise stack in place: Google Cloud, BigQuery, Cloud Run, Pub/Sub—the whole nine yards, which we very much get used to at Masthead. But here’s the real challenge: we’re talking about ingesting 30,000 events per table in BigQuery every single minute, totaling over 2 million events in 24 hours across dozens of tables. And here’s the critical part—they needed genuine real-time anomaly detection to identify issues within that same minute, whether it was duplicate data, dropped events, or a complete interruption in data ingestion.
Sounds straightforward at first glance, right? But here’s the complexity: each table receives data every 20 seconds (give or take), and every single table has its own unique data patterns with different expected ranges. This meant we had to completely reimagine our algorithm to handle these much shorter intervals and adapt rapidly to new data patterns that were obviously much more agile and demanding compared to batch data insertion.
Now, you might say, “It’s just an algorithm—what’s the big deal?” But the real challenge wasn’t just the algorithm—it was the speed at which Masthead needed to receive and process and visualize the data. The only reason we could support this client’s needs was Masthead’s unique architecture. Real-time anomaly detection requires understanding data behavior patterns as they happen, not minutes later. This is where Masthead shines: we built it more like traditional software observability tools that are triggered by logs.
This approach enables us to deliver streaming pipeline anomaly detection that can actually keep up with intensive telemetry data streams. Consider this: it’s virtually impossible to maintain the necessary speed of anomaly detection with SQL-first data observability tools that query data from INFORMATION_SCHEMA to understand ingestion rates—it simply won’t work for real-time operations.
The best part? The algorithm works completely automatically. Clients don’t need to provide any expected thresholds—we extract all necessary data directly from the logs. And we deliver results within just 5-6 minutes of the streaming pipeline starting to run: automatically identifying the thresholds and sending alerts to Slack if inserted data is out of range.
All in all, for the data team that builds and supports this large scale of data ingestion, it was important to know exactly what failed to be able to troubleshoot it in a timely manner. Otherwise, they would need to spend additional time fetching the data after the streaming event, which is not always possible. Moreover, dealing with the reingestion of streaming data is not a trivial task, as it often causes additional work of ensuring there are no duplicates and no missing data.
Another crucial consideration is visualization and troubleshooting—after all, building a UI that can smoothly handle thousands of events isn’t trivial. Even the most sophisticated algorithm loses value if we can’t communicate its insights clearly and effectively to clients. That’s why we developed dynamic visualizations showing raw data insertion, displaying both individual data points and aggregated values at each step. This helps data teams spot anomalies in real time, prevent potential issues, and maintain robust data reliability.

In conclusion, if your team has ever been stuck troubleshooting incorrect streaming data insertions (and let’s face it, who hasn’t?), we can ensure that never happens again.
No more valuable time lost building monitoring systems or wrestling with data reprocessing—which we all know can be a significant operational challenge.