Back to blog
Industry insight 16 March 2026 · 7 min read

The economics of fault-detection latency: turning days into minutes

Every hour a string sits offline has a price. The interesting question isn't whether your O&M tool catches faults — it's how quickly. The arithmetic is steeper than most operators expect.

Josh Powell
Founder, InspireGreen

There’s a number that O&M conversations rarely put on the table: time-to-detection. The marketing material on most platforms talks about whether they catch faults. The interesting question is how long they take to catch them.

The arithmetic on this is steeper than most operators expect. We’ve been collecting the figures for our own fleet for the last eighteen months, and the gap between a platform that catches a fault in fifteen minutes and one that catches it in fifteen days is, financially, larger than the platform’s annual cost by a comfortable margin.

Worth thinking about in some detail.

A simple model

Pick a representative UK rooftop: 500 kWp, decent orientation, average UK irradiance profile, roughly 450 MWh of annual generation. Assume an average export rate of about 11p/kWh — somewhere in the middle of what’s typical across UK C&I export arrangements in 2026 between fixed PPAs, half-hourly export and self-consumption displacement.

A fault that takes out 5% of generation costs the owner about £6.20 per day on average across the year. In peak summer that figure rises to closer to £10 per day; in deep winter it drops below £3.

Now consider three detection scenarios:

ScenarioTime to detectLoss per fault
Real-time, derived detectionUnder 30 minutes< £0.50
Daily report + portal check24-72 hours£6 - £19
Monthly review or quarterly site walk14 - 90 days£87 - £560

Those are per-fault figures. The portfolio question is how many faults per site per year, and that depends on site age, install quality and luck. From our own data, somewhere between 1.5 and 3 detectable faults per site per year is a reasonable working figure for a fleet over three years old. A “fault” here means anything from a tripped string to a degraded optimiser to an inverter derating event; not all of them are catastrophic.

For a portfolio of forty sites at that fault rate, the difference between the first and third row of that table is between roughly £5,200 and £67,000 a year in unrecovered generation. Same fleet. Same sites. Different tooling.

Why the gap is so wide

There’s nothing magical about the difference. It’s the compounding effect of three things:

  1. Sub-daily faults are invisible to daily checks. A string that fails at 11am and is detected at 9am the next morning is invisible to a daily review process. Detected on the second day, it’s been bleeding revenue for 22 hours.
  2. The portal-level traffic-light is too coarse. A 5% loss is well inside what a site-level green/amber/red indicator considers normal. Operators relying on the portal’s own dashboard for triage will routinely miss faults that the underlying data could surface in minutes.
  3. Monthly reviews catch the trend, not the event. By the time a fault shows up as a PR shortfall in a monthly report, it has, by definition, been live for most of a month. The detection is technically happening; the financial damage is already done.

Each of these compounds the others. If your platform reports daily, your portal coarse, and your reviews monthly, you are running on a three-week effective detection latency even for faults the data could surface immediately.

What “minutes” actually requires

Sub-hour fault detection isn’t a vendor-marketing claim — it’s a data-pipeline constraint. To detect a fault in minutes, four things have to be true:

  • Polling has to be frequent. A platform that polls inverters once an hour cannot detect faults faster than its polling interval. We poll Solis Cloud every 5 minutes and SolarEdge every 15, because that’s the most aggressive interval each upstream API permits without rate-limiting. Detection cadence on any platform is bounded by the upstream API: SolarEdge’s standard monitoring tier permits a 15-minute interval, so that sets a floor on how quickly a fault can surface unless a vendor has access to a higher API tier. Worth understanding where any quoted detection figure sits relative to that upstream limit.
  • Detection has to run on each poll. Pulling the data is only half the work; the rules have to fire against it. We run six detection rules (string drop, persistent zero, partial offline, comms stale, inverter offline, optimiser drop) on every polling tick.
  • The detection has to be string-aware. Site-level rules can’t catch sub-array faults. The platform has to be comparing strings against their siblings, MPPTs against the inverter mean, and inverters against the site mean.
  • The output has to reach a human or an automated action within minutes. A fault detected at 11:04 that hits a queue nobody looks at until Friday is no better than monthly review.

The last point is where we see most platforms fall down. Detection is technically immediate but operational latency is days. The chain from “the data is in” to “a human acts” is what needs measuring, not just the data-pipe.

The market is moving on this

There’s an emerging consensus across the UK O&M community that detection latency is the metric that actually distinguishes platforms. The conversation has shifted from “do you catch string faults” — most credible platforms now do — to “what’s your mean time to detection”. We expect to see that figure on more procurement RFPs in 2026 than it appeared on in 2025.

Solar Energy UK’s reporting on commercial O&M (solarenergyuk.org) increasingly emphasises operational metrics over headline capacity numbers. The industry knows that “300 GW of UK PV by 2035” depends on the existing fleet being maintained well, and “maintained well” is increasingly defined in terms of detection latency rather than annual inspection box-checking.

How to actually measure this on your own fleet

If you’re an operator wondering where your platform sits on this spectrum, the question to ask is straightforward: pick the last five real faults you found and write down the timestamps. When did the fault start? When did your platform raise a case? When did a human act on it?

We do this internally every quarter. Each fault gets three timestamps: fault start (inferred from telemetry), case opened (when the platform raised it), and case acknowledged (when an engineer first acted). The two intervals — start-to-open and open-to-acted — are the two latencies that matter.

The mechanics of start-to-open are knowable in advance. On SolarEdge the upstream API exposes 15-minute granularity, so a string-level fault is detectable within one or two polling windows of when it starts. On Solis the cloud polls more aggressively and our adapter pulls every five minutes, so the same fault is detectable sooner. Open-to-acted is a human variable: whoever is on-call, whatever else they’re doing, whether the case landed in business hours. The platform can only do its half.

You can measure both. A platform that can’t tell you what its own detection latency is on your own fleet is one that hasn’t been built to be measured.

The flip side: alert fatigue

There’s a real failure mode to sub-hour detection, and it’s worth naming. If the platform raises every minor blip as a case, the human end of the pipeline collapses. We’ve all worked with monitoring systems that emit a dozen “warning” emails a day; the result is that nobody reads any of them, and a genuine fault gets buried.

So the platform also has to be careful about what it raises. Our rule of thumb: a detection should fire only when the data is consistent with a real, physically meaningful, actionable fault. Not when a single five-minute sample looks odd. Not when a cloud passes. Not when an inverter clips on a hot afternoon.

That’s why our detections require a fault state to persist across at least two polling intervals, why we weather-normalise the comparisons, and why we suppress detections during expected derating events. The point isn’t to maximise alerts. It’s to minimise the time between a real fault and a real human seeing it.

What this means for procurement

If you’re scoping an O&M platform, three questions get to the heart of the matter:

  1. What is your polling frequency for SolarEdge / Solis / [your inverter brand]? Ask for a specific number, and ask how it relates to the upstream API’s own limit — that tells you the real floor on detection speed.
  2. What is the median time-to-detection for a string-level fault, and can you show it on my own fleet? A useful answer is a measured range with the methodology behind it, not a single headline figure.
  3. What is the rate of false positives? A platform that detects everything in minutes but raises ten false fires per site per week has solved the wrong problem. Ask how detections are debounced and weather-normalised.

We’ve written about string-level fault detection in more technical detail. The economics here sit on top of that detection logic.

The bit nobody wants to say out loud

A lot of the financial argument for sub-hour detection assumes the operator will actually do something quickly once the alert fires. If the case opens at 11:04 and sits in a queue for nine days because the engineering team is busy, the platform’s detection latency didn’t help anyone.

The platform and the operations have to move together. A faster platform with a slow operation is wasted money. A fast operation with a slow platform is doing heroic work to compensate for the wrong tooling.

The right answer is to align both. Detect in minutes. Act within business hours. Resolve within the SLA. And track the time at each stage so you can talk about it honestly with the owner who’s paying the bill.


SolarFleet polls SolarEdge every 15 minutes and Solis every 5 minutes, so string-level fault detection runs within the next polling window. See how the platform handles alerts or start free with 2 sites.

#economics #fault-detection #performance #sla
Josh Powell
Founder, InspireGreen

Josh runs InspireGreen, a solar installer based in Cardiff, and builds SolarFleet — the O&M platform we use to monitor our own sites. Most posts here come straight out of the work: a case we dealt with, a feature we shipped, or a thing we wish we'd known earlier.