By bert hubert / firstname.lastname@example.org
This post is an excerpt of a far longer post on Galileo, its structures and the cause of the outage. Here we’ll only focus on the outage - the potential underlying reasons behind it are described in the full article.
Since the week-long outage in July I’ve been fascinated by Galileo and, together with a wonderful crew of developers, experts and receiver operators, have learned so much about what I now know are called ‘Global Navigation Satellite Systems’ or GNSS. This has lead to the galmon.eu project, which monitors the health and vital statistics of GPS, Galileo, BeiDou and GLONASS. More about the project can be read in the full article.
Earlier I wrote a technical summary in a post called GPS, Galileo & More: How do they work & what happened during the big outage? which dwells more on what the outage looked like. This article is on what went wrong and the problems behind the outage.
The daily work of running a GNSS constellation
Running Galileo is 24/7 work. As explained in my earlier technical article, for a GNSS to function, the performance of the atomic clocks in space must be modeled and tracked closely. Simultaneously, the precision on earth will never be better than how precisely we know where the satellites in space are.
Extremely simplified, a GNSS satellite is a machine that beeps exactly once a second, while broadcasting its orbit and atomic clock to decimeter & nanosecond precision.
Achieving such precision is hard work - if the atomic clock of a satellite is behind by 3 nanoseconds, this can look like it being 1 meter further away from the receiver. So which is it? Is the orbit not known precisely enough or is the clock off? Or, heaven forbid, has the timing facility on earth lost the plot - making everything look wrong?
Note: this is what actually happened during the week long outage, more details below
Determining the interlocking behaviour of constellation orbits, space clocks and ground clocks requires iteration & convergence - after an extended bit of the orbit it becomes clear which is off more, the clock or the ‘ephemeris’ that describes the orbit.
To achieve the very high accuracy Galileo has promised requires uploading a new ephemeris frequently, as often as a few times per hour.
Currently we often see Galileo satellites broadcast that their position is not known precisely. When we later compare the orbit and clock performance against what independent observers measured, we find that the satellites were mostly just fine - it was something on the ground that lost track of what was going on, leading the satellite to be declared ‘No Accuracy Prediction Available’ (NAPA), effectively disabling its use by receivers (like phones and cars).
The current status of Galileo is not that great. There are 26 satellites in space, 2 of them in botched (elliptical) orbits, 2 of them unavailable, 1 unavailable until further notice. This makes for 21 functional satellites, which is less than the “healthy” number of 24. Anything below 24 causes gaps in coverage, or at least, areas where the positioning accuracy will not be good enough.
As NASA, ESA and others are fond of saying, “space is hard”, and they are not wrong. Space is a harsh environment. GNSS vehicles consist of very delicate components that all have to work just right, and from time to time there are problems. This means that lately, on average, around 5% of the Galileo satellites are unavailable because of minor problems (as described above). Often these problems appear to be ground based and have to do with software on earth and not with things being broken in space.
It appears this software was originally procured by ESA and that any changes have to flow from the original developer, to ESA, to the GSA and thence to Spaceopal for deployment. This may mean that fixing software will take some time, but I am sure they’ll eventually get there.
The week long outage
During and after the outage, the GSA and others referred to the Service Definition Document (SDD), which specifies the Minimum Performance Levels (MPLs) we can expect from Galileo during the period of initial services. As noted in the article “Lessons to be learned from Galileo Signal Outage” on the Inside GNSS website, this rings hollow:
Eventually, some days into the event, the GSA did post a statement on its website, taking responsibility and even apologizing for the failure. But then it attempted to minimize its import by arguing that, after all, Galileo is still in its “initial services” phase and therefore should not be expected to work without interruption. Among many Galileo’s supporters, that message fell flat.
A major problem during the outage was in fact the communication of what was (not) going on. As the Inside GNSS article correctly notes, no one within Galileo was allowed to communicate anything except the European Commission itself. And the EC is in effect in the worst position to communicate because it is so far removed from operations.
The defense that Galileo was operating as specified in the SDD may in fact be correct. Within any 30 day period, there can be a full 6.9 day outage and Minimum Performance Levels can still be met.
In September an Independent Inquiry Board was formed to study the week long Galileo outage. The plan was to release preliminary findings in October with final recommendations by the end of 2019.
From rumors, we know a report has been written and that it has reached some firm conclusions. It appears a lot was wrong, but the findings are currently EU classified.
Although October has come and gone, the independent board has not published any initial findings so far. As an aside, an industry insider has told me that the best way to tell how well Galileo is actually doing is to see how late the performance reports are.
Meanwhile, earlier this week Pierre Delsaux, European Commission Deputy Director General in charge of Galileo, reportedly blamed the outage on the mistakes of a single person. Mr Delsaux then went on to compound this terrible statement by adding that the EC had been transparent because the reasons for the outage had been discussed at a conference earlier this year.
It is indeed true that a presentation was held in Florida where details were shared with that audience, and by paying $24 we can download the presentation that was held there. From the slides, we learn that the outage stemmed from a failure in the system that determines the satellite orbits and clock parameters, which are normally uploaded to the satellites many times per day.
The outage in the ephemeris provisioning happened because simultaneously:
- The backup system was not available
- New equipment was being deployed and mishandled during an upgrade exercise
- There was an anomaly in the Galileo system reference time system
- Which was then also in a non-normal configuration
In a way this is very good news - if a major outage needed many things to go wrong at the same time, that means it was not theoretically an accident waiting to happen. We can of course wonder why upgrade work was happening while the backup site was not available. It should be noted that in general, it often happens that rarely used backup facilities are not immediately able to take over in the face of failure.
After the incident started, it took a while to determine what was going one before operations could be restarted, but by that time, the constellation had already drifted too far from a known state that the status of the orbits and clocks could be converged upon quickly. If the backup site had been live, it would have been a great place to restart from, since it presumably would have been in a converged state already.
As noted earlier in this page, determining the orbits and clock configurations is a complex task, where errors in the clocks, or the individual clock drift rates, can look just like errors in the orbit details. Over time, by tracking orbits and clocks, such ambiguities can be resolved, and the system then converges on a known state. From that point on, that state can then be tracked closely.
After sufficient downtime however, such convergence is not a given, and it appears the Galileo constellation had to perform a ‘cold boot’ from first principles. The presentation at the ION conference also notes that due to quality concerns, no shortcuts are taken in such a cold boot.
GNSS ephemeris and clock convergence is a dark art only practiced by a few systems on the planet. Despite the fact that this should all never have happened, I am quite impressed that it was possible to restart the system all over again in only a few days.
Also please note that while this presentation does confirm an operator error was involved, the actual reality is that Pierre Delsaux’s reported claim that the incident was due to a single person making a mistake vastly understates the many reasons for the prolonged outage.
So why did this happen?
On one hand, anyone can have a failure. However, seen another way, Galileo as a project does have problems.
The operation of Galileo is spread out over a large number of organizations and companies.
There is a complicated arrangement between the EU, the European Space Agency and industry to keep Galileo going. Much of this is due to the complicated history of the project.
It seems likely that the reliability of the project is not improved by having such a complicated and spread out constellation of companies and agencies - and that this may especially slow down the resolution of the noted ground-based software problems.
To more fully understand what is going on, I heartily recommend reading the longer version of this article, but here are some highlights:
- The Galileo satellite constellation is underpowered to deliver the current performance claims ('When close isn’t enough, use Galileo').
- The official status of the project (‘initial services’) says that availability will be “better than 77%”, which belies the ‘better than GPS’ claims.
- GPS and Galileo are engineered to be used in combination, and this combination is indeed superb.
- The EU GNSS Agency (GSA) which is in charge of operating Galileo relies heavily on industry. The GSA itself is a hub of contracts, but is not itself a powerhouse of Galileo operational expertise - since all this has been outsourced to large defense contractors.
- Within the Galileo project there is the Galileo Reference Centre, which was designed to be an independent monitor of Galileo performance. This GRC is in reality run by the Spanish space, IT and defense company GMV.
- Currently, around 5% of the Galileo capacity is lost to software problems likely in the Orbit Synchronization Processing Facility (OSPF), run by GMV.
- There is a culture of secrecy around Galileo which is not productive.
- The EU and ESA are currently battling it out over the generation of an additional EU space agency. Brexit also has an impact on Galileo.
- The Galileo operational constellation will likely only get expanded beyond 2020, and may in fact decrease in size before that time.
To read on, head to the full “State of Galileo” article.