Embracing CloudKit: Part 8

Posted by Stuart Wheelwright on June 19, 2023 · 16 mins read

Part 8: When Things Go Wrong

This is the last in an eight-part series on implementing data sharing in Shopping UK using CloudKit.

Shopping UK is a smart shopping list for UK shoppers. It knows almost every product in the supermarket and will arrange them by aisle. Lists can be shared with family or friends.

Last week, we looked at how to change or stop a share, what happens when a list is deleted and background maintenance. Today, we’ll wrap up the series with a look at error handling, merging data, and diagnosing problems.

The Path of Sadness

Happy and sad paths

Throughout this series, the focus has been on the happy path — when everything goes to plan — but a synchronisation solution is only as good as its ability to handle unexpected situations.

We’ll start with a look at what can go wrong before examining how Shopping UK handles things.

I can’t guarantee I’ve thought of all possible problematic scenarios, but I will share what I built, what I learned from it, and some techniques I used to help diagnose problems when they do arise.

Why do things go wrong?

Synchronising data is hard for two reasons:

  • Networking: data must be moved between physically separated systems — the user’s device and iCloud — using an unreliable network.
  • Merging Data: data must be combined from different sources — between each user’s device and iCloud — while maintaining data integrity and striving to keep data consistent across all devices.

The problem with networks

CloudKit provides a clean interface for moving data between a device and iCloud, but it cannot change the nature of networks:

  • The network is never reliable.
  • Latency is never zero.
  • Bandwidth is never infinite.
  • Transport cost is never zero.
  • The network is never homogeneous.

Every request sent from your app to iCloud must travel from the user’s device to iCloud across many networks. Each message will start on a Wi-Fi or a mobile (cellular) link before finding larger internet backbones, and eventually arriving at the datacenter that hosts Apple’s iCloud servers. The path taken to iCloud will be different for each user, and possibly for each message.

Wi-Fi will not always be available, mobile (cellular) data may be disabled (e.g. flight mode enabled or data roaming disabled) or have patchy reception, iCloud may be down or busy, and messages may be lost.

Nothing happens instantly. Each message will take several milliseconds, or longer, to arrive — this is millions of times slower than a message can be processed on the device. Many messages could be in-flight at the same time. Many responses may arrive at the same time.

Everything has a cost. Larger messages need more processing, use more iCloud storage, and eat up more of the user’s mobile (cellular) data plan.

Everyone will have a different experience. Some users have ultra fast broadband, some are limited to a weak or slow data signal. Some parts of the world are closer to a datacenter than others.

The problem with data merging

The aim of a synchronisation solution is to move data between devices to give each user the same up-to-date view of the world.

But the network won’t always be available, messages take time to move between devices, and users expect their app to continue to work when their mobile signal is patchy. The best we can hope for is for all devices to eventually show a consistent view of the data.

And this means we must consider what to do when the same data is changed by two people at the same time.

How should conflicting requests be merged?

How should this be resolved?

  1. Should the changes be applied in order?
  2. Should they be somehow merged?
  3. Should they both be rejected?

There isn’t a “right” answer to this — it depends on the nature of the app, the meaning of the data, and the expectation of the users.

Taming the network

When designing the sharing strategy for Shopping UK, I had several guiding principles in mind:

  1. Synchronise changes with others as soon as possible.
  2. Don’t lose any changes.
  3. Minimise user’s data costs.
  4. Use the battery efficiently.

These are not universal principles that work for all apps. If your app shares the user’s location in real-time, it may be ok for a single update to be lost. But lost items is not a good look for a shopping list app.

Because the network is unreliable, I chose a queue-based solution.

When the user adds an item, the app creates a JournalEntry record to represent the change. This is saved to a local queue and then the queue is uploaded to iCloud (using CloudKit).

If the network goes down nothing is lost.

Changes are added to an upload queue before attempting upload

The same thing happens when fetching changes from iCloud. Newly received JournalEntry records are saved to a local queue and the queue is processed, and each change is applied to the list in turn.

After a local change is uploaded, it is removed from the upload queue.

After a remote change is applied to the device, the entry is removed from the to-apply queue.

An upload may fail for many reasons. Look at the CloudKit errors for a full reference.

What does the app do when it receives an error?

Generally, Shopping UK will simply show a friendly form of the error in the Activity Log to let the user know synchronisation may be delayed.

For example, when Airplane Mode is enabled, a NetworkUnavailable will be returned:

App shows the details of the error and the affected items

Is this what happens for all errors?

Nope, just errors that cannot be fixed by the app.

CloudKit protects itself from huge messages.

  • If a request is too large or contains too many records, CloudKit will reject it with a LimitExceeded error.

  • If requests are made too quickly, CloudKit may reject some with a RequestRateLimited error. I’ve also occasionally seen a PartialFailure error with the message “will not be saved but can be retried as is”. This appears to happen when iCloud is under heavy load.

The app knows what to do about this type of error:

  • When a request contains too many records Shopping UK splits the list in half and sends each half separately. If either half fails, it is split in half again. This halving continues until everything has been sent successfully.

  • When the request is rejected due to rate limiting or when the “will not be saved but can be retried as is” message is returned, Shopping UK will wait a while and try again.

Taming the merge

Is a merge always necessary?

No. It can be avoided by never caching anything on device, and treating iCloud as the single authoritative source:

  • When your app needs data, it fetches it from iCloud.
  • To change data, your app immediately uploads the changes to iCloud.
  • If the upload fails because another device made changes at the same time, a fresh copy is fetched, and the user makes the changes again.

This is the way websites worked in the very early days.

This may be ok for some types of app, but it doesn’t work for a shopping list. If you’re in a rural shop with no data signal it would be frustrating if your shopping list was offline and could not be viewed.

So we’re stuck with the reality that data on the device and the data in iCloud can diverge. And this means merges are necessary.

When do we merge?

The beauty of a using a write-only journal means merges are never needed when uploading changes to iCloud. Each change is simply appended to the set of JournalEntry records in iCloud.

Changes are appended to iCloud

But this doesn’t eliminate the merge, it just moves it elsewhere! This is why I love design. Nothing is free; everything is a trade-off :)

For Shopping UK, the merge happens when we need to apply the changes to the user’s device.

Here’s some examples.

Adding while offline

Alice and Bob are both offline, with no phone reception.

Alice adds ‘bread’ below ‘milk’ moments before Bob adds ‘eggs’ below ‘milk’.

Add while offline

What happens when their network returns — how do we resolve this conflict?

Shopping UK will apply the changes in order. This can result in Alice and Bob’s lists appearing in a slightly different order:

Merge when online

Remember this is an unusual event. It is not often that both users would add items to their shared list while both were offline. And, although each user has a different ordering, it won’t affect how the lists are used. Items are never lost, and when Alice and Bob start shopping, their lists will be automatically categorised and ordered to match the aisles in the supermarket.

Arguments could be made for a better approach, but in my mind, this is a reasonable compromise that has the benefit of allowing the app to continue to be used offline.

Renaming or recolouring the list while offline

Some merge situations are a lot simpler to resolve.

If two users were to change their shared shopping list’s colour or name while offline, Shopping UK would simply apply the changes in the order they were made — the last one wins.

For example, Alice and Bob are both offline. Alice changes the list’s name to “Weekly Shop” a second before Bob renames the list to “Alice and Bob’s List”. When the devices re-join the network, each device will upload the “rename” change, and both devices will fetch the other person’s “rename”. On both devices, the operations will be applied in the same order. This will result in the list being called “Alice and Bob’s List”, because Bob made his change after Alice.

Easily Diagnosing problems

Things often went wrong while I was building the synchronisation solution — due to my misunderstanding, bugs, or undocumented iCloud behaviour.

Diagnosing the cause of the issue was not always easy.

Here’s some things that helped me.

Logging to Console

Having detailed logs available was essential for diagnosing problems. Many bugs cannot be found simply by capturing the current state of the system. It is often necessary to see how the state changed over time. Most of CloudKit’s calls are asynchronous and this adds to the complexity.

When writing code, I added “write-to-log” messages everywhere — for almost every path through the code. In many cases, these log messages became a substitute for code comments.

Add logging liberally

My logging framework has configurable verbosity. While developing, I set it to DEBUG mode, to show every message in the console. This meant I could see exactly what was happening internally, and in real-time. The logs showed every decision, the contents of every request, the contents of every response, error messages, everything.

Example of console logging

For the version released to the App Store, only ERROR messages and critical information is logged. This means logging won’t slow down normal operation.

Real Time Journal Diagnostics

In addition to console logging, I found it useful to add a view to show the state of the upload and to-apply queues. This floating window shows how many of each type of operation are waiting to be sent to iCloud and waiting to be applied to the device.

Tapping on the window will show the entries in the queue, and tapping an entry shows further detail:

This view is hidden for the App Store version of the app, but it is incredibly helpful while developing.

Diagnostic Log

Finally, I’ve extended the information included in the diagnostic log users can send me when they encounter a problem.

Send Diagnostic Log when reporting a problem

This log now contains details of the CloudKit configuration, and the contents of the upload and to-apply queues.

Diagnostic Log example

The information in this report should prove invaluable for tracking down the root cause if problems are reported by users.

The End

And that concludes the series. I hope it has been useful. Please send any errors, omissions, or feedback to @wheelies

Before I go, I’d like to share some CloudKit articles I found to be incredibly useful while adding CloudKit support to Shopping UK:

*Shopping UK* App Screenshots

If you get a chance, please try Shopping UK and let me know what you think at @wheelies