Engineering for Offline Workflows

Mobile devices have revolutionized how people use the internet.  Pulling some numbers from Smart Insights' mobile marketing statistics, we see that more minutes are spent on mobile devices than traditional desktop devices when measured globally.  The following image is telling:



However, growing up in a rural-ish area, I've noticed that data coverage is not necessarily commensurate with data requirements.  Even when it is, the cost of the data is often unaffordable.  Even in relatively wealthy countries with modern infrastructure, such as the United States, data rates can be quite high.

This introduces an interesting and important problem for those engineers trying to build applications that target the mobile space.  Users strongly desire to use their applications on mobile devices and using those devices as they were intended to be used is often impossible or unaffordable.  The challenge for the engineer is creating a solution that allows users to seamlessly use these applications while having poor connectivity or "windows of affordability" - periods in which the user is connected to cheaper internet, like home or business wifi.

This blog post is targeted at engineers trying to solve this problem.  How can apps be designed to meet these requirements?  How should the data model be designed to support offline workflows?  The proposed solution is not universal, nothing ever is, but is intended to form the basis of a simple CRUD app.  The recommendations obviously need to be tailored to individual use cases.

The hardest problems in computer science


For the purposes of examination, I'll assume that offline workflows must support querying, updates, creation, and deletion of entities.  These entities can even contain references to each other.  How do we support these operations locally and make it look like they are applied globally?


There are only two hard things in Computer Science: cache invalidation and naming things. - Phil Karlton

Mobile applications that support offline workflow are all about caching important data and knowing when to synchronize and invalidate.  I like to think the above quotation can be amended to:


There are many hard things in Computer Science: cache invalidation and naming things and I have the rest written down somewhere else.

This more accurately reflects the distributed nature of this problem.  Engineers designing offline clients have to think about invalidating caches in the presence of constant network partitioning.  It's the CAP theorem with P on steroids.  If you are unfamiliar with CAP, I recommend heading over here and getting caught up to speed.

Essentially, the design here is going to optimize availability.  We are making the assumption that the application must always appear to work, even if the user isn't connected to the internet and hasn't been for some time.  In CAP parlance, this will be an AP system where we place consistent rules around partition healing.

The design


The database

It's tempting to use a traditional RDBMS here and rely on the client to initialize and commit transactions involving batches of entities.  Doing so allows us to verify referential constraints and make the batching atomic.  However, this conflicts with the basic premise of an AP system that is intended for marginal connectivity.  How many dangling transactions in this system are acceptable?  If a transaction fails due to referential constraint, what logic is required to pull the offending entities out of batch and retry?  This path is going to require a great deal of communication, something upon which we can't rely.

The remainder of this post assumes the database is a NoSQL database with limited transaction support.  This might mean we can only serialize mutations to a single row (entity) via a CAS operation like Cassandra's lightweight transactions or Google Datastore's transaction semantics.  In either case, there is no support for foreign keys or referential constraints.

The benefits of using such a system is that it can be quite cost effective, scalable, and in the case of Google's Datastore requires no operational overhead.  I highly recommend Google's Datastore in the application I describe below as it is cheap and well-suited for this use case.


Storage

Modern browsers and applications have gained the capability to cache data locally quite easily.  This is true if you are designing a traditional web app and use local storage (although somewhat limited in size) or designing mobile apps using something like react native using AsyncStorage.  This data can persist beyond the lifetime of the application process, very important during long partitions.

Caching

Any local storage is going to have a finite amount of space, either imposed by hardware constraints or by the risk tolerance of the application developer to perceived data loss.  The key here will be to separate caching into a couple of partitions, one partition is going to contain mutated entities and one partition is going to contain entities from the server.  These entities might be a list of appointments in an appointment book, for instance.  In order to keep the amount of cache used reasonable, these will often be ordered by time descending.  Although this might not always be the best order, in the appointment book case we can assume the user is more interested in upcoming and recently completed appointments than appointments years into the past.  For simplicity's sake, I'm going to ignore appointments that might be years into the future.

The workflow we have so far then, resembles the following:


  1. The user opens the application or manually initiates some synchronization process.
  2. The application phones home to the server and pulls most relevant server state.  In our case, this is a list of appointments ordered by date, but this could be ordered by any relevant data.  This might also be several types of entities, such as appointment entities and appointment book entities, which are simply entities that contain a list of appointments.
  3. Client caches this data in local storage of some sort.  If there is not enough space, the least relevant data is excluded.
Curious readers will find issue with 1.  What if the user is offline for the first sync?  Unfortunately, not all cases can be covered - sometimes we simply can't do what the user asks.  However, we do have the advantage that users that require offline workflows often begin and end their days within range of cheap internet.

Mutated entities will go into the mutated partition.  When entities are displayed to the user, these entities are munged with cached server entities to display the user-consistent view of the world to the user.  If the mutated cache runs low on space, further mutations can be prevented.  The size should be set so that this is a rare case, but prevents the mutated cache from growing indefinitely.

The model

In order to better illustrate what needs to be done, I'm going to outline this post's specific models so we can more easily keep track of what's required.

Appointment book:
public class AppointmentBook {
public DateTime LatestAppointment; // time of latest appointment in the book
}

Appointment:
public class Appointment {
public DateTime Time; // how is this relevant to the user?
}


In both C# cases, we use a DateTime field to determine relevance.  This field will be useful in querying.  We'll be adding to these fields as we add capability.

Creation

This problem is fairly simple if entities cannot reference other entities.  If that were the case, the client would simply need a temporary identifier, or none at all, until the client successfully saves the entity and gets a server-assigned identifier.  However, almost all applications in the real world maintain relationships between entities.  This could be adding a relationship between an old and new entity, two new entities, or two old entities.  This means that clients will need to generate identifiers that are unique to themselves and all other clients, typically a UUID.

Generating a UUID in javascript may not be as simple as it first seems.  It appears that most native random generation is good but not cryptographically secure.  If this makes you uncomfortable, I wouldn't add randomness with a straight timestamp.  Most machines are running some sort of NTP service and may accidentally generate random values from the same timestamp.  A better option might be to grab a UUID from the server and use that in combination with a global counter to guarantee uniqueness.  The best bet might be to use a library like uuid that uses better random generation where available.

It's also a bit unusual for many developers to accept entity IDs from the client.  While seemingly a bit odd at first, it makes the support of partitioning much easier.  It also allows the client to create relationships between entities without having to be concerned about reconciling IDs in the future, a process fraught with hazard.  It makes saving order-independent as long as a few rules are followed:


  1. Consistent rules are followed on clients where a related entity cannot be found.  For instance, it's possible for client 1 to save an appointment book but lose connection before the appointment is saved.  This leaves a dangling relationship.  Code needs to be consistent in the handling of this case.
  2. Corollary to 1, no referential constraints.  Client calls to save entities will only fail if there is a server error, not for entity state.  Saving an appointment book entity that references an appointment that does not yet exist is permissible.  This will seem foreign in SQL environments, but it is a common pattern in NoSQL environments.  Trading consistency for scale often involves moving toward a "python-like" paradigm... better to ask forgiveness than permission.
Adding client-side IDs results in the following entities.

Appointment book:

public class AppointmentBook {
public Guid Id; // client-set globally unique identifier
public DateTime LatestAppointment; // time of latest appointment in the book
}

Appointment:

public class Appointment {
public Guid Id; // client-set globally unique identifier
public DateTime Time; // how is this relevant to the user?
public string AppointmentBookId; // reference to appointment book, globally unique
}

Updates

Updates are quite a bit harder.  We know, for instance, that two clients are not simultaneously creating conflicting entities, but they may simultaneously be editing the same entity.  How do the server and client manage this situation?  A simple solution is to add a counter to the entity so that the server can detect collisions.  A common solution is to use something as simple as a revision field.  This is simply a monotonically increasing integer that gets incremented every time the entity is persisted.  Most NoSQL data stores support row-level transactions that support this type of consistency.

When the client downloads entities from the server, it maintains but ignores these revision values.  When saving, the original revision values are simply sent back to the server.  If the entity had been saved by someone else in the meantime, the revision value will alert the server to this fact.  But what should be done?

Remember, the server cannot and should not prevent a save here.  A good solution is to take a diff of the old and new entities and log the changes.  This immutable log can go into an OLAP database like BigQuery, or the database you are using for the entities themselves.  The important thing here is that the collision is logged and administrators can see what happened at every step to arrive at current state.  Remember, ask forgiveness not permission.

Appointment:

public class Appointment {
public Guid Id; // client-set globally unique identifier
public DateTime Time; // time of latest appointment in the book
public int Revision; // detects collisions
public string AppointmentBookId; // reference to appointment book
}

Deletes

Similar to updates, all cascades and referential constraints are removed.  All deletes become soft deletes which means this codepath can share a codepath with updates with only one specific field being modified.  This might mean that some clients create relationships that aren't technically possible if they hold stale entities.  These situations are allowed, but logged for future correction.

Clients maintain their cache of updated entities and filter at the UI any entities that have the deleted flag set.  The server also filters on this flag before returning entities to the client.

Because it's also technically possible to update a deleted entity, we'll make the deleted marker a timestamp instead of using a more generic modified timestamp.  This gives us a record at any time past deletion of when the entity was deleted.


Appointment:

public class Appointment {
public Guid Id; // generated by the client, globally unique id to create relationships and entities
public DateTime Time; // time of latest appointment in the book
public int Revision; // detects collisions
public DateTime Deleted; // if set, indicates this entity is deleted
}

Putting it all together

Using the above models as a guide, a simple framework can be built to support offline clients in an AP model.  The fields support a couple of basic rules:

  1. Fail forward - entities can always be saved.  If business invariants are violated, or conflicts encountered, they are repaired instead of prevented.
  2. Saves are small, contained units.  It's tempting to allow the clients to commit large batched transactions, but we expect spotty connectivity.  The risk of dangling transactions is high.  Instead, learn to live in a world without constraints.
  3. Log changes.  The consistency model is generally last man wins, but if we log how the entity arrived at the current state we can always perform an audit and correct as necessary.
  4. Measure relevancy.  Cache sizes are limited, we are generally concerned with maintaining only as much state as required to provide a good experience to the user.  The user is interested in what's relevant.
Following these rules won't cover all cases, some may require more complex solutions (CRDTs if mutating lists, for instance) .  This is simply meant as a primer in building a basic CRUD application that works offline, something very valuable in world where users are mobile and data is expensive.

Comments

Popular posts from this blog

Adventures in bit arrays/SIMD (Go)

Having fun with Go slices