Skip to main content

Fault tolerance

Given the millions of objects found in a typical Microsoft 365 tenant, Corso is optimized for high-performance processing, hardened to tolerate transient failures and, most importantly, able to restart backups.

Corso’s fault-tolerance architecture is motivated by Microsoft’s Graph API variable performance and throttling. Corso follows Microsoft’s recommend best practices (for example, correctly decorating API traffic) and, in addition, implements a number of optimizations to improve backup and restore reliability.

Recovery from transient failures

Corso, at the HTTP layer, will retry requests (after a HTTP timeout, for example) and will respect Graph API’s directives such as the retry-after header to backoff when needed. This allows backups to succeed in the face of transient or temporary failures.

Restarting from permanent API failures

The Graph API can, for internal reasons, exhibit extended periods of failures for particular Graph objects. In this scenario, bounded retries will be ineffective. Unless invoked with the fail fast option, Corso will skip over these failing objects. For backups, it will move forward with backing up other objects belonging to the user and, for restores, it will continue with trying to restore any remaining objects. If a multi-user backed is in progress (via * or by specifying multiple users with the —user argument), Corso will also continue processing backups for the remaining users. In both cases, Corso will exit with a non-zero exit code to reflect incomplete backups or restores.

On subsequent backup attempts, Corso will try to minimize the work involved. If the previous backup was successful and Corso’s stored state tokens haven’t expired, it will use delta queries, wherever supported, to perform incremental backups.

If the previous backup for a user had resulted in a failure, Corso uses a variety of fallback mechanisms to reduce the amount of data downloaded and reduce the number of objects enumerated. For example, with OneDrive, Corso won't redo downloads of data from Microsoft 365 or uploads of data to the Corso repository if it had successfully backed up that OneDrive file as a part of a previously incomplete and failed backup. Even if the Graph API might not allow Corso to skip downloading data, Corso can still skip another upload it to the repository.