Sporadic slow Docker push towards AWS ECR

Incident Report for Semaphore CI

Resolved

We are happy to report that the issue with slow pushes towards AWS ECR is resolved.

With great help from the AWS networking team, we reached the conclusion that the cause was ISP to ISP congestion caused by a damaged fiber-optic cable.

With the new routing in place, users shouldn’t experience any unexpectedly slow pushes to AWS ECR.

Posted Apr 23, 2021 - 12:13 UTC

Identified

The issue has been identified by AWS networking team and is present in Availability Zones us-east-1d and us-east-1e during 22:30 - 03:30 UTC. AWS team is working with their upstream connectivity providers on resolving the issue. Based on updates from AWS updates progress will be communicated.

Posted Apr 15, 2021 - 10:33 UTC

Update

We are continuing to investigate this issue.

Posted Apr 09, 2021 - 12:29 UTC

Update

We are continuing to work with AWS networking department on resolving sporadic connectivity issues with AWS us-east-1 region.
The problem has been narrowed down to two specific Availability Zones us-east-1d and us-east-1e during 22:30 - 03:30 UTC and we are hoping that resolution is near.
We apologize for the inconvenience this is causing.

Posted Mar 31, 2021 - 12:02 UTC

Update

Issue: Slow Docker push towards AWS
We continue to work on this issue. The problem seems to be less severe than when originally detected and it is happening less frequently.

We noticed a pattern that shows that slowdown usually happens between 22:30 - 03:30 UTC.

With the insights that we gathered so far, we have prepared an improved version of the script that collects additional metrics on docker commands.

If you are interested in helping our team and sending us more data can include this script in the builds on the projects affected by this issue.

Instructions on how to use it can be found here: https://github.com/renderedtext/snippets/blob/master/docs/slow_docker_push_troubleshoot.md
You can find the script itself here: https://github.com/renderedtext/snippets/blob/master/push.sh in case you want to review what data is collected.

If you are affected by this problem we urge you to reach out to our support team at support@semaphoreci.com and provide a link to the affected jobs.

Posted Mar 17, 2021 - 10:14 UTC

Update

Our engineering team is continuing to work on this problem. At this moment, this is a top priority for our infrastructure team.
We have an open communication channel with our upstream provider since March 2nd and have managed to identify two separate issues, one of which has been resolved by March 5th (details below).
_______________________________________________________________________________________
Issue 1: Connectivity issues with GCP storage, AWS S3, and Docker pull, overall network connectivity issues.

Status: Resolved

Summary: An urgent unscheduled maintenance on our upstream provider's network started on March 2nd and continued until March 5th. On the Semaphore end, this manifested as general instability of the network and increased download and upload times towards resources outside of the Semaphore environment.
These issues were resolved on March 5th.
If you're still experiencing this problem please reach out to our support team at support@semaphoreci.com
_______________________________________________________________________________________
Issue 2: Slow Docker push towards AWS

Status: Ongoing

Summary: On a small number of projects docker push will run drastically slower than expected. This issue is sporadic and so far it was limited to pushes to AWS ECR and DockerHub (also hosted on AWS). We are still working with the upstream providers to pinpoint the exact cause of this behavior.

Any insights from our users that are affected by this will help us build a more statistically relevant sample of the issue and bring us closer to resolving it.

If you are affected by this problem we urge you to reach out to our support team at support@semaphoreci.com and provide a link to the affected jobs.

Our engineers are preparing a docker-debug script that collects additional metrics and we will be sending it out next week to all affected parties that are interested in helping us get to the bottom of this.

Posted Mar 12, 2021 - 14:45 UTC

Update

We are continuing to investigate this issue.

Posted Mar 05, 2021 - 10:24 UTC

Update

We are continuing to investigate this issue.

Posted Mar 04, 2021 - 12:39 UTC

Investigating

We are seeing sporadic packet loss between our build cluster and the AWS us-east-1 region.
Our team is actively working with the upstream provider to pinpoint and resolve the issue.

Posted Mar 03, 2021 - 10:56 UTC

This incident affected: Semaphore.