Paul Vitty, Lead DevOps Engineer, BRS GOLF by GOLFNOW
On 10 May, I – like most of the nation – watched Boris Johnson announce the phased reopening of the country. Midway through, my phone started dinging. I expected it to be friends critiquing government plans… instead it was Slack – “#alerts_international: Abnormal increase in traffic on BRS GOLF Production”. What followed was one of the most interesting and challenging periods of my career.
A few weeks prior the Operations Team (spanning Orlando, Chicago and Belfast) had seen this before. When golf returned in the US; the team had dealt with phenomenal demand on nearly every system; far above previous years’ records, and beyond what anybody had ever forecast.
The story behind handling this demand starts two years and six days prior to the PM’s announcement; with the git commit “Initial commit”. The Product & Technology Team (P&T) made a strategic decision to start decoupling the UIs from the core codebase; this includes member and visitor booking. These were detached from the core codebase and communicated back to the core using RESTful APIs, mostly present to support our mobile app. These UIs would have no direct attachment to persistent datastores.
The DevOps team took this opportunity to rethink how we build, deploy and operate applications. GOLFNOW was an early adopter of Kubernetes; running high transacting production applications on Kubernetes for over two years, so we decided on hosting this new UI on a cluster in Google Cloud in London. Additionally, we installed a CI/CD pipeline, empowering developers to make changes and get fast feedback across multiple environments. In the intervening period this pipeline matured, enabling developers to deploy feature branches to an ephemeral environment. This allows Product to iterate on changes with a developer prior to merging into the stable branch. With this decoupling, fast feedback and test automation, we moved to a release-when-ready deployment – freeing the UI from the bi-weekly release cadence – something that proved critical.
In the months leading up to, and throughout, lockdown, P&T and DevOps engaged in a project to upgrade our primary MySQL datastore between major versions. This new version brought about significant performance enhancements, as well as migrating to hardware with 4x the RAM and faster SSDs.
BRS GOLF (GOLFNOW’s B2B tee sheet management offering) is architected as a multi-tenant application; allowing upgrades on the datastore in a phased rollout to manage risk. The DevOps team developed Ansible automation, migrating customers’ data from our old MySQL cluster to our new cluster. We were rolling through this process when Covid-19 hit.
At 7pm on 10 May, BRS GOLF went from practically no traffic due to the suspension of golf, to reaching nearly our summer 2019 peak. News had broken that golf could resume, courses were gearing up for the return; and impatient golfers were looking for tee times. Throughout this first week, strategic decisions made months and years prior proved pivotal to our success, affording us levers to pull to help deal with traffic.
Working off NewRelic, AppDynamics, Datadog and Splunk metrics, we started to pull these levers. In December we migrated all our properties to Cloudflare for WAF; an instant win was enabling caching at the edge, creating an almost immediate perceived performance increase to end-users, and reducing load on our infrastructure by 30%. Database performance was running at an all-time best, even under far above normal QPS.
Next we started tuning and scaling up our infrastructure. Firstly, we noticed some contention on our HAProxy load balancer, we iteratively tuned to reach our current setup; this involved some esoteric Linux Kernel tweaks around the TCP/IP stack.
Next we focused on the responsive member booking site in Kubernetes. The way courses release tee times is akin to Ticketmaster; with a massive flood of requests inside a 5-minute period; generally, on the hour. To manage this we significantly scaled up the GKE cluster; enforced increased resource limits on pods; and increased the number of running pods by 300%. This allowed us to absorb more traffic. Soon we’ll start work on predictive auto-scaling of our pods, using historical APM data to scale up and down the number of nodes and pods for cost efficiency.
It wasn’t all about load though. Our Product team wanted to change some UX in response to feedback; again, previous decisions helped here. Developers paired with the Product manager to quickly make changes on a feature branch in an ephemeral environment; and with the decoupling, we could quickly roll this out to production with zero downtime.
We continue to see far above normal load across our platforms (reaching 7x our previous year’s record). Fortunately, due to the team’s hard work, this isn’t considered an event anymore, it’s just the new normal. Although we have been remote since March, this crazy period of growth actually brought the teams closer together; focused-on our core principle of Connecting the World to Golf.