Continuously Effecting the Product

I originally wrote entry for the Roomkey tech blog in 2016. That site has been shutdown due to the Coronavirus-triggered demise of Roomkey in 2020. I'm reproducing it here, lightly edited, for archival purposes.

"There have been N days with no workplace accident"

This article examines an initiative at Room Key to lower the latency between identifying a desired change to code and deploying code to production. It assumes some familiarity with git, Jenkins and Amazon Web Services.

Some History

Room Key's primary product is a web site delivered as a single page Javascript application that queries an AWS-hosted Clojure web service. We use two primary git repositories: one for the front-end app and one for the back-end webservice. From our formation in 2010 until very recently we practiced a form of agile development with a three week sprint. Once per sprint we would cycle through the same plan-code-stage-qa-deploy process.

Plan

During planning, desired features were broken down into feature stories and priority bug fixes were scheduled with bug stories. The stories were written or adjusted in such a way as to be achievable within the upcoming sprint. The team agreed to a three-week plan which was communicated to external stakeholders.

Code

Once the plan was laid out in the form of stories for the current sprint, developers coded the features and bug fixes. Feature branches in our git repo were used to isolate complex stories from the master branch. For commits to the master branch, a Jenkins continuous integration server ran unit tests for both of the primary repositories.

Stage

Towards the end of the second week of the sprint, developers would wrap up their development and merge their commits into the master branch. A front-end release candidate was identified and staged into the back-end repository as an asset. A back-end release candidate was then staged to a qa server.

QA

Once the release candidate was staged, the QA team had a budget of roughly four days to perform regression tests and feature tests. Most of the feature tests were performed manually based on the acceptance criteria outlined in the feature stories. Regression tests were partially automated and worked in the browser.

Deploy

Once QA signed off on the release candidate, we deployed the artifacts to AWS by adjusting a CloudFormation template.

Problems

Our approach was appealing in many ways: it was easy to explain; it provided regular feedback; compared to old-school waterfall planning methodologies, it was efficient; it ticked a lot of the "agile" boxes; and it promised the ability to update to our product every three weeks. But it had some painful failure modes that were exposed all too often because our reality is messy.

Our reality was that:

Bugs from the previous sprint ate up wildly varying amounts of our time each sprint.
Stories changed in scope, and new high-priority stories were added after the sprint started.
Available resources changed due to sickness or other personal reasons.
The QA team found bugs of wildly varying severity in the new features.
Deployment issues occasionally arose.

If these problems were encountered early in the sprint, we could usually recover. Problems uncovered late in the sprint had more severe repercussions. Obviously, there was less time to recover without missing the deployment schedule. But we also found that the further into the development cycle a failure appeared, the more likely it was to prevent every in-progress story from progressing. Why? Because we coupled together all stories and commits in a sprint... in essence, we willingly donned a straitjacket.

This coupling phenomenon where one story, or even one commit, can torpedo every story in a sprint can be explained by this observation:

Planning has as a primary goal the decomposition of high-level deliverables so that developers can work independently. For QA testing of the deliverables, the independent pieces need to be reconstituted into a recognizable feature. For deployment, all scheduled features need to be reconstituted into a single artifact.

This is where the original draft blog post ended. Shortly after I drafted it, I became the CTO of Roomkey and addressed the issues above by adopting a continuous deployment model. It was not easy to get from "here" to "there" but the payoff was worth it. We parallelized and decoupled stories, invested in automated testing to find problem early and changed our processes fundamentally. Our approach probably will not work for most companies.

The biggest challenge was getting developers to buy into their responsibility for finding bugs early. Most developers accept responsibility for the bugs that they introduce. Fewer, but still many, accept responsibility for proactively finding bugs in the code for which they are responsible. Too few accept the collective responsibility of finding bugs (before delivery to QA) resulting from the integration of their code with the code of other developers. At Roomkey, this last challenge was a struggle. In a mature code base with many authors, I believe that a robust test suite is one of the best ways to reduce the number of regressions.