Things that keep me awake at night – DevOps pipeline Quality Assurance

Gathering thoughts as I organize this topic, and explore it. Suggestions on where to learn more are welcome.

I’ll be simplifying things initially. I plan to start by focusing on the pipeline and tools used in the DevOps culture. I expect the more I dive in this topic, the more precise next posts will be.

For simplicity let’s assume we are dealing with development of a cloud-based web application, having a DevOps toolchain that includes tools like Chef, Puppet, Jenkins, Docker, Packer, AWS, New Relic, Splunk… how do you test a deployment pipeline built on top of these?

I have to start somewhere. I know this: you can approach testing software by dividing the problem into separate areas, researching them, and executing any necessary actions, including finding and resolving issues. The result should hopefully be a high quality, or at least acceptable, product.

Let me try applying these areas to DevOps toolchain, and list the questions/topics that emerge.

  • Is it working as expected? What does working as expected mean to you? To your stakeholders?
  • Do you have unit tests? Integration? End-to-end? How many is enough?
  • Do you need to do any manual testing after a pipeline step is executed?
Automation / Automatability / Testability
  • Are you going to automate the testing? Why yes? Why not? How much?
  • If yes – which tools will you use? Are they free? What alternatives do you have?
  • Is the toolchain automation-friendly? Was it created with automation in mind?
  • Is it testing-friendly in general? Do you have hooks / breakpoints to make it easy to test?
  • Is there a certain User Experience your DevOps tools should deliver?
  • Is the pipeline error-prone? Can somebody deploy a test build to production by mistake? Can they destroy your current production stack by clicking on a badly described button?
  • Do you need to support keyboard shortcuts? Arrow keys / tabs to navigate?
  • Does the UI support long/short inputs for build names, or components? High build numbers?
User Acceptance
  • Who is your customer? What acceptance do you need from them?
  • Would you do A/B testing for your pipeline?
Installation / Integration / System
  • What do you need to integrate with? For example – would you file JIRA tickets automatically if something goes wrong?
  • Do you need a database? Which version?
  • What operating system will your toolchain run on? What OS will you support for developing it?
  • When depending on a 3rd party – do you accept to rely on their uptime? What if critical cloud-based tool goes down when you urgently need to deploy a hotfix?
  • Will the 3rd party let you know of planned downtime? Is the downtime in a timezone suitable for you?
  • Do you have backup?
  • What platforms should you be compatible with? AWS? OpenStack? Azure? Are you going to test all of them?
  • If your pipeline is web-based – which browsers will you support? Can a bad rendering on Safari cause an error? What about strict Firefox security? What if the users are running Chrome with JS-blocking extension?
  • Any potential compatibility issues between your tools? Should you test every new version with others?
  • Do you have any dates or numbers showing up in the pipeline? 1.000 and 1,000 are not the same… same goes for 6/12/2016…
  • Monday is not the first day of the week for everybody. Do you care?
  • If you have user input – does it support non-ASCII characters? Does it have to?
  • Any of your users need a localized UI?
  • If some of your resources are outside your country – would you support them? What if part of the deployment needs a phone number, but it’s in a weird formatting from another country?
  • Are you required to meet certain requirements like SOX or HIPAA? Can your DevOps toolchain and code assure at least part of the compliance?
  • Any export regulations you might be violating with your DevOps code? What if certain country requires that data is stored locally, but your tools deploy a server on a different continent?
Stress / Load / Performance
  • Can you deploy 10 servers simultaneously? What about 10000?
  • How long does it take to deploy the infrastructure? Is 1 hour acceptable? What if 10 minutes is too long?
  • Did anyone even define these requirements?
  • Do you track any of the performance metrics?
  • Do you take any user input? Can a malicious user infect other users? Steal their passwords? Admin password?
  • Do you store sensitive data in your Jenkins jobs? Where do you store them securely?
  • How will you prevent users from committing their AWS credentials to public repositories?
  • Do you remove all access when terminating employees?
  • Do you use access control? Do you audit user actions? Should you?
  • Who is really implementing security? Can a single engineer misconfigure firewall on all your production servers?
  • Do you have enough logging to know why something went wrong? Do all 3rd party tools have enough logging?
  • Where are your logs?
  • Do you have alerts / notifications in place?
  • What are the configuration options for your jobs?
  • What documentation do you need? Do you have enough if somebody decides to leave abruptly or falls under a bus?
  • Any public-facing documentation you want to / have to share?
Adoption / Metrics and Instrumentation
  • Any metrics you want to track?
  • Do you need to add instrumentation to the jobs to know where the bottlenecks are?
Upgrade / Rollback
  • How will you test new versions of the tools? Are you ready to roll them back? Will they work after rollback?
Rollout strategy
  • What is your must-have vs nice-to-have? What tools depend on each other?
  • Can you define phases of your DevOps toolchain deployment?
  • Have you identified all the resources you need for testing?
  • Environmental resources like hardware, and software that you need?
  • What about licenses? Any legal review of these needed?
  • Are you well staffed? Any training your engineers need?
  • Documentation, artifacts… what else do you need to deliver?
Vendor / 3rd party
  • When working with a vendor on your DevOps implementation: how much would you want them to test vs you? What is their testing strategy? How much testing overlap should happen? What to they need to deliver?
Definition of done
  • When can you tell you are happy with the testing of the DevOps toolchain?
  • Do you need to sign off? Who else signs off?


These are just some initial thoughts. What do you think of these? What’s missing?