Parallelizing Jenkins Pipelines | Medallia Engineering Blog

Medallia Engineering Blog

A look inside what we are building

Parallelizing Jenkins Pipelines

Pipelines are one of the most powerful tools Jenkins has to offer and a standard for building jobs. The problem is that it can be so powerful that you end up running many different things and it’s easy for run times to increase dramatically.

It can become increasingly difficult to maintain a fast CI feedback cycle while running all the quality tasks such as build, checks, test, etc. Developers want fast feedback to do incremental changes, master needs to run fast to confirm everything works, release branches need to tag a git repository and so on.

Overview

We will talk about the experiences we had at Medallia when dealing with a repository with over 1,000,000 lines of code, multiple gradle subprojects, more than 15,000 integration tests that run outside of Jenkins and over 150 developers working on it.

We will show the strategy that was used by the Test Engineering team to parallelize the stages. We cut the run time of our feature branch (40 minutes) and release branch (80 minutes) in half. To reach these new runtimes we decided to use the parallel plugin in a way that makes all our stages run in the same Jenkins agent. We will go through the process of why and how we did this and other alternatives that might suit your projects better.


Drawing by Rodrigo Fernández.

For anyone who would like to implement this solution, we list a series of lessons to consider at the end.

The pipeline when executed sequentially looks like this:

How to start parallelizing

Designing build blocks

First of all imagine how your pipeline should look based on everything you need to run. This is a list of considerations:

  1. Check which stages must run sequentially based on whether a stage depends on another. There are two different examples

    1. Fail fast: a global stage like checkout fails so the rest of the pipeline should not run
    2. Git tag a release commit before publishing the binary for deployment because the tag version is used as binary name
  2. Parallelize stages that are not dependent on one another. For example:

    1. unit tests and static code checks can run in parallel
  3. When doing multiple things on each stage that could be split and that have different priorities, just move the lower prioritized task down on the pipeline. For example:

    1. Unit test report generation usually is a separate step inside a Unit Test stage. Moving the report generation to the end of the pipeline will gain a few seconds on an already long stage and most of all, it will fail faster so developers will find out what is going on quicker
  4. Make it replay-safe

    1. In a sequential pipeline, having stages ready for replay should not be a big issue.
    2. Enabling parallel pipeline for replay isn’t a lot of work, but you need to be mindful of parallel stages that might fail, as when replayed, all of them will be run again; this is especially sensitive for blockers. For example:
      • Upload to Artifactory can’t have the same artifact published twice, but if it runs in parallel with something else that transiently failed, the replay will fail because the artifact is already published

With that list done, create blocks that will run in parallel, with the right priority.

Parallel strategy

Our project size is so big that we decided to run the parallel stages on the same agent in separate, nested workspaces, which reside inside the main workspace.

This has pros and cons, but as we use git history to make decisions on what to run, we needed the whole git repo and stashing was not an option.

Our design is to checkout in a directory inside the main workspace and then copy in parallel to other directories that are used for different stages.

How does it work?

Code checkout should be done inside a particular directory (targetDir in the example code) so the copy is self contained.

  def checkoutSCM() {
    def targetDir = 'main_repo_dir'
    checkout([
            $class                           : 'GitSCM',
            branches                         : scm.branches,
            doGenerateSubmoduleConfigurations: scm.doGenerateSubmoduleConfigurations,
            extensions                       : scm.extensions + [[$class: 'RelativeTargetDirectory', relativeTargetDir: targetDir], [$class: 'CleanBeforeCheckout']],
            submoduleCfg                     : [],
            userRemoteConfigs                : scm.userRemoteConfigs
    ])
}

After that, it should be built so you have the final version of your code to be distributed anywhere. In this case we have 3 other directories called ‘parallel_dir1’, ‘parallel_dir2’ and ‘parallel_dir3’.

/**
* BLOCK 2
* Once the code is checked out and gradle build ran
* It creates a copy the full content to each of the directories
* used for parallel stages
*/
parallel (
   copy_to_dir1: {
       dir("parallel_dir1") {
           sh 'rm -rf ./*'
           sh 'cp -Rf ../main_repo_dir/. .'
       }
   },
   copy_to_dir2: {
       dir("parallel_dir2") {
           sh 'rm -rf ./*'
           sh 'cp -Rf ../main_repo_dir/. .'
       }
   },
   copy_to_dir3: {
       dir("parallel_dir3") {
           sh 'rm -rf ./*'
           sh 'cp -Rf ../main_repo_dir/. .'
       }
   }, failFast: true
)

This code snippet shows 3 different concepts that are important:

  • Each parallel step should be treated as a Block of code in your Jenkinsfile and it should not be interchangeable, though the stages inside the step should be. That should come from the main design

  • Each directory must be clean before running anything. If your project always starts from a clean environment you shouldn’t need this

  • failFast tag is part of the design. In this case if any of the copying fails, the pipeline must fail immediately. In other cases it is best to finish all stages before failing as we will see next.

A very important notion that you probably realized already is that this is a fixed cost that needs to be reduced as much as possible because it is not present on the sequential pipeline which means that you are wasting build time. It will depend on the size of your project, but this started taking 10 seconds and due to an IO issue it went up to 3 minutes.

The next block will make its stages run on the specific directories where code was copied.

/**
* BLOCK 3
* test stage1
* test stage2
* Runs unit tests
* Publishes something to S3
*/
parallel (
   test_stage1: {
       dir("parallel_dir3") {
           stage("test stage1") {
               echo "running test stage1"
           }
       }
   },
   test_stage2: {
       dir("parallel_dir1") {
           stage("test stage2") {
               echo "running test stage2"
           }
       }
   },
   unit_tests: {
       dir("parallel_dir2") {
          stage("Unit Tests") {
             //RUN TESTS
             sh "./gradlew --no-daemon --parallel --continue test"
          }
       }
   },
   s3_upload: {
       dir("main_repo_dir") {
           stage("S3 Upload") {
               retry(NUMBER_UPLOAD_RETRIES) {
                   timeout(10) {
                       //UPLOAD TO S3
                   }
               }
           }
       }
   }, failFast: false
)

First thing to notice is… failFast: false. Everything should finish before notifying. Unit tests can take a few minutes and if test_stage1 fails you don’t want the pipeline to fail immediately.

New questions might come up from this piece of code:

  • Can you add multiple stages inside each parallel closure?

    • Yes, but it’s not recommended because control is lost on what happens inside it

    • Also Blue Ocean shows the closures, not the stages, so it will be hiding logic

  • Is it the same to run any stage on any of the parallel directories?

    • Yes, but there are 2 main reasons we found to keep track of which directory executes each stage

      • Optimizations can be made if similar stages are run in the same directory as you pay some fixed costs. For example, publishing the binary to S3 and to Artifactory using the same directory in different blocks will reduce build time as the one that comes second will use the same packaged code as the first one

      • Output from one stage can be needed on another stage that if it’s ran in a different directory it’s won’t see it. Example: Test execution and test reporting could be different stages of separate blocks, but if reporting doesn’t get the files from the same directory as execution it will fail

  • Why is there a –parallel inside the unit test?

    • It could be depending on the amount of threads that are started both on gradle and on Jenkins, but it is something to watch. For more information you can read Gradle Performance
  • Why does Unit test has a –continue argument?

    • To go along with the failFast: false argument, it is desired that a test doesn’t make the whole execution stop, but let it run until they finish and report accordingly
How does our pipeline look at the end?

As you can see, the pipeline shows each parallel block with its stages and it seems right. Any changes from here are part of a redesign. For example, you might want to move the test_trigger (which does integration and functional testing) stage to the previous block to enable a different process to run it while the pipeline builds.

As previously mentioned, Blue Ocean hides stages in blocks that are not part of parallel as you can see here:

report_results:  THIS IS WHAT YOU SEE IN THE PIPELINE {
   dir("parallel_dir2") {
       stage("Report Results")  THIS IS HIDDEN {
       //DO SOMETHING
      }

       stage("Minimum coverage")  THIS IS HIDDEN {
       //DO SOMETHING ELSE
       }
   }
},

Alternatives to this solution

If your project is smaller and you don’t depend on git data to build anything (or at least not all git data) there are a couple of different parallel options. The most common one is run stages on different agents by stashing your binary or code and then unstashing it.

Here’s an example on stashing and unstashing in the same agent similar to the copy that is currently done.

stage('build') { 
    dir('main_repo_dir') {
        sh 'gradle build'
        stash '.'
    }
}

parallel (
   copy_to_dir1: {
       dir("parallel_dir1") {
           sh 'rm -rf ./*'
           unstash 'test'
       }
   },
   copy_to_dir2: {
       dir("parallel_dir2") {
           sh 'rm -rf ./*'
           unstash 'test'
       }
   },
   copy_to_dir3: {
       dir("parallel_dir3") {
           sh 'rm -rf ./*'
           unstash 'test'
       }
   }, failFast: true
)

Also, depending on what the tests and checks are doing, it can be even pushed somewhere like S3 and pulled from different downstream jobs to separate the responsibility. This has the downside of using multiple agents, getting network issues, etc, but it will be up to you and the project.

For stashing the whole workdir to be used on other places, the times for such a big project as ours were over 10 minutes stashing and unstashing so all gain from parallelizing would be lost.

Another common option is to trigger multiple jobs independently. With a similar advantage as the previous option, but with a big con which is that it is hard to sync back the results from different places in case it is needed.

Check the following diagram to see the various options. At Medallia we chose “Multiple Workspaces”

Results and Conclusions

Accomplishments

Total time improved

We reduced the time to almost half of the duration of the sequential build. It became the sum of the slowest stages of each block.

The next level is optimizing those slow stages as much as possible to reduce overall time even more.

Developers got faster feedback

As we were able to start earlier stages that depend on external tools (integration, functional tests or docker image build), our overall developer life cycle got reduced as well. As the parallelization also included our release pipeline, we got to validate our binaries faster prior to releases.

Stages are correctly sorted

Sometimes it is hard to determine what is the order in which the stages of a pipeline should run, but this exercise helped us design the stages in a way that made sense to us.

Lessons Learned

Not everything is perfect. Down the road we started seeing that some builds were taking as long as before which after investigating and fixing, left us with some learnings, so this is our way of helping you implement this with as few bumps as possible.

Jenkins agents performance

Some builds started to crash big time when the parallelization ran. Why?

In our case, running multiple gradle tasks at the same time, especially when they are CPU and memory intensive like unit test and static code checks. This is a problem that needs to be fixed at an infrastructure level, but it can have an impact on the stages and blocks design.

That is why when making big changes to a pipeline that consumes resources, some type of performance testing is recommended to assess if the system supports the new load.

Gradle locking files

As the build is executed on one workspace, gradle picks the cache from there and when multiple stages run gradle commands at the same time, sometimes the cache reading locks a file for too long and other stages time out.

This was fixed on gradle 4.2.1, but it still happened 2 times in over 3 months down from a couple of times a day.

We also have separated gradle tasks like build, check, test and publish to measure them in isolation, which has an impact on how many times gradle is executed.

Disk space

Each build is doing 4 full copies of the entire repo with all its classes so obviously we filled the agents disks.

This solution was simple, post action to remove the parallel directories. Also, work on shallow clone to reduce the git repo to a minimum and unshallow only when necessary.

Gradle tasks were not optimized

As we are running multiple gradle tasks at the same time, we thought that maybe it was too much for the pipeline as previously mentioned, but one interesting thing we discovered was that gradle was not working correctly and was building multiple times the whole project.

We used a plugin that shows which tasks take the most to run to determine what was going on and started a plan to fix it.

Troubleshooting

Classic Jenkins pipeline view is not very good at showing what is failing on a pipeline and even less when it’s in parallel as each stage is a different thread. That is why Blue Ocean or the Pipeline Steps page on the classic view helped a lot here. Even on notification emails, developers are sent directly to that page. Unfortunately there is no way to know exactly what step failed when communicating to a developer.

Final thoughts

After a couple of weeks building the parallel jenkinsfile and running it several times we saw a big improvement on build times, but over time it became slower and slower and that’s when maintenance came into play.

There are a multiple things that start to show when using parallel, so it’s not only the optimization in build time, but also the improvements to the build tool, process and infrastructure to really make a difference. We had to change even the way we documented the pipelines.

Is it worth doing it? Yes, even though our design decision was limited in part by our project, parallelizing is a good option to improve the development experience and life cycle and using Jenkins agents to its fullest capacities.

comments powered by Disqus