Ram disk

Not long ago, I was running multiple Docker containers in parallel to score code submissions.

This involved compiling the code, and running a common set of tests on the compiled executable to determine the score.

When examining the output, I noticed that some submissions were failing to compile.

Diagnosis

Though there was a 30s timeout on the compile step to prevent broken submissions from running infinitely, I didn’t expect submissions to actually time out given the relatively small scale of the project.

After performing a couple of runs, I determined that this issue was intermittent, and occurring for different submissions each time. I also noticed that my reference submission which was written in Rust was failing more frequently.

I tried to compile some of these submissions outside of Docker, but was unable to reproduce the issue.

By chance, I then found that invoking clang (with no arguments) after restarting the machine was taking more than 20 seconds. Given that it was slow on the first run but not subsequent invocations, I thought to caching.

Perhaps there was an issue or limit with Disk I/O?

I filed a ticket with the provider.

Workaround 1

After reading documentation from previous years, I noticed that the documented setup had Docker on an external volume rather than the boot volume.

I added an external volume for Docker, and the timeouts went away.

Given that I was no longer experiencing the issue, I replied to the email from the provider with my solution and they closed the ticket.

Workaround 2

A few weeks later, I was scoring a new set of submissions for a new project.

When examining the output, I found that Rust submissions with external crates were often failing to compile, and many submission were failing on a test which indirectly required reading many small files from disk.

Another observation from watching docker ps was that though I requested parallel(1) to run 16 submissions at once, it was never running the full 16, and even dropped to 4 at one point.

Since it was a weekend, I didn’t immediately create another ticket and looked for alternative solutions.

The first thing that I tried was setting up Software RAID 0, and mounting Docker on the array. While this seemed to have improved results somewhat, some tests were still failing unexpectedly.

After some thinking, it occurred to me that since Disk I/O latency may be the issue, it might be worthwhile to try putting /var/lib/docker in memory instead.

This can be achieved with a ram disk, such as:

mount -o size=16G -t tmpfs none /var/lib/docker

Surprisingly, this worked on the first attempt. The parallelism increased, and tests were now passing consistently.

Since the RAM allocation for the VM was 64 GB, this solution was feasible for my use case and I was able to complete the testing, albeit with the minor inconvenience of recreating the ram disk after every restart.

← Previous Post