How not to use Docker

Robert Mălai
Engineering Manager @ 3Pillar Global

PROGRAMMING

Besides the official Docker documentation, there are several good sources on the Internet where you can read about how to kick-start your dockerized environment on your project, including Stack Overflow Docker tag, official Docker images for MySQL, and Postgres. Unfortunately, there aren't so many places where one can read about the failures in setting up Docker in various common use cases.

You may wonder "why would anyone be interested in such a thing?" I think the value of such stories lies in the fact that we discover how such solutions work by trial and error, many times - pretty much the same method applied by a 2nd grader. Do you find yourself in a similar situation? You probably do, so the ideas from here will bear some value to you.

If this is the first time you've heard about Docker, head here first https://docs.docker.com/get-started, and then return to this article. I used Docker 1.12 on macOS, in native mode - this means with Docker managing the build in hypervisor.

Database initialization: problem definition

We all use databases. It's the most common use case in every small or large distributed service or web application. If you don't use an ORM of any sort where your entities are "magically" created by the framework, then you are probably extracting some valuable data from a database that you don't primarily own. In that case, you'll need some data initialization where you, at the very minimum, will have to create the database and the tables needed by your application.

It is important to understand the generic pattern applied by MySQL or Postgres images to achieve exactly that:

The basic recipe will install the software and any utilities needed alongside, then will "mount" the data volume and launch the init script; the init script will check if the data folder is empty or not and, depending on that, it will launch the initialization scripts from the folder docker-entrypoint-initdb.d which were mounted by the user.

If you apply this recipe to initialize your database, it will work.

However, what if you have the following use case: using a Docker container in development, you mock the real database with some large quantities of data - let's say about a million? The above initialization will work, but it's time consuming. If you're combining this with several other images in a Travis configuration where you want to run your cool suite of automated tests, then you'll find yourself waiting for the container to become available … for a long time … say 15 minutes. This may not be an inconvenience in some cases, but most of the times it is, for a number of reasons:

You are using a Docker compose configuration locally and waiting 15 minutes is too much.
Your Continuous Integration processing queue is taking too long.

Database initialization: proposed solution

In one of our projects, where we needed to mock a Redshift database in the CI/CD setup, we've come across the problem described above. In order to circumvent it, we thought to initialize the database during image construction. We needed to alter the dB initialization script to the following workflow:

The excerpt of the Dockerfile that implements the above functionality is highlighted below:

# source: https://github.com/kiasaki/
  docker-alpine-postgres
FROM alpine

RUN echo 
"@edge http://nl.alpinelinux.org/alpine/edge/main" >> /etc/apk/repositories && \
   apk update && \
   apk add curl "libpq@edge<9.7" 
   "postgresql-client@edge<9.7" "postgresql@edge<9.7"  
   "postgresql-contrib@edge<9.7" && \

 curl -o /usr/local/bin/gosu -sSL "https://github.com/tianon/gosu/releases/download/1.2/gosu-amd64" && \
    chmod +x /usr/local/bin/gosu && \
    apk del curl && \
    rm -rf /var/cache/apk/*

ENV LANG en_US.utf8
ENV PGDATA /var/lib/postgresql/data/

ENV POSTGRES_DB datastore
ENV POSTGRES_USER docker
ENV POSTGRES_PASSWORD letmein

RUN mkdir -p /opt/setup/data-scripts.d/
RUN mkdir -p /zdata/
COPY ./data-scripts.d/* /opt/setup/data-scripts.d/

WORKDIR /opt/setup/
COPY db-setup.sh /opt/setup/
COPY db-pack.sh /opt/setup/
COPY db-run.sh /opt/setup/

RUN ./db-setup.sh
RUN ./db-pack.sh

VOLUME $PGDATA

EXPOSE 5432

ENTRYPOINT [ "/opt/setup/db-run.sh" ]

CMD [ "postgres" ]

For brevity, we'll ignore the packing and unpacking scripts, though one can find them in this gist:

https://gist.github.com/robert-malai/19d139937a6b51d2709a6584702b4e5e. The remaining part is the logic baked in when the container is launched - that part is implemented in db-run.sh:

#!/bin/sh

set -e

if [ ! -f "$PGDATA/PG_VERSION" ]; then
    echo "Restoring $PGDATA ..."
    tar -xf /zdata/backup.tar -C $PGDATA
    sync
    echo "Done."
else
    echo 
      "$PGDATA was already there, skipping restore."
fi

echo "Launching command: $@ ..."
if [ "$1" = 'postgres' ]; then
    gosu postgres "$@"
else
    exec "$@"
fi

During the bootstrap of the container, the whole database is being unpacked and restored in the data mounted folder, and, at the end, the server is launched via the "postgres" mnemonic.

Conclusion

Running the above container increased the image build to about 15 minutes - the initialization scripts comprised of about 2 million records on a very simple table structure, with only 5 fields. The good part is that the container took only about 2 minutes to be bootstrapped from the base image. This greatly improved the user experience of the developers who will debug locally using a database, which is very close to what is in production.

PROs

Greatly reduced container bootstrap time
Reliable

CONs

Requires image repository
Maintenance of the database Dockerfile

The customization of the database initialization scripts proves to be of moderate difficulty (it's by no means a trivial task), but the end result was reliable, so we can recommend this approach to anyone who finds themselves in a similar situation.