class: center, middle, inverse, title-slide # Docker ## For data analysis ### Sam Abbott ### 2018-05-24 (updated: 2018-05-24) --- class: center, middle, inverse #Overview --- #Overview - Why Docker - What is Docker - The basics of Docker - Building a Dockerfile - The basics of Docker compose - Continuous integration and Docker --- # Overview - Example: Rstudio Server as a development environment - Example: Jupyter Notebooks as a development environment - Example: Shiny app deployment - Aside: ShinyPrioxy - Example: Scheduling processes, with an example twitter bot. - Example: Database updating at FC, using Python - Example: Data transformation, model fitting, predictions - Offer rate tracker (dev) - Wrap up --- # Why Docker - Reproducible environment, removing works on my machine issues. Move between machines and analysts with ease. - Partition tasks to their own environment, securing login credentials etc. - More task with the same resources (as docker uses a shared operating system. - Docker containers lend themsevles to continuous integration and deployment - Test deployments in production conditions with test databases etc. in the environment they will run in. - Useful for open source projects like R where libraries are constantly being updated, potentially breaking your analysis. --- class: center, middle, inverse #What is Docker --- #[What is Docker]( *"Docker is a tool designed to make it easier to create, deploy, and run applications by using containers"* ##[What is a container]( *"A container image is a lightweight, stand-alone, executable package of a piece of software that includes everything needed to run it: code, runtime, system tools, system libraries, settings"* --- class: center, middle, inverse # The basics of Docker --- #[ The Docker Engine]( ##Server Long running process known as a daemon ##REST API Specifies the interface ##Command line interface Uses the REST API to control the daemon. --- #Common commands - What images do I have locally? ```bash docker images ``` - What containers are running? ```bash docker ps ``` - Get a new container ```bash docker pull seabbs/fcdashboard ``` - Run a container ```bash docker run -p 3838:3838 seabbs/fcdashboard ``` --- #Common commands - How do I stop this container? ```bash docker stop <container-id> ``` - How do I remove this container? ```bash docker rm <container-id> ``` --- class: center, middle, inverse #Building a Dockerfile --- #Choosing a base image - Dockerfiles can be built from scratch but it is better to base a container on a community maintained one. ## Funding Circle ```bash FROM ``` --- #Choosing a base image ## [Rocker]( ```bash ## Rstudio server FROM rocker/rstudio ## Rstudio Server with tidyverse packages FROM rocker/tidyverse ## Shiny server FROM rocker/shiny ## Version controlled R image - tagged to 3.4.4 (uses MRAN) rocker/r-ver:3.4.4 ``` --- #Choosing a base image ## [Jupyter]( ```bash ## Jupyter notebook - many other variants FROM jupyter/datascience-notebook ``` ## Other - Docker hub contains prebuilt community maintained images for many other systems. Often the easiest method to install software/packages such as Selenium, Tensorflow, databses etc. --- # Installing libraries and R packages ```bash ## Get libs required by packages RUN apt-get update && \ apt-get install -y \ libssl-dev \ libcurl4-openssl-dev \ libssh2-1-dev \ libnlopt0 \ libnlopt-dev \ libudunits2-dev \ libxml2-dev \ libgdal-dev \ libproj-dev \ && apt-get clean ## Install R packages - MRAN RUN Rscript -e 'install.packages(c("pkgconfig", "irlba", "igraph", "shinydashboard"))' RUN Rscript -e 'install.packages(c("shinyBS", "shinyWidgets", "tidyverse", "DT", "rmarkdown"))' RUN Rscript -e 'install.packages(c( "e1071", "caret", "ggfortify", "plotly", "lubridate", "wrapr", "stringr"))' ``` --- # Add files, setting work directories, exposing ports, and running a command ```bash ADD . home/fcdashboard WORKDIR home/fcdashboard EXPOSE 3838 ## Create log file CMD Rscript -e 'shiny::runApp(port = 3838, host = "")' ``` --- #[The full dockerfile]( - Full example dockerfile available on GitHub. #Running the example ```bash # If in local directory docker build . -t fcdashboard # If stored as an image in the cloud (here docker hub) docker pull seabbs/fcdashboard docker run -d -p 3838:3838 seabbs/fcdashboard ## Mounting a volume..... or use -v to mount a volume... see docs. --mount type=bind,source=$(pwd)/data/,target=/home/rstudio/fcdashboard/data ``` --- class: center, middle, inverse #Docker compose --- #Docker compose ## [What is docker compose]( *"Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a YAML file to configure your application’s services. Then, with a single command, you create and start all the services from your configuration"* ## Why docker compose - Specify ports, environment files, volumes to mount and containers to link together. --- #Common commands - build containers? ```bash docker-compose build ``` - Launch containers ```bash docker-compose up ``` - Bring down containers ```bash docker-compose down ``` --- ## [docker-compose.yml]( ```bash version: '3' services: db: image: postgres:9.6 fcadb: image: dwh_importer: build: . links: - fcadb - db env_file: 'docker-compose.env' ``` --- ## [docker-compose.env]( ```bash FCA_HOST=fcadb FCA_PORT=5432 FCA_DBNAME=fundingcircle FCA_USER=wwwfc FCA_PW= DWH_HOST=db DWH_PORT=5432 DWH_DBNAME=warehouse DWH_USER=postgres DWH_PW= ``` --- class: center, middle, inverse # Continuous integration and Docker --- # Container registries - Store built images for sharing - Some can also autmatically build images from Dockerfiles - Personal experiance: Docker Hub - Funding Circle: --- # [Docker hub]( - Not used by Funding Circle? - Provides a public registry of docker images - Can [automate builds]( from GitHub Repos can be set up, removing issue of local build time. --- ## []( - Contains all FC docker apps - Building handled by [circleci]( --- ## [CircleCI for Docker container deployment]( - Used by FC to build containers in the cloud - Tests are run - If passed container is pushed to - Can then be scheduled using the cluster --- class: center, middle, inverse #Issues --- #Issues - Within FC recommended to use `` as a base image - Alpine often used due to size. - This does not suport R, python etc. natively. - Advantage is ability to add secrets? Although this should be possible with other images... - Adding version controlled R and rstudio server is difficult. I have implemented [R]( but this is not yet available from quay. - For analysts using other base images would be preferable. - Can normal users push to quay, schedule automated builds from GitHub for analysis in development? --- class: center, middle, inverse #Examples --- # Rstudio server with tidyverse ```bash docker pull rocker/tidyverse docker run -d -p 8787:8787 -e USER=seabbs -e PASSWORD=seabbs --name rstudio rocker/tidyverse ``` --- # Jupyter notebook ```bash docker pull jupyter/datascience-notebook docker run -d -p 8888:8888 jupyter/datascience-notebook ``` --- # Shiny app ```bash docker pull seabbs/fcdashboard docker run -d -p 8787:8787 seabbs/fcdashboard ``` --- # Aside - [ShinyProxy]( - Open source alternative to shiny server - Each app is packaged as a docker container - Manage access etc natively - Scale with no restrictions - Compatible with clusters (i.e docker swarm/kubernetes) --- --- # Scheduling processes in a container ##Steps: - Write a script that pulls data, transforms and then pushes - Set up a script to launch job at desired time limit - Write dockerfile with all dependencies - CMD to run scheduling script and hold open container - Push docker container to registry - Pull docker image and launch. ##Example: - [Example]( - [Output]( --- --- # Scheduling processes using containers and a external manager - Managers include: Kubernetes, Jenkins, CRON scheduling ##Steps: - Write a script that pulls data, transforms and then pushes - Write dockerfile with all dependencies - Push to registry - Get job scheduled ##Example: - [Example]( --- --- # Data transformation, model fitting, predictions - Offer rate tracker (dev) ##Example: - [Example]( --- # Wrap-up - Using docker for all analysis ensures that work is reproducible and easy to push to production. - Allows the analyst/data scientist to easily spin up the same environment on any compute resource. - Makes sharing work easier and quicker. - Makes using continuous integration and testing much easier. - FC makes good use of docker for production, introducing it to analysis process might improve knowledge sharing etc. - Several issues to iron out before easy to use docker for analysis at FC. - [`containerit`:]( R package for automatically building dockerfiles. ---