class: center, middle, inverse, title-slide # Docker ## For data analysis ### Sam Abbott ### 2018-05-24 (updated: 2018-05-24) --- class: center, middle, inverse #Overview --- #Overview - Why Docker - What is Docker - The basics of Docker - Building a Dockerfile - The basics of Docker compose - Continuous integration and Docker --- # Overview - Example: Rstudio Server as a development environment - Example: Jupyter Notebooks as a development environment - Example: Shiny app deployment - Aside: ShinyPrioxy - Example: Scheduling processes, with an example twitter bot. - Example: Database updating at FC, using Python - Example: Data transformation, model fitting, predictions - Offer rate tracker (dev) - Wrap up --- # Why Docker - Reproducible environment, removing works on my machine issues. Move between machines and analysts with ease. - Partition tasks to their own environment, securing login credentials etc. - More task with the same resources (as docker uses a shared operating system. - Docker containers lend themsevles to continuous integration and deployment - Test deployments in production conditions with test databases etc. in the environment they will run in. - Useful for open source projects like R where libraries are constantly being updated, potentially breaking your analysis. --- class: center, middle, inverse #What is Docker --- #[What is Docker](https://opensource.com/resources/what-docker) *"Docker is a tool designed to make it easier to create, deploy, and run applications by using containers"* ##[What is a container](https://www.docker.com/what-container) *"A container image is a lightweight, stand-alone, executable package of a piece of software that includes everything needed to run it: code, runtime, system tools, system libraries, settings"* --- class: center, middle, inverse # The basics of Docker --- #[ The Docker Engine](https://docs.docker.com/engine/docker-overview/#the-docker-platform) ##Server Long running process known as a daemon ##REST API Specifies the interface ##Command line interface Uses the REST API to control the daemon. --- #Common commands - What images do I have locally? ```bash docker images ``` - What containers are running? ```bash docker ps ``` - Get a new container ```bash docker pull seabbs/fcdashboard ``` - Run a container ```bash docker run -p 3838:3838 seabbs/fcdashboard ``` --- #Common commands - How do I stop this container? ```bash docker stop <container-id> ``` - How do I remove this container? ```bash docker rm <container-id> ``` --- class: center, middle, inverse #Building a Dockerfile --- #Choosing a base image - Dockerfiles can be built from scratch but it is better to base a container on a community maintained one. ## Funding Circle ```bash FROM quay.io/fundingcircle/alpine-envconsul:3.7 ``` --- #Choosing a base image ## [Rocker](https://hub.docker.com/u/rocker/) ```bash ## Rstudio server FROM rocker/rstudio ## Rstudio Server with tidyverse packages FROM rocker/tidyverse ## Shiny server FROM rocker/shiny ## Version controlled R image - tagged to 3.4.4 (uses MRAN) rocker/r-ver:3.4.4 ``` --- #Choosing a base image ## [Jupyter](https://hub.docker.com/u/jupyter/) ```bash ## Jupyter notebook - many other variants FROM jupyter/datascience-notebook ``` ## Other - Docker hub contains prebuilt community maintained images for many other systems. Often the easiest method to install software/packages such as Selenium, Tensorflow, databses etc. --- # Installing libraries and R packages ```bash ## Get libs required by packages RUN apt-get update && \ apt-get install -y \ libssl-dev \ libcurl4-openssl-dev \ libssh2-1-dev \ libnlopt0 \ libnlopt-dev \ libudunits2-dev \ libxml2-dev \ libgdal-dev \ libproj-dev \ && apt-get clean ## Install R packages - MRAN RUN Rscript -e 'install.packages(c("pkgconfig", "irlba", "igraph", "shinydashboard"))' RUN Rscript -e 'install.packages(c("shinyBS", "shinyWidgets", "tidyverse", "DT", "rmarkdown"))' RUN Rscript -e 'install.packages(c( "e1071", "caret", "ggfortify", "plotly", "lubridate", "wrapr", "stringr"))' ``` --- # Add files, setting work directories, exposing ports, and running a command ```bash ADD . home/fcdashboard WORKDIR home/fcdashboard EXPOSE 3838 ## Create log file CMD Rscript -e 'shiny::runApp(port = 3838, host = "0.0.0.0")' ``` --- #[The full dockerfile](https://github.com/seabbs/fcdashboard/blob/master/Dockerfile) - Full example dockerfile available on GitHub. #Running the example ```bash # If in local directory docker build . -t fcdashboard # If stored as an image in the cloud (here docker hub) docker pull seabbs/fcdashboard docker run -d -p 3838:3838 seabbs/fcdashboard ## Mounting a volume..... or use -v to mount a volume... see docs. --mount type=bind,source=$(pwd)/data/,target=/home/rstudio/fcdashboard/data ``` --- class: center, middle, inverse #Docker compose --- #Docker compose ## [What is docker compose](https://docs.docker.com/compose/overview/) *"Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a YAML file to configure your application’s services. Then, with a single command, you create and start all the services from your configuration"* ## Why docker compose - Specify ports, environment files, volumes to mount and containers to link together. --- #Common commands - build containers? ```bash docker-compose build ``` - Launch containers ```bash docker-compose up ``` - Bring down containers ```bash docker-compose down ``` --- ## [docker-compose.yml](https://github.com/FundingCircle/uk-data-scripts/blob/master/dwh_import/docker-compose.yml) ```bash version: '3' services: db: image: postgres:9.6 fcadb: image: quay.io/fundingcircle/postgres-with-data-uat:latest dwh_importer: build: . links: - fcadb - db env_file: 'docker-compose.env' ``` --- ## [docker-compose.env](https://github.com/FundingCircle/uk-data-scripts/blob/master/dwh_import/docker-compose.yml) ```bash FCA_HOST=fcadb FCA_PORT=5432 FCA_DBNAME=fundingcircle FCA_USER=wwwfc FCA_PW= DWH_HOST=db DWH_PORT=5432 DWH_DBNAME=warehouse DWH_USER=postgres DWH_PW= ``` --- class: center, middle, inverse # Continuous integration and Docker --- # Container registries - Store built images for sharing - Some can also autmatically build images from Dockerfiles - Personal experiance: Docker Hub - Funding Circle: Quay.io --- # [Docker hub](https://www.google.com/search?q=docker+hub) - Not used by Funding Circle? - Provides a public registry of docker images - Can [automate builds](https://hub.docker.com/r/seabbs/gettbinr/) from GitHub Repos can be set up, removing issue of local build time. --- ## [Quay.io](https://quay.io/repository/) - Contains all FC docker apps - Building handled by [circleci](https://circleci.com/?utm_source=gb&utm_medium=SEM&utm_campaign=SEM-gb-200-Eng-ni&utm_content=SEM-gb-200-Eng-ni-CircleCI&gclid=EAIaIQobChMIpca5jcie2wIVS5PtCh2mxg1CEAAYASAAEgIJ2_D_BwE) --- ## [CircleCI for Docker container deployment](https://github.com/FundingCircle/uk-data-scripts/blob/master/.circleci/config.yml) - Used by FC to build containers in the cloud - Tests are run - If passed container is pushed to quay.io - Can then be scheduled using the cluster --- class: center, middle, inverse #Issues --- #Issues - Within FC recommended to use `quay.io/fundingcircle/alpine-envconsul:3.7` as a base image - Alpine often used due to size. - This does not suport R, python etc. natively. - Advantage is ability to add secrets? Although this should be possible with other images... - Adding version controlled R and rstudio server is difficult. I have implemented [R](https://github.com/FundingCircle/risk-model-dashboards/blob/offerratetracker/offerratetracker/docker-config/containers/r-alpine/Dockerfile) but this is not yet available from quay. - For analysts using other base images would be preferable. - Can normal users push to quay, schedule automated builds from GitHub for analysis in development? --- class: center, middle, inverse #Examples --- # Rstudio server with tidyverse ```bash docker pull rocker/tidyverse docker run -d -p 8787:8787 -e USER=seabbs -e PASSWORD=seabbs --name rstudio rocker/tidyverse ``` --- # Jupyter notebook ```bash docker pull jupyter/datascience-notebook docker run -d -p 8888:8888 jupyter/datascience-notebook ``` --- # Shiny app ```bash docker pull seabbs/fcdashboard docker run -d -p 8787:8787 seabbs/fcdashboard ``` --- # Aside - [ShinyProxy](https://www.shinyproxy.io) - Open source alternative to shiny server - Each app is packaged as a docker container - Manage access etc natively - Scale with no restrictions - Compatible with clusters (i.e docker swarm/kubernetes) --- --- # Scheduling processes in a container ##Steps: - Write a script that pulls data, transforms and then pushes - Set up a script to launch job at desired time limit - Write dockerfile with all dependencies - CMD to run scheduling script and hold open container - Push docker container to registry - Pull docker image and launch. ##Example: - [Example](https://github.com/seabbs/TweetRstudioCheatsheets) - [Output](https://twitter.com/daily_r_sheets) --- --- # Scheduling processes using containers and a external manager - Managers include: Kubernetes, Jenkins, CRON scheduling ##Steps: - Write a script that pulls data, transforms and then pushes - Write dockerfile with all dependencies - Push to registry - Get job scheduled ##Example: - [Example](https://github.com/FundingCircle/uk-data-scripts/tree/master/dwh_import) --- --- # Data transformation, model fitting, predictions - Offer rate tracker (dev) ##Example: - [Example](https://github.com/FundingCircle/risk-model-dashboards/tree/offerratetracker/offerratetracker) --- # Wrap-up - Using docker for all analysis ensures that work is reproducible and easy to push to production. - Allows the analyst/data scientist to easily spin up the same environment on any compute resource. - Makes sharing work easier and quicker. - Makes using continuous integration and testing much easier. - FC makes good use of docker for production, introducing it to analysis process might improve knowledge sharing etc. - Several issues to iron out before easy to use docker for analysis at FC. - [`containerit`:](http://o2r.info/2017/05/30/containerit-package/) R package for automatically building dockerfiles. ---