Skip to main content

Command Palette

Search for a command to run...

Reproducible ML Environments with Docker

Updated
45 min read
J

I am Jyotiprakash, a deeply driven computer systems engineer, software developer, teacher, and philosopher. With a decade of professional experience, I have contributed to various cutting-edge software products in network security, mobile apps, and healthcare software at renowned companies like Oracle, Yahoo, and Epic. My academic journey has taken me to prestigious institutions such as the University of Wisconsin-Madison and BITS Pilani in India, where I consistently ranked among the top of my class.

At my core, I am a computer enthusiast with a profound interest in understanding the intricacies of computer programming. My skills are not limited to application programming in Java; I have also delved deeply into computer hardware, learning about various architectures, low-level assembly programming, Linux kernel implementation, and writing device drivers. The contributions of Linus Torvalds, Ken Thompson, and Dennis Ritchie—who revolutionized the computer industry—inspire me. I believe that real contributions to computer science are made by mastering all levels of abstraction and understanding systems inside out.

In addition to my professional pursuits, I am passionate about teaching and sharing knowledge. I have spent two years as a teaching assistant at UW Madison, where I taught complex concepts in operating systems, computer graphics, and data structures to both graduate and undergraduate students. Currently, I am an assistant professor at KIIT, Bhubaneswar, where I continue to teach computer science to undergraduate and graduate students. I am also working on writing a few free books on systems programming, as I believe in freely sharing knowledge to empower others.

The "Works on My Machine" Problem and How Docker Solves It

Ever uttered or heard the dreaded phrase, "But it works on my machine!"? It's a classic scenario in software development, and machine learning projects are certainly not immune. You develop a fantastic model, meticulously manage your Python libraries, and then, when a colleague tries to run your code or you attempt to deploy it, chaos ensues. Missing dependencies, conflicting library versions, different operating system quirks – the list goes on. This is where Docker steps in as a powerful solution.

What is Docker?

At its core, Docker is a platform that allows you to package your application, along with all its dependencies (libraries, system tools, code, runtime), into a standardized unit called a container. Think of a container as a lightweight, standalone, executable package that includes everything needed to run a piece of software.

Unlike traditional virtual machines (VMs) that virtualize an entire operating system, containers virtualize the operating system's kernel. This makes them much more lightweight, faster to start, and less resource-intensive. You can run multiple containers on the same host machine, each isolated from the others and from the host itself, ensuring consistency across different environments.

Why Docker for ML?

The benefits of using Docker shine particularly bright in the realm of Machine Learning:

  • Reproducibility: This is the holy grail for ML. Docker ensures that your environment is precisely defined and can be recreated identically anywhere. Whether it's your laptop, a colleague's machine, or a cloud server, the code will run the same way, with the same package versions and configurations. This is crucial for debugging, verifying results, and ensuring scientific rigor.

  • Dependency Management: ML projects often involve a complex web of libraries (think TensorFlow, PyTorch, scikit-learn, Pandas, etc.) with specific version requirements. Dockerfiles explicitly list these dependencies, making it easy to manage and track them. No more "dependency hell"!

  • Collaboration: Share your Docker image or Dockerfile with collaborators, and they can instantly spin up an identical environment. This drastically reduces setup time and ensures everyone is on the same page.

  • Consistent Deployment: When it's time to move your ML model from development to production, Docker provides a seamless path. The same container that you used for development can be deployed to staging or production servers, guaranteeing consistency and reducing deployment-related bugs.

  • Isolation: Work on multiple projects with conflicting dependencies without them interfering with each other. Each project can live in its own isolated Docker container.

What We'll Build Today

In this blog post, we're going to walk you through setting up a comprehensive, reproducible Machine Learning environment using Docker. By the end, you'll have:

  • A Dockerfile that acts as the blueprint for our ML environment.

  • An entrypoint.sh script to initialize our container.

  • A docker-compose.yml file to easily manage and run our Docker setup.

This environment will feature a JupyterLab instance, pre-loaded with a wide array of popular and essential Python libraries for machine learning, data analysis, visualization, NLP, computer vision, and more. You'll be able to create this environment once and then run it consistently, wherever Docker is installed.

A Look Inside Our ML Toolkit

To make our Dockerized ML environment truly useful and versatile, we're pre-installing a comprehensive suite of system dependencies and Python libraries. This curated list ensures that you have the tools you need for a wide range of machine learning tasks, from data preprocessing and model training to visualization and interpretation, right out of the box.

Let's break down what's included:

A. System-Level Dependencies

First, we lay the groundwork by installing essential system packages on our Ubuntu base image. These are often prerequisites for Python libraries or provide core functionalities:

  • Python 3.10: We're using a specific and recent version of Python (python3.10, python3.10-dev for development headers, and python3.10-venv for virtual environment support) to ensure stability and access to modern language features.

  • python3-pip: The package installer for Python, crucial for managing our Python libraries.

  • build-essential: This package contains compilers (like GCC) and tools (make) necessary for building some Python packages from their source code, especially those with C/C++ extensions.

  • cmake: An open-source, cross-platform family of tools designed to build, test and package software. It's a dependency for several C++ based libraries.

  • git: The ubiquitous version control system, useful for cloning code or if some pip packages need to pull dependencies from Git repositories.

  • curl: A command-line tool for transferring data with URLs, often used for downloading files or by other scripts within package installations.

  • Graphics & Display Libraries (libgl1, libglib2.0-0, libsm6, libxext6, libxrender-dev): These provide support for graphics rendering, which can be dependencies for libraries like OpenCV and Matplotlib, even in a headless environment.

  • pkg-config: A helper tool used when compiling applications and libraries to provide correct compiler and linker flags.

  • libsentencepiece-dev: Development files for SentencePiece, a popular library for unsupervised text tokenization, frequently used in Natural Language Processing.

B. Python Packages (Installed within a Virtual Environment)

All Python packages are installed inside a dedicated virtual environment (/opt/venv) to keep them isolated and well-managed. Here’s a categorized overview:

  1. Core ML, PyTorch & Essential Compute:

    • pip, wheel, setuptools: The foundational tools for Python package management and building.

    • torch, torchvision, torchaudio: The PyTorch ecosystem for deep learning. We're installing the CPU versions for broader compatibility in this setup. torch provides core tensor functionalities, torchvision offers datasets and models for computer vision, and torchaudio does the same for audio.

    • jupyterlab: Our primary interactive development environment – a web-based interface for Jupyter notebooks, code, and data.

    • numpy: The fundamental package for numerical computing in Python, essential for array manipulation and mathematical operations.

    • pandas: The go-to library for data manipulation and analysis, providing powerful data structures like DataFrames.

    • scipy: A library for scientific and technical computing, offering modules for optimization, statistics, signal processing, and more.

    • scikit-learn: A comprehensive and easy-to-use library for traditional machine learning algorithms (classification, regression, clustering, dimensionality reduction, model selection, etc.).

    • Pillow: A powerful image processing library, a fork of PIL (Python Imaging Library).

  2. Popular ML Frameworks & PyTorch Ecosystem:

    • xgboost, lightgbm, catboost: Highly efficient and popular gradient boosting frameworks known for their performance in structured data competitions and applications.

    • pytorch-lightning, fastai, ignite: High-level frameworks that simplify PyTorch training, making it easier to write boilerplate-free, scalable, and reproducible deep learning code.

    • einops: For flexible and powerful tensor operations, making tensor manipulations more readable and reliable.

    • skorch: A scikit-learn compatible neural network library that wraps PyTorch, allowing you to use PyTorch models with scikit-learn utilities.

    • accelerate: A library from Hugging Face to easily run your PyTorch training scripts across any distributed configuration (multi-GPU, TPU, etc.) with minimal code changes.

    • torchmetrics: A collection of performance metrics for PyTorch models, making evaluation straightforward.

    • torch-scatter, torch-sparse, torch-cluster, torch-spline-conv, torch-geometric: A suite of libraries for building and training Graph Neural Networks (GNNs) with PyTorch.

  3. Data Visualization:

    • matplotlib: The most established plotting library in Python, providing a wide variety of static, animated, and interactive visualizations.

    • seaborn: Built on top of Matplotlib, Seaborn provides a high-level interface for drawing attractive and informative statistical graphics.

    • plotly, bokeh, altair: Libraries for creating interactive web-based visualizations. Plotly and Bokeh are particularly powerful for complex, interactive plots, while Altair uses a declarative approach.

    • ydata-profiling (formerly Pandas Profiling): Generates interactive HTML reports summarizing datasets, great for exploratory data analysis (EDA).

    • plotnine: A Python implementation of R's ggplot2, allowing for a grammar of graphics approach to plotting.

  4. Natural Language Processing (NLP):

    • sentencepiece: An unsupervised text tokenizer and detokenizer, particularly useful for neural network-based text generation.

    • transformers: From Hugging Face, this library provides thousands of pre-trained models for NLP tasks like text classification, question answering, translation, and more (e.g., BERT, GPT-2).

    • nltk (Natural Language Toolkit): A comprehensive library for various NLP tasks, including tokenization, stemming, tagging, parsing, and classification.

    • spacy: An industrial-strength NLP library designed for performance and ease of use, offering pre-trained models and support for deep learning integration.

    • sentence-transformers: A framework for state-of-the-art sentence, text, and image embeddings.

    • tokenizers: Fast state-of-the-art tokenizers provided by Hugging Face, used by their transformers library.

    • gensim: A library for topic modeling, document similarity analysis, and other unsupervised NLP tasks.

    • textblob: Provides a simple API for common NLP tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, and translation.

  5. Computer Vision (CV) & Audio:

    • timm (PyTorch Image Models): An extensive collection of state-of-the-art image models, backbones, and pre-trained weights for PyTorch.

    • albumentations: A fast and flexible library for image augmentation, crucial for improving the performance of computer vision models.

    • opencv-python-headless: OpenCV (Open Source Computer Vision Library) bindings for Python. The headless version is suitable for server-side applications where no GUI is needed.

    • imageio: A library for reading and writing a wide range of image data, including animated GIFs, videos, and scientific formats.

    • librosa: A Python package for music and audio analysis, providing tools for feature extraction, signal processing, and more.

    • soundfile: A library to read and write sound files.

  6. Hyperparameter Optimization (HPO) & Workflow:

    • optuna, hyperopt, scikit-optimize: Libraries for automating the process of finding the best hyperparameters for your machine learning models.

    • mlflow: An open-source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.

    • wandb (Weights & Biases): A popular tool for experiment tracking, dataset versioning, model management, and collaboration in ML projects.

  7. Model Interpretability:

    • shap: A game-theoretic approach to explain the output of any machine learning model by assigning importance values to each feature.

    • captum: A PyTorch library for model interpretability, offering various algorithms to understand feature importance and neuron activations.

    • interpret: A library from Microsoft for training interpretable models and explaining black-box systems.

    • shapash: Creates an interactive dashboard to help data scientists easily understand their models using SHAP and other interpretability methods.

    • explainerdashboard: Allows you to quickly build interactive dashboards to explore and explain the predictions and workings of ML models.

    • fairlearn: A Python package to assess and improve the fairness of machine learning models.

    • dtreeviz: A library for visualizing decision trees in an intuitive way.

    • dowhy: A Python library for causal inference that helps answer "what if" questions.

    • lit-nlp (Language Interpretability Tool): A visual, interactive NLP model understanding tool.

    • imodels: Provides a collection of interpretable machine learning models like rule lists and decision trees.

    • aequitas: An open-source bias and fairness audit toolkit.

    • lofo-importance (Leave One Feature Out Importance): A method to calculate feature importance.

  8. Scikit-learn Utilities & Other Useful Libraries:

    • mlxtend: Contains useful tools and extensions for data science and machine learning tasks, including stacking classifiers, association rule mining, and plotters.

    • imbalanced-learn: Provides techniques to deal with imbalanced datasets in machine learning, such as oversampling and undersampling.

    • category_encoders: A collection of scikit-learn-compatible transformers for encoding categorical data.

    • statsmodels: A library for estimating and interpreting statistical models, conducting statistical tests, and exploring data.

    • hdbscan: A clustering algorithm that performs well on noisy data and can find clusters of varying shapes.

    • pyjanitor: Provides a clean API for cleaning data using method chaining, built on top of pandas.

    • streamlit, gradio: Libraries for quickly building and sharing interactive web applications for your machine learning models without extensive web development knowledge.

  9. Malware Analysis & Reverse Engineering Tools:

    • pefile: A Python module for reading and working with Portable Executable (PE) files, common in Windows malware.

    • distorm3: A powerful disassembler library for x86/AMD64 binary code.

    • flare-capa: Automatically identifies capabilities in executable files, often used in malware analysis.

    • angr: A platform-agnostic binary analysis framework, useful for automated reverse engineering and vulnerability discovery.

    • vivisect: A binary analysis and reverse engineering framework.

    • networkx: A library for creating, manipulating, and studying the structure, dynamics, and functions of complex networks (often used in conjunction with tools like angr or for graph-based analysis).

This extensive list ensures that when you launch your Docker container, you're stepping into a rich, ready-to-use environment tailored for a wide spectrum of machine learning endeavors.

The Blueprint: Understanding the Dockerfile

The Dockerfile is the heart of our reproducible environment. It's a text file that contains a series of instructions on how to build a custom Docker image. Each instruction creates a new "layer" in the image, making the build process efficient and images modular.

Let's look at the complete Dockerfile we're using and then break it down step-by-step. Name this file Dockerfile when you create it.

# Use a recent stable Ubuntu image
FROM ubuntu:22.04

# Set environment variables to prevent interactive prompts during installation
ENV DEBIAN_FRONTEND=noninteractive
ENV TZ=Etc/UTC
ENV PYTHONUNBUFFERED=1
# Set path for venv
ENV PATH="/opt/venv/bin:$PATH"

# Install system dependencies, Python3, pip, and venv
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    python3.10 \
    python3.10-dev \
    python3.10-venv \
    python3-pip \
    build-essential \
    cmake \
    git \
    curl \
    libgl1 \
    pkg-config \
    libsentencepiece-dev \
    # For OpenCV/matplotlib backends if needed beyond libgl1
    libglib2.0-0 \
    libsm6 \
    libxext6 \
    libxrender-dev \
    && rm -rf /var/lib/apt/lists/*

# Create a Python virtual environment using python3.10
RUN python3.10 -m venv /opt/venv

# Activate the virtual environment and install Python libraries in batches
# This helps with readability and potentially caching layers

# --- Batch 1: Core ML, PyTorch, and essential compute libraries ---
RUN . /opt/venv/bin/activate && \
    pip install --no-cache-dir --upgrade pip wheel setuptools && \
    # Install PyTorch and related packages from the PyTorch CPU index
    pip install --no-cache-dir \
    torch~=2.3.0 --index-url https://download.pytorch.org/whl/cpu \
    torchvision~=0.18.0 --index-url https://download.pytorch.org/whl/cpu \
    torchaudio~=2.3.0 --index-url https://download.pytorch.org/whl/cpu && \
    # Install other core packages from PyPI
    pip install --no-cache-dir \
    jupyterlab \
    numpy \
    pandas \
    scipy \
    scikit-learn \
    Pillow

# --- Batch 2: Popular ML Frameworks & PyTorch Ecosystem ---
RUN . /opt/venv/bin/activate && \
    pip install --no-cache-dir \
    xgboost \
    lightgbm \
    catboost \
    pytorch-lightning \
    fastai \
    ignite \
    einops \
    skorch \
    accelerate \
    torchmetrics \
    # PyTorch Geometric and its dependencies
    torch-scatter torch-sparse torch-cluster torch-spline-conv -f https://data.pyg.org/whl/torch-2.3.0+cpu.html \
    torch-geometric

# --- Batch 3: Data Visualization ---
RUN . /opt/venv/bin/activate && \
    pip install --no-cache-dir \
    matplotlib \
    seaborn \
    plotly \
    bokeh \
    altair \
    ydata-profiling \
    plotnine

# --- Batch 4: Natural Language Processing (NLP) ---
# Install sentencepiece separately first, as it's a common build dependency
RUN . /opt/venv/bin/activate && \
    pip install --no-cache-dir sentencepiece && \
    pip install --no-cache-dir \
    transformers \
    nltk \
    spacy \
    sentence-transformers \
    tokenizers \
    gensim \
    textblob && \
    # Download NLTK data and spaCy model
    python -m nltk.downloader popular && \
    python -m spacy download en_core_web_sm

# --- Batch 5: Computer Vision (CV) & Audio ---
RUN . /opt/venv/bin/activate && \
    pip install --no-cache-dir \
    timm \
    albumentations \
    opencv-python-headless \
    imageio \
    librosa \
    soundfile

# --- Batch 6: Hyperparameter Optimization (HPO) & Workflow ---
RUN . /opt/venv/bin/activate && \
    pip install --no-cache-dir \
    optuna \
    hyperopt \
    scikit-optimize \
    mlflow \
    wandb

# --- Batch 7: Model Interpretability ---
RUN . /opt/venv/bin/activate && \
    pip install --no-cache-dir \
    shap \
    captum \
    interpret \
    shapash \
    explainerdashboard \
    fairlearn \
    dtreeviz \
    dowhy \
    lit-nlp \
    imodels \
    aequitas \
    lofo-importance

# --- Batch 8: Scikit-learn Utilities & Other Useful Libraries ---
RUN . /opt/venv/bin/activate && \
    pip install --no-cache-dir \
    mlxtend \
    imbalanced-learn \
    category_encoders \
    statsmodels \
    hdbscan \
    pyjanitor \
    streamlit \
    gradio

# --- Batch 9: Malware Analysis & Reverse Engineering Tools ---
RUN . /opt/venv/bin/activate && \
    pip install --no-cache-dir \
    pefile \
    distorm3 \
    flare-capa \
    angr \
    vivisect \
    networkx

# Create a non-root user for security and a workspace directory
RUN useradd -m -s /bin/bash -u 1000 jupyteruser && \
    mkdir /workspace && \
    chown -R jupyteruser:jupyteruser /workspace

# Switch to the non-root user
USER jupyteruser

# Set the working directory
WORKDIR /workspace

# Expose the default Jupyter Notebook port
EXPOSE 8888

# Copy the entrypoint script (ensure this file exists in your build context)
COPY --chown=jupyteruser:jupyteruser entrypoint.sh /usr/local/bin/entrypoint.sh
RUN chmod +x /usr/local/bin/entrypoint.sh

# Set the entrypoint
ENTRYPOINT ["/usr/local/bin/entrypoint.sh"]

# Default command (can be overridden)
# Starts JupyterLab with no token/password for convenience in local dev
CMD ["--ip=0.0.0.0", "--port=8888", "--no-browser", "--NotebookApp.token=''", "--NotebookApp.password=''"]

Now, let's dissect this Dockerfile:

  • FROM ubuntu:22.04

    • This is the first instruction in any Dockerfile. It specifies the base image from which we are building. We're using ubuntu:22.04, which is the Long-Term Support (LTS) version of Ubuntu (Jammy Jellyfish) as of May 2025. This provides a stable and well-supported foundation.
  • ENV DEBIAN_FRONTEND=noninteractive

    • This environment variable tells the Debian/Ubuntu package manager (apt-get) not to ask for any interactive input during package installations. This is crucial for automated builds.
  • ENV TZ=Etc/UTC

    • Sets the timezone for the container to UTC. This helps ensure consistency in time-related operations.
  • ENV PYTHONUNBUFFERED=1

    • This tells Python to run in unbuffered mode, which means Python will send its output (like print statements) directly to the terminal (or Docker logs) without waiting for a buffer to fill. This is helpful for debugging as you see logs immediately.
  • ENV PATH="/opt/venv/bin:$PATH"

    • This prepends the bin directory of our Python virtual environment (/opt/venv/bin) to the system's PATH environment variable. This means that when we run commands like python or pip, the versions from our virtual environment will be used by default, even before activating it explicitly in later RUN commands.
  • RUN apt-get update && \ ... && rm -rf /var/lib/apt/lists/*

    • This is a single RUN instruction that performs several actions related to system package management:

      • apt-get update: Refreshes the local package list from the repositories defined in /etc/apt/sources.list.

      • apt-get install -y --no-install-recommends \ ...: Installs all the system dependencies we listed in Section II (Python 3.10, pip, build tools, various libraries).

        • -y: Automatically answers "yes" to prompts.

        • --no-install-recommends: Installs only the main dependencies and not the "recommended" packages, which can help keep the image size smaller.

      • && rm -rf /var/lib/apt/lists/*: After the installations, this cleans up the downloaded package lists to reduce the image layer size. Chaining commands with && ensures that if one command fails, the subsequent ones (and the cleanup) don't run.

  • RUN python3.10 -m venv /opt/venv

    • This command uses the installed python3.10 to create a Python virtual environment named venv inside the /opt directory. Virtual environments are a best practice for isolating project-specific dependencies.
  • The Batched RUN . /opt/venv/bin/activate && pip install ... commands

    • We have multiple RUN blocks for installing Python packages. This is a strategic choice.

    • Activating the venv: Each RUN command that installs Python packages starts with . /opt/venv/bin/activate. This sources the activate script of our virtual environment, ensuring that pip installs packages into /opt/venv and not the system Python. Note that each RUN command executes in its own shell, so activation is needed for each one where pip is used in this manner. (Though because we modified PATH earlier, pip would likely resolve to the venv's pip anyway, this makes it explicit).

    • --no-cache-dir: This pip install option disables the pip cache. While the cache speeds up subsequent local builds if you're rebuilding frequently, it increases the Docker image layer size. For final images, it's good practice to use --no-cache-dir.

    • Installing packages in logical groups: We've grouped related packages into separate RUN commands (e.g., Core ML, NLP, Visualization).

      • Readability: This makes the Dockerfile easier to read and understand.

      • Docker Layer Caching: This is a key benefit. Docker caches the results of each RUN instruction as a layer. If you change a later RUN instruction (e.g., add a new package to Batch 5), Docker can reuse the cached layers from Batch 1-4, significantly speeding up the image rebuild process. If all pip install commands were in one giant RUN instruction, any change would invalidate that entire layer, forcing a reinstall of all Python packages.

    • Specific installations:

      • PyTorch: Notice torch~=2.3.0 --index-url https://download.pytorch.org/whl/cpu. We are installing PyTorch (and torchvision, torchaudio) version 2.3.0 (or compatible minor versions) specifically from the PyTorch CPU wheel index. This ensures we get a CPU-only version, keeping the image smaller than if it included CUDA dependencies.

      • PyTorch Geometric: torch-scatter torch-sparse ... -f https://data.pyg.org/whl/torch-2.3.0+cpu.html. These packages for graph neural networks have specific dependencies tied to the PyTorch version and CPU/GPU. We use the -f flag to point pip to a specific URL where it can find compatible wheels.

    • NLTK Data and spaCy Model Downloads:

      • In "Batch 4: Natural Language Processing (NLP)", after installing the Python packages, we run:

        • python -m nltk.downloader popular: This downloads common NLTK datasets (corpora, models) so they are available within the image.

        • python -m spacy download en_core_web_sm: This downloads a small English language model for spaCy.

      • Including these downloads in the Dockerfile ensures these resources are baked into the image, avoiding runtime downloads and potential network issues when a container starts.

  • RUN useradd -m -s /bin/bash -u 1000 jupyteruser && \ mkdir /workspace && \ chown -R jupyteruser:jupyteruser /workspace

    • This command sets up a non-root user for running our application:

      • useradd -m -s /bin/bash -u 1000 jupyteruser: Creates a new user named jupyteruser with UID 1000.

        • -m: Creates a home directory for the user (e.g., /home/jupyteruser).

        • -s /bin/bash: Sets the default shell for this user to bash.

        • -u 1000: Assigns a specific User ID (UID). This is often helpful for managing file permissions with mounted volumes, as you can match this UID on your host system if needed.

      • mkdir /workspace: Creates a directory named /workspace. This will be our primary working directory inside the container for notebooks and projects.

      • chown -R jupyteruser:jupyteruser /workspace: Changes the owner and group of the /workspace directory to our new jupyteruser. This ensures the user has write permissions in this directory.

  • USER jupyteruser

    • This instruction switches the active user for subsequent Dockerfile commands (and for the running container by default) from root to jupyteruser. Running applications as a non-root user is a security best practice.
  • WORKDIR /workspace

    • Sets the working directory for any subsequent RUN, CMD, ENTRYPOINT, COPY, and ADD instructions. This means that if you run a command like ls later, it will be executed as if you were in the /workspace directory. It also means that when the container starts, the default directory for the jupyteruser will be /workspace.
  • EXPOSE 8888

    • This instruction informs Docker that the container listens on the specified network port (8888 in this case) at runtime. This is documentation; it doesn't actually publish the port. Publishing the port (mapping it to a host port) is done when you run the container (e.g., with docker run -p ... or in docker-compose.yml). JupyterLab typically runs on port 8888.
  • COPY --chown=jupyteruser:jupyteruser entrypoint.sh /usr/local/bin/entrypoint.sh

    • This copies the entrypoint.sh script from your Docker build context (the directory where you run docker build) into the image at /usr/local/bin/entrypoint.sh.

    • --chown=jupyteruser:jupyteruser: Sets the owner and group of the copied file inside the image to jupyteruser.

  • RUN chmod +x /usr/local/bin/entrypoint.sh

  • ENTRYPOINT ["/usr/local/bin/entrypoint.sh"]

    • Configures the container to run /usr/local/bin/entrypoint.sh as its main executable when it starts. Anything specified in the CMD instruction will be passed as arguments to this entrypoint script.
  • CMD ["--ip=0.0.0.0", "--port=8888", "--no-browser", "--NotebookApp.token=''", "--NotebookApp.password=''"]

    • This provides default arguments to the ENTRYPOINT. If you run the container without specifying any command, these arguments will be passed to our entrypoint.sh script.

    • Our entrypoint.sh script will, in turn, pass these arguments to jupyter lab.

      • --ip=0.0.0.0: Makes JupyterLab listen on all network interfaces within the container, which is necessary for accessing it from the host.

      • --port=8888: Specifies the port for JupyterLab.

      • --no-browser: Prevents JupyterLab from trying to automatically open a web browser inside the container (which wouldn't work anyway).

      • --NotebookApp.token='' and --NotebookApp.password='': These disable token and password authentication for JupyterLab. This is convenient for local development but should be changed or secured if you plan to expose JupyterLab to a network.

And that's our Dockerfile! Each line plays a role in crafting a robust, layered, and feature-rich ML environment. The use of multiple RUN commands for pip install is a key takeaway for optimizing build times through Docker's caching mechanism.

The Magic of Image Creation

Now that we've dissected the Dockerfile, let's understand what happens when you tell Docker to actually build an image from it. This process, typically initiated with the docker build command (or via docker-compose up --build), is where the blueprint comes to life. It's a fascinating dance of layers, caching, and reproducibility.

Here's a breakdown of the image creation process:

  1. Starting with the Base (FROM ubuntu:22.04)

    • The very first thing Docker does is look at the FROM instruction in your Dockerfile. In our case, it's FROM ubuntu:22.04.

    • Docker checks if this base image already exists locally on your machine.

    • If it doesn't, Docker pulls it from a Docker registry (by default, Docker Hub). This ubuntu:22.04 image itself is composed of its own set of read-only layers, representing a minimal Ubuntu system. This base image forms the initial layer of our custom image.

  2. Executing Instructions Sequentially in Temporary Containers

    • Docker then proceeds through the Dockerfile, one instruction at a time, in the order they are written.

    • For most instructions, especially RUN, COPY, and ADD, Docker performs the following:

      • It launches a temporary container using the image from the previous successful step (or the base image for the first instruction after FROM).

      • It executes the command specified in the instruction (e.g., apt-get update && apt-get install ... or pip install ...) inside this temporary container.

  3. Creating Layers: The Building Blocks of Images

    • Once an instruction completes successfully, Docker effectively takes a "snapshot" of the filesystem changes made by that instruction within the temporary container.

    • This set of changes is then committed as a new read-only layer on top of the previous layer(s).

    • For example:

      • The RUN apt-get update && apt-get install ... command will result in a new layer containing all the newly installed system packages and the updated package lists (before the cleanup).

      • Each RUN . /opt/venv/bin/activate && pip install ... command will create its own layer containing the Python packages installed in that specific step.

      • A COPY instruction creates a layer that adds the specified files from your build context into the image.

    • Instructions like ENV, WORKDIR, USER, EXPOSE, ENTRYPOINT, and CMD don't typically create large data layers themselves but rather add metadata to the image configuration. However, they still mark a point in the build process and can affect layer caching.

  4. Layer Caching: Speeding Up Builds

    • This layer system is incredibly powerful because of caching.

    • Before executing an instruction, Docker checks if it has already run this exact instruction (and all preceding instructions were also identical) in a previous build and if there's a cached layer for it.

    • If a valid cached layer exists, Docker reuses it instead of re-executing the command. This can dramatically speed up image rebuilds, especially if you only change later parts of your Dockerfile.

    • This is why we strategically broke down our pip install commands into multiple RUN instructions in our Dockerfile. If we only change a package in "Batch 5", Docker can reuse the cached layers for "Batch 1" through "Batch 4", saving a significant amount of time by not reinstalling all those earlier packages.

    • An instruction's cache is invalidated if the instruction itself changes or if files copied by a COPY or ADD instruction have changed.

  5. The Final Image: A Stack of Read-Only Layers

    • After all instructions in the Dockerfile have been processed, the result is a new Docker image. This final image is essentially a stack of all the read-only layers created during the build process, plus some metadata (like the default ENTRYPOINT and CMD).

    • You can think of it like a lasagne, where each layer adds something new on top of the one below it. Because these layers are read-only, they are inherently shareable and consistent.

  6. Ready for Reproducible Containers

    • This newly built image (e.g., named ml-notebook as per our docker-compose.yml or if you tagged it with docker build -t ml-notebook .) now resides in your local Docker image cache.

    • The magic is that this image contains everything needed to run our ML environment: the Ubuntu OS, specific Python version, all system and Python dependencies, our entrypoint.sh script, and the default configuration.

    • You can now use this single image to spin up multiple containers. Each container will be an identical, isolated instance of this environment.

    • You can push this image to a Docker registry (like Docker Hub or a private registry) and share it with colleagues, or deploy it to different servers, and be confident that the environment inside the container will be exactly the same, regardless of where it runs. This is the core of Docker's "build once, run anywhere" philosophy.

When a container is started from an image, Docker adds a thin, writable layer on top of the read-only image layers. Any changes the running container makes (like creating new files, modifying existing ones) are stored in this writable layer. The underlying image remains unchanged.

Understanding this layered architecture and build process helps in writing efficient Dockerfiles and appreciating the consistency and portability that Docker brings to application development and deployment, especially for complex environments like our ML setup.

The Gatekeeper: What entrypoint.sh Does

In our Docker setup, the ENTRYPOINT instruction in the Dockerfile specifies a script that will be executed when any container based on our image starts. This script, entrypoint.sh, acts as the primary command or "gatekeeper" for the container. Its purpose is often to perform some initial setup or to control how the main application inside the container is launched.

Purpose of an Entrypoint Script

While you can directly specify an executable in the ENTRYPOINT (like ENTRYPOINT ["jupyter", "lab"]), using a shell script offers more flexibility:

  • Environment Configuration: It can set up environment variables or perform configurations that need to happen right before the main application starts.

  • Preparatory Steps: It can run prerequisite commands, checks, or wait for other services if needed (though our current script is simpler).

  • Argument Processing: It can process or modify the arguments passed to the container (defined by CMD in the Dockerfile or overridden during docker run) before launching the main application.

  • Wrapper Logic: It can wrap the main application command, allowing for conditional execution or logging.

Let's look at our entrypoint.sh:

#!/bin/bash
set -e

# Activate the Python virtual environment
. /opt/venv/bin/activate

# Check if the first argument is a directory or file path
# If so, assume the user wants to open JupyterLab in that specific context
# Otherwise, default to the /workspace directory
if [ -d "$1" ] || [ -f "$1" ]; then
    # If $1 is a directory or file, pass all arguments ($@) to jupyter lab
    # This allows users to specify a notebook or sub-directory to open directly
    jupyter lab "$@"
else
    # If $1 is not a directory/file (or no arguments are given),
    # start jupyter lab in the default /workspace directory.
    # Any additional CMD arguments (like --ip, --port) are still passed via "$@".
    jupyter lab --notebook-dir=/workspace "$@"
fi

Explanation of the entrypoint.sh Script:

  1. #!/bin/bash

    • This is the "shebang." It specifies that the script should be executed with /bin/bash, the Bash shell.
  2. set -e

    • This command ensures that the script will exit immediately if any command fails (returns a non-zero exit status). This is a good practice for shell scripting as it helps prevent unexpected behavior by stopping the script at the point of error.
  3. . /opt/venv/bin/activate

    • This line is crucial for our Python environment.

    • The . (dot) command is a synonym for source. It executes the activate script in the current shell's context.

    • /opt/venv/bin/activate is the script that activates our Python virtual environment located at /opt/venv.

    • Activating the virtual environment modifies the current shell's PATH and other environment variables so that commands like python, pip, and any installed Python package executables (like jupyter) refer to the versions within the virtual environment. This ensures that jupyter lab uses the correct Python interpreter and all the libraries we installed into /opt/venv.

  4. Conditional Logic for JupyterLab Startup:

    • The script then uses an if statement to determine how to start jupyter lab. The $@ variable in Bash represents all the arguments passed to the script. In our Docker setup, these arguments will initially come from the CMD instruction in the Dockerfile.

    • if [ -d "$1" ] || [ -f "$1" ]; then

      • "$1" refers to the first argument passed to the entrypoint.sh script.

      • -d "$1": This checks if the first argument is an existing directory.

      • -f "$1": This checks if the first argument is an existing regular file.

      • ||: This is the logical OR operator.

      • So, this condition checks if the first argument provided to the container (which becomes the first argument to this script) is either a directory path or a file path that exists within the container.

    • jupyter lab "$@"

      • If the condition is true (meaning the user likely provided a specific path as an argument when running the container, intending for JupyterLab to open there or open that specific file), this line is executed.

      • It starts jupyter lab and passes all the arguments ("$@") received by the script directly to it. For example, if the CMD was my_notebook.ipynb --ip=0.0.0.0 ..., JupyterLab would try to open my_notebook.ipynb.

    • else

      • If the first argument is not a directory or a file (e.g., it's an option like --ip=0.0.0.0, or no arguments are provided beyond the options), this block is executed.
    • jupyter lab --notebook-dir=/workspace "$@"

      • This is the default behavior. It starts jupyter lab and explicitly tells it to use /workspace as the root directory for notebooks.

      • "$@" again passes all arguments from the CMD (like --ip=0.0.0.0, --port=8888, --NotebookApp.token='', etc.) to the jupyter lab command.

In summary: Our entrypoint.sh script first ensures the correct Python virtual environment is active. Then, it intelligently starts JupyterLab: if a user happens to pass a specific file or directory path as the very first argument when running the container (overriding the default CMD behavior), it attempts to use that. Otherwise, it defaults to serving JupyterLab from the /workspace directory, while still respecting all other options (like port, IP binding, and token settings) provided by the CMD in the Dockerfile. This simple script adds a bit of flexibility while ensuring our core application (jupyter lab) runs correctly within the prepared Python environment.

Orchestrating with Ease: Understanding docker-compose.yml

While we can build our Docker image using docker build and run it with docker run, these commands can become lengthy and cumbersome to type repeatedly, especially as configurations grow. This is where Docker Compose comes in as a powerful orchestration tool.

What is Docker Compose?

Docker Compose is a tool for defining and running multi-container Docker applications. You use a YAML file (typically docker-compose.yml) to configure your application's services, networks, and volumes. Then, with a single command (docker-compose up or docker compose up for newer versions), you can create and start all the services from your configuration.

Why use Docker Compose for a Single Service?

Even though our current setup involves only a single service (our ML JupyterLab environment), using Docker Compose offers several advantages:

  • Ease of Configuration: All your container's runtime configurations (port mappings, volume mounts, container name, image to use/build) are neatly defined in one declarative YAML file. This is much cleaner than a long docker run command.

  • Readability and Maintainability: The docker-compose.yml file is easy to read, understand, and version control alongside your project code.

  • Reproducibility of Run Configuration: It ensures that you (and your collaborators) always run the container with the exact same settings.

  • Simplified Commands: Starting, stopping, and rebuilding your environment becomes as simple as docker-compose up and docker-compose down.

  • Future Scalability: If your project grows to require additional services (e.g., a separate database, a monitoring tool, or another API), you can easily add them to the same docker-compose.yml file.

Now, let's look at the docker-compose.yml file for our ML environment:

version: '3.8'

services:
  ml_notebook_service:
    build:
      context: .
      dockerfile: Dockerfile
    image: ml-notebook

    container_name: my_ml_jupyter_container

    ports:
      - "8888:8888"  # Map host port 8888 to container port 8888

    volumes:
      # Map the ./my_ml_projects directory on the host
      # to /workspace inside the container
      - ./my_ml_projects:/workspace

    # Keep STDIN open even if not attached and allocate a pseudo-TTY
    # Good for interactive processes, though JupyterLab is a server.
    # It doesn't hurt to have them.
    stdin_open: true
    tty: true

Explanation of the docker-compose.yml file:

  • version: '3.8'

    • This line specifies the version of the Docker Compose file format being used. Version '3.8' is a modern version that supports current Docker features. Different versions have slightly different syntax and capabilities.
  • services:

    • This is a top-level key that defines all the different services (which usually translate to containers) that make up your application. Our application currently has one service.
  • ml_notebook_service:

    • This is the custom name we've given to our service. You can name it anything descriptive. Under this key, we define the configuration for this specific service.

    • build:

      • This tells Docker Compose how to build the image for this service if it doesn't already exist or if we explicitly ask for a rebuild.

      • context: .: Specifies the build context – the directory containing the Dockerfile and any other files needed for the build. . means the current directory (where the docker-compose.yml file is located).

      • dockerfile: Dockerfile: Specifies the name of the Dockerfile to use for building the image, relative to the context path.

    • image: ml-notebook

      • This defines the name and tag for the image that will be built or used for this service. If an image with this name and tag (ml-notebook:latest by default if no tag is specified) exists locally, Compose might use it (depending on the command used). If it's built via the build: directive, it will be tagged with this name.
    • container_name: my_ml_jupyter_container

      • This sets a custom name for the container when it's created from the image. If you don't specify this, Docker will assign a random name. Having a fixed name makes it easier to refer to the container in Docker commands (e.g., docker logs my_ml_jupyter_container).
    • ports:

      • This section defines port mappings between the host machine and the container.

      • - "8888:8888": This maps port 8888 on the host machine to port 8888 inside the container. The format is "HOST_PORT:CONTAINER_PORT". Since our JupyterLab instance inside the container listens on port 8888 (as specified by EXPOSE 8888 in the Dockerfile and the CMD), this mapping allows us to access JupyterLab by navigating to http://localhost:8888 on our host machine's browser.

    • volumes:

      • This section defines volume mounts, which are used for persisting data and sharing files between the host and the container.

      • - ./my_ml_projects:/workspace: This is a crucial line for our development workflow.

        • ./my_ml_projects: This refers to a directory named my_ml_projects located in the same directory as the docker-compose.yml file on your host machine. You will need to create this directory.

        • /workspace: This is the path inside the container where the host directory will be mounted. Recall that in our Dockerfile, we set WORKDIR /workspace and our jupyteruser has ownership of this directory.

        • The effect is that any files you put in ./my_ml_projects on your host will appear inside the container at /workspace, and any files JupyterLab saves to /workspace inside the container (like new notebooks or data files) will actually be saved to ./my_ml_projects on your host. This makes your work persistent even if the container is stopped and removed.

    • stdin_open: true

      • This is equivalent to docker run -i. It keeps STDIN open even if not attached.
    • tty: true

      • This is equivalent to docker run -t. It allocates a pseudo-TTY (teletypewriter).

      • While JupyterLab runs as a web server and might not strictly require these for its main operation, they are generally good defaults for services that might have interactive aspects or for debugging. They don't harm in this setup and are often included.

By using this docker-compose.yml file, launching our entire ML environment with its specific port mappings and volume configurations becomes as simple as running docker-compose up. This significantly improves the developer experience and ensures consistency.

Setting Up and Using the Shared Directory

One of the most practical features of our Docker setup is the ability to work on your project files directly from your host machine while they are simultaneously accessible and modifiable by the JupyterLab environment running inside the container. This is achieved through Docker volumes, specifically a type called a "bind mount."

The Importance of Volumes for Persistent Work

Docker containers are, by default, ephemeral. This means that if you stop and remove a container, any data created inside that container's writable layer (that isn't part of the image itself) is lost. For a development environment like ours, where you'll be creating notebooks, scripts, and datasets, this is obviously not ideal.

Volumes allow you to:

  1. Persist Data: Data stored in a volume exists on the host machine, independent of the container's lifecycle. Even if you remove the container, your work in my_ml_projects remains safe.

  2. Share Files: Easily get files into and out of your container. You can use your favorite editor on your host machine to edit Python scripts, and those changes will be immediately reflected inside the container for JupyterLab to use, and vice-versa.

In our docker-compose.yml, the magic happens in the volumes section:

    volumes:
      # Map the ./my_ml_projects directory on the host
      # to /workspace inside the container
      - ./my_ml_projects:/workspace

Actionable Step: Create Your Project Directory

Before you can use this volume mapping, you need to create the directory on your host machine that will be linked to the container.

  1. Navigate to the directory where you saved your Dockerfile, entrypoint.sh, and docker-compose.yml files.

  2. Create a new directory named my_ml_projects. You can do this from your terminal:

    Bash

     mkdir my_ml_projects
    

This my_ml_projects directory is now ready to be your persistent workspace.

How the Mapping Works: ./my_ml_projects:/workspace

Let's break down this line:

  • ./my_ml_projects: This is the source path on your host machine. The . signifies that it's relative to the location of your docker-compose.yml file. So, it points to the my_ml_projects directory you just created.

  • /workspace: This is the target path inside the container. It's the directory within the container where the contents of my_ml_projects will appear.

  • The colon (:) separates the host path from the container path.

When Docker Compose starts the service, it establishes a link (a bind mount) between these two locations. Any file or folder you create, modify, or delete in my_ml_projects on your host will be mirrored inside the /workspace directory in the container, and vice-versa. Since our JupyterLab instance is configured to use /workspace as its root directory (thanks to our entrypoint.sh and the WORKDIR in the Dockerfile), any notebooks you create in JupyterLab will be saved directly into your my_ml_projects folder on your host.

Crucial Point: User Permissions – The UID/GID Dance

This is where things can sometimes get tricky. The user running the JupyterLab process inside our Docker container is jupyteruser, which we created with a User ID (UID) of 1000 and a Group ID (GID) that's typically also 1000 (its primary group).

For jupyteruser inside the container to be able to create, modify, and delete files within the mounted /workspace directory (which is actually your ./my_ml_projects on the host), it needs to have the necessary write permissions from the host operating system's perspective.

  • The Issue: If the my_ml_projects directory on your host is owned by a user with a different UID than 1000, or if its permissions are too restrictive, the jupyteruser (UID 1000) inside the container might not have permission to write to it. You might encounter errors like "Permission Denied" when trying to save a notebook or create a new file in JupyterLab.

  • Why UID 1000? On many Linux distributions, the first non-system user created typically gets UID 1000. This is a common convention, which is why we chose it in the Dockerfile (useradd -u 1000 jupyteruser).

Solutions and Recommendations:

  1. Check Ownership and Permissions (Linux/macOS):

    • The simplest scenario is when the user running the docker-compose up command on the host is the owner of the my_ml_projects directory, and their UID happens to be 1000. In this case, it often "just works."

    • You can check the ownership and permissions of my_ml_projects on your host: Bash

        ls -ld my_ml_projects
      

      This will show something like drwxr-xr-x 2 yourhostuser yourhostgroup 4096 May 14 08:00 my_ml_projects.

    • To see your host user's UID and GID: Bash

        id -u  # Shows your UID
        id -g  # Shows your GID
      
    • If your host user's UID is 1000: You are generally fine.

    • If your host user's UID is NOT 1000, but you created my_ml_projects as this user: The directory should still be writable by you. Docker's file sharing mechanisms often handle this gracefully on macOS and Windows (using Docker Desktop). On Linux, if the UIDs don't match, the other permissions bits for the directory become important (write permission for "group" or "others").

  2. Ensuring Writability (Linux/macOS):

    • If you face permission issues on Linux: The most straightforward way to ensure the container's jupyteruser (UID 1000) can write to the host directory is to change the ownership of the my_ml_projects directory on your host to UID 1000.

      • Caution: Only do this if you understand the implications for your host system. If UID 1000 doesn't correspond to your main user, you might need sudo to manage these files directly on the host afterward. <!-- end list -->

Bash

        # Be careful with sudo chown!
        sudo chown -R 1000:1000 my_ml_projects

The -R makes it recursive. The 1000:1000 sets UID to 1000 and GID to 1000.

  • Alternatively, you can grant broader write permissions, though this is less secure: Bash

      chmod -R 777 my_ml_projects # Allows read/write/execute for everyone - use sparingly.
    

    A slightly better approach might be to ensure your user is part of a group that has write access, and that the container user's GID also has access, or to use ACLs (Access Control Lists), but these are more advanced setups.

  1. Docker Desktop (Windows/macOS):

    • Docker Desktop often handles file ownership and permissions more transparently between the host and containers, especially for bind mounts from user directories. You are less likely to encounter UID/GID mismatch issues directly, but ensure the directory is writable by your host user account.

General Recommendation:

For most users, especially on macOS and Windows using Docker Desktop, creating the my_ml_projects directory as your normal host user should work fine. On Linux, if your host user's UID is 1000, it's usually seamless. If you're on Linux with a different UID and encounter permission errors inside JupyterLab, setting the ownership of my_ml_projects on the host to UID 1000 (sudo chown -R 1000:1000 my_ml_projects) is often the most direct fix for this specific Docker setup.

Understanding this interaction between host permissions and container user UIDs is key to a smooth experience with Docker volumes. Once set up correctly, your my_ml_projects directory becomes the seamless bridge between your host machine and your powerful, isolated ML environment running in Docker.

Bringing It All to Life: Running Your ML Environment

With our Dockerfile, entrypoint.sh, and docker-compose.yml files in place, and our shared my_ml_projects directory created, we're ready to launch our reproducible ML environment. We'll cover two ways to do this: using Docker Compose (recommended for its simplicity) and using the direct docker run command (to understand what Compose does under the hood).

A. With Docker Compose (Recommended for Ease)

Docker Compose significantly simplifies the process of building and running our container with all its configurations.

  1. Ensure Files are Together:

    First, make sure your Dockerfile, entrypoint.sh, and docker-compose.yml files are all in the same directory. Also, ensure the my_ml_projects directory you created earlier is in this same location. Your directory structure should look something like this:

     your_project_folder/
     ├── Dockerfile
     ├── entrypoint.sh
     ├── docker-compose.yml
     └── my_ml_projects/
    
  2. Launch the Environment:

    Open your terminal, navigate to your_project_folder, and run the following command:

     docker compose up --build -d
    
    • Note on command: If you have an older version of Docker Compose, the command might be docker-compose up --build -d (with a hyphen). Modern Docker versions integrate compose as a plugin.

Let's break down this command:

  • docker compose up: This is the core command to start your services as defined in docker-compose.yml.

  • --build: This flag tells Docker Compose to build the Docker image before starting the service. It will look for the build instructions in your docker-compose.yml (which points to your Dockerfile).

    • Important: You should use --build the very first time you run this command. You also need to use it again if you make any changes to your Dockerfile or entrypoint.sh script so that the image is rebuilt with your changes. If the image is already built and up-to-date, Compose is smart enough not to rebuild unnecessarily on subsequent up commands without --build.
  • -d: This stands for "detached mode." It runs the containers in the background and prints the new container's name. Without -d, the container logs would occupy your terminal.

The first time you run this, it will take a while as Docker downloads the base image and then runs all the steps in your Dockerfile to install the extensive list of packages. Subsequent builds will be much faster if only later parts of the Dockerfile are changed, thanks to layer caching.

  1. Stopping the Environment:

    When you're done working and want to stop the container(s) defined in your docker-compose.yml, navigate to the same directory in your terminal and run:

     docker compose down
    
    • (Or docker-compose down for older versions).

    • This command stops and removes the containers, networks, and (by default) the named image defined in your docker-compose.yml. Your data in my_ml_projects will remain untouched because it's on your host.

B. With docker run (Understanding the Nuts and Bolts)

While Docker Compose is convenient, it's useful to understand the equivalent docker run command that Compose essentially automates. This helps you appreciate what's happening behind the scenes.

  1. Step 1: Build the Image (if not already built)

    If the image ml-notebook hasn't been built yet (e.g., by a previous docker compose up --build), you first need to build it using the Dockerfile:

     docker build -t ml-notebook .
    
    • docker build: The command to build an image from a Dockerfile.

    • -t ml-notebook: Tags the image with the name ml-notebook (and a default tag of latest). This is the name we referred to in our docker-compose.yml and will use in the docker run command.

    • .: Specifies that the build context (the location of the Dockerfile and other necessary files like entrypoint.sh) is the current directory.

  2. Step 2: Run the Container

    Once the image is built and named ml-notebook, you can run a container from it:

     docker run -d -p 8888:8888 -v "$(pwd)/my_ml_projects:/workspace" --name my_ml_jupyter_container_manual ml-notebook
    

    Let's break this down:

    • docker run: The command to create and start a new container from an image.

    • -d: Run in detached mode (in the background).

    • -p 8888:8888: Publish port 8888 of the container to port 8888 on the host. Format is HOST_PORT:CONTAINER_PORT.

    • -v "$(pwd)/my_ml_projects:/workspace": Mount a volume.

      • $(pwd)/my_ml_projects: This takes the current working directory (pwd) on your host and appends /my_ml_projects to it, forming an absolute path to your shared folder. Using $(pwd) (or ${PWD} on some systems) makes the command more portable.

      • :/workspace: Maps it to the /workspace directory inside the container.

    • --name my_ml_jupyter_container_manual: Assigns a specific name to the running container. We've added _manual to distinguish it from one potentially run by Compose.

    • ml-notebook: The name of the image to use for creating the container.

As you can see, the docker run command includes all the configurations (port mapping, volume mount, container name) that we neatly defined in our docker-compose.yml file.

  1. Stopping and Removing the Manually Run Container:

    If you started the container with docker run, you'd stop and remove it with separate commands:

    • To stop it:

        docker stop my_ml_jupyter_container_manual
      
    • To remove it (after it's stopped):

        docker rm my_ml_jupyter_container_manual
      

Using Docker Compose clearly streamlines these operations into simpler up and down commands, especially as configurations become more complex or involve multiple interlinked services.

Now that your environment is up and running (hopefully using the Docker Compose method!), the next step is to access JupyterLab.

Launching JupyterLab and Starting Your Work

Once your Docker container is up and running (ideally started with docker compose up --build -d), your powerful, pre-configured ML environment is ready and waiting. Accessing JupyterLab, which is the web-based interactive development environment we've set up, is straightforward.

1. Accessing JupyterLab

  • Open your favorite web browser (like Chrome, Firefox, Safari, or Edge).

  • In the address bar, type: http://localhost:8888

  • Press Enter.

You should see the JupyterLab interface load.

Important Note on Authentication (Tokens/Passwords):

Recall the CMD instruction at the end of our Dockerfile:

CMD ["--ip=0.0.0.0", "--port=8888", "--no-browser", "--NotebookApp.token=''", "--NotebookApp.password=''"]

The parts --NotebookApp.token='' and --NotebookApp.password='' explicitly tell JupyterLab to start without requiring a security token or password.

  • Convenience for Local Development: We've configured it this way for ease of use in a local, trusted development setting. You can jump straight into your work without needing to copy-paste tokens from Docker logs.

  • Security Consideration: This is NOT secure for environments exposed to a network or the internet. If you were to run this container on a server where others could access port 8888, anyone could access your JupyterLab instance and potentially execute code. For any non-local or shared deployment, you should remove these options from the CMD in your Dockerfile (or override the CMD when running the container) and configure proper JupyterLab security (e.g., with a token or password, or by putting it behind a reverse proxy with authentication).

2. Navigating and Using Your Workspace

Once JupyterLab loads, you'll see a file browser panel on the left.

  • Your Workspace (/workspace which is your my_ml_projects): The file browser will be rooted at the /workspace directory inside the container. Thanks to our volume mount (- ./my_ml_projects:/workspace in docker-compose.yml), this /workspace directory is directly linked to the my_ml_projects folder on your host machine.

    • Any notebooks, Python scripts, data files, or subdirectories you create here within JupyterLab will appear in your my_ml_projects folder on your computer.

    • Conversely, if you add files to my_ml_projects from your host OS, they will show up in the JupyterLab file browser (you might need to refresh the browser view).

  • Creating a New Notebook:

    • In the "Launcher" tab (it usually opens by default, or you can open it via File > New Launcher), under the "Notebook" section, click on "Python 3 (ipykernel)" (or whatever the Python kernel is named).

    • This will create a new, untitled Jupyter notebook (.ipynb file). You can rename it, add code cells, markdown cells, and run your Python code using the vast array of libraries we've installed.

  • Opening Existing Files:

    • If you have existing notebooks or Python scripts in your my_ml_projects folder on your host, you'll see them listed in the JupyterLab file browser. Simply double-click to open them.
  • Using the Terminal:

    • JupyterLab also provides access to a terminal within the container. In the "Launcher," click on "Terminal." This will open a shell session as the jupyteruser inside the /workspace directory. You can use this to run shell commands, manage files, or execute Python scripts directly.

You are now all set! You have a fully equipped Machine Learning environment at your fingertips, running consistently inside a Docker container, with all your work conveniently saved to your local machine. Experiment with the installed libraries, build your models, and enjoy the reproducibility this setup offers.

Conclusion and Next Steps

Congratulations! You've successfully walked through the process of setting up a comprehensive, reproducible Machine Learning environment using Docker. By leveraging a Dockerfile to define the environment, an entrypoint.sh for initialization, and docker-compose.yml for easy orchestration, you've built a powerful workspace that addresses many common development pain points.

Recap of the Benefits:

  • Reproducibility: Your ML environment is now codified. Anyone (including your future self) can recreate this exact setup by simply using these configuration files, ensuring that your code runs consistently across different machines and over time.

  • Isolation: Your ML project and its numerous dependencies are neatly contained, preventing conflicts with other projects or your system's global package installations. You can work on multiple projects with different requirements side-by-side.

  • Pre-configured Powerhouse: You have a JupyterLab instance ready to go, packed with an extensive suite of Python libraries for data science, machine learning, deep learning, NLP, computer vision, and more. No more tedious individual installations for every new project!

  • Simplified Workflow: Tools like Docker Compose make managing your environment's lifecycle (starting, stopping, rebuilding) incredibly straightforward.

  • Persistent Workspace: Through volume mapping, your valuable notebooks, scripts, and data are safely stored on your host machine, seamlessly integrated with the containerized environment.

You've effectively banished the "it works on my machine" curse for your ML projects.

Potential Next Steps for Your Dockerized ML Journey:

This setup provides a fantastic foundation, but there's always more to explore and customize:

  1. Customize Your Dockerfile Further:

    • Add or Remove Packages: Tailor the installed Python packages to your specific needs. Remove those you don't use to keep the image leaner, or add specialized libraries required for your domain.

    • Different Python or Library Versions: Pin specific versions of Python or key libraries if your project demands it.

    • Install Other Tools: You might want to add other command-line utilities or software directly into the image.

  2. Go GPU-Powered for Deep Learning:

    • The current setup uses a CPU-only version of PyTorch to maintain broad compatibility and a smaller image size.

    • If you're doing serious deep learning and have an NVIDIA GPU, you can adapt this setup to use a GPU-enabled base image (e.g., nvidia/cuda base images) and install GPU-compatible versions of PyTorch or TensorFlow. This will require nvidia-docker or the NVIDIA Container Toolkit to be installed on your host. Your Dockerfile would need to change its FROM instruction and potentially how PyTorch is installed.

  3. Share Your ml-notebook Image:

    • Once you've built your image (ml-notebook), you can push it to a Docker registry like Docker Hub (public or private) or a private company registry.

    • This allows your collaborators to simply docker pull yourusername/ml-notebook and run an identical environment without needing to go through the build process themselves.

    • Commands to tag and push: Bash

        docker tag ml-notebook yourdockerhubusername/ml-notebook:latest
        docker push yourdockerhubusername/ml-notebook:latest
      

      (You'll need to be logged into Docker Hub using docker login).

  4. Integrate with Version Control (Git):

    • Keep your Dockerfile, entrypoint.sh, docker-compose.yml, and your my_ml_projects (or specific notebooks/scripts within it) under version control with Git. This tracks changes to your environment definition alongside your code.
  5. Explore Advanced Docker Compose Features:

    • If your project grows, you might add other services like databases (PostgreSQL, MongoDB), experiment tracking tools (a self-hosted MLflow server), or APIs, all managed within the same docker-compose.yml.

The world of Docker is vast, and this is just the beginning. By embracing containerization, you're adopting a best practice that will make your machine learning workflows more robust, collaborative, and efficient.

Happy coding, and may your environments always be reproducible!

More from this blog

Jyotiprakash's Blog

251 posts

I'm Jyotiprakash, a software dev and professor at KIIT, with expertise in system programming.