Content from Introduction


Last updated on 2024-10-03 | Edit this page

Estimated time: 25 minutes

Overview

Questions

  • What are virtual machines and containers?
  • How are they used in open science?

Objectives

  • Explain the main components of a virtual machine and a container and list the major differences between them
  • Explain at a conceptual level how these tools can be used in research workflows to enable open and reproducible research

Introduction


You might have heard of containers and virtual machines in various contexts. For example, you might have heard of ‘containerizing’ an application, running an application in ‘Docker’, ‘spinning up’ a virtual machine in order to run a certain program.

This lesson is intended to provide a hands-on primer to both virtual machines and containers. We will begin with a conceptual overview. We will follow it by hands-on explorations of two tools: VirtualBox and Docker.

If you forget what a particular term means, refer to the Glossary.

Prerequisite

This lesson assumes no prior experience with containers or virtual machines. However, it does assume basic knowledge about computer and networking concepts. These include

  • Ability to install software (and obtaining elevated/administrative rights to do so),
  • Basic knowledge of the components of a computer and what they do (CPU, network, storage)
  • Knowledge of how to navigate your computer’s directory structure (either graphically or via the command line).

Prior exposure to using command line tools is useful but not required.

What are virtual machines and what do they do?


Lead off script: We’re all familiar with computers like Macs and PCs. They run an operating system like Windows or Mac, and we can run programs on that computer like web browsers and word processors. The operating system controls all of the physical resources.

Explain the diagram

  • Physical resources, and how the OS controls access to them
  • The VMM is a program that gets a slice of those resources and presents them as if they were physical resources to virtual machines
  • Define the relationship between the host and guest OS.
  • Use the concept of a computer-within-a-computer
diagram showing boxes with hardware resources, applications, and the operating system
An ordinary (unvirtualized) system. The operating system has complete control of the physical hardware resources and is responsible for executing individual applications and allocating those resources to them.

Normally, computers run a single operating system with a single set of applications. Sometimes (for reasons we’ll discuss soon), people might want to run a totally separate operating system with a different set of applications. One way to do that is to split up the physical resources like CPU, RAM, etc. and allocate them to that second operating system. The concept of splitting up these resources so that only this second operating system can access them is the idea behind a virtual machine. The operating system inside the virtual machine only sees the virtual resources allocated to it (not the real, physical resources).

diagram showing boxes with hardware resources, applications, the operating system, and how virtualization shares resources
The physical hardware resources are divided between the host operating system and any virtual machines. The virtual machine manager (VMM) takes care of managing the virtualized resources. Each virtual machine only sees the resources allocated to it.

At its core, a virtual machine (referred to as VM from now on) is the self-contained set of operating system, applications, and any other needed files that run on a host machine. The VM files are contained inside of a inside single file (usually) called a VM image. The file you downloaded during the lesson setup is an example of a VM image.

Callout

One way to think of the concept is that a VM is a computer that runs inside your computer. All the programs that run inside this mini computer can’t “see” anything running inside the host (or other VMs).

Why would we want to run a mini computer inside of our main computer? VMs are commonly used to

  • Make managing and deploying complex applications easier.
  • Run multiple applications and operating systems on the same hardware.

For example, a web application (like your bank’s online portal), needs to stay running in case there is a lot of traffic or if the network or server goes down. One way to achieve this is to run several identical servers spread out geographically. While one could install all the software on each server, doing so is complicated by the fact that the physical hardware and software present on each server might be slightly different. Also, we might want to run multiple complex applications on the same server which could result in conflicting software dependencies. VMs help solve these problems.

VMs are also commonly used in academic research scenarios as well as they can help with the problem of research reproducibility by packaging all data and code together so that others can easily re-run the same analysis while avoiding the issue of having to install and configure the environment in the same way as the original researcher. They also help optimize the usage of the computing resources owned by the institution

Callout

Benefits of VMs

  • Helps with distributing and managing applications by including all needed dependencies and configurations.
  • Increases security by isolating applications from each other.
  • Maximizes the use of physical hardware resources by running multiple isolated operating systems at the same time.

What are containers and what do they do?


diagram showing boxes with hardware resources, applications, containers, and the operating system
Containers are self-contained environments that run using the host operating system’s resources. Programs running inside a container only see resources allocated to it via the container manager. For example, from the point of view of the host system, App1’s files live in a normal directory somewhere in the file system. From App1’s perspective, the only directories that it can directly access on the host are controlled by the container manager.

Containers are conceptually similar to VMs in that they also encapsulate applications and their dependencies into packages that can be easily distributed and run on different physical machines. The big difference is that hardware is not virtualized. This means that applications running in a container must be compatible with the host OS and its hardware. In more technical terms, applications running in a container share the host’s kernel and therefore must be compatible with the host’s architecture.

Containers are generally more lightweight in terms of disk space and memory requirements than comparable VMs. However, it comes at the cost of portability. Applications inside a container can only run on a compatible host, unlike VMs which can run applications from differing operating systems or hardware architectures.

A core concept is that containers should be ephemeral and all user data and configurations should live outside the container. This means that containers can be created and destroyed quickly and easily without affecting the data that the container depends on. This separation is what enables some of the use cases below.

Examples:

  • Web applications (e.g., a web front-end with a database backend – each running in its own container)
  • Data science and machine learning software stacks

Additional characteristics:

  • Containers can contain applications from various OSs like Linux or Windows, but containers based on Unix-like OSs (e.g., Linux) are the most common.
  • Containers are generally console based. If they have a graphical interface, the main way containers present it is via a web browser.
  • Software to create an manage containers is varied. Docker is the most popular one.

Callout

Benefits of containers

  • A lighter application footprint (mainly around lower CPU and memory requirements) compared to VMs.
  • Quickly and easily update a complex application without affecting any user data or causing issues with conflicting dependencies in the host OS.
  • Quickly and easily scale applications. For example, when there is a need to dynamically run multiple instances of an application across a cluster of servers to handle increased demand.
  • Robustness of an application stack. If an application is made up of smaller applications that talk to each other via standard mechanisms (e.g., web APIs), it is easier to pinpoint issues and update individual applications.

Comparing virtual machines and containers


Virtual Machines Containers
Contains all the dependencies needed to run an application Yes Yes
Isolates an application from the host OS Yes Yes
Ease of distribution Very easy Easy/hard (depending on complexity and hardware compatibility)
Disk space, CPU, and memory requirements Larger Smaller
Presents virtual versions of real hardware like CPUs, disks, etc Yes No
Scaling based on computing needs More difficult Easier
Able to run applications from one operating system on another Yes No*
Able to run applications from one CPU architecture (e.g., 32 bit x86) on another (e.g., 64 bit ARM) Yes (via emulation) No

Challenge 1:

If you are running a web browser inside a VM, can the host OS see the web pages you’re visiting? What your application is making web requests from inside a container? Can the host see the IP addresses your application is connecting to?

In both cases, the host can usually see what sites (domain only if using HTTPS) or IP addresses the guest OS or container is connecting to. In the VM case, even though the network hardware is virtualized, the actual data still has to go through the real hardware at some point. For containers, the container already uses the real hardware as if it were running natively in the host and the effect is the same.

Key Points

  • A virtual machine is a separate computer that runs with its own operating system and applications inside of a host operating system.
  • Containers are like lightweight virtual machines with some subtle but consequential differences.
  • Containers and virtual machines can address many of the same use cases.
  • Both virtual machines and containers are commonly used in academic research but containers are more popular.

Content from Virtual machines using VirtualBox


Last updated on 2024-08-23 | Edit this page

Estimated time: 12 minutes

Overview

Questions

  • How do you write a lesson using Markdown and sandpaper?

Objectives

  • Explain how to use markdown with The Carpentries Workbench
  • Demonstrate how to include pieces of code, figures, and nested challenge blocks

Introduction


This is a lesson created via The Carpentries Workbench. It is written in Pandoc-flavored Markdown for static files and R Markdown for dynamic files that can render code into output. Please refer to the Introduction to The Carpentries Workbench for full documentation.

What you need to know is that there are three sections required for a valid Carpentries lesson:

  1. questions are displayed at the beginning of the episode to prime the learner for the content.
  2. objectives are the learning objectives for an episode displayed with the questions.
  3. keypoints are displayed at the end of the episode to reinforce the objectives.

Inline instructor notes can help inform instructors of timing challenges associated with the lessons. They appear in the “Instructor View”

Challenge 1: Can you do it?

What is the output of this command?

R

paste("This", "new", "lesson", "looks", "good")

OUTPUT

[1] "This new lesson looks good"

Challenge 2: how do you nest solutions within challenge blocks?

You can add a line with at least three colons and a solution tag.

Figures


You can use standard markdown for static figures with the following syntax:

![optional caption that appears below the figure](figure url){alt='alt text for accessibility purposes'}

Blue Carpentries hex person logo with no text.
You belong in The Carpentries!

Callout

Callout sections can highlight information.

They are sometimes used to emphasise particularly important points but are also used in some lessons to present “asides”: content that is not central to the narrative of the lesson, e.g. by providing the answer to a commonly-asked question.

Math


One of our episodes contains \(\LaTeX\) equations when describing how to create dynamic reports with {knitr}, so we now use mathjax to describe this:

$\alpha = \dfrac{1}{(1 - \beta)^2}$ becomes: \(\alpha = \dfrac{1}{(1 - \beta)^2}\)

Cool, right?

Key Points

  • Use .md files for episodes when you want static content
  • Use .Rmd files for episodes when you need to generate output
  • Run sandpaper::check_lesson() to identify any issues with your lesson
  • Run sandpaper::build_lesson() to preview your lesson locally

Content from Basics of Containers with Docker


Last updated on 2024-10-03 | Edit this page

Estimated time: 20 minutes

Overview

Questions

  • What is a Docker image?
  • What is a Docker container?
  • How do you start and stop a container?
  • How do retrieve output from a container to a local machine?

Objectives

  • Explain the difference between a Docker image and a Docker container
  • Retrieve a Docker image from the cloud
  • Start a Docker container running on a local machine
  • Use the command line to check the status of the container
  • Clean the environment by stopping the container

Introduction


TODO Flavor text introducing containers. Why we use them. Include at least a couple use cases. If necessary, provide high level distinction from Virtual Machines.

Instructors should feel free to add their own examples in the introduction, to help your learners appreciate the utility of containers. Providing your own use case of containers helps lend authenticity to the lesson.

Images versus containers

There are two big pieces of the container world: images and containers. They are related to one another, but they are not synonymous. Briefly, images provide the plans for making a container, and a container is similar to a virtual machine in that it is effectively another computer running on your computer. To use an analogy from architecture, images are the blueprints and containers are the actual building.

Callout

If you are a fan of philosophy, images are for Platonists and containers are for nominalists.

Considering the differences between images and containers…

Images are

  1. Read-only
  2. Contain instructions (in a file called a “Dockerfile” - we talk about Dockerfiles later in the lesson)
  3. They do not actually “do” anything

Containers are

  1. Modifiable (while running)
  2. Can include files and programs (like your computer!)
  3. Can run analyses or web applications (and more)

TODO: Anything the instructor should be aware of. Maybe here’s a point for an image of some sorts.

Challenge 1: Images versus containers

You instructor introduced one analogy for explaining the difference between a Docker image and a Docker container. What is another way to explain images and containers?

Several analogies exist, and here are a few:

  • An image is a recipe, say, for your favorite curry, while the container is the actual curry dish you can eat.
  • “Think of a container as a shipping container for software - it holds important content like files and programs so that an application can be delivered efficiently from producer to consumer. An image is more like a read-only manifest or schematic of what will be inside the container.” (from Jacob Schmitt)
  • If you are familiar with object-oriented programming, you can think of an image as a class, and a container an object of that class.

Working with containers

One thing to note right away is that a lot of the work of running containers happens through the command line interface. That is, we do not have a graphical user interface (GUI) with menus to work with. Instead, we type commands into a terminal for starting and stopping containers.

For the purposes of this lesson, we are going to use a relatively lightweight workflow of using a container. Briefly, the steps of using a container are:

  1. Retrieve the image we would like to use from an online repository.
  2. Start the container running (like turning on a computer).
  3. Interact with the container, if the container has such functionality (some containers are just programmed to run without additional interaction from users).
  4. Check the status of the container.
  5. Upon completion of whatever task we are using the container for, stop the container (like turning off the computer).

Steps 1, 2, 4, and 5 are all associated with a specific docker command:

  1. Retrieve image: docker pull
  2. Start container: docker run
  3. Check status: docker ps
  4. Stop container: docker stop

The instructions included in the two episodes on containers assume that learners are using the virtual machines described in prior episodes. However, the following Docker instructions can all be run on any computer that has an internet connection and has Docker installed. You can find more information about installing Docker at the Carpentries’ Containers lesson.

Retrieving images

The first step of using containers is to download a copy of the image you would like to use. For Docker images, there are multiple sites on the internet that serve as sources for Docker images. Two common repositories are DockerHub and GitHub’s Container Registry; for this lesson, we will be downloading from DockerHub. The nice thing is that we do not have to open a web browser and manually download a file - instead we can use the Docker commands to do this for us. For downloading images, the syntax is:

docker pull <image creator>/<image name>

Where we replace <image creator> with the username of the person or organization responsible for the image and <image name> with the name of the image. For this lesson, we are going to use an image that includes the OpenRefine software. OpenRefine is a powerful data-wrangling tool that runs in a web browser.

Callout

Want to learn more about OpenRefine? Check out the Library Carpentry Lesson on Open Refine.

TODO docker pull

$ docker pull felixlohmeier/openrefine

Starting an image

TODO docker start

$ docker run -p 3333:3333 felixlohmeier/openrefine

Status check

TODO docker ps

$ docker ps

Using the container

TODO Do something in OpenRefine

Stopping the container

TODO docker stop (after docker ps)

$ docker ps
$ docker stop <container ID>

Challenge 2: Checking the status of containers

We saw before that we could check the status of running containers by using the command docker ps. What happens when you run the same command now? What about when you run the same command with the -a flag?

  • docker ps will show the status of all running containers. If you have no containers running, and you probably do not at this point of the lesson, you should see an empty table, like:
$ docker ps
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
$ 
  • docker ps -a will show all containers that are running or have been run on the machine. This includes the container that we stopped earlier.
$ docker ps -a
CONTAINER ID   IMAGE                      COMMAND                  CREATED         STATUS                       PORTS     NAMES
906072ff88f6   felixlohmeier/openrefine   "/app/refine -i 0.0.…"   2 days ago      Exited (143) 2 days ago                determined_torvalds

$

Note the date information (in the CREATED and STATUS fields) and the container name (the NAMES field) will likely be different on your machine.

Challenge 3: Order of operations

Rearrange the following commands to (in the following order) (1) start the OpenRefine container, (2) find the container image ID of the running OpenRefine container, and (3) terminal the OpenRefine container.

docker stop <container ID>
docker run -p 3333:3333 felixlohmeier/openrefine
docker ps
docker run -p 3333:3333 felixlohmeier/openrefine
docker ps
docker stop <container ID>

Callout

TODO Add any notes that may be relevant, but not necessary for lesson?

Key Points

  • Containers are a way to provide a consistent environment for reproducible work.
  • Use docker pull to copy an image to your machine
  • Use docker start to start running a container
  • Use docker ps to check the status of running containers
  • Use docker stop to stop running a container

Content from Creating Containers with Docker


Last updated on 2024-09-20 | Edit this page

Estimated time: 12 minutes

Overview

Questions

  • How do you create new Docker images?

Objectives

  • Explain how a Dockerfile is used to create Docker images
  • Create a Dockerfile to run a command
  • Use docker build to create a new image
  • Update a Dockerfile to run a Python script

Introduction


TODO Dockerfiles -> Images -> Containers

TODO Might be a good spot for a visual.

TODO How do we extend analogies that we presented in episode 1?

TODO Anything instructors should be aware of for this episode?

Dockerfile gross anatomy

TODO Cover two commands: FROM and CMD (or maybe ENTRYPOINT instead of CMD?)

Creating images from Dockerfiles

TODO Explain commands. Note how no new file is created in our directory, it gets created…somewhere, though?

docker build -t ...

docker image ls

Starting containers

TODO This is review.

docker run ...

Confirm it ran and quit

docker ps -a

Challenge 1: Update base image

  • Update the Dockerfile to have a base image that includes Python version 3.12 (instead of Python version 3.9)
  • Build the image
  • Start the container to confirm it is using Python version 3.12

To change the base image, update the information passed to the FROM command. That is, open the Dockerfile and change this line:

FROM python:3.9

to

FROM python:3.12

Buid the image

In the terminal, use docker build to create a new version of the image.

BASH

docker build -t <username>/python-container

This command will over-write the previous version of the image. TODO Need to test this statement.

Verify image was updated

In the terminal, use docker run to start a container based on the updated image.

BASH

docker run <username>/python-container

Copying files into the image

TODO Add flavor text about why we might do this.

COPY ...

See https://stackoverflow.com/questions/32727594/how-to-pass-arguments-to-shell-script-through-docker-run and https://www.tutorialspoint.com/how-to-pass-command-line-arguments-to-a-python-docker-container

for example of passing arguments to a script. Passing arguments might be too much.

Challenge 2: Copy a script to run in the container

There is a script (make this a print("Hello World!") python script) you want to include

Key Points

  • Dockerfiles include instructions for creating a Docker image
  • The FROM command in a Dockerfile indicates the base image to build on
  • The CMD command in a Dockerfile includes commands to execute when a container starts running
  • The COPY command in a Dockerfile copies files from your local machine to the Docker image so they are available for use when the container is running