A Gentle Introduction to AWS SageMaker - Part I

Arindam Dey
7 min readJul 2, 2022

--

Kings Lake Austria by Arindam Dey ( Canon EOS 450D, f/8, Exp. Time 1/30sec, ISO-200, 120mm )

I cleared my AWS ML Specialty exam in Q1 2022. While it took me almost six months to prepare for the exam, it was a fabulous learning experience. There is really no dearth of learning material on building a knowledge portfolio around the math and the associated code. However, when it came to applying those concepts in an industrial scenario, I came to an abrupt halt. How to deploy a model into production, how does one scale it , how to apply updates to the model ….. these are just some of the questions. This is where , the AWS ML Specialty curriculum is truly an eye opener.

In this 2 part article I will attempt to demonstrate a beginners journey into training a model on the cloud using AWS SageMaker. This is what we shall cover :

Part-I

  1. Create a SageMaker Notebook Instance on a virtual server on AWS platform.
  2. Install our packages that are persistent ( Lifecycle Configuration ).
  3. Launch our JupyterLab.

Part-II

  1. Load a sample dataset on AWS-S3 ( Simple Storage Service )
  2. Grant SageMaker instance access to the dataset.
  3. Launch a hyperparameter tuning job on a cluster of virtual machines.
  4. Pick up the best parameter and build our model.
  5. Save the model on Cloud Storage.
  6. Load the model on another virtual machine and run predictions

Getting Started

To be able to do the entire exercise , the reader will need an AWS account. There’s plenty of material around, to learn setting up an AWS account, so I won’t go there.

So let’s get started by creating a Notebook instance. Go to your AWS Console and in the Search bar, type Sagemaker ( ignore the auto-fills ). Select Amazon SageMaker

Fig 1. AWS Console Landing Page

Once in the SageMaker, click Notebook->Notebook instances. In my case , you can see a bunch of instances , I have already created. Notice , all of them are in status “Stopped”. This is very important , because AWS charges you for running instances. We need to make sure , that we stop all instances , once our work is done.

Look for the “Create notebook instance” ( the orange button below ). Once you click it , another window with a lot of intimidating options open up.

Fig 2. Running Instances

For now, we will keep everything very basic. Let us call our instance “MyDemoInstance” and in the instance type drop-down select “ml.t2.medium” ( I will explain this in a moment ). Scroll all the way down and click Create notebook instance.

Fig 3. Configuring the Instance

This will lead you to the following window , where you will be able to see the status of your instance. Initially it’ll show a “Pending Status”. Give it a few minutes, till the Status shows ready and Actions shows Open Jupyter | Open Jupyter Lab.

Fig 4. Our Demo Instance

Before we proceed, I need to clarify the term ml.t2.medium. Well, we are launching a virtual machine , so we need to specify some basic configuration we are looking for, in terms of CPU and memory. AWS provides a large collection of instances based on your computational needs (e.g. General purpose, memory optimized, compute optimized etc. ). Each of these instances have different configurations and pricing schemes. Visit here to have a look at the various SageMaker instance types at our disposal.

The link above will show you that the “ml.t2.medium” has 4 vCPUs, 4 GiB of RAM and costs $0.05/hr. That’s pretty modest, considering we’ll shut our instance down immediately after we are done.

We just launched a virtual server on AWS, on which we can use Jupyter Notebook or JupyterLab. Let’s dig deeper now.

Inspecting the Environments

Now , click Open Jupyter under the Actions column and the all familiar Jupyter Notebook Opens up. We now have a notebook running on AWS. Look for the “New” button on the right. Clicking this button reveals the available environments.

Fig 5. Available Environments

To keep things simple , we will select the conda_python3 environment. However, there’s one crucial point we need to address. What if , the chosen environment has some libraries missing ? We can certainly open the Jupyter Notebook and run a pip install in a cell. Unfortunately, the moment we shut down our instance, these additional libraries will cease to exist and have to be re-installed. How do we address this ?

Lifecycle Configuration

Quit the Jupyter Notebook and move back to adjacent tab in the browser. With MyDemoInstance checked click Actions->Stop.

Fig 6. Stopping the Demo Instance

The reason for doing this is to associate a startup script , which’ll install an additional library imblearn in our python3 environment. Now look for the Lifecycle Configuration option on your left. You should get a window like this. Click Create configuration.

Fig 7. Create Lifecycle Configuration

This opens up a window to create Create lifecycle configuration. Let us call it imblearn ( you can put any name here ) and type in the small script you see in the image below. This will install a library imblearn when our notebook instance re-starts. This library helps create synthetic data to handle data imbalance. Click Create configuration.

Fig 8. Edit the Lifecycle Configuration

We won’t be using this library per se, as this is just a demonstration to load external libraries. If you intend to install multiple packages (e.g, imblearn and scipy ), you could simply replace line by pip install --upgrade imblearn scipy

Now go back to the page , where we have our instances (Fig 6.) and click exactly on the instance name MyDemoInstance. Two things to be noticed here.

Notice , how the Lifecycle configuration is blank. We will associate a configuration shortly.

Fig 9.1 Notebook Instance Settings

Secondly, scroll down a bit, till you see Permissions and encryption. Note the cryptic text under IAM role ARN. Copy this text and keep it in a text-editor. We will need it in Part-II

Fig 9.2 Notebook Permissions and Encryption.

The 12 digit number 570517415597 is my 12 digit Account ID. In your case it’ll be your account ID.

Now scroll up and click Edit (Fig 9.1), so that we can associate our custom lifecycle named imblearn to our instance. In the Edit notebook instance , select the life cycle configuration drop-down and select imblearn. Scroll down and click Update Notebook Instance. Note , you can change Notebook instance type as well, but we’ll leave this for another day.

Fig 10. Associate the lifecycle script with the instance

All we did was to make sure that , every time our Notebook instance starts, the lifecycle script activates the python3 environment, installs our package and de-activates the environment. There are loads of examples of lifecycle scripts here.

Now you can go back to the Notebook instances (Fig 6.), check MyDemoInstance and click Actions->Start. Wait , till you get the status updated to InService. Now click the Open JupyterLab text on the right. This will open up the JupyterLab in an adjacent tab on your browser.

Fig 11. MyDemoInstance up and running

The JupyterLab has an untitled.ipynb notebook available by default. Double click to launch the notebook. In case, it prompts you to choose a kernel, then select conda_python3. Use the File option to rename the notebook to something suitable.

Fig 12. JupyterLab running on SageMaker Instance

Phew !! We now have an instance running SageMaker, which we will use to build a classification model.

Before We Move on to Part-II

Let us summarize what we have done so far.

  1. We created a Sagemaker instance of type ml.t2.medium.
  2. We associated a lifecycle configuration named imblearn , so that instance always starts up with this package.

In Part-II we will do the following

  1. Load dataset into an AWS S3 ( Simple Storage Service ) bucket.
  2. Create policies , so that the SageMaker instance can access the dataset.
  3. Pull a container from AWS ECR ( Elastic Container Registry ) containing built in algorithm. In our case we will use XGBoost.
  4. Launch a Hyperparameter tuning job across multiple instances running this container.
  5. Choose the best model and save it in S3.
  6. Load the best model on another instance and run predictions.

Caution !!

Make sure that you shut down your instance , in case you want to do rest of exercise later. Go back to the SageMaker Notebook Instances Page ( Fig 4. ). Select the radio button of your instance. Click Actions>Stop. If you miss this step, AWS will keep charging you for the instance , as long as it is running.

Thank you and happy Learning. On to Part-II

--

--