A Gentle Introduction to AWS Sagemaker - Part II

12 min readJul 2, 2022

Summer Delights by Arindam Dey ( Canon EOS 450D, f/6.3, Exp. Time 1/2sec, ISO-100, 210mm )

In our previous article , we saw how to run a Jupyter Notebook on AWS SageMaker. Though we did not write a single line of code, we have few more steps to go before we get coding. In this article we will do the following.

Create a bucket on AWS S3 ( Simple Storage Service ), that’ll hold the datasets.
Upload our datasets into the bucket.
Create policies, so that the SageMaker notebook instance can access the datasets.
Create a cluster of instances to run Hyperparameter tuning. Each of these instances will run a container.
Choose the best model and run predictions in another instance , once again using a container.
Compare the test set and ground truth.

The python code for this project can be found in this GitHub repo.

Uploading the Datasets

Download the ground_truth, train , validation and test CSV files. Now go back to the AWS Console and this time look for the service S3 ( Simple Storage Service ). We will not dwell too much into the dataset. Just bear in mind that the csv files should meet the following characteristics according to SageMaker. There are 111 Features and a label.

There should not be any header columns
The labels should be in first column. Thus, the train and validation files have 112 columns.
The test set has only the features without the label column. So , it has 111 columns.
The Ground truth file is the test set file with the label column appended in the beginning. Thus it has 112 columns.

For more information on SageMaker input formats can be found here

In my case, I have two buckets already existing. Click Create bucket and the following window opens up.

Enter a bucket name keeping in mind the following

S3 uses a global namespace, meaning your bucket name must be globally unique. In my case , I’ve named it as mylaundrybucket (pardon the pun )
See the fine-print under the text-box for additional guidelines.

Scroll all the way down and click Create bucket. Now you should be able to see the newly created bucket. Click on the bucket name and you should get the following window for uploading your files.

Click Upload and select all four files from your local system. This should lead you to the progress window below.

Now click the bucket name under destination, you should be able to see all the files in your bucket now. Click Permissions and you should get the following page with Permissions overview.

Scroll down , till you see Bucket policy. Click Edit

The following window opens up. Click Add new statement and a bunch of JSON lines appear right under Policy

Setting Policies

You can delete the statements above and paste the following json lines in text area. You have to make two changes:

Replace the string mylaundrybucket with your bucket name.
Paste the IAM Role ARN from Part-1 of this article. You will find it just above fig 9.2 in Part-I.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Statement1",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::570517415597:role/service-role/AmazonSageMaker-ExecutionRole-20210106T114090"
            },
            "Action": "s3:*",
            "Resource": "arn:aws:s3:::mylaundrybucket"
        }
    ]
}

All we did is , give our Sagemaker Notebook Instance ( the Principal ) access to the S3 bucket ( the Resource ) on your behalf and perform all actions possible to a s3 bucket ( the Action ). This will allow the notebook pull the train and test set data during training. In a practical scenario , you may have datasets of few 10’s of GBs. Then , its possible to send your data in parts to the training algorithm.
In our case though , the datasets are small, so the training algorithm will pull the data in its entirety.

Back to Sagemaker

We can now go back to the top of the console, type SageMaker in the search and select Amazon Sagemaker. Look for Notebook >Notebook Instances on the left side. You should be able to come back to this window now.

Check the instance MyDemoInstance , click Actions>Start. Wait till the Status becomes InService. Now click Open JupyterLab

You should be able to see the Jupyter Notebook you created in Part-I. Let’s get coding, paste the following lines in the first cell.

from imblearn.over_sampling import SMOTE
import boto3
import sagemaker
from sagemaker.debugger import Rule, rule_configs
from sagemaker.tuner import(
    IntegerParameter,
    ContinuousParameter,
    HyperparameterTuner,
)from sagemaker.inputs import TrainingInput
from sagemaker import get_execution_roleimport pandas as pd

Paste these lines in a cell and execute. Hopefully you got no error. Notice , the imblearn package imports without any issue, as we included this package as part of our lifecycle configuration. We won’t be using it in our code, as this was used as a demonstration.

Moving on, run this code snippet in the next cell.

role = get_execution_role()
my_region = boto3.session.Session().region_name
print("Success - the MySageMakerInstance is in the " + my_region )
print(role)

This should print something similar to the following lines. In your case you will get your region name printed out ( mine is ap-south-1). Also, we printed out the IAM Role ARN, which is exactly identical to what you got above.

Success - the MySageMakerInstance is in the ap-south-1
arn:aws:iam::570517415597:role/service-role/AmazonSageMaker-ExecutionRole-20210106T114090

Now run the following lines. We intend to pull an image containing the latest version of XGBoost. We are walking into containerization territory though. I promise , we won’t wander too much into containers.

sess = sagemaker.Session()container = sagemaker.image_uris.retrieve("xgboost", my_region, "latest")
print(container)

The variable sess allows us to interact with other AWS services. In this case we will be interacting with S3 and also make API calls to launch a bunch of instances. the print statement gives us a string similar to following. That’s just the name of the image we will be pulling for AWS ECR ( Elastic Container Registry ).

991648021394.dkr.ecr.ap-south-1.amazonaws.com/xgboost:latest

AWS has a large repository of pre-built docker images here. You will notice , for XGBoost alone it has so many options to choose from. Similarly there are pre-built containers for Tensorflow, Scikit-learn, SPARK ML etc. The idea is , to enable us to let all the heavy lifting be done by the containers themselves. You can make your image as well, but I guess that is for another day.

If you visit the link above , and scroll all the way to the XGBoost section, you will be able to see this image.

Now, this is the most important part of the code. Let us inspect each of the variables we define here.

#Replace bucket name with your bucket
my_bkt='mylaundrybucket'
rules=[Rule.sagemaker(rule_configs.create_xgboost_report())]xgb = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type="ml.m4.xlarge",
    output_path='s3://{}/models'.format(my_bkt),
    sagemaker_session=sess,
)xgb.set_hyperparameters(
    eval_metric="auc",
    objective="binary:logistic",
    num_round=100,
    rate_drop=0.3,
    tweedie_variance_power=1.4,
    rules=rules
)

xgb is an instance of the SageMaker estimator class. We have encapsulated container, role and sess arguments inside it which we defined above.
The output_path argument is a folder , that the training algorithm will create inside our bucket, to store the model artifacts. In your case a folder named models will be automatically created inside your bucket.
We set some default hyperparameters.

We will now set the input paths for training and validations datasets. These are just the path to the s3 bucket we created above. Note that, we are not downloading the dataset into our instance at any point of time. We will clarify the fate of these files in a moment.

s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/sage_train.csv'.format(my_bkt), content_type='csv')s3_input_validation = sagemaker.inputs.TrainingInput(s3_data='s3://{}/sage_val.csv'.format(my_bkt), content_type='csv')

Finally we define the hyperparameter ranges. You can define many more hyperparameters, but AWS does specify some guidelines/best practices here. This is one of the hot topics they ask in the exam.

hyperparameter_ranges = {
    "eta": ContinuousParameter(0, 1),
    "min_child_weight": ContinuousParameter(1, 10),
    "alpha": ContinuousParameter(0, 2),
    "max_depth": IntegerParameter(4, 10),
}objective_metric_name = "validation:auc"tuner = HyperparameterTuner(xgb, objective_metric_name, hyperparameter_ranges, max_jobs=6, max_parallel_jobs=3)

Note, how some of the parameters are continuous and some integer. The continuous ones move from the minimum to maximum in decimal increments. Finally, we define an instance tuner of the HyperparameterTuner class and encapsulate some of the arguments we defined above. The max_jobs and max_parallel_jobs parameters do have a tradeoff described in the best practice link above.

max_jobs is the maximum total number of training jobs to start for the hyperparameter tuning job (default: 1, we have chosen 6).
max_parallel_jobs is the maximum number of parallel training jobs to start (default: 1, we have chosen 3).
For this exercise , we will run 6 training jobs but only 3 of them will be running simultaneously.

Running the Tuning Job

Finally, with bated breath we run the following line of code. This will run the training jobs on a bunch of virtual machines ( EC2 Instances ).

tuner.fit({"train": s3_input_train, "validation": s3_input_validation}, include_cls_metadata=False)

This cell will take some time to run. Go back to Amazon Sagemaker (adjacent page) in your browser. Look for Training>Training jobs as shown in the image below. You should now be able to see three training jobs, all in InProgress status.

You can click on any of these training jobs and see the status for yourself. Scroll down further and you will be see the Input data configuration: train and Input data configuration: validation.

Let us try to understand , what is happening in under the hood. The code we wrote so far has launched these 3 instances (virtual servers) of type ml.m4.xlarge. Each of these instance , runs a docker container we specified above. Each of these containers have a XGBoost algorithm already built inside. This algorithm, pulls the training and validation data and searches the best hyperparameter.

Tuning Results

Once all the training jobs are completed, you will see exactly 6 training jobs with duration and status. You should be able to see a window like this.

Our code launched all these instances and shut them immediately after the training was done. Once again, click on any of the names and we get the Job Settings window below. This time we will see some additional information. You can also see the billable time these instances incurred. Clicking the View History reveals the description of the steps taken.

Each of the instance downloaded the input data from S3 but, who gave these instances the permission to access our S3 bucket. Notice , how these instances were assigned the same IAM Role ARN we came across above. This is the same role our SageMaker instance has. When we built the bucket above, we explicitly granted access to this role.

Let’s go back to JupyterLab, and type these lines. This will give us a summary of the tuning job we ran.

from pprint import pprint
tuning_job_result=boto3.client("sagemaker").describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuner.latest_tuning_job.job_name
)if tuning_job_result.get("BestTrainingJob", None):
    print("Best model found so far:")
    pprint(tuning_job_result["BestTrainingJob"])
else:
    print("No training jobs have reported results yet.")

The summary print out should look like below. The key parameters are bold highlighted. Let us examine each of them.

Best model found so far:
{'CreationTime': datetime.datetime(2022, 7, 2, 6, 2, 44, tzinfo=tzlocal()),
 'FinalHyperParameterTuningJobObjectiveMetric': {'MetricName': 'validation:auc','Value': 0.9948490262031555},
 'ObjectiveStatus': 'Succeeded',
 'TrainingEndTime': datetime.datetime(2022, 7, 2, 6, 6, 48, tzinfo=tzlocal()),
 'TrainingJobArn': 'arn:aws:sagemaker:ap-south-1:570517415597:training-job/xgboost-220702-0557-006-b02e145d',
 'TrainingJobName': 'xgboost-220702-0557-006-b02e145d',
 'TrainingJobStatus': 'Completed',
 'TrainingStartTime': datetime.datetime(2022, 7, 2, 6, 4, 25, tzinfo=tzlocal()),
 'TunedHyperParameters': {'alpha': '0.36843552580362204',
                          'eta': '0.3283748406816732',
                          'max_depth': '8',
                          'min_child_weight': '6.631353393968709'}}

The validation AUC is a sweet 99.5%. The tuned hyperparameter values are also printed here. Finally , the best model generated in this exercise is stored in S3 by the name xgboost-220702–0557–006-b02e145d.

Scroll up and type S3 in the search bar and go back to the bucket we created. Clicking our bucket reveals that new folder models have been created. Inside this folder we have exactly 6 model folders. One of those folders contains the best model ( in this case it is xgboost-220702–0557–006-b02e145d ). Navigating inside the folder further reveals a subfolder named output, which contains a file model.tar.gz. This is what we have been trying to achieve. Make a note of your best model, as will need it shortly, to run predictions.

Running Predictions

To run prediction, we have to load the saved model into another instance and send the test data to it. Run the following lines in a new cell. Our objective is to list all the saved models in our s3 bucket.

s3 = boto3.resource('s3')
bucket = s3.Bucket(my_bkt)
for obj in bucket.objects.all():
    if obj.key.find('model.tar')>-1:
        print(obj.key)
        KEY=obj.key

This should list out something similar to the lines below. Note, this includes the path of our best model as well.

models/xgboost-220702-0557-001-7e933b22/output/model.tar.gz
models/xgboost-220702-0557-002-a50caa48/output/model.tar.gz
models/xgboost-220702-0557-003-8e680304/output/model.tar.gz
models/xgboost-220702-0557-004-d59193ca/output/model.tar.gz
models/xgboost-220702-0557-005-47391ac0/output/model.tar.gz
models/xgboost-220702-0557-006-b02e145d/output/model.tar.gz

Now run the following lines. We store the best model in my_best_job and use this inside an container we will run for predictions.

my_best_job=tuning_job_result["BestTrainingJob"]["TrainingJobName"]model = sagemaker.model.Model(    image_uri=container,model_data='s3://{}/models/{}/output/model.tar.gz'.format(my_bkt,my_best_job),role=role)transformer = model.transformer(
    instance_count=1, instance_type="ml.m4.xlarge", 
    assemble_with="Line", accept="text/csv",
    output_path='s3://{}/models'.format(my_bkt)
)

This time we define model as an instance of the SageMaker Model class, and encapsulate the same container we used above, but this time we will use this for inference ( predictions ) and inject our best model into it. We also assign the same IAM Role ARN we used above. For our prediction job, we will use the same instance type we used for training. The transformer method of model needs the instance type and an s3 path to store the output.

Finally , run these lines and sit back. The predictions runs for around 4 minutes.

test_data_path='s3://{}/sage_test.csv'.format(my_bkt)
transformer.transform(test_data_path, content_type="text/csv")

Once this is done, it creates a file sage_test.csv.out inside our s3 bucket in the models folder, containing prediction probabilities in the models subfolder in our bucket.

We can simply copy it into our instance and view the prediction. We will also download the Ground_Truth.csv file into our instance, so that we can run comparison.

#Downloading both the ground truth and the predictions into our instance.s3.Bucket(my_bkt).download_file('Ground_Truth.csv', 'Ground_Truth.csv')
s3.Bucket(my_bkt).download_file('models/sage_test.csv.out', 'sage_test.csv.out')

Once we have both the files, we import some standard libraries to calculate the metrics.

import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.metrics import plot_confusion_matrix,roc_curve,roc_auc_score,auc
import seaborn as sns#Read in the downloaded files. pred_proba contains the predicted probabilities from our best model.
#The test_df comtains the ground truth data, with the first column containing the tru class labels.pred_proba = pd.read_csv('output.csv',header=None)
test_df =pd.read_csv('Ground_Truth.csv',header=None)
y_test=test_df.iloc[:, 0].values# Evaluate the metrics for various thresholds
fpr, tpr, thresholds = roc_curve(y_test,pred_proba)
# compute AUC
roc_auc = auc(fpr, tpr)# Plotting the Threshold on the ROC Curve
fig = plt.figure(figsize=(9,6))ax = fig.add_subplot(111)
ax.set_title('Choosing the Correct Threshold with Hold Out ROC-AUC {:0.3f}'.format(roc_auc), fontsize=14)
ax.set_ylabel('TPR and Threshold')
ax.set_xlabel('FPR')
 
# Plot the AUC Curve
ax=sns.lineplot(fpr[1::], thresholds[1::], markeredgecolor='r',linestyle='dashed', color='r',label='Thresholds')
# Plot the Thresholds Curve
ax=sns.lineplot(x=fpr[1::], y=tpr[1::],linewidth= 3,color='b',label='FPR vs TPR')

This generates the following curve. Note that we got a AUC of 99.5% for the hold out test set.

Conclusion

Believe it or not, we have done the entire exercise based on this block diagram. The only exception is, we did not deploy our model into an endpoint, instead we ran our predictions in batch.

Fig 14. Cotainers for Sagemaker ( Source Sagemaker Workshop)

Instead of launching a SageMaker instance , we could have installed the SageMaker library in our local system and launched the training and prediction instances in AWS as well.

Caution !! Once again, I will strongly recommend to make sure that you shut down your instance once you are done. Go back to the SageMaker Notebook Instances Page. Select the radio button of your instance. Click Actions>Stop. If you miss this step, AWS will keep charging you for the instance , as long as it is running.

A Gentle Introduction to AWS Sagemaker - Part II

Written by Arindam Dey