
Aztk
First let’s talk a bit about the Azure Distributed Data Engineering Toolkit. It’s a python CLI application for provisioning on-demand Spark on Docker clusters in Azure. This is an Open Source CLI that you can find it here.
- First install aztk on your local machine
pip install aztk==0.8.1
or directly from the repo follow these steps - Initialize the project in a directory
aztk spark init
This will create a .aztk folder that contains the config files. .aztk/secrets.yaml contains the ids of the resources that you need to create.
Note: You only need to fill this secrets.yaml if you want to use CLI. - To finish up the setup of aztk you need to create the needed resources in azure. The steps to do so are well explained in this section. (You don’t have to go over the Using Shared Keys section)
At this point you have registered an Azure Active directory Application, created a Storage Account, created a batch account, gave your app access to these resources and took note of the credentials. If you filled the secrets.yaml file you can verify the setup by running:
aztk spark cluster list
expected output:
Cluster | State | VM Size | Nodes
— — — — — — — — —| — — — — — -| — — — — — — — — — — -| — — — –
Aztk will allow you to manage clusters and to run applications on them. You can do this through the command line but to deploy your model and run it programmatically we will write a script that does that for us.
Once you have all the resources setup in Azure, it is time to look at score.py which is the main module that I used to create a cluster and submit an app to it.
Code
- Copy the content of score.py to a local file; this script will be responsible for scheduling the scoring. This file contains the aztk code that creates a cluster and runs code on it. Now I will go over the code section by section. Look for the #UPDATE for sections that you will need to fill.
2. We need to specify the secrets configurations. Here’s a brief description of each one:

Custom scripts are bash scripts that are executed in the docker container in which the spark environment is run. For example, you can use it to assign environment variables or install additional packages on the nodes.
You need to create a cluster by specifying the size which represents the numbers of nodes in the cluster, the vm_size which specifies the type of VM you want to run on ( for the list of VMs available in Azure please refer to this link) If you need a GPU machine this is the place to define it. You also need to define the spark configuration, sparkconf for each of the nodes. You can also add the user_config if you need to ssh into the cluster.
Note: It is important that the VM specified in the cluster config earlier has enough memory for your app, otherwise you will run into errors.

This code submits the app to the cluster and waits for it to be done. If you are outputting logs in your app you will be able to see those in the blob storage you created under the name of your app.


You should delete the cluster after all the applications running on it are completed so that you don’t pay for the cluster without using it.
Run the script locally:
python score.py
Sometime you’ll need to debug some issue with the cluster. You can run the following command to get the logs needed.
aztk spark cluster debug — id {cluster-id} –output path/to/output-dir
Azure Functions
Azure Functions allow you to create scheduled or triggered pieces of code implemented in a variety of programming languages. We will create a python function and trigger it based on a timer. I want to run scoring.py every 24hours to predict. For my use case we will create a function that will run one score.py for scoring (every 24 hours)
Note: If your code needs more than 30 minutes to run, you need to use the App Service Plan otherwise choosing consumption plan is fine. To learn more about the App Service Plan click here
Resources
· Aztk git repo: https://github.com/Azure/aztk. This repo contains documentation, code and you can file issue if you hit any bugs
· Useful aztk command line instructions:
> aztk spark cluster list
To lists all the clusters associated with your account as described in secret.yaml file> aztk spark cluster delete –id clusterId
To delete the specific clusterId> aztk spark cluster view –id clusterId
To view the state of the nodes on the specified cluster> aztk spark cluster debug –id clusterId –output path/to/logs.txt