Building Microsoft ML+Python App Container

I wanted to share a quick post about building out a Microsoft Machine Learning container that can be used with Azure App Service to host a front-end web app developed in Python which makes use of Microsoft’s ML Libraries.  The full documentation may be found here: https://docs.microsoft.com/en-us/machine-learning-server/what-is-machine-learning-server.  Additionally, if you’re looking just to install the client libraries on Linux that information can be found here: https://docs.microsoft.com/en-us/machine-learning-server/install/python-libraries-interpreter#install-python-libraries-on-linux.

However, if you plan to build a container and host it in Azure, particularly Azure App Service, there are a couple of things you’ll want to do in addition to simply installing the ML package.  This post will focus on getting the image size down and ensuring dependencies are installed.

There are a few problems that exist if one simply takes the route of creating a Dockerfile and including Run command such as:

# ML Server Python
RUN apt-get -y update \
&& apt-get install -y apt-transport-https wget \
&& dpkg -i packages-microsoft-prod.deb \
&& rm -f packages-microsoft-prod.deb \
&& apt-key adv –keyserver packages.microsoft.com –recv-keys 52E16F86FEE04B979B07E28DB02C46DF417A0893 \
&& apt-get -y update \
&& apt-get install -y microsoft-mlserver-packages-py-9.3.0

 

The problems are:

  1. The bits are likely in the wrong location
  2. The dependencies for them may not be installed in your desired location
  3. SIZE matters

If you’ve already got an app running for a front-end it that uses ML services deployed using Microsoft ML Server then I highly suggest that you use the direct API calls and Swagger capability to implement your application’s client-side interface.  However, if you’ve gone done the path of using the DeployClient() to discover service endpoints and make the calls then you will need the Microsoft ML client libraries installed.  Again, I strongly suggest using swagger and the requests package to create your client, but if you have to move forward with DeployClient() for now.  Here are my suggestions.

Thanks to Nick Reith with the help on explaining and illustrating the staged build for me.  To kick things off let’s ensure that we start this docker file by setting our stage 0:

FROM tiangolo/uwsgi-nginx-flask:python3.6 as stage0
#disto info
RUN cat /etc/issue
RUN pip install –upgrade pip \
&& apt-get update \
&& apt-get install -y apt-utils
# ML Server Python
RUN apt-get -y update \
&& apt-get install -y apt-transport-https wget \
&& dpkg -i packages-microsoft-prod.deb \
&& rm -f packages-microsoft-prod.deb \
&& apt-key adv –keyserver packages.microsoft.com –recv-keys 52E16F86FEE04B979B07E28DB02C46DF417A0893 \
&& apt-get -y update \
&& apt-get install -y microsoft-mlserver-packages-py-9.3.0

Note that I’m using tiangolo’s uwsgi-nginx-flax container.  This helps as I don’t have to worry about all of the initial configurations of uwsgi and nginx.  However, I will cover its usage and optimization in a subsequent post.  For now, let’s focus on the python and ML library set up in the image.

In the code block above note that we run the setup of the ML libraries all in a single line.  By default, it installs the packages into the folder /opt/microsoft/mlserver/9.3.0/runtime/python/lib/python3.5/site-packages.  However, we need to move the required packages from that location over into our python runtime environment.  The good news is that we don’t need all of them.  The bad news is that they have a number of dependencies that we must install.  Also, installing the packages drives the image size up to about 5.6 GB.  That’s a pretty large container.  The real problem with the size shows up at deployment time.  Depending on the size of the App Service instance that I used the App Service would take as much as 13 minutes to pull the images and get it running.  This is definitely not desirable for iterative work or scale operations.  So, let’s reduce that footprint.

After the packages are installed completely, we’ll copy the ones we know that we need in order to use the DeployClient() as a client-side object for calling service endpoints.

#copy the ML libraries into the python path for the app to tmp folder to move later
RUN mkdir /tmp/mldeps
RUN cp -r /opt/microsoft/mlserver/9.3.0/runtime/python/lib/python3.5/site-packages/adal* /tmp/mldeps/
RUN cp -r /opt/microsoft/mlserver/9.3.0/runtime/python/lib/python3.5/site-packages/liac* /tmp/mldeps/
RUN cp -r /opt/microsoft/mlserver/9.3.0/runtime/python/lib/python3.5/site-packages/azureml* /tmp/mldeps/

The lines above I create a temporary holding location and recursively copy the folder for adal, liac, and azureml into that folder in preparation for the next stage.  To start the next stage the image is pulled from the same image repo that was used previously.  Subsequently, the files that were copied from the Microsoft ML install into a temp location are copied into a temp location in this stage.

#start next stage
FROM tiangolo/uwsgi-nginx-flask:python3.6
# Copy ml packages and app files from previous install and discard the rest
COPY –from=stage0 /tmp/mldeps /tmp/mldeps

We can’t copy them directly to the site-packages location, because if we do that prior to updating pip and installing dependencies we’ll get several dependency version errors.  Thus, getting the files over into the proper the location in this stage will follow the sequence of [copy from stage 0 to temp] –> [update pip] –> [install dependencies] –> [move to final location] –> [delete temp folder].

#must upgrade and install python package dependencies for ML packages before moving over ML packages
RUN pip install –upgrade pip \
&& apt-get update \
&& apt-get install -y apt-utils

#dependencies needed for azureml packages
RUN pip install dill PyJWT cryptography
# add needed packages for front-end app
RUN pip install dash dash-core-components dash-html-components dash_dangerously_set_inner_html pandas requests

#move ML packages into place
RUN cp -r /tmp/mldeps/* /usr/local/lib/python3.6/site-packages/

#remove the temp holding directory
RUN rm -r /tmp/mldeps

Other than a couple of more lines for setting up the ports that’s about it.  The results are an image that drops from 5.6 GB to about 1.3 GB as can be seen in my image repo.

C:\Users\jofultz\Documents\Visual Studio 2017\Projects\App%20Service%20Linux%20Python%20App>docker images

REPOSITORY                TAG                IMAGE ID         CREATED          SIZE

ml-py-reduced                   latest              f248747aedad       6 hours ago         1.31GB

ml-py                                   latest              f308579a7ac8        6 hours ago         5.66GB

Keeping the size down allows the image to be pulled initially and made operational in a much shorter timeframe. For ease of reading I’ve kept mostly discrete operations, but if you wanted to reduce the number of image layers you can combine a number of the RUN statements and reduce the layering for the image.  For reference, here is the full Dockerfile for building the reduced image:

FROM tiangolo/uwsgi-nginx-flask:python3.6 as stage0

#disto info
RUN cat /etc/issue

RUN pip install –upgrade pip \
&& apt-get update \
&& apt-get install -y apt-utils

# ML Server Python
RUN apt-get -y update \
&& apt-get install -y apt-transport-https wget \
&& wget https://packages.microsoft.com/config/ubuntu/16.04/packages-microsoft-prod.deb \
&& dpkg -i packages-microsoft-prod.deb \
&& rm -f packages-microsoft-prod.deb \
&& apt-key adv –keyserver packages.microsoft.com –recv-keys 52E16F86FEE04B979B07E28DB02C46DF417A0893 \
&& apt-get -y update \
&& apt-get install -y microsoft-mlserver-packages-py-9.3.0

#copy the ML libraries into the python path for the app to tmp folder to move later
RUN mkdir /tmp/mldeps
RUN cp -r /opt/microsoft/mlserver/9.3.0/runtime/python/lib/python3.5/site-packages/adal* /tmp/mldeps/
RUN cp -r /opt/microsoft/mlserver/9.3.0/runtime/python/lib/python3.5/site-packages/liac* /tmp/mldeps/
RUN cp -r /opt/microsoft/mlserver/9.3.0/runtime/python/lib/python3.5/site-packages/azureml* /tmp/mldeps/

#start next stage
FROM tiangolo/uwsgi-nginx-flask:python3.6

# Copy ml packages and app files from previous install and discard the rest
COPY –from=stage0 /tmp/mldeps /tmp/mldeps

#must upgrade and install python package dependencies for ML packages before moving over ML packages
RUN pip install –upgrade pip \
&& apt-get update \
&& apt-get install -y apt-utils

#dependencies needed for azureml packages
RUN pip install dill PyJWT cryptography
# add needed packages for front-end app
RUN pip install dash dash-core-components dash-html-components dash_dangerously_set_inner_html pandas requests

#move ML packages into place
RUN cp -r /tmp/mldeps/* /usr/local/lib/python3.6/site-packages/

#remove the temp holding directory
RUN rm -r /tmp/mldeps

ENV LISTEN_PORT=80

EXPOSE 80

COPY /app /app

Enterprise Azure ML Cluster Deployment

I wanted to take a look at a few of the details that aren’t necessarily clear or covered in the documentation when it comes to setting up a Microsoft Machine Learning Server cluster in Azure.  There are few things to note for both the cluster configuration and for client configuration when it comes to certificate validation.

To get quickly to a deployed cluster footprint one can start with the quickstart templates that exist for ML Server.  Those templates may be found at the repo: https://github.com/Microsoft/microsoft-r.  Navigating into the enterprise templates (https://github.com/Microsoft/microsoft-r/tree/master/mlserver-arm-templates/enterprise-configuration/windows-sqlserver) there are block diagrams that show the basic setup.  However, the challenges with setting up anything in a large enterprise are in the details.  The following diagram is what’ll I’ll reference as I discuss key items that caused a bit of churn as we deployed and configured the clusters:

Enterprise-Azure-ML-Cluster

I’ve put fake IPs and subnet ranges on the diagram.  I’ve used relatively small subnet ranges as the web and compute node clusters shouldn’t really need more than 5 instances for my scenario.  Of note, I’m not addressing disaster recovery or high availability topics here.

There are two keys items that can throw a wrench in the works: networking and SSL/TLS certificates.  The firewall items are pretty straight-forward but the certificate configuration is a gift that keeps on giving.

Networking

The troubles with firewalls come in from a few different vectors.  Commonly, in large enterprise organizations, you’ll see constraints such as:

  • No port 80 traffic into Azure
  • No general traffic from Azure to on-premise systems (requires specific firewall exceptions)
  • Often limited to port 443 only into Azure

The primary reason that this can be problematic is that one will not be able to configure and test the clusters over port 80 through the ILB that is configured as part of the template.  Meaning that the DeployClient (https://docs.microsoft.com/en-us/machine-learning-server/python-reference/azureml-model-management-sdk/deploy-client) cannot be used to directly provision services and test them over port 80 from an on-premise client.  Note, that when using this template, port 80 and 443 are used to communicate with the ML Server API instead of the local install port 12800.  Of course, you may configure it differently, but as the template is presently it will be done over 80 and 443 through the ILB or App Gateway respectively.  The two ways to work around the network traffic rules are to use a jumpbox on the cluster vnet or a peered vnet or to use a self-signed certificate on the App Gateway and use port 443 (more certificates and client configuration shortly).

On an internally facing implementation, it is likely that private certificate authority will be used for the certificate.  This will work perfectly well, but one needs to ensure that App Gateway has been configured properly to communicate with the on-premise certificate authority.  We want to ensure that the App Gateway has access to the PKI environment so that it may access the Certificate Revocation List (CRL).  Access to this depends on how it has been set up.  In this case, it is exposed over port 80 and accessible via the vnet.  In order to resolve the address, the App Gateway needs to point to internal DNS.  Thus, on the vnet configuration under DNS Servers (working from the diagram) one would see entries for 10.1.2.3 and 10.4.5.6.  However, there is also an entry for 168.63.129.16.  This is used in Azure for platform services.

Managed Disks

The current configuration in the template is to use standard blob storage based disks.  However, if you want to ensure that your VM fault domain aligns with your disk fault domain, it is best to use Managed Disks.  That configuration can be done by finding the osdisk element for the ComputeNodes and the WebNodes and changing from:

"virtualMachineProfile": {
    "storageProfile": {
       "osDisk": {
          "vhdContainers": [
             "[concat(reference(concat('Microsoft.Storage/storageAccounts/sa', variables('uniqueStringArray')[0]), '2015-06-15').primaryEndpoints.blob, 'vhds')]",
             "[concat(reference(concat('Microsoft.Storage/storageAccounts/sa', variables('uniqueStringArray')[1]), '2015-06-15').primaryEndpoints.blob, 'vhds')]",
             "[concat(reference(concat('Microsoft.Storage/storageAccounts/sa', variables('uniqueStringArray')[2]), '2015-06-15').primaryEndpoints.blob, 'vhds')]",
             "[concat(reference(concat('Microsoft.Storage/storageAccounts/sa', variables('uniqueStringArray')[3]), '2015-06-15').primaryEndpoints.blob, 'vhds')]",
             "[concat(reference(concat('Microsoft.Storage/storageAccounts/sa', variables('uniqueStringArray')[4]), '2015-06-15').primaryEndpoints.blob, 'vhds')]"
          ],
          "name": "osdisk",
          "createOption": "FromImage"
       },

to:

"virtualMachineProfile": {
   "storageProfile": {
      "osDisk": {
         "managedDisk": { 
            "storageAccountType": "Standard_LRS" },
            "createOption": "FromImage"
         },

Additionally, you can remove the dependency on the storage provisioning:

"dependsOn": [
   "storageLoop",
   "[concat('Microsoft.Network/virtualNetworks/', variables('vnetName'))]",
   "[resourceId('Microsoft.Network/loadBalancers', variables('webNodeLoadBalancerName'))]"

Certificates and Clients

It is best to request the SSL certificate early and have it available as you configure the environment.  It will help to avoid reconfiguration later.  I want to address client issues first because there are a few approaches to working around the use of self-signed certificate or using a private certificate authority (private cert).

Anyone developing sites or services even briefly is familiar with certificate verification failure that the client (browser or client library) will generate when it retrieves a certificate from an endpoint for which it doesn’t recognize the authority.  The primary approaches to resolve the issue are:

  1. Ignore certificate verification warnings
  2. Use an exported .cer file to validate that specific call
  3. Point the API to the proper authority for verification
  4. Add the certificate (contents of the .cer file) to the certificate verification store being used

The more difficult configuration is related to the python clients calling any of the ML models that have been deployed as services.  Using .Net code will look at the machine authority.  Supposing that the client machine is part of the domain then the certificate should be verified.  However, using the Python requests library you’ll need to either export the .cer file and specify it in the function, set the flag to ignore, or append the contents of the .cer file to the file that is checked by the client.  The documentation is here to set the verify parameter: http://docs.python-requests.org/en/master/user/advanced/#ssl-cert-verification.  The net is that the call would be something to the effect of:

requests.<verb>('<uri>', verify=<'/path/to/certfile', true, false>)

If you prefer to include your cert in the default check you’ll need to open the cacert.pem file and append the contents of your file to the contents of the cacert.pem.  Requests depends on certifi (https://pypi.org/project/certifi/) to do the SSL verification.  On a given client machine, you’ll have to look at the location in which the package was installed.  If you’re configuring one of the cluster nodes (web or compute) then you’ll be able to find the cacert.pem file at the location: C:\Program Files\Microsoft\ML Server\PYTHON_SERVER\Lib\site-packages\certifi.

Lastly, and not yet mentioned, if you’re using the DeployClient() to make the API calls then the verify flag cannot be used in the call.  To that end, you must either use the aforementioned method of appending to the cacert.pem file or create the DeployClient() via configuration so that you can either point to the certificate to use or set the flag to false.  Thus, your code would have something like the following to set configuration options prior to creating the DeployClient():

from azureml.common import Configuration

conf = Configuration()  # singleton so this will stick across requests

# --- SSL/TLS verification ---
# Set this to false to skip verifying SSL certificate when calling API
# from https server.
# Options:
# --------
# verify -- (optional) Either a boolean, in which case it controls
# whether we verify the server's TLS certificate, or a string, in which
# case it must be a path to a CA bundle to use
# Example:
# verify_ssl = True (default)
# verify_ssl = False
# verify_ssl = '/etc/ssl/certs/ca-certificates.crt'
conf.verify_ssl = True   # Default is True

# Set this to customize the certificate file to verify the peer.
# SSL client certificate default, if String, path to ssl client cert
# file (.pem). If Tuple ('client certificate file, 'client key') pair.
# Example:
# cert = ''
# cert = ('cert.pem', 'key.pem')
conf.cert = None  # Default is None

Closing

To summarize, the changes that were made were:

  1. Networking – shut-off 80, allow for internal DNS resolution, create jumpbox
  2. Managed Disks – change the template to use them
  3. Certificate verification – several options for ensuring that client verification can be done from either calling clients (e.g., browser based calls) or from the web or compute nodes for inter-dependant calls.

I hope this helps anyone attempting to set this up themselves within the enterprise.