Enterprise Azure ML Cluster Deployment

I wanted to take a look at a few of the details that aren’t necessarily clear or covered in the documentation when it comes to setting up a Microsoft Machine Learning Server cluster in Azure.  There are few things to note for both the cluster configuration and for client configuration when it comes to certificate validation.

To get quickly to a deployed cluster footprint one can start with the quickstart templates that exist for ML Server.  Those templates may be found at the repo: https://github.com/Microsoft/microsoft-r.  Navigating into the enterprise templates (https://github.com/Microsoft/microsoft-r/tree/master/mlserver-arm-templates/enterprise-configuration/windows-sqlserver) there are block diagrams that show the basic setup.  However, the challenges with setting up anything in a large enterprise are in the details.  The following diagram is what’ll I’ll reference as I discuss key items that caused a bit of churn as we deployed and configured the clusters:

Enterprise-Azure-ML-Cluster

I’ve put fake IPs and subnet ranges on the diagram.  I’ve used relatively small subnet ranges as the web and compute node clusters shouldn’t really need more than 5 instances for my scenario.  Of note, I’m not addressing disaster recovery or high availability topics here.

There are two keys items that can throw a wrench in the works: networking and SSL/TLS certificates.  The firewall items are pretty straight-forward but the certificate configuration is a gift that keeps on giving.

Networking

The troubles with firewalls come in from a few different vectors.  Commonly, in large enterprise organizations, you’ll see constraints such as:

  • No port 80 traffic into Azure
  • No general traffic from Azure to on-premise systems (requires specific firewall exceptions)
  • Often limited to port 443 only into Azure

The primary reason that this can be problematic is that one will not be able to configure and test the clusters over port 80 through the ILB that is configured as part of the template.  Meaning that the DeployClient (https://docs.microsoft.com/en-us/machine-learning-server/python-reference/azureml-model-management-sdk/deploy-client) cannot be used to directly provision services and test them over port 80 from an on-premise client.  Note, that when using this template, port 80 and 443 are used to communicate with the ML Server API instead of the local install port 12800.  Of course, you may configure it differently, but as the template is presently it will be done over 80 and 443 through the ILB or App Gateway respectively.  The two ways to work around the network traffic rules are to use a jumpbox on the cluster vnet or a peered vnet or to use a self-signed certificate on the App Gateway and use port 443 (more certificates and client configuration shortly).

On an internally facing implementation, it is likely that private certificate authority will be used for the certificate.  This will work perfectly well, but one needs to ensure that App Gateway has been configured properly to communicate with the on-premise certificate authority.  We want to ensure that the App Gateway has access to the PKI environment so that it may access the Certificate Revocation List (CRL).  Access to this depends on how it has been set up.  In this case, it is exposed over port 80 and accessible via the vnet.  In order to resolve the address, the App Gateway needs to point to internal DNS.  Thus, on the vnet configuration under DNS Servers (working from the diagram) one would see entries for 10.1.2.3 and 10.4.5.6.  However, there is also an entry for 168.63.129.16.  This is used in Azure for platform services.

Managed Disks

The current configuration in the template is to use standard blob storage based disks.  However, if you want to ensure that your VM fault domain aligns with your disk fault domain, it is best to use Managed Disks.  That configuration can be done by finding the osdisk element for the ComputeNodes and the WebNodes and changing from:

"virtualMachineProfile": {
    "storageProfile": {
       "osDisk": {
          "vhdContainers": [
             "[concat(reference(concat('Microsoft.Storage/storageAccounts/sa', variables('uniqueStringArray')[0]), '2015-06-15').primaryEndpoints.blob, 'vhds')]",
             "[concat(reference(concat('Microsoft.Storage/storageAccounts/sa', variables('uniqueStringArray')[1]), '2015-06-15').primaryEndpoints.blob, 'vhds')]",
             "[concat(reference(concat('Microsoft.Storage/storageAccounts/sa', variables('uniqueStringArray')[2]), '2015-06-15').primaryEndpoints.blob, 'vhds')]",
             "[concat(reference(concat('Microsoft.Storage/storageAccounts/sa', variables('uniqueStringArray')[3]), '2015-06-15').primaryEndpoints.blob, 'vhds')]",
             "[concat(reference(concat('Microsoft.Storage/storageAccounts/sa', variables('uniqueStringArray')[4]), '2015-06-15').primaryEndpoints.blob, 'vhds')]"
          ],
          "name": "osdisk",
          "createOption": "FromImage"
       },

to:

"virtualMachineProfile": {
   "storageProfile": {
      "osDisk": {
         "managedDisk": { 
            "storageAccountType": "Standard_LRS" },
            "createOption": "FromImage"
         },

Additionally, you can remove the dependency on the storage provisioning:

"dependsOn": [
   "storageLoop",
   "[concat('Microsoft.Network/virtualNetworks/', variables('vnetName'))]",
   "[resourceId('Microsoft.Network/loadBalancers', variables('webNodeLoadBalancerName'))]"

Certificates and Clients

It is best to request the SSL certificate early and have it available as you configure the environment.  It will help to avoid reconfiguration later.  I want to address client issues first because there are a few approaches to working around the use of self-signed certificate or using a private certificate authority (private cert).

Anyone developing sites or services even briefly is familiar with certificate verification failure that the client (browser or client library) will generate when it retrieves a certificate from an endpoint for which it doesn’t recognize the authority.  The primary approaches to resolve the issue are:

  1. Ignore certificate verification warnings
  2. Use an exported .cer file to validate that specific call
  3. Point the API to the proper authority for verification
  4. Add the certificate (contents of the .cer file) to the certificate verification store being used

The more difficult configuration is related to the python clients calling any of the ML models that have been deployed as services.  Using .Net code will look at the machine authority.  Supposing that the client machine is part of the domain then the certificate should be verified.  However, using the Python requests library you’ll need to either export the .cer file and specify it in the function, set the flag to ignore, or append the contents of the .cer file to the file that is checked by the client.  The documentation is here to set the verify parameter: http://docs.python-requests.org/en/master/user/advanced/#ssl-cert-verification.  The net is that the call would be something to the effect of:

requests.<verb>('<uri>', verify=<'/path/to/certfile', true, false>)

If you prefer to include your cert in the default check you’ll need to open the cacert.pem file and append the contents of your file to the contents of the cacert.pem.  Requests depends on certifi (https://pypi.org/project/certifi/) to do the SSL verification.  On a given client machine, you’ll have to look at the location in which the package was installed.  If you’re configuring one of the cluster nodes (web or compute) then you’ll be able to find the cacert.pem file at the location: C:\Program Files\Microsoft\ML Server\PYTHON_SERVER\Lib\site-packages\certifi.

Lastly, and not yet mentioned, if you’re using the DeployClient() to make the API calls then the verify flag cannot be used in the call.  To that end, you must either use the aforementioned method of appending to the cacert.pem file or create the DeployClient() via configuration so that you can either point to the certificate to use or set the flag to false.  Thus, your code would have something like the following to set configuration options prior to creating the DeployClient():

from azureml.common import Configuration

conf = Configuration()  # singleton so this will stick across requests

# --- SSL/TLS verification ---
# Set this to false to skip verifying SSL certificate when calling API
# from https server.
# Options:
# --------
# verify -- (optional) Either a boolean, in which case it controls
# whether we verify the server's TLS certificate, or a string, in which
# case it must be a path to a CA bundle to use
# Example:
# verify_ssl = True (default)
# verify_ssl = False
# verify_ssl = '/etc/ssl/certs/ca-certificates.crt'
conf.verify_ssl = True   # Default is True

# Set this to customize the certificate file to verify the peer.
# SSL client certificate default, if String, path to ssl client cert
# file (.pem). If Tuple ('client certificate file, 'client key') pair.
# Example:
# cert = ''
# cert = ('cert.pem', 'key.pem')
conf.cert = None  # Default is None

Closing

To summarize, the changes that were made were:

  1. Networking – shut-off 80, allow for internal DNS resolution, create jumpbox
  2. Managed Disks – change the template to use them
  3. Certificate verification – several options for ensuring that client verification can be done from either calling clients (e.g., browser based calls) or from the web or compute nodes for inter-dependant calls.

I hope this helps anyone attempting to set this up themselves within the enterprise.

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s