Featured

Chaining Spring WebFlux Calls

Introduction

I recently needed to create a health endpoint for our Java WebFlux application that would retrieve results from 5 different service endpoints.  Three of them were Mono<> and two were Flux<>. Each one of them had a different generic type.  Through this effort, I learned a little more about combining disparate results in Spring WebFlux.

The first part of the challenge is figuring out how to best combine something like Mono with something like Flux.  All of the examples I found would execute something like:

Iterable filteredCityList = cityData.getAll(filter).toIterable();

or

State currentState = List stateData.getStateById(abbreviation).block();

The key issue with these types of examples being that they are using blocking calls which isn’t permitted and won’t compile when using Spring WebFlux.  Eventually, I gave up on finding a working example that would suit my purpose.  However, the examples did give me some clues on methods that I might use to accomplish the task.  To that end, I looked at merge, zip/zipWith, and concat/concatWith as methods in the WebFlux framework that might suit my purpose.

Options

Before we dive into each of them, I want to clarify the unmodified goal:

Goal: Execute requests across multiple services in parallel and capture duration at the completion of each of them.

The merge and zip methods had the same behaviour in that the results would interleave.  However, I need to complete the full request and provide a duration for that completion.  Interleaving results that get sent to the subscriber make it much more difficult to tease apart the results and figure out when each one completes.  Additionally, due to the interleaving, the results from the publishers being merged would affect the measurement for all the publishers being measured.  The zip methods had a similar issue with the added complexity of returning the results in tuples that must be handled in such a way to still arrive at the goal of calculating the duration for each of the calls distinct from the concurrent calls to the other publishers.

My imperfect solution was to use concatWith().  This had the benefit of allowing me to treat each request as a single complete execution.  The primary flaw being that the subscription to each of the chained publishers would only happen sequentially.  So, the previous examples with concatWith would result in something similar to:

Flux>City> citiesFlux = cityData.getAll(filter);
Mono stateMono = stateData.getById(abbreviation);
Flux = stateMono.concatWith(citiesFlux);

Solution

So, this gets me part of the way there in that I may now successfully create a chain of calls, but I need to do two more things:

  1. I need to measure and record the execution time for each of the discrete Mono<> or Flux<> calls
  2. I need to collect all of the discrete measurements into a single return and provide an overall status

In order to get time measurements for the completion, I need a subscriber for each of the calls.  In this case, I use map().  In particular, I use map() to produce MeasurementResult type.  For the Mono<> calls this is pretty straightforward as it looks something like:

Mono stateByIdResult = stateMono.getById(abbreviation)
    .map(state -> {
  return buildResult ("getStateById", getElapsedTime(), threshold);
});

The buildResult() method returns a MeasurementResult type. The call to getElapsedTime() is a way for me to externalize time measurement in a way that would meet scoping requirements due to the compiler requiring that the variable I use to track and calculate duration be final.

So, great, now I can easily chain all of the Mono<> return types together and return a custom type.  What about the Flux<>?  If I do the same for Flux, I’ll get a bunch of MeasurementResults, because I get one execution and one result value for each of the data items that flow out of the Flux publisher.  This is where one is tempted to figure out how to make toIterable() or block() to work, but they simply won’t as the blocking calls aren’t permitted.  My solution was to covert all of the types to Mono<>.  For the Flux<> methods this meant using collectList().  Using collectList() helped me in two ways:

  1. it would convert the result to a Mono<> making it easier to chain
  2. it would allow me to process the stream output at the end which is what I need in order to prevent multiple results and provide a duration

To that end, my Flux<> calls end up looking like:

Mono citiesResult = cityData.getAll(filter)
  .collectList()  // <-- creates the Mono
  .map(cityData -> {
      return buildResult ("getAllCities", getElapsedTime(), threshold);
  });

The last part was to chain them together and produce the final result in code that looks similar to:

Mono<List> buildChain() {
  //Mono<> API call
  Mono stateByIdResult = stateMono.getById(abbreviation)
      .map(state -> {
          return buildResult ("getStateById", getElapsedTime(), threshold);
	  });

  //Flux<> API call
  Mono citiesResult = cityData.getAll(filter)
      .collectList()  // <-- creates the Mono
      .map(cityData -> {
          return buildResult ("getAllCities", getElapsedTime(), threshold);
      });

  Return stateByIdResult.concatWith(citiesResult).collectList();
}

Using collectList() allows me to return the discrete MeasurementResult objects from each call in a single List.  In turn, that helps me at the end, because now I need to process those results and put them into an overall results object so that the final bit of code that returns to the caller looks something like:

//return of Mono<ResponseEntity>
return buildChain()
    .map(discreteResultsList -> {
	FinalResult  finalResult = new FinalResult();
	finalResult.individualResults = discreteResultsList;
	finalResult.description = "[some text]";
	finalResult.version = buildConfig.getVersion();
	return finalResult;
     }).map(result -> ResponseEntity.ok().body(result));

The pattern described above may be used for any arbitrary chain by calling concatWith() and adding the next publisher.  The key being that they must all converge on a single output type.  Additionally, if you need the actual data, you could transform to a common type or interface for the types in play or do something like add each result to a List and return it.

Caveats

There are a few things to note with this approach.  The first and foremost is that concatWith() subscribes to each of the publishers sequentially.  Thus, if you’re looking for parallel execution, you’ll need to look at a different method such as zip, merge, or create your own subscriber with subscribe() using something like the parallel scheduler.

Secondly, as noted previously, I had to track duration.  To do this, I need a way to record and update a start time variable.  Since that information is needed in the lambda used to map the data, it creates a closure. Java requires that the variable used in the closure from the outer scope must be final.  To that end, I just declared a List that I used:

final List timeList = new ArrayList();

For the completion of each call, I would use the last element of that list to calculate duration and add a new value to the list for the next subscriber to use.  This would introduce some extra time into the duration calculation, but nothing significant enough that mattered for my use case.

 

Acquisition of Adobe Live Stream with Azure Functions

Recently I’ve been working on the acquisition of click stream events from Adobe Live Stream. As part of that work, I wanted to ensure that we could scale up throughput while keeping the overall cost and complexity low. This solution moves unprocessed data to a single EventHub instance and uses a Consumption Plan. The final flow looks like the following:

High-level Flow

The entire write-up can be found in the repository located at: https://github.com/jofultz/AdobeClickStreamIngestion.

Process Dump for Azure Functions

I originally posted this on MSDN blogs.  However, with its decommissioning of personal blogs the content is lost.  I had some requests about this topic and thought I’d repost it here.


Recently I was asked how we might programmatically setup a way to dump the w3wp process for a Function App.  Jeremy Brooks (fellow Microsoftie) pointed out that I could issue the same GET requests used in Kudu to show process information in the dashboard.  So, before I get to the automation, I’ll review fetching information using Postman.

Setup Deployment Credentials

If you are currently using Kudu for deployment, you’ve already done this.  If not, take a look here  (https://docs.microsoft.com/en-us/azure/app-service/app-service-deployment-credentials) and follow the guidance to get a username and password setup.

Requests in Postman

In Postman, use the Authorization tab to set the type to Basic and enter the username and password that I created as a Deployment Credential.

Postman takes care of Base64 encoding and adding the header.  It’ll be a little more work when I get to the code 🙂

The next thing to know is that I will issue 3 different GET requests, but all are located at the Kudu endpoint.  Thus, base URI and pattern will be: https://[yourAppName].scm.azurewebsites.net/api/[command].  The three commands I wish to execute are:

  1. /processes – retrieves a list of all running processes
  2. /processes/[processId] – properties of a specific process
  3. /processes/[processId]/dump?dumpType=[1||2] – dumps the process with 1 for mini-dump and 2 for a full dump

Before I move on to the code, I want to review the sample return from the first two requests that need to be issued.  The first request is simply going to give us the list of the processes that are presently servicing the Function App.  It will look something like:


[
   {
      "id": 4292,
      "name": "w3wp",
      "href": "https://jofultzfntest.scm.azurewebsites.net/api/processes/4292",
      "user_name": "IIS APPPOOL\\jofultzFnTest"
   },
   {
      "id": 6948,
      "name": "w3wp",
      "href": "href="https://jofultzfntest.scm.azurewebsites.net/api/processes/6948">https://jofultzfntest.scm.azurewebsites.net/api/processes/6948"
   }
]

I clearly don’t have much going on here and if you have something of any amount of significance running you’ll have a lot more line items.  The key thing to note is “name” key.  I am going to want all items in the array where the “name” key has a value of “w3wp”.  However, I don’t want to dump the w3wp process that is serving Kudu.  Which brings me to the second set of requests.  For every item I use the “href” value to retrieve the individual properties of the process.  The result for each of those will look something like:

{
   "id": 4292,
   "name": "w3wp",
   "href": " href="https://jofultzfntest.scm.azurewebsites.net/api/processes/4292">https://jofultzfntest.scm.azurewebsites.net/api/processes/4292",
   "minidump": " href="https://jofultzfntest.scm.azurewebsites.net/api/processes/4292/dump">https://jofultzfntest.scm.azurewebsites.net/api/processes/4292/dump",
   "iis_profile_timeout_in_seconds": 180,
   "parent": "href="https://jofultzfntest.scm.azurewebsites.net/api/processes/-1">https://jofultzfntest.scm.azurewebsites.net/api/processes/-1",
   "children": [],
-
-
   "is_scm_site": true
}

I’ve cut a lot of the properties out to keep it manageable in this post, but I kept in the key that we are searching to compare “is_scm_site”.  Thus, as we loop through the process list that we got from the first and check for this property for false before dumping the process.  Thanks to David Ebbo for pointing me to the “is_scm_site” key.

Implementation

I’ve created a blob storage container named “dumpfiles” to receive the process dumps and I’m organizing them into folders for date.

functions-dump-storage

Function Code

The next bit was a small bit of code that I throw into a separate Function App to retrieve the information from Kudu.  To get things setup, I need to grab the list of running processes issuing the same GET request that I issued in Postman.  To that, I’ve got a little code to setup the HttpClient.

HttpClient client = new HttpClient();
client.BaseAddress = new Uri(baseAddress);

// the creds from my .publishsettings file
var byteArrayCredential = Encoding.ASCII.GetBytes(usr + ":" + pwd);
client.DefaultRequestHeaders.Authorization = new AuthenticationHeaderValue("Basic", Convert.ToBase64String(byteArrayCredential));

// GET to the run action for my job
var response = await client.GetAsync(apiProcessPath);
responseString = await response.Content.ReadAsStringAsync();

In this first call, the baseAddress  is “href=”https://%5BFn%20App%20Name%5D.scm.azurewebsites.net/”>https://%5BFn App Name].scm.azurewebsites.net/” with the apiProcessPath is assigned the value “api/processes/”.  This gets me a JSON Array which I’ll loop through and grab process details.  To that end, I’ve setup a loop to iterate the Jarray.

JArray jsonProcessList = JArray.Parse(responseString);

//setup loop to check each process and dump if is w3wp and is not is_scm_site
foreach (var processItem in jsonProcessList)
{
   if (processItem["name"].Value() == "w3wp")
   {
      log.Info("checking " + processItem["name"] + ":" + processItem["id"] + " for scm site");
      string apiProcessDetailUri = processItem["href"].Value();
      response = await client.GetAsync(apiProcessDetailUri);
      responseString = await response.Content.ReadAsStringAsync();

Now I check to make sure it is the process in which I’m interested (processItem[“name”] == “w3wp”) and if it is I’ll use the “href” key that is returned with the previous call that gives me the URI to retch the process details.

JObject processDetail = JObject.Parse(responseString);

if (processDetail["is_scm_site"]?.Value() == true)
{
   log.Info("process " + processItem["name"] + ":" + processItem["id"] + " is the scm site");
}
else
{
   log.Info("fetching process dump for " + processItem["name"] + ":" + processItem["id"]);
   //TODO: Need a means to handle a larger return than can be handled by string
   string processDumpContent = default(string);
   //construct the query string based on the process URI; using type 1, minidump
   response = await client.GetAsync(apiProcessDetailUri + "/dump?dumpType=1");

   processDumpContent = await response.Content.ReadAsStringAsync();

In this section of code I’m simply ensuring that I only dump for the non-SCM site (serving Kudu).  To do that, I construct the URI and query string to request a minidump (apiProcessDetailUri + “/dump?dumpType=1”) for the process.  The last bit of code is just to name the file and persist it to BLOB storage.

  string path = "dumpfiles/" + DateTime.Now.ToShortDateString().Replace(@"/", "-") + "/" +

   DateTime.Now.ToShortTimeString() + " - " +
   processItem["name"] + ":" + processItem["id"];

   var attributes = new Attribute[]
   {
      new BlobAttribute(path),
      new StorageAccountAttribute("processdumpstorage_STORAGE")
   };
   using (var writer = await binder.BindAsync(attributes))
   {
      writer.Write(processDumpContent);
   }
}

Once I’ve set the name as desired to get the desired organization, I set Attributes to connect to my BLOB storage target and use the binder with those attributes to create a TextWriter that then use to persist the dump.

Alert + WebMethod

The last bit to tie it all together is to create an alert on errors that calls a webhook (the Function App endpoint that I created) if there is an error.functions-dump-alerts

I’ve done it here in the Alerts (Classic) UI.  I’ve grabbed the Function’s URL from the portal which has the code in the querystring and provided as the Webhook in the configuration UI.

Housekeeping

With everything configured and code in place we should start to see w3wp dumps when we receive errors.  However, this sample has a number things that need to be considered if doing this in a more production fashion.

  1. Credentials and paths
    1. I’ve simply assigned variables directly in the code. You would want to store and retrieve the values separately.
    2. KeyVault is ideally the place for the credentials
    3. Root path for the Kudu API is a little of something to explore.  For example, could we pass it in via the new Alerts mechanism?  So, there is some research to be done there.  The same sort of thinking applies to the dump type.
  2. Scale issue – if the Function App has a large number of hosts we need to parallelize the dump collection and look into ways to target better.
    1. My for loop to iterate the JArray and check the process for is_scm_site is sequential.  It would be worth looking at using parallel for and running a few checks at a time.  The exact amount of concurrent fetches would probably be a bit of a science to get right.
    2. Need to research errors reported to see if information is available that could be passed to the webhook to specifically target the process.
  3. Data Size – I used a string to retrieve the process dump which limits me to a theoretical 2GB.
    1. Look at a more efficient type for receiving and holding data (maybe local filestream as temporary holder spot?).
    2. If local filestream works then maybe there is more efficient copy mechanism too.

I’m sure I missed out on a few things, but these are my thoughts for the moment.  If you have any thoughts or feedback, post it up and I’ll try to incorporate where it makes sense.

Featured

Scaling Python Apps on App Service for Linux Containers

With the introduction of Linux and Containers to the App Service capabilities, the world of PaaS container hosting gets a little more palatable for those that are looking to modernize the application deployment footprints, but do not want to take on learning all of the intricacies of a running a full environment using something like Kubernetes. With containers on App Service we can reap the benefits of containerized deployments in addition to the benefits of the high-fidelity service integration that App Service has across the Azure services such as Azure Active Directory, Azure Container Registry, App Insights, and so on.

However, with a new system comes new things to understand in order to get the most out of the system. In this writing, I will focus on key considerations for scaling when hosting a containerized Python application using App Service. This is a key consideration as we move from the IIS + FastCGI world into the Linux + NGINX + uWSGI world in App Service.

To read the full write-up, please follow navigate to the repo where I’ve posted the content in the Readme and published the artifacts used while testing.  For easier copy-paste, the link is: https://github.com/jofultz/AppServiceContainerOptimization/blob/master/README.md.

 

Building Microsoft ML+Python App Container

I wanted to share a quick post about building out a Microsoft Machine Learning container that can be used with Azure App Service to host a front-end web app developed in Python which makes use of Microsoft’s ML Libraries.  The full documentation may be found here: https://docs.microsoft.com/en-us/machine-learning-server/what-is-machine-learning-server.  Additionally, if you’re looking just to install the client libraries on Linux that information can be found here: https://docs.microsoft.com/en-us/machine-learning-server/install/python-libraries-interpreter#install-python-libraries-on-linux.

However, if you plan to build a container and host it in Azure, particularly Azure App Service, there are a couple of things you’ll want to do in addition to simply installing the ML package.  This post will focus on getting the image size down and ensuring dependencies are installed.

There are a few problems that exist if one simply takes the route of creating a Dockerfile and including Run command such as:

# ML Server Python
RUN apt-get -y update \
&& apt-get install -y apt-transport-https wget \
&& dpkg -i packages-microsoft-prod.deb \
&& rm -f packages-microsoft-prod.deb \
&& apt-key adv –keyserver packages.microsoft.com –recv-keys 52E16F86FEE04B979B07E28DB02C46DF417A0893 \
&& apt-get -y update \
&& apt-get install -y microsoft-mlserver-packages-py-9.3.0

 

The problems are:

  1. The bits are likely in the wrong location
  2. The dependencies for them may not be installed in your desired location
  3. SIZE matters

If you’ve already got an app running for a front-end it that uses ML services deployed using Microsoft ML Server then I highly suggest that you use the direct API calls and Swagger capability to implement your application’s client-side interface.  However, if you’ve gone done the path of using the DeployClient() to discover service endpoints and make the calls then you will need the Microsoft ML client libraries installed.  Again, I strongly suggest using swagger and the requests package to create your client, but if you have to move forward with DeployClient() for now.  Here are my suggestions.

Thanks to Nick Reith with the help on explaining and illustrating the staged build for me.  To kick things off let’s ensure that we start this docker file by setting our stage 0:

FROM tiangolo/uwsgi-nginx-flask:python3.6 as stage0
#disto info
RUN cat /etc/issue
RUN pip install –upgrade pip \
&& apt-get update \
&& apt-get install -y apt-utils
# ML Server Python
RUN apt-get -y update \
&& apt-get install -y apt-transport-https wget \
&& dpkg -i packages-microsoft-prod.deb \
&& rm -f packages-microsoft-prod.deb \
&& apt-key adv –keyserver packages.microsoft.com –recv-keys 52E16F86FEE04B979B07E28DB02C46DF417A0893 \
&& apt-get -y update \
&& apt-get install -y microsoft-mlserver-packages-py-9.3.0

Note that I’m using tiangolo’s uwsgi-nginx-flax container.  This helps as I don’t have to worry about all of the initial configurations of uwsgi and nginx.  However, I will cover its usage and optimization in a subsequent post.  For now, let’s focus on the python and ML library set up in the image.

In the code block above note that we run the setup of the ML libraries all in a single line.  By default, it installs the packages into the folder /opt/microsoft/mlserver/9.3.0/runtime/python/lib/python3.5/site-packages.  However, we need to move the required packages from that location over into our python runtime environment.  The good news is that we don’t need all of them.  The bad news is that they have a number of dependencies that we must install.  Also, installing the packages drives the image size up to about 5.6 GB.  That’s a pretty large container.  The real problem with the size shows up at deployment time.  Depending on the size of the App Service instance that I used the App Service would take as much as 13 minutes to pull the images and get it running.  This is definitely not desirable for iterative work or scale operations.  So, let’s reduce that footprint.

After the packages are installed completely, we’ll copy the ones we know that we need in order to use the DeployClient() as a client-side object for calling service endpoints.

#copy the ML libraries into the python path for the app to tmp folder to move later
RUN mkdir /tmp/mldeps
RUN cp -r /opt/microsoft/mlserver/9.3.0/runtime/python/lib/python3.5/site-packages/adal* /tmp/mldeps/
RUN cp -r /opt/microsoft/mlserver/9.3.0/runtime/python/lib/python3.5/site-packages/liac* /tmp/mldeps/
RUN cp -r /opt/microsoft/mlserver/9.3.0/runtime/python/lib/python3.5/site-packages/azureml* /tmp/mldeps/

The lines above I create a temporary holding location and recursively copy the folder for adal, liac, and azureml into that folder in preparation for the next stage.  To start the next stage the image is pulled from the same image repo that was used previously.  Subsequently, the files that were copied from the Microsoft ML install into a temp location are copied into a temp location in this stage.

#start next stage
FROM tiangolo/uwsgi-nginx-flask:python3.6
# Copy ml packages and app files from previous install and discard the rest
COPY –from=stage0 /tmp/mldeps /tmp/mldeps

We can’t copy them directly to the site-packages location, because if we do that prior to updating pip and installing dependencies we’ll get several dependency version errors.  Thus, getting the files over into the proper the location in this stage will follow the sequence of [copy from stage 0 to temp] –> [update pip] –> [install dependencies] –> [move to final location] –> [delete temp folder].

#must upgrade and install python package dependencies for ML packages before moving over ML packages
RUN pip install –upgrade pip \
&& apt-get update \
&& apt-get install -y apt-utils

#dependencies needed for azureml packages
RUN pip install dill PyJWT cryptography
# add needed packages for front-end app
RUN pip install dash dash-core-components dash-html-components dash_dangerously_set_inner_html pandas requests

#move ML packages into place
RUN cp -r /tmp/mldeps/* /usr/local/lib/python3.6/site-packages/

#remove the temp holding directory
RUN rm -r /tmp/mldeps

Other than a couple of more lines for setting up the ports that’s about it.  The results are an image that drops from 5.6 GB to about 1.3 GB as can be seen in my image repo.

C:\Users\jofultz\Documents\Visual Studio 2017\Projects\App%20Service%20Linux%20Python%20App>docker images

REPOSITORY                TAG                IMAGE ID         CREATED          SIZE

ml-py-reduced                   latest              f248747aedad       6 hours ago         1.31GB

ml-py                                   latest              f308579a7ac8        6 hours ago         5.66GB

Keeping the size down allows the image to be pulled initially and made operational in a much shorter timeframe. For ease of reading I’ve kept mostly discrete operations, but if you wanted to reduce the number of image layers you can combine a number of the RUN statements and reduce the layering for the image.  For reference, here is the full Dockerfile for building the reduced image:

FROM tiangolo/uwsgi-nginx-flask:python3.6 as stage0

#disto info
RUN cat /etc/issue

RUN pip install –upgrade pip \
&& apt-get update \
&& apt-get install -y apt-utils

# ML Server Python
RUN apt-get -y update \
&& apt-get install -y apt-transport-https wget \
&& wget https://packages.microsoft.com/config/ubuntu/16.04/packages-microsoft-prod.deb \
&& dpkg -i packages-microsoft-prod.deb \
&& rm -f packages-microsoft-prod.deb \
&& apt-key adv –keyserver packages.microsoft.com –recv-keys 52E16F86FEE04B979B07E28DB02C46DF417A0893 \
&& apt-get -y update \
&& apt-get install -y microsoft-mlserver-packages-py-9.3.0

#copy the ML libraries into the python path for the app to tmp folder to move later
RUN mkdir /tmp/mldeps
RUN cp -r /opt/microsoft/mlserver/9.3.0/runtime/python/lib/python3.5/site-packages/adal* /tmp/mldeps/
RUN cp -r /opt/microsoft/mlserver/9.3.0/runtime/python/lib/python3.5/site-packages/liac* /tmp/mldeps/
RUN cp -r /opt/microsoft/mlserver/9.3.0/runtime/python/lib/python3.5/site-packages/azureml* /tmp/mldeps/

#start next stage
FROM tiangolo/uwsgi-nginx-flask:python3.6

# Copy ml packages and app files from previous install and discard the rest
COPY –from=stage0 /tmp/mldeps /tmp/mldeps

#must upgrade and install python package dependencies for ML packages before moving over ML packages
RUN pip install –upgrade pip \
&& apt-get update \
&& apt-get install -y apt-utils

#dependencies needed for azureml packages
RUN pip install dill PyJWT cryptography
# add needed packages for front-end app
RUN pip install dash dash-core-components dash-html-components dash_dangerously_set_inner_html pandas requests

#move ML packages into place
RUN cp -r /tmp/mldeps/* /usr/local/lib/python3.6/site-packages/

#remove the temp holding directory
RUN rm -r /tmp/mldeps

ENV LISTEN_PORT=80

EXPOSE 80

COPY /app /app

Enterprise Azure ML Cluster Deployment

I wanted to take a look at a few of the details that aren’t necessarily clear or covered in the documentation when it comes to setting up a Microsoft Machine Learning Server cluster in Azure.  There are few things to note for both the cluster configuration and for client configuration when it comes to certificate validation.

To get quickly to a deployed cluster footprint one can start with the quickstart templates that exist for ML Server.  Those templates may be found at the repo: https://github.com/Microsoft/microsoft-r.  Navigating into the enterprise templates (https://github.com/Microsoft/microsoft-r/tree/master/mlserver-arm-templates/enterprise-configuration/windows-sqlserver) there are block diagrams that show the basic setup.  However, the challenges with setting up anything in a large enterprise are in the details.  The following diagram is what’ll I’ll reference as I discuss key items that caused a bit of churn as we deployed and configured the clusters:

Enterprise-Azure-ML-Cluster

I’ve put fake IPs and subnet ranges on the diagram.  I’ve used relatively small subnet ranges as the web and compute node clusters shouldn’t really need more than 5 instances for my scenario.  Of note, I’m not addressing disaster recovery or high availability topics here.

There are two keys items that can throw a wrench in the works: networking and SSL/TLS certificates.  The firewall items are pretty straight-forward but the certificate configuration is a gift that keeps on giving.

Networking

The troubles with firewalls come in from a few different vectors.  Commonly, in large enterprise organizations, you’ll see constraints such as:

  • No port 80 traffic into Azure
  • No general traffic from Azure to on-premise systems (requires specific firewall exceptions)
  • Often limited to port 443 only into Azure

The primary reason that this can be problematic is that one will not be able to configure and test the clusters over port 80 through the ILB that is configured as part of the template.  Meaning that the DeployClient (https://docs.microsoft.com/en-us/machine-learning-server/python-reference/azureml-model-management-sdk/deploy-client) cannot be used to directly provision services and test them over port 80 from an on-premise client.  Note, that when using this template, port 80 and 443 are used to communicate with the ML Server API instead of the local install port 12800.  Of course, you may configure it differently, but as the template is presently it will be done over 80 and 443 through the ILB or App Gateway respectively.  The two ways to work around the network traffic rules are to use a jumpbox on the cluster vnet or a peered vnet or to use a self-signed certificate on the App Gateway and use port 443 (more certificates and client configuration shortly).

On an internally facing implementation, it is likely that private certificate authority will be used for the certificate.  This will work perfectly well, but one needs to ensure that App Gateway has been configured properly to communicate with the on-premise certificate authority.  We want to ensure that the App Gateway has access to the PKI environment so that it may access the Certificate Revocation List (CRL).  Access to this depends on how it has been set up.  In this case, it is exposed over port 80 and accessible via the vnet.  In order to resolve the address, the App Gateway needs to point to internal DNS.  Thus, on the vnet configuration under DNS Servers (working from the diagram) one would see entries for 10.1.2.3 and 10.4.5.6.  However, there is also an entry for 168.63.129.16.  This is used in Azure for platform services.

Managed Disks

The current configuration in the template is to use standard blob storage based disks.  However, if you want to ensure that your VM fault domain aligns with your disk fault domain, it is best to use Managed Disks.  That configuration can be done by finding the osdisk element for the ComputeNodes and the WebNodes and changing from:

"virtualMachineProfile": {
    "storageProfile": {
       "osDisk": {
          "vhdContainers": [
             "[concat(reference(concat('Microsoft.Storage/storageAccounts/sa', variables('uniqueStringArray')[0]), '2015-06-15').primaryEndpoints.blob, 'vhds')]",
             "[concat(reference(concat('Microsoft.Storage/storageAccounts/sa', variables('uniqueStringArray')[1]), '2015-06-15').primaryEndpoints.blob, 'vhds')]",
             "[concat(reference(concat('Microsoft.Storage/storageAccounts/sa', variables('uniqueStringArray')[2]), '2015-06-15').primaryEndpoints.blob, 'vhds')]",
             "[concat(reference(concat('Microsoft.Storage/storageAccounts/sa', variables('uniqueStringArray')[3]), '2015-06-15').primaryEndpoints.blob, 'vhds')]",
             "[concat(reference(concat('Microsoft.Storage/storageAccounts/sa', variables('uniqueStringArray')[4]), '2015-06-15').primaryEndpoints.blob, 'vhds')]"
          ],
          "name": "osdisk",
          "createOption": "FromImage"
       },

to:

"virtualMachineProfile": {
   "storageProfile": {
      "osDisk": {
         "managedDisk": { 
            "storageAccountType": "Standard_LRS" },
            "createOption": "FromImage"
         },

Additionally, you can remove the dependency on the storage provisioning:

"dependsOn": [
   "storageLoop",
   "[concat('Microsoft.Network/virtualNetworks/', variables('vnetName'))]",
   "[resourceId('Microsoft.Network/loadBalancers', variables('webNodeLoadBalancerName'))]"

Certificates and Clients

It is best to request the SSL certificate early and have it available as you configure the environment.  It will help to avoid reconfiguration later.  I want to address client issues first because there are a few approaches to working around the use of self-signed certificate or using a private certificate authority (private cert).

Anyone developing sites or services even briefly is familiar with certificate verification failure that the client (browser or client library) will generate when it retrieves a certificate from an endpoint for which it doesn’t recognize the authority.  The primary approaches to resolve the issue are:

  1. Ignore certificate verification warnings
  2. Use an exported .cer file to validate that specific call
  3. Point the API to the proper authority for verification
  4. Add the certificate (contents of the .cer file) to the certificate verification store being used

The more difficult configuration is related to the python clients calling any of the ML models that have been deployed as services.  Using .Net code will look at the machine authority.  Supposing that the client machine is part of the domain then the certificate should be verified.  However, using the Python requests library you’ll need to either export the .cer file and specify it in the function, set the flag to ignore, or append the contents of the .cer file to the file that is checked by the client.  The documentation is here to set the verify parameter: http://docs.python-requests.org/en/master/user/advanced/#ssl-cert-verification.  The net is that the call would be something to the effect of:

requests.<verb>('<uri>', verify=<'/path/to/certfile', true, false>)

If you prefer to include your cert in the default check you’ll need to open the cacert.pem file and append the contents of your file to the contents of the cacert.pem.  Requests depends on certifi (https://pypi.org/project/certifi/) to do the SSL verification.  On a given client machine, you’ll have to look at the location in which the package was installed.  If you’re configuring one of the cluster nodes (web or compute) then you’ll be able to find the cacert.pem file at the location: C:\Program Files\Microsoft\ML Server\PYTHON_SERVER\Lib\site-packages\certifi.

Lastly, and not yet mentioned, if you’re using the DeployClient() to make the API calls then the verify flag cannot be used in the call.  To that end, you must either use the aforementioned method of appending to the cacert.pem file or create the DeployClient() via configuration so that you can either point to the certificate to use or set the flag to false.  Thus, your code would have something like the following to set configuration options prior to creating the DeployClient():

from azureml.common import Configuration

conf = Configuration()  # singleton so this will stick across requests

# --- SSL/TLS verification ---
# Set this to false to skip verifying SSL certificate when calling API
# from https server.
# Options:
# --------
# verify -- (optional) Either a boolean, in which case it controls
# whether we verify the server's TLS certificate, or a string, in which
# case it must be a path to a CA bundle to use
# Example:
# verify_ssl = True (default)
# verify_ssl = False
# verify_ssl = '/etc/ssl/certs/ca-certificates.crt'
conf.verify_ssl = True   # Default is True

# Set this to customize the certificate file to verify the peer.
# SSL client certificate default, if String, path to ssl client cert
# file (.pem). If Tuple ('client certificate file, 'client key') pair.
# Example:
# cert = ''
# cert = ('cert.pem', 'key.pem')
conf.cert = None  # Default is None

Closing

To summarize, the changes that were made were:

  1. Networking – shut-off 80, allow for internal DNS resolution, create jumpbox
  2. Managed Disks – change the template to use them
  3. Certificate verification – several options for ensuring that client verification can be done from either calling clients (e.g., browser based calls) or from the web or compute nodes for inter-dependant calls.

I hope this helps anyone attempting to set this up themselves within the enterprise.