Machine Learning Cheat Sheet For PMs and Business Owners

3/19/2023

You can't be a PM or business owner without a deep understanding of machine learning: establishing the business value and then rallying a team to deliver to that value without falling into the many unique pitfalls of ML programs. Here is a "cheat sheet" compilation of how to deliver machine learning products and programs, highlighting practical tips from someone who's been on the journey.

THE 4 PHASES OF MACHINE LEARNING PROGRAMS

Machine Learnings programs - whether the creation of a new product or a new operational process - are delivered in 4 phases:

Establish the business case and scope to be delivered
Gather and prepare the data
Develop your model (aka Machine Learning algorithm)
Deploy and then continuously monitor and refine

These phases are iterative. Teams are encouraged to jump back and forth and cycle between phases frequently.

Here's a look at the popular CRISP-DM process championed by IBM. It has 6 boxes but essentially fits the 4 phases I mention. Note that it is iterative, as illustrated by the circular arrows.

Microsoft has its own ML project flow called Team Data Science Process (TDSP). Though the diagram they published looks more complex, at its core it has the same basic 4 phases: understand the business, ready the data, develop a model, and deploy (seen in the the 4 big circles in the center of the diagram).

And finally here is the project lifecycle proposed by Andrew Ng from DeepLearning.ai, a popular thought leader and online instructor, showing yet again essentially the same 4 phases.

Let's go through the 4 main phases and highlight the key things PMs need to anticipate for each.

1. ESTABLISH THE BUSINESS CASE AND PROJECT PLAN

Business Case

Establishing the business case and business plan / product strategy is truly product management 101, yet with the hype of ML we often skip the rigor of answering some of these fundamental questions:

Who is the persona we are targeting?
What is the problem or pain of this persona that we're trying to solve?
How does the persona solve the problem today? What gaps or drawbacks in the way they solve the problem today can be solved by our new approach?
What circumstances and constraints are they under (time pressure, lack of knowledge, high pressure environment, established way of doing things, other technologies in use, etc.)
What business outcomes are we expecting and how will they be measured? (e.g. revenue)

Startups are subject to a great deal of scrutiny on these questions their investors, but ironically it's established companied looking to launch an ML product that are more likely to gloss over the business case in their hurry to develop something related to "artificial intelligence". Don't let your org skip this fundamental step!

The answers to these questions are not things you come up with in a board room with the team. They come from the outside, talking to customers and prospects directly via product-market fit interviews.

Feasibility

Next you validate feasibility, i.e. is this actually something that is likely solvable by ML? Feasibility is a sticky subject because there are still a lot of unsolved business problems out there that ML could have the potential to solve, so the PM and leadership need to strike the right balance of taking a risk vs. being confident that the solution is sufficiently within reach.

Ask your team: do you even need an ML algorithm? Is what you're trying to accomplish something that could be done by traditional business logic instead? As Wytsma and Carquex argue in 5 Steps for Building Machine Learning Models for Business, a LOT of what business owners think they want from ML can be more efficiently solved by heuristics implemented with traditional business logic, as shown in the table below! The moral: don't do ML to say that you're doing ML, your business problem needs to be something that truly warrants ML.

Problem Statement

ML programs require a lot of very specific and precise objectives. If you are not super clear and aligned from customer to leadership to the team on what you are trying to achieve, ML programs more than traditional programs are likely to not deliver.

Is the most important thing to save customers time, or encourage them to buy more? Are we optimizing for accuracy or speed? Managers who are used to saying "I want all of those things" and leave requirements open-ended will have a hard time, as discussed in Most Common Pitfalls of Delivering an AI Program. There will be major trade-offs and unexpected miscommunications if these decisions are not confronted head on.

Write down a short problem statement that summarizes what you are trying to achieve. Make it visible and check in regularly. Everybody on the team should share the same understanding of the problem that you're solving

Outcome Metrics and Output Metrics

To support the business case and problem statement, you should include quantifiable outcome metrics. This could include more specific targets that you expect to achieve, such as ("Customers renew 20% more often because the new algorithm increased its relevance" or "Employees save 10% of time thanks to automation of a specific daily task"). Then the technical team must write corresponding output metrics, such as a quantifiable measure of the algorithm's relevance, accuracy, precision, throughput, etc.

The PM is in the middle, responsible for making sure that if the technical team delivers output metrics, this will ensure that the business achieves its outcome metrics.

To help focus the ML team and the ML algorithm itself, it's best to select 1 optimizing metric that the team should focus on improving as much as possible, and other satisficing metrics that simply need to meet a minimum threshold. For example, perhaps your optimizing metric is precision and you select satisficing metrics for recall, speed and throughput above a certain basic threshold.

Ensure targets are reasonable by comparing them to benchmarks such as human-level performance, or competitor or open-source benchmarks, or research study benchmarks (taken in the right context). Otherwise you may have a manager insisting on 99.9% accuracy and you don't have grounds to explain why this is or isn't reasonable.

Finally, remember that you FIRST set targets that will achieve the business plan, THEN worry about how to achieve them (same as in other disciplines like sales - first you set the sales target your business needs, then you figure out how to get there by doing things like hiring more, changing tactics, etc.). Don't undermine your business case upfront by setting too low a bar.

Non-functional requirements

Let's not forget that for every ML program, there's not just the algorithm but also all the software and hardware infrastructure around it! Work with the engineering team to develop your non-functional requirements: is the overall system in the cloud or on the edge, what is the system availability, resiliency posture and security needs? (for a fulsome list, see checklist of non-functional requirements).

There are also some key non-functional requirements that must be defined for the algorithm itself, notably:

explainability - how opaque or transparent should the model be? Some deep learning models can give great results but be very opaque, i.e. difficult to understand what the algorithm is doing to arrive at its results, and therefore harder to debug if it goes wrong. Would you sacrifice some performance in favour of a more transparent algorithm that is easier to debug?
cost constraints - sourcing, labeling and maintaining data can be a very costly endeavour depending on how you approach it. Is there a ceiling to how much the team can spend upfront? Over time once the system is live?
ethics, privacy and regulatory requirements - regulatory programs like GDPR and CCPA are signs of how attuned we now are to data privacy, and ethical AI is also a top concern. Clearly understanding your requirements upfront is critical as re-work later could mean a re-write of the whole model

Project Plan and Scorecards

Finally you develop the project plan, defining the scope of the project and the plan to deliver to the business plan. In Amazon's excellent paper Managing Machine Learning Projects, the author lives and dies by scorecards for all facets of the program - go-to-market, financial, data, etc. PMPs will recognize these scorecards as simply actionable risk registers, used to continuously anticipate and steer clear of pitfalls and prioritize mitigations.

Read more: How Are AI Programs Different From Traditional Programs?

2. GATHER AND PREPARE THE DATA

Sourcing Data

Once we have an initial pass through the business plan and project plan phases, it's time to source data. Data needs to be sourced in a way that is representative of the data your algorithm will encounter in production. It's worth getting leadership to look at samples of the data set to make sure you have buy-in that this is the right data. Even a slight change in approach (e.g. data from US vs. data from Canada, photos of cars taken from a street camera vs. photos of cars taken from a mobile phone) could totally change the algorithm.

For supervised learning (the most common type of ML), the data you source has to be labelled, so that the model can learn how to map the features of the data to the labels. This forms the basis of the prediction or classification of future data.

There are lots of different ways to source data, each with cost, time and quality trade-offs. The team should not get focused on just 1 source but instead start by brainstorming a wide list of possible sources and then deciding which sources to acquire first. Sources can depend on the amount of data you need, and whether that data is structured or unstructured:

If you only need a small amount of data (<10,000 samples): you can find ways to pay generalists or SMEs to manually gather and label data. This is especially true if the data is unstructured (images, audio clips, etc.), because humans are particularly adept at labeling unstructured data. Labelling must be very clean and accurate for a small data set. You can always go manually inspect the quality, and get labelers together to address inconsistencies between how they did the labeling.
If you need a lot of unstructured data (>10,000 samples), you have to be clever in how you gather enough labelled data at a reasonable cost. The good news is that the need for perfectly-labelled data is lower as the model will be able to figure out the patterns given enough volume. If using human labelers, be very precise with your labeling instructions since one of the biggest sources of error is inconsistent interpretation of the labeling rules from one labeler to another. Data augmentation techniques such as GANs can be used to amplify a smaller set of data into a larger one.
If you need a lot of structured data (>10,000 samples), this can be the biggest challenge to obtain, since human labeling is harder and data augmentation techniques don't apply. In this case finding ways of having the system programmatically gather data from users (e.g. e-commerce system tracking user behaviour such as purchases and returns) is the best way.

Data Pipeline

In ML engineering, we start to establish an ML Pipeline across data, the model, and the deployment and monitoring (rollout) phases. Here is Robert Crowley's representation of an ML pipeline's components.

In essence, once we've sourced data from users, employees or 3rd parties, we build a data pipeline that processes the data in various stages, getting it ready for the model to be trained:

The first row of the diagram is the data pipeline:

Data cleansing initially prepares the data to be used for training manually or programmatically. In real-world data, you often spend a lot of time here due to missing data, erroneous data, outliers, etc.
Data ingestion is the process of absorbing data from different sources and transferring it to a target site where it can be analyzed
Data analysis and transformation: here we get stats on the data (e.g. what type of data fields and distribution) to better understand it, then apply techniques like normalization (e.g. transposing all numeric elements into 0-1 range), standardization (rescaling the attributes so that they have mean as 0 and variance as 1), or clipping (e.g. cropping images to ensure they are same size). For non-numerical data such as categories of things (e.g. shoes, socks, pants), convert these to numeric representations (shoes=0, socks=1, pants=2). This makes the data faster and easier to process programmatically. As we transform the data, we derive a schema which we can use to identify and address anomalies (more on schemas below).
Feature engineering is selecting and combining features that have the greatest predictive signal while pruning features to ensure the model is efficient to run in practice. Often this requires domain knowledge. For example, a large weather data set might include dozens of features (temperature, humidity, pressure, wind, time of day), but if you are training a model to predict if it will rain, only a subset of those features will be the biggest predictors. Feature engineering can include discarding irrelevant features, combining correlated features into new features, and performing feature selection techniques to eliminate any duplicative or unnecessary features (e.g. dimensionality reduction using algorithms like Principal Component Analysis). An important concept is to arrive at a set of features that are orthogonal, i.e. not correlated, so that as you work on your model you can test the weighting of each in isolation.
Data splitting is about separating your data into a training set (the data that you will initially train several possible models on), dev set (data that you will use to decide which of the possible models work best) and test set (once you have selected a candidate model, use it on the test data to see if it generalizes well). If you have a small amount of data, the training/dev/test set split can be 70% / 15% / 15%. If you have lot of data, the training/dev/test set split could be 98% / 1% / 1% maximizing the amount of data on which to train the model while still providing enough to satisfy dev and test purposes.

Read more: What Makes A Top 1% Project Manager?

3. DEVELOP THE MODEL

There are 3 major types of ML algorithms:

1. Supervised Learning: find patterns within the labels of your data to make predictions. Ultimately it creates a mapping of all possible inputs x to predictions y.

Predictions can be classifications, such as yes/no, cat/dog/mouse, happy/sad/angry, benign/malignant. A model that predicts by classifying is called a "classifier", e.g. Naive-Bayes, Logistic Regression, Support Vector Machine (SVM), Gradient Boost, and General Additive Models (GAMs).
Predictions can be values, such as the predicted value of a piece of real estate, predicted likelilood that a customer will buy an umbrella given the amount of rainfall. A model that maps inputs to predicted values is known as "regression", using approaches such as Linear Regression illustrated below, Random Forest, Polynomial Regression, and again GAMs.

2. Unsupervised Learning: the data is not labelled, so the algorithm finds patterns finding clusters (groupings) of points together that have similar features (e.g. K Means Cluster algorithm). Unsupervised learning is relatively harder, and sometimes the clusters obtained are difficult to understand because of the lack of labels or classes. Other unsupervised learning approaches include Apriori algorithm (generating IF_THEN rules such as IF people buy an iPad THEN they also buy an iPad Case to protect it), and Principal Component Analysis.

3. Reinforcement Learning: figure out a solution to a problem using rewards and punishments as reinforcements. The model is rewarded if it completes the job and punished when it fails. The tricky part is to figure out what kind of rewards and punishment would be suited for the model. These algorithms choose an action, based on each data point and later learn how good the decision was. Over time, the algorithm changes its strategy to learn better and achieve the best reward.

Hypothesis, Train, Error Analysis, Repeat

Developing the model (aka algorithm) is a highly iterative process in and of itself. It should be a cycle where you continuously:

Develop a hypothesis (based on a chosen base function such as linear regression with a quadratic function and initial set of hyperparameters)
Train the model on the training data set (which tunes the hyperparameters)
Error analysis, where you look at your model's error function and results on optimizing and satisficing metrics and then judge how to update your hypothesis for the next iteration.

Don't Spend Too Much Time

Although each step may sound complex to a PM or business owner, best practice is to spend about 2 days on each step. The ML team shouldn't get stuck in analysis paralysis or perfectionist thinking on any step beyond a few days. Just start by overfitting a small amount of data.

Training Optimization

What do I mean by overfitting? As you go through this iterative process and are conducting the error analysis step, 2 possibilities emerge:

Your model underfits the training data. In this case, your error rate on the training data is still high and could be improved. Your model is said to exhibit high bias. The short explanation is that you may not have the exact formula right yet: maybe you aren't taking into account enough features of the data, or the formula needs an update (e.g. another quadratic term). The ML team must keep testing updated hypotheses.
Your model overfits the data. In this case, your error rate on the training data is low, but when you run the same model on the dev set, the error rate is high. Your model is said to not generalize well, and that it has high variance. In this case, your model might be taking into account features that seem to be relevant in the training set but actually aren't relevant in the general case. So again the team goes back and adjusts the hypothesis and tries again.

Going through this process is called training optimization. Earlier I mentioned the importance of engineering features that are orthogonal. This way if you underfit or overfit the data, you can try modifying parameters that are relevant to one feature at a time, in isolation, to see if it can yield an improvement.

Modeling in Academia vs. Modeling in Production

A lot of data science talent will come from the academic world, and they will have to get used to priorities being different when building ML for production:

the optimizing metric will often be related to fast inference and interpretability of the model, rather than optimizing for accuracy
data is constantly shifting and changing, rather than working on a fixed set of a data in a lab. As a result, you are constantly training and tuning the model rather than spending too long optimizing a model on a single data set.
data scientists will have to grow an appreciation for the overall engineering of the system, not just the model, that make the business results possible

Metadata

The process of modelling should use version control to help the team keep track of what has been tried already, what worked and what hasn't. This is critical when analyzing results compared to previous iterations and formulating your next hypothesis.

Track metadata for data provenance and data lineage. Start in a spreadsheet for small project if you want, but as data scales you'll need a tool like TensorFlow to manage metadata across large amounts of data and many different iterations of the possible models.

Model Evaluation

When evaluating your model hypothesis, compare it to the specific output metrics you identified in Phase 1. For example, maybe your optimizing metric is the common F1 score, where F1 is a combination of:

precision (of all the predictions the algorithm made that the patient has cancer, what % were correct, in other words True Positives / (True Positives + False Positives)), and
recall (of all the cancer patients in the sample, what % did the algorithm predict, aka True Positives / (True Positives + False Negatives)).

When you feel that the model is sufficiently trained, then re-train the final model on all the data from scratch. Then perform model scoring on test set, evaluate using error metrics defined in project. Understand performance on certain slices of the data.

When doing error analysis, tag the various errors (e.g. blurry image, small image) and see which tags come up most often, that will point you to where the biggest sources of errors to focus on might be.

Balancing Accuracy with Performance and Cost

Part of having an optimizing metric such as F1 and satisficing (or gating) metrics that ensure a threshold of latency and throughput means that there might be trade-off's in the model and the software/hardware infrastructure to support the model. For example, a super accurate model might have many high-order polynomial calculations that take a long time to compute. You could invest in a lot of high-performance hardware (servers, GPUs) and distributed architecture to compensate for this latency, but this comes at a high cost that may be out of reach if you're not Google or Facebook. Or you might sacrifice 1% of accuracy to meet latency and throughput thresholds. This is where data science meets engineering meets business thinking - all groups have to get creative together on the optimal path to handle the trade-offs.

4. DEPLOY AND MONITOR

Once the system is deployed, the job is only 50% done. The other 50% ahead is now the substantial effort required to monitor and maintain the system, which can be far more taxing than monitoring and maintaining traditional software.

Progressive Roll-out

To start, ML systems are typically not deployed broadly. Rather, start small and progressively roll-out to more users as you have success. Options include:

Canary deployment: start with a pilot (e.g. 5% of users) and progressively ramp up with monitoring to track issues
Blue/green deployment: If replacing an old system, use blue/green deployment to progressively route traffic to new system while still keeping the old system alive until all traffic is migrated over
Shadow mode, decision support, partial automation: in programs where the goal is to automate what was previously manual work, consider first running the algorithm alongside the human worker (shadow mode, to see if the ML's actions closely match what the human does), providing advice to the human (decision support), and perhaps aiming for only partial automation of specific subsets of tasks rather than trying to take on complete automation right away.

Along the way you measure whether you are achieving the business objectives (as measured by the outcome metrics you specified in Phase 1), using techniques like hindsight scenario testing, A/B testing, and user interviews about their qualitative experience during the pilot phase.

Model Drift

Models eventually lose their predictive power, called "model drift" or "model decay". This can be due to data drift and concept drift. Data drift is when the live data coming in to your model no longer has the expected distribution. For example, in an e-commerce store selling shoes, data drift may occur as the seasons change and customers' preferences for different types of shoes change. Concept drift is when the basic concepts your model has been trained to recognize start to be labelled differently altogether. For example, fashion changes and the shoes of today no longer look like shoes the model was trained to recognize a few years ago.

Since both drifts involve a statistical change in the data, the best approach to detect them is by monitoring its statistical properties, the model’s predictions, and their correlation with other factors.

Drifts can be slow or very sudden. Usually consumer changes are slow trends over time, whereas in a B2B context an entire enterprise can shift quite suddenly (if for example a company tells all of its workers to change their behaviour one day, or installs a new company-wide software).

Monitoring the Right Metrics

The selection of metrics to deploy is critical. Consider input metrics (input length, values, volume, # of missing values, avg image brightness), and output metrics (e.g. clickthrough rates, null value returns, number of times the user has to repeat their query or switch to manual typing). Set thresholds for alarms.

Monitoring is very critical to check the health of the software and hardware of the system itself. Many problems can arise over time such as sensor malfunctions, system downtime, bandwidth issues, etc. Consider software metrics (memory, server load, throughput, latency) in addition to inputs and outputs.

Deployment Is An Iterative Process Too

Just as all other steps of the ML process described in this article are iterative, so too is deployment and maintenance. As metrics show that the model is no longer performing at an acceptable level, update the model and redeploy with refreshed thresholds so that monitoring is prepared to sound the alarms when the time comes again.

0 Comments

Make New Programs Succeed.

Machine Learning Cheat Sheet For PMs and Business Owners

Leave a Reply.

Categories

Make New Programs Succeed​.

Machine Learning Cheat Sheet For PMs and Business Owners

Leave a Reply.

Categories

Make New Programs Succeed.