Delivering an AI Program: What Could Possibli Go Wrong?

6/15/2020

AI/ML has the potential to deliver incredible value to customers. A single breakthrough can make a product company's growth soar. A fine-tuned model can completely transform an enterprise's operations.

The mechanics of delivering such a program are very similar to traditional projects on the face of it, so it is tempting to think of the implementation as being the easy part. But because of the non-deterministic nature of AI/ML, all the pitfalls of ordinary software development programs - misaligned objectives, underestimation, lack of process, skimping on QA, ignoring risks, etc. - are amplified x10. To deliver successfully, you need to ratchet up the program diligence.

Don't Oversell the Vision Before It's Proven.

In many industries, sales is used to ‘selling’ the vision of the product in advance of it being built, and customers assume vapourware by default. No one bats an eye because we’re accustomed to the idea that engineering will always be able to fulfill whatever we’re selling, given enough time and money.

AI/ML however is a world of possibilities, but many of these possibilities may not actually be able to be implemented in practice, even with a huge budget. It's easy to promise "The product will automatically predict X with high accuracy." where X could be anything from detecting a security breach to predicting stock prices to finding the perfect outfit for you wear. But even if the prototype is already 70% accurate, it may never get to 80%, or whatever you need it to be to be commercially viable.

Don't oversell the vision, be honest. Be honest with customers, and be honest within the company walls too. Imagine slightly implying or exaggerating results to your manager, who then implies or exaggerates a little further to their manager, etc. and pretty soon the board is hearing that the product is 110% accurate and can slice bread too!

Be Absolutely Clear: Is This a Prototype or a Commercial Product?

When starting a traditional project with new technology, it’s common to start with a prototype. Although best practice is to "throw away" that prototype and start fresh on the commercial development, management often pushes teams to continue work directly on the prototype to turn it into a commercial product, skipping the throw away step. The result is a shaky development foundation that can result in code debt and bugs for the lifetime of the product. Far from ideal, but it can be liveable in a business sense.

With AI, skipping the throwaway step is no longer possible. When a data scientist is prototyping a model, their focus is truly to validate that the product could ever have a chance of being delivered, and if not, what other related things it could do instead to deliver value. Prototyping AI involves many shortcuts - working on a "clean" data set (artificially clean, biased, small sample), where the availability and timing of the data may not be an issue. The work is done on a Jupyter notebook or other simulated environment that is vastly different from a real-world environment, and using models and techniques that could never be used in a commercial context. The business problem itself may be fluid, as the data scientist may realize that the algorithm can't solve X but can solve Y instead.

When the prototype has met its objectives, a new project should be started from scratch for the commercial development. In commercial development, the team's work is focused on entirely different things such as:

how to implement the model at scale
on large sets of data
on imperfect real-world data that may not always be available in time or at all
on data that may require considerable ETL
and ensuring the model's predictive performance is not only good when you develop it but continues to deliver accurate results even as time goes on in the field and data may change (known as drift)

So, is the project a prototype or a commercial endeavour? Alignment from the team to upper management is critical. The answer can't be both.

R&D Must be Informed by Business Requirements

From research and prototyping to commercial development, AI/ML requires making decisions and trade-off's. This is where the PM needs to provide loads of context to the R&D team: what are the customer problem(s) are that we are trying to solve? What parameters and trade-off's would be acceptable to a customer? Have we considered the cost trade-offs of operating in practice? For example:

is the program intended to deliver a incremental improvement over what exists in the industry today? Or is this a "big bet" to do something very innovative? The former implies use of well-established models, whereas the latter could mean focusing on more speculative research that could take longer and has a smaller chance of a big payoff.

What quality of data will we be able to get in practice, in the field? Has the business considered the cost of obtaining that data, and the cost of performing FMT (filter, merge, transform) / ETL (extract, transform, load)?

What is the cost of an error? The product will make mistakes with some margin of error. Is this a program where mistakes are ok to make because the cost is low. For example, if Netflix recommends a movie you don't like, that's forgivable. If medical software recommends a certain medication that harms a patient, that is not! This kind of cost/benefit analysis allows the R&D team to surface requirements such as “must be at least as accurate as the current process,” or “must provide transparency into how decisions are being made”.

If drift occurs, or without human intervention is the cost still economically viable?

From a data scientist's point of view, whether a model is sufficient to support the business case might be a higher bar than whether the model is a good model. Clear business strategy and customer context upfront can help guide the direction to take the development (kill the research and start fresh, re-focus it, etc.)

Test environment vs. Real-world environment

Skimping on QA strategy, test environments and real-world testing are classic software development pitfalls. With AI/ML programs, again this problem is amplified. The biggest challenges in commercial success lie in the substantial delta between the test environment and test data, and the highly variable nature of the field. Pitfalls include:

Test data is not a large enough sample size,
Test data not representative of reality
Data in the field requires so much ETL that it is not viable to work with in practice, or data changes from the time it was captured to the time it's processed, leading to lack of accuracy
Data in the field produces different errors/outliers/rare cases not seen in prototyping that skew results or generate unexpected behaviour
In research, the features of the data that best predicted results are not available in the data in practice
Model seemed to be fine in prototype but revealed to simply not be the right model in the field (e.g. due to correlation vs. causation assumptions in the smaller test data set)
Drift - data changes over time and the model doesn't keep up
Model works but can't be made performant enough to be commercially viable within the time to deliver
Business case problems - data sources cost more than expected, the cost of maintaining&retraining the model are too high

Careful modelling of the different data sources and their attributes - features, availability, timing, anticipated errors and drift - can lay the foundation for the right QA strategy what to test, where to test, and how to judge success.

Risk Management

Unless you're working at a big bank or large enterprise, PMs don't spend near as much time on risk management as PMI would say. In fact, if you spend too much time talking about risks in a small company, and you get pegged as a dissenter that doesn't believe in the company's success ("Why are you talking about risks? You don't believe we can make it happen?")

In Amazon's excellent paper Managing Machine Learning Projects, the author lives and dies by scorecards for all facets of the program - go-to-market, financial, data, etc. PMPs will recognize these scorecards as simply actionable risk registers.

Here's an example of a scorecard for go-to-market, which the program manager would continuously to track and use to direct progress and align the organization:

Example scorecard for financials:

Example scorecard for the data aspects of the program:

Starting from canned scorecards like these is recommended, since the list of risks is pre-populated based on others' lessons learned. Then expand on them with your own project-specific risks.

Give the team the big picture

In AI/ML programs, once production starts, research usually continues in parallel. Managing both as one cohesive team is a new challenge for traditional program managers.

From a process perspective, sometimes the method of communicating the requirements from research to production is by simply giving them the researcher’s Jupyter notebook, or a set of Python or R scripts. If the prod team redevelops and optimizes the code for production while the research team continues from their base notebook, you have the problem of versioning the code and identifying changes.

From a human perspective, it can be easy to assume that because the individual team members are often highly educated and experienced, especially data scientists who may have a PhD. Nevertheless these are still just people, and people see the world through their own lens until the manager gives them the big picture.

To ensure a well-oiled research and production machine:

Always provide the big picture. Ensure the scientists understand the goals and challenges of production, and the production team understand the goals and challenges of the scientists, as they may otherwise underestimate the risks of each other's parts.
give the research team production requirements so that the data scientists share in the compromises made with engineers. Conversely, have the data sceintists design tests and test data for production
Ensure whoever is managing has a deep understanding of both sides, including the goals and challenges, and particular terminology, to foster trust with all team members and bridge communication gaps.

Don't Assume Linear Progress

Finally, and perhaps most obviously, because of the non-deterministic nature of AI/ML development, achieving results usually does not happen as a set of linear steps that can be measured.

The model may jump from 50% to 70% accuracy quickly, but getting to 80% may take much longer, or not be possible at all. Or to get there, the team may have to scrap some of the work and go back down to 50% before it can go back up. Similarly, with one data source the algorithm might be highly performant and accurate, and with another it suddenly drops.

Metrics-focused management teams who are used to linear progressions can find these spiky results jarring. Metrics need to be part of the equation, especially before making any commercial release decisions, but they need to be accompanied with qualitative insight, lots of communication, and flexibility to pivot from the original plan if things are not progressing as expected after a certain time. Amazon against provides tools to help management visualize progress and plan contingency, but ultimately for any organization that hasn't undertaken AI/ML before, there's no panacea. This is culture change, plain and simple.

AI Programs - New Diligence Required

Back in the 1970s and 80s, software development was a sort of "black magic", where both benefits and risks could take you by surprise, forcing management to plan carefully. We have since made leaps and bounds in the maturity and predictability of our software engineering practices, to the point that we take for granted that anything can be engineered given enough time and budget. In a sense, AI takes us back to those early days where anything was possible, but nothing was to be taken for granted. The risks are high, but in a way. that's a good thing - it should increase the maturity of our management processes and responsibility.

1 Comment

Make New Programs Succeed.

Delivering an AI Program: What Could Possibli Go Wrong?

Leave a Reply.

Categories

Make New Programs Succeed​.

Delivering an AI Program: What Could Possibli Go Wrong?

Leave a Reply.

Categories

Make New Programs Succeed.