Machine Learning 101 prototyping workflow

Master the very basic ML prototyping workflow

Sebastien Sime
5 min readOct 16, 2021

What you will learn:

  • The necessary steps to perform when prototyping a machine learning solution
  • The reasons why each steps are so useful
  • How each steps are performed using most of python popular libraries (through the provided notebook)

Motivation:

Putting yourself in a data science role when you’ve been given the amazing task of building this cutting-edge machine learning solution. You have the data and the motivation but don’t know where to start. Just pause a minute and think about the necessary steps…

Is it clear in your mind or you have this rush in your chest but without exactly seeing the path and where to begin?

My motivation here is simple: give you, in a straightforward way, where to start and also why each step is important. I remember when I started this journey into the data world, being a little bit crushed under the data science buzz words with the associated technics: it was like being in a storm on a little canoe. After reading this post, I believe you will have enough faith to walk on water. So let’s get started.

The main objectives:

The traditional machine learning (ML) workflow globally consists in 3 major steps: getting \ loading the data, training the models and evaluate the results.

ML workflow major steps

You can visualize these steps as the 3 main objectives of your project:

  1. The first objective is to get good and clean data related to your project. The data could be on any form (in a csv format, pictures, sounds etc.).
  2. The second objective is to train some models with the data you got at the previous step. The purpose of this step is to find the pattern or structure (if any) hidden in your data. And for this you will use few algorithms to catch the structure.
  3. With the model (AKA the pattern or structure) provided with the previous step, you are now interested in deciding whether or not the structure is a good representation of your data. That’s the purpose of evaluation.

Most of the time, the evaluation step will raises another questions that will require perform further data transformation, model training and so on.

So let’s suppose now that you have the data you wanted for the your project and that you now want to kind of create a GANTT chart of your project. What are simple steps?

Machine learning project simple steps:

Let’s first have a view of the project skeleton. As myself, you can use this skeleton as a check list to make sure you have all the necessary steps. I believe following this template, will give a nice and professional way of knowing what to do, when to do it and how much time and steps until the end of the project.

ML project skeleton:

  1. Prepare the project work environment

a) Load libraries

b) Load dataset

2. Get a sense \ feeling of the data through Summarization

This step will help you build a strong data related knowledge about your dataset.

a) Descriptive statistics

  • Peek At Your Data.
  • Dimensions of Your Data.
  • Data Types.
  • Class Distribution.
  • Data Summary.
  • Correlations.
  • Skewness.

b) Data visualizations

  • Histograms.
  • Density Plots.
  • Box and Whisker Plots.
  • Correlation Matrix Plot.
  • Scatter Plot Matrix.

3. Prepare the data for modelling

Most of datasets cannot be directly given to ML algorithms. According to what you’re trying to achieve, some transformation would be needed.

a) Data Cleaning

b) Data Transforms

  • Rescale data.
  • Standardize data.
  • Normalize data.
  • Binarize data.

c) Feature Selection

  • Univariate Selection.
  • Recursive Feature Elimination.
  • Principle Component Analysis.
  • Feature Importance.

4. Try different algorithms and compare them to one another

To make sure you end-up with the best ML algorithm, you will have to try different algorithm typologies and compare them to one another.

a) Split the dataset into train\test\validation parts

  • Train and Test Sets.
  • Cross-Validation.
  • Leave One Out Cross-Validation.
  • Repeated Random Test-Train Splits.

b) Test options and evaluation metric

c) Try different Algorithms

d) Compare Algorithms

5. Improve the performance of the chosen model (improve accuracy = Fine-tuning)

a) Algorithm Tuning or models’ hyper-parameters optimization

b) Try Ensemble algorithms (algorithms using boosting and bagging technics)

6. Finalize Model

a) Predictions on validation dataset

b) Create standalone model on entire training dataset

c) Save model for later use

You can use this skeleton by simply copying it into your favourite work environment as comments and simply begin by filling the void between. But first, you have to do some research on your project subject to build a culture or a general knowledge on the subject you are working on. I believe that is may be the most important step as it will surely help you understanding what and why you’re working. The urge of writing some code should be kept hidden until you understand the context of your study.

This is only one way of getting done and some steps can be skipped according to the data and the objective you have. If there are terms you didn’t quite understood when reading the skeleton, I found this post which breaks things even more.

Finally, note that this prototyping process is not linear but iterative. While performing one step you might discover something and go back to the previous steps.

The practical application:

In the provided python notebook (go to the GitHub repository), I choose to work with breast cancer detection dataset of the UCI repository. This is one of the “Hello world” machine learning on internet, but before jumping right in the code, I spent a significant amount of time trying to understand couple of things like: where the data came from? How the data have been gathered and labelled? Why bringing AI into the detection process could help? What the dataset statistical variable represent in real life? etc.

Conclusion:

The lesson learned while doing this project is that even if the project seems simple there is always much to be learned on the subject and on the process. Performing a ML project is not only about cutting edge lines of code but also about the learning experience and the problem solving attitude of asking questions.

--

--