Build Interactive Data Apps of Scikit-learn Models Using Taipy

Data scientists and machine learning engineers mainly use Jupyter to explore data and build models.

However, building an interactive app is better for those who don’t care about our code and are interested in results.

While we have previously discussed about Taipy, an open-source full-stack data application builder using Python alone (I also like to call it a supercharged version of Streamlit), we are yet to do a practical demo of building a data application.

So today, let me walk you through building a simple data pipeline:

It will train various classification models implemented in sklearn.
It will plot their decision region plots to let the user visually assess their performance.

By the end of this newsletter, we would have built the following app:

Let’s begin!

Building blocks of Taipy apps

A typical data pipeline has a series of steps, such as:

Load data → select columns → clean data → fit model → predict → find training score → find test score.

But if you look closely, is it necessary to go step-by-step?

In other words, an optimal execution graph can look something like this:

Unless explicitly implemented, traditional data application builders (and even Python scripts) can not take advantage of such parallelization. As a result, they resort to a step-by-step execution, which is not optimal.

However, data apps built with Taipy can leverage such optimizations.

There are four core components of Taipy that facilitate this:

Data node → A placeholder for data like text, numbers, class objects, CSV files, and more.
Task → A function to perform a specific task in the pipeline — cleaning data, for instance. Thus, it accepts data node(s) as its input and outputs another data node, which can be an input to another task.
Pipeline → A sequence of tasks to be executed (with parallelism and caching).
Scenario → A Scenario configuration defines a given pipeline. An instance/execution of such configuration is called a Scenario. Typically, a new Scenario is needed (created and executed) when we execute the pipeline with modified/different inputs.

Next, let’s understand how we can utilize them in Taipy to build efficient data apps.

Prerequisites

To get started, install Taipy:

Next, install Taipy Studio, a VS Code extension. It provides a graphical interface to build and visualize data pipelines.

These are the steps: VS Code →Extensions → Search Taipy Studio → Install.

Sklearn Model app using Taipy

You can download the code for this project here: Taipy-Sklearn demo.

It would be better if you download the above code, open it in VS code, and read this newsletter side-by-side.

We will be creating four Python files in this project:

main.py: This will be the project’s base file.
config.py: This will define the connections between tasks, which data nodes they accept, and what data nodes they output.
algos.py: This will implement the tasks utilized in our data pipeline. So here’s where we will train the models and create their decision region plots.
interface.py: This will implement the interface for the user to interact with our data app. We discussed creating interfaces in Taipy in an earlier issue, so we won’t go through it again. Note that if you are not comfortable with Taipy's GUI Markdown syntax, Taipy also provides a pure Python interface.

First, let’s look at the algos.py file as that one is the easiest to begin with.

It has two functions:

fit: Based on the parameter value model_name, it creates a model object.
plot: This method accepts the model trained above and returns a decision region plot.

Next, consider the config.py file. It imports the two functions defined in the algos.py file above and the Config class of Taipy library.

Let’s look at the configure() method in detail.

Recall that the fit() method defined above has three parameters → X, y, and model_name.

Thus, we define those as data nodes in the first three lines:

The output of fit() method is a model object, which is also a data node. Thus, we define that too in line 9.

Moving on, we define a task in the pipeline to fit a model.

1st argument: Name of the task → “fit”
2nd argument: Function this task is supposed to execute → fit.
3rd argument: The input data nodes accepted by the task (We defined them in earlier lines of the configure method)
4th argument: The data node outputted by the task.

Similarly, we define a task to plot the decision regions as well:

1st argument: Name of the task → “plot”
2nd argument: Function this task is supposed to execute → plot.
3rd argument: The input data nodes accepted by the task (We defined them in earlier lines of the configure method)
4th argument: The data node outputted by the task, a figure object.

This defines our pipeline.

Finally, we bind the tasks together in a scenario, export the configuration to a TOML file (for visualization), and return the scenario:

We will see shortly that once we run the pipeline, it will create a config.toml file, which can be visualized using Taipy Studio (the extension we installed earlier):

The diagram makes it quite easy to visualize how different tasks and data nodes contribute to the overall pipeline.

Next, let’s look at the main.py file, which is the base file of this project.

We import the following here:

The taipy library.
The configure() method we defined in the config.py file.
The GUI markdown object (interface) defined in interface.py.
The Core and GUI components of Taipy.
- Core is used to run pipelines.
- GUI is used to define a web interface.
The method to create moons dataset from sklearn.

Under the main block (if __name__ == "__main__"), we first instantiate an object of the Core class imported above, and a standard scenario configuration defined in the configure method:

Next, for every model:

We create a scenario (line 25).
Specify values for the data nodes (lines 27-29) that are needed to start the scenario, and submit it (line 31).

Finally, under the same main block, we create our GUI by passing the Markdown object created in interface.py.

Done!

Executing this as follows launches the data app I showed you above:

Wasn’t that simple?

To recap, here’s what we did:

We implemented all the tasks in algos.py.
In config.py, we defined the overflow workflow of our data application and how different tasks and data nodes interacted with one another.
In interface.py, we defined our graphical interface.
In main.py, we launched all independent scenarios by specifying data node values and executing the application.

The whole application was entirely Pythonic and did not take more than 160-180 lines of code to implement.

You can find the code for this project here: Taipy-Sklearn demo.

A departing note

Taipy is genuinely one of the best application builders I have ever used. It’s hard for me to switch to any alternatives now.

The latency difference is quite noticeable in practical apps when I use Taipy as compared to other apps, as depicted below:

Taipy has witnessed exponential growth over the last couple of months or so, which shows that more and more programmers are adopting it and finding it useful.

It has also trended on GitHub multiple times.

They are releasing new updates almost every week now, so do give them a star on GitHub to support their work: Taipy GitHub.

I love Taipy’s mission of supporting data scientists and machine learning engineers in building full-stack apps themselves while Taipy takes care of all backend optimizations.

They are solving a big problem with existing tools, and I’m eager to see how they continue!

Get started with: