Big Data Charting Strategies in Python

This article is a follow-up to the article Python Charting: Big Data without crashing. This article can be read independently from this previous article.

‍

As Data Scientists, we are challenged very often with large data sets that need to be displayed graphically. I am not even talking about Big Data here: a mid-size dataset greater than tens of thousands of points may represent a challenge for most Python libraries, let alone data sets with millions of data points.

One common approach is to use various heatmap techniques (such as datashader, etc). However, this only covers a small set of situations, usually involving maps. The bread & butter of data scientists remain time series and scatter plots.

Taipy (A Python library to fast-track your front-end and back-end development) addresses this problem with “Decimators”. These provide some clever’ implementation of data sampling in order to minimize the volume of data transiting between the application and the GUI client, without losing information.If one considers a table (dataframe, array, etc) with just 20 columns and 500K rows, generating a graphical view for such a table will require over 1 minute on a 10MB/s network!

Taipy manages to reduce ‘smartly’ the number of points that define the shape of the plot/curve.

‍

Taipy Decimators for Time Series

The basic “algorithm”

The process of ‘reducing’ a dataset by eliminating ‘non-significant’ points is generally referred to as ‘decimation’ or ‘downsampling’ (or even ‘subsampling’).

We apply here the MinMaxDecimator to the curve for the Apple stock price over time. This is not a very large dataset but the principle described here works for any size of dataset including those that will crash your Python environment (due to lack of memory most likely).

‍

import yfinance as yf
from taipy.gui import Gui
from taipy.gui.data.decimator import MinMaxDecimator
import pandas as pd


df_AAPL = yf.Ticker("AAPL").history(interval="1d", period = "100Y")
df_AAPL["DATE"] = df_AAPL.index
df_AAPL['DATE'] = pd.to_datetime(df_AAPL["DATE"]).dt.normalize()


n_out = 300


#step 1: Creation of a MinMaxDecimator with a set limit of 300 points.
decimator_instance = MinMaxDecimator(n_out=n_out)


decimate_data_count = len(df_AAPL)


# Taipy's augmented markdown code
# Two charts are being displayed: first chart without the decimator, the second using the decimator
page = """
## Decimator Example - Times Series


From a data lengths of <|{len(df_AAPL)}|> to <|{n_out}|>


<|layout|
<|{df_AAPL}|chart|x=DATE|y=Open|title=No Decimator|>


<|{df_AAPL}|chart|x=DATE|y=Open|decimator=decimator_instance|title=With Decimator|>
|>
"""
# Launching & execution of the Web client
gui = Gui(page)
gui.run(use_reloader=True, port=5004)

‍

Note that the Taipy code that displays the two charts is condensed in the lines

‍

<|{df_AAPL}|chart|x=DATE|y=Open|title=No Decimator|>

<|{df_AAPL}|chart|x=DATE|y=Open|decimator=decimator_instance|title=With Decimator|>

‍

Taipy provides two methods to generate graphics:

One is based on an ‘augmented’ markdown syntax. This is our choice here. You get the markdown facility extended with graphical syntax receiving Python objects. Here a dataframe, df_AAPL is an input to the ‘chart’ markdown object. More can be found in Taipy’s documentation.
The second is a pure Python interface. No markdown.

‍

<|layout|
<|{df_AAPL}|chart|x=DATE|y=Open|title=No Decimator|>

<|{df_AAPL}|chart|x=DATE|y=Open|decimator=decimator_instance|title=With Decimator|>
|>

You get the following results:

A few comments on these two charts:

One can notice on the right-hand side graph that the shape of the curve is preserved even though the number of points is reduced (limited to 300 points). The MinMaxDecimator removes the points that least modify the shape of the curve and keeps the 300 most meaningful points. Note that 300 is just a parameter.
Second, if you zoom in on the second chart, the graphical object will ‘repopulate’ the selected area with 300 points again. This allows the ‘optimal’ display of the data, whatever zoom level is applied by the end-user.

The results after zooming: the diagram on the right still displays 300 points representing the selected area.

‍

Beside the MinMax decimator, there are two other types of decimators worth exploring: the RDP and the LTTB decimator.

‍

Taipy Decimators for clouds of points (2D or 3D)

The previous decimators are well suited for 2D line curves but are not satisfactory for scatter plots: for instance, in the situation where you want to visualize clusters.

Here, we will be using the ScatterDecimator.

Here’s a Python script displaying two ‘scatter’ charts:

One without the decimator
And the second with the decimator.

The first part of the script is about creating a data set for classification (using sklearn’s make_classification() generator).

The second part is similar to the previous script, with Taipy’s markdown generating two different graphs.

‍

import pandas as pd
import numpy as np
from taipy.gui import Gui
from taipy.gui.data.decimator import ScatterDecimator

from sklearn.datasets import make_classification

n_clusters = 3
n_features = 3

X, y = make_classification(
    n_samples=30000,
    n_features=n_features,
    n_informative=n_features,
    n_redundant=0,
    n_clusters_per_class=2,
    n_classes=3
)
df = pd.DataFrame(X[:, :3], columns=["x", "y", "z"])
df["class"] = y

def create_colors(y):
    colors = []
    for i in range(len(y)):
        if y[i] == 0:
            colors += ["red"]
        if y[i] == 1:
            colors += ["blue"]
        if y[i] == 2:
            colors += ["green"]
    return colors

df["Colors"] = create_colors(y)

marker = {"color": "Colors"}

# Parameters to create the ScatterDecimator:
# threshold: minimum number of points to apply decimator
# binning_ratio: the divider that determines the grid size. Basically, the larger the # binning_ratio, the smaller the final chart points
# max_overlap_points: determines how many points can overlap within a single grid area.
decimator = ScatterDecimator(binning_ratio=100, max_overlap_points=2)
chart = """
<|layout|

<|{df}|chart|type=scatter3d|x=x|y=y|z=z|mode=markers|marker={marker}|height=800px|title=No Decimator|>

<|{df}|chart|type=scatter3d|x=x|y=y|z=z|mode=markers|marker={marker}|height=800px|decimator=decimator|title=With Decimator|>

|>
"""
gui = Gui(chart)
gui.run(use_reloader=True, port=5005)

‍Here are the resulting charts:

‍

Again, the right-hand chart displays many fewer points, allowing for better visuals (whereas in the first chart, the shape of the clusters gets obfuscated). The interesting thing is that in the second graph, the outer envelop of each cluster is respected (the frontier points of each cluster are preserved).

‍

Performance

Some substantial differences in performance between the various decimators are also worth noting.

Min-Max: 1 (reference value)
LTTB: 10 x
RDP (Ramer–Douglas–Peucker using the 'epsilon' threshold): 20 x
RDP (Ramer–Douglas–Peucker(using a fixed number of points): 800 x

‍

Conclusion

Taipy offers the possibility to apply or not a decimation algorithm. A developer can choose which algorithm to use and with what parameters. If this algorithm is to impact the application's performance, it is always possible to fine-tune these parameters.

The benefits of it are:

A massive decrease in the load of the network and the memory used (on the client side)
More meaningful visuals get displayed
A much better / smoother user experience

‍

Big Data Charting Strategies in Python

Taipy Decimators for Time Series

Taipy Decimators for clouds of points (2D or 3D)

Performance

Conclusion

Related Posts

We're Hiring! - AI Developer – R&D Team

We're hiring! Python Developer – Pre-Sales Team

Taipy vs. Power BI: Which Data Visualization Tool is Right for You?

Stay ahead with our newsletter