This article is a follow-up to the article Python Charting: Big Data without crashing. This article can be read independently from this previous article.
As Data Scientists, we are challenged very often with large data sets that need to be displayed graphically. I am not even talking about Big Data here: a mid-size dataset greater than tens of thousands of points may represent a challenge for most Python libraries, let alone data sets with millions of data points.
One common approach is to use various heatmap techniques (such as datashader, etc). However, this only covers a small set of situations, usually involving maps. The bread & butter of data scientists remain time series and scatter plots.
Taipy (A Python library to fast-track your front-end and back-end development) addresses this problem with “Decimators”. These provide some clever’ implementation of data sampling in order to minimize the volume of data transiting between the application and the GUI client, without losing information.If one considers a table (dataframe, array, etc) with just 20 columns and 500K rows, generating a graphical view for such a table will require over 1 minute on a 10MB/s network!
Taipy manages to reduce ‘smartly’ the number of points that define the shape of the plot/curve.
Taipy Decimators for Time Series
The basic “algorithm”
The process of ‘reducing’ a dataset by eliminating ‘non-significant’ points is generally referred to as ‘decimation’ or ‘downsampling’ (or even ‘subsampling’).
We apply here the MinMaxDecimator to the curve for the Apple stock price over time. This is not a very large dataset but the principle described here works for any size of dataset including those that will crash your Python environment (due to lack of memory most likely).
Note that the Taipy code that displays the two charts is condensed in the lines
Taipy provides two methods to generate graphics:
- One is based on an ‘augmented’ markdown syntax. This is our choice here. You get the markdown facility extended with graphical syntax receiving Python objects. Here a dataframe, df_AAPL is an input to the ‘chart’ markdown object. More can be found in Taipy’s documentation.
- The second is a pure Python interface. No markdown.
You get the following results:
A few comments on these two charts:
- One can notice on the right-hand side graph that the shape of the curve is preserved even though the number of points is reduced (limited to 300 points). The MinMaxDecimator removes the points that least modify the shape of the curve and keeps the 300 most meaningful points. Note that 300 is just a parameter.
- Second, if you zoom in on the second chart, the graphical object will ‘repopulate’ the selected area with 300 points again. This allows the ‘optimal’ display of the data, whatever zoom level is applied by the end-user.
The results after zooming: the diagram on the right still displays 300 points representing the selected area.
Beside the MinMax decimator, there are two other types of decimators worth exploring: the RDP and the LTTB decimator.
Taipy Decimators for clouds of points (2D or 3D)
The previous decimators are well suited for 2D line curves but are not satisfactory for scatter plots: for instance, in the situation where you want to visualize clusters.
Here, we will be using the ScatterDecimator.
Here’s a Python script displaying two ‘scatter’ charts:
- One without the decimator
- And the second with the decimator.
The first part of the script is about creating a data set for classification (using sklearn’s make_classification() generator).
The second part is similar to the previous script, with Taipy’s markdown generating two different graphs.
Here are the resulting charts:
Again, the right-hand chart displays many fewer points, allowing for better visuals (whereas in the first chart, the shape of the clusters gets obfuscated). The interesting thing is that in the second graph, the outer envelop of each cluster is respected (the frontier points of each cluster are preserved).
Performance
Some substantial differences in performance between the various decimators are also worth noting.
- Min-Max: 1 (reference value)
- LTTB: 10 x
- RDP (Ramer–Douglas–Peucker using the 'epsilon' threshold): 20 x
- RDP (Ramer–Douglas–Peucker(using a fixed number of points): 800 x
Conclusion
Taipy offers the possibility to apply or not a decimation algorithm. A developer can choose which algorithm to use and with what parameters. If this algorithm is to impact the application's performance, it is always possible to fine-tune these parameters.
The benefits of it are:
- A massive decrease in the load of the network and the memory used (on the client side)
- More meaningful visuals get displayed
- A much better / smoother user experience
- Copied!