Graphical Effects with Python, Tkinter, Cython, and Numba

Yesterday, Saturday, I felt like creating a flame effect (fire) in Python. This effect was quite popular in the early 90s. I remembered that the algorithm was quite simple, but there were some tricks to do with the color palette.

I found this article with the implementation in C: https://lodev.org/cgtutor/fire.html

From the same article, we can get an idea of how the effect looks:

Flame Effect

After reading the article and watching some videos on YouTube, I faced two problems:

I needed a graphical application capable of displaying images, like an animation, one image after another, as quickly as possible (at least about 15 frames per second, ideally above 30).
I suspected that I would have speed issues generating the images, since a mere image of 1024 x 1024 has many points and uses about 3 bytes per point. Imagining a matrix of this size to work with in Python, I saw that it wouldn’t be so easy to write this part just in Python. I installed numpy to ensure.

I expected that problem one would be relatively simple, but I’ll explain what complicated it a bit. Since I only want to display an image, Python’s tkinter would be sufficient. I started by creating a simple application, showing a Canvas and adding an image. However, due to problem two, during the time to generate the image, the screen becomes completely blocked; you cannot move or close the window.

The code is in Portuguese and English, but basically, it is a tkinter application where the main window has a Label to display a message, in this case, the current frame number; and an image.

class App(tk.Tk):
    def __init__(self, desenhador, func, preFunc, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.setup_windows()
        self.queue = Queue()
        self.queueStop = Queue()
        self.setup_thread(desenhador, func, preFunc)
        self.buffer = None
        self.running = True
        self.dead = False
        self.after(1, self.check_queue)

    def setup_windows(self):
        self.title('Image Generator')
        self.status = tk.StringVar(self, value='Waiting')
        tk.Label(self, textvariable=self.status).pack()
        self.canvas = tk.Canvas(self, width=LARGURA, height=ALTURA)
        self.image = self.canvas.create_image(0, 0, anchor=tk.NW)
        self.canvas.pack()
        self.protocol("WM_DELETE_WINDOW", self.terminate)

The setup_windows method sets up the window, adding the Label, creating the Canvas, and the image. Since we will frequently swap the image, it also keeps a reference to the image on the canvas in self.image. This method also sets up the window to call self.terminate if the user closes it.

The setup_thread method initializes the thread; the class that manages the thread is passed as a drawer to __init__. To facilitate communication with the thread, two queues were created, one to receive messages from the thread (self.queue) and another to wait for the thread’s completion self.queueStop (more details later). func and preFunc are two functions used to facilitate testing, where the functions that perform the image drawing can be passed as parameters. preFunc generates the first image, and func is called within a loop to generate the images (subsequent frames). The drawer thread is started immediately after its creation.

    def setup_thread(self, desenhador, func, preFunc):
        self.desenhador = desenhador(self.queue, self.queueStop, func, preFunc)
        self.desenhador.start()

Once the window and the thread that updates the images have been created, we need a method that periodically checks if there are new images in the queue. This method is check_queue, called in __init__ with self.after(1, self.check_queue). The use of self.after is crucial because it starts executing check_queue outside of __init__, after the window and event loop have been created.

check_queue checks if the queue with the images generated by the thread is empty. If it is, it does nothing, but if not, it takes the new image and swaps the image on the Canvas. In the end, it schedules to run again 10 ms later and repeats this process to swap the images as quickly as possible.

    def check_queue(self):
        if not self.queue.empty():
            contador, self.buffer = self.queue.get()
            self.status.set(f"Frame: {contador}")
            self.canvas.itemconfig(self.image, image=self.buffer)
            self.queue.task_done()
        if self.running:
            self.after(10, self.check_queue)

When working with multiple threads in tkinter and most GUI frameworks, normally, we can only change the objects managed by the framework in the same thread that runs the mainloop. That’s why the image is swapped in check_queue. This also leads to other problems to manage between threads and the GUI. For example, the conversion of an image, performed in the drawing thread (details later), needs the tkinter to be running and processing events, even if it is an object off-screen and not associated with any control. This is a characteristic of tkinter. And that’s why terminate calls check_thread_dead to kill the thread, but waiting for the main tkinter loop to run. Note that the drawer is stopped with self.desenhador.stop(). Then, check_thread_dead is called to check if the drawer really stopped; it is at this moment that we use the other queue, queueStop. This queue remains empty during the program’s execution and only receives something when the drawer’s loop finishes its work. Only then is the tkinter loop destroyed with the call to self.destroy().

    def check_thread_dead(self):
        if self.queueStop.empty() and not self.dead:
            self.after(1, self.check_thread_dead)
            return
        self.queueStop.get()
        self.dead = True
        self.desenhador.join()
        self.destroy()

    def terminate(self, e=None):
        self.running = False
        if not self.dead:
            self.desenhador.stop()
            self.check_thread_dead()

This is all just to have the window being updated by another thread. We still haven’t drawn anything. Let’s look at an implementation of the drawer:

class Desenha(Thread):
    def __init__(self, queue, queueStop, func, preFunc, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.queue = queue
        self.queueStop = queueStop
        self.running = True
        self.func = func
        self.preFunc = preFunc or func

    def run(self):
        try:
            data = numpy.zeros((ALTURA, LARGURA, 3), dtype=numpy.uint8)
            c = 0
            self.preFunc(data, c, LARGURA, ALTURA)
            while self.running:
                with TimeIt("Loop") as t:
                    # with TimeIt("ForLoop") as t:
                    self.func(data, c, LARGURA, ALTURA)
                    # with TimeIt("FROM ARRAY") as t1:
                    image = Image.fromarray(data)
                    # with TimeIt("Convert") as t2:
                    converted_image = ImageTk.PhotoImage(image)
                    # with TimeIt("Queue") as t3:
                    self.queue.put((c, converted_image))
                    c += 1
        finally:
            self.running = False
            self.queueStop.put((0, 'DONE'))

    def stop(self):
        self.running = False

The Desenha class receives the queues to which it will send the images that will be created within run and also the message indicating it has finished. The work itself is performed inside run, which is executed when the thread is started.

Since the matrices are large, easily over 1 million elements for 1024 x 1024 images, Desenha uses optimized numpy arrays. It would simply be much slower to work with Python lists to perform these operations, as we have to fill all the points for each image.

If you are not familiar with NumPy, it is a library widely used in data science and various other areas that need to perform operations with matrices and carry out mathematical calculations in general in Python. You can read the documentation here NumPy. The great advantage of NumPy is that it is optimized in C, in addition to being part of Scipy.org.

Returning to run, it basically creates a matrix large enough to represent the points of the new image we are going to create. These dimensions are HEIGHT and WIDTH in 3 dimensions, one for each RGB color component (Red, Green, and Blue; one byte for each). Thus, with an image of 1024 x 1024 points, we have 1024 x 1024 x 3 = 3,145,728 bytes just to store the matrix of points.

Once the matrix is created, run calls the drawing function self.preFunc, which performs the drawing of the first image, passing the matrix, a frame counter, as well as the dimensions of the image. This signature has developed as I needed to do tests. Then, it calls within the main loop self.func, with the same parameters, but to create the subsequent images. This organization with preFunc and func was necessary to better visualize the data, as the drawing algorithm I started using for testing did not provide a quick visual response. So, I used preFunc to draw an image and func to modify it, such as moving its lines up.

Since we need to swap the images as quickly as possible (30 frames per second = ~33 ms between each image), run must execute its loop as quickly as possible.

The next step, which is independent of how the image was created, is to transform the point matrix into an image. This transformation is performed by image = Image.fromarray(data). From this point, we have an image, but it is in the format of PILLOW (PIL), the image library we use to manage this. To convert our image to tkinter, also using PILLOW, we call: converted_image = ImageTk.PhotoImage(image). converted_image is ready to go to the queue and be drawn on the screen. We can now move on to drawing the next image.

One curiosity is that it was precisely the ImageTk.PhotoImage that complicated the thread termination process; it is this class that needs the tkinter event loop running to function and caused a coordination of termination to be elaborated with the queueStop.

The loop in run keeps running until self.running is False. And that is exactly what the stop method does.

Since the drawing thread is independent of the main program thread, where tkinter runs, the stop can occur at different times in the loop. That is why we cannot disable tkinter until the loop is finished and reaches again the while that checks self.running.

Upon exiting the loop, a message is posted to queueStop. This message serves as a signal for the main loop to continue its termination and subsequently close the window.

You may have noticed several commented calls to TimeIt. This class was created just to measure the execution time of some functions, as I realized it was very slow.

class TimeIt:
    """Class to measure the execution time of some blocks.
       Should be used as context managers, with blocks with"""
    def __init__(self, name, silent=False):
        self.name = name
        self.start = 0
        self.end = 0
        self.silent = silent

    def __enter__(self):
        self.start = datetime.now()

    def __exit__(self, *args, **kwargs):
        self.end = datetime.now()
        if not self.silent:
            seconds = self.elapsed().total_seconds()
            if seconds == 0:
                return
            fps = 1.0 / seconds
            print(f"Elapsed {self.name}: {self.elapsed()} Frames: {fps}")

    def elapsed(self):
        return self.end - self.start

Before optimizing, it was necessary to discover the source of the slowness. In the case of the loop, it was always the call to self.func that dominated the execution time. You can remove the comments and indent the following line to get the results on the screen. The operation of Image.fromarray and ImageTk.PhotoImage executes very quickly, in the range of 1 ms on my computer. The drawing function was taking up to 3s or 3000 ms at the beginning. Remembering that we need to draw in a maximum of 33 ms to have 30 frames per second.

Let’s look at a simple drawing function:

def draw(data, c, largura, altura):
    for y in numpy.arange(0, altura):
        for x in numpy.arange(0, largura):
            data[y, x] = [0, 0, y // (c + 1)]

This function simply draws a series of stripes on the screen, changing the blue component of each point with the division of the current line by the frame counter (c). The idea was just to traverse the points of the image and be able to visualize it on the screen as quickly as possible.

This function has horrible performance:

Elapsed Loop: 0:00:01.427400 Frames: 0.7005744710662744
Elapsed Loop: 0:00:01.316119 Frames: 0.7598097132554122
Elapsed Loop: 0:00:01.308270 Frames: 0.764368211454822
Elapsed Loop: 0:00:01.341486 Frames: 0.7454419949220491
Elapsed Loop: 0:00:01.359058 Frames: 0.7358037699641957

Less than 1 frame per second, since we are spending more than 1s to generate a single image.

Blue Screen

As time goes by, the image gets darker due to the values of c that increase with each frame. But with this speed, it became very slow, and you can barely notice any change on the screen itself.

Even using NumPy, the execution time of the drawing loop was very high. I then decided to use another library called Numba. Numba is a JIT (Just in Time compiler) for Python. With it, you can annotate your functions, and they are compiled the first time they are called. When called again, the original function is replaced by the compiled, optimized one, running with native language performance (as long as the interaction with the interpreter is limited). Let’s see what we need to change to use Numba:

@jit(nopython=True, parallel=True, fastmath=True, nogil=True)
def drawNumba(data, c, largura, altura):
    for y in numpy.arange(0, altura):
        for x in numpy.arange(0, largura):
            data[y, x] = [0, 0, y // (c + 1)]

The code is the same; we simply added the @jit decorator from Numba to mark that we want this function to be optimized. Nothing else in the code was changed, except for the import of Numba itself. For the rest of the program, the function behaves the same way as before.

Let’s see the result with Numba:

Elapsed PreLoop: 0:00:01.276058 Frames: 0.7836634384957424
Elapsed Loop: 0:00:00.210905 Frames: 4.74147127853773
Elapsed Loop: 0:00:00.205027 Frames: 4.877406390377853
Elapsed Loop: 0:00:00.219663 Frames: 4.552428037493797
Elapsed Loop: 0:00:00.220664 Frames: 4.5317768190552155
Elapsed Loop: 0:00:00.209917 Frames: 4.763787592238837

I added a TimeIt context to measure the execution time of the first call to the function, in this case, preFunc. Note that the function executed practically with the same slowness as the non-accelerated version. However, observe that from the second call, the execution time was reduced from 1.27s to 0.21s, raising our frames per second to 4.7 (the number of frames on the screen may be lower or a bit different due to communication with tkinter). The accelerated version with Numba runs in just 16% of the time, meaning it is almost 8 times faster. All this with installation via pip and two lines in the code. But 4 frames per second is still very slow and far from the desired 30. Remembering that so far I haven’t even started to create the flame effect.

Another alternative is to use a compiled module created with Cython. Cython (unlike CPython) is a compiler that translates a program similar to Python into a C module that Python can call.

To use Cython, we need to make some more important changes. First, install Cython and make sure a C/C++ compiler is installed on the machine. On Windows with Python 3.8, I used Visual Studio 2019 without problems.

A program in Cython is written in a file with the .pyx extension. Converting the drawing function to Cython, we have:

import numpy as np
cimport numpy as np
cimport cython
from libc.math cimport abs
from libc.stdlib cimport rand


ctypedef np.uint8_t DTYPE_t
ctypedef np.uint32_t DTYPE32_t


@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
@cython.cdivision(True)
def draw2(np.ndarray[DTYPE_t, ndim=3] data, int c, int max_x, int max_y):
    cdef int x, y
    cdef int ic = c
    cdef np.ndarray[DTYPE_t, ndim=3] h = data
    cdef int cmax_y = max_y, cmax_x = max_x
    for y in range(cmax_y):
        for x in range(cmax_x):
            h[y, x, 0] = 0
            h[y, x, 1] = 0
            h[y, x, 2] = y / (ic + 1)

Very similar to Python and C.

Cython also requires the configuration of a setup.py to compile the module.

from setuptools import setup
from Cython.Build import cythonize
import numpy

setup(
    name='Screen Generator',
    ext_modules=cythonize("compute.pyx", annotate=True, language_level=3),
    include_dirs=[numpy.get_include()],
    zip_safe=False,
)

And it needs to be compiled with:

python setup.py build_ext --inplace

But the results are very good:

Elapsed Loop: 0:00:00.023445 Frames: 42.65301770100235
Elapsed Loop: 0:00:00.022442 Frames: 44.55930843953302
Elapsed Loop: 0:00:00.023414 Frames: 42.70949004868882
Elapsed Loop: 0:00:00.024410 Frames: 40.96681687832855
Elapsed Loop: 0:00:00.023431 Frames: 42.67850283812044
Elapsed Loop: 0:00:00.022455 Frames: 44.533511467379206
Elapsed Loop: 0:00:00.023436 Frames: 42.66939750810719

Now we went from 4 to 40 frames per second and generated a new image in just 23 ms!

In reality, it became so fast that the image turns black very quickly. To make visualization easier, another function called drawUp was created. To avoid darkening the image, I decided to copy the lines to roll the lines on the screen, so the program can run longer without the screen turning black.

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
@cython.cdivision(True)
def drawUp(np.ndarray[DTYPE_t, ndim=3] data, int c, int max_x, int max_y):
    cdef int x, y
    cdef int ic = c
    cdef np.ndarray[DTYPE_t, ndim=3] h = data
    cdef int cmax_y = max_y, cmax_x = max_x
    # Copy top to bottom
    for x in range(0, cmax_x):
        h[cmax_y - 2, x, 0] = h[0, x, 0]
        h[cmax_y - 2, x, 1] = h[0, x, 1]
        h[cmax_y - 2, x, 2] = h[0, x, 2]
    for y in range(1, cmax_y - 1):
        for x in range(0, cmax_x):
            h[y - 1, x, 0] = h[y, x, 0]
            h[y - 1, x, 1] = h[y, x, 1]
            h[y - 1, x, 2] = h[y, x, 2]

This change led to the separation between preFunc and func. In preFunc, executed by draw2, an image like the one generated in pure Python is created. In the drawUp function, it simply rolls the lines of the image, copying the top line down and moving the other lines up.

At this point, both the performance problem and the window termination issue have been resolved. We just need to convert the algorithm to generate the flames.

The first step is to generate a compatible color palette, since the algorithm uses 256 colors to indicate the intensity of the fire.

Converting to Python, we have something like:

def build_fire_palette():
    palette = numpy.zeros((256, 3), dtype=numpy.uint8)
    for x in range(256):
        h = x // 3
        saturation = 100
        b = min(256, x * 2) / 256.0 * 100.0
        css = f"hsl({h},{saturation}%,{b}%)"
        palette[x] = ImageColor.getrgb(css)
    return palette

The palette is simply a color table that we will use to transform a value between 0 and 255 (byte) into an RGB color (red, green, and blue with 3 bytes).

A problem arises with the drawer, as the Desenha class does not support color palettes. We will need another drawer:

class DesenhaComPalette(Desenha):
    def run(self):
        try:
            palette = build_fire_palette()
            data = numpy.zeros((ALTURA, LARGURA), dtype=numpy.uint8)
            fogo = numpy.zeros((ALTURA, LARGURA), dtype=numpy.uint32)
            c = 0
            while self.running:
                with TimeIt("Loop") as t:
                    # with TimeIt("ForLoop") as t:
                    self.func(data, c, LARGURA, ALTURA, fogo)
                    # with TimeIt("FROM ARRAY") as t1:
                    image = Image.fromarray(data, mode="P")
                    image.putpalette(palette)
                    # with TimeIt("Convert") as t2:
                    converted_image = ImageTk.PhotoImage(image)
                    # with TimeIt("Queue") as t3:
                    self.queue.put((c, converted_image))
                    c += 1
        finally:
            self.running = False
            self.queueStop.put((0, 'DONE'))

The difference is that we create the image differently because we have to pass the points (with colors 0 to 255) and the palette (with the translation of each color). We also create the fire, but as an integer matrix and not as a byte matrix. This changes the size of the matrix in memory, but it is necessary for the flame algorithm that keeps the fire information between one screen and another. In data, we will store the points in 256 colors.

The algorithm converted to Python looks like this:

def desenhaPythonFlamas(data, c, largura, altura, fogo):
    for x in range(LARGURA):
        fogo[ALTURA - 1, x] = int(min(random.random() * 2048, 2048))

    for y in range(1, ALTURA - 2):
        for x in range(0, LARGURA):
            v = int((fogo[(y + 1) % ALTURA, x] +
                     fogo[(y + 1) % ALTURA, (x - 1) % LARGURA] +
                     fogo[(y + 1) % ALTURA, (x + 1) % LARGURA] +
                     fogo[(y + 2) % ALTURA, x]) * 32) / 129
            fogo[y, x] = v
    for y in range(altura):
        for x in range(largura):
            data[y, x] = fogo[y, x] % 256

Which runs ultra slow, as expected:

Elapsed Loop: 0:00:06.345203 Frames: 0.15759937073723254
Elapsed Loop: 0:00:06.327644 Frames: 0.15803670370836284
Elapsed Loop: 0:00:06.362772 Frames: 0.15716420453223848
Elapsed Loop: 0:00:06.387171 Frames: 0.15656383710409505

It takes an enormous 6s to generate just one screen with the flames! Let’s move on to the optimized version with Numba, simply by adding the decorator, as we did earlier.

Elapsed Loop: 0:00:00.022445 Frames: 44.55335263978615
Elapsed Loop: 0:00:00.024425 Frames: 40.941658137154555
Elapsed Loop: 0:00:00.024421 Frames: 40.94836411285369
Elapsed Loop: 0:00:00.024396 Frames: 40.99032628299721

Much better! We reached over 30 frames as expected, returning to about 23 ms to generate a frame. Remembering that all these performances are for images of 1024 x 1024 points. If you have a slower computer, you can reduce the screen size.

And how would it look in Cython? Let’s pay to see:

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
@cython.cdivision(True)
def desenhaflamas(np.ndarray[DTYPE_t, ndim=2] data,
                  int c, int max_x, int max_y,
                  np.ndarray[DTYPE32_t, ndim=2] fogo):
    cdef int x, y
    cdef int ic = c
    cdef np.ndarray[DTYPE_t, ndim=2] d = data
    cdef np.ndarray[DTYPE32_t, ndim=2] f = fogo
    cdef int cmax_y = max_y, cmax_x = max_x

    for x in range(cmax_x):
        f[cmax_y - 1, x] = abs(32768 + rand()) % 2048

    for y in range(1, cmax_y - 2):
        for x in range(0, cmax_x):
            f[y, x] = ((f[(y + 1) % cmax_y, x] +
                        f[(y + 1) % cmax_y, (x - 1) % cmax_x] +
                        f[(y + 1)% cmax_y, (x + 1) % cmax_x] +
                        f[(y + 2)% cmax_y, x]) * 32) / 129
    for y in range(max_y):
        for x in range(max_x):
            d[y, x] = f[y, x] % 256

Which has performance:

Elapsed Loop: 0:00:00.026373 Frames: 37.917567208887874
Elapsed Loop: 0:00:00.026379 Frames: 37.90894271958755
Elapsed Loop: 0:00:00.028309 Frames: 35.324455120279765
Elapsed Loop: 0:00:00.030253 Frames: 33.05457310018841
Elapsed Loop: 0:00:00.028315 Frames: 37.91612952149844
Elapsed Loop: 0:00:00.026374 Frames: 37.91612952149844

It got a little worse than with Numba. I believe there is some detail in the Cython code. But we have already seen that the effect is running:

Flames

How to run all this in one program? We need a configuration section:

if len(sys.argv) < 5:
    print("Usage: python desenha.py <algorithm> <accelerator> <width> <height>")
    print("Algorithm: drawing, flames")
    print("Accelerator: cython, python, numba")

ALGORITHM = sys.argv[1].lower()
ACCELERATOR = sys.argv[2].lower()
WIDTH = int(sys.argv[3])
HEIGHT = int(sys.argv[4])

print(f"ALGORITHM: {ALGORITHM}")
print(f"ACCELERATOR: {ACCELERATOR}")
print(f"WIDTH: {WIDTH} HEIGHT: {HEIGHT}")

CONFIGURATION = {
    "flames": {"drawer": DesenhaComPalette,
               "optimization": {"python": (desenhaPythonFlamas.py_func, None),
                                "cython": (desenhaflamas, None),
                                "numba": (desenhaPythonFlamas, None)
                                }},
    "drawing": {"drawer": Desenha,
                "optimization": {"python": (drawNumba.py_func, None),
                                 "cython": (drawUp, draw2),
                                 "numba": (drawNumba, None)
                                 }}
}

if ALGORITHM not in CONFIGURATION:
    print(f"Algorithm {ALGORITHM} is invalid", file=sys.stderr)
    sys.exit(1)

if ACCELERATOR not in CONFIGURATION[ALGORITHM]["optimization"]:
    print(f"Accelerator {ACCELERATOR} is invalid", file=sys.stderr)
    sys.exit(2)

if HEIGHT < MIN_V or WIDTH < MIN_V or HEIGHT > MAX_V or WIDTH > MAX_V:
    print(f"Height and width must be values between {MIN_V} and {MAX_V}.")
    sys.exit(3)

drawer = CONFIGURATION[ALGORITHM]["drawer"]
func = CONFIGURATION[ALGORITHM]["optimization"][ACCELERATOR][0]
prefunc = CONFIGURATION[ALGORITHM]["optimization"][ACCELERATOR][1]

app = App(desenhador=drawer, func=func, preFunc=prefunc)
app.mainloop()

Phew, finally running! I hope you enjoyed the article and that you became curious about performance in Python, Cython, and Numba. The complete code is published on GitHub: https://github.com/lskbr/flamas_em_python

And if we use this code to simulate the Game of Life? That will be for another article.