OpenGL Compute Shaders: SPA¶

This notebook is the third in a series of notebooks about general-purpose GPU (GPGPU) computing using OpenGL’s compute shaders with the goal of an accurate and fast GPU implementation of Reda & Andreas’s Solar Position Algorithm. There’s still some work to do, but what I have is already useful in many contexts and as I get close to the finish line I wanted to do some more rigorous validation in terms of accuracy and runtime speed, using pvlib’s numpy and numba implementations as the baseline.

These comparisons (both error and runtime) will be specific to this particular GPU. I will be interested to see how the results vary across devices…

import moderngl
import numpy as np
import pandas as pd
import pvlib
import matplotlib.pyplot as plt
import time

This function does the data processing needed to shuffle the arguments to and from the OpenGL kernel, as well as coordinate the kernel execution itself. Again, no promises that this is the best way to do it – but it works!

context = moderngl.create_standalone_context(require=430)

def gpu_spa(times, lat, lon, elevation=0, temperature=12, pressure=101325, delta_t=67, atmos_refract=0.5667, groupsize=1024):
    n_times = len(times)
    n_vars = 5
    unixtimes = np.array(times.view(np.int64)/10**9, dtype=np.uint32)
    params = np.array([lat, lon, elevation, temperature, pressure, delta_t, atmos_refract], dtype='f4')

    with open("spa.glsl", "r") as f:
        source = f.read().replace("%%N%%", str(min(n_times, groupsize)))

    compute_shader = context.compute_shader(source)
    
    buffer_times = context.buffer(unixtimes)
    buffer_parameters = context.buffer(params)
    buffer_outputs = context.buffer(reserve=n_vars*4*n_times)

    buffer_times.bind_to_storage_buffer(0)
    buffer_parameters.bind_to_storage_buffer(1)
    buffer_outputs.bind_to_storage_buffer(2)
    
    n_group = 1 + (n_times-1) // groupsize
    compute_shader.run(group_x=n_group)
    
    outputs = np.frombuffer(buffer_outputs.read(), dtype=np.float32)
    out = pd.DataFrame({
        'elevation': outputs[0::n_vars],
        'apparent_elevation': outputs[1::n_vars],
        'zenith': outputs[2::n_vars],
        'apparent_zenith': outputs[3::n_vars],
        'azimuth': outputs[4::n_vars],
    }, index=times)
    return out

Accuracy Benchmarks¶

First, check that the GPU implementation gives the same output as pvlib under several test scenarios. pvlib’s numpy implementation is the reference.

2019 1-minute simulation¶

def describe(x1, x2):
    d = x1 - x2
    return pd.DataFrame({
        'rmse': (d**2).mean()**0.5,
        'mbe': d.mean(),
        'mae': d.abs().mean(),
    }).T

times = pd.date_range('2019-01-01', '2020-01-01', freq='T')
gpu = gpu_spa(times, 40, -80)
cpu = pvlib.solarposition.get_solarposition(times, 40, -80)

describe(cpu, gpu)

	apparent_elevation	apparent_zenith	azimuth	elevation	equation_of_time	zenith
rmse	0.001248	0.001248	0.000568	0.000334	NaN	0.000334
mbe	-0.000004	0.000004	-0.000219	-0.000002	NaN	0.000002
mae	0.000258	0.000258	0.000411	0.000257	NaN	0.000257

filt = cpu['zenith'] < 89
describe(cpu.loc[filt, :], gpu.loc[filt, :])

	apparent_elevation	apparent_zenith	azimuth	elevation	equation_of_time	zenith
rmse	0.000323	0.000323	0.000561	0.000325	NaN	0.000325
mbe	0.000008	-0.000008	-0.000216	0.000008	NaN	-0.000008
mae	0.000246	0.000246	0.000408	0.000248	NaN	0.000248

Error is very low for this test, although there is likely still some room for improvement. These results (and all others in this notebook) are still using the arcsin/arctan hack from the previous post, so there is some accuracy to be gained there if nowhere else.

One of SPA’s claims to fame is its accuracy over such a wide time span; I’m not necessarily interested in trying to match that, but we should at least test some years besides 2019:

1900–2100 hourly simulation¶

times = pd.date_range('1900-01-01', '2100-01-01', freq='H')
gpu = gpu_spa(times, 40, -80)
cpu = pvlib.solarposition.get_solarposition(times, 40, -80)

describe(cpu, gpu)

	apparent_elevation	apparent_zenith	azimuth	elevation	equation_of_time	zenith
rmse	35.638818	35.638818	106.151313	35.629696	NaN	35.629696
mbe	0.411457	-0.411457	4.656588	0.408023	NaN	-0.408023
mae	21.895442	21.895442	64.165011	21.890567	NaN	21.890567

Yikes! So something is going wrong somewhere… dropping nighttime and the atmospheric correction doesn’t fix it either:

filt = cpu['zenith'] < 89
describe(cpu.loc[filt, :], gpu.loc[filt, :])

	apparent_elevation	apparent_zenith	azimuth	elevation	equation_of_time	zenith
rmse	36.085840	36.085840	85.652276	36.084422	NaN	36.084422
mbe	16.209380	-16.209380	-21.528633	16.198121	NaN	-16.198121
mae	22.415592	22.415592	53.339417	22.417768	NaN	22.417768

Turns out that it all goes haywire for dates prior to Jan 1 2000:

(gpu['zenith'] - cpu['zenith']).loc['1999-01-01':'2001-01-01'].plot()

<AxesSubplot:>

../_images/671574af43eb95f6328f7ffd7b014c0e441165cc2151c9c4266e062a23506886.png

Guess I need to take another look at my julian date handlers! Moving on, let’s try out different locations to make sure things don’t get wonky at negative latitudes or near the poles or something:

Lat/Lon Sweep¶

times = pd.date_range('2019-01-01', '2020-01-01', freq='H')

rmse = []
mbe = []
mae = []

for lat in np.arange(-90, 90, 5):
    for lon in np.arange(-180, 180, 10):
        gpu = gpu_spa(times, lat, lon)
        cpu = pvlib.solarposition.get_solarposition(times, lat, lon)
        res = describe(cpu, gpu).assign(lat=lat, lon=lon)
        rmse.append(res.loc['rmse', :])
        mbe.append(res.loc['mbe', :])
        mae.append(res.loc['mae', :])

rmse = pd.concat(rmse, axis=1).T
mbe = pd.concat(mbe, axis=1).T
mae = pd.concat(mae, axis=1).T

for column in ['apparent_zenith', 'zenith', 'azimuth']:
    fig, axes = plt.subplots(1, 3, figsize=(13, 3), dpi=100)
    for i, (label, err) in enumerate([('rmse', rmse), ('mbe', mbe), ('mae', mae)]):
        piv = err.pivot_table(index='lat',
                              columns='lon',
                              values=column)
        pcm = axes[i].pcolormesh(piv.columns, piv.index, piv, shading='auto')
        axes[i].set_ylabel(r'Latitude [$\degree$]')
        axes[i].set_xlabel(r'Longitude [$\degree$]')
        axes[i].set_title(fr'{label} [$\degree$]')
        plt.colorbar(pcm, ax=axes[i])

    fig.suptitle(column)
    fig.tight_layout()

../_images/250bbab6b332fa2a1a8e488502649042c7847ee2e7165e095d8d23e1400bcf55.png

../_images/b92bdebb8019ad5aba7dd6e4e35ac09d530e529d014f5c6cf3962ac45d13224d.png

../_images/b70ac7f2b0ffbc759966fbe6e014125f924ad4c6513b1e2eee0e42222be62520.png

Some interesting patterns that I’m not immediately sure how to interpret, but in any case the variation with location doesn’t seem very significant. Now let’s play with the other SPA knobs:

Refraction Correction¶

The SPA zenith refraction correction depends on annual average temperature and pressure. To make sure that correction is calculated correctly, let’s sweep across both and ensure that refraction-corrected (“apparent”) zenith still matches between the two implementations. The temperatures and pressures chosen here are each roughly representative of the ranges found on Earth, although certain combinations might not be particularly realistic.

times = pd.date_range('2019-06-01', '2019-06-02', freq='10s')

n = 5
fig, axes = plt.subplots(n, n, sharex=True, sharey=True, dpi=75, figsize=(12, 12))

for i, temperature in enumerate(np.linspace(-30, 30, num=n)):
    for j, pressure in enumerate(np.linspace(30000, 102000, num=n)):
        gpu = gpu_spa(times, 40, -80, pressure=pressure, temperature=temperature)
        cpu = pvlib.solarposition.get_solarposition(times, 40, -80, pressure=pressure, temperature=temperature)
        rmse = np.mean((gpu['apparent_zenith'] - cpu['apparent_zenith'])**2)**0.5
        emax = np.max(np.abs(gpu['apparent_zenith'] - cpu['apparent_zenith']))
        axes[i, j].scatter(cpu['apparent_zenith'], gpu['apparent_zenith'], s=1)
        axes[i, j].axline((0, 0), slope=1, c='k')
        axes[i, j].set_title(f'T={temperature:0.0f} $\degree$C  P={pressure/1000:0.01f} kPa')
        axes[i, j].text(0, 100, f'rmse={rmse:0.06f}\nmax={emax:0.06f}')

for i in range(n):
    axes[i, 0].set_ylabel('GPU')
    axes[n-1, i].set_xlabel('CPU')

fig.suptitle('Apparent Zenith')
fig.tight_layout()

../_images/71cd961511697998f4e0bd6ba53d72f11c4aebe8573f39fb73240c1c1059b9b5.png

All looks pretty good here. Now, on to comparing calculation times!

Execution Speed Benchmarks¶

My first-order mental model of GPU processing is that runtime is a combination of fixed costs (shader compilation, OpenGL context overhead) that are more or less constant for any input and variable costs (memory allocation, data transfer, and of course the actual algorithm runtime) that scale with the number of timestamps being simulated. So of course it will be interesting to compare runtimes across a range of \(N\), the number of simulated timestamps.

import warnings
warnings.filterwarnings('ignore', message='reloading spa')

timings = []

for n in 10**np.arange(1, 8):
    times = pd.date_range('2019-01-01', freq='T', periods=n)
    timing = {'N': n}

    st = time.perf_counter()
    _ = pvlib.solarposition.spa_python(times, 40, -80, how='numpy')
    ed = time.perf_counter()
    timing['cpu (numpy)'] = ed - st

    # warmup
    _ = pvlib.solarposition.spa_python(times[:10], 40, -80, how='numba')

    st = time.perf_counter()
    _ = pvlib.solarposition.spa_python(times, 40, -80, how='numba', numthreads=4)
    ed = time.perf_counter()
    timing['cpu (numba, 4 threads)'] = ed - st

    st = time.perf_counter()
    _ = pvlib.solarposition.spa_python(times, 40, -80, how='numba', numthreads=8)
    ed = time.perf_counter()
    timing['cpu (numba, 8 threads)'] = ed - st

    st = time.perf_counter()
    _ = pvlib.solarposition.spa_python(times, 40, -80, how='numba', numthreads=16)
    ed = time.perf_counter()
    timing['cpu (numba, 16 threads)'] = ed - st

    st = time.perf_counter()
    _ = gpu_spa(times, 40, -80)
    ed = time.perf_counter()
    timing['gpu'] = ed - st

    timings.append(timing)

timings = pd.DataFrame(timings).set_index('N')

/home/kevin/miniconda3/envs/dev/lib/python3.8/site-packages/pvlib/spa.py:991: UserWarning: The number of threads is more than the length of the time array. Only using %s threads.
  warnings.warn('The number of threads is more than the length of '

fig, axes = plt.subplots(2, 1, sharex=True, dpi=100, figsize=(6, 6))
timings.plot(ax=axes[0], logx=True, logy=True)
axes[0].set_ylabel('Elapsed Time [s]')
axes[0].grid()

ratio = 1/timings.divide(timings['cpu (numpy)'], axis=0)
ratio.plot(ax=axes[1], logx=True, logy=True)
axes[1].set_ylabel('Speed-up (numpy=1)')
axes[1].grid()

fig.tight_layout()

../_images/c0d45a606666eeba998140d9a3d784863c92160eaca873351d562347a5344d68.png

So, on this machine, the GPU implementation is:

Faster than the numpy implementation for all N
Faster than the numba implementation starting somewhere between 1000 and 10000 timestamps
Vastly faster (two orders of magnitude) for 100000 timestamps and beyond

So for an 8760 you might expect a (20x, 3x) speedup over (numpy, numba). For an annual 1-minute simulation (N=525600), the speedup might be (200x, 20x) over (numpy, numba). Boggles the mind, especially considering that this GPU is a weeny integrated GPU (I think)!

Here are the same timings in tabular form, for reference:

timings

	cpu (numpy)	cpu (numba, 4 threads)	cpu (numba, 8 threads)	cpu (numba, 16 threads)	gpu
N
10	0.005542	0.000670	0.000681	0.001133	0.002333
100	0.009041	0.000662	0.000809	0.001022	0.001723
1000	0.016032	0.002164	0.001544	0.001640	0.002802
10000	0.073568	0.015643	0.009942	0.007940	0.002448
100000	0.655516	0.153031	0.091456	0.069525	0.006387
1000000	8.010285	1.538795	0.917086	0.703851	0.023457
10000000	102.137903	16.005205	9.922412	7.520801	0.275046

ratio

	cpu (numpy)	cpu (numba, 4 threads)	cpu (numba, 8 threads)	cpu (numba, 16 threads)	gpu
N
10	1.0	8.266644	8.141435	4.890245	2.375776
100	1.0	13.652682	11.175203	8.847653	5.247172
1000	1.0	7.407335	10.382628	9.775966	5.720831
10000	1.0	4.702821	7.399333	9.265220	30.048933
100000	1.0	4.283561	7.167524	9.428449	102.639141
1000000	1.0	5.205558	8.734495	11.380647	341.483374
10000000	1.0	6.381543	10.293657	13.580722	371.348489

Closing thoughts:

It’s a shame that the shader compilation cost is incurred on every invocation. I want to see if I can modify the shader in such a way that I can compile it once but apply it to an input of any length. That would hopefully make it more competitive with numba for the smaller inputs.
Definitely need to fix that “reverse Y2K bug”…
Definitely want to re-run this notebook on a different computer. I’ve addressed the big error sources on this machine, but the fact that the trignometric functions have such large error here gives me significant pause – maybe other computers will have totally different error sources. If the error is acceptable across machines (or I rewrite the code to only use “safe” GLSL functions), then evaluating the speed-up with a proper beefy GPU would be very cool.

OpenGL Compute Shaders: Accuracy of Math Functions PSM3 Pixel Boundaries

2022-08-21

Recent Posts

Archives