OpenGL Compute Shaders: SPA

This notebook is the third in a series of notebooks about general-purpose GPU (GPGPU) computing using OpenGL’s compute shaders with the goal of an accurate and fast GPU implementation of Reda & Andreas’s Solar Position Algorithm. There’s still some work to do, but what I have is already useful in many contexts and as I get close to the finish line I wanted to do some more rigorous validation in terms of accuracy and runtime speed, using pvlib’s numpy and numba implementations as the baseline.

These comparisons (both error and runtime) will be specific to this particular GPU. I will be interested to see how the results vary across devices…

import moderngl
import numpy as np
import pandas as pd
import pvlib
import matplotlib.pyplot as plt
import time

This function does the data processing needed to shuffle the arguments to and from the OpenGL kernel, as well as coordinate the kernel execution itself. Again, no promises that this is the best way to do it – but it works!

context = moderngl.create_standalone_context(require=430)

def gpu_spa(times, lat, lon, elevation=0, temperature=12, pressure=101325, delta_t=67, atmos_refract=0.5667, groupsize=1024):
    n_times = len(times)
    n_vars = 5
    unixtimes = np.array(times.view(np.int64)/10**9, dtype=np.uint32)
    params = np.array([lat, lon, elevation, temperature, pressure, delta_t, atmos_refract], dtype='f4')

    with open("spa.glsl", "r") as f:
        source = f.read().replace("%%N%%", str(min(n_times, groupsize)))

    compute_shader = context.compute_shader(source)
    
    buffer_times = context.buffer(unixtimes)
    buffer_parameters = context.buffer(params)
    buffer_outputs = context.buffer(reserve=n_vars*4*n_times)

    buffer_times.bind_to_storage_buffer(0)
    buffer_parameters.bind_to_storage_buffer(1)
    buffer_outputs.bind_to_storage_buffer(2)
    
    n_group = 1 + (n_times-1) // groupsize
    compute_shader.run(group_x=n_group)
    
    outputs = np.frombuffer(buffer_outputs.read(), dtype=np.float32)
    out = pd.DataFrame({
        'elevation': outputs[0::n_vars],
        'apparent_elevation': outputs[1::n_vars],
        'zenith': outputs[2::n_vars],
        'apparent_zenith': outputs[3::n_vars],
        'azimuth': outputs[4::n_vars],
    }, index=times)
    return out

Accuracy Benchmarks

First, check that the GPU implementation gives the same output as pvlib under several test scenarios. pvlib’s numpy implementation is the reference.

2019 1-minute simulation

def describe(x1, x2):
    d = x1 - x2
    return pd.DataFrame({
        'rmse': (d**2).mean()**0.5,
        'mbe': d.mean(),
        'mae': d.abs().mean(),
    }).T
times = pd.date_range('2019-01-01', '2020-01-01', freq='T')
gpu = gpu_spa(times, 40, -80)
cpu = pvlib.solarposition.get_solarposition(times, 40, -80)
describe(cpu, gpu)
apparent_elevation apparent_zenith azimuth elevation equation_of_time zenith
rmse 0.001248 0.001248 0.000568 0.000334 NaN 0.000334
mbe -0.000004 0.000004 -0.000219 -0.000002 NaN 0.000002
mae 0.000258 0.000258 0.000411 0.000257 NaN 0.000257
filt = cpu['zenith'] < 89
describe(cpu.loc[filt, :], gpu.loc[filt, :])
apparent_elevation apparent_zenith azimuth elevation equation_of_time zenith
rmse 0.000323 0.000323 0.000561 0.000325 NaN 0.000325
mbe 0.000008 -0.000008 -0.000216 0.000008 NaN -0.000008
mae 0.000246 0.000246 0.000408 0.000248 NaN 0.000248

Error is very low for this test, although there is likely still some room for improvement. These results (and all others in this notebook) are still using the arcsin/arctan hack from the previous post, so there is some accuracy to be gained there if nowhere else.

One of SPA’s claims to fame is its accuracy over such a wide time span; I’m not necessarily interested in trying to match that, but we should at least test some years besides 2019:

1900–2100 hourly simulation

times = pd.date_range('1900-01-01', '2100-01-01', freq='H')
gpu = gpu_spa(times, 40, -80)
cpu = pvlib.solarposition.get_solarposition(times, 40, -80)
describe(cpu, gpu)
apparent_elevation apparent_zenith azimuth elevation equation_of_time zenith
rmse 35.638818 35.638818 106.151313 35.629696 NaN 35.629696
mbe 0.411457 -0.411457 4.656588 0.408023 NaN -0.408023
mae 21.895442 21.895442 64.165011 21.890567 NaN 21.890567

Yikes! So something is going wrong somewhere… dropping nighttime and the atmospheric correction doesn’t fix it either:

filt = cpu['zenith'] < 89
describe(cpu.loc[filt, :], gpu.loc[filt, :])
apparent_elevation apparent_zenith azimuth elevation equation_of_time zenith
rmse 36.085840 36.085840 85.652276 36.084422 NaN 36.084422
mbe 16.209380 -16.209380 -21.528633 16.198121 NaN -16.198121
mae 22.415592 22.415592 53.339417 22.417768 NaN 22.417768

Turns out that it all goes haywire for dates prior to Jan 1 2000:

(gpu['zenith'] - cpu['zenith']).loc['1999-01-01':'2001-01-01'].plot()
<AxesSubplot:>
../_images/671574af43eb95f6328f7ffd7b014c0e441165cc2151c9c4266e062a23506886.png

Guess I need to take another look at my julian date handlers! Moving on, let’s try out different locations to make sure things don’t get wonky at negative latitudes or near the poles or something:

Lat/Lon Sweep

times = pd.date_range('2019-01-01', '2020-01-01', freq='H')

rmse = []
mbe = []
mae = []

for lat in np.arange(-90, 90, 5):
    for lon in np.arange(-180, 180, 10):
        gpu = gpu_spa(times, lat, lon)
        cpu = pvlib.solarposition.get_solarposition(times, lat, lon)
        res = describe(cpu, gpu).assign(lat=lat, lon=lon)
        rmse.append(res.loc['rmse', :])
        mbe.append(res.loc['mbe', :])
        mae.append(res.loc['mae', :])

rmse = pd.concat(rmse, axis=1).T
mbe = pd.concat(mbe, axis=1).T
mae = pd.concat(mae, axis=1).T
for column in ['apparent_zenith', 'zenith', 'azimuth']:
    fig, axes = plt.subplots(1, 3, figsize=(13, 3), dpi=100)
    for i, (label, err) in enumerate([('rmse', rmse), ('mbe', mbe), ('mae', mae)]):
        piv = err.pivot_table(index='lat',
                              columns='lon',
                              values=column)
        pcm = axes[i].pcolormesh(piv.columns, piv.index, piv, shading='auto')
        axes[i].set_ylabel(r'Latitude [$\degree$]')
        axes[i].set_xlabel(r'Longitude [$\degree$]')
        axes[i].set_title(fr'{label} [$\degree$]')
        plt.colorbar(pcm, ax=axes[i])

    fig.suptitle(column)
    fig.tight_layout()
../_images/250bbab6b332fa2a1a8e488502649042c7847ee2e7165e095d8d23e1400bcf55.png ../_images/b92bdebb8019ad5aba7dd6e4e35ac09d530e529d014f5c6cf3962ac45d13224d.png ../_images/b70ac7f2b0ffbc759966fbe6e014125f924ad4c6513b1e2eee0e42222be62520.png

Some interesting patterns that I’m not immediately sure how to interpret, but in any case the variation with location doesn’t seem very significant. Now let’s play with the other SPA knobs:

Refraction Correction

The SPA zenith refraction correction depends on annual average temperature and pressure. To make sure that correction is calculated correctly, let’s sweep across both and ensure that refraction-corrected (“apparent”) zenith still matches between the two implementations. The temperatures and pressures chosen here are each roughly representative of the ranges found on Earth, although certain combinations might not be particularly realistic.

times = pd.date_range('2019-06-01', '2019-06-02', freq='10s')

n = 5
fig, axes = plt.subplots(n, n, sharex=True, sharey=True, dpi=75, figsize=(12, 12))

for i, temperature in enumerate(np.linspace(-30, 30, num=n)):
    for j, pressure in enumerate(np.linspace(30000, 102000, num=n)):
        gpu = gpu_spa(times, 40, -80, pressure=pressure, temperature=temperature)
        cpu = pvlib.solarposition.get_solarposition(times, 40, -80, pressure=pressure, temperature=temperature)
        rmse = np.mean((gpu['apparent_zenith'] - cpu['apparent_zenith'])**2)**0.5
        emax = np.max(np.abs(gpu['apparent_zenith'] - cpu['apparent_zenith']))
        axes[i, j].scatter(cpu['apparent_zenith'], gpu['apparent_zenith'], s=1)
        axes[i, j].axline((0, 0), slope=1, c='k')
        axes[i, j].set_title(f'T={temperature:0.0f} $\degree$C  P={pressure/1000:0.01f} kPa')
        axes[i, j].text(0, 100, f'rmse={rmse:0.06f}\nmax={emax:0.06f}')

for i in range(n):
    axes[i, 0].set_ylabel('GPU')
    axes[n-1, i].set_xlabel('CPU')

fig.suptitle('Apparent Zenith')
fig.tight_layout()
../_images/71cd961511697998f4e0bd6ba53d72f11c4aebe8573f39fb73240c1c1059b9b5.png

All looks pretty good here. Now, on to comparing calculation times!

Execution Speed Benchmarks

My first-order mental model of GPU processing is that runtime is a combination of fixed costs (shader compilation, OpenGL context overhead) that are more or less constant for any input and variable costs (memory allocation, data transfer, and of course the actual algorithm runtime) that scale with the number of timestamps being simulated. So of course it will be interesting to compare runtimes across a range of \(N\), the number of simulated timestamps.

import warnings
warnings.filterwarnings('ignore', message='reloading spa')

timings = []

for n in 10**np.arange(1, 8):
    times = pd.date_range('2019-01-01', freq='T', periods=n)
    timing = {'N': n}

    st = time.perf_counter()
    _ = pvlib.solarposition.spa_python(times, 40, -80, how='numpy')
    ed = time.perf_counter()
    timing['cpu (numpy)'] = ed - st

    # warmup
    _ = pvlib.solarposition.spa_python(times[:10], 40, -80, how='numba')

    st = time.perf_counter()
    _ = pvlib.solarposition.spa_python(times, 40, -80, how='numba', numthreads=4)
    ed = time.perf_counter()
    timing['cpu (numba, 4 threads)'] = ed - st

    st = time.perf_counter()
    _ = pvlib.solarposition.spa_python(times, 40, -80, how='numba', numthreads=8)
    ed = time.perf_counter()
    timing['cpu (numba, 8 threads)'] = ed - st

    st = time.perf_counter()
    _ = pvlib.solarposition.spa_python(times, 40, -80, how='numba', numthreads=16)
    ed = time.perf_counter()
    timing['cpu (numba, 16 threads)'] = ed - st

    st = time.perf_counter()
    _ = gpu_spa(times, 40, -80)
    ed = time.perf_counter()
    timing['gpu'] = ed - st

    timings.append(timing)

timings = pd.DataFrame(timings).set_index('N')
/home/kevin/miniconda3/envs/dev/lib/python3.8/site-packages/pvlib/spa.py:991: UserWarning: The number of threads is more than the length of the time array. Only using %s threads.
  warnings.warn('The number of threads is more than the length of '
fig, axes = plt.subplots(2, 1, sharex=True, dpi=100, figsize=(6, 6))
timings.plot(ax=axes[0], logx=True, logy=True)
axes[0].set_ylabel('Elapsed Time [s]')
axes[0].grid()

ratio = 1/timings.divide(timings['cpu (numpy)'], axis=0)
ratio.plot(ax=axes[1], logx=True, logy=True)
axes[1].set_ylabel('Speed-up (numpy=1)')
axes[1].grid()

fig.tight_layout()
../_images/c0d45a606666eeba998140d9a3d784863c92160eaca873351d562347a5344d68.png

So, on this machine, the GPU implementation is:

  • Faster than the numpy implementation for all N

  • Faster than the numba implementation starting somewhere between 1000 and 10000 timestamps

  • Vastly faster (two orders of magnitude) for 100000 timestamps and beyond

So for an 8760 you might expect a (20x, 3x) speedup over (numpy, numba). For an annual 1-minute simulation (N=525600), the speedup might be (200x, 20x) over (numpy, numba). Boggles the mind, especially considering that this GPU is a weeny integrated GPU (I think)!

Here are the same timings in tabular form, for reference:

timings
cpu (numpy) cpu (numba, 4 threads) cpu (numba, 8 threads) cpu (numba, 16 threads) gpu
N
10 0.005542 0.000670 0.000681 0.001133 0.002333
100 0.009041 0.000662 0.000809 0.001022 0.001723
1000 0.016032 0.002164 0.001544 0.001640 0.002802
10000 0.073568 0.015643 0.009942 0.007940 0.002448
100000 0.655516 0.153031 0.091456 0.069525 0.006387
1000000 8.010285 1.538795 0.917086 0.703851 0.023457
10000000 102.137903 16.005205 9.922412 7.520801 0.275046
ratio
cpu (numpy) cpu (numba, 4 threads) cpu (numba, 8 threads) cpu (numba, 16 threads) gpu
N
10 1.0 8.266644 8.141435 4.890245 2.375776
100 1.0 13.652682 11.175203 8.847653 5.247172
1000 1.0 7.407335 10.382628 9.775966 5.720831
10000 1.0 4.702821 7.399333 9.265220 30.048933
100000 1.0 4.283561 7.167524 9.428449 102.639141
1000000 1.0 5.205558 8.734495 11.380647 341.483374
10000000 1.0 6.381543 10.293657 13.580722 371.348489

Closing thoughts:

  • It’s a shame that the shader compilation cost is incurred on every invocation. I want to see if I can modify the shader in such a way that I can compile it once but apply it to an input of any length. That would hopefully make it more competitive with numba for the smaller inputs.

  • Definitely need to fix that “reverse Y2K bug”…

  • Definitely want to re-run this notebook on a different computer. I’ve addressed the big error sources on this machine, but the fact that the trignometric functions have such large error here gives me significant pause – maybe other computers will have totally different error sources. If the error is acceptable across machines (or I rewrite the code to only use “safe” GLSL functions), then evaluating the speed-up with a proper beefy GPU would be very cool.