Matt Pharr’s blog

A Visit to the Sponza Palace’s Atrium

2023-07-10T00:00:00-07:00

Back in the early 2000s, the CGTechniques website had a “rendering challenge”, where an interesting model would be posted and then artists would try to make the best rendering they could of it. I remember how remarkable the images were in those early days of global illumination–seeing complex 3D models coupled with beautiful lighting was incredibly inspiring, especially when there were few especially interesting scenes available for use in rendering research. One of the models used for the challenge was the now-famous “Sponza Atrium,” created by Marko Dabrovic. The CGTechniques website is remarkably still online, but the contest pages are missing images, though archive.org delivers some of them at least.

In those days of the rendering challenge, Greg Humphreys and I were in the thick of working on the first edition of Physically Based Rendering. We were desperate for good scenes to render. I got in touch with Marko and he was more than happy to give us permission to use the Sponza Atrium model in the book and to help with the conversion. In addition to the atrium and a model of the Šibenik Cathedral, less famous though still wonderful, Marko and his colleague Mihovil Odak had a nice model of an Audi TT that we made extensive use of as well. They were all great scenes, especially for those days; we kept using them in the book through the third edition, only now moving on to new ones, two decades later.

I was recently able to visit Croatia, home to the originals for the Sponza Atrium and Šibenik Cathedral models. I didn’t make it to Šibenik, so missed the cathedral, but I did visit Dubrovnik, home of the Sponza Palace. You get a nice postcard when you pay the entrance fee.

The Sponza Place is now home to the Dubrovnik State Archive. A room at the entrance has a memorial to the 200 soldiers who died defending Dubrovnik during the Siege of Dubrovnik. The Palace itself was damaged then, hit by a number of shells. You couldn’t tell now, nor do you see any hint of the broader devastation to the city, just 30 years ago.

It was quiet when I visited the Palace even though the streets of Dubrovnik outside were packed with tourists. And by quiet I mean that no one else was in the Atrium most of the time I was there, not that I’m complaining. I’m happy to confirm that the GI effects are impressive to see in person.

I took advantage of the emptiness and recorded a three-minute long video, walking about and panning my camera around to capture the details as well as I could. Here’s a low-resolution GIF of the first 20 seconds of it. (Caution: I find it a little nausea-inducing to view the video now; the goal was a thorough capture rather than something pleasing for human consumption.) I offer it up as fodder for a fun NeRF, or at least as reference to the original.

Some News About the 4th Edition of Physically Based Rendering

2022-12-22T00:00:00-08:00

I’m delighted to report that the final laid-out pages of the 4th edition of Physically Based Rendering are now on their way to the printer. As they say, it’s been a journey, but I think that all involved are thrilled with the final result. It has been a delight working with MIT Press this time around, especially after all of the disappointment of the conglomerate that shall not be named that was the publisher for the last two editions.

Speaking of delightful things (and of things that printers require), we have a final cover design. It’s another thing that I think turned out well; the idea is to convey some idea of the distance that’s traveled—more or less from scratch to photorealism–over the course of the book.

The printed book will be available on March 28, 2023 and full book text will be available online for free starting on November 1, 2023. That first date has once again slipped, though just a few weeks this time. Apologies, though I’m glad we took a little extra time for the last rounds of reviews and fine edits before sending it to the printer.

For buyers in the US, a 20% off preorder discount and free shipping is available if you enter the promo code “MITPHoliday22” and order from Penguin Random House. Alternatively, it’s available for preorder on Amazon and elsewhere.

In the meantime, we have posted PDFs of two complete chapters from the new edition: Chapter 11, Volume Scattering and Chapter 14, Light Transport: Volume Rendering. Together they are over 100 pages of text, almost all of it brand new or largely rewritten since the third edition. Those two chapters cover the state of the art in volumetric light transport, up to and including the null-scattering path integral.

One more thing… The cover image has always been an important part of the book, conveying the value proposition—study the contents and you can understand how to write a program that makes images like this. For each new edition, we’ve tried to find better and better scenes to keep up with pbrt’s increasing capabilities and all of the topics covered in the book.

This time around, we licensed the rights to two lovely scenes from Angelo Ferretti, allowing us to convert them into pbrt’s format and to distribute the result. They are now available in the pbrt-v4-scenes repository. Together they are nearly 6 GiB, so your git pull may take some time.
While you wait, here’s a selection of a few views of them that are rendered, naturally, with pbrt.

Happy rendering!

Let’s Stop Calling it “GGX”

2022-05-06T00:00:00-07:00

Fifteen years ago, Walter et al. published a fantastic paper about microfacets at EGSR 2007. It’s full of great contributions, including working out the theory of refraction through rough microfacet models and evaluating various models with respect to measured data. Justifiably, it won the EGSR Test of Time award in 2021.

That paper also introduced a microfacet distribution, named there “GGX.” That distribution was more effective at fitting their measured data than distributions that had been used before in graphics. To the authors’ knowledge at the time it was new, but it later became apparent that GGX is equivalent to a microfacet distribution that Trowbridge and Reitz introduced in 1975.¹ It was an unintentional reinvention—these things happen.

Although this connection now seems to be fairly widely known, “GGX” seems to have stuck in graphics. To this day, “GGX” is used widely in the titles of papers and their text, often without any reference to Trowbridge and Reitz. It’s an unfortunate state of affairs:

First and foremost, Trowbridge and Reitz deserve their acknowledgment. Their paper is fantastic² and their work dates to 1975.
It doesn’t reflect well on graphics as a field for us to continue to use our own renaming of a preexisting model. For example, if we all called Monte Carlo integration “the Kajiya method,” the broader Monte Carlo community would quite reasonably raise an eyebrow.
It reduces the impact of work done in graphics that is based on the Trowbridge–Reitz distribution; if someone in another field is aware of Trowbridge–Reitz but not “GGX,” then there’s research in graphics that they’re unlikely to find even though it may be relevant to their work.

So, better late than never—let’s make it “Trowbridge–Reitz,” or if you prefer, “Trowbridge–Reitz (GGX).”

notes

Trowbridge, S., and K. P. Reitz. 1975. Average irregularity representation of a rough ray reflection. Journal of the Optical Society of America 65 (5), 531–36. ↩
I love this: “The ellipsoid model may prove useful by allowing estimations of its parameter (e) to a reasonable accuracy simply from visual examination of a surface’s micro-structure. On each of the surfaces we examined, one of the authors has visually estimated the shape of the average ellipsoid by observing cross sections of surface irregularities and by observing variations of abundances of surface microareas with orientation relative to the macrosurface.” ↩

Sampling in Floating Point (2/3): 1D Intervals

2022-03-14T00:00:00-07:00

After learning about Walker’s algorithm for uniformly sampling \([0,1)\) in floating-point, I started thinking about how his approach might be generalized to arbitrary intervals; being able to uniformly sample any interval while potentially generating all possible floating-point values inside it would certainly be a nice tool to add to the toolbox.

The good news is that it was a fun thought exercise with things learned along the way. In the end I found enough insights to come up with a solution. However, upon further digging I found two previous implementations of the approach I came up with. So much for a tour de force paper describing my findings in ACM Transactions on Modeling and Computer Simulation. They are:

Christoph Conrads’s Rademacher Floating Point Library, which dates to 2018. See the make_uniform_random_value() function there.
Olaf Bernstein’s dist_uniformf_dense() which seems to date to 2021.

(I’d be interested to hear if there are any earlier instances of it.)

Nevertheless, I still thought it might be useful to write up some of the motivation, walk through my route to a solution, and explain some of the subtleties.

What’s Wrong With Linear Interpolation?

If you want to sample a value in \([a,b)\) for arbitrary \(a\) and \(b\), the usual thing is to take a uniform value in \([0,1)\) and use it to linearly interpolate between \(a\) and \(b\). Last time we saw how to compute a gold-standard uniform floating-point value in \([0,1)\), so why not use that to interpolate? Needless to say, that works in practice, but once again, there are some subtleties.

First one must choose an equation with which to linearly interpolate. For an interval \([a,b)\) with interpolation parameter \(t\), are two common choices: \(a+t (b-a)\) and \((1-t)a +t b\). Each has its own strengths and weaknesses.

The first, \(a + t (b-a)\), requires fewer operations but has the disadvantage that even if \(t \in [0,1)\), it is not guaranteed that the interpolated value will be in \([a,b)\). For example, with \(a=2.5\), \(b=8.87385559\), and \(t=1-2^{-24}\), the last floating-point value before 1, then with float32, we find that \[ 2.5 + (1 - 2^{-24}) (8.87385559 - 2.5) \rightarrow 8.87385559; \] the result is equal to the upper bound even though \(t<1\). A similar problem occurs with the closed interval \([a,b]\): the value \(t=1\) can yield value that is greater than \(b\).

In graphics, respecting intervals is important since we often do things like bound vertex positions at different times, linearly interpolate them, and then want to be able to assert that the result is inside the bounds. In this case, \((1-t)a+tb\) is preferable, since \(t=0\) always yields \(a\) and \(t=1\) gives \(b\). However, that formulation has the surprising shortcoming that increasing \(t\) sometimes causes the interpolated value to move backwards. Consider this interpolant with \(a=2.5\) and \(b=10.53479\): \[ (1-t) \cdot 2.5 + t \cdot 10.53479. \] With \(t=0.985167086\), the interpolant gives 10.4156113. Moving \(t\) up to the next possible floating-point value, 0.985167146, the interpolant’s value is reduced, down to 10.4156103. The rounding on the terms has gone differently with that small change in \(t\) and it’s (slightly) downhill from there. In practice, these little wobbles are unlikely to cause trouble, though they do mean that an assertion that implicitly assumes that the interpolant is monotonic may fail for fairly obscure reasons.

Both of these approaches also suffer from a minor bias for reasons similar to why dividing a random 32-bit value by \(2^{-32}\) to generate a uniform random variable led to a bias in sampled floats: each \(t\) value maps to a single floating-point value and if there are more of one than the other, rounding may introduce a minor non-uniformity. (The numerics people like to say this is due to the pigeonhole principle. I can’t say that is incorrect, but I like to think of it in terms of aliasing: it’s taking something at one frequency it and then resampling it at another—things always get a little messy when you do that and aren’t careful.)

For more on the above, including an entertaining review of how linear interpolation is implemented in assorted programming languages’ standard libraries and assorted misstatements in their documentation about the characteristics of the expected results, see Drawing random floating-point numbers from an interval, by Frédéric Goualard.¹

A Few Utility Functions

Before we go further, let’s specify a few utility functions that will be used in the forthcoming implementations. (All of the following code is available in a small header-only library.)

For efficient sampling of the \(p=1/2\) geometric distribution, a function that uses native bit counting instructions will be useful:

int CountLeadingZeros(uint64_t value);

We’ll assume the existence of two functions that provide uniform random values.

uint64_t Random64Bits();
uint32_t Random32Bits();

C++20’s std::bit_cast() function makes it easy to convert from a float32 to its bitwise representation and back.

uint32_t ToBits(float f) { return std::bit_cast<uint32_t>(f); }
float FromBits(uint32_t b) { return std::bit_cast<float>(b); }

A few helper functions extract the pieces of a float32. (If necessary, see the Wikipedia page for details of the in-memory layout of float32s to understand what these are doing.) Zero is returned by SignBit() for positive values and one for negative. Because the float32 exponent is stored in memory in biased form as an unsigned value from zero to 255, Exponent() returns the unbiased exponent, which ranges from \(-126\) to 127 for normal floating-point values, with \(-127\) reserved for zero and the denormalized floats.

int SignBit(float f) { return ToBits(f) >> 31; }
int Exponent(float f) { return ((ToBits(f) >> 23) & 0xff) - 127; }
constexpr int SignificandMask = (1 << 23) - 1;
int Significand(float f) { return ToBits(f) & SignificandMask; }

We’ll also find it useful to be able to generate uniform random significands.

uint32_t RandomSignificand() { return Random32Bits() & SignificandMask; }

Float32FromParts() constructs a float32 value from the specified pieces. The assertions document the requirements for the input parameters.

float Float32FromParts(int sign, int exponent, int significand) {
    assert(sign == 0 || sign == 1);
    assert(exponent >= -127 && exponent <= 127);
    assert(significand >= 0 && significand < (1 << 23));
    return FromBits((sign << 31) | ((exponent + 127) << 23) | significand);
}

A positive power-of-two float32 value can be constructed by shifting the biased exponent into place.

float FloatPow2(int exponent) {
    assert(exponent >= -126 && exponent <= 127);
    return FromBits((exponent + 127) << 23);
}

Expressed with these primitives, Reynolds’s pragmatic compromise algorithm for uniformly sampling in \([0,1)\) is:

float Sample01() {
    uint64_t bits = RandomBits64();
    int significand = bits & SignificandMask;
    if (int lz = CountLeadingZeros(bits); lz <= 40)
        return Float32FromParts(0, -1 - lz, significand);
    return 0x1p-64f * significand;
}

First Steps

Back to our original question: can we do better than linear interpolation to sample an arbitrary interval, and more specifically, is it possible to generalize Walker’s algorithm to remove the limitation of sampling over \([0,1)\)? I had no idea how to go all the way from the original algorithm to arbitrary intervals so I started with a few small thought experiments to chip away at the edges of the problem an improve my intuition. (In all of the following, I’ll assume a half-open interval \([a,b)\) that does not span zero; we’ll come back to the generalizations of a closed interval \([a,b]\) and intervals that span zero at the end.)

An easy first case is to consider intervals that start at zero and end with an arbitrary power of two. I first took the smallest step possible and thought about \([0,2)\). Indeed, Walker’s approach just works; there’s nothing in it that requires the upper bound to be 1; we can apply the same idea of starting with the upper \([1,2)\) interval, randomly selecting it with probability 1/2 and otherwise continuing down intervals until one is chosen or we hit the denorms. There’s an easy first victory.

Have we missed anything? We should be careful. To further validate this direction, consider the case where we have a tiny power-of-two sized interval, say \([0,2^{-124})\). The minimum exponent for normal numbers is \(-126\), so we have just two regular power-of-two sized intervals and then the denorms. Here’s how that looks on the floating-point number line with valid floats marked with hashes and a 3-bit significand to keep the figure scrutable:

We sample the \([2^{-125}, 2^{-124})\) interval with probability 1/2 and otherwise sample \([2^{-126}, 2^{-125})\) with probability 1/2. If neither is selected, we uniformly sample the denorms which are between \([0,2^{-126})\). This extreme case helps us better understand how the edge case of the denorms is handled: because the width of the last interval of normal floats and the width of the denorms is equal, choosing between them with equal probability leads to uniform sampling of the full interval.

On this topic, Walker wrote:

In practice, the range of the exponent will be limited, and the probability of the number falling into either of the two smallest intervals will be the same.

Denormalized numbers were invented after his paper so it seems that this was a minor fudge in his original approach, corrected today by advances in floating-point.

Here is a function to sample a float32 exponent for this case, taking 64 random bits at a time and counting zeros to sample the distribution. An exponent is returned either if a one bit is found in the random bits or if enough zero bits have been seen to make it to the denorms. In either case, if the denorms have been reached, \(-127\) is returned so that a denormalized or zero floating-point value results.

int SampleToPowerOfTwoExponent(int exponent) {
    assert(exponent >= -127 && exponent <= 127);
    while (exponent > -126) {
        if (int lz = CountLeadingZeros(Random64Bits()); lz == 64)
            exponent -= 64;
        else
            return std::max(-127, exponent - 1 - lz);
    }
    return -127;
}

Given SampleToPowerOfTwoExponent(), the full algorithm to uniformly sample an interval \([0,2^x)\) is simple.

float SampleToPowerOfTwo(int exponent) {
    int ex = SampleToPowerOfTwoExponent(exponent);
    return Float32FromParts(0, ex, RandomSignificand());
}

An implementation that uses a fixed number of random bits can be found with a straightforward generalization of Reynolds’s “pragmatic” algorithm that always consumes only 64 random bits, though there are two differences compared to the \([0,1)\) case. First, if the initial interval is \([0,2^{-88})\) or smaller, then the 41 bits remaining after the significand is extracted from the 64-bit random value are more than are needed to consider for all of the possible power-of-two intervals. In that case, we need to be careful to finish in the denorms rather than trying to construct a float32 with an invalid exponent. Clamping the exponent at \(-127\) takes care of this.

The second difference is that if all of the bits used to select an interval are zero, then if the initial exponent is \(x\), then the remaining interval that we will sample using equal spacing is \([0,2^{x-41})\). Given a 23-bit significand \(s\), the sampled value is then \[ \frac{s}{2^{23}} 2^{x-41}. \] It is tempting to merge the division by \(2^{23}\) and the multiplication by \(2^{x-41}\) into a single constant, though doing so would lead to underflow when \(x < -63\). (Reynolds’s algorithm for \([0,1)\) could just multiply the significand by \(2^{-64}\) in this case since \(x\) was always 0 and there were no concerns about underflow.)

float SampleToPowerOfTwoFast(int exponent, uint64_t bits) {
    int significand = bits & SignificandMask;
    int lz = CountLeadingZeros(bits);
    if (lz == 41 && exponent - 41 > -127)
        return significand * 0x1p-23f * FloatPow2(exponent - 41);
    int ex = exponent - 1 - lz;
    return Float32FromParts(0, std::max(-127, ex), significand);
}

Another easy case comes with an interval where both endpoints have the same exponent. In that case, the spacing between them is uniform and a value can be sampled by randomly sampling a significand between theirs. That setting is shown on the floating-point number line below; the values marked in red are the possible results, depending on the significand.

The code is easy to write given a RandomInt() function that returns a uniform random integer between 0 and the specified value, inclusive:

float SampleSameExponent(float a, float b) {
    assert(a < b && Exponent(a) == Exponent(b));
    int sa = Significand(a), sb = Significand(b);
    int sig = sa + RandomInt(sb - sa - 1);
    return Float32FromParts(SignBit(a), Exponent(a), sig);
}

Arbitrary (Power of 2) Lower Bounds

The easy successes stop coming when we consider intervals with a power-of-two value at their lower bounds: say that we’d like to sample uniformly over \([1,7)\). Our intervals are \([1,2)\), \([2,4)\), and \([4,7)\). Their respective widths are 1, 2, and 4; the sampling probabilities are \(1/7\), \(2/7\), and \(4/7\). So much for a nice geometric distribution with \(p=1/2\). The setting is illustrated below:

Here we most definitely see the importance of the denorms and the last power-of-two sized interval of normal floating-point numbers having the same width. With a power-of-two interval that ends above zero, we no longer have two intervals at the end that should be sampled with the same probability and things fall apart.

Upon reaching this realization, I had no idea how to proceed; I feared that the cause might be lost. Lacking any other ideas, I wondered if it would work to apply Walker’s approach still with the probability \(1/2\) of sampling each interval but then cycling around when one goes past the lower interval, along these lines:

With this method, the probability of sampling the \([4,7)\) interval is then \(1/2\) the first time around. With \(1/8\) probability we cycle back around for another \(1/2\) chance, and so forth. We have: \[ \frac{1}{2} + \frac{1}{8} \frac{1}{2} + \cdots = \frac{1}{2} \sum_{i=0}^\infty \frac{1}{8^i} = \frac{4}{7} \] Success! (Needless to say, the other intervals work out with the desired probabilities as well.)

SampleExponent() implements the algorithm that consumes random bits until it successfully samples such a single power-of-two interval.

int SampleExponent(int emin, int emax) {
    int c = 0;
    while (true) {
        if (int lz = CountLeadingZeros(Random64Bits()); lz == 64)
            c += 64;
        else
            return emax - 1 - ((c + lz) % (emax - emin));
    }
}

If emin and emax are not known at compile time, computing the integer modulus in SampleExponent() may be expensive. Because the maximum value of emax-emin is 253, it may be worthwhile to maintain a table of constants for use with an efficient integer modulus algorithm (see e.g., Lemire et al. 2021.)

With SampleExponent() in hand, the full algorithm is straightforward.

// Sample uniformly and comprehensively in [2^emin, 2^emax).
float SampleExponentRange(int emin, int emax) {
    assert(emax > emin);
    int significand = RandomSignificand();
    return Float32FromParts(0, SampleExponent(emin, emax), significand);
}

For a “pragmatic” version of this algorithm that uses a fixed number of random bits, we could take the number of leading zeros modulo the number of power-of-two sized intervals under consideration to choose an interval and then use a uniform random significand. However, in the rare case where all of the bits used to sample an interval are zero, the remaining interval is of the form \([2^a, 2^c)\) where \(c=41 \mod (b-a)\); we’ve used up all of our random bits to be faced with the same general problem we started with (unless it happens that \(c=a+1\).) At that point we might use linear interpolation to sample the remaining interval, though that’s admittedly unsatisfying, as linear interpolation is the thing we’re trying to avoid.

Partial Intervals at One or Both Ends

With that we finally have enough to return to the original task, uniformly and comprehensively sampling an arbitrary interval \([a,b)\). This is, unfortunately, the point at which I haven’t been able to figure out a reasonable “pragmatic” implementation that uses a small and fixed number of random bits. The figure below shows the general setting; as before, the valid candidate values are marked in red.

An approach based on rejection sampling can be used to sample the specified interval: the idea is that we will sample all of the possible intervals as before, with probability according to their width. Then we uniformly sample a significand in the chosen interval and then accept the value if it is within \([a,b)\). For the power-of-two intervals in the middle, we will always accept the sample, and for the intervals on the ends, the probability of acceptance is proportional to how much of the power-of-two interval overlaps \([a,b)\).

The implementation isn’t much code given all the helpers we’ve already defined, though there are two important details. First, the upper exponent is bumped up by one if b’s significand is non-zero. To understand why, consider the difference between sampling the intervals \([a, 8)\) and \([a, 8.5)\). In the former case, we will never need to consider an exponent of 3, but for the later case, we must. Second, the algorithm used for sampling exponent must account for whether zero or the denorms are included in \([a,b)\); this corresponds to the differences we saw earlier in how to sample intervals like \([0,2^x)\) versus \([2^x,2^y)\).

float SampleRange(float a, float b) {
    assert(a < b && a >= 0.f && b >= 0.f);
    int ea = Exponent(a), eb = Exponent(b);
    if (Significand(b) != 0) ++eb;
    while (true) {
       int e = (ea == -127) ? SampleToPowerOfTwoExponent(eb) :
                              SampleExponent(ea, eb);
       float v = Float32FromParts(0, e, RandomSignificand());
       if (v >= a && v < b)
           return v;
    }
}

Note that it would probably be worthwhile to handle the special case of matching exponents with a call to SampleSameExponent(), as rejection sampling with a significand that spans the entire power-of-two range will be highly inefficient if the two values are close together.

The worst case for this algorithm comes when b’s significand is small—i.e., b is just past a power-of-two. The upper power-of-two range will be sampled with probability at least 1/2 but then the sampled value v will usually be rejected, requiring another time through the while loop. Conversely, having a just below a power of 2 is less trouble, since the corresponding power-of-two interval is the least likely to be sampled.

Closed Intervals

One nice thing about the algorithm implemented in SampleRange() is that handling closed intervals \([a,b]\) is mostly a matter of updating the if test in the while loop accordingly. The only other difference is that eb is always be increased by one. Thus, the worst case for this version of the algorithm is when b is an exact power of 2, again giving a \(1/2\) chance of selecting the upper interval each time, with a \(1-2^{-23}\) probability of rejecting the sample in that interval.

Further Improvements

Stratified sampling is a topic we didn’t get to today; it is often desirable when one is generating multiple samples over an interval. For a power-of-2 stratification, it’s possible to work backward from the various sampling algorithms to determine constraints on the bit patterns the achieve stratification. I’ll leave the details of that to the reader; Sample01() is a good place to start.

We also haven’t dug into the case of an interval that spans zero. To achieve uniform sampling a similar rejection-based approach is probably needed where given such an interval \([a,b)\) we define an extended interval \([-c,c)\) with \(c=\max (|a|, |b|)\) that encompasses the original interval. We can then randomly select the positive or negative side, generate a sample, and then reject it if it is not inside the original interval. However, the combination of an unbalanced interval that spans zero and also includes an exact power of two at its upper bound gives an even worse worst case: consider a highly unbalanced interval like \([-2^{-100}, 2^{64}]\): we end up with a nearly \(3/4\) chance of rejecting each candidate sample.

Discussion

It was pretty good going through the special cases until we reached the end. Unfortunately, I don’t see a good way to work around the need to do rejection sampling when there are partial power-of-two intervals at the ends of the range. Perhaps that isn’t the worst thing ever, but having an irregular amount of computation afoot is not ideal if high performance on GPUs or using SIMD instructions is of interest.

Nevertheless, a quick benchmark suggests that SampleRange() is only about 2.5 times slower than \((1-t)a+tb\) on my system here if the cost of random number generation is included. If a reasonable amount of computation is performed for each sample, the added cost may be no concern. However, lacking a clear example of a case where this first-class sampling makes a difference in the final results, it’s hard to argue for the added expense in general.

note

Goualard also suggests a sampling algorithm based on a uniform spacing over the interval that is by design not able to generate all possible floating-point values. This algorithm seems to have been previously derived by Artur Grabowski in his random-double library from 2015; see rd_positive() there. ↩

Sampling in Floating Point (1/3): The Unit Interval

2022-03-05T00:00:00-08:00

(The following assumes basic familiarity with the IEEE floating point representation—sign, power of two exponent, and significand—but not necessarily expert-level understanding of it.)

Taking samples from various distributions is at the heart of rendering; perhaps most importantly, it allows us to use importance sampling when performing Monte Carlo integration, which gives us a powerful tool to reduce error. The associated sampling algorithms are generally derived assuming that real numbers are afoot but are then implemented using floating-point math on computers. For sampling, the differences between reals and floats usually doesn’t cause any problems, though if you look closely enough there are a few interesting subtleties. We’ll start this short series on that topic today with what seems like should be the simplest of problems: uniformly sampling a floating-point value between zero and one.

Uniform Floats by Dividing by an Integer

Just about anywhere you look, from Stack Overflow to all four editions of Physically Based Rendering, you’ll be told that it’s easy to sample a uniform floating-point value in \([0,1)\): just generate a random \(n\) bit unsigned integer and divide by \(2^{n}\). Given real numbers, that’s fine—the largest value your integer can take is \(2^n-1\) and dividing by \(2^n\) gives a value that’s strictly less than one.¹ With 32-bit floats (as we will exclusively consider today), there’s a nit: say that \(n=32\) (as is used in pbrt). After floating point rounding, one will find that \[ \frac{2^{32} - 1}{2^{32}} \rightarrow 1; \] so much for that non-inclusive upper bound. The problem is that the spacing between the floats right below 1 is \(2^{-24}\). Because \(2^{-32}\) is much less than half that, \(1-2^{-32}\) rounds to 1. Even worse, all 128 floating-point values in \([2^{32}-128, 2^{32}-1]\) round to 1.

pbrt works around that problem by bumping any such 1s down to \(1-2^{-24}\), the last representable float before 1. That gets things back to \([0,1)\) but it’s a stinkiness in the code that in retrospect should have led to the algorithm used being given more attention.

One way to avoid this issue and many of the following is to set \(n=24\), in which case all values after division are valid float32 values and no rounding is required.² However, that gives slightly more than 16 million unique values; that’s a fair number of them, but there are actually a total of 1,065,353,216 float32 values in \([0,1)\)—nearly a quarter of all possible 32-bit floats. Under that lens, those 16 million seem rather few.

How much better do we do with \(n=32\)? Although we start out with over 4 billion distinct integer values, if you divide each by \(2^{-32}\) to generate samples in \([0,1)\) and count how many float32s are generated, it turns out that those four billion yield only 83,886,081 distinct floating-point values, or 7.87% of all of the possible ones between zero and one. Not only do we have multiple integer values mapping to the same floating-point value all the way from \(2^{-8}=0.00390625\) to 1, but between 0 and \(2^{-9}=0.001953125\), the spacing between floats is less than \(2^{-32}\) and many floating-point values are never generated.

There’s another problem that comes with the choice of \(n>24\), nicely explained in the paper Generating Random Floating-Point Numbers by Dividing Integers: A Case Study, by Frédéric Goualard. When the usual round-to-nearest-even is applied after dividing by \(2^{32}\), a systemic bias is introduced in the final values, clearly shown in that paper with examples that use low floating-point precision. Thus, it’s not just “we’re not making the most of what we’ve been given”, but it’s “the distribution isn’t actually uniform.”

The rounding problem is still evident with float32s and \(n=32\) bits; if we consider all \(2^{32}\) floating-point values, we would expect for example that all floats in \([0.5,1)\) would be generated the same number of times. (Indeed, we would expect 256 of each since we have \(2^{32}\) values, the float32 spacing in that interval is \(2^{-24}\), and \(2^{32} \cdot 2^{-24} = 256\).) However, if we count them up, it turns out that alternating floating-point values are generated 255 times and 257 times, all the way from 0.5 to 1. That happens in many other intervals, becoming its worst in the interval \([0.00390625, 0.0078125)\) where alternating values are generated one and three times.³

Depending on one’s application, all of these issues may be no problem in practice, and I wouldn’t make the argument that they are likely to cause errors in rendered images. Most of the time \(n=24\) and not worrying about it is probably fine. Yet IEEE has given us all that precision and it seems wasteful not to make use of it, if it isn’t too much trouble to do so…

Uniform Floats by Sampling a Geometric Distribution

What might be done about these problems? A remarkably elegant and efficient solution dates to 1974 with Walker’s paper Fast Generation of Uniformly Distributed Pseudorandom Numbers with Floating-Point Representation, which is based on the following observation (expressed here in terms of modern IEEE float32s):

In the interval \([1/2, 1)\), there are exactly \(2^{23}\) equally-spaced numbers that can be represented in float32.
In the interval \([1/4, 1/2)\), there are exactly \(2^{23}\) equally-spaced numbers that can be represented in float32.
In the interval \([1/8, 1/4)\), there are exactly \(2^{23}\) equally-spaced numbers that can be represented in float32.
And so on…

We would like an algorithm that can generate all of those numbers but does so in a way that gives a uniform distribution over \([0,1)\). Walker observed that this can be done in two steps: first by choosing an interval with probability according to its width, and then by sampling uniformly within its interval. (This algorithm is sometimes credited to Downey, who seems to have independently derived it in an unfinished paper from 2007.)

Because each interval’s width is half that of the one above it, sampling an interval corresponds to sampling a geometric distribution with \(p=1/2\). There’s thus an easy iterative algorithm to select an interval: one can first randomly choose to generate the sample within \([1/2, 1)\) with probability \(1/2\). Otherwise, sample within \([1/4,1/2)\) with probability \(1/2\) and so forth; bottom out if you hit the denorms. Given an interval, the exponent follows and a sample within an interval can be found by uniformly sampling a significand, since values within a given interval are equally-spaced.

Choosing an interval in that way takes only two iterations in expectation, but the worst case requires many more. The associated execution divergence is especially undesirable for processors like GPUs. Walker had another trick up his sleeve, however:

Pseudorandom integer numbers with a truncated geometric distribution may be obtained by counting consecutive 1s or 0s in a binary random number, drawn from a set having a uniform frequency distribution.

In other words, generate a random binary integer and, say, count the number of leading zero bits. Use that count to choose an interval, where zero leading zero bits has you sampling in \([1/2,1)\), one leading zero bit puts you in \([1/4,1/2)\), and so forth. Given an index \(i\) into the intervals that starts at 0, the exponent is then \(-1 - i\). Modern processors offer bit counting instructions that yield such counts, so this algorithm can be implemented very efficiently.

From Theory to Implementation

With float32, the floating-point exponent factors over the \([0,1)\) interval go from \(2^{-1}\) down to \(2^{-126}\) before the denorms start. Thus, 128 random bits may be required to choose the interval. However, those intervals start becoming so small that one’s commitment to possibly sampling every possible float might start to waver; the odds of making it to one of the tiny ones becomes vanishingly small.

A blog post by Marc Reynolds has all sorts of good insights on the efficient implementation of this algorithm. (More generally, Marc’s blog is full of great sampling and floating-point content; highly recommended.) He considers multiple approaches (for example, successively generating as many random 32-bit values as needed) and ends with a pragmatic compromise that takes a single 64-bit random value, uses 41 bits to choose the exponent, and uses the remaining 23 bits to sample the significand. The remaining \([0,2^{-40})\) interval is sampled uniformly. As long as an efficient count leading zeros instruction is used, it’s only slightly more work than multiplying by \(2^{-32}\) and clamping; in practice, most of the extra expense comes from needing to generate a 64-bit pseudorandom value rather than just a 32-bit one.

Conclusion

Unless you’re bottlenecked on sample generation, it’s worth considering using an efficient implementation of Walker’s algorithm to generate uniform random floating-point numbers over \([0,1)\). It’s not much more computation than the usual, it makes the most of what floating point offers, and it eliminates a minor source of bias. Plus, you get to exercise the bit counting instructions and feel like that much more of a hacker.

Next time we’ll look at uniformly sampling intervals of floating point numbers beyond \([0,1)\). After that, on to how low-discrepancy sampling interacts with some of the topics that came up today as well as some discussion about avoiding an unnecessary waste of precision when sampling exponential functions.

notes

In practice, one multiplies by \(2^{-32}\) since dividing by a power of two and multiplying by its reciprocal give the same result with IEEE floats. ↩
If I remember correctly, Petrik Clarberg explained the superiority of \(n=24\) over \(n=32\) in this context to me a few years ago; it’s a point that I underappreciated at the time. ↩
On a processor where changing the rounding mode is inexpensive, it is probably a good idea to select rounding down in this case. For example, in CUDA, the multiplication by \(2^{-32}\) might be performed using __fmul_rd(). ↩

Update: Some Analysis of Physically Based Rendering’s Bibliography

2022-01-05T00:00:00-08:00

It’s just over a year ago now that I posted an analysis of citation counts in the bibliographies of the first three editions of Physically Based Rendering. Back then I promised an update with statistics for the forthcoming fourth edition “in the next few weeks.” That turns out to have been rather optimistic, but here we now are with results finally available.

The fact that these results are ready means that we’re done fiddling with the text; we will shortly be handing it over to the publisher so that the book production process can begin. We have switched to MIT Press for the fourth edition and it’s been a fine experience working with them so far; we’re optimistic that our interests are all well aligned in producing a quality book, much more so than with the previous corporate conglomerate that shall not be named. Happily, MIT Press has agreed that we can continue to post a free edition of the book online. (The current plan is for that to be made available roughly six months after the print edition hits the shelves.) However, that brings us to our first point of drama in this bibliographical vanity contest: the online edition will be a superset of the print edition and so there are differences between their bibliographies.

The differences between the two versions are due to the amount of new content that we wrote for the fourth edition; all in all, it would be about 1,600 printed pages. That’s too much for a single volume, at least if one wants paper that isn’t newspaper-thin and a binding that won’t fall apart. Thus, Wenzel and I went through the exercise of rejiggering the book into a 1,200 page version for print while maintaining the full text for the online edition. In deciding what would be online-only, we looked for content that was mostly independent of the rest of the book and was little-changed from the third edition. As examples, both the section on realistic camera models and the chapter on bidirectional light transport will not be there in the print edition this time.

The print edition still includes citations and discussion of previous work for topics that are not included in its text, though not as much of it as in the online edition—it doesn’t make sense to go into as much depth in the citations when the text doesn’t deeply discuss the corresponding topics. Therefore, here I will report the results for both the print and online editions. No doubt there will be years of arguments to come about which is the more proper measure—one might argue that the online edition’s bibliography is the canonical one, as it reflects what would be printed if physical limitations didn’t intrude, or one might argue for the print edition in that those citations earned consumption of actual paper and not just electrons. We will leave that question to be resolved by future historians of computer graphics.

Finally, a few notes on methodology: as before, the following is a simple count of how often each name appears in the bibliography. Editing a book or a conference proceedings doesn’t count, but otherwise every citation is counted equally—from a single-author SIGGRAPH paper to a blog post. The citations include work through SIGGRAPH 2021 but nothing published subsequently. Yes, SIGGRAPH Asia papers are now out, but we had to draw a line somewhere in order to get that thing out the door.

As before, many caveats are in order about how arbitrary a measure this is. Another to mention today is the impact of the fine series of Eurographics State of the Art Reports (STARs), twelve of which appear in the fourth edition’s bibliography. For topics that are not central to the book (e.g., texture synthesis), we will often cite a STAR and only a few additional publications rather than comprehensively survey previous work. Thus, there is an irony in successfully developing a new area of research to the point that it merits a STAR: in our bibliographic measure, a lengthy publication record may end up collapsed into a STAR and a few additional citations, putting one lower than one would have been otherwise.

With that, here are the results for all four editions—author last names with citation counts:

1st (2004)	2nd (2010)	3rd (2016)	4th (print, 2022)	4th (online, 2022)
Greenberg (26)	Jensen (31)	Jensen (33)	Jarosz, Jensen (35)	Jarosz, Jensen (40)
Shirley (25)	Shirley (29)	Shirley (31)	Ramamoorthi (32)	Ramamoorthi (38)
Hanrahan (22)	Greenberg, Hanrahan (27)	Keller (27)	Keller, Shirley (31)	Hanika (36)
Jensen (16)	Ramamoorthi (23)	Wald (25)	Hanika, Jakob (30)	Dachsbacher (33)
Arvo (14)	Wald (18)	Greenberg, Hanrahan (24)	Wald (29)	Jakob, Shirley (32)
Mitchell (13)	Keller (17)	Slusallek (21)	Slusallek (27)	Keller (31)
Keller (12)	Arvo, Seidel, Slusallek (16)	Marschner, Ramamoorthi (19)	Dachsbacher, Marschner (26)	Křivànek, Wald (30)
Heckbert, Torrance (10)	Mitchell (13)	Arvo, Seidel (17)	Křivànek (25)	Marschner, Slusallek (27)
Cook, Kajiya, Levoy, Pattanaik, Ward (8)	Pattanaik, Torrance, Walter (12)	Jarosz (16)	Hanrahan, Novák (23)	Hachisuka (25)

Henrik continues his reign, though he now shares the top spot with Wojciech Jarosz, who had not a single appearance in the first edition’s bibliography. The smallest of lexicographical differences—“a” before “e”—puts Wojciech first in the alphabetical ordering. Fittingly, Wojciech was Henrik’s Ph.D. student, so I have to assume that for Henrik the bitterness of sharing the glory is balanced by the sweetness of a former student’s achievement.

Other than Henrik, Alex Keller and Pete Shirley are the only others who made the list for all four editions, though Pat Hanrahan also has that distinction if one neglects the online 4th edition. (Surely Pat will therefore be out there arguing vigorously that the print edition is canonical.)

Ravi Ramamoorthi has displaced Pete Shirley from his long-held number two position, though just by a hair, at least in the print edition. Johannes Hanika has also rocketed up in the standings, making an especially quick climb given that he was not cited in either the first or the second editions. Carsten Dachsbacher has also climbed rapidly, starting from one citation in the first edition, two in the second, and 10 in the third. Finally and fittingly, Jaroslav Křivànek is up there in the latest edition as well.

There you have it. I must admit my disappointment at seeing that the top two spots had the same names for both versions of the fourth edition—all the less potential controversy over which version is the canonical one, though there’s enough motion farther down the list that one might hope for some sparks in the future.

Debugging Your Renderer (5/n): Rendering Deterministically

2021-12-24T00:00:00-08:00

Deterministic program execution has a lot going for it. For most programs, it’s the natural way of being: for any particular input, the program generates the same output. Determinism makes debugging much easier, as it saves you from having to re-run the system repeatedly to trigger a bug that only happens sometimes, and it’s great for end-to-end tests, since you can safely make strict assertions about cases where the program’s output should remain absolutely unchanged (e.g., that float parser example).

However, deterministic execution doesn’t always come naturally when you’re rendering, especially when you’re rendering in parallel. Today’s post will go into some of the ways that deterministic execution can be lost, talk about how to maintain determinism, and then finish with some further discussion of its benefits.

The Basics

To start, let’s settle on a more precise definition of deterministic rendering than “same input gives same output.” It is too much to ask for bit accuracy in output across machines; not only will we encounter different standard math libraries with different levels of precision, but there are a number of corners of C++ that allow for things like variation in order of evaluation across compilers that can lead to innocuous differences in output.

Therefore, we’ll define the observable effect of determinism as: on a particular system with a particular compiler, repeatedly running the renderer on the same input always produces the same value at every pixel. Implicit in that definition is that the same computations are performed to compute each pixel’s value, though not necessarily in the same order. That definition is plenty for our needs; the benefits from nailing it down further almost certainly wouldn’t be worth the trouble.

A render running on a single core should naturally achieve that goal. If it does not, fixing that is the first order of business. Most likely it’s an uninitialized memory access, other memory corruption, or code somewhere that randomly seeds a random number generator based on something that varies like the process id or current time. (I won’t say more about fixing those sorts of problems here, as it’s all rendering-independent and is regular everyday debugging.)

Rendering in parallel is when things get more complicated. Indeed, none of the versions of pbrt before the latest, pbrt-v4, was deterministic. That was always a minor annoyance when debugging and testing the system, though I honestly didn’t realize what a productivity drag it was until determinism was achieved.

Consistent Samples

For rendering to be deterministic, the Monte Carlo sampling routines must use exactly the same random sample points at every sample taken in every pixel. If they are not, then determinism is lost from the start, since different rays will be traced each time due to slightly different rays leaving the camera, different sampling decisions will be made at intersections, and so forth. One might assume that deterministic is the natural way of being for the Samplers that generate those points, but that was not so prior to pbrt-v4. There were two issues: the placement of low discrepancy point sets and carried state in samplers that led to nondeterminism with multithreading.

When using low discrepancy point sets like Halton points, pbrt-v3 aligns the origin of the points with the upper left pixel of the image. That’s normally \((0,0)\), but then if the user specifies a crop window to render just part of the image the low discrepancy points all shift in compensation. That was always a bother for debugging since you couldn’t narrow in on a problem pixel without perturbing all of the samples and often no longer hitting the bug. That detail was easy enough to fix given attention to it.

The other issue came from the fact that each thread maintains its own Sampler instance. This way samplers can maintain state that depends on the current pixel and pixel sample (e.g., an offset into the Halton sequence). Many samplers also use pseudorandom number generators (RNGs) in their work; those, too, are per-sampler state. (For example, the stratified sampler uses a RNG to jitter sample locations and low discrepancy samplers use RNGs for randomization via scrambling.)

In pbrt-v3, those per-sampler RNGs are seeded once at system startup time and then chug along, generating random numbers as requested. Because threads are dynamically assigned to work on regions of the image, they may not work on the same pixels over multiple runs. In turn, the values that a RNG returns at a pixel both depends on which thread was assigned that pixel as well as how many random numbers it had supplied previously for other pixels.

The fix was easy: reseed the RNG before generating sample points at a particular pixel sample. The Sampler interface includes a StartPixelSample() method that is called before samples are requested at a given pixel sample, so it’s just a few lines of code to put those RNGs in a known state. Here’s that method in IndependentSampler, which generates uniform independent samples without any further nuance:

void StartPixelSample(Point2i p, int sampleIndex, int dimension) {
    rng.SetSequence(Hash(p, seed));
    rng.Advance(sampleIndex * 65536ull + dimension);
}

There are two things to note in StartPixelSample()’s implementation. First, pbrt uses the PCG RNG, which allows the specification of both a particular sequence of pseudorandom values as well as an offset into that sequence. Thus, we choose a sequence according to the pixel coordinates and then offset into it according to the index of the sample being taken in the pixel.

The other thing to mention there is Hash(), which has been useful all over the place in pbrt-v4. Here is its signature:

template <typename... Args> uint64_t Hash(Args... args);

You can pass a bunch of values or objects straight away to it and it marshals them up and passes them to MurmurHash to hash them.¹ In its use in the IndependentSampler, we also allow the user to specify a seed for random number generation; Hash() makes it simple to mush that together with the current pixel coordinates to choose a pseudorandom sequence for the current pixel.

There is, needless to say, a short unit test that ensures all of the samplers consistently generate the same sample values.

Other Moments of Randomness

Samplers were much of the trouble in bringing pbrt-v4 into the land of deterministic output, though two other places in the system that made random decisions without the involvement of a sampler needed attention.

First was a stochastic alpha test, deep in the primitive intersection code. For shapes that have an alpha texture assigned to them, we’d like to ignore any intersections where the alpha texture is zero and randomly accept ones with fractional alpha with probability according to their alpha value. The sampler isn’t available in the ray intersection routines and keeping a persistent RNG in that code has obvious problems, so here is what we do instead:

if (Float a = alpha.Evaluate(si->intr); a < 1) {
    // Possibly ignore intersection based on stochastic alpha test
    Float u = (a <= 0) ? 1.f : HashFloat(ray.o, ray.d);
    if (u > a) {
        // Ignore this intersection and trace a new ray
        [...]

Given a less-than-one alpha value, a call to HashFloat() gives a uniform random floating-point value between 0 and 1. It’s a buddy of Hash() and is also happy to take whichever-all values you pass it to turn into a random floating-point value. (Above, it’s the ray origin and direction.)

template <typename... Args>
Float HashFloat(Args... args) {
    return uint32_t(Hash(args...)) * 0x1p-32f;
}

Thus, the results are deterministic for any given ray.

The second case was in pbrt-v4’s LayeredBxDF class, which implements Guo et al.’s algorithm for stochastic evaluation and sampling of the BRDFs of layered materials. That needs an unbounded number of independent random samples, so we instantiate an RNG for each evaluation, but seed it via the incident and outgoing directions. Thus again, for any pair of directions passed to the BRDF evaluation method, the same set of random samples will be generated and the returned value will be deterministic.

Consistent Pixel Sums

With what we have so far, the same rays will be traced each time the renderer runs and in turn, if an assertion fires along the way, it will do so consistently. That’s a big benefit for debugging, but we have not yet achieved deterministic output, which is important for making end-to-end tests maximally useful.

The remaining challenge lies in summing sample values to compute each pixel’s final value. Because floating-point addition is not associative, if the image samples that contribute to a pixel are not accumulated carefully the order of summation may be different across different runs of the program and so the output may change. That was a problem in pbrt-v3 due to how it computed final pixel values: there, the image is decomposed into rectangular regions that are assigned to threads and threads generate samples within their regions, updating the pixels that each sample contributes to.

This figure illustrates the problem with that, showing all of the samples that contribute to a particular output pixel (black dot):

We have two threads responsible for adjacent \(4 \times 4\) pixel regions of the image (thick boxes). For an output image pixel near the boundary of the two regions that has a reconstruction filter that is wider than the pixel spacing (shaded circle), some of the samples that contribute will be taken by thread 1 (orange dots) and some will come from samples taken by thread 2 (blue dot). Because the threads are independent, the filtered sample values are not accumulated in a deterministic order and thus, the final pixel value is not deterministic.

pbrt-v4 addresses this issue by adopting Ernst et al.’s filter importance sampling approach. Independent samples are taken for each output pixel, with no sample sharing with other pixels. If only a single thread works on a pixel at a time, then the samples for each output pixel are naturally generated in a consistent order, giving a consistent sum. (Filter importance sampling has a number of additional advantages that are detailed in the paper, including better preservation of the benefits of high-quality sampling patterns.) With that tuned up, we (almost) have deterministic output.

Those Pesky Splats

One more thing… pbrt-v4’s output is not quite deterministic if a light transport algorithm that traces paths starting from the light sources is being used. In that case, light path vertices are splatted into the image at whichever pixel they are visible; if multiple threads end up splatting into the same pixel, then we are back to nondeterminism from unordered floating-point addition.

This issue could be addressed by having each thread splat into its own image and then summing the images at the end, though that would incur a cost in memory use that scales with the number of threads. Alternatively, we might use fixed-point rather than floating-point to store those pixel values. For now that issue is unaddressed; it rarely causes any trouble, especially since those splatted values are accumulated in double precision and generally converted all the way down to half-float precision for storage. Most of the time that loss of precision hides any sloppy sums.

The Joys of --debugstart

The greatest benefit of deterministic rendering has been the ability to quickly iterate on bugs: you can add some logging code or more assertions, recompile, and re-render, confident that the new code will see the same inputs as triggered the bug. Samplers that give exactly the same samples at each pixel also means that you can speed things up by just rendering a crop window or even a single pixel as you’re chasing a bug.

Even better, it was easy to go even further and add support for retracing just a single offending ray path. pbrt-v4 has a CheckCallbackScope class that uses RAII to register a callback function that will run if an assertion fails or if the renderer crashes. Here is how it is used in most of pbrt’s CPU integrators:

thread_local Point2i threadPixel;
thread_local int threadSampleIndex;

CheckCallbackScope _([&]() {
    return StringPrintf("Rendering failed at pixel (%d, %d) sample %d. Debug with "
                        "\"--debugstart %d,%d,%d\"\n",
                        threadPixel.x, threadPixel.y, threadSampleIndex,
                        threadPixel.x, threadPixel.y, threadSampleIndex);
});

As rendering proceeds, each thread keeps its thread-local threadPixel and threadSampleIndex variables up to date and if the renderer aborts due to an error, you get a message like:

Rendering failed at pixel (915, 249) sample 83. Debug with "--debugstart 915,249,83"

at the bottom of the crash output. If you then rerun pbrt passing it that --debugstart option, a specialized code path traces just that single ray path in the main thread of execution. That gives a simpler debugging context than launching a bunch of threads and waiting for the bug to hit again; it’s delightfully helpful for bugs that otherwise only happen after a substantial amount of time has gone by.

Conclusion

We’ve made it past “detecting rendering bugs” and have made our way to “reliably replicating those bugs.” Next time will be a few thoughts about performance bugs before we get into actual debugging techniques.

note

The attentive reader of the Hash() implementation will note that if a struct or class that has padding between elements is passed to it, the results may be nondeterministic since it hashes their in-memory contents directly. It would be nice to use a C++ SFINAE trick to get a compilation error in that case, but I’m not aware of a way to detect that at compile time. ↩

Debugging Your Renderer (4/n): End-to-end tests (or, “why did that image change?”)

2021-12-19T00:00:00-08:00

Here we are, three posts into the meat of this series, and we’re still on the topic of determining if the renderer is buggy in the first place—the actual craft of debugging has not yet seen much discussion. We’re getting there—I promise—but I’m going to finish discussing ways of detecting bugs before getting into fixing them.

Beyond unit tests, I’ve also found that having a good set of end-to-end rendering tests is of enormous benefit. In this context, the idea of an end-to-end test is simple: you render an image of a scene and then check the image to make sure it is correct.

There’s plenty of nuance in that sentence: which scene? (And not just one, right?) How do you check whether the output is correct? Needless to say, it’s “many scenes,” and as we’ll see, verifying correctness from an image can be as much art as science. We’ll dig into all of those questions today.

Building a Library of Test Scenes

I’ve been collecting scenes to use for testing pbrt for at least a decade; there are upward of 600 of them in the test suite today. Most of them don’t make pretty pictures and some output very low resolution images. Some are as small as \(10 \times 10\) pixels—nothing much to look at at all. They can be split into a few categories:

Simple scenes with analytic solutions.
Scenes that target a single renderer feature.
Complex(ish) scenes.
Reproduction cases for user-reported bugs.

Each type is valuable. Take the scenes with analytic solutions: one such scene is a diffuse sphere with radius 1, a reflectance of 0.5, and a point light with intensity \(\pi\) at its center. Put a camera inside that thing and render it with your path tracer: if your pixels don’t all have a value very close to 1 (given sufficient samples), you’ve got a bug. Stop right there, fix it, and be happy you had such an easy way to detect something was off.

You can take that scene and easily make variations of it. Replace that single point light with four point lights with intensities that sum to \(\pi\)—that should be all ones as well. Or Take out the point light and make the interior of the sphere emissive with spatially- and directionally-uniform radiance of 0.5, leaving the diffuse reflectance at 0.5. Once again, you should get pixels that are all 1. That emissive sphere you can make bigger or smaller; it should be all ones if you make a variant with a different radius.

Once you start thinking in terms of scenes where you can work out the correct answer, there’s lots more you can do. You could light a diffuse quad with an infinite light source and then again with an emissive sphere surrounding it. You could test your bidirectional algorithms by putting a glass where with an index of refraction of 1 around the quad; in principle, that should have no effect.

And then you can also make variants of those variants that exercise all of the different sample generation algorithms and light transport algorithms; each of those is just a small change to the scene description file, so getting up to 600 doesn’t need to go one at a time.

The analytic scenes rarely fail once you’ve gotten them working the first time, but when they do, the debugging problem is a relatively easy one—much nicer than “images of the Moana Island scene are too dark when the bidirectional path tracer is used.” For example, for the scene with a single point light, every ray should return the same value—at each intersection point, the reflected radiance due to direct lighting should be 0.5 and then the indirect radiance (also 0.5) should be scaled by 0.5. (Expand out that series and you get your expected value of 1.)

Of course, those scenes may all render correctly and you may well still find that the Moana Island scene is still too dark with your bidirectional path tracer, but you’ve at least carved off the easy-to-fix cases in a way that makes them easy to debug.

For most of the renderer’s capabilities, it’s not too hard to come up with a simple scene that targets that feature without exercising too many other parts of the renderer. Those are also useful to have in end-to-end tests. As an example, pbrt’s test suite includes a scene comprised of a single quad with a high-frequency texture viewed at an oblique angle. The BSDF is diffuse, it’s lit by a directional light, there’s no complex visibility or multiple light scattering. This is it:

That scene is effectively a test of pbrt’s ray differentials and texture filtering code. If one makes a change to the renderer and then that scene goes bad, you can make a good guess about where the bug lies from the limited subset of the rendering code that runs in generating it. In such a case, if scenes without textures still render correctly then you have a stronger hint, though if those are also broken, then you have a hint that texture filtering isn’t your problem. (Or, that you have multiple problems.)

Sometimes things only go wrong in the presence of complexity; a number of scenes culled from the pbrt-v4-scenes distribution and added to the end-to-end tests take care of that. When those scenes fail, it’s usually the case that simpler ones do as well. If not, it’s often worth trying to simplify the more complex scene as much as possible while still hitting the bug; that, too, is a source of more test scenes for the future. (More on that topic in a future post as well.)

Finally, there are the scenes from user bug reports. I add all of those to the test suite; not only are they all cases that testing previously wasn’t rigorous enough to catch, but there’s no reason to risk the embarrassment (on this end) and annoyance (on the bug reporter’s end) of that same bug reappearing in the future due to a change to the renderer inadvertently reintroducing it.

There is a time versus coverage trade-off in assembling this collection of scenes: the more scenes you have with the more pixels to render and the more samples per pixel, the more you’re exercising the renderer. Yet, the more of all of that you have, the longer it takes to run the tests. If running them takes too long, you won’t run them as often as you should. I’ve ended up tuning them to be about an hour of single-core CPU time (though they run on multiple cores, so it’s just a few minutes of wall-clock time). As you add scenes and the total time to run all of them increases, you can judiciously reduce the resolution of some of the tests or dial down the sampling rate used when rendering them.

Does Everything Render to Completion?

So you have a few tens or hundreds of test scenes and, let’s hope, a script to render all of them and save the images. What now? Run that script and see what happens.

Most of the corners of the renderer’s code ends up being fairly well exercised if you have hundreds of varied scenes designed to exercise it.¹ That’s good news for your assertions, as far as giving them plenty of variety to assert about. It’s also encouragement to add more assertions; sometimes adding a new assertion and running through all of the existing test scenes will unearth a new failure. You might even add expensive assertions for a single run-through of the test scenes to see if they find anything, planning to debug if so and to remove them or demote them to debug-only assertions when you’re done.

Finding a failing assertion in that way really is a good thing, even though you’ve found more work for yourself. You’ve got yourself a debugging task ahead of you but it’s not completely open ended, and it’s on your own terms without the panic of a user reporting a serious bug where you have no idea what the cause may be. It’s also likely with a simpler scene than a user would have been rendering if they encountered the bug later.

Assertions aside, the renderer may crash for some or even all of the scenes. Same deal with that: a crash is not fun, but better to find it yourself while running the tests and fix it before your users are bothered by it.

A good collection of test scenes is also good fodder for tools like valgrind, helgrind, and assorted sanitizers. There’s a much better chance of those sorts of tools finding something if you give them a variety of rendering computations to examine. Chasing down any errors those report is also something you must do before proceeding when you find them: there’s no way to know how much havoc lies in their wake, so you might as well fix them once you’re aware of them, lest you spend hours chasing down some other bug that turned out to be due to one of those.

Are the Images Correct?

If all of the scenes render to completion, now you have a few hundred images sitting on disk. How do you know if each one is correct?

For pbrt’s test scenes, I maintain a set of “golden” images that provide a reference.² The test script then checks the output from the current version of the renderer with the golden images. How tricky could that be? The first hard problem is generating golden images in the first place. The second is determining if a rendered image is correct. We’ll consider both topics in turn.

Creating an initial set of golden images is a bootstrapping problem. For the scenes with analytic solutions you can manually verify correctness via their pixel values, but for the rest it’s not so easy. I have partially been able to sidestep that issue by assuming that the last released version of pbrt is bug free and using its output as a starting point. While pbrt is surely not bug free, after it has been out for a few years enough people have spent enough time with the code that it’s reasonable to assume it’s in pretty good shape.

For a different renderer, one might try using the output of pbrt or another renderer as an initial reference, though that’s tricky business, with differences in BSDF models, texture filtering, and details like rendering in RGB versus using spectra. One can at least make sure that one’s renderer is in the right ballpark that way, if another renderer is both trusted and well-understood and if it’s not too hard to render scenes in both it and your own renderer.

Another option is to gain confidence in candidate golden images via experiments. We’ll come back to this topic in more detail once we get to debugging techniques, but to understand the idea, let’s consider that texture filtering test from before. Say that you’ve implemented ray differentials and a texture filtering algorithm and can render images that aren’t obviously wrong. Lacking a verified solution, how can you become more confident that they are correct?

You might render the scene with no texture filtering but with many pixel samples to get an antialiased image that way. That’s something to compare to. You know that your implementation won’t match that perfectly, but if it’s too far off you might be suspicious of your differentials’ correctness. Another useful technique is to explore the parameter space: render it once with your implementation, then again with your texture filter widths half as wide as you think they should be, then again with them twice as wide. You should see aliasing with the narrow filters and blurring with the wide ones. If so, you have some more confidence in your implementation, and if not, you have something to dig into further.

Here are some images that show the results of applying that approach for the texture filtering test above; the images are as we would expect. (The images are presented using jeri; click on them and hit ‘f’ to go full screen if necessary to see the differences.)

At minimum one may decree that the output of the renderer at some point in time gives the golden images. Going forward, any deviation in them should be explained, either from fixing a bug or from a well-understood improvement to the renderer.

When the Images Should Not Change at All

Given golden images, a change to the renderer, and a run of the end-to-end tests, we have a set of new images that may or may not the match the golden images. How one feels about that depends on the sort of change one has made. Here are a few representative cases where not a single pixel of a single image should be different:

A pooled memory allocator was introduced to optimize small memory allocations.
An optimized routine for parsing text floating-point values in the scene description was adopted.
The function that loads image texture maps has been parallelized to reduce start-up time.

For all of those cases, there’s no reasonable explanation for why anything should change in the final output, yet sometimes you make a change like that and find differences. If it’s major differences, then presumably you’ve broken something fundamental; the debugging problem in those cases is often not too bad due to the wide impact. Choose the one of the simplest scenes that went astray and take it from there.

For minor differences, it’s also critical to understand what happened. It can hard to be disciplined about that: if it’s just one pixel in one scene out of hundreds of scenes with perhaps billions of pixels changes after you replaced the float parser, it’s easy to tell yourself that a single float was parsed differently and hey, quite possibly you just fixed a bug you didn’t know you had. Yet something more serious may be lurking; it may just be that your tests only hit a buggy case once but other scenes would hit it often. If you don’t understand the root cause, you’re building the rest of the system on sand.

For the case of the float parser, it’d be crucial to track down which float (or floats) went astray and why—keep both parsers around, call both for each float parsed, and assert that both give the same result. When they disagree, figure out which one was correct. Your assertion may never fire, which would be “interesting” as well; it may be that the pixel change was not due to a difference in parsing floats but was due to some other bug that was tickled by your changes. Those sorts of bugs aren’t fun to chase down but are equally important to understand when you encounter them.

Implicit in these imperative statements about no pixels changing has been the assumption that the renderer is deterministic—that rendering the same scene gives exactly the same output image. For now we will take that as given. Making the renderer so is tricky but worthwhile; that will be the sole topic of the next post in this series.

When the Images May Change

Whenever changes are made to code involving ray tracing, other geometric computations, or light transport algorithms, it’s almost inevitable that images will change. This brings us to the tricky question of “are those changes ok, or suggestive that there is a bug?”

To motivate this case, let’s consider a (real) example: making what is believed to be an improvement to the algorithm that makes sure that rays leaving bilinear patches do not incorrectly reinstersect the patch. Assuming that we had a reasonable algorithm for this previously, we would expect very small changes in the images for every scene that has bilinear patches in it, but would not expect any big image changes. (Though we might hope to have a scene that shows a case where the current algorithm is insufficient, in which case we would hope for significant and visually evident improvement with it.)

My testing script uses pbrt’s imgtool program to compare the output images to the golden images. It prints nothing when they match exactly, so if you’ve just changed the float parser, you might run the end-to-end tests, wait for them to finish, and move along happily if nothing is reported. When there is a discrepancy, imgtool’s output is like this:

That output is carefully crafted. The three lines in turn:

The news: the images are different.
The pathnames of the two images, relative to the current directory. These are there alone and together on a line so that it’s easy to triple click that line to select it, then type the name of an image viewer in the shell, paste the selection, hit return, and then view the two images.
Numerical details about the images and how they differ.

Those details include the average value of all of the pixels in each image, their percentage difference, and their mean squared difference. Often those numbers alone are enough to indicate what’s going on. If we saw something like the above for all of the test scenes that had bilinear patches in them (minuscule differences in average pixel values and MSE), we could be fairly confident that all was well. It would still be worth a quick glance at a few of the images, but there would be no need to view all of them to feel good about the change.

With that workflow in mind, imgtool offers some color in its output to make it easier to see higher levels of error. Here’s what it said about another scene after I made that change:

That red text says “this seems a little high”, and indeed it is so—here are the corresponding images:

Something funny is happening at the boundary of the liquid at the top of the cup; it is evident that one of two images must be wrong, though it isn’t obvious which one is. Time to start debugging.

Using Statistics to Your Advantage

For the bilinear patch intersection example, the image statistics are useful for giving a good first indicator of “all is well”, “something may be fishy here”, or “things are Not Good.” That is plenty useful, but when one is making changes to Monte Carlo sampling code, those numbers have even greater value. Consider improving a BRDF importance sampling routine to better match the BRDF. In that case, we hope for significant image changes for the better thanks to lower error. How do we distinguish between an improvement in error and an incorrect result?

Just looking at the images may not be enough. Consider these three images of the San Miguel scene where the first is the baseline reference and the others correspond to two different changes to the renderer, one correct and one buggy. It’s not evident from just looking at the images which one is wrong.

However, imgtool has something interesting to report: the average pixel value of “Change A” is 0.17% higher than the reference image, but the average pixel value of “Change B” is 4.03% higher. In the context of unbiased Monte Carlo, a 4% change is most definitely a sign of something going wrong.

One way to think about why this is so is that if you’re using unbiased Monte Carlo algorithms, rendering images of thousands of pixels, each with tens of samples, then you have hundreds of thousands or even millions of sample values that feed into that average. If you have changed your importance sampling routines (and your estimators don’t have ridiculously high variance), then those average image values should be well locked in if both “before” and “after” are bug-free.

That idea also explains why that San Miguel test has a fairly low sampling rate—just 16 samples per pixel. You often don’t need to render the whole image to convergence to tell if the Monte Carlo bits have gone wrong; the statistics over all of the pixels often tell the tale.

But how do you know how much of a change is acceptable? Is that 0.17% something to worry about? In practice, it depends; the answer depends on how many samples you’re taking and how much variance there is in your estimators. For pbrt’s tests, I’ve learned to have a sense of what’s expected, but that’s admittedly imprecise. A much better way would be to follow the ideas presented in Kartic Subr and Jim Arvo’s paper on applying proper statistical tests to these tasks. They show not only the right way to decide if two images have the same mean, accounting for the number of samples taken in setting a threshold, but also showing how to robustly determine the answer to questions like “does image a have lower variance than image b?”

For all of these evaluations of images, it’s crucial that images are stored in floating point, not clamped, and without any tone mapping or gamma correction. When you’re making images for people to look at, you’re more than welcome to use 8-bit PNGs and run your pixels through the ACES curve for a “filmic” look. For the purposes of end-to-end tests, maintaining good old linear values with their full dynamic range is the only thing that allows you to reason about what’s going on with them statistically.

Finally, even if the numbers look good, it’s still important to view the images, or at least all of those ones with the greatest reported differences. A shortcoming of those image-wide statistics is that they don’t indicate whether the error has some unsightly structure to it that is sneaking under the radar. One way to better automate that test would be to also use a perceptual error metric like ꟻlip, though that requires high-quality reference images, which pbrt’s end-to-end tests currently avoid in the interests of running more quickly.

Conclusion

This has turned into a longer post than I intended and there’s still plenty more to say, especially about the tricky problem of having two rendered images and trying to figure out what their differences signify. We will most certainly come back to that in following posts since it is frequently integral to the renderer debugging process.

The best thing about having a good set of tests—both unit and end-to-end—is being able to iterate on code with confidence. You can refactor swaths of the system, you can cleanup things that are a little grungy, and if the tests are clear, you can feel confident about committing those changes. Sometimes you can try out speculative ideas—things where you’re not sure if the idea is right—and quickly gather some empirical data about whether the idea works or not. If those indicators are promising and you pursue your idea you should still find better ways to validate it, but I’ve found that a quick yes/no can be a helpful guide.

Next time we’ll go into the details of making a renderer deterministic, which is one of the foundations of everything discussed today. That post will certainly be less to digest than this one was.

notes

The Right Thing to do would be to use a tool that measures code coverage, see which parts of the renderer never or rarely run given your test scenes, and to introduce new scenes intentionally to exercise that code. Admittedly, I have not yet found that discipline for pbrt. ↩
Note that the golden images must be generated from scratch for each operating system and compiler used, as differences in details like precision in the system math library usually leads to minor image differences across different systems. ↩

Debugging Your Renderer (3/n): Assertions (and on not sweeping things under the rug)

2021-12-02T00:00:00-08:00

Today we’ll keep the discussion to the topic of runtime assertions in renderers; next time it’ll be on to end-to-end tests, which will start to lead us into a more image-focused view of graphics debugging that will keep us busy for a while.

A principle in the last post on unit testing for renderers was the idea that you’d like your debugging problem to be as simple as possible; one way to achieve that is if bugs manifest themselves in a way other than “some of these pixels don’t look right…” While there will always be plenty of that sort of bug, those are usually a much harder debugging problem than a conventional one like “the program printed an error and crashed.” A good set of runtime assertions can be an effective way to turn obscure bugs into more obvious ones.

An assertion is a simple thing: a statement that a condition is always true at some point in the execution of a program. It seems that the original idea of them dates to Goldstine and von Neumann in 1947.¹ If such a statement is ever found to be false, then a fundamental assumption underlying the system’s implementation has been violated. The implications—to the performance of the program or to the correctness of its output—may be wide-ranging and possibly impossible to recover from. Assertions a great way to catch little things early before they turn into big things that are only evident much later.

In contrast to unit tests, which just have to be fast enough to not be annoying to run often, assertions must be efficient, since they often run in the innermost loops of the renderer. In return, they have the advantage that they can check many more situations than a unit test. It turns out that a myriad of unexpected edge cases come up as you trace billions of rays in many different scenes. Yet an assertion that has no chance of firing is only a drag on overall performance without offering any value. The art is to write the ones that you don’t think will ever fire but yet sometimes do so.

For a well-written general discussion of assertions, see John Regehr’s blog post on the topic.

The Basics

While C++ provides an assert macro in the standard library, it has a few shortcomings:

Assertions are either enabled or disabled, via the NDEBUG macro. Often, they are disabled completely for optimized builds, which in turn means that they run rarely and do not catch many bugs.
When an assertion fails, only the text of the assertion (e.g., “x > 0”) and its location in the source code is printed without any further context.

pbrt-v4 therefore has its own set of assertion macros, which are also integrated with pbrt’s runtime logging system. pbrt’s assertion macros are based on those in Google’s glog package. It includes assertions that are always included, even in release builds, and those that are only for debug builds, where more costly checks may be acceptable. They also provide much more helpful information than assert() does when an assertion fails.

Beyond a basic Boolean assertion (CHECK()), there are separate assertions for checking equality, inequality, and greater-than/less-than. For example, CHECK_GE() checks that the first value provided to it is greater than or equal to the second. Here is an example of its use in pbrt:

CHECK_GE(1 - pAbsorb - pScatter, -1e-6);

There’s a bit of context packed into that simple check: we have two probabilities, pAbsorb and pScatter, and if you look at the code before it you can see that the light transport algorithm has just computed three probabilities where the third, pNull is 1 - pAbsorb - pScatter. Thus, the assertion is effectively making sure that we are using valid probabilities when computing pNull.

More broadly, that check is in the context of pbrt’s code for sampling volumetric scattering. That code requires that the volumetric representation provide a majorant that bounds the density of the volume over a region of space. The CHECK_GE() then is effectively checking that the majorant is a valid bound. Thus, it’s really a check on the validity of the code that computes those bounds, which is far away in the system from where the check is made.

While that decoupling has the disadvantage that a failing assertion may require searching to find the code actually responsible for the bug, the advantage is that the check is made at every sample taken in every volumetric medium that is provided to pbrt for rendering; it gives the majorant computations a thorough workout. That check has found many bugs in that code since it was introduced; there are plenty of corner cases in the majorant computations, especially when you’re doing trilinear interpolation, which requires considering a larger footprint, and also using the nested grid representation of NanoVDB.

If that assertion fails, pbrt dumps more information than just the text of the assertion:²

[ tid 12129819 @     1.252s cpu/integrators.cpp:1004 ]
    FATAL Check failed: 1 - pAbsorb - pScatter >= -1e-6
        with 1 - pAbsorb - pScatter = -0.3336507, -1e-6 = -0.000001

In addition to the id of the thread in which the assertion failed, we have the elapsed time since rendering began (about 1.25 seconds here), the location of the assertion in the source code, what was asserted, as well as both of the values that were passed to CHECK_GE(). Having those values immediately at hand is often helpful. In the best case, one can understand the bug immediately, for example by seeing that an edge case that had been assumed to be impossible actually happens in practice. For this one, knowing whether the value was slightly outside of the limit or far outside of the limit (as it was here) may be a good starting point for further investigation.

A full stack trace then follows; that, too, can give a useful first pointer for understanding the issue. It is especially useful in still getting something from bug reports from users when it’s not possible to reproduce a bug locally as well as when pbrt is used for assignments in classes. In the latter case, the conversation often goes something like this:

“pbrt is buggy! It crashes when I call the function to normalize a vector.”
“That’s interesting–what does it print when it crashes?”
(pbrt’s output)
“That’s not a crash; it’s a failing assertion. The problem is that the foo() function that you added there is passing a degenerate vector to the vector normalization routine.”

Given that students often don’t seem to read that output in the first place, I’m not sure if any lessons are being learned about the value of assertions through that exercise, but you can at least work through that cycle much more quickly if it doesn’t require the student to fire up the debugger to provide more information.

Resilience Versus Rigidity

When an assertion fails, a program generally terminates. That’s a harsh punishment, especially if the program is well into a lengthy computation. One can treat failed assertions as exceptions and terminate just part of the computation (and maybe just a small part, like a single ray path), or one can also try to recover from the failing case and go on. How to approach all this is something of a philosophical question.

A widely-accepted principle about assertions is that they should not be used for error handling: invalid input from the user should never lead to an assertion failure but rather should be caught sooner (and a helpful error message printed, even if the program then terminates). An assertion failure should only represent an actual bug in the system: a mistake on the programmer’s side, not on the user’s, even if something goofy provided by the user is what tripped up the program. That to me seems like an unquestionably good principle.

But even with assertions limited to errors in the implementation, what else might one do when one fails? One might try to recover, patching over the underlying issue (for example, forcing the third probability to zero in the majorant case), but that approach isn’t fully satisfying. One issue is that the code paths for the error cases will only run rarely, so they won’t be well tested—it’s then hard to have confidence in their correctness.

For a commercial product (or one that is not open source), not annoying your users with an unexpected program termination is probably a good idea, though I have to say that in my experience the error handling you get is often not much better.

More optimistically, assertion failures represent useful data points. Papering over them is ignoring evidence of a deeper issue. Perhaps your code for recovering from the failed assertion is running all the time and there’s a massive bug lurking but you have no idea it exists in the first place.

So I have come to believe that the best approach is to be strict, at least for a system like pbrt. Include error handling code to deal with invalid user input, add cases as necessary to make your algorithms general-purpose and robust, but when things go wrong in a way that you hadn’t thought was possible, don’t try to muddle through it—fail if a null vector is to be normalized and abort if the majorants are seriously off. Those sorts of unexpected cases merit investigation and resolution. By making them impossible to ignore you reduce the chance of letting something serious fester for a long time. It’s an annoyance in the moment, but it makes the system much more robust in the end.

Track Down Rare Failures(!)

About not letting things fester… One of the reasons I’ve come to the rigidity view is an experience I had with the first version of pbrt. That version was more on the resilience side of things, or perhaps it was just negligence. Over the course of rendering the image below it would always print a handful of warnings about rays having not-a-number (NaN) values in their direction vectors.

I expected that something obscure was occasionally going wrong in the middle of BSDF sampling but I didn’t dig in for years after first seeing those warnings. Part of my laziness came from the (correct) assumption that it would be painful debugging since the warnings didn’t appear until rendering had gone on for some time. The underlying bug didn’t seem important to fix since it happened so rarely.

Eventually I chased it down. As with many difficult bugs, the fix was a single-character change: a greater or equals that should have been a greater than—“equals” being a case that otherwise led to a division by zero.

        // Handle total internal reflection for transmission
-       if (sint2 > 1.) return 0.;
+       if (sint2 >= 1.) return 0.;

When I rendered that scene afterward, not only were the warnings gone, but the entire rendering computation was \(1.25\times\) faster than it was before. I couldn’t understand why that would be so and spent hours trying to figure out what was going on. At first I assumed the speedup must be due to something else, like a different setting for compiler optimizations, but I found that it truly was entirely due to that one-character fix.

Eventually I got to the bottom of it. Here is where thing were going catastrophically wrong—with a few lines of code elided, this is the heart of the kd-tree traversal code in pbrt-v1:

int axis = node->SplitAxis();
float tplane = (node->SplitPos() - ray.o[axis]) * invDir[axis];
// ...
if (tplane > tmax || tplane <= 0) {
    // visit first child node next
} else if (tplane < tmin) {
    // visit second child node next
else {
    // enqueue second child to visit later and visit first child next
}

Consider that code with the lens of not-a-number. There are two rules to keep in mind: a calculation that includes a NaN will yield a NaN, and any comparison that includes a NaN evaluates to false. (Thus, the fun idiom of testing x == x as a way to check for a NaN.) Above, tplane will be NaN since the inverse ray direction is NaN. The condition in the first “if” test will be false, since both comparisons include a NaN. The condition in the second “if” test will also be false. In turn, the third case is always taken and every node of the kd-tree will be visited.

Thus, a NaN-direction ray is intersected with each and every primitive in the scene. For a complex scene, that’s a lot of intersection tests and thus, the performance impact of just a handful of those rays was substantial. Good times.

Conclusion

Here we are with two posts in a row that are comprised of me arguing for a particular way of doing things and then ending with a story about me not practicing what I’m preaching. One could take this to mean that I don’t know what I’m talking about, or one could take it to mean that my pain has the potential to be your gain. Either way works for me.

More generally, I’ve come to learn that if something seems a little stinky or uncertain in code, it really is worth stopping to take the time to chase down whether there is in fact something wrong. You have in hand evidence of a problem in a particular place in a system—that’s valuable. If you ignore it and there is a bug there, often that bug will later manifest itself in a way that’s much more obscure, maybe not evidently connected to that part of the system at all. You end up spending hours chasing it down just to discover that if you had investigated the questionable behavior when you first encountered it, you’d have fixed the underlying issue much earlier and much more easily.

notes

Goldstine and von Neumann. 1948. Planning and Coding of problems for an Electronic Computing Instrument. Technical Report, Institute of Advanced Study. ↩
To my previous frequent frustration, the CHECK macros in Google’s glog package do not print floating-point values with their full precision, which leads to error messages like Check failed: x != 0 with x = 0 bring printed when x is very small but not actually zero. This is another reason pbrt provides its own CHECK macros. ↩

Debugging Your Renderer (2/n): Unit Tests

2021-11-26T00:00:00-08:00

Here we are, a year and a half after I posted an introduction that was full of talk about a forthcoming series of blog posts about debugging renderers. When I posted that I already had a text file full of notes and had the idea that I’d get through a series of 8 or so posts over the following few weeks.

…and it’s been nothing but crickets after that setup.

There’s no good reason for my poor follow-through, though this series did turn into one of those things that got more daunting to return to the longer time went by; I felt like the bar kept getting higher and that my eventual postings would have to make up for the bait and switch.

Now that I’m at it again, I can’t promise that these posts will make up for the wait; in general, you get what you pay for around here. But let’s reset and try getting back into it.

To get back in the right mood, here are a pair of images back from the first time I tried to implement Greg Ward’s irradiance caching algorithm back when I was in grad school:

In the left image (which was rendered from right to left for some reason), there was a bug that caused energy to grow without bound as the cache was populated (no doubt a missing factor of \(1/\pi\) that led to a feedback loop). I always liked how that image went from ok to a little too bright to thermonuclear by the time it was halfway through. The image on the right is my eventual success, with a slightly different scene layout.

Avoiding The Bad Place

There’s nothing fun about an image that starts out ok and then goes bad or your renderer crashing after its been running for an hour with a stack trace 20 levels deep. There’s lots to be unhappy about:

Things are broken, but they’re not utterly broken, which suggests that the underlying bug will be subtle and thus difficult to track down.
There’s an enormous amount of state to reason about—the scene in all its complexity, all of the derived data structures, and everything that happened since the start of rendering until things evidently went wrong. Any bit of it may hold the problem that led to disaster.
More specifically, the actual bug may be in code that ran long before the bug became evident; some incorrect value computed earlier that messed things up later, possibly in an indirect way. This is a particular challenge with algorithms that reuse earlier results, be it spatially, temporally or otherwise.
It may be minutes or even hours into rendering before the bug manifests itself; each time you think you’ve fixed it, you’ve got to again wait that much longer to confirm that you’re right.

Anything you can do to avoid that sad situation reduces the amount of time you spend on gnarly debugging problems, and in turn, the more productive you’ll be (and the more fun you’ll have, actually implementing fun new things rather than trying to make the old things work correctly.) That goal leads to the first principle of renderer debugging:

Try to make it a conventional debugging problem (“given these inputs, this function produces this incorrect output”) and not an unbounded “this image is wrong and I don’t know why” problem.

One of the best ways to have more bugs be in the first category is to have a good suite of unit tests. There’s nothing glamorous about writing unit tests, at least in the moment, but they can give you a lot in return for not too much work. Not only does failing unit test immediately narrow down the source of a bug to the few things that the test exercises, but it generally gives you an easier debugging problem than a failure in the context of the full renderer.

Starting Simple

A good unit test is crisp—easy to understand and just testing one thing. Writing tests becomes more fun if you embrace that way of going about it—it’s easy coding since the whole goal is to not be tricky, with the idea that you want to minimize the chance that your test itself has bugs. A good testing framework helps by making it easy to add tests; I’ve been using googletest for years, but there are plenty of others.

It’s good to start out by testing the most obvious things you can think of. That may be counter-intuitive—it’s tempting to start with devious tests that poke all the edge cases. However, if you think about it from the perspective of encountering a failing test, then the simpler the test is, the easier it is to reason about the correct behavior, and the easier debugging will be. (There is an analogy here to the old joke about the drunk searching for his car keys under the street light.) Only once the basics are covered in your tests is it worth getting more clever. If your simpler tests pass and only the more complex ones fail, then at least you can assume that simple stuff is functioning correctly; that may help you reason about why the harder cases have gone wrong.

Here is an example of a simple one from pbrt-v4. pbrt provides an AtomicFloat class that can atomically add values to a floating-point variable.¹ This test ensures that AtomicFloat isn’t utterly broken.

TEST(FloatingPoint, AtomicFloat) {
    AtomicFloat af(0);
    Float f = 0.;
    EXPECT_EQ(f, af);

    af.Add(1.0251);
    f += 1.0251;
    EXPECT_EQ(f, af);

    af.Add(2.);
    f += 2.;
    EXPECT_EQ(f, af);
}

The test is as simple as it could be: it performs a few additions and makes sure that the result is the same as if a regular float had been used. It’s hard to imagine that this test would ever fail, but if it did, jackpot! We have an easy case to reason about and trace through.

Here’s another example of a not-very-clever test from pbrt-v4. Most of the sampling functions there now provide an inversion function that goes from sampled values back to the original \([0,1]^n\) sample space. Thus, it’s worth checking that a round-trip brings you back to (more or less) where you started. The following test takes a bunch of random samples u, warps them to directions dir on the hemisphere, then warps the directions back to points up in the canonical \([0,1]^2\) square, before checking the result is pretty much back where it started.

TEST(Sampling, InvertUniformHemisphere) {
    for (Point2f u : Uniform2D(1000)) {
        Vector3f dir = SampleUniformHemisphere(u);
        Point2f up = InvertUniformHemisphereSample(dir);

        EXPECT_LT(std::abs(u.x - up.x), 1e-3);
        EXPECT_LT(std::abs(u.y - up.y), 1e-3);
    }
}

There’s not much to that test, but it’s a nice one to have in the bag. Once it passes, you can feel pretty good about your InvertUniformHemisphereSample function, at least if you have independent confidence that SampleUniformHemisphere works. And how long does it take to write? No more than a minute or two. Once it is passing, you can more confidently make improvements to the implementations of either of those functions knowing that this test has a good chance of failing if you mess something up.

About succinctness in tests: that Uniform2D in that test is a little thing I wrote purely to make unit tests more concise. It’s crafted to be used with C++ range-based for loops and here generates 1000 uniformly distributed 2D sample values to be looped over. It and a handful of other sample point generators save a few lines of code in each test that otherwise needs a number of random values of some dimensionality and pattern. I’ve found that just about anything that reduces friction when writing tests ends up being worthwhile in that each of those things generally leads to more tests being written in the end.

The Challenge of Sampling

One of the challenges in implementing a Monte Carlo renderer is that the computation is statistical in nature; sometimes it’s hard to tell if a given sample value is incorrect or if it’s a valid outlier. Bugs often only become evident in the aggregate with many samples. That challenge extends to writing unit tests—for example, given a routine to draw samples from some distribution, how can we be sure the samples are in fact from the expected distribution?

The Right Thing to do is to apply proper statistical tests. For example, Wenzel has written code that applies a \(\chi^2\)-test to pbrt’s BSDF sampling routines. Those tests recently helped him chase down and fix a tricky bug in pbrt’s rough dielectric sampling code. Much respect for doing it the right way.

My discipline is not always as strong as Wenzel’s, though there are some more straightforward alternatives that are also effective. For example, pbrt has many little sampling functions that draw samples from some distribution. An easy way to test them is to evaluate the underlying function to create a tabularized distribution and to confirm that both it and the sampling method to be tested more or less generate the same samples with same probabilities. As an example, here is an excerpt from the test for sampling a trimmed Gaussian:

    auto exp = [&](Float x) { return std::exp(-c * x); };
    auto values = Sample1DFunction(exp, 32768, 16, 0, xMax);
    PiecewiseConstant1D distrib(values, 0, xMax);

    for (Float u : Uniform1D(100)) {
        Float sampledX = SampleTrimmedExponential(u, c, xMax);
        Float sampledProb = TrimmedExponentialPDF(sampledX, c, xMax);

        Float discreteProb;
        Float discreteX = distrib.Sample(u, &discreteProb);
        EXPECT_LT(std::abs(sampledX - discreteX), 1e-2);
        EXPECT_LT(std::abs(sampledProb - discreteProb), 1e-2);

The Sample1DFunction utility routine takes a function and evaluates it in a specified number of buckets covering a specified range, returning a vector of values. PiecewiseConstant1D then computes the corresponding piecewise-constant 1D distribution. We then take samples using the exact sampling routine and the piecewise-constant routine and ensure that each sample value is approximately the same and each returned sample probability is close as well. (This test implicitly depends on both sampling approaches warping uniform samples to samples from the function with values of u close to zero at the lower end of the exponential and u close to one at the upper end, which is the case here.)

To be clear: SampleTrimmedExponential could still be buggy even when that test passes. One might fret about those fairly large 1e-2 epsilons used for the quality test, for example. It is possible that the looseness of those epsilons might mask something subtly wrong, but we can at least trust that the function isn’t completely broken, off by a significant constant factor or the like.

Writing this sort of test requires trusting your functions for sampling tabularized distributions, but those too have their own tests; eventually one can be confident in all of the foundations. For example, this one compares those results to a case where the expected result can be worked out by hand and ensures that they match.

Preserving the Evidence

Another good use for unit tests is for isolating bugs, both for debugging them when they first occur and for ensuring that a subsequent change to the system doesn’t inadvertently reintroduce them.

Disney’s Moana Island scene helped surface all sorts of bugs in pbrt; many were fairly painful to debug since many were of the form of “render for a few hours before the crash happens.” For those, I found it useful to turn them into small unit tests as soon as I could narrow down what was going wrong.

Here’s one for a ray-triangle intersection that went bad. We have a degenerate triangle (note that the x and z coordinates are all equal), and so the intersection test should never return true. But for the specific ray here, it once did, and then things went south from there. Trying potential fixes with a small test like this was a nice way to work through the issue in the first place—it was easy to try a fix, recompile, and quickly see if it worked.

TEST(Triangle, BadCases) {
    Transform identity;
    std::vector<int> indices{ 0, 1, 2 };
    std::vector<Point3f> p { Point3f(-1113.45459, -79.0496140, -56.2431908),
                             Point3f(-1113.45459, -87.0922699, -56.2431908),
                             Point3f(-1113.45459, -79.2090149, -56.2431908) };
    TriangleMesh mesh(identity, false, indices, p, {}, {}, {}, {});
    auto tris = Triangle::CreateTriangles(&mesh, Allocator());

    Ray ray(Point3f(-1081.47925, 99.9999542, 87.7701111),
            Vector3f(-32.1072998, -183.355865, -144.607635), 0.9999);

    EXPECT_FALSE(tris[0].Intersect(ray).has_value());
}

One thing to note when extracting failure cases like this is that it’s critical to get every last digit of floating-point values: if the floats you test with aren’t precisely the same as the ones that led to the bug, you may not hit the bug at all in a test run.

Never Defer Looking into a Failing Test

A cautionary tale to wrap up: a few months ago a bug report about a failing unit test in pbrt-v4 came in. It had the following summary:

gcc-8.4 has stuck forever on ZSobolSampler.ValidIndices test

gcc-9.3 passed all tests

gcc-10.3 gives me the following message (in an eternal cycle) during tests

/src/pbrt/samplers_test.cpp:182: Failure
Value of: returnedIndices.find(index) == returnedIndices.end()
Actual: false
Expected: true

The ZSobolSampler implements Ahmed and Wonka’s blue noise sampler, which is based on permuting a set of low-discrepancy samples in a way that improves their blue noise characteristics. pbrt’s ZSobolSampler.ValidIndices test essentially just checks that the permutation is correct by verifying that the same sample isn’t returned for two different pixels. That test had been helpful when I first implemented that sampler, but it had been no trouble for months when that bug report arrived.

When the bug report came in, I took a quick look at that test and couldn’t imagine how it would ever run forever. No one else had reported anything similar and so, to my shame, I assumed it must be a problem with the compiler installation on the user’s system or some other one-off error. I didn’t look at it again for almost two months.

When I gave it more attention, I immediately found that I could reproduce the bug using those compilers, just as reported. It was a gnarly bug—one that disappeared when I recompiled with debugging symbols and even disappeared with an optimized build with debugging symbols. The bug would randomly disappear if I added print statements to log the program’s execution. Eventually I thought to try UBSan, and it saved the day, identifying this line of code as the problem:

int p = (MixBits(higherDigits ^ (0x55555555 * dimension)) >> 24) % 24;

0x55555555 is a signed integer and multiplying by dimension, which was an integer that starts at 0 and goes up from there, quickly led to overflow, which is undefined behavior (UB) in C++. In turn, gcc was presumably assuming that there was no UB in the program and optimizing accordingly, leading in one case to an infinite loop and in another to a bogus sample permutation.

At least the fix was easy—all is fine with an unsigned integer, where overflow is allowed and well-defined:

int p = (MixBits(higherDigits ^ (0x55555555u * dimension)) >> 24) % 24;

Leaving aside the joys of undefined behavior in C++, it was hard enough to chase that bug down with it already narrowed down to a failing test. If the bug had been something like “images are slightly too dark with gcc-10.3” (as could conceivably happen with repeated sample values, depending on how they were being repeated), it surely would have been an even longer and more painful journey. Score +1 for unit tests and -1 for me.

Conclusion

We’re not done with testing! With the unit testing lecture over, next time it will be on to some thoughts about writing effective assertions and how end-to-end tests fit in for testing renderers.

note

That capability isn’t provided by the C++ standard library since floating-point addition is not associative, so different execution orders may give different results. For pbrt’s purposes, that’s not a concern, so AtomicFloat provides that functionality through atomic compare/exchange operations. ↩