Today we will be joined by Intel Principal Engineer Karthik Vaidyanathan and talk extensively about Intel’s upcoming neural supersampling technology, XeSS, and that very impressive demo. Karthik was instrumental to XeSS development and will be fielding the community’s questions about the upscaling tech. The agenda of the interview will begin with some basic background on the classification of upscaling technologies and then move on to more pointed questions.
While you might be tempted to skip towards the juicier parts of the interview, we would encourage everyone to read through it all, because there are nuggets of insight in practically every statement that Karthik responds with (and we especially loved the examples that will help gamers understand just what is going on behind the scenes).
Karthik was also able to reveal some truly exciting details about XeSS (I won’t spoil them) and we can’t wait to tell you about them. This writeup will be a 36-minute read for the average reader, so grab a cup of coffee, some snacks, and enjoy the interview!
Special shout out to Patrick Moorehead (@PatrickMoorhead), Sebastian Moore (@Sebasti90655465), Locuza (@Locuza_), MeoldoicWarrior (@MelodicWarrior1), Albert Thomas (@ultrawide219), and WildCracks (@wild_cracks) for contributing to the interview questions.
There is spatial upscaling. Probably heard about that quite a bit. They are also commonly referred to as super resolution techniques. And then on the other hand you have super sampling techniques and a lot of the modern games already use super sampling techniques like checkerboard rendering. Now diving deeper into some of these spatial upscaling techniques, they tend to be simpler. They typically engage after anti-aliasing. After TAA, for example, they look at a single frame at a lower resolution and they just upscale it to the target resolution. There are lots of smart techniques that they use to upscale the image, but when you’re working with such a small set of pixels to begin with often the information is not there. And moreover, when you’re working after anti-aliasing, you can think of anti-aliasing as a low pass filter, that also ends up removing a lot of the details in a lot of the information.
So you’re working with a very small set of information, and then you’re left with two choices to try and produce all the missing pixels and either approximate or hallucinate. And most of the real-time techniques that you will see are approximators. And the way it works is, if you have a well defined edge or some well-defined features, you can detect it. And then either sharpen it, or produce those details because you have detected it in the first place. Now you can imagine a scenario where you have very fine details, the most common one being thin wires, but there’s so many more - fine reflections, highlights, all of these things, if you look at a low resolution render of a wire you would have maybe one pixel over here and one pixel over there. And there’s no way to infer that there’s a wire, so it’s impossible to produce those details. And that’s where it becomes quite challenging for these kind of single frame spatial upscaling techniques to produce that detail. Now there are state-of-the-art techniques like GANs which can hallucinate but these are really not fast enough for the real time domain and also they don’t generalize very well, at least not yet.
Super sampling is a completely different kind of approach and it’s good to point out the distinction between these two. The way super sampling works is first of all, it treats anti-aliasing and upscaling as a single problem. These are not separate problems and it works directly off of the unfiltered pixels coming from the render. So you have all the information coming from the render and not only that, you also look at pixels and previous frames and there’s a lot of information there and you combine that information to produce all the pixels that you need for your target resolution. So first of all, you’re not limited as much as you would normally be, in the amount of information that you have. For example if you have a wire, in one frame, you might only see two pixels on that wire but over say eight frames, you are beginning to see that this is something that looks like a wire. And the interesting thing is that super sampling has been around for a while. Checkerboard rendering is already there. The state-of-the-art game engines renders already employ these kind of techniques. And often you might find that these already performed better than spatial upscaling.
[Super sampling] is not an easy problem, especially with the current state of the art super sampling, which are often based on heuristics. The problem is that it’s challenging to find all the information, all the pixels from your previous frames that can actually be utilized to reconstruct your current frame because there are lots of scenarios where it might not be feasible - for example, disocclusion - where something is visible in the current frame, but was not visible in the in the previous frame. Imagine some big object in the foreground, which was visible in the previous frame but in the current frame it just disappears - it’s moved away - and therefore you cannot use those pixels because they’re not visible in the current frame. So that’s one scenario. There are many such scenarios where because the scene is dynamic, things are moving, you cannot have a one-to-one correspondence between the pixels in your previous frame and your current frame.
You really need some smartness to try and detect which pixels are usable. And in the event that you are not able to use those pixels you still need a good approximation like spatial techniques have. Most state-of-the-art game [techniques] use a lot of heuristics, a lot of hand-designed approaches to try and use as many pixels as they can, but try to do it in a way where you don’t end up integrating false information or invalid information, but they don’t work all the time, and therefore you will often see artifacts like ghosting, blurring and these are issues, commonly associated with techniques like TAA, checkerboard rendering. And that’s where neural networks come in because this is almost an ideal problem for neural networks, because theys are very good at detecting complex features and that’s where we can use them to integrate just the right amount of information and when that information is not there, try to detect these complex features and reconstruct them. So that sort of summarizes the technology.
So FSR to my knowledge is spatial upscaling, and we already discussed spatial upscaling and some of its limitations. DLSS 1.0, again I am not aware of the internals of DLSS because its not open, but from my understanding, it was not something that generalized across games, DLSS 2.0 plus was neural network based and it generalized very well. XeSS from day one, our objective has to be a generalized technique.
You mentioned Unreal Engine 5 and it produces some of the highest quality shadow geometric lighting fidelity and when you’re investing so much in your render, you really don’t want to lose any of that quality when you scale from that - those pixels to your target resolution - and that’s been our objective from day one. And also you don’t want to have a solution that’s fragile that requires training for every game that someone ships that’s also been our objective from day one.
Microsoft has enabled this through Shader Model 6.4 and above and on all these platforms XeSS will work.
I can give you a very simple example. You can have a network try to predict all the three-color components R, G and B. Or you can have a network that only predicts a filter that applies on all the three color components, right? And this is a very very simple example that I’m coming up just to convey the idea that the way you define the problem gives you a path to better generalization. And that is the key and all the issues that you might have seen with DLSS 1.0 where, you know, it might have been blurry in some places, issues with motion, inherently, it comes down to: defining, the problem in a way that is easy to generalize and treating anti-aliasing and upscaling as a single problem where you use motion vectors for both these problems by treating it as one, and that’s the key.
And yes, we do plan to open source. And it’s important for us. It’s part of our vision for a technology like this. First of all, ISVs actually know what they’re integrating when we share the source, they understand the technology, they can build upon it, we get a bigger mindshare as a result.
We have a certain perspective on this. We have some of the best researchers solving this problem, but I imagine sharing this with the larger community allows us to leverage so much more, and there’s so much more that we could do as a result. So that’s one part of it. There’s of course having something that is cross vendor and open source that is much better. Because there’s lower barrier to adoption, right? Because if you have a technology that’s open source and runs on multiple platforms. It’s something that you can integrate into your game engine and not have to differentiate for every single platform that you’re running on. So, yes, it’s also been our objective from day one to have a solution that works on other GPUs, is open source and can set the path or establish a path to wider adoption across the industry.
And that is something you need for wider application of technologies like this. You can come up with the most disruptive technology, but if it’s a black box, there’s always a challenge.
But yeah, BioShock Infinite was probably the last game that I recall very distinctly for one reason, for me, personally, I am more into it for the story telling aspect, for me personally games are another form of artistic expression, something more like a movie but just with more degrees of freedom and more and more potential. And so I tend to be biased towards games that have a very strong storytelling aspect. Right? I’m not really into multiplayer gaming because for me, that’s that’s not what has interested me all along. Its about having an immersive experience and sort of just admiring the visuals, the creative input, just the artistic aspect of it. And that’s what draws me to this field, this technology. And yeah, I guess. BioShock Infinite was one of the games that had a very strong plotline in my opinion.
And the way these settings work in all games, across FSR, DLSS is that it just controls your input resolution. So ultra quality runs the input at a higher resolution – you are producing more pixels. Quality produces slightly fewer pixel, balance produces even fewer pixels, performance produces smallest number of pixels. And in some ways that reflects on the capabilities of your upscaling technology. When you’re running at quality or ultra quality, there’s not much for the upscaling technique to do. You already have you know, a majority of the pixels and you might be able to get away with a cheaper upscaling technique. It’s when you start going down to performance, the kind which unlocks 60 frames per second with the highest quality rendering, that’s where you, you know, the capabilities of the upscaling technology really start coming out.
So from our standpoint, that’s what I can talk about. We train with reference images that have 64 samples per pixel.
Now, now if you want to draw a resolution from that you can you can do the math. 64 would be, you know, 8 samples in X and Y so you could you know, that would be 32k – that’s what it would be. Okay, but I wouldn’t call it 32k because what we are doing is effectively all those samples are contributing to the same pixel. I think 64 samples are all contributing to the same pixel. But yeah, effectively 32k pixels is what we use to train – is what we used to create the reference for one image.
Now, coming to your question. So, for DP4a, yes, SM 6.4 and beyond SM 6.6 for example supports DP4a and SM 6.6 also supports these packing intrinsics for extracting 8-bit data and packing 8-bit data. So we recommend SM 6.6.
We don’t use DirectML. See the kind of performance that you’re looking at when it comes to real time, even a hundred microseconds is very significant. And so for our implementation we need to really push the boundaries of the implementation and the optimization and we need a lot of custom capabilities, custom layers, custom fusion, things like that to extract that level of performance, and in its current form, DirectML doesn’t meet those requirements, but we are certainly looking forward to the evolution of the standards around Matrix acceleration and we’re definitely keeping an eye on it and we hope that our approach to XeSS sets the stage for the standardization effort around real-time neural networks.
It requires developer support but having said that, generally, super sampling technologies that are implemented at the tail end of the pipeline, closer to the display, will always have more challenges. I can give you a clear example. Let’s say you had film grain noise that was introduced as a post process - trying to apply an upscaling or super sampling solution after that fact becomes very challenging.
So even if one were to implement something like this as an upscaling solution, for example, just close to the display, there’s always going to be scenarios like this when you know the game engine does some kind of post processing that just breaks it. So being closer to the render gives you, as we discussed the last time, the highest fidelity information with the amount of controllability that you need to be able to produce the best result.