Hello yuz-ers. What a year! We ended 2022 with more yuzu Fried Chicken, Vulkan changes, a new input driver, an exorbitant amount of kernel work, more performance, better visuals, and much more!
Blinkhawk has also been working hard on his beloved project, releasing Y.F.C. Part 1.5.
Basically an abridged version of what is expected for the full “Part 2” release.
The changes in this pull request include a rework of the MacroHLE
implementation to include various new macros for indirect draws and configurations.
As discussed in previous articles, macros are small GPU programs that implement features like indirect and instanced draws. They must be emulated. MacroHLE (High-Level Emulation) is the process of avoiding executing a requested macro and instead translating it directly to the code that it would have generated (like an instanced or indirect draw). This works in contrast and in parallel with MacroJIT, which works by actually emulating the loops and control flow contained in macro programs in a just-in-time fashion.
Now, why keep both? Well, each one performs their own specialized task. MacroHLE’s advantage compared to MacroJIT has to do with the emulation of indirect calls. An indirect call, such as a draw, uses data generated somewhere in the GPU through some shader in order to establish the draw parameters and its draw count. Traditionally with MacroJIT we had to sync the Host GPU and Guest GPU to obtain the indirect data in order to execute the macro correctly. With MacroHLE, we create an indirect draw command in the host GPU that points to the translated address of where the GPU generated data should be. Thus skipping the syncing.
Thanks to these improvements, yuzu now is able to more efficiently execute macros, considerably reducing CPU overhead, and without having to change any setting. What we internally like to call a “passive skill”.
As a result of these changes, performance has been improved in several titles, including those developed by Koei Tecmo, Pokémon Scarlet and Violet
, Bayonetta 3
, and Monster Hunter Rise
(with the exception of version 12.0.0, which still requires further fixes) to name a few.
The crashes in Fire Emblem: Warriors
have also been fixed.
We measured a 5-20% performance boost in select titles, but the improvement may be higher on CPUs with a lot of cache. From our testing, the 5800X3D can reach over 30% in some games.
The performance cost of rendering at higher resolutions was also greatly reduced.
But the goodies don’t end here! Blinkhawk also added support for the VK_EXT_extended_dynamic_state2
and VK_EXT_extended_dynamic_state3
Vulkan extensions, reducing the amount and size of shaders needed to be built during gameplay.
This relatively “new” pair, along with the already implemented VK_EXT_extended_dynamic_state
and VK_EXT_vertex_input_dynamic_state
, are the four extensions responsible for considerably reducing shader building stuttering.
But as it always goes, support for these extensions in consumer GPUs is spotty at best, and a mess to support at worst.
State3
in particular is only supported by the NVIDIA Vulkan Beta drivers, version 527.86 at the time of writing, and recent (late 2021 and newer) RADV Mesa drivers.
We recommend anyone interested in testing how a fresh shader cache performs to give these drivers a go.
With no alternative, implementing these extensions forced us to perform another dreaded cache invalidation.
Most drivers cover at least 3 of the 4 extensions without issue, one way or another, with one glaring exception, AMD Windows drivers. The price of this is higher stuttering during gameplay when new shaders are being processed compared to running the same card on Linux with RADV, or using any other brand.
A small side-note, Linux RADV users should update their Mesa version to the latest (or use a more recent distro version if needed), as support for state2
was broken in versions before 21.2.
As a last second change, Blinkhawk tested removing the 16-bit floating point (FP16) blacklist enforced on NVIDIA Ampere and newer GPUs (RTX 3000 series and higher). If it worked, it would have allowed them to work similarly to Turing and AMD Radeon offerings in this aspect. However, NVIDIA redesigned how their FP32 and FP16 units operate on Ampere and newer, with both providing identical performance. Unfortunately, even if it were faster, it’d be irrelevant in the end, as FP16 on Ampere and Ada is still bugged in the drivers, producing graphical issues in many games.
The only remaining architecture that could benefit from enabling blacklisted FP16 support is Intel on Windows, but their drivers are a dumpster fire regarding FP16. So they continue to emulate 16-bit precision with 32-bit the same way as Ampere and Ada, in this case with its always present performance loss. Of course the weakest architecture that could benefit the most from this change is the only one that remains broken…
Another extra benefit of this iteration of Y.F.C.
is that Normal
GPU accuracy is much safer to use.
Particles will continue to be better in High
, but games like Pokémon Scarlet/Violet
, Bayonetta 3
, and many others can be played with Normal
accuracy without glitches much more regularly with the big performance benefit this provides (Bayonetta 3
in particular still needs High
for its title screen, but gameplay is safe on Normal
).
The month doesn’t stop there, there has been a plethora of changes worth mentioning too in our GPU codebase.
Oh boy, byte[] sure has been busy this month.
To start off, he is responsible for implementing the SMAA anti-aliasing filter for our Vulkan and OpenGL backends. But that’s not the whole story, so let’s elaborate further.
SMAA
, or enhanced subpixel morphological antialiasing, is an improvement over MLAA developed by the Spanish Universidad de Zaragoza and video game studio Crytek, of Crysis fame.
BreadFish implemented the original OpenGL version, intending to release it as part of the resolution scaler. As it turns out, implementing SMAA
for Vulkan is no joke, and after being nagged by your writer, byte[] had to work 2 weeks to get it in shape.
SMAA
, being based on MLAA
, intends to be a post-processing (aka shader-based) option focused on quality over performance by analyzing adjacent pixels, unlike FXAA
which just blurs the entire screen.
The SMAA
filter is implemented using render passes and it produces its best results when combined with FSR filtering.
AMD recommends properly anti-aliasing the image in their official Overview Integration Guide.
The results speak for themselves:
Here you can see an ideal test case for SMAA, the simple triangle of death
Ropes and power lines, the classic example for anti-aliasing testing (Pokémon Scarlet)
Sprite elements in 3D games benefit from it (Xenoblade Chronicles 3)
For those interested, we used the ULTRA
preset, testing showed a low performance loss even with a GT 1030, so we preferred to focus on quality.
Only users with old integrated GPUs should avoid SMAA
. For the rest, it’s a safe option to turn on and forget.
You can find the feature in Emulation > Configure > Graphics > Anti-Aliasing Method
.
SMAA doesn’t suffer from the horrible colour banding of FXAA (The Legend of Zelda: Breath of the Wild)
And it's a great help for users running low resolution multipliers. This example is 0.5x Bilinear alone Vs 0.5x FSR + SMAA (Pokémon Scarlet)
As a side note, NVIDIA’s version of FSR, NIS, was also tested, but the result is so ugly and over-sharpened, that we decided to keep the best option of the two, FSR.
byte[] has also fixed a problem with anisotropic filtering.
If users ran the RADV driver on Linux, anisotropic filtering values other than Default
would cause a distinct “acne-like” rendering issue in Super Mario Odyssey
. The issue persists at other anisotropic filtering and resolution multiplier values., but byte[] continues to work on the issue.
The so called RADV acne (Super Mario Odyssey)
The change also addresses an issue with the buggy water rendering in Super Mario Sunshine
with automatic anisotropic filtering on Lavapipe (Mesa, Linux), although the error still occurs at other anisotropic filtering values.
Kind of makes it look even older (Super Mario Sunshine)
byte[] also corrected the semantics of data cache management operations in the memory.
Previously, when the guest requested a cache invalidation, the implementation would simply invalidate the cache on the hardware, rather than making the memory visible to the GPU as intended.
On the side, he also promoted various Vulkan Extensions to use core methods. In the Vulkan API, vendor extensions are optional features provided by specific hardware vendors or drivers that may not be available on all systems. In contrast, core methods are a fundamental part of the Vulkan specification and are guaranteed to be available on all systems that support the API. Thus, promoting extensions to use core methods can improve their reliability and portability.
byte[] made further initialization tweaks to the Vulkan API.
These changes included the restoration of VK_KHR_timeline_semaphore
and VK_EXT_host_query_reset
, which were mistakenly removed in a previous PR. He also added the flag VK_INSTANCE_CREATE_ENUMERATE_PORTABILITY_BIT_KHR
to the VkInstanceCreateInfo
structure for MoltenVK
to allow MoltenVK
to be detected as an available Vulkan device.
Keep in mind that a lot more work is needed in order to get yuzu rendering on macOS devices. This is only early preliminary work.
vonchenplus has implemented the draw manager for Maxwell3D
with the aim of eliminating workarounds and reorganising the drawing process to more accurately enumerate the drawing behaviour.
As a result of these changes, the issue in Dragon Quest Builders
where some 3D models were not rendering properly has been fixed.
No more missing stuff! (Dragon Quest Builders)
Following these changes, vonchenplus also improved the code for the topology update logic so that the implementation is more accurate. This change was necessary in order to implement special topologies with Vulkan.
This includes support for quad strips
, which require the use of triangles to simulate them, and the ability to simulate indexed and non-indexed modes.
In non-indexed mode, a fixed mapping table is used to connect the vertices, while in indexed mode, a compute shader is used to dynamically map the original drawing indices.
vonchenplus has also implemented support for line loops, which require the use of triangle lists to simulate them, and for polygons, which require the use of triangle fans.
These changes fixed the Hero’s path in Legend of Zelda: Breath of the Wild
, as well as the Status Summary graphic in Pokémon Scarlet and Violet
, and they also gave us another shader cache invalidation, yay!
Don’t mess with the stats! Can’t do breeding without the stats! (Pokémon Scarlet)
When the Sheika GPS signal returns (The Legend of Zelda: Breath of the Wild)
Blinkhawk has added alpha to coverage and alpha to one to our Vulkan backend.
Alpha to coverage
is a multisampling technique that is used to improve the quality of transparent or partially transparent pixels.
It works by blending the alpha values of multiple samples taken from the same pixel to produce a single, more accurate result.
This can help to reduce aliasing and other rendering artifacts that can occur when rendering transparent pixels.
Alpha to one
, on the other hand, is a technique that is used to improve the quality of partially transparent pixels by setting the alpha value of each pixel to a maximum of 1.0
.
This can help to reduce the amount of alpha blending that needs to be performed, which can improve the performance of the rendering pipeline.
These changes have fixed the shading of trees and grass problems when viewed up close or from a distance in Pokémon Scarlet and Violet
.
The camera isn’t more interested in that tree, you should learn from this, Dark Souls (Pokémon Scarlet)
vonchenplus has corrected errors caused by yuzu’s faulty detection of draw types.
In the past, yuzu would set every vertex and index count register to zero after each draw to determine if the next draw would be a regular or indexed draw.
Xenoblade Chronicles 3
proves to us that these registers initiate some draw calls based on previous values.
Changing this behaviour
partially fixes the particles present in Xenoblade Chronicles 3
. You can now more easily perform your off-seer duties.
Meat is on the menu! (Xenoblade Chronicles 3)
We don’t usually cover compilation changes here, but this time we had to do it because it affects compatibility.
Your writer (or co-writer in this progress report, my partner did most of the work this time) has been playing with compilation flags in order to get more free performance, following previous work done by Blinkhawk some time ago.
Microsoft Visual C++ (MSVC, Visual Studio) is simple enough (we’ll talk about Linux later). You enable full program optimizations, optimize for performance instead of size, a bit here, a bit there, and you gain a nice 3%, but I wanted more. Last month, epicboy improved the build process, saving both time and memory. This created a “gap” big enough to enable the Big One, Link-Time Optimizations (LTO), an optimization that in the past had to be discarded for eating all the available RAM of our buildbots.
Windows testing went well and in some cases the performance uplift reached up to 12%. The problem was Linux. LTO is aggressive by nature, and there’s no guarantee that all parts of the project will react nicely to it. In this case, the problem was Qt, the UI looked completely garbled. So LTO had to go, but in its place, we now require what Dynarmic already did for a while, x86-64-v2 hardware.
GCC and Clang builds will now compile assuming the features of CPUs are compatible with the instruction sets that form part of x86-64-v2, the highest one being SSE4.2. This means the minimum CPU required for yuzu to work without crashing, in both Windows and Linux, is now the first generation of Core i-series (500-900 series), which are almost 15 years old, and the FX and APU series from AMD, which are almost 12 years old. The performance boost on GCC and Clang is up to 7%.
First system runs i7-12700H - 2x16GB 4800MHz CL40 - RTX 3080 Mobile 16GB 175W, second system runs R7 5800X3D - 2x16GB 3600MHz CL16 - RTX 4090
We originally wanted to enforce x86-64-v3 to get an even bigger performance boost, as well as to ensure a minimum level of performance, as any CPU lacking AVX, AVX2, and in particular, FMA, will be very slow, no matter its clock speed or core count.
Yes, that means the 8 core Ivy Bridge Xeon you bought for 20 bucks is not fast enough for this task.
The problem, however, is that doing so would leave close to 9% of the user base out of support, according to our telemetry. That many users is a considerable number, so we’ve decided to wait until more users adopt more modern CPUs before implementing this change. We’ll re-evaluate enforcing x86-64-v3 in the future once OpenGL eventually ends up on the chopping block as well.
While this change would also apply for Windows, MSVC is not flexible enough to let us build for x86-64-v2, it either supports SSE2, or jumps straight to the first AVX. Dynarmic already manually uses x86-64-v2 extensions, so any CPU lacking SSE4.2 is considered unstable regardless of the OS in use.
x86-64-v4 will not be an option for many years, mainly because Intel can’t decide if AVX-512, made by themselves, is something that their users should be allowed to actually use.
If an old-school user is so strongly set on running yuzu on decades old CPUs, the Flatpak builds are still generic, or there’s always the option of building yuzu manually, allowing you to configure any requirements.
german77 has done it again, giving us an amazing Christmas gift, a new input driver for Joy-Cons This is an in-house development that doesn’t rely on SDL, so it gives us much more freedom to add features that weren’t previously available.
The basics are covered. Single and Dual Joy-Con modes are available, button, stick, motion mapping works the same as before. But that’s not exciting, here’s all the new stuff that was added:
Night Vision
, Game Builder Garage
, and Nintendo Labo
can make use of this neat feature at the base of the right Joy-Con.All this extra accuracy highlights a problem we didn’t often face before: PC Bluetooth connections are very easy to saturate. Cheaper/Intel bluetooth chipsets or areas with tons of interference are especially prone to this. For this reason, HD Rumble can potentially cause lag depending on the user’s specific circumstances. We recommend unmapping/disabling rumble in those cases.
Speaking of saturation, the IR camera may be slow in some games. The reason being that we currently implement only the image transfer mode, which saves 320x240 pictures. Some games prefer faster framerates at the cost of resolution, going as low as 40x30. Once all modes are added in, the choppy framerate will disappear.
Amiibo data writing is a work in progress.
german77’s desire for incredible input improvements doesn’t end there.
german77 implemented the mifare
service,
allowing games read and write plain mifare tags.
Games like Skylanders Imaginators
make use of this feature.
The only feature lacking is support for encrypted read and writes.
Speaking of SDL, a recent update broke the way it handled the GUID, the identifier of several controllers, including the one integrated into the Steam Deck, causing many annoyances for Deck users. So, with no alternative on hand, german77 had to implement a custom filter to solve the issue.
And lastly, as a very important quality of life change, german77 made the input device list refresh automatically, ensuring that yuzu detects controllers without the need for manual intervention. Goodbye tiny refresh button!
To close the section, MonsterDruide1 increased the accuracy of analog sticks for TAS by hardcoding the range and deadzones of the user input.
With an update for Dynarmic and SDL2, byte[] enabled support for ARM64 compilation. This means all Switch titles can be tested on Linux ARM64 devices with compatible Vulkan drivers.
As part of this effort, we started implementing Flatpak support for ARM64 Linux devices. This required making OpenGL optional for the build process, as Flatpak’s Qt build only supports OpenGL ES, not the full fledged OpenGL 4.6 compatibility profile we require.
Part of these changes fixed compilation for macOS, but the situation remains the same, without MoltenVK
support, nothing will be rendered.
epicboy implemented a series of changes with the goal of minimizing the overhead of dynamic memory allocation, a task which involves requesting memory from the operating system, and can slow-down performance in some circumstances.
The texture cache, in particular, was a significant contributor to this issue, as it constantly allocated and then deallocated memory when transferring textures to and from the GPU. To address this problem, epicboy optimized the texture cache to pre-allocate a buffer to store swizzle data and reuse it whenever possible, rather than performing a dynamic memory allocation every time this was done. This change should result in reduced stuttering, as memory will now only be requested from the operating system if the buffer is not large enough to hold the data.
epicboy also made similar changes to optimise the ReadBuffer
function
, which likewise takes a similar approach: instead of allocating and deallocating memory, a buffer is created once to hold data in the memory, and it only reallocates whenever it needs to grow.
Additionally, he introduced a ScratchBuffer
class
to act as a wrapper around a heap-allocated buffer of memory.
The advantage of this class lies with the fact that it eliminates the need to initialize the stored values, and the need to copy the data when the buffer needs to grow. Thus, it would help to speed up things by minimizing the amount of time spent on memory management tasks.
german77 implemented the FreeThreadCount
info type,
which is needed by titles such as Just Dance 2023 Edition
(although that game requires additional changes in order to work).
Saalvage noticed an error in yuzu’s kernel implementation and made the necessary changes to unlock thread mutex before destruction, as not doing so incurs an undefined behaviour. “Here be Dragons” and all that.
byte[] submitted a change that improves the handling of system startup failure, in order to prevent deadlocks and crashes when/if the GPU initialization fails.
He also added KHardwareTimer
.
This component is designed to fix an issue with incorrect event unregistration when threads request a timeout for certain operations.
Without the fix, the threads would return successfully from the operation but fail to cancel the timeout, which would cause the timer to mistakenly fire on the thread and cancel a random unrelated operation.
This change fixes the random hangs that have been plaguing us for months in The Legend of Zelda: Breath of the Wild
, as well as Persona 5 Royal
.
byte[] also introduced a workaround for crashes caused due to unallocated memory after noticing that yuzu always used memory blocks without marking them as allocated, causing it to overlap memory used by the game. He fixed the bug by making sure we now allocate the memory before using it. This is meant to alleviate the situation while other parts of the kernel are being ironed out.
This is more related to error handling, but counts nonetheless. byte[] added an option to force the emulator to break when an invalid memory access happens. This means that if/when a game explodes in the background, the emulator will crash instead of slowly eating all the available system RAM. Problems like these can be caused by emulation issues, damaged game dumps, or even some wonky mods, so it’s always a better option to avoid crashing the entire emulator, and if the user has little enough RAM, making the OS suffer.
Along with endless silent changes that don’t get mentioned here, lioncash made some changes to the input code to reduce memory use ever so slightly.
We’ve had some interesting user interface quality of life changes implemented! lioncash made the SPIR-V shader backend element translatable, so it doesn’t always show in English for everyone. The community effort working on translation can now take the label and update it accordingly.
Some months ago, with the core timing changes, we allowed users to boot games with their framerate unlocked after continuous requests from the community.
As it turns out, nothing changed. Several games hate booting with unlocked framerates, and the support channels get their fair share of people asking why their game doesn’t want to boot.
So, simple fix, unlocked framerate at boot rights denied.
The hotkey to toggle unlocked framerate is Ctrl + U
by default, only a small nuisance.
Users reported that they couldn’t record or stream their yuzu window while in windowed mode.
byte[] found the cause was setting the WA_DontCreateNativeAncestors
Qt property for all platforms, instead of just for wayland.
Issue down, streamers rejoice.
Discord user piplup reported that yuzu didn’t save the device name (what you would call the console) after accessing a game’s custom configuration window. german77 fixed the issue (this particular setting lacked a custom configuration equivalent), and also fixed Qt 6 build issues while at it.
Another very nice quality of life improvement made by german77 is making yuzu remember the last selected directory
for Install files to NAND…
.
If you keep your dumps in the same folder, updating your games is going to take fewer clicks now.
byte[] managed an amazing victory in the war against crashes when closing/stopping games. He worked on making shutdown not visibly freeze yuzu, avoiding crashing the emulator while the game quits, and showing a nice pop-up message while at it too!
Morph helped make this possible by hiding button dialog boxes, allowing for the creation of the dialog overlay byte[] added.
Another battle fought on this front is related to homebrew apps. byte[] is responsible for making them quit properly now too.
ChrisOboe suddenly shows up with a glorious quality of life fix for the terminal-based yuzu-cmd build. Marking the build as a “Windows” application instead of a “Console” one ensures that no empty command line window pops up needlessly. This can help streaming programs set up to run specific games with yuzu-cmd, as this prevents the sudden empty black box from appearing in front of other windows.
For our single audio change of this month, Maide properly signals a buffer event on audio stops,
fixing an early softlock that affected the Pokémon Brilliant Diamond/Shining Pearl
games.
We’re only a few days into 2023 and we already want to publish the next progress report. So much has happened in such a short time!
Also, Blinkhaw, bunnei, and byte[] are up to something, and we can’t wait to tell you more. And yes, there will be yet more cache invalidations. All in the name of progress.
That’s all folks! Expect a few but very critical Vulkan improvements next time, hope to see you then!
⭐⭐⭐ Grande Messi
Advertisement
Advertisement