Configure Nvidia RTD3 for Linux Laptops
Configure Nvidia RTD3 for Linux Laptops
Many recent high-performance laptops have a discrete GPU, and most of these are Nvidia. RTD3 is a power management technique used by modern Nvidia GPUs, usually in combination with prime offloading. This is very useful for battery life.
In this blog post, I will explain how to configure RTD3 for devices that have edgy cases like nvidia used for framebuffer or no VGA compatible controller exists.
Linux and Nvidia come with good defaults, so RTD3 works out of the box.
After some research, this firmware misconfiguration is mostly seen on laptops with AMD Strix Point integrated GPU and an Nvidia discrete GPU.
I think desktop environments (compositors) like KDE Plasma and GNOME (the popular ones) have implemented smarter device selection. This blog post is about fixing this issue kernel-side (without recompilation, of course). There are patches under discussion in the mailing list to address this; with these patches, the kernel will be able to identify the real primary GPU.
Here is a curated list of useful commands used in this blog post or at least useful for this topic:
ls -l /dev/dri/by-path/*: This allows you to seedritopcidevice mapping.lspci -d ::03xx: This shows all yourpcidisplay devices.cat /sys/class/drm/card*/device/power_state: Check the specific card state; our goal here is to make the Nvidia card's stateD3Cold.nvidia-smi: Check Nvidia status.upower -b: Dump battery details. You can use this to see power consumption when your battery is discharging.ls -la /sys/class/drm/: See which port is wired to which render device.
Here are some definitions of acronyms, just nice to know (so you can pretend you are knowledgeable about these):
bios: Basic Input/Output System The old way your computer loads your operating system.efi: Extensible Firmware Interface Intel's original firmware specification, predecessor to UEFI.dri: Direct Rendering Infrastructure This enables userspace programs to talk to graphics devices directly.drm: Direct Render Manager Managesdri.fb: Frame Bufferpci: Peripheral Component Interconnect Bus (think of this as the abbreviation of Latinomnibus- for all, a "line" where all devices talk to the CPU).pcie:PCIExpress Faster version ofpci.RTD3: RuntimeD3What we are trying to implement here: turn the GPU off when not used.s0ix: Modern Standby Something irrelevant; this technique by Microsoft saves battery when the lid is closed.uefi: Unified Extensible Firmware Interface The newbios.vga: Video Graphics Array Old 90s interface that you would only find on old motherboards, with screws on both sides. This whole blog post is probably about this. In addition, thevga-compatibleflag assigned to your GPU does not do anything on modernuefiarchitectures besides conventionally marking it as the primary display device that some operating systems, namely Linux, will use. If you want to know what it does inBIOS, see here.
And here are some useful links:
- Latest Nvidia driver doc on RTD3 (at the time this post is published)
- Arch wiki about PRIME (which is used with RTD3)
- AMD DRM issue tracker for "Laptop GPU order reversed from normal causes boot_vga/drm issues"
- Kernel mailing list about patch that uses the GPU connected to
eDPas primary display
Check Availability
Before we start, you should check that you have an Nvidia card, have installed the nvidia driver, it is loaded, and you DO NOT have ANY other power-saving tools that control your discrete GPU.
To check your cards, run lspci -d ::03xx to enumerate your pci display devices.
Here is an example of my output:
$ lspci -d ::03xx
64:00.0 VGA compatible controller: NVIDIA Corporation AD106M [GeForce RTX 4070 Max-Q / Mobile] (rev a1)
65:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Strix [Radeon 880M / 890M] (rev c1)
According to the Nvidia driver doc, NVreg_DynamicPowerManagement=0x03 (which means auto) is set by default by the driver. For Ampere or later laptops, this translates to fine-grained power control (RTD3, or NVreg_DynamicPowerManagement=0x02).
To check if your machine supports RTD3, run (and here is my output):
$ cat /proc/driver/nvidia/gpus/`Your Nvidia Bus Number from "ls -l /dev/dri/by-path/*"`/power # cat /proc/driver/nvidia/gpus/0000:64:00.0/power
Runtime D3 status: Enabled (fine-grained)
Video Memory: Off
GPU Hardware Support:
Video Memory Self Refresh: Supported
Video Memory Off: Supported
S0ix Power Management:
Platform Support: Supported
Status: Disabled
Notebook Dynamic Boost: Supported
You at least want to see Runtime D3 is supported. Usually it will say enabled. But since you are here, it probably means that your compositor is not willing to use your iGPU for things. If you want to force enable RTD3, you can create a modprobe config (files under /etc/modprobe.d/) and put options nvidia NVreg_DynamicPowerManagement=0x02 inside.
The Fix
I put this section in the front because this is probably what you are here for. If you want to know what you are putting into your command line, please continue reading after this section.
- Try this first: Force your compositor to use your integrated GPU render device.
In my compositor, niri, this means adding this to the niri config file:
debug {
render-drm-device "/dev/dri/renderD128" // Display device of your integrated GPU
}
This should fix most of the issues in newer kernel versions. For older versions, we expect poorly-written applications that do not respect boot_vga and are hardcoded to card0, so we need to make the changes below to make your iGPU be card0.
- Make your Nvidia driver depend on
amdgpu
In a modprobe config file, have:
softdep nvidia pre: amdgpu
- Disable
efiframebuffers if previous parts did not fixed your issue
Add this to your kernel cmdline:
video=efifb:off initcall_blacklist=sysfb_init
If you are using grub, it's GRUB_CMDLINE_LINUX_(DEFAULT) in /etc/default/grub. Just append it to that line.
The Issue and Explanation
There are mainly two different causes of this issue, both of which result in applications choosing the Nvidia discrete GPU as their default display.
Client and Driver Code Analysis
Before going into the two cases, I think I should first explain how applications and compositors decide which GPU to use. For compositors, this is a snippet of the source code of smithay that decides the render device to use, which is the library niri uses for implementing the wayland protocol:
// From smithay/src/backend/udev.rs:237-265
pub fn primary_gpu<S: AsRef<str>>(seat: S) -> io::Result<Option<PathBuf>> {
let mut enumerator = Enumerator::new()?;
enumerator.match_subsystem("drm")?;
enumerator.match_sysname("card[0-9]*")?;
if let Some(path) = enumerator
.scan_devices()?
.filter(|device| {
let seat_name = device
.property_value("ID_SEAT")
.map(|x| x.to_os_string())
.unwrap_or_else(|| OsString::from("seat0"));
if seat_name == *seat.as_ref() {
if let Ok(Some(pci)) = device.parent_with_subsystem(Path::new("pci")) {
if let Some(id) = pci.attribute_value("boot_vga") {
return id == "1";
}
}
}
false
})
.flat_map(|device| device.devnode().map(PathBuf::from))
.next()
{
Ok(Some(path))
} else {
all_gpus(seat).map(|all| all.into_iter().next())
}
}
Basically what smithay does is choose the card with the boot_vga flag, or else card0. For applications, the wgpu library is smart about GPU selection, which is based on the developer choosing their PowerPreference.
For OpenGL and Vulkan, the only choice we have here is mesa to analyze (we don't have knowledge about the Nvidia driver since it is proprietary, and amdvlk is discontinued - please do not use it, it has weird Vulkan issues, which I'll write another blog about). And for mesa, it is smart enough to ask the compositor which card it chose.
This chunk of code is for Vulkan (although OpenGL code functionally does almost the same):
// From mesa/src/vulkan/device-select-layer/device_select.c:324-369
static int
get_default(const struct instance_info *info, uint32_t count, struct device_pci_info *pci_infos)
{
int default_idx = -1;
bool has_cpu = false;
for (unsigned i = 0; i < count; ++i)
has_cpu |= pci_infos[i].device_type == VK_PHYSICAL_DEVICE_TYPE_CPU;
if (default_idx == -1 && info->has_wayland) {
default_idx = device_select_find_wayland_pci_default(pci_infos, count);
if (info->debug && default_idx != -1)
fprintf(stderr, "device-select: device_select_find_wayland_pci_default selected %i\n",
default_idx);
}
if (default_idx == -1 && info->has_xcb) {
default_idx = device_select_find_xcb_pci_default(pci_infos, count);
if (info->debug && default_idx != -1)
fprintf(stderr, "device-select: device_select_find_xcb_pci_default selected %i\n",
default_idx);
}
if (default_idx == -1) {
if (info->has_vulkan11 && info->has_pci_bus)
default_idx = device_select_find_boot_vga_default(pci_infos, count);
else
default_idx = device_select_find_boot_vga_vid_did(pci_infos, count);
if (info->debug && default_idx != -1)
fprintf(stderr, "device-select: device_select_find_boot_vga selected %i\n", default_idx);
}
/* If no GPU has been selected so far, select the first non-CPU device. If none are available,
* pick the first CPU device.
*/
if (default_idx == -1) {
default_idx = device_select_find_non_cpu(pci_infos, count);
if (info->debug && default_idx != -1)
fprintf(stderr, "device-select: device_select_find_non_cpu selected %i\n", default_idx);
}
if (default_idx == -1 && has_cpu)
default_idx = 0;
return default_idx;
}
And as you can see, they love to look for the boot_vga flag, and if the flag does not exist, choose card0.
Kernel Code Analysis
In the Linux kernel (latest version of course), here's how boot_vga is determined (a quote from linux/drivers/pci/vgaarb.c:581):
We select the default VGA device in this order:
- Firmware framebuffer (screen_info from EFI/VESA)
- Legacy VGA device (owns VGA_RSRC_LEGACY_MASK)
- Non-legacy integrated device (ACPI_VIDEO_HID)
- Non-legacy discrete device
- Other device
Here is where things get tricky. As I said above, there are mainly two reasons for Nvidia to be chosen as the default device.
It is either:
- Nvidia owns your frame buffer.
- Although your integrated gpu owns frame buffer but it is not VGA compatible thus not considered first, and then Nvidia get selected.
Which both are edgy cases. Here is a partial vgaarb trace:
drivers/pci/vgaarb.c:1515:
This the entrypoint on detecting boot_vga.
// static int __init vga_arb_device_init(void)
...
while ((pdev =
pci_get_subsys(PCI_ANY_ID, PCI_ANY_ID, PCI_ANY_ID,
PCI_ANY_ID, pdev)) != NULL) {
if (pci_is_vga(pdev))
vga_arbiter_add_pci_device(pdev);
}
...
include/linux/pci.h:768:
Here, it is filtering only VGA compatible controllers.
static inline bool pci_is_vga(struct pci_dev *pdev)
{
if ((pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA)
return true;
if ((pdev->class >> 8) == PCI_CLASS_NOT_DEFINED_VGA)
return true;
return false;
}
drivers/pci/vgaarb.c:736:
This chunk of code is deciding whether we find a better choice of boot_vga, if we did, swap.
// static bool vga_arbiter_add_pci_device(struct pci_dev *pdev)
...
if (vga_is_boot_device(vgadev)) {
vgaarb_info(&pdev->dev, "setting as boot VGA device%s\n",
vga_default_device() ?
" (overriding previous)" : "");
vga_set_default_device(pdev);
}
...
drivers/pci/vgaarb.c:581:
Here is basically what the comment said:
If we haven't found a legacy VGA device, accept a non-legacy device. It may have either IO or MEM enabled, and bridges may not have PCI_BRIDGE_CTL_VGA enabled, so it may not be able to use legacy VGA resources. Prefer an integrated GPU over others.
/*
* Return true if vgadev is a better default VGA device than the best one
* we've seen so far.
*/
// static bool vga_is_boot_device(struct vga_device *vgadev)
...
/*
* If we haven't found a legacy VGA device, accept a non-legacy
* device. It may have either IO or MEM enabled, and bridges may
* not have PCI_BRIDGE_CTL_VGA enabled, so it may not be able to
* use legacy VGA resources. Prefer an integrated GPU over others.
*/
pci_read_config_word(pdev, PCI_COMMAND, &cmd);
if (cmd & (PCI_COMMAND_IO | PCI_COMMAND_MEMORY)) {
/*
* An integrated GPU overrides a previous non-legacy
* device. We expect only a single integrated GPU, but if
* there are more, we use the *last* because that was the
* previous behavior.
*/
if (vga_arb_integrated_gpu(&pdev->dev))
return true;
/*
* We prefer the first non-legacy discrete device we find.
* If we already found one, vgadev is no better.
*/
if (boot_vga) {
pci_read_config_word(boot_vga->pdev, PCI_COMMAND,
&boot_cmd);
if (boot_cmd & (PCI_COMMAND_IO | PCI_COMMAND_MEMORY))
return false;
}
return true;
}
...
For the first case, check your pci numbering of each device, and hopefully they are both VGA compatible controller or Display controller. If there is one and only one VGA compatible controller, according to the rules above, this will be chose as boot_vga. If both device are the same, they break ties by pci number. You can check pci numbers and display types of all your graphic devices by lspci -d ::03xx. And if your integrated gpu has higher pci numbers (smaller numbers get higher priority), you can make your integrated gpu your boot_vga by disable framebuffers.
- Disabled
efifb, the legacy framebuffer controller that is mostly never used on modern systems because we havesimpledrm. - Blacklist
sysfb_init, so that no framebuffer is actually claimed by linux from your bootloader.
video=efifb:off initcall_blacklist=sysfb_init
If you are in the second case like me, where Nvidia claims to be the VGA compatible controller, there is not much we can do besides waiting for the kernel patch.
Here are some questions you might have:
- Why my solution above would work?
The first step I have made wayland compositor to use the render node of the integrated gpu because all normal graphics api adhere to wayland compositor.
The second step made
nvidiadepends onamdgpu, thus amdgpu has a higher render node, which get used next if there is no wayland compositor or your wayland compositor does not have the api implemented.The third step is a rare case that I only saw once on Arch Linux forum that someone solved this issue by removing framebuffer. I personally believe good hardware manufacturers should not ends up in this case.
- What does
simpledrmdo?
It is a
fbdevreplacement. It gets the framebuffer fromsimple-framebufferwhich gets initialized bysysfb_init. This is the tool that displays logs when Linux is booting, because at this stage your GPU drivers are not loaded. If we disable framebuffers in the kernel command line, our screen will be black until a real gpu driver get loaded (this process took about 1.1s on my system).
- Why my device behaves weird with
RTD3on?
Probably you have some power management background service on that is fighting with Nvidia driver.
Comments