Today, I investigated a strange boot issue on the Nvidia Orin.

After flashing a fresh boot image and following the Nvidia documentation, the system showed an error in the final steps of the boot process, shortly before handing over to the Linux kernel:

L4TLauncher: Attempting Direct Boot
OpenAndReadUntrustedFileToBuffer: Failed to open boot\extlinux\extlinux.conf: Not Found
ProcessExtLinuxConfig:sds Failed to Authenticate boot\extlinux\extlinux.conf (Not Found)

And, unfortunately, we need to modify /boot/extlinux/extlinux.conf for our research ...

Googling OpenAndReadUntrustedFileToBuffer leads to the implementation: https://github.com/NVIDIA/edk2-nvid...

The L4TLauncher uses UEFI services to access this file from the Linux root partition, which is an ext4 file system. Maybe UEFI has issues mounting the filesystem? Time to investigate with the UEFI Shell.

Since we still had a working system, I compared the differences between the two systems. The UEFI Shell prints all partition and all mounted filesystems at start up (or type map -c to see the list). On the good system, the ext4 filesystem was found immediately and listed as fs2:. But on the bad system, fs2: is the EFI partition, which would be fs3: on the good system. Strange. Maybe the ext4 filesystem driver is missing? But Ext4Dxe was loaded. Even stranger.

After poking around in the UEFI Shell for some time to find enlightenment / serendipity. I came to the conclusion that I just waste my time, as the builtin mechanisms didn't tell me anything relevant. Frustrating. I booted Linux and started debugging. Both Linux systems showed a comparable partition layout. Hmm. Maybe there's an issue with the file system? Let's try fsck. And bingo. The bad Orin complained about an unsupported filesystem feature:

nvidia@orin:~$ sudo fsck -n /dev/mmcblk0p1
[sudo] password for nvidia: 
fsck from util-linux 2.34
e2fsck 1.45.5 (07-Jan-2020)
Warning!  /dev/mmcblk0p1 is mounted.
Warning: skipping journal recovery because doing a read-only filesystem check.
/dev/mmcblk0p1 has unsupported feature(s): FEATURE_C12
e2fsck: Get a newer version of e2fsck!

/dev/mmcblk0p1: ********** WARNING: Filesystem still has errors **********

But fsck proceeded the working Orin. Could this be the issue?

I dug deeper. FEATURE_C12 stands for orphan_file. This didn't sound convincing. So I dumped the filesystem flags with dumpe2fs.

Good Orin:

$ dumpe2fs -h /dev/mmcblk0p1
...
Filesystem features:          ... needs_recovery ...

Bad Orin:

$ dumpe2fs -h /dev/mmcblk0p1
...
Filesystem features:          ... FEATURE_C12 ... metadata_csum_seed ...

Also, the output on the bad system was slightly longer. Maybe it's the metadata checksumming? Let's try out!

But how to remount the active root filesystem read-only to apply the changes? I remembered the old magic sysrq tricks, and luckily, we can still trigger them via the proc filesystem:

sudo su -
echo u > /proc/sysrq-trigger
tune2fs -O ^metadata_csum_seed /dev/mmcblk0p1
sync
reboot

And the system came up again. And found its extlinux.conf at boot ...

What happened here was the following. We flashed the bad Orin from a new laptop that had a newer version of mkfs.ext4 installed that enabled the metadata_csum_seed feature by default. However, the UEFI implementation does not mount filesystems with unsupported features. This is a good decision to not corrupt the filesystem, but UEFI could at least print a warning ...