Skip to content

Kernel: Considering the removal of jails #26392

@spholz

Description

@spholz

Additional context:

(tldr: I noticed some questionable locking patterns in VFSRootContext, a jails-related class)

While attempting to port the xHCI driver to the new WaitQueue introduced in #26160 I noticed quite a lot of questionable locking patterns related to VFSRootContext. My current progress for that is here: https://github.com/spholz/serenity/tree/xhci-locking-stuff.

The new WaitQueue doesn't support blocking while in a critical section (rightfully so, since this can easily lead to deadlocks). Currently this leads to a confusing panic message in Thread::set_state(), so I added a better panic message in spholz@9f93141.
However, the filesystem code currently holds spinlocks during disk transfers, which causes us to hit this panic, as submitting USB drive read/write requests causes us to block on a WaitQueue in the xHCI code:

xHCI panic
[Power State Switch Task(52:52)]: KERNEL PANIC! :^(
[Power State Switch Task(52:52)]: Attempted to block in critical section
[Power State Switch Task(52:52)]: at ./Kernel/Tasks/WaitQueue.cpp:84 in void Kernel::WaitQueue::Waiter::maybe_block()
Kernel + 0x000000000036b298  Kernel::__panic(char const*, unsigned int, char const*) +0x188
Kernel + 0x00000000004f2926  Kernel::WaitQueue::Waiter::maybe_block() +0x416
Kernel + 0x000000000004c851  Kernel::USB::xHCI::xHCIController::submit_bulk_transfer(Kernel::USB::Transfer&) +0x581
Kernel + 0x000000000007d3a7  Kernel::USB::BulkOutPipe::submit_bulk_out_transfer(unsigned long, void const*) +0x357
Kernel + 0x000000000017f639  AK::ErrorOr<Kernel::USB::CommandStatusWrapper, AK::Error> Kernel::USB::BulkSCSIInterface::send_scsi_command<(Kernel::USB::SCSIDataDirection)0, Kernel::SCSI::Write10, Kernel::UserOrKernelBuffer>(Kernel::SCSI: +0xc9
Kernel + 0x000000000017d1a4  Kernel::USB::BulkSCSIStorageDevice::do_write(unsigned int, unsigned int, Kernel::UserOrKernelBuffer&, unsigned long) +0x184
Kernel + 0x000000000017d3c0  Kernel::USB::BulkSCSIStorageDevice::start_request(Kernel::AsyncBlockDeviceRequest&) +0x120
Kernel + 0x000000000019459a  AK::ErrorOr<AK::NonnullLockRefPtr<Kernel::AsyncBlockDeviceRequest>, AK::Error> Kernel::Device::try_make_request<Kernel::AsyncBlockDeviceRequest, Kernel::AsyncBlockDeviceRequest::RequestType, unsigned long&,  +0x65a
Kernel + 0x000000000018ff22  Kernel::StorageDevice::write(Kernel::OpenFileDescription&, unsigned long, Kernel::UserOrKernelBuffer const&, unsigned long) +0x7e2
Kernel + 0x00000000001a671a  Kernel::BlockBasedFileSystem::raw_write(AK::DistinctNumeric<unsigned long, Kernel::__BlockIndex_tag, AK::DistinctNumericFeature::Comparison, AK::DistinctNumericFeature::CastToBool>, Kernel::UserOrKernelBuffe +0x9a
Kernel + 0x00000000001a6ae1  Kernel::BlockBasedFileSystem::raw_write_blocks(AK::DistinctNumeric<unsigned long, Kernel::__BlockIndex_tag, AK::DistinctNumericFeature::Comparison, AK::DistinctNumericFeature::CastToBool>, unsigned long, Ker +0xc1
Kernel + 0x00000000001bc93c  Kernel::Ext2FS::flush_super_block() +0x13c
Kernel + 0x00000000001c4f51  Kernel::Ext2FS::prepare_to_clear_last_mount(Kernel::Inode&) +0x7d1
Kernel + 0x00000000001fcefa  Kernel::FileSystem::prepare_to_unmount(Kernel::Inode&) +0x20a
Kernel + 0x00000000002daa59  Kernel::VirtualFileSystem::remove_mount(Kernel::Mount&, AK::Detail::IntrusiveList<Kernel::FileBackedFileSystem, Kernel::FileBackedFileSystem*, &Kernel::FileBackedFileSystem::m_file_backed_file_system_node>&) +0xd9
Kernel + 0x00000000002cca35  Kernel::VFSRootContext::unmount(AK::Detail::IntrusiveList<Kernel::FileBackedFileSystem, Kernel::FileBackedFileSystem*, &Kernel::FileBackedFileSystem::m_file_backed_file_system_node>&, Kernel::Inode&, AK::Str +0x815
Kernel + 0x00000000002da2bc  Kernel::VirtualFileSystem::unmount(Kernel::VFSRootContext&, Kernel::Inode&, AK::StringView) +0xbc
Kernel + 0x00000000002cee95  Kernel::VFSRootContext::do_full_teardown(AK::Badge<Kernel::PowerStateSwitchTask>) +0x7a5
Kernel + 0x00000000004a62c9  Kernel::PowerStateSwitchTask::perform_shutdown(Kernel::PowerStateSwitchTask::DoReboot) +0xb39
Kernel + 0x00000000004a6d37  Kernel::PowerStateSwitchTask::power_state_switch_task(void*) +0xc7

This panic is caused by prepare_to_unmount() holding the FileSystem::m_attach_count spinlock (introduced by 0fd7b68) while flushing the superblock. I fixed this in spholz@7b00d93 by converting it to a mutex.

This isn't the only lock held while blocking on xHCI transfers though. Additionally, I also had to migrate the VFSRootContext::m_details lock (introduced by 1dfc9e2) to a mutex. I did this in spholz@96fbdc8.

Now converting these two spinlocks to mutexes caused even more problems, since the m_details mutex is held while we lock the process's m_attached_vfs_root_context spinlock in the Process constructor while calling VFSRootContext::attach(). Acquiring a mutex while holding a spinlock is always a bug.
We could also migrate this lock to a mutex, but this will cause a problem here: https://github.com/spholz/serenity/blob/da0161ddabbe8422df59b86b670250c4e40d9a89/Kernel/Tasks/Process.cpp#L793-L795, where we would lock the m_current_directory spinlock before acquiring the mutex.

I also noticed that the Process class provides an accessor for the m_attached_vfs_root_context in Process::vfs_root_context(), but that function just locks the lock once, gets the VFSRootContext and then unlocks again. This seems very TOCTOU-susceptible.
This code in VFSRootContext seems also quite susceptible to TOCTOUs: https://github.com/spholz/serenity/blob/da0161ddabbe8422df59b86b670250c4e40d9a89/Kernel/FileSystem/VirtualFileSystem.cpp#L297-L300
prepare_to_unmount locks the m_attach_count and clears the last mount if the count is 1, but then later we acquire that lock again and check if the count is 1 for a second time.

Whenever we use the VFSRootContext::root_custody() function we use it like this:

    auto vfs_root_context_root_custody = Process::current().vfs_root_context()->root_custody().with([](auto& custody) -> NonnullRefPtr<Custody> {
        return custody;
    });

This also seems quite TOCTOU-y to me, if changing the root inode is possible. The jails-related copy_mount syscall might be able to do that, but I'm not really sure. If we would support chroot in the future, this would definitely be possible. If changing the root inode is impossible, then this lock should be unnecessary.

So in conclusion a lot of VFSRootContext code seems to have questionable locking patterns to me.


Nobody is currently working on jails code, so I'm considering to remove support for it entirely.
Jails in general add quite a lot of complexity in many parts of the kernel, which likely can easily bitrot if no one is paying attention to it. Proper support for containers/jails requires proper permission checks in a lot of places in the kernel. Keeping those correct and adding new ones where appropriate is also extra maintenance effort.

I don't think it's worth it to keep supporting jails, since nothing except for the runc utility is currently using them. Especially since support for them adds a lot of extra complexity spread all over the kernel code base.

If someone else want to continue working on that code and fixing these issues, please say so.
Otherwise I guess I may attempt to remove support for it, but doing so likely isn't easy, since it touches so many parts in the kernel.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions