diff options
| author | Alice Ryhl <aliceryhl@google.com> | 2025-09-19 09:42:07 +0300 |
|---|---|---|
| committer | Greg Kroah-Hartman <gregkh@linuxfoundation.org> | 2025-09-19 10:40:46 +0300 |
| commit | eafedbc7c050c44744fbdf80bdf3315e860b7513 (patch) | |
| tree | 9537baab5b4dacf3b1694016c723093737b8c6eb /drivers/android/binder/range_alloc | |
| parent | 55f6ac4484b342e62640ee4135ab1f1ffbcd5be5 (diff) | |
| download | linux-eafedbc7c050c44744fbdf80bdf3315e860b7513.tar.xz | |
rust_binder: add Rust Binder driver
We're generally not proponents of rewrites (nasty uncomfortable things
that make you late for dinner!). So why rewrite Binder?
Binder has been evolving over the past 15+ years to meet the evolving
needs of Android. Its responsibilities, expectations, and complexity
have grown considerably during that time. While we expect Binder to
continue to evolve along with Android, there are a number of factors
that currently constrain our ability to develop/maintain it. Briefly
those are:
1. Complexity: Binder is at the intersection of everything in Android and
fulfills many responsibilities beyond IPC. It has become many things
to many people, and due to its many features and their interactions
with each other, its complexity is quite high. In just 6kLOC it must
deliver transactions to the right threads. It must correctly parse
and translate the contents of transactions, which can contain several
objects of different types (e.g., pointers, fds) that can interact
with each other. It controls the size of thread pools in userspace,
and ensures that transactions are assigned to threads in ways that
avoid deadlocks where the threadpool has run out of threads. It must
track refcounts of objects that are shared by several processes by
forwarding refcount changes between the processes correctly. It must
handle numerous error scenarios and it combines/nests 13 different
locks, 7 reference counters, and atomic variables. Finally, It must
do all of this as fast and efficiently as possible. Minor performance
regressions can cause a noticeably degraded user experience.
2. Things to improve: Thousand-line functions [1], error-prone error
handling [2], and confusing structure can occur as a code base grows
organically. After more than a decade of development, this codebase
could use an overhaul.
[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/android/binder.c?h=v6.5#n2896
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/android/binder.c?h=v6.5#n3658
3. Security critical: Binder is a critical part of Android's sandboxing
strategy. Even Android's most de-privileged sandboxes (e.g. the
Chrome renderer, or SW Codec) have direct access to Binder. More than
just about any other component, it's important that Binder provide
robust security, and itself be robust against security
vulnerabilities.
It's #1 (high complexity) that has made continuing to evolve Binder and
resolving #2 (tech debt) exceptionally difficult without causing #3
(security issues). For Binder to continue to meet Android's needs, we
need better ways to manage (and reduce!) complexity without increasing
the risk.
The biggest change is obviously the choice of programming language. We
decided to use Rust because it directly addresses a number of the
challenges within Binder that we have faced during the last years. It
prevents mistakes with ref counting, locking, bounds checking, and also
does a lot to reduce the complexity of error handling. Additionally,
we've been able to use the more expressive type system to encode the
ownership semantics of the various structs and pointers, which takes the
complexity of managing object lifetimes out of the hands of the
programmer, reducing the risk of use-after-frees and similar problems.
Rust has many different pointer types that it uses to encode ownership
semantics into the type system, and this is probably one of the most
important aspects of how it helps in Binder. The Binder driver has a lot
of different objects that have complex ownership semantics; some
pointers own a refcount, some pointers have exclusive ownership, and
some pointers just reference the object and it is kept alive in some
other manner. With Rust, we can use a different pointer type for each
kind of pointer, which enables the compiler to enforce that the
ownership semantics are implemented correctly.
Another useful feature is Rust's error handling. Rust allows for more
simplified error handling with features such as destructors, and you get
compilation failures if errors are not properly handled. This means that
even though Rust requires you to spend more lines of code than C on
things such as writing down invariants that are left implicit in C, the
Rust driver is still slightly smaller than C binder: Rust is 5.5kLOC and
C is 5.8kLOC. (These numbers are excluding blank lines, comments,
binderfs, and any debugging facilities in C that are not yet implemented
in the Rust driver. The numbers include abstractions in rust/kernel/
that are unlikely to be used by other drivers than Binder.)
Although this rewrite completely rethinks how the code is structured and
how assumptions are enforced, we do not fundamentally change *how* the
driver does the things it does. A lot of careful thought has gone into
the existing design. The rewrite is aimed rather at improving code
health, structure, readability, robustness, security, maintainability
and extensibility. We also include more inline documentation, and
improve how assumptions in the code are enforced. Furthermore, all
unsafe code is annotated with a SAFETY comment that explains why it is
correct.
We have left the binderfs filesystem component in C. Rewriting it in
Rust would be a large amount of work and requires a lot of bindings to
the file system interfaces. Binderfs has not historically had the same
challenges with security and complexity, so rewriting binderfs seems to
have lower value than the rest of Binder.
Correctness and feature parity
------------------------------
Rust binder passes all tests that validate the correctness of Binder in
the Android Open Source Project. We can boot a device, and run a variety
of apps and functionality without issues. We have performed this both on
the Cuttlefish Android emulator device, and on a Pixel 6 Pro.
As for feature parity, Rust binder currently implements all features
that C binder supports, with the exception of some debugging facilities.
The missing debugging facilities will be added before we submit the Rust
implementation upstream.
Tracepoints
-----------
I did not include all of the tracepoints as I felt that the mechansim
for making C access fields of Rust structs should be discussed on list
separately. I also did not include the support for building Rust Binder
as a module since that requires exporting a bunch of additional symbols
on the C side.
Original RFC Link with old benchmark numbers:
https://lore.kernel.org/r/20231101-rust-binder-v1-0-08ba9197f637@google.com
Co-developed-by: Wedson Almeida Filho <wedsonaf@gmail.com>
Signed-off-by: Wedson Almeida Filho <wedsonaf@gmail.com>
Co-developed-by: Matt Gilbride <mattgilbride@google.com>
Signed-off-by: Matt Gilbride <mattgilbride@google.com>
Acked-by: Carlos Llamas <cmllamas@google.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Link: https://lore.kernel.org/r/20250919-rust-binder-v2-1-a384b09f28dd@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Diffstat (limited to 'drivers/android/binder/range_alloc')
| -rw-r--r-- | drivers/android/binder/range_alloc/array.rs | 251 | ||||
| -rw-r--r-- | drivers/android/binder/range_alloc/mod.rs | 329 | ||||
| -rw-r--r-- | drivers/android/binder/range_alloc/tree.rs | 488 |
3 files changed, 1068 insertions, 0 deletions
diff --git a/drivers/android/binder/range_alloc/array.rs b/drivers/android/binder/range_alloc/array.rs new file mode 100644 index 000000000000..07e1dec2ce63 --- /dev/null +++ b/drivers/android/binder/range_alloc/array.rs @@ -0,0 +1,251 @@ +// SPDX-License-Identifier: GPL-2.0 + +// Copyright (C) 2025 Google LLC. + +use kernel::{ + page::{PAGE_MASK, PAGE_SIZE}, + prelude::*, + seq_file::SeqFile, + seq_print, + task::Pid, +}; + +use crate::range_alloc::{DescriptorState, FreedRange, Range}; + +/// Keeps track of allocations in a process' mmap. +/// +/// Each process has an mmap where the data for incoming transactions will be placed. This struct +/// keeps track of allocations made in the mmap. For each allocation, we store a descriptor that +/// has metadata related to the allocation. We also keep track of available free space. +pub(super) struct ArrayRangeAllocator<T> { + /// This stores all ranges that are allocated. Unlike the tree based allocator, we do *not* + /// store the free ranges. + /// + /// Sorted by offset. + pub(super) ranges: KVec<Range<T>>, + size: usize, + free_oneway_space: usize, +} + +struct FindEmptyRes { + /// Which index in `ranges` should we insert the new range at? + /// + /// Inserting the new range at this index keeps `ranges` sorted. + insert_at_idx: usize, + /// Which offset should we insert the new range at? + insert_at_offset: usize, +} + +impl<T> ArrayRangeAllocator<T> { + pub(crate) fn new(size: usize, alloc: EmptyArrayAlloc<T>) -> Self { + Self { + ranges: alloc.ranges, + size, + free_oneway_space: size / 2, + } + } + + pub(crate) fn free_oneway_space(&self) -> usize { + self.free_oneway_space + } + + pub(crate) fn count_buffers(&self) -> usize { + self.ranges.len() + } + + pub(crate) fn total_size(&self) -> usize { + self.size + } + + pub(crate) fn is_full(&self) -> bool { + self.ranges.len() == self.ranges.capacity() + } + + pub(crate) fn debug_print(&self, m: &SeqFile) -> Result<()> { + for range in &self.ranges { + seq_print!( + m, + " buffer {}: {} size {} pid {} oneway {}", + 0, + range.offset, + range.size, + range.state.pid(), + range.state.is_oneway(), + ); + if let DescriptorState::Reserved(_) = range.state { + seq_print!(m, " reserved\n"); + } else { + seq_print!(m, " allocated\n"); + } + } + Ok(()) + } + + /// Find somewhere to put a new range. + /// + /// Unlike the tree implementation, we do not bother to find the smallest gap. The idea is that + /// fragmentation isn't a big issue when we don't have many ranges. + /// + /// Returns the index that the new range should have in `self.ranges` after insertion. + fn find_empty_range(&self, size: usize) -> Option<FindEmptyRes> { + let after_last_range = self.ranges.last().map(Range::endpoint).unwrap_or(0); + + if size <= self.total_size() - after_last_range { + // We can put the range at the end, so just do that. + Some(FindEmptyRes { + insert_at_idx: self.ranges.len(), + insert_at_offset: after_last_range, + }) + } else { + let mut end_of_prev = 0; + for (i, range) in self.ranges.iter().enumerate() { + // Does it fit before the i'th range? + if size <= range.offset - end_of_prev { + return Some(FindEmptyRes { + insert_at_idx: i, + insert_at_offset: end_of_prev, + }); + } + end_of_prev = range.endpoint(); + } + None + } + } + + pub(crate) fn reserve_new( + &mut self, + debug_id: usize, + size: usize, + is_oneway: bool, + pid: Pid, + ) -> Result<usize> { + // Compute new value of free_oneway_space, which is set only on success. + let new_oneway_space = if is_oneway { + match self.free_oneway_space.checked_sub(size) { + Some(new_oneway_space) => new_oneway_space, + None => return Err(ENOSPC), + } + } else { + self.free_oneway_space + }; + + let FindEmptyRes { + insert_at_idx, + insert_at_offset, + } = self.find_empty_range(size).ok_or(ENOSPC)?; + self.free_oneway_space = new_oneway_space; + + let new_range = Range { + offset: insert_at_offset, + size, + state: DescriptorState::new(is_oneway, debug_id, pid), + }; + // Insert the value at the given index to keep the array sorted. + self.ranges + .insert_within_capacity(insert_at_idx, new_range) + .ok() + .unwrap(); + + Ok(insert_at_offset) + } + + pub(crate) fn reservation_abort(&mut self, offset: usize) -> Result<FreedRange> { + // This could use a binary search, but linear scans are usually faster for small arrays. + let i = self + .ranges + .iter() + .position(|range| range.offset == offset) + .ok_or(EINVAL)?; + let range = &self.ranges[i]; + + if let DescriptorState::Allocated(_) = range.state { + return Err(EPERM); + } + + let size = range.size; + let offset = range.offset; + + if range.state.is_oneway() { + self.free_oneway_space += size; + } + + // This computes the range of pages that are no longer used by *any* allocated range. The + // caller will mark them as unused, which means that they can be freed if the system comes + // under memory pressure. + let mut freed_range = FreedRange::interior_pages(offset, size); + #[expect(clippy::collapsible_if)] // reads better like this + if offset % PAGE_SIZE != 0 { + if i == 0 || self.ranges[i - 1].endpoint() <= (offset & PAGE_MASK) { + freed_range.start_page_idx -= 1; + } + } + if range.endpoint() % PAGE_SIZE != 0 { + let page_after = (range.endpoint() & PAGE_MASK) + PAGE_SIZE; + if i + 1 == self.ranges.len() || page_after <= self.ranges[i + 1].offset { + freed_range.end_page_idx += 1; + } + } + + self.ranges.remove(i)?; + Ok(freed_range) + } + + pub(crate) fn reservation_commit(&mut self, offset: usize, data: &mut Option<T>) -> Result { + // This could use a binary search, but linear scans are usually faster for small arrays. + let range = self + .ranges + .iter_mut() + .find(|range| range.offset == offset) + .ok_or(ENOENT)?; + + let DescriptorState::Reserved(reservation) = &range.state else { + return Err(ENOENT); + }; + + range.state = DescriptorState::Allocated(reservation.clone().allocate(data.take())); + Ok(()) + } + + pub(crate) fn reserve_existing(&mut self, offset: usize) -> Result<(usize, usize, Option<T>)> { + // This could use a binary search, but linear scans are usually faster for small arrays. + let range = self + .ranges + .iter_mut() + .find(|range| range.offset == offset) + .ok_or(ENOENT)?; + + let DescriptorState::Allocated(allocation) = &mut range.state else { + return Err(ENOENT); + }; + + let data = allocation.take(); + let debug_id = allocation.reservation.debug_id; + range.state = DescriptorState::Reserved(allocation.reservation.clone()); + Ok((range.size, debug_id, data)) + } + + pub(crate) fn take_for_each<F: Fn(usize, usize, usize, Option<T>)>(&mut self, callback: F) { + for range in self.ranges.iter_mut() { + if let DescriptorState::Allocated(allocation) = &mut range.state { + callback( + range.offset, + range.size, + allocation.reservation.debug_id, + allocation.data.take(), + ); + } + } + } +} + +pub(crate) struct EmptyArrayAlloc<T> { + ranges: KVec<Range<T>>, +} + +impl<T> EmptyArrayAlloc<T> { + pub(crate) fn try_new(capacity: usize) -> Result<Self> { + Ok(Self { + ranges: KVec::with_capacity(capacity, GFP_KERNEL)?, + }) + } +} diff --git a/drivers/android/binder/range_alloc/mod.rs b/drivers/android/binder/range_alloc/mod.rs new file mode 100644 index 000000000000..2301e2bc1a1f --- /dev/null +++ b/drivers/android/binder/range_alloc/mod.rs @@ -0,0 +1,329 @@ +// SPDX-License-Identifier: GPL-2.0 + +// Copyright (C) 2025 Google LLC. + +use kernel::{page::PAGE_SIZE, prelude::*, seq_file::SeqFile, task::Pid}; + +mod tree; +use self::tree::{FromArrayAllocs, ReserveNewTreeAlloc, TreeRangeAllocator}; + +mod array; +use self::array::{ArrayRangeAllocator, EmptyArrayAlloc}; + +enum DescriptorState<T> { + Reserved(Reservation), + Allocated(Allocation<T>), +} + +impl<T> DescriptorState<T> { + fn new(is_oneway: bool, debug_id: usize, pid: Pid) -> Self { + DescriptorState::Reserved(Reservation { + debug_id, + is_oneway, + pid, + }) + } + + fn pid(&self) -> Pid { + match self { + DescriptorState::Reserved(inner) => inner.pid, + DescriptorState::Allocated(inner) => inner.reservation.pid, + } + } + + fn is_oneway(&self) -> bool { + match self { + DescriptorState::Reserved(inner) => inner.is_oneway, + DescriptorState::Allocated(inner) => inner.reservation.is_oneway, + } + } +} + +#[derive(Clone)] +struct Reservation { + debug_id: usize, + is_oneway: bool, + pid: Pid, +} + +impl Reservation { + fn allocate<T>(self, data: Option<T>) -> Allocation<T> { + Allocation { + data, + reservation: self, + } + } +} + +struct Allocation<T> { + reservation: Reservation, + data: Option<T>, +} + +impl<T> Allocation<T> { + fn deallocate(self) -> (Reservation, Option<T>) { + (self.reservation, self.data) + } + + fn debug_id(&self) -> usize { + self.reservation.debug_id + } + + fn take(&mut self) -> Option<T> { + self.data.take() + } +} + +/// The array implementation must switch to the tree if it wants to go beyond this number of +/// ranges. +const TREE_THRESHOLD: usize = 8; + +/// Represents a range of pages that have just become completely free. +#[derive(Copy, Clone)] +pub(crate) struct FreedRange { + pub(crate) start_page_idx: usize, + pub(crate) end_page_idx: usize, +} + +impl FreedRange { + fn interior_pages(offset: usize, size: usize) -> FreedRange { + FreedRange { + // Divide round up + start_page_idx: offset.div_ceil(PAGE_SIZE), + // Divide round down + end_page_idx: (offset + size) / PAGE_SIZE, + } + } +} + +struct Range<T> { + offset: usize, + size: usize, + state: DescriptorState<T>, +} + +impl<T> Range<T> { + fn endpoint(&self) -> usize { + self.offset + self.size + } +} + +pub(crate) struct RangeAllocator<T> { + inner: Impl<T>, +} + +enum Impl<T> { + Empty(usize), + Array(ArrayRangeAllocator<T>), + Tree(TreeRangeAllocator<T>), +} + +impl<T> RangeAllocator<T> { + pub(crate) fn new(size: usize) -> Self { + Self { + inner: Impl::Empty(size), + } + } + + pub(crate) fn free_oneway_space(&self) -> usize { + match &self.inner { + Impl::Empty(size) => size / 2, + Impl::Array(array) => array.free_oneway_space(), + Impl::Tree(tree) => tree.free_oneway_space(), + } + } + + pub(crate) fn count_buffers(&self) -> usize { + match &self.inner { + Impl::Empty(_size) => 0, + Impl::Array(array) => array.count_buffers(), + Impl::Tree(tree) => tree.count_buffers(), + } + } + + pub(crate) fn debug_print(&self, m: &SeqFile) -> Result<()> { + match &self.inner { + Impl::Empty(_size) => Ok(()), + Impl::Array(array) => array.debug_print(m), + Impl::Tree(tree) => tree.debug_print(m), + } + } + + /// Try to reserve a new buffer, using the provided allocation if necessary. + pub(crate) fn reserve_new(&mut self, mut args: ReserveNewArgs<T>) -> Result<ReserveNew<T>> { + match &mut self.inner { + Impl::Empty(size) => { + let empty_array = match args.empty_array_alloc.take() { + Some(empty_array) => ArrayRangeAllocator::new(*size, empty_array), + None => { + return Ok(ReserveNew::NeedAlloc(ReserveNewNeedAlloc { + args, + need_empty_array_alloc: true, + need_new_tree_alloc: false, + need_tree_alloc: false, + })) + } + }; + + self.inner = Impl::Array(empty_array); + self.reserve_new(args) + } + Impl::Array(array) if array.is_full() => { + let allocs = match args.new_tree_alloc { + Some(ref mut allocs) => allocs, + None => { + return Ok(ReserveNew::NeedAlloc(ReserveNewNeedAlloc { + args, + need_empty_array_alloc: false, + need_new_tree_alloc: true, + need_tree_alloc: true, + })) + } + }; + + let new_tree = + TreeRangeAllocator::from_array(array.total_size(), &mut array.ranges, allocs); + + self.inner = Impl::Tree(new_tree); + self.reserve_new(args) + } + Impl::Array(array) => { + let offset = + array.reserve_new(args.debug_id, args.size, args.is_oneway, args.pid)?; + Ok(ReserveNew::Success(ReserveNewSuccess { + offset, + oneway_spam_detected: false, + _empty_array_alloc: args.empty_array_alloc, + _new_tree_alloc: args.new_tree_alloc, + _tree_alloc: args.tree_alloc, + })) + } + Impl::Tree(tree) => { + let alloc = match args.tree_alloc { + Some(alloc) => alloc, + None => { + return Ok(ReserveNew::NeedAlloc(ReserveNewNeedAlloc { + args, + need_empty_array_alloc: false, + need_new_tree_alloc: false, + need_tree_alloc: true, + })); + } + }; + let (offset, oneway_spam_detected) = + tree.reserve_new(args.debug_id, args.size, args.is_oneway, args.pid, alloc)?; + Ok(ReserveNew::Success(ReserveNewSuccess { + offset, + oneway_spam_detected, + _empty_array_alloc: args.empty_array_alloc, + _new_tree_alloc: args.new_tree_alloc, + _tree_alloc: None, + })) + } + } + } + + /// Deletes the allocations at `offset`. + pub(crate) fn reservation_abort(&mut self, offset: usize) -> Result<FreedRange> { + match &mut self.inner { + Impl::Empty(_size) => Err(EINVAL), + Impl::Array(array) => array.reservation_abort(offset), + Impl::Tree(tree) => { + let freed_range = tree.reservation_abort(offset)?; + if tree.is_empty() { + self.inner = Impl::Empty(tree.total_size()); + } + Ok(freed_range) + } + } + } + + /// Called when an allocation is no longer in use by the kernel. + /// + /// The value in `data` will be stored, if any. A mutable reference is used to avoid dropping + /// the `T` when an error is returned. + pub(crate) fn reservation_commit(&mut self, offset: usize, data: &mut Option<T>) -> Result { + match &mut self.inner { + Impl::Empty(_size) => Err(EINVAL), + Impl::Array(array) => array.reservation_commit(offset, data), + Impl::Tree(tree) => tree.reservation_commit(offset, data), + } + } + + /// Called when the kernel starts using an allocation. + /// + /// Returns the size of the existing entry and the data associated with it. + pub(crate) fn reserve_existing(&mut self, offset: usize) -> Result<(usize, usize, Option<T>)> { + match &mut self.inner { + Impl::Empty(_size) => Err(EINVAL), + Impl::Array(array) => array.reserve_existing(offset), + Impl::Tree(tree) => tree.reserve_existing(offset), + } + } + + /// Call the provided callback at every allocated region. + /// + /// This destroys the range allocator. Used only during shutdown. + pub(crate) fn take_for_each<F: Fn(usize, usize, usize, Option<T>)>(&mut self, callback: F) { + match &mut self.inner { + Impl::Empty(_size) => {} + Impl::Array(array) => array.take_for_each(callback), + Impl::Tree(tree) => tree.take_for_each(callback), + } + } +} + +/// The arguments for `reserve_new`. +#[derive(Default)] +pub(crate) struct ReserveNewArgs<T> { + pub(crate) size: usize, + pub(crate) is_oneway: bool, + pub(crate) debug_id: usize, + pub(crate) pid: Pid, + pub(crate) empty_array_alloc: Option<EmptyArrayAlloc<T>>, + pub(crate) new_tree_alloc: Option<FromArrayAllocs<T>>, + pub(crate) tree_alloc: Option<ReserveNewTreeAlloc<T>>, +} + +/// The return type of `ReserveNew`. +pub(crate) enum ReserveNew<T> { + Success(ReserveNewSuccess<T>), + NeedAlloc(ReserveNewNeedAlloc<T>), +} + +/// Returned by `reserve_new` when the reservation was successul. +pub(crate) struct ReserveNewSuccess<T> { + pub(crate) offset: usize, + pub(crate) oneway_spam_detected: bool, + + // If the user supplied an allocation that we did not end up using, then we return it here. + // The caller will kfree it outside of the lock. + _empty_array_alloc: Option<EmptyArrayAlloc<T>>, + _new_tree_alloc: Option<FromArrayAllocs<T>>, + _tree_alloc: Option<ReserveNewTreeAlloc<T>>, +} + +/// Returned by `reserve_new` to request the caller to make an allocation before calling the method +/// again. +pub(crate) struct ReserveNewNeedAlloc<T> { + args: ReserveNewArgs<T>, + need_empty_array_alloc: bool, + need_new_tree_alloc: bool, + need_tree_alloc: bool, +} + +impl<T> ReserveNewNeedAlloc<T> { + /// Make the necessary allocations for another call to `reserve_new`. + pub(crate) fn make_alloc(mut self) -> Result<ReserveNewArgs<T>> { + if self.need_empty_array_alloc && self.args.empty_array_alloc.is_none() { + self.args.empty_array_alloc = Some(EmptyArrayAlloc::try_new(TREE_THRESHOLD)?); + } + if self.need_new_tree_alloc && self.args.new_tree_alloc.is_none() { + self.args.new_tree_alloc = Some(FromArrayAllocs::try_new(TREE_THRESHOLD)?); + } + if self.need_tree_alloc && self.args.tree_alloc.is_none() { + self.args.tree_alloc = Some(ReserveNewTreeAlloc::try_new()?); + } + Ok(self.args) + } +} diff --git a/drivers/android/binder/range_alloc/tree.rs b/drivers/android/binder/range_alloc/tree.rs new file mode 100644 index 000000000000..7b1a248fcb02 --- /dev/null +++ b/drivers/android/binder/range_alloc/tree.rs @@ -0,0 +1,488 @@ +// SPDX-License-Identifier: GPL-2.0 + +// Copyright (C) 2025 Google LLC. + +use kernel::{ + page::PAGE_SIZE, + prelude::*, + rbtree::{RBTree, RBTreeNode, RBTreeNodeReservation}, + seq_file::SeqFile, + seq_print, + task::Pid, +}; + +use crate::range_alloc::{DescriptorState, FreedRange, Range}; + +/// Keeps track of allocations in a process' mmap. +/// +/// Each process has an mmap where the data for incoming transactions will be placed. This struct +/// keeps track of allocations made in the mmap. For each allocation, we store a descriptor that +/// has metadata related to the allocation. We also keep track of available free space. +pub(super) struct TreeRangeAllocator<T> { + /// This collection contains descriptors for *both* ranges containing an allocation, *and* free + /// ranges between allocations. The free ranges get merged, so there are never two free ranges + /// next to each other. + tree: RBTree<usize, Descriptor<T>>, + /// Contains an entry for every free range in `self.tree`. This tree sorts the ranges by size, + /// letting us look up the smallest range whose size is at least some lower bound. + free_tree: RBTree<FreeKey, ()>, + size: usize, + free_oneway_space: usize, +} + +impl<T> TreeRangeAllocator<T> { + pub(crate) fn from_array( + size: usize, + ranges: &mut KVec<Range<T>>, + alloc: &mut FromArrayAllocs<T>, + ) -> Self { + let mut tree = TreeRangeAllocator { + tree: RBTree::new(), + free_tree: RBTree::new(), + size, + free_oneway_space: size / 2, + }; + + let mut free_offset = 0; + for range in ranges.drain_all() { + let free_size = range.offset - free_offset; + if free_size > 0 { + let free_node = alloc.free_tree.pop().unwrap(); + tree.free_tree + .insert(free_node.into_node((free_size, free_offset), ())); + let tree_node = alloc.tree.pop().unwrap(); + tree.tree.insert( + tree_node.into_node(free_offset, Descriptor::new(free_offset, free_size)), + ); + } + free_offset = range.endpoint(); + + if range.state.is_oneway() { + tree.free_oneway_space = tree.free_oneway_space.saturating_sub(range.size); + } + + let free_res = alloc.free_tree.pop().unwrap(); + let tree_node = alloc.tree.pop().unwrap(); + let mut desc = Descriptor::new(range.offset, range.size); + desc.state = Some((range.state, free_res)); + tree.tree.insert(tree_node.into_node(range.offset, desc)); + } + + // After the last range, we may need a free range. + if free_offset < size { + let free_size = size - free_offset; + let free_node = alloc.free_tree.pop().unwrap(); + tree.free_tree + .insert(free_node.into_node((free_size, free_offset), ())); + let tree_node = alloc.tree.pop().unwrap(); + tree.tree + .insert(tree_node.into_node(free_offset, Descriptor::new(free_offset, free_size))); + } + + tree + } + + pub(crate) fn is_empty(&self) -> bool { + let mut tree_iter = self.tree.values(); + // There's always at least one range, because index zero is either the start of a free or + // allocated range. + let first_value = tree_iter.next().unwrap(); + if tree_iter.next().is_some() { + // There are never two free ranges next to each other, so if there is more than one + // descriptor, then at least one of them must hold an allocated range. + return false; + } + // There is only one descriptor. Return true if it is for a free range. + first_value.state.is_none() + } + + pub(crate) fn total_size(&self) -> usize { + self.size + } + + pub(crate) fn free_oneway_space(&self) -> usize { + self.free_oneway_space + } + + pub(crate) fn count_buffers(&self) -> usize { + self.tree + .values() + .filter(|desc| desc.state.is_some()) + .count() + } + + pub(crate) fn debug_print(&self, m: &SeqFile) -> Result<()> { + for desc in self.tree.values() { + let state = match &desc.state { + Some(state) => &state.0, + None => continue, + }; + seq_print!( + m, + " buffer: {} size {} pid {}", + desc.offset, + desc.size, + state.pid(), + ); + if state.is_oneway() { + seq_print!(m, " oneway"); + } + match state { + DescriptorState::Reserved(_res) => { + seq_print!(m, " reserved\n"); + } + DescriptorState::Allocated(_alloc) => { + seq_print!(m, " allocated\n"); + } + } + } + Ok(()) + } + + fn find_best_match(&mut self, size: usize) -> Option<&mut Descriptor<T>> { + let free_cursor = self.free_tree.cursor_lower_bound(&(size, 0))?; + let ((_, offset), ()) = free_cursor.current(); + self.tree.get_mut(offset) + } + + /// Try to reserve a new buffer, using the provided allocation if necessary. + pub(crate) fn reserve_new( + &mut self, + debug_id: usize, + size: usize, + is_oneway: bool, + pid: Pid, + alloc: ReserveNewTreeAlloc<T>, + ) -> Result<(usize, bool)> { + // Compute new value of free_oneway_space, which is set only on success. + let new_oneway_space = if is_oneway { + match self.free_oneway_space.checked_sub(size) { + Some(new_oneway_space) => new_oneway_space, + None => return Err(ENOSPC), + } + } else { + self.free_oneway_space + }; + + // Start detecting spammers once we have less than 20% + // of async space left (which is less than 10% of total + // buffer size). + // + // (This will short-circut, so `low_oneway_space` is + // only called when necessary.) + let oneway_spam_detected = + is_oneway && new_oneway_space < self.size / 10 && self.low_oneway_space(pid); + + let (found_size, found_off, tree_node, free_tree_node) = match self.find_best_match(size) { + None => { + pr_warn!("ENOSPC from range_alloc.reserve_new - size: {}", size); + return Err(ENOSPC); + } + Some(desc) => { + let found_size = desc.size; + let found_offset = desc.offset; + + // In case we need to break up the descriptor + let new_desc = Descriptor::new(found_offset + size, found_size - size); + let (tree_node, free_tree_node, desc_node_res) = alloc.initialize(new_desc); + + desc.state = Some(( + DescriptorState::new(is_oneway, debug_id, pid), + desc_node_res, + )); + desc.size = size; + + (found_size, found_offset, tree_node, free_tree_node) + } + }; + self.free_oneway_space = new_oneway_space; + self.free_tree.remove(&(found_size, found_off)); + + if found_size != size { + self.tree.insert(tree_node); + self.free_tree.insert(free_tree_node); + } + + Ok((found_off, oneway_spam_detected)) + } + + pub(crate) fn reservation_abort(&mut self, offset: usize) -> Result<FreedRange> { + let mut cursor = self.tree.cursor_lower_bound(&offset).ok_or_else(|| { + pr_warn!( + "EINVAL from range_alloc.reservation_abort - offset: {}", + offset + ); + EINVAL + })?; + + let (_, desc) = cursor.current_mut(); + + if desc.offset != offset { + pr_warn!( + "EINVAL from range_alloc.reservation_abort - offset: {}", + offset + ); + return Err(EINVAL); + } + + let (reservation, free_node_res) = desc.try_change_state(|state| match state { + Some((DescriptorState::Reserved(reservation), free_node_res)) => { + (None, Ok((reservation, free_node_res))) + } + None => { + pr_warn!( + "EINVAL from range_alloc.reservation_abort - offset: {}", + offset + ); + (None, Err(EINVAL)) + } + allocated => { + pr_warn!( + "EPERM from range_alloc.reservation_abort - offset: {}", + offset + ); + (allocated, Err(EPERM)) + } + })?; + + let mut size = desc.size; + let mut offset = desc.offset; + let free_oneway_space_add = if reservation.is_oneway { size } else { 0 }; + + self.free_oneway_space += free_oneway_space_add; + + let mut freed_range = FreedRange::interior_pages(offset, size); + // Compute how large the next free region needs to be to include one more page in + // the newly freed range. + let add_next_page_needed = match (offset + size) % PAGE_SIZE { + 0 => usize::MAX, + unalign => PAGE_SIZE - unalign, + }; + // Compute how large the previous free region needs to be to include one more page + // in the newly freed range. + let add_prev_page_needed = match offset % PAGE_SIZE { + 0 => usize::MAX, + unalign => unalign, + }; + + // Merge next into current if next is free + let remove_next = match cursor.peek_next() { + Some((_, next)) if next.state.is_none() => { + if next.size >= add_next_page_needed { + freed_range.end_page_idx += 1; + } + self.free_tree.remove(&(next.size, next.offset)); + size += next.size; + true + } + _ => false, + }; + + if remove_next { + let (_, desc) = cursor.current_mut(); + desc.size = size; + cursor.remove_next(); + } + + // Merge current into prev if prev is free + match cursor.peek_prev_mut() { + Some((_, prev)) if prev.state.is_none() => { + if prev.size >= add_prev_page_needed { + freed_range.start_page_idx -= 1; + } + // merge previous with current, remove current + self.free_tree.remove(&(prev.size, prev.offset)); + offset = prev.offset; + size += prev.size; + prev.size = size; + cursor.remove_current(); + } + _ => {} + }; + + self.free_tree + .insert(free_node_res.into_node((size, offset), ())); + + Ok(freed_range) + } + + pub(crate) fn reservation_commit(&mut self, offset: usize, data: &mut Option<T>) -> Result { + let desc = self.tree.get_mut(&offset).ok_or(ENOENT)?; + + desc.try_change_state(|state| match state { + Some((DescriptorState::Reserved(reservation), free_node_res)) => ( + Some(( + DescriptorState::Allocated(reservation.allocate(data.take())), + free_node_res, + )), + Ok(()), + ), + other => (other, Err(ENOENT)), + }) + } + + /// Takes an entry at the given offset from [`DescriptorState::Allocated`] to + /// [`DescriptorState::Reserved`]. + /// + /// Returns the size of the existing entry and the data associated with it. + pub(crate) fn reserve_existing(&mut self, offset: usize) -> Result<(usize, usize, Option<T>)> { + let desc = self.tree.get_mut(&offset).ok_or_else(|| { + pr_warn!( + "ENOENT from range_alloc.reserve_existing - offset: {}", + offset + ); + ENOENT + })?; + + let (debug_id, data) = desc.try_change_state(|state| match state { + Some((DescriptorState::Allocated(allocation), free_node_res)) => { + let (reservation, data) = allocation.deallocate(); + let debug_id = reservation.debug_id; + ( + Some((DescriptorState::Reserved(reservation), free_node_res)), + Ok((debug_id, data)), + ) + } + other => { + pr_warn!( + "ENOENT from range_alloc.reserve_existing - offset: {}", + offset + ); + (other, Err(ENOENT)) + } + })?; + + Ok((desc.size, debug_id, data)) + } + + /// Call the provided callback at every allocated region. + /// + /// This destroys the range allocator. Used only during shutdown. + pub(crate) fn take_for_each<F: Fn(usize, usize, usize, Option<T>)>(&mut self, callback: F) { + for (_, desc) in self.tree.iter_mut() { + if let Some((DescriptorState::Allocated(allocation), _)) = &mut desc.state { + callback( + desc.offset, + desc.size, + allocation.debug_id(), + allocation.take(), + ); + } + } + } + + /// Find the amount and size of buffers allocated by the current caller. + /// + /// The idea is that once we cross the threshold, whoever is responsible + /// for the low async space is likely to try to send another async transaction, + /// and at some point we'll catch them in the act. This is more efficient + /// than keeping a map per pid. + fn low_oneway_space(&self, calling_pid: Pid) -> bool { + let mut total_alloc_size = 0; + let mut num_buffers = 0; + for (_, desc) in self.tree.iter() { + if let Some((state, _)) = &desc.state { + if state.is_oneway() && state.pid() == calling_pid { + total_alloc_size += desc.size; + num_buffers += 1; + } + } + } + + // Warn if this pid has more than 50 transactions, or more than 50% of + // async space (which is 25% of total buffer size). Oneway spam is only + // detected when the threshold is exceeded. + num_buffers > 50 || total_alloc_size > self.size / 4 + } +} + +type TreeDescriptorState<T> = (DescriptorState<T>, FreeNodeRes); +struct Descriptor<T> { + size: usize, + offset: usize, + state: Option<TreeDescriptorState<T>>, +} + +impl<T> Descriptor<T> { + fn new(offset: usize, size: usize) -> Self { + Self { + size, + offset, + state: None, + } + } + + fn try_change_state<F, Data>(&mut self, f: F) -> Result<Data> + where + F: FnOnce(Option<TreeDescriptorState<T>>) -> (Option<TreeDescriptorState<T>>, Result<Data>), + { + let (new_state, result) = f(self.state.take()); + self.state = new_state; + result + } +} + +// (Descriptor.size, Descriptor.offset) +type FreeKey = (usize, usize); +type FreeNodeRes = RBTreeNodeReservation<FreeKey, ()>; + +/// An allocation for use by `reserve_new`. +pub(crate) struct ReserveNewTreeAlloc<T> { + tree_node_res: RBTreeNodeReservation<usize, Descriptor<T>>, + free_tree_node_res: FreeNodeRes, + desc_node_res: FreeNodeRes, +} + +impl<T> ReserveNewTreeAlloc<T> { + pub(crate) fn try_new() -> Result<Self> { + let tree_node_res = RBTreeNodeReservation::new(GFP_KERNEL)?; + let free_tree_node_res = RBTreeNodeReservation::new(GFP_KERNEL)?; + let desc_node_res = RBTreeNodeReservation::new(GFP_KERNEL)?; + Ok(Self { + tree_node_res, + free_tree_node_res, + desc_node_res, + }) + } + + fn initialize( + self, + desc: Descriptor<T>, + ) -> ( + RBTreeNode<usize, Descriptor<T>>, + RBTreeNode<FreeKey, ()>, + FreeNodeRes, + ) { + let size = desc.size; + let offset = desc.offset; + ( + self.tree_node_res.into_node(offset, desc), + self.free_tree_node_res.into_node((size, offset), ()), + self.desc_node_res, + ) + } +} + +/// An allocation for creating a tree from an `ArrayRangeAllocator`. +pub(crate) struct FromArrayAllocs<T> { + tree: KVec<RBTreeNodeReservation<usize, Descriptor<T>>>, + free_tree: KVec<RBTreeNodeReservation<FreeKey, ()>>, +} + +impl<T> FromArrayAllocs<T> { + pub(crate) fn try_new(len: usize) -> Result<Self> { + let num_descriptors = 2 * len + 1; + + let mut tree = KVec::with_capacity(num_descriptors, GFP_KERNEL)?; + for _ in 0..num_descriptors { + tree.push(RBTreeNodeReservation::new(GFP_KERNEL)?, GFP_KERNEL)?; + } + + let mut free_tree = KVec::with_capacity(num_descriptors, GFP_KERNEL)?; + for _ in 0..num_descriptors { + free_tree.push(RBTreeNodeReservation::new(GFP_KERNEL)?, GFP_KERNEL)?; + } + + Ok(Self { tree, free_tree }) + } +} |
