From de54b9ac253787c366bbfb28d901a31954eb3511 Mon Sep 17 00:00:00 2001 From: Marcus Gelderie Date: Thu, 6 Aug 2015 15:46:10 -0700 Subject: ipc: modify message queue accounting to not take kernel data structures into account A while back, the message queue implementation in the kernel was improved to use btrees to speed up retrieval of messages, in commit d6629859b36d ("ipc/mqueue: improve performance of send/recv"). That patch introducing the improved kernel handling of message queues (using btrees) has, as a by-product, changed the meaning of the QSIZE field in the pseudo-file created for the queue. Before, this field reflected the size of the user-data in the queue. Since, it also takes kernel data structures into account. For example, if 13 bytes of user data are in the queue, on my machine the file reports a size of 61 bytes. There was some discussion on this topic before (for example https://lkml.org/lkml/2014/10/1/115). Commenting on a th lkml, Michael Kerrisk gave the following background (https://lkml.org/lkml/2015/6/16/74): The pseudofiles in the mqueue filesystem (usually mounted at /dev/mqueue) expose fields with metadata describing a message queue. One of these fields, QSIZE, as originally implemented, showed the total number of bytes of user data in all messages in the message queue, and this feature was documented from the beginning in the mq_overview(7) page. In 3.5, some other (useful) work happened to break the user-space API in a couple of places, including the value exposed via QSIZE, which now includes a measure of kernel overhead bytes for the queue, a figure that renders QSIZE useless for its original purpose, since there's no way to deduce the number of overhead bytes consumed by the implementation. (The other user-space breakage was subsequently fixed.) This patch removes the accounting of kernel data structures in the queue. Reporting the size of these data-structures in the QSIZE field was a breaking change (see Michael's comment above). Without the QSIZE field reporting the total size of user-data in the queue, there is no way to deduce this number. It should be noted that the resource limit RLIMIT_MSGQUEUE is counted against the worst-case size of the queue (in both the old and the new implementation). Therefore, the kernel overhead accounting in QSIZE is not necessary to help the user understand the limitations RLIMIT imposes on the processes. Signed-off-by: Marcus Gelderie Acked-by: Doug Ledford Acked-by: Michael Kerrisk Acked-by: Davidlohr Bueso Cc: David Howells Cc: Alexander Viro Cc: John Duffy Cc: Arto Bendiken Cc: Manfred Spraul Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- ipc/mqueue.c | 5 ----- 1 file changed, 5 deletions(-) (limited to 'ipc') diff --git a/ipc/mqueue.c b/ipc/mqueue.c index a24ba9fe5bb8..161a1807e6ef 100644 --- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -142,7 +142,6 @@ static int msg_insert(struct msg_msg *msg, struct mqueue_inode_info *info) if (!leaf) return -ENOMEM; INIT_LIST_HEAD(&leaf->msg_list); - info->qsize += sizeof(*leaf); } leaf->priority = msg->m_type; rb_link_node(&leaf->rb_node, parent, p); @@ -187,7 +186,6 @@ try_again: "lazy leaf delete!\n"); rb_erase(&leaf->rb_node, &info->msg_tree); if (info->node_cache) { - info->qsize -= sizeof(*leaf); kfree(leaf); } else { info->node_cache = leaf; @@ -200,7 +198,6 @@ try_again: if (list_empty(&leaf->msg_list)) { rb_erase(&leaf->rb_node, &info->msg_tree); if (info->node_cache) { - info->qsize -= sizeof(*leaf); kfree(leaf); } else { info->node_cache = leaf; @@ -1034,7 +1031,6 @@ SYSCALL_DEFINE5(mq_timedsend, mqd_t, mqdes, const char __user *, u_msg_ptr, /* Save our speculative allocation into the cache */ INIT_LIST_HEAD(&new_leaf->msg_list); info->node_cache = new_leaf; - info->qsize += sizeof(*new_leaf); new_leaf = NULL; } else { kfree(new_leaf); @@ -1142,7 +1138,6 @@ SYSCALL_DEFINE5(mq_timedreceive, mqd_t, mqdes, char __user *, u_msg_ptr, /* Save our speculative allocation into the cache */ INIT_LIST_HEAD(&new_leaf->msg_list); info->node_cache = new_leaf; - info->qsize += sizeof(*new_leaf); } else { kfree(new_leaf); } -- cgit v1.2.3 From e1832f2923ec92d0e590e496c8890675457f8568 Mon Sep 17 00:00:00 2001 From: Stephen Smalley Date: Thu, 6 Aug 2015 15:46:55 -0700 Subject: ipc: use private shmem or hugetlbfs inodes for shm segments. The shm implementation internally uses shmem or hugetlbfs inodes for shm segments. As these inodes are never directly exposed to userspace and only accessed through the shm operations which are already hooked by security modules, mark the inodes with the S_PRIVATE flag so that inode security initialization and permission checking is skipped. This was motivated by the following lockdep warning: ====================================================== [ INFO: possible circular locking dependency detected ] 4.2.0-0.rc3.git0.1.fc24.x86_64+debug #1 Tainted: G W ------------------------------------------------------- httpd/1597 is trying to acquire lock: (&ids->rwsem){+++++.}, at: shm_close+0x34/0x130 but task is already holding lock: (&mm->mmap_sem){++++++}, at: SyS_shmdt+0x4b/0x180 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&mm->mmap_sem){++++++}: lock_acquire+0xc7/0x270 __might_fault+0x7a/0xa0 filldir+0x9e/0x130 xfs_dir2_block_getdents.isra.12+0x198/0x1c0 [xfs] xfs_readdir+0x1b4/0x330 [xfs] xfs_file_readdir+0x2b/0x30 [xfs] iterate_dir+0x97/0x130 SyS_getdents+0x91/0x120 entry_SYSCALL_64_fastpath+0x12/0x76 -> #2 (&xfs_dir_ilock_class){++++.+}: lock_acquire+0xc7/0x270 down_read_nested+0x57/0xa0 xfs_ilock+0x167/0x350 [xfs] xfs_ilock_attr_map_shared+0x38/0x50 [xfs] xfs_attr_get+0xbd/0x190 [xfs] xfs_xattr_get+0x3d/0x70 [xfs] generic_getxattr+0x4f/0x70 inode_doinit_with_dentry+0x162/0x670 sb_finish_set_opts+0xd9/0x230 selinux_set_mnt_opts+0x35c/0x660 superblock_doinit+0x77/0xf0 delayed_superblock_init+0x10/0x20 iterate_supers+0xb3/0x110 selinux_complete_init+0x2f/0x40 security_load_policy+0x103/0x600 sel_write_load+0xc1/0x750 __vfs_write+0x37/0x100 vfs_write+0xa9/0x1a0 SyS_write+0x58/0xd0 entry_SYSCALL_64_fastpath+0x12/0x76 ... Signed-off-by: Stephen Smalley Reported-by: Morten Stevens Acked-by: Hugh Dickins Acked-by: Paul Moore Cc: Manfred Spraul Cc: Davidlohr Bueso Cc: Prarit Bhargava Cc: Eric Paris Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/hugetlbfs/inode.c | 2 ++ ipc/shm.c | 2 +- mm/shmem.c | 4 ++-- 3 files changed, 5 insertions(+), 3 deletions(-) (limited to 'ipc') diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 0cf74df68617..973c24ce59ad 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -1010,6 +1010,8 @@ struct file *hugetlb_file_setup(const char *name, size_t size, inode = hugetlbfs_get_inode(sb, NULL, S_IFREG | S_IRWXUGO, 0); if (!inode) goto out_dentry; + if (creat_flags == HUGETLB_SHMFS_INODE) + inode->i_flags |= S_PRIVATE; file = ERR_PTR(-ENOMEM); if (hugetlb_reserve_pages(inode, 0, diff --git a/ipc/shm.c b/ipc/shm.c index 06e5cf2fe019..4aef24d91b63 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -545,7 +545,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params) if ((shmflg & SHM_NORESERVE) && sysctl_overcommit_memory != OVERCOMMIT_NEVER) acctflag = VM_NORESERVE; - file = shmem_file_setup(name, size, acctflag); + file = shmem_kernel_file_setup(name, size, acctflag); } error = PTR_ERR(file); if (IS_ERR(file)) diff --git a/mm/shmem.c b/mm/shmem.c index 4caf8ed24d65..dbe0c1e8349c 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3363,8 +3363,8 @@ put_path: * shmem_kernel_file_setup - get an unlinked file living in tmpfs which must be * kernel internal. There will be NO LSM permission checks against the * underlying inode. So users of this interface must do LSM checks at a - * higher layer. The one user is the big_key implementation. LSM checks - * are provided at the key level rather than the inode level. + * higher layer. The users are the big_key and shm implementations. LSM + * checks are provided at the key or shm level rather than the inode. * @name: name for dentry (to be seen in /proc//maps * @size: size to be set for the file * @flags: VM_NORESERVE suppresses pre-accounting of the entire object size -- cgit v1.2.3