kernel/linux.git - Linux kernel stable tree (mirror)

Age	Commit message (Collapse)	Author	Files	Lines
2008-11-01	mac80211/drivers: rewrite the rate control API	Johannes Berg	12	-310/+306
	So after the previous changes we were still unhappy with how convoluted the API is and decided to make things simpler for everybody. This completely changes the rate control API, now taking into account 802.11n with MCS rates and more control, most drivers don't support that though. Signed-off-by: Felix Fietkau <nbd@openwrt.org> Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-01	mac80211: add might_sleep to hw_config	Johannes Berg	1	-0/+2
	Just to catch bugs when changing mac80211. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-01	mac80211: rewrite HT handling	Johannes Berg	4	-153/+134
	The HT handling has the following deficiencies, which I've (partially) fixed: * it always uses the AP info even if there is no AP, hence has no chance of working as an AP * it pretends to be HW config, but really is per-BSS * channel sanity checking is left to the drivers * it generally lets the driver control too much HT enabling is still wrong with this patch if you have more than one virtual STA mode interface, but that never happens currently. Once WDS, IBSS or AP/VLAN gets HT capabilities, it will also be wrong, see the comment in ieee80211_enable_ht(). Additionally, this fixes a number of bugs: * mac80211: ieee80211_set_disassoc doesn't notify the driver any more since the refactoring * iwl-agn-rs: always uses the HT capabilities from the wrong stuff mac80211 gives it rather than the actual peer STA * ath9k: a number of bugs resulting from the broken HT API I'm not entirely happy with putting the HT capabilities into struct ieee80211_sta as restricted to our own HT TX capabilities, but I see no cleaner solution for now. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-01	mac80211: move bss_conf into vif	Johannes Berg	8	-38/+35
	Move bss_conf into the vif struct so that drivers can access it during ->tx without having to store it in the private data or similar. No driver updates because this is only for when they want to start using it. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-01	mac80211: make retry limits part of hw config	Johannes Berg	5	-24/+18
	Instead of having a separate callback, use the HW config callback with a new flag to change retry limits. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-01	nl80211: export HT capabilities	Johannes Berg	1	-0/+13
	This exports the local HT capabilities in nl80211. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-01	mac80211: provide sequence numbers	Johannes Berg	2	-0/+12
	I've come to think that not providing sequence numbers for the normal STA mode case was a mistake, at least two drivers now had to implement code they wouldn't otherwise need, and I believe at76_usb and adm8211 might be broken. This patch makes mac80211 assign a sequence number to all those frames that need one except beacons. That means that if a driver only implements modes that do not do beaconing it need not worry about the sequence number. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-01	rfkill: rate-limit rfkill-input workqueue usage (v3)	Henrique de Moraes Holschuh	1	-8/+41
	Limit the number of "expensive" rfkill workqueue operations per second, in order to not hog system resources too much when faced with a rogue source of rfkill input events. The old rfkill-input code (before it was refactored) had such a limit in place. It used to drop new events that were past the rate limit. This behaviour was not implemented as an anti-DoS measure, but rather as an attempt to work around deficiencies in input device drivers which would issue multiple KEY_FOO events too soon for a given key FOO (i.e. ones that do not implement mechanical debouncing properly). However, we can't really expect such issues to be worked around by every input handler out there, and also by every userspace client of input devices. It is the input device driver's responsability to do debouncing instead of spamming the input layer with bogus events. The new limiter code is focused only on anti-DoS behaviour, and tries to not lose events (instead, it coalesces them when possible). The transmitters are updated once every 200ms, maximum. Care is taken not to delay a request to _enter_ rfkill transmitter Emergency Power Off (EPO) mode. If mistriggered (e.g. by a jiffies counter wrap), the code delays processing once by 200ms. Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: Ivo van Doorn <IvDoorn@gmail.com> Cc: Dmitry Torokhov <dtor@mail.ru> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-01	rfkill: honour EPO state when resuming a rfkill controller	Henrique de Moraes Holschuh	1	-2/+11
	rfkill_resume() would always restore the rfkill controller state to its pre-suspend state. Now that we know when we are under EPO, kick the rfkill controller to SOFT_BLOCKED state instead of to its pre-suspend state when it is resumed while EPO mode is active. Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: Ivo van Doorn <IvDoorn@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-01	rfkill: add master_switch_mode and EPO lock to rfkill and rfkill-input	Henrique de Moraes Holschuh	3	-80/+274
	Add of software-based sanity to rfkill and rfkill-input so that it can reproduce what hardware-based EPO switches do, blocking all transmitters and locking down any further attempts to unblock them until the switch is deactivated. rfkill-input is responsible for issuing the EPO control requests, like before. While an rfkill EPO is active, all transmitters are locked to one of the BLOCKED states and all attempts to change that through the rfkill API (userspace and kernel) will be either ignored or return -EPERM errors. The lock will be released upon receipt of EV_SW SW_RFKILL_ALL ON by rfkill-input, or should modular rfkill-input be unloaded. This makes rfkill and rfkill-input extend the operation of an existing wireless master kill switch to all wireless devices in the system, even those that are not under hardware or firmware control. Since the above is the expected operational behavior for the master rfkill switch, the EPO lock functionality is not optional. Also, extend rfkill-input to allow for three different behaviors when it receives an EV_SW SW_RFKILL_ALL ON input event. The user can set which behavior he wants through the master_switch_mode parameter: master_switch_mode = 0: EV_SW SW_RFKILL_ALL ON just unlocks rfkill controller state changes (so that the rfkill userspace and kernel APIs can now be used to change rfkill controller states again), but doesn't change any of their states (so they will all remain blocked). This is the safest mode of operation, as it requires explicit operator action to re-enable a transmitter. master_switch_mode = 1: EV_SW SW_RFKILL_ALL ON causes rfkill-input to attempt to restore the system to the state before the last EV_SW SW_RFKILL_ALL OFF event, or to the default global states if no EV_SW SW_RFKILL_ALL OFF ever happened. This is the recommended mode of operation for laptops. master_switch_mode = 2: tries to unblock all rfkill controllers (i.e. enable all transmitters) when an EV_SW SW_RFKILL_ALL ON event is received. This is the default mode of operation, as it mimics the previous behavior of rfkill-input. In order to implement these features in a clean way, the entire event handling of rfkill-input was refactored into a single worker function. Protection against input event DoS (repeatedly firing rfkill events for rfkill-input to process) was removed during the code refactoring. It will be added back in a future patch. Note that with these changes, rfkill-input doesn't need to explicitly handle any radio types for which KEY_<radio type> or SW_<radio type> events do not exist yet. Code to handle EV_SW SW_{WLAN,WWAN,BLUETOOTH,WIMAX,...} was added as it might be needed in the future (and its implementation is not that obvious), but is currently #ifdef'd out to avoid wasting resources. Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: Ivo van Doorn <IvDoorn@gmail.com> Cc: Dmitry Torokhov <dtor@mail.ru> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-01	rfkill: export global states to rfkill-input	Henrique de Moraes Holschuh	2	-0/+15
	Export the the global switch states to rfkill-input. This is needed to properly implement KEY_* handling without disregarding the initial state. Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: Ivo van Doorn <IvDoorn@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-01	rfkill: use killable locks instead of interruptible	Henrique de Moraes Holschuh	1	-3/+4
	Apparently, many applications don't expect to get EAGAIN from fd read/write operations, since POSIX doesn't mandate it. Use mutex_lock_killable instead of mutex_lock_interruptible, which won't cause issues. Signed-off-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: Ivo van Doorn <IvDoorn@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-01	mac80211: introduce hw config change flags	Johannes Berg	7	-32/+56
	This makes mac80211 notify the driver which configuration actually changed, e.g. channel etc. No driver changes, this is just plumbing, driver authors are expected to act on this if they want to. Also remove the HW CONFIG debug printk, it's incorrect, often we configure something else. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-01	mac80211: kill hw.conf.antenna_sel_{rx,tx}	Johannes Berg	3	-11/+0
	Never actually used. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-01	802.11: clean up/fix HT support	Johannes Berg	7	-154/+167
	This patch cleans up a number of things: * the unusable definition of the HT capabilities/HT information information elements * variable names that are hard to understand * mac80211: move ieee80211_handle_ht to ht.c and remove the unused enable_ht parameter * mac80211: fix bug with MCS rate 32 in ieee80211_handle_ht * mac80211: fix bug with casting the result of ieee80211_bss_get_ie to an information element _contents_ rather than the whole element, add size checking (another out-of-bounds access bug fixed!) * mac80211: remove some unused return values in favour of BUG_ON checking * a few minor other things Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-01	mac80211: remove unused declaration of struct sta_attribute.	Rami Rosen	1	-5/+0
	This patch removes unused definition of struct sta_attribute in net/mac80211/ieee80211_i.h. Signed-off-by: Rami Rosen <ramirose@gmail.com> Acked-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-01	mac80211: fix short slot handling	Johannes Berg	2	-37/+46
	This patch makes mac80211 handle short slot requests from the AP properly. Also warn about uses of IEEE80211_CONF_SHORT_SLOT_TIME and optimise out the code since it cannot ever be hit anyway. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-01	mac80211: remove max_antenna_gain config	Johannes Berg	1	-2/+0
	The antenna gain isn't exactly configurable, despite the belief of some unnamed individual who thinks that the EEPROM might influence it. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-01	mac80211: clean up ieee80211_hw_config errors	Johannes Berg	5	-25/+13
	Warn when ieee80211_hw_config returns an error, it shouldn't happen; remove a number of printks that would happen in such a case and one printk that is user-triggerable. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-01	mac80211: remove wiphy_to_hw	Johannes Berg	1	-7/+0
	This isn't used by anyone, if we ever need it we can add it back, until then it's useless. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-01	mac80211: minor code cleanups	Johannes Berg	9	-52/+45
	Nothing very interesting, some checkpatch inspired stuff, some other things. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-01	mac80211: remove writable debugs mesh parameters	Johannes Berg	1	-88/+24
	These parameters shouldn't be configurable via debugfs, if they need to be configurable nl80211 support has to be added, if not then they don't need to be writable here either. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Cc: Javier Cardona <javier@cozybit.com> Cc: Luis Carlos Cobo <luisca@cozybit.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-11-01	mac80211: remove aggregation status write support from debugfs	Johannes Berg	1	-72/+1
	This code uses static variables and thus cannot be kept. Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>
2008-10-31	net: replace NIPQUAD() in net/*/	Harvey Harrison	26	-150/+127
	Using NIPQUAD() with NIPQUAD_FMT, %d.%d.%d.%d or %u.%u.%u.%u can be replaced with %pI4 Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-31	net: replace NIPQUAD() in net/netfilter/	Harvey Harrison	16	-108/+78
	Using NIPQUAD() with NIPQUAD_FMT, %d.%d.%d.%d or %u.%u.%u.%u can be replaced with %pI4 Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-31	net: replace NIPQUAD() in net/ipv4/ net/ipv6/	Harvey Harrison	15	-103/+78
	Using NIPQUAD() with NIPQUAD_FMT, %d.%d.%d.%d or %u.%u.%u.%u can be replaced with %pI4 Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-31	net: replace NIPQUAD() in net/ipv4/netfilter/	Harvey Harrison	10	-85/+69
	Using NIPQUAD() with NIPQUAD_FMT, %d.%d.%d.%d or %u.%u.%u.%u can be replaced with %pI4 Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-31	pkt_sched: Add peek emulation for non-work-conserving qdiscs.	Jarek Poplawski	6	-6/+11
	This patch adds qdisc_peek_dequeued() wrapper to emulate peek method with qdisc->dequeue() and storing "peeked" skb in qdisc->gso_skb until dequeuing. This is mainly for compatibility reasons not to break some strange configs because peeking is expected for non-work-conserving parent qdiscs to query work-conserving child qdiscs. This implementation requires using qdisc_dequeue_peeked() wrapper instead of directly calling qdisc->dequeue() for all qdiscs ever querried with qdisc->ops->peek() or qdisc_peek_dequeued(). Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-31	pkt_sched: Use qdisc->ops->peek() instead of ->dequeue() & ->requeue()	Jarek Poplawski	4	-29/+19
	Use qdisc->ops->peek() instead of ->dequeue() & ->requeue() pair. After this patch the only remaining user of qdisc->ops->requeue() is netem_enqueue(). Based on ideas of Herbert Xu, Patrick McHardy and David S. Miller. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-31	pkt_sched: Add qdisc->ops->peek() implementation.	Jarek Poplawski	8	-0/+69
	Add qdisc->ops->peek() implementation for work-conserving qdiscs. With feedback from Patrick McHardy. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-31	pkt_sched: sch_generic: Add generic qdisc->ops->peek() implementation.	Jarek Poplawski	2	-0/+28
	With feedback from Patrick McHardy. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-31	pkt_sched: Add ->peek() methods for fifo, prio and SFQ qdiscs.	Patrick McHardy	3	-0/+28
	From: Patrick McHardy <kaber@trash.net> Just as a demonstration how easy adding a peek operation to the work-conserving qdiscs actually is. It doesn't need to keep or change any internal state in many cases thanks to the guarantee that the packet will either be dequeued or, if another packet arrives, the upper qdisc will immediately ->peek again to reevaluate the state. (This is only slightly modified Patrick's patch.) Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-31	xfrm: C99 for xfrm_dev_notifier	Alexey Dobriyan	1	-3/+1
	Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-31	Merge branch 'master' of ↵	David S. Miller	12	-41/+144
	master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/wireless/p54/p54common.c
2008-10-31	xfrm: do not leak ESRCH to user space	fernando@oss.ntt.co	1	-0/+2
	I noticed that, under certain conditions, ESRCH can be leaked from the xfrm layer to user space through sys_connect. In particular, this seems to happen reliably when the kernel fails to resolve a template either because the AF_KEY receive buffer being used by racoon is full or because the SA entry we are trying to use is in XFRM_STATE_EXPIRED state. However, since this could be a transient issue it could be argued that EAGAIN would be more appropriate. Besides this error code is not even documented in the man page for sys_connect (as of man-pages 3.07). Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-31	Merge branch 'master' of git://git.infradead.org/users/pcmoore/lblnet-2.6	David S. Miller	4	-4/+29

2008-10-31	netfilter: nf_conntrack_proto_gre: switch to register_pernet_gen_subsys()	Alexey Dobriyan	1	-2/+2
	register_pernet_gen_device() can't be used is nf_conntrack_pptp module is also used (compiled in or loaded). Right now, proto_gre_net_exit() is called before nf_conntrack_pptp_net_exit(). The former shutdowns and frees GRE piece of netns, however the latter absolutely needs it to flush keymap. Oops is inevitable. Switch to shiny new register_pernet_gen_subsys() to get correct ordering in netns ops list. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-31	netns: add register_pernet_gen_subsys/unregister_pernet_gen_subsys	Alexey Dobriyan	1	-0/+32
	netns ops which are registered with register_pernet_gen_device() are shutdown strictly before those which are registered with register_pernet_subsys(). Sometimes this leads to opposite (read: buggy) shutdown ordering between two modules. Add register_pernet_gen_subsys()/unregister_pernet_gen_subsys() for modules which aren't elite enough for entry in struct net, and which can't use register_pernet_gen_device(). PPTP conntracking module is such one. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-31	udp: Should use spin_lock_bh()/spin_unlock_bh() in udp_lib_unhash()	Eric Dumazet	1	-2/+2
	Spotted by Alexander Beregalov Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-30	Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6	Linus Torvalds	2	-18/+58
	* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: SUNRPC: Fix potential race in put_rpccred() SUNRPC: Fix rpcauth_prune_expired NFS: Convert nfs_attr_generation_counter into an atomic_long SUNRPC: Respond promptly to server TCP resets
2008-10-30	netlabel: Fix compilation warnings in net/netlabel/netlabel_addrlist.c	Manish Katiyar	2	-0/+24
	Enable netlabel auditing functions only when CONFIG_AUDIT is set Signed-off-by: Manish Katiyar <mkatiyar@gmail.com> Signed-off-by: Paul Moore <paul.moore@hp.com>
2008-10-29	netlabel: Fix compiler warnings in netlabel_mgmt.c	Paul Moore	1	-1/+1
	Fix the compiler warnings below, thanks to Andrew Morton for finding them. net/netlabel/netlabel_mgmt.c: In function `netlbl_mgmt_listentry': net/netlabel/netlabel_mgmt.c:268: warning: 'ret_val' might be used uninitialized in this function Signed-off-by: Paul Moore <paul.moore@hp.com>
2008-10-29	cipso: unsigned buf_len cannot be negative	roel kluin	1	-3/+4
	unsigned buf_len cannot be negative Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Paul Moore <paul.moore@hp.com>
2008-10-29	net: replace %p6 with %pI6	Harvey Harrison	33	-60/+60
	Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29	net: replace %#p6 format specifier with %pi6	Harvey Harrison	6	-9/+9
	gcc warns when using the # modifier with the %p format specifier, so we can't use this to omit the colons when needed, introduces %pi6 instead. Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29	udp: introduce sk_for_each_rcu_safenext()	Eric Dumazet	2	-4/+4
	Corey Minyard found a race added in commit 271b72c7fa82c2c7a795bc16896149933110672d (udp: RCU handling for Unicast packets.) "If the socket is moved from one list to another list in-between the time the hash is calculated and the next field is accessed, and the socket has moved to the end of the new list, the traversal will not complete properly on the list it should have, since the socket will be on the end of the new list and there's not a way to tell it's on a new list and restart the list traversal. I think that this can be solved by pre-fetching the "next" field (with proper barriers) before checking the hash." This patch corrects this problem, introducing a new sk_for_each_rcu_safenext() macro. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29	udp: udp_get_next() should use spin_unlock_bh()	Eric Dumazet	1	-1/+1
	Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29	udp: calculate udp_mem based on low memory instead of all memory	Eric Dumazet	1	-3/+6
	This patch mimics commit 57413ebc4e0f1e471a3b4db4aff9a85c083d090e (tcp: calculate tcp_mem based on low memory instead of all memory) The udp_mem array which contains limits on the total amount of memory used by UDP sockets is calculated based on nr_all_pages. On a 32 bits x86 system, we should base this on the number of lowmem pages. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29	udp: RCU handling for Unicast packets.	Eric Dumazet	5	-17/+54
	Goals are : 1) Optimizing handling of incoming Unicast UDP frames, so that no memory writes should happen in the fast path. Note: Multicasts and broadcasts still will need to take a lock, because doing a full lockless lookup in this case is difficult. 2) No expensive operations in the socket bind/unhash phases : - No expensive synchronize_rcu() calls. - No added rcu_head in socket structure, increasing memory needs, but more important, forcing us to use call_rcu() calls, that have the bad property of making sockets structure cold. (rcu grace period between socket freeing and its potential reuse make this socket being cold in CPU cache). David did a previous patch using call_rcu() and noticed a 20% impact on TCP connection rates. Quoting Cristopher Lameter : "Right. That results in cacheline cooldown. You'd want to recycle the object as they are cache hot on a per cpu basis. That is screwed up by the delayed regular rcu processing. We have seen multiple regressions due to cacheline cooldown. The only choice in cacheline hot sensitive areas is to deal with the complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU." - Because udp sockets are allocated from dedicated kmem_cache, use of SLAB_DESTROY_BY_RCU can help here. Theory of operation : --------------------- As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()), special attention must be taken by readers and writers. Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed, reused, inserted in a different chain or in worst case in the same chain while readers could do lookups in the same time. In order to avoid loops, a reader must check each socket found in a chain really belongs to the chain the reader was traversing. If it finds a mismatch, lookup must start again at the begining. This restart loop is the reason we had to use rdlock for the multicast case, because we dont want to send same message several times to the same socket. We use RCU only for fast path. Thus, /proc/net/udp still takes spinlocks. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-29	udp: introduce struct udp_table and multiple spinlocks	Eric Dumazet	6	-148/+200
	UDP sockets are hashed in a 128 slots hash table. This hash table is protected by one rwlock. This rwlock is readlocked each time an incoming UDP message is handled. This rwlock is writelocked each time a socket must be inserted in hash table (bind time), or deleted from this table (close time) This is not scalable on SMP machines : 1) Even in read mode, lock() and unlock() are atomic operations and must dirty a contended cache line, shared by all cpus. 2) A writer might be starved if many readers are 'in flight'. This can happen on a machine with some NIC receiving many UDP messages. User process can be delayed a long time at socket creation/dismantle time. This patch prepares RCU migration, by introducing 'struct udp_table and struct udp_hslot', and using one spinlock per chain, to reduce contention on central rwlock. Introducing one spinlock per chain reduces latencies, for port randomization on heavily loaded UDP servers. This also speedup bindings to specific ports. udp_lib_unhash() was uninlined, becoming to big. Some cleanups were done to ease review of following patch (RCUification of UDP Unicast lookups) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>