summaryrefslogtreecommitdiff
path: root/lib/hweight.c
AgeCommit message (Collapse)AuthorFilesLines
2016-06-08x86/hweight: Get rid of the special calling conventionBorislav Petkov1-0/+4
People complained about ARCH_HWEIGHT_CFLAGS and how it throws a wrench into kcov, lto, etc, experimentations. Add asm versions for __sw_hweight{32,64}() and do explicit saving and restoring of clobbered registers. This gets rid of the special calling convention. We get to call those functions on !X86_FEATURE_POPCNT CPUs. We still need to hardcode POPCNT and register operands as some old gas versions which we support, do not know about POPCNT. Btw, remove redundant REX prefix from 32-bit POPCNT because alternatives can do padding now. Suggested-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1464605787-20603-1-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-13Make ARCH_HAS_FAST_MULTIPLIER a real config variableLinus Torvalds1-2/+2
It used to be an ad-hoc hack defined by the x86 version of <asm/bitops.h> that enabled a couple of library routines to know whether an integer multiply is faster than repeated shifts and additions. This just makes it use the real Kconfig system instead, and makes x86 (which was the only architecture that did this) select the option. NOTE! Even for x86, this really is kind of wrong. If we cared, we would probably not enable this for builds optimized for netburst (P4), where shifts-and-adds are generally faster than multiplies. This patch does *not* change that kind of logic, though, it is purely a syntactic change with no code changes. This was triggered by the fact that we have other places that really want to know "do I want to expand multiples by constants by hand or not", particularly the hash generation code. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-08lib: reduce the use of module.h wherever possiblePaul Gortmaker1-1/+1
For files only using THIS_MODULE and/or EXPORT_SYMBOL, map them onto including export.h -- or if the file isn't even using those, then just delete the include. Fix up any implicit include dependencies that were being masked by module.h along the way. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2010-04-07x86: Add optimized popcnt variantsBorislav Petkov1-10/+10
Add support for the hardware version of the Hamming weight function, popcnt, present in CPUs which advertize it under CPUID, Function 0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the default lib/hweight.c sw versions. A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost a 3x speedup on a F10h machine. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> LKML-Reference: <20100318112015.GC11152@aftab> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-04-07bitops: Optimize hweight() by making use of compile-time evaluationPeter Zijlstra1-9/+10
Rename the extisting runtime hweight() implementations to __arch_hweight(), rename the compile-time versions to __const_hweight() and then have hweight() pick between them. Suggested-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100318111929.GB11152@aftab> Acked-by: H. Peter Anvin <hpa@zytor.com> LKML-Reference: <1265028224.24455.154.camel@laptop> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2009-12-28x86, core: Optimize hweight32()Akinobu Mita1-0/+7
Optimize hweight32 by using the same technique in hweight64. The proof of this technique can be found in the commit log for f9b4192923fa6e38331e88214b1fe5fc21583fcc ("bitops: hweight() speedup"). The userspace benchmark on x86_32 showed 20% speedup with bitmap_weight() which uses hweight32 to count bits for each unsigned long on 32bit architectures. int main(void) { #define SZ (1024 * 1024 * 512) static DECLARE_BITMAP(bitmap, SZ) = { [0 ... 100] = 1, }; return bitmap_weight(bitmap, SZ); } Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <1258603932-4590-1-git-send-email-akinobu.mita@gmail.com> [ only x86 sets ARCH_HAS_FAST_MULTIPLIER so we do this via the x86 tree] Signed-off-by: Ingo Molnar <mingo@elte.hu>
2007-10-19remove asm/bitops.h includesJiri Slaby1-1/+1
remove asm/bitops.h includes including asm/bitops directly may cause compile errors. don't include it and include linux/bitops instead. next patch will deny including asm header directly. Cc: Adrian Bunk <bunk@kernel.org> Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2006-09-26[PATCH] optimize hweight64 for x86_64Andi Kleen1-2/+8
Based on patch from David Rientjes <rientjes@google.com>, but changed by AK. Optimizes the 64-bit hamming weight for x86_64 processors assuming they have fast multiplication. Uses five fewer bitops than the generic hweight64. Benchmark on one EMT64 showed ~25% speedup with 2^24 consecutive calls. Define a new ARCH_HAS_FAST_MULTIPLIER that can be set by other architectures that can also multiply fast. Signed-off-by: Andi Kleen <ak@suse.de>
2006-03-26[PATCH] bitops: hweight() speedupAkinobu Mita1-15/+14
<linux@horizon.com> wrote: This is an extremely well-known technique. You can see a similar version that uses a multiply for the last few steps at http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel whch refers to "Software Optimization Guide for AMD Athlon 64 and Opteron Processors" http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF It's section 8.6, "Efficient Implementation of Population-Count Function in 32-bit Mode", pages 179-180. It uses the name that I am more familiar with, "popcount" (population count), although "Hamming weight" also makes sense. Anyway, the proof of correctness proceeds as follows: b = a - ((a >> 1) & 0x55555555); c = (b & 0x33333333) + ((b >> 2) & 0x33333333); d = (c + (c >> 4)) & 0x0f0f0f0f; #if SLOW_MULTIPLY e = d + (d >> 8) f = e + (e >> 16); return f & 63; #else /* Useful if multiply takes at most 4 cycles */ return (d * 0x01010101) >> 24; #endif The input value a can be thought of as 32 1-bit fields each holding their own hamming weight. Now look at it as 16 2-bit fields. Each 2-bit field a1..a0 has the value 2*a1 + a0. This can be converted into the hamming weight of the 2-bit field a1+a0 by subtracting a1. That's what the (a >> 1) & mask subtraction does. Since there can be no borrows, you can just do it all at once. Enumerating the 4 possible cases: 0b00 = 0 -> 0 - 0 = 0 0b01 = 1 -> 1 - 0 = 1 0b10 = 2 -> 2 - 1 = 1 0b11 = 3 -> 3 - 1 = 2 The next step consists of breaking up b (made of 16 2-bir fields) into even and odd halves and adding them into 4-bit fields. Since the largest possible sum is 2+2 = 4, which will not fit into a 4-bit field, the 2-bit ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ "which will not fit into a 2-bit field" fields have to be masked before they are added. After this point, the masking can be delayed. Each 4-bit field holds a population count from 0..4, taking at most 3 bits. These numbers can be added without overflowing a 4-bit field, so we can compute c + (c >> 4), and only then mask off the unwanted bits. This produces d, a number of 4 8-bit fields, each in the range 0..8. From this point, we can shift and add d multiple times without overflowing an 8-bit field, and only do a final mask at the end. The number to mask with has to be at least 63 (so that 32 on't be truncated), but can also be 128 or 255. The x86 has a special encoding for signed immediate byte values -128..127, so the value of 255 is slower. On other processors, a special "sign extend byte" instruction might be faster. On a processor with fast integer multiplies (Athlon but not P4), you can reduce the final few serially dependent instructions to a single integer multiply. Consider d to be 3 8-bit values d3, d2, d1 and d0, each in the range 0..8. The multiply forms the partial products: d3 d2 d1 d0 d3 d2 d1 d0 d3 d2 d1 d0 + d3 d2 d1 d0 ---------------------- e3 e2 e1 e0 Where e3 = d3 + d2 + d1 + d0. e2, e1 and e0 obviously cannot generate any carries. Signed-off-by: Akinobu Mita <mita@miraclelinux.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-26[PATCH] bitops: generic hweight{64,32,16,8}()Akinobu Mita1-0/+54
This patch introduces the C-language equivalents of the functions below: unsigned int hweight32(unsigned int w); unsigned int hweight16(unsigned int w); unsigned int hweight8(unsigned int w); unsigned long hweight64(__u64 w); In include/asm-generic/bitops/hweight.h This code largely copied from: include/linux/bitops.h Signed-off-by: Akinobu Mita <mita@miraclelinux.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>