Merge branch 'bpf-nfp-mul-div-support'

Jiong Wang says: ==================== NFP supports u16 and u32 multiplication. Multiplication is done 8-bits per step, therefore we need 2 steps for u16 and 4 steps for u32. We also need one start instruction to initialize the sequence and one or two instructions to fetch the result depending on either you need the high halve of u32 multiplication. For ALU64, if either operand is beyond u32's value range, we reject it. One thing to note, if the source operand is BPF_K, then we need to check "imm" field directly, and we'd reject it if it is negative. Because for ALU64, "imm" (with s32 type) is expected to be sign extended to s64 which NFP mul doesn't support. For ALU32, it is fine for "imm" be negative though, because the result is 32-bits and here is no difference on the low halve of result for signed/unsigned mul, so we will get correct result. NFP doesn't have integer divide instruction, this patch set uses reciprocal algorithm (the basic one, reciprocal_div) to emulate it. For each u32 divide, we would need 11 instructions to finish the operation. 7 (for multiplication) + 4 (various ALUs) = 11 Given NFP only supports multiplication no bigger than u32, we'd require divisor and dividend no bigger than that as well. Also eBPF doesn't support signed divide and has enforced this on C language level by failing compilation. However LLVM assembler hasn't enforced this, so it is possible for negative constant to leak in as a BPF_K operand through assembly code, we reject such cases as well. Meanwhile reciprocal_div.h only implemented the basic version of: "Division by Invariant Integers Using Multiplication" - Torbjörn Granlund and Peter L. Montgomery This patch set further implements the optimized version (Figure 4.2 in the paper) inside existing reciprocal_div.h. When the divider is even and the calculated reciprocal magic number doesn't fit u32, we could reduce the required ALU instructions from 4 to 2 or 1 for some cases. The advanced version requires more complex calculation to get the reciprocal multiplier and other control variables, but then could reduce the required emulation operations. It makes sense to use it for JIT divide code generation (for example eBPF JIT backends) for which we are willing to trade performance of JITed code with that of host. v2: - add warning in l == 32 code path. (Song Liu/Jakub) - jit separate insn sequence for l == 32. (Jakub/Edwin) - should use unrestricted operand for mul. - once place should use "1U << exp" instead of "1 << exp". ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
author: Daniel Borkmann <daniel@iogearbox.net> 2018-07-07 02:45:31 +0300
committer: Daniel Borkmann <daniel@iogearbox.net> 2018-07-07 02:45:32 +0300
commit: d90c936fb3181695d63a1edb155d26fc576b17b4 (patch)
tree: 2e8fd0bfafbfffba139074dfbf7e8f0406caee31 /include/linux
parent: 02000b55850deeadffe433e4b4930a8831f477de (diff)
parent: 9fb410a89e8fa92f8ebc7aa95563442a14da21eb (diff)
download: linux-d90c936fb3181695d63a1edb155d26fc576b17b4.tar.xz
1 files changed, 68 insertions, 0 deletions
diff --git a/include/linux/reciprocal_div.h b/include/linux/reciprocal_div.h
index e031e9f2f9d8..585ce89c0f33 100644
--- a/include/linux/reciprocal_div.h
+++ b/include/linux/reciprocal_div.h
@@ -25,6 +25,9 @@ struct reciprocal_value {
 	u8 sh1, sh2;
 };
 
+/* "reciprocal_value" and "reciprocal_divide" together implement the basic
+ * version of the algorithm described in Figure 4.1 of the paper.
+ */
 struct reciprocal_value reciprocal_value(u32 d);
 
 static inline u32 reciprocal_divide(u32 a, struct reciprocal_value R)
@@ -33,4 +36,69 @@ static inline u32 reciprocal_divide(u32 a, struct reciprocal_value R)
 	return (t + ((a - t) >> R.sh1)) >> R.sh2;
 }
 
+struct reciprocal_value_adv {
+	u32 m;
+	u8 sh, exp;
+	bool is_wide_m;
+};
+
+/* "reciprocal_value_adv" implements the advanced version of the algorithm
+ * described in Figure 4.2 of the paper except when "divisor > (1U << 31)" whose
+ * ceil(log2(d)) result will be 32 which then requires u128 divide on host. The
+ * exception case could be easily handled before calling "reciprocal_value_adv".
+ *
+ * The advanced version requires more complex calculation to get the reciprocal
+ * multiplier and other control variables, but then could reduce the required
+ * emulation operations.
+ *
+ * It makes no sense to use this advanced version for host divide emulation,
+ * those extra complexities for calculating multiplier etc could completely
+ * waive our saving on emulation operations.
+ *
+ * However, it makes sense to use it for JIT divide code generation for which
+ * we are willing to trade performance of JITed code with that of host. As shown
+ * by the following pseudo code, the required emulation operations could go down
+ * from 6 (the basic version) to 3 or 4.
+ *
+ * To use the result of "reciprocal_value_adv", suppose we want to calculate
+ * n/d, the pseudo C code will be:
+ *
+ *   struct reciprocal_value_adv rvalue;
+ *   u8 pre_shift, exp;
+ *
+ *   // handle exception case.
+ *   if (d >= (1U << 31)) {
+ *     result = n >= d;
+ *     return;
+ *   }
+ *
+ *   rvalue = reciprocal_value_adv(d, 32)
+ *   exp = rvalue.exp;
+ *   if (rvalue.is_wide_m && !(d & 1)) {
+ *     // floor(log2(d & (2^32 -d)))
+ *     pre_shift = fls(d & -d) - 1;
+ *     rvalue = reciprocal_value_adv(d >> pre_shift, 32 - pre_shift);
+ *   } else {
+ *     pre_shift = 0;
+ *   }
+ *
+ *   // code generation starts.
+ *   if (imm == 1U << exp) {
+ *     result = n >> exp;
+ *   } else if (rvalue.is_wide_m) {
+ *     // pre_shift must be zero when reached here.
+ *     t = (n * rvalue.m) >> 32;
+ *     result = n - t;
+ *     result >>= 1;
+ *     result += t;
+ *     result >>= rvalue.sh - 1;
+ *   } else {
+ *     if (pre_shift)
+ *       result = n >> pre_shift;
+ *     result = ((u64)result * rvalue.m) >> 32;
+ *     result >>= rvalue.sh;
+ *   }
+ */
+struct reciprocal_value_adv reciprocal_value_adv(u32 d, u8 prec);
+
 #endif /* _LINUX_RECIPROCAL_DIV_H */
author	Daniel Borkmann <daniel@iogearbox.net>	2018-07-07 02:45:31 +0300
committer	Daniel Borkmann <daniel@iogearbox.net>	2018-07-07 02:45:32 +0300
commit	d90c936fb3181695d63a1edb155d26fc576b17b4 (patch)
tree	2e8fd0bfafbfffba139074dfbf7e8f0406caee31 /include/linux
parent	02000b55850deeadffe433e4b4930a8831f477de (diff)
parent	9fb410a89e8fa92f8ebc7aa95563442a14da21eb (diff)
download	linux-d90c936fb3181695d63a1edb155d26fc576b17b4.tar.xz