1 files changed, 429 insertions, 0 deletions
diff --git a/Documentation/sparc/oradax/oracle-dax.txt b/Documentation/sparc/oradax/oracle-dax.txt
new file mode 100644
index 000000000000..9d53ac93286f
--- /dev/null
+++ b/Documentation/sparc/oradax/oracle-dax.txt
@@ -0,0 +1,429 @@
+Oracle Data Analytics Accelerator (DAX)
+---------------------------------------
+
+DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8
+(DAX2) processor chips, and has direct access to the CPU's L3 caches
+as well as physical memory. It can perform several operations on data
+streams with various input and output formats.  A driver provides a
+transport mechanism and has limited knowledge of the various opcodes
+and data formats. A user space library provides high level services
+and translates these into low level commands which are then passed
+into the driver and subsequently the Hypervisor and the coprocessor.
+The library is the recommended way for applications to use the
+coprocessor, and the driver interface is not intended for general use.
+This document describes the general flow of the driver, its
+structures, and its programmatic interface. It also provides example
+code sufficient to write user or kernel applications that use DAX
+functionality.
+
+The user library is open source and available at:
+    https://oss.oracle.com/git/gitweb.cgi?p=libdax.git
+
+The Hypervisor interface to the coprocessor is described in detail in
+the accompanying document, dax-hv-api.txt, which is a plain text
+excerpt of the (Oracle internal) "UltraSPARC Virtual Machine
+Specification" version 3.0.20+15, dated 2017-09-25.
+
+
+High Level Overview
+-------------------
+
+A coprocessor request is described by a Command Control Block
+(CCB). The CCB contains an opcode and various parameters. The opcode
+specifies what operation is to be done, and the parameters specify
+options, flags, sizes, and addresses.  The CCB (or an array of CCBs)
+is passed to the Hypervisor, which handles queueing and scheduling of
+requests to the available coprocessor execution units. A status code
+returned indicates if the request was submitted successfully or if
+there was an error.  One of the addresses given in each CCB is a
+pointer to a "completion area", which is a 128 byte memory block that
+is written by the coprocessor to provide execution status. No
+interrupt is generated upon completion; the completion area must be
+polled by software to find out when a transaction has finished, but
+the M7 and later processors provide a mechanism to pause the virtual
+processor until the completion status has been updated by the
+coprocessor. This is done using the monitored load and mwait
+instructions, which are described in more detail later.  The DAX
+coprocessor was designed so that after a request is submitted, the
+kernel is no longer involved in the processing of it.  The polling is
+done at the user level, which results in almost zero latency between
+completion of a request and resumption of execution of the requesting
+thread.
+
+
+Addressing Memory
+-----------------
+
+The kernel does not have access to physical memory in the Sun4v
+architecture, as there is an additional level of memory virtualization
+present. This intermediate level is called "real" memory, and the
+kernel treats this as if it were physical.  The Hypervisor handles the
+translations between real memory and physical so that each logical
+domain (LDOM) can have a partition of physical memory that is isolated
+from that of other LDOMs.  When the kernel sets up a virtual mapping,
+it specifies a virtual address and the real address to which it should
+be mapped.
+
+The DAX coprocessor can only operate on physical memory, so before a
+request can be fed to the coprocessor, all the addresses in a CCB must
+be converted into physical addresses. The kernel cannot do this since
+it has no visibility into physical addresses. So a CCB may contain
+either the virtual or real addresses of the buffers or a combination
+of them. An "address type" field is available for each address that
+may be given in the CCB. In all cases, the Hypervisor will translate
+all the addresses to physical before dispatching to hardware. Address
+translations are performed using the context of the process initiating
+the request.
+
+
+The Driver API
+--------------
+
+An application makes requests to the driver via the write() system
+call, and gets results (if any) via read(). The completion areas are
+made accessible via mmap(), and are read-only for the application.
+
+The request may either be an immediate command or an array of CCBs to
+be submitted to the hardware.
+
+Each open instance of the device is exclusive to the thread that
+opened it, and must be used by that thread for all subsequent
+operations. The driver open function creates a new context for the
+thread and initializes it for use.  This context contains pointers and
+values used internally by the driver to keep track of submitted
+requests. The completion area buffer is also allocated, and this is
+large enough to contain the completion areas for many concurrent
+requests.  When the device is closed, any outstanding transactions are
+flushed and the context is cleaned up.
+
+On a DAX1 system (M7), the device will be called "oradax1", while on a
+DAX2 system (M8) it will be "oradax2". If an application requires one
+or the other, it should simply attempt to open the appropriate
+device. Only one of the devices will exist on any given system, so the
+name can be used to determine what the platform supports.
+
+The immediate commands are CCB_DEQUEUE, CCB_KILL, and CCB_INFO. For
+all of these, success is indicated by a return value from write()
+equal to the number of bytes given in the call. Otherwise -1 is
+returned and errno is set.
+
+CCB_DEQUEUE
+
+Tells the driver to clean up resources associated with past
+requests. Since no interrupt is generated upon the completion of a
+request, the driver must be told when it may reclaim resources.  No
+further status information is returned, so the user should not
+subsequently call read().
+
+CCB_KILL
+
+Kills a CCB during execution. The CCB is guaranteed to not continue
+executing once this call returns successfully. On success, read() must
+be called to retrieve the result of the action.
+
+CCB_INFO
+
+Retrieves information about a currently executing CCB. Note that some
+Hypervisors might return 'notfound' when the CCB is in 'inprogress'
+state. To ensure a CCB in the 'notfound' state will never be executed,
+CCB_KILL must be invoked on that CCB. Upon success, read() must be
+called to retrieve the details of the action.
+
+Submission of an array of CCBs for execution
+
+A write() whose length is a multiple of the CCB size is treated as a
+submit operation. The file offset is treated as the index of the
+completion area to use, and may be set via lseek() or using the
+pwrite() system call. If -1 is returned then errno is set to indicate
+the error. Otherwise, the return value is the length of the array that
+was actually accepted by the coprocessor. If the accepted length is
+equal to the requested length, then the submission was completely
+successful and there is no further status needed; hence, the user
+should not subsequently call read(). Partial acceptance of the CCB
+array is indicated by a return value less than the requested length,
+and read() must be called to retrieve further status information.  The
+status will reflect the error caused by the first CCB that was not
+accepted, and status_data will provide additional data in some cases.
+
+MMAP
+
+The mmap() function provides access to the completion area allocated
+in the driver.  Note that the completion area is not writeable by the
+user process, and the mmap call must not specify PROT_WRITE.
+
+
+Completion of a Request
+-----------------------
+
+The first byte in each completion area is the command status which is
+updated by the coprocessor hardware. Software may take advantage of
+new M7/M8 processor capabilities to efficiently poll this status byte.
+First, a "monitored load" is achieved via a Load from Alternate Space
+(ldxa, lduba, etc.) with ASI 0x84 (ASI_MONITOR_PRIMARY).  Second, a
+"monitored wait" is achieved via the mwait instruction (a write to
+%asr28). This instruction is like pause in that it suspends execution
+of the virtual processor for the given number of nanoseconds, but in
+addition will terminate early when one of several events occur. If the
+block of data containing the monitored location is modified, then the
+mwait terminates. This causes software to resume execution immediately
+(without a context switch or kernel to user transition) after a
+transaction completes. Thus the latency between transaction completion
+and resumption of execution may be just a few nanoseconds.
+
+
+Application Life Cycle of a DAX Submission
+------------------------------------------
+
+ - open dax device
+ - call mmap() to get the completion area address
+ - allocate a CCB and fill in the opcode, flags, parameters, addresses, etc.
+ - submit CCB via write() or pwrite()
+ - go into a loop executing monitored load + monitored wait and
+   terminate when the command status indicates the request is complete
+   (CCB_KILL or CCB_INFO may be used any time as necessary)
+ - perform a CCB_DEQUEUE
+ - call munmap() for completion area
+ - close the dax device
+
+
+Memory Constraints
+------------------
+
+The DAX hardware operates only on physical addresses. Therefore, it is
+not aware of virtual memory mappings and the discontiguities that may
+exist in the physical memory that a virtual buffer maps to. There is
+no I/O TLB or any scatter/gather mechanism. All buffers, whether input
+or output, must reside in a physically contiguous region of memory.
+
+The Hypervisor translates all addresses within a CCB to physical
+before handing off the CCB to DAX. The Hypervisor determines the
+virtual page size for each virtual address given, and uses this to
+program a size limit for each address. This prevents the coprocessor
+from reading or writing beyond the bound of the virtual page, even
+though it is accessing physical memory directly. A simpler way of
+saying this is that a DAX operation will never "cross" a virtual page
+boundary. If an 8k virtual page is used, then the data is strictly
+limited to 8k. If a user's buffer is larger than 8k, then a larger
+page size must be used, or the transaction size will be truncated to
+8k.
+
+Huge pages. A user may allocate huge pages using standard interfaces.
+Memory buffers residing on huge pages may be used to achieve much
+larger DAX transaction sizes, but the rules must still be followed,
+and no transaction will cross a page boundary, even a huge page.  A
+major caveat is that Linux on Sparc presents 8Mb as one of the huge
+page sizes. Sparc does not actually provide a 8Mb hardware page size,
+and this size is synthesized by pasting together two 4Mb pages. The
+reasons for this are historical, and it creates an issue because only
+half of this 8Mb page can actually be used for any given buffer in a
+DAX request, and it must be either the first half or the second half;
+it cannot be a 4Mb chunk in the middle, since that crosses a
+(hardware) page boundary. Note that this entire issue may be hidden by
+higher level libraries.
+
+
+CCB Structure
+-------------
+A CCB is an array of 8 64-bit words. Several of these words provide
+command opcodes, parameters, flags, etc., and the rest are addresses
+for the completion area, output buffer, and various inputs:
+
+   struct ccb {
+       u64   control;
+       u64   completion;
+       u64   input0;
+       u64   access;
+       u64   input1;
+       u64   op_data;
+       u64   output;
+       u64   table;
+   };
+
+See libdax/common/sys/dax1/dax1_ccb.h for a detailed description of
+each of these fields, and see dax-hv-api.txt for a complete description
+of the Hypervisor API available to the guest OS (ie, Linux kernel).
+
+The first word (control) is examined by the driver for the following:
+ - CCB version, which must be consistent with hardware version
+ - Opcode, which must be one of the documented allowable commands
+ - Address types, which must be set to "virtual" for all the addresses
+   given by the user, thereby ensuring that the application can
+   only access memory that it owns
+
+
+Example Code
+------------
+
+The DAX is accessible to both user and kernel code.  The kernel code
+can make hypercalls directly while the user code must use wrappers
+provided by the driver. The setup of the CCB is nearly identical for
+both; the only difference is in preparation of the completion area. An
+example of user code is given now, with kernel code afterwards.
+
+In order to program using the driver API, the file
+arch/sparc/include/uapi/asm/oradax.h must be included.
+
+First, the proper device must be opened. For M7 it will be
+/dev/oradax1 and for M8 it will be /dev/oradax2. The simplest
+procedure is to attempt to open both, as only one will succeed:
+
+	fd = open("/dev/oradax1", O_RDWR);
+	if (fd < 0)
+		fd = open("/dev/oradax2", O_RDWR);
+	if (fd < 0)
+	       /* No DAX found */
+
+Next, the completion area must be mapped:
+
+      completion_area = mmap(NULL, DAX_MMAP_LEN, PROT_READ, MAP_SHARED, fd, 0);
+
+All input and output buffers must be fully contained in one hardware
+page, since as explained above, the DAX is strictly constrained by
+virtual page boundaries.  In addition, the output buffer must be
+64-byte aligned and its size must be a multiple of 64 bytes because
+the coprocessor writes in units of cache lines.
+
+This example demonstrates the DAX Scan command, which takes as input a
+vector and a match value, and produces a bitmap as the output. For
+each input element that matches the value, the corresponding bit is
+set in the output.
+
+In this example, the input vector consists of a series of single bits,
+and the match value is 0. So each 0 bit in the input will produce a 1
+in the output, and vice versa, which produces an output bitmap which
+is the input bitmap inverted.
+
+For details of all the parameters and bits used in this CCB, please
+refer to section 36.2.1.3 of the DAX Hypervisor API document, which
+describes the Scan command in detail.
+
+	ccb->control =       /* Table 36.1, CCB Header Format */
+		  (2L << 48)     /* command = Scan Value */
+		| (3L << 40)     /* output address type = primary virtual */
+		| (3L << 34)     /* primary input address type = primary virtual */
+		             /* Section 36.2.1, Query CCB Command Formats */
+		| (1 << 28)     /* 36.2.1.1.1 primary input format = fixed width bit packed */
+		| (0 << 23)     /* 36.2.1.1.2 primary input element size = 0 (1 bit) */
+		| (8 << 10)     /* 36.2.1.1.6 output format = bit vector */
+		| (0 <<  5)	/* 36.2.1.3 First scan criteria size = 0 (1 byte) */
+		| (31 << 0);	/* 36.2.1.3 Disable second scan criteria */
+
+	ccb->completion = 0;    /* Completion area address, to be filled in by driver */
+
+	ccb->input0 = (unsigned long) input; /* primary input address */
+
+	ccb->access =       /* Section 36.2.1.2, Data Access Control */
+		  (2 << 24)    /* Primary input length format = bits */
+		| (nbits - 1); /* number of bits in primary input stream, minus 1 */
+
+	ccb->input1 = 0;       /* secondary input address, unused */
+
+	ccb->op_data = 0;      /* scan criteria (value to be matched) */
+
+	ccb->output = (unsigned long) output;	/* output address */
+
+	ccb->table = 0;	       /* table address, unused */
+
+The CCB submission is a write() or pwrite() system call to the
+driver. If the call fails, then a read() must be used to retrieve the
+status:
+
+	if (pwrite(fd, ccb, 64, 0) != 64) {
+		struct ccb_exec_result status;
+		read(fd, &status, sizeof(status));
+		/* bail out */
+	}
+
+After a successful submission of the CCB, the completion area may be
+polled to determine when the DAX is finished. Detailed information on
+the contents of the completion area can be found in section 36.2.2 of
+the DAX HV API document.
+
+	while (1) {
+		/* Monitored Load */
+		__asm__ __volatile__("lduba [%1] 0x84, %0\n"
+				     : "=r" (status)
+				     : "r"  (completion_area));
+
+		if (status)	     /* 0 indicates command in progress */
+			break;
+
+		/* MWAIT */
+		__asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::);    /* 1000 ns */
+	}
+
+A completion area status of 1 indicates successful completion of the
+CCB and validity of the output bitmap, which may be used immediately.
+All other non-zero values indicate error conditions which are
+described in section 36.2.2.
+
+	if (completion_area[0] != 1) {	/* section 36.2.2, 1 = command ran and succeeded */
+		/* completion_area[0] contains the completion status */
+		/* completion_area[1] contains an error code, see 36.2.2 */
+	}
+
+After the completion area has been processed, the driver must be
+notified that it can release any resources associated with the
+request. This is done via the dequeue operation:
+
+	struct dax_command cmd;
+	cmd.command = CCB_DEQUEUE;
+	if (write(fd, &cmd, sizeof(cmd)) != sizeof(cmd)) {
+		/* bail out */
+	}
+
+Finally, normal program cleanup should be done, i.e., unmapping
+completion area, closing the dax device, freeing memory etc.
+
+[Kernel example]
+
+The only difference in using the DAX in kernel code is the treatment
+of the completion area. Unlike user applications which mmap the
+completion area allocated by the driver, kernel code must allocate its
+own memory to use for the completion area, and this address and its
+type must be given in the CCB:
+
+	ccb->control |=      /* Table 36.1, CCB Header Format */
+	        (3L << 32);     /* completion area address type = primary virtual */
+
+	ccb->completion = (unsigned long) completion_area;   /* Completion area address */
+
+The dax submit hypercall is made directly. The flags used in the
+ccb_submit call are documented in the DAX HV API in section 36.3.1.
+
+#include <asm/hypervisor.h>
+
+	hv_rv = sun4v_ccb_submit((unsigned long)ccb, 64,
+				 HV_CCB_QUERY_CMD |
+				 HV_CCB_ARG0_PRIVILEGED | HV_CCB_ARG0_TYPE_PRIMARY |
+				 HV_CCB_VA_PRIVILEGED,
+				 0, &bytes_accepted, &status_data);
+
+	if (hv_rv != HV_EOK) {
+		/* hv_rv is an error code, status_data contains */
+		/* potential additional status, see 36.3.1.1 */
+	}
+
+After the submission, the completion area polling code is identical to
+that in user land:
+
+	while (1) {
+		/* Monitored Load */
+		__asm__ __volatile__("lduba [%1] 0x84, %0\n"
+				     : "=r" (status)
+				     : "r"  (completion_area));
+
+		if (status)	     /* 0 indicates command in progress */
+			break;
+
+		/* MWAIT */
+		__asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::);    /* 1000 ns */
+	}
+
+	if (completion_area[0] != 1) {	/* section 36.2.2, 1 = command ran and succeeded */
+		/* completion_area[0] contains the completion status */
+		/* completion_area[1] contains an error code, see 36.2.2 */
+	}
+
+The output bitmap is ready for consumption immediately after the
+completion status indicates success.