diff options
Diffstat (limited to 'Documentation/sparc/oradax/oracle-dax.txt')
-rw-r--r-- | Documentation/sparc/oradax/oracle-dax.txt | 429 |
1 files changed, 429 insertions, 0 deletions
diff --git a/Documentation/sparc/oradax/oracle-dax.txt b/Documentation/sparc/oradax/oracle-dax.txt new file mode 100644 index 000000000000..9d53ac93286f --- /dev/null +++ b/Documentation/sparc/oradax/oracle-dax.txt @@ -0,0 +1,429 @@ +Oracle Data Analytics Accelerator (DAX) +--------------------------------------- + +DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8 +(DAX2) processor chips, and has direct access to the CPU's L3 caches +as well as physical memory. It can perform several operations on data +streams with various input and output formats. A driver provides a +transport mechanism and has limited knowledge of the various opcodes +and data formats. A user space library provides high level services +and translates these into low level commands which are then passed +into the driver and subsequently the Hypervisor and the coprocessor. +The library is the recommended way for applications to use the +coprocessor, and the driver interface is not intended for general use. +This document describes the general flow of the driver, its +structures, and its programmatic interface. It also provides example +code sufficient to write user or kernel applications that use DAX +functionality. + +The user library is open source and available at: + https://oss.oracle.com/git/gitweb.cgi?p=libdax.git + +The Hypervisor interface to the coprocessor is described in detail in +the accompanying document, dax-hv-api.txt, which is a plain text +excerpt of the (Oracle internal) "UltraSPARC Virtual Machine +Specification" version 3.0.20+15, dated 2017-09-25. + + +High Level Overview +------------------- + +A coprocessor request is described by a Command Control Block +(CCB). The CCB contains an opcode and various parameters. The opcode +specifies what operation is to be done, and the parameters specify +options, flags, sizes, and addresses. The CCB (or an array of CCBs) +is passed to the Hypervisor, which handles queueing and scheduling of +requests to the available coprocessor execution units. A status code +returned indicates if the request was submitted successfully or if +there was an error. One of the addresses given in each CCB is a +pointer to a "completion area", which is a 128 byte memory block that +is written by the coprocessor to provide execution status. No +interrupt is generated upon completion; the completion area must be +polled by software to find out when a transaction has finished, but +the M7 and later processors provide a mechanism to pause the virtual +processor until the completion status has been updated by the +coprocessor. This is done using the monitored load and mwait +instructions, which are described in more detail later. The DAX +coprocessor was designed so that after a request is submitted, the +kernel is no longer involved in the processing of it. The polling is +done at the user level, which results in almost zero latency between +completion of a request and resumption of execution of the requesting +thread. + + +Addressing Memory +----------------- + +The kernel does not have access to physical memory in the Sun4v +architecture, as there is an additional level of memory virtualization +present. This intermediate level is called "real" memory, and the +kernel treats this as if it were physical. The Hypervisor handles the +translations between real memory and physical so that each logical +domain (LDOM) can have a partition of physical memory that is isolated +from that of other LDOMs. When the kernel sets up a virtual mapping, +it specifies a virtual address and the real address to which it should +be mapped. + +The DAX coprocessor can only operate on physical memory, so before a +request can be fed to the coprocessor, all the addresses in a CCB must +be converted into physical addresses. The kernel cannot do this since +it has no visibility into physical addresses. So a CCB may contain +either the virtual or real addresses of the buffers or a combination +of them. An "address type" field is available for each address that +may be given in the CCB. In all cases, the Hypervisor will translate +all the addresses to physical before dispatching to hardware. Address +translations are performed using the context of the process initiating +the request. + + +The Driver API +-------------- + +An application makes requests to the driver via the write() system +call, and gets results (if any) via read(). The completion areas are +made accessible via mmap(), and are read-only for the application. + +The request may either be an immediate command or an array of CCBs to +be submitted to the hardware. + +Each open instance of the device is exclusive to the thread that +opened it, and must be used by that thread for all subsequent +operations. The driver open function creates a new context for the +thread and initializes it for use. This context contains pointers and +values used internally by the driver to keep track of submitted +requests. The completion area buffer is also allocated, and this is +large enough to contain the completion areas for many concurrent +requests. When the device is closed, any outstanding transactions are +flushed and the context is cleaned up. + +On a DAX1 system (M7), the device will be called "oradax1", while on a +DAX2 system (M8) it will be "oradax2". If an application requires one +or the other, it should simply attempt to open the appropriate +device. Only one of the devices will exist on any given system, so the +name can be used to determine what the platform supports. + +The immediate commands are CCB_DEQUEUE, CCB_KILL, and CCB_INFO. For +all of these, success is indicated by a return value from write() +equal to the number of bytes given in the call. Otherwise -1 is +returned and errno is set. + +CCB_DEQUEUE + +Tells the driver to clean up resources associated with past +requests. Since no interrupt is generated upon the completion of a +request, the driver must be told when it may reclaim resources. No +further status information is returned, so the user should not +subsequently call read(). + +CCB_KILL + +Kills a CCB during execution. The CCB is guaranteed to not continue +executing once this call returns successfully. On success, read() must +be called to retrieve the result of the action. + +CCB_INFO + +Retrieves information about a currently executing CCB. Note that some +Hypervisors might return 'notfound' when the CCB is in 'inprogress' +state. To ensure a CCB in the 'notfound' state will never be executed, +CCB_KILL must be invoked on that CCB. Upon success, read() must be +called to retrieve the details of the action. + +Submission of an array of CCBs for execution + +A write() whose length is a multiple of the CCB size is treated as a +submit operation. The file offset is treated as the index of the +completion area to use, and may be set via lseek() or using the +pwrite() system call. If -1 is returned then errno is set to indicate +the error. Otherwise, the return value is the length of the array that +was actually accepted by the coprocessor. If the accepted length is +equal to the requested length, then the submission was completely +successful and there is no further status needed; hence, the user +should not subsequently call read(). Partial acceptance of the CCB +array is indicated by a return value less than the requested length, +and read() must be called to retrieve further status information. The +status will reflect the error caused by the first CCB that was not +accepted, and status_data will provide additional data in some cases. + +MMAP + +The mmap() function provides access to the completion area allocated +in the driver. Note that the completion area is not writeable by the +user process, and the mmap call must not specify PROT_WRITE. + + +Completion of a Request +----------------------- + +The first byte in each completion area is the command status which is +updated by the coprocessor hardware. Software may take advantage of +new M7/M8 processor capabilities to efficiently poll this status byte. +First, a "monitored load" is achieved via a Load from Alternate Space +(ldxa, lduba, etc.) with ASI 0x84 (ASI_MONITOR_PRIMARY). Second, a +"monitored wait" is achieved via the mwait instruction (a write to +%asr28). This instruction is like pause in that it suspends execution +of the virtual processor for the given number of nanoseconds, but in +addition will terminate early when one of several events occur. If the +block of data containing the monitored location is modified, then the +mwait terminates. This causes software to resume execution immediately +(without a context switch or kernel to user transition) after a +transaction completes. Thus the latency between transaction completion +and resumption of execution may be just a few nanoseconds. + + +Application Life Cycle of a DAX Submission +------------------------------------------ + + - open dax device + - call mmap() to get the completion area address + - allocate a CCB and fill in the opcode, flags, parameters, addresses, etc. + - submit CCB via write() or pwrite() + - go into a loop executing monitored load + monitored wait and + terminate when the command status indicates the request is complete + (CCB_KILL or CCB_INFO may be used any time as necessary) + - perform a CCB_DEQUEUE + - call munmap() for completion area + - close the dax device + + +Memory Constraints +------------------ + +The DAX hardware operates only on physical addresses. Therefore, it is +not aware of virtual memory mappings and the discontiguities that may +exist in the physical memory that a virtual buffer maps to. There is +no I/O TLB or any scatter/gather mechanism. All buffers, whether input +or output, must reside in a physically contiguous region of memory. + +The Hypervisor translates all addresses within a CCB to physical +before handing off the CCB to DAX. The Hypervisor determines the +virtual page size for each virtual address given, and uses this to +program a size limit for each address. This prevents the coprocessor +from reading or writing beyond the bound of the virtual page, even +though it is accessing physical memory directly. A simpler way of +saying this is that a DAX operation will never "cross" a virtual page +boundary. If an 8k virtual page is used, then the data is strictly +limited to 8k. If a user's buffer is larger than 8k, then a larger +page size must be used, or the transaction size will be truncated to +8k. + +Huge pages. A user may allocate huge pages using standard interfaces. +Memory buffers residing on huge pages may be used to achieve much +larger DAX transaction sizes, but the rules must still be followed, +and no transaction will cross a page boundary, even a huge page. A +major caveat is that Linux on Sparc presents 8Mb as one of the huge +page sizes. Sparc does not actually provide a 8Mb hardware page size, +and this size is synthesized by pasting together two 4Mb pages. The +reasons for this are historical, and it creates an issue because only +half of this 8Mb page can actually be used for any given buffer in a +DAX request, and it must be either the first half or the second half; +it cannot be a 4Mb chunk in the middle, since that crosses a +(hardware) page boundary. Note that this entire issue may be hidden by +higher level libraries. + + +CCB Structure +------------- +A CCB is an array of 8 64-bit words. Several of these words provide +command opcodes, parameters, flags, etc., and the rest are addresses +for the completion area, output buffer, and various inputs: + + struct ccb { + u64 control; + u64 completion; + u64 input0; + u64 access; + u64 input1; + u64 op_data; + u64 output; + u64 table; + }; + +See libdax/common/sys/dax1/dax1_ccb.h for a detailed description of +each of these fields, and see dax-hv-api.txt for a complete description +of the Hypervisor API available to the guest OS (ie, Linux kernel). + +The first word (control) is examined by the driver for the following: + - CCB version, which must be consistent with hardware version + - Opcode, which must be one of the documented allowable commands + - Address types, which must be set to "virtual" for all the addresses + given by the user, thereby ensuring that the application can + only access memory that it owns + + +Example Code +------------ + +The DAX is accessible to both user and kernel code. The kernel code +can make hypercalls directly while the user code must use wrappers +provided by the driver. The setup of the CCB is nearly identical for +both; the only difference is in preparation of the completion area. An +example of user code is given now, with kernel code afterwards. + +In order to program using the driver API, the file +arch/sparc/include/uapi/asm/oradax.h must be included. + +First, the proper device must be opened. For M7 it will be +/dev/oradax1 and for M8 it will be /dev/oradax2. The simplest +procedure is to attempt to open both, as only one will succeed: + + fd = open("/dev/oradax1", O_RDWR); + if (fd < 0) + fd = open("/dev/oradax2", O_RDWR); + if (fd < 0) + /* No DAX found */ + +Next, the completion area must be mapped: + + completion_area = mmap(NULL, DAX_MMAP_LEN, PROT_READ, MAP_SHARED, fd, 0); + +All input and output buffers must be fully contained in one hardware +page, since as explained above, the DAX is strictly constrained by +virtual page boundaries. In addition, the output buffer must be +64-byte aligned and its size must be a multiple of 64 bytes because +the coprocessor writes in units of cache lines. + +This example demonstrates the DAX Scan command, which takes as input a +vector and a match value, and produces a bitmap as the output. For +each input element that matches the value, the corresponding bit is +set in the output. + +In this example, the input vector consists of a series of single bits, +and the match value is 0. So each 0 bit in the input will produce a 1 +in the output, and vice versa, which produces an output bitmap which +is the input bitmap inverted. + +For details of all the parameters and bits used in this CCB, please +refer to section 36.2.1.3 of the DAX Hypervisor API document, which +describes the Scan command in detail. + + ccb->control = /* Table 36.1, CCB Header Format */ + (2L << 48) /* command = Scan Value */ + | (3L << 40) /* output address type = primary virtual */ + | (3L << 34) /* primary input address type = primary virtual */ + /* Section 36.2.1, Query CCB Command Formats */ + | (1 << 28) /* 36.2.1.1.1 primary input format = fixed width bit packed */ + | (0 << 23) /* 36.2.1.1.2 primary input element size = 0 (1 bit) */ + | (8 << 10) /* 36.2.1.1.6 output format = bit vector */ + | (0 << 5) /* 36.2.1.3 First scan criteria size = 0 (1 byte) */ + | (31 << 0); /* 36.2.1.3 Disable second scan criteria */ + + ccb->completion = 0; /* Completion area address, to be filled in by driver */ + + ccb->input0 = (unsigned long) input; /* primary input address */ + + ccb->access = /* Section 36.2.1.2, Data Access Control */ + (2 << 24) /* Primary input length format = bits */ + | (nbits - 1); /* number of bits in primary input stream, minus 1 */ + + ccb->input1 = 0; /* secondary input address, unused */ + + ccb->op_data = 0; /* scan criteria (value to be matched) */ + + ccb->output = (unsigned long) output; /* output address */ + + ccb->table = 0; /* table address, unused */ + +The CCB submission is a write() or pwrite() system call to the +driver. If the call fails, then a read() must be used to retrieve the +status: + + if (pwrite(fd, ccb, 64, 0) != 64) { + struct ccb_exec_result status; + read(fd, &status, sizeof(status)); + /* bail out */ + } + +After a successful submission of the CCB, the completion area may be +polled to determine when the DAX is finished. Detailed information on +the contents of the completion area can be found in section 36.2.2 of +the DAX HV API document. + + while (1) { + /* Monitored Load */ + __asm__ __volatile__("lduba [%1] 0x84, %0\n" + : "=r" (status) + : "r" (completion_area)); + + if (status) /* 0 indicates command in progress */ + break; + + /* MWAIT */ + __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::); /* 1000 ns */ + } + +A completion area status of 1 indicates successful completion of the +CCB and validity of the output bitmap, which may be used immediately. +All other non-zero values indicate error conditions which are +described in section 36.2.2. + + if (completion_area[0] != 1) { /* section 36.2.2, 1 = command ran and succeeded */ + /* completion_area[0] contains the completion status */ + /* completion_area[1] contains an error code, see 36.2.2 */ + } + +After the completion area has been processed, the driver must be +notified that it can release any resources associated with the +request. This is done via the dequeue operation: + + struct dax_command cmd; + cmd.command = CCB_DEQUEUE; + if (write(fd, &cmd, sizeof(cmd)) != sizeof(cmd)) { + /* bail out */ + } + +Finally, normal program cleanup should be done, i.e., unmapping +completion area, closing the dax device, freeing memory etc. + +[Kernel example] + +The only difference in using the DAX in kernel code is the treatment +of the completion area. Unlike user applications which mmap the +completion area allocated by the driver, kernel code must allocate its +own memory to use for the completion area, and this address and its +type must be given in the CCB: + + ccb->control |= /* Table 36.1, CCB Header Format */ + (3L << 32); /* completion area address type = primary virtual */ + + ccb->completion = (unsigned long) completion_area; /* Completion area address */ + +The dax submit hypercall is made directly. The flags used in the +ccb_submit call are documented in the DAX HV API in section 36.3.1. + +#include <asm/hypervisor.h> + + hv_rv = sun4v_ccb_submit((unsigned long)ccb, 64, + HV_CCB_QUERY_CMD | + HV_CCB_ARG0_PRIVILEGED | HV_CCB_ARG0_TYPE_PRIMARY | + HV_CCB_VA_PRIVILEGED, + 0, &bytes_accepted, &status_data); + + if (hv_rv != HV_EOK) { + /* hv_rv is an error code, status_data contains */ + /* potential additional status, see 36.3.1.1 */ + } + +After the submission, the completion area polling code is identical to +that in user land: + + while (1) { + /* Monitored Load */ + __asm__ __volatile__("lduba [%1] 0x84, %0\n" + : "=r" (status) + : "r" (completion_area)); + + if (status) /* 0 indicates command in progress */ + break; + + /* MWAIT */ + __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::); /* 1000 ns */ + } + + if (completion_area[0] != 1) { /* section 36.2.2, 1 = command ran and succeeded */ + /* completion_area[0] contains the completion status */ + /* completion_area[1] contains an error code, see 36.2.2 */ + } + +The output bitmap is ready for consumption immediately after the +completion status indicates success. |