summaryrefslogtreecommitdiff
path: root/fs/exofs
AgeCommit message (Collapse)AuthorFilesLines
2012-07-23Merge branch 'for-linus-2' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull the big VFS changes from Al Viro: "This one is *big* and changes quite a few things around VFS. What's in there: - the first of two really major architecture changes - death to open intents. The former is finally there; it was very long in making, but with Miklos getting through really hard and messy final push in fs/namei.c, we finally have it. Unlike his variant, this one doesn't introduce struct opendata; what we have instead is ->atomic_open() taking preallocated struct file * and passing everything via its fields. Instead of returning struct file *, it returns -E... on error, 0 on success and 1 in "deal with it yourself" case (e.g. symlink found on server, etc.). See comments before fs/namei.c:atomic_open(). That made a lot of goodies finally possible and quite a few are in that pile: ->lookup(), ->d_revalidate() and ->create() do not get struct nameidata * anymore; ->lookup() and ->d_revalidate() get lookup flags instead, ->create() gets "do we want it exclusive" flag. With the introduction of new helper (kern_path_locked()) we are rid of all struct nameidata instances outside of fs/namei.c; it's still visible in namei.h, but not for long. Come the next cycle, declaration will move either to fs/internal.h or to fs/namei.c itself. [me, miklos, hch] - The second major change: behaviour of final fput(). Now we have __fput() done without any locks held by caller *and* not from deep in call stack. That obviously lifts a lot of constraints on the locking in there. Moreover, it's legal now to call fput() from atomic contexts (which has immediately simplified life for aio.c). We also don't need anti-recursion logics in __scm_destroy() anymore. There is a price, though - the damn thing has become partially asynchronous. For fput() from normal process we are guaranteed that pending __fput() will be done before the caller returns to userland, exits or gets stopped for ptrace. For kernel threads and atomic contexts it's done via schedule_work(), so theoretically we might need a way to make sure it's finished; so far only one such place had been found, but there might be more. There's flush_delayed_fput() (do all pending __fput()) and there's __fput_sync() (fput() analog doing __fput() immediately). I hope we won't need them often; see warnings in fs/file_table.c for details. [me, based on task_work series from Oleg merged last cycle] - sync series from Jan - large part of "death to sync_supers()" work from Artem; the only bits missing here are exofs and ext4 ones. As far as I understand, those are going via the exofs and ext4 trees resp.; once they are in, we can put ->write_super() to the rest, along with the thread calling it. - preparatory bits from unionmount series (from dhowells). - assorted cleanups and fixes all over the place, as usual. This is not the last pile for this cycle; there's at least jlayton's ESTALE work and fsfreeze series (the latter - in dire need of fixes, so I'm not sure it'll make the cut this cycle). I'll probably throw symlink/hardlink restrictions stuff from Kees into the next pile, too. Plus there's a lot of misc patches I hadn't thrown into that one - it's large enough as it is..." * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (127 commits) ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file() btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file() switch dentry_open() to struct path, make it grab references itself spufs: shift dget/mntget towards dentry_open() zoran: don't bother with struct file * in zoran_map ecryptfs: don't reinvent the wheels, please - use struct completion don't expose I_NEW inodes via dentry->d_inode tidy up namei.c a bit unobfuscate follow_up() a bit ext3: pass custom EOF to generic_file_llseek_size() ext4: use core vfs llseek code for dir seeks vfs: allow custom EOF in generic_file_llseek code vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes vfs: Remove unnecessary flushing of block devices vfs: Make sys_sync writeout also block device inodes vfs: Create function for iterating over block devices vfs: Reorder operations during sys_sync quota: Move quota syncing to ->sync_fs method quota: Split dquot_quota_sync() to writeback and cache flushing part vfs: Move noop_backing_dev_info check from sync into writeback ...
2012-07-20ore: Unlock r4w pages in exact reverse order of lockingBoaz Harrosh1-12/+12
The read-4-write pages are locked in address ascending order. But where unlocked in a way easiest for coding. Fix that, locks should be released in opposite order of locking, .i.e descending address order. I have not hit this dead-lock. It was found by inspecting the dbug print-outs. I suspect there is an higher lock at caller that protects us, but fix it regardless. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-07-20ore: Remove support of partial IO request (NFS crash)Boaz Harrosh1-7/+1
Do to OOM situations the ore might fail to allocate all resources needed for IO of the full request. If some progress was possible it would proceed with a partial/short request, for the sake of forward progress. Since this crashes NFS-core and exofs is just fine without it just remove this contraption, and fail. TODO: Support real forward progress with some reserved allocations of resources, such as mem pools and/or bio_sets [Bug since 3.2 Kernel] CC: Stable Tree <stable@kernel.org> CC: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-07-20ore: Fix NFS crash by supporting any unaligned RAID IOBoaz Harrosh1-31/+36
In RAID_5/6 We used to not permit an IO that it's end byte is not stripe_size aligned and spans more than one stripe. .i.e the caller must check if after submission the actual transferred bytes is shorter, and would need to resubmit a new IO with the remainder. Exofs supports this, and NFS was supposed to support this as well with it's short write mechanism. But late testing has exposed a CRASH when this is used with none-RPC layout-drivers. The change at NFS is deep and risky, in it's place the fix at ORE to lift the limitation is actually clean and simple. So here it is below. The principal here is that in the case of unaligned IO on both ends, beginning and end, we will send two read requests one like old code, before the calculation of the first stripe, and also a new site, before the calculation of the last stripe. If any "boundary" is aligned or the complete IO is within a single stripe. we do a single read like before. The code is clean and simple by splitting the old _read_4_write into 3 even parts: 1._read_4_write_first_stripe 2. _read_4_write_last_stripe 3. _read_4_write_execute And calling 1+3 at the same place as before. 2+3 before last stripe, and in the case of all in a single stripe then 1+2+3 is preformed additively. Why did I not think of it before. Well I had a strike of genius because I have stared at this code for 2 years, and did not find this simple solution, til today. Not that I did not try. This solution is much better for NFS than the previous supposedly solution because the short write was dealt with out-of-band after IO_done, which would cause for a seeky IO pattern where as in here we execute in order. At both solutions we do 2 separate reads, only here we do it within a single IO request. (And actually combine two writes into a single submission) NFS/exofs code need not change since the ORE API communicates the new shorter length on return, what will happen is that this case would not occur anymore. hurray!! [Stable this is an NFS bug since 3.2 Kernel should apply cleanly] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-07-14don't pass nameidata to ->create()Al Viro1-1/+1
boolean "does it have to be exclusive?" flag is passed instead; Local filesystem should just ignore it - the object is guaranteed not to be there yet. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14stop passing nameidata to ->lookup()Al Viro1-1/+1
Just the flags; only NFS cares even about that, but there are legitimate uses for such argument. And getting rid of that completely would require splitting ->lookup() into a couple of methods (at least), so let's leave that alone for now... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-12exofs: fix sparse non-ANSI function warningRandy Dunlap1-1/+1
Fix sparse non-ANSI function warning: fs/exofs/sys.c:112:28: warning: non-ANSI function declaration of function 'exofs_sysfs_dbg_print' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osdLinus Torvalds4-2/+230
Pull exofs updates from Boaz Harrosh: "Just a couple of patches. The first is a BUG fix destined for stable which missed the 3.4-rc7 Kernel. The second is just a fixture addition so exofs is able to be better exported as a cluster file system via pNFS." * 'for-linus' of git://git.open-osd.org/linux-open-osd: exofs: Add SYSFS info for autologin/pNFS export exofs: Fix CRASH on very early IO errors.
2012-05-21exofs: Add SYSFS info for autologin/pNFS exportSachin Bhamare4-1/+229
Introduce sysfs infrastructure for exofs cluster filesystem. Each OSD target shows up as below in the sysfs hierarchy: /sys/fs/exofs/<osdname>_<partition_id>/devX Where <osdname>_<partition_id> is the unique identification of a Superblock. Where devX: 0 <= X < device_table_size. They are ordered in device-table order as specified to the mkfs.exofs command Each OSD device devX has following attributes : osdname - ReadOnly systemid - ReadOnly uri - Read/Write It is up to user-mode to update devX/uri for support of autologin. These sysfs information are used both for autologin as well as support for exporting exofs via a pNFSD server in user-mode. (.eg NFS-Ganesha) Signed-off-by: Sachin Bhamare <sbhamare@panasas.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-05-20exofs: Fix CRASH on very early IO errors.Boaz Harrosh1-1/+1
If at exofs_fill_super() we had an early termination do to any error, like an IO error while reading the super-block. We would crash inside exofs_free_sbi(). This is because sbi->oc.numdevs was set to 1, before we actually have a device table at all. Fix it by moving the sbi->oc.numdevs = 1 to after the allocation of the device table. Reported-by: Johannes Schild <JSchild@gmx.de> Stable: This is a bug since v3.2.0 CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-05-06vfs: Rename end_writeback() to clear_inode()Jan Kara1-2/+2
After we moved inode_sync_wait() from end_writeback() it doesn't make sense to call the function end_writeback() anymore. Rename it to clear_inode() which well says what the function really does - set I_CLEAR flag. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2012-03-29Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osdLinus Torvalds1-3/+4
Pull trivial exofs changes from Boaz Harrosh: "Just nothingness really. The big exofs changes are reserved for the next merge window." * 'for-linus' of git://git.open-osd.org/linux-open-osd: exofs: Cap on the memcpy() size exofs: (trivial) Fix typo in super.c exofs: fix endian conversion in exofs_sync_fs()
2012-03-22Merge branch 'for-linus' of ↵Linus Torvalds2-14/+3
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 1 from Al Viro: "This is _not_ all; in particular, Miklos' and Jan's stuff is not there yet." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits) ext4: initialization of ext4_li_mtx needs to be done earlier debugfs-related mode_t whack-a-mole hfsplus: add an ioctl to bless files hfsplus: change finder_info to u32 hfsplus: initialise userflags qnx4: new helper - try_extent() qnx4: get rid of qnx4_bread/qnx4_getblk take removal of PF_FORKNOEXEC to flush_old_exec() trim includes in inode.c um: uml_dup_mmap() relies on ->mmap_sem being held, but activate_mm() doesn't hold it um: embed ->stub_pages[] into mmu_context gadgetfs: list_for_each_safe() misuse ocfs2: fix leaks on failure exits in module_init ecryptfs: make register_filesystem() the last potential failure exit ntfs: forgets to unregister sysctls on register_filesystem() failure logfs: missing cleanup on register_filesystem() failure jfs: mising cleanup on register_filesystem() failure make configfs_pin_fs() return root dentry on success configfs: configfs_create_dir() has parent dentry in dentry->d_parent configfs: sanitize configfs_create() ...
2012-03-21switch open-coded instances of d_make_root() to new helperAl Viro1-2/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-03-21vfs: check i_nlink limits in vfs_{mkdir,rename_dir,link}Al Viro2-12/+2
New field of struct super_block - ->s_max_links. Maximal allowed value of ->i_nlink or 0; in the latter case all checks still need to be done in ->link/->mkdir/->rename instances. Note that this limit applies both to directoris and to non-directories. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-03-20exofs: remove the second argument of k[un]map_atomic()Cong Wang1-2/+2
Ack-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Cong Wang <amwang@redhat.com>
2012-03-20exofs: Cap on the memcpy() sizeDan Carpenter1-1/+2
This data comes from the device, so probably it's fairly trustworthy but it makes the static checkers happy if we check it. [Boaz] the system_id_len is zero, if not present, or always OSD_SYSTEMID_LEN. So always copy OSD_SYSTEMID_LEN bytes. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-03-20exofs: (trivial) Fix typo in super.cMasanari Iida1-1/+1
Correct spelling "faild" to "failed" in fs/exofs/super.c Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-03-20exofs: fix endian conversion in exofs_sync_fs()Dan Carpenter1-1/+1
fscb->s_numfiles is an __le64 field so we need to use cpu_to_le64() to get a little endian 64 bit on big endian systems. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-01-10Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osdLinus Torvalds4-28/+81
* 'for-linus' of git://git.open-osd.org/linux-open-osd: ore: Must support none-PAGE-aligned IO ore: fix BUG_ON, too few sgs when reading ore: Fix crash in case of an IO error. ore: FIX breakage when MISC_FILESYSTEMS is not set
2012-01-09exofs: oops after late failure in mountAl Viro1-0/+2
We have already set ->s_root, so ->put_super() is going to be called. Freeing ->s_fs_info is a bloody bad idea when it's going to be dereferenced very shortly... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-08ore: Must support none-PAGE-aligned IOBoaz Harrosh1-12/+60
NFS might send us offsets that are not PAGE aligned. So we must read in the reminder of the first/last pages, in cases we need it for Parity calculations. We only add an sg segments to read the partial page. But we don't mark it as read=true because it is a lock-for-write page. TODO: In some cases (IO spans a single unit) we can just adjust the raid_unit offset/length, but this is left for later Kernels. [Bug in 3.2.0 Kernel] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-01-06ore: fix BUG_ON, too few sgs when readingBoaz Harrosh2-2/+6
When reading RAID5 files, in rare cases, we calculated too few sg segments. There should be two extra for the beginning and end partial units. Also "too few sg segments" should not be a BUG_ON there is all the mechanics in place to handle it, as a short read. So just return -ENOMEM and the rest of the code will gracefully split the IO. [Bug in 3.2.0 Kernel] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-01-06ore: Fix crash in case of an IO error.Boaz Harrosh1-3/+3
The users of ore_check_io() expect the reported device (In case of error) to be indexed relative to the passed-in ore_components table, and not the logical dev index. This causes a crash inside objlayoutdriver in case of an IO error. [Bug in 3.2.0 Kernel] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-01-06ore: FIX breakage when MISC_FILESYSTEMS is not setBoaz Harrosh2-11/+12
As Reported by Randy Dunlap When MISC_FILESYSTEMS is not enabled and NFS4.1 is: fs/built-in.o: In function `objio_alloc_io_state': objio_osd.c:(.text+0xcb525): undefined reference to `ore_get_rw_state' fs/built-in.o: In function `_write_done': objio_osd.c:(.text+0xcb58d): undefined reference to `ore_check_io' fs/built-in.o: In function `_read_done': ... When MISC_FILESYSTEMS, which is more of a GUI thing then anything else, is not selected. exofs/Kconfig is never examined during Kconfig, and it can not do it's magic stuff to automatically select everything needed. We must split exofs/Kconfig in two. The ore one is always included. And the exofs one is left in it's old place in the menu. [Needed for the 3.2.0 Kernel] CC: Stable Tree <stable@kernel.org> Reported-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-01-04exofs: propagate umode_tAl Viro3-3/+3
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-04switch ->mknod() to umode_tAl Viro1-1/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-04switch ->create() to umode_tAl Viro1-1/+1
vfs_create() ignores everything outside of 16bit subset of its mode argument; switching it to umode_t is obviously equivalent and it's the only caller of the method Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-04switch vfs_mkdir() and ->mkdir() to umode_tAl Viro1-1/+1
vfs_mkdir() gets int, but immediately drops everything that might not fit into umode_t and that's the only caller of ->mkdir()... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-04vfs: fix the stupidity with i_dentry in inode destructorsAl Viro1-1/+0
Seeing that just about every destructor got that INIT_LIST_HEAD() copied into it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once(); the cost of taking it into inode_init_always() will be negligible for pipes and sockets and negative for everything else. Not to mention the removal of boilerplate code from ->destroy_inode() instances... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-11-07Merge branch 'modsplit-Oct31_2011' of ↵Linus Torvalds2-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits) Revert "tracing: Include module.h in define_trace.h" irq: don't put module.h into irq.h for tracking irqgen modules. bluetooth: macroize two small inlines to avoid module.h ip_vs.h: fix implicit use of module_get/module_put from module.h nf_conntrack.h: fix up fallout from implicit moduleparam.h presence include: replace linux/module.h with "struct module" wherever possible include: convert various register fcns to macros to avoid include chaining crypto.h: remove unused crypto_tfm_alg_modname() inline uwb.h: fix implicit use of asm/page.h for PAGE_SIZE pm_runtime.h: explicitly requires notifier.h linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h miscdevice.h: fix up implicit use of lists and types stop_machine.h: fix implicit use of smp.h for smp_processor_id of: fix implicit use of errno.h in include/linux/of.h of_platform.h: delete needless include <linux/module.h> acpi: remove module.h include from platform/aclinux.h miscdevice.h: delete unnecessary inclusion of module.h device_cgroup.h: delete needless include <linux/module.h> net: sch_generic remove redundant use of <linux/module.h> net: inet_timewait_sock doesnt need <linux/module.h> ... Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in - drivers/media/dvb/frontends/dibx000_common.c - drivers/media/video/{mt9m111.c,ov6650.c} - drivers/mfd/ab3550-core.c - include/linux/dmaengine.h
2011-11-04Merge branch 'nfs-for-3.2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds1-1/+1
* 'nfs-for-3.2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (25 commits) nfs: set vs_hidden on nfs4_callback_version4 (try #2) pnfs-obj: Support for RAID5 read-4-write interface. pnfs-obj: move to ore 03: Remove old raid engine pnfs-obj: move to ore 02: move to ORE pnfs-obj: move to ore 01: ore_layout & ore_components pnfs-obj: Rename objlayout_io_state => objlayout_io_res pnfs-obj: Get rid of objlayout_{alloc,free}_io_state pnfs-obj: Return PNFS_NOT_ATTEMPTED in case of read/write_pagelist pnfs-obj: Remove redundant EOF from objlayout_io_state nfs: Remove unused variable from write.c nfs: Fix unused variable warning from file.c NFS: Remove no-op less-than-zero checks on unsigned variables. NFS: Clean up nfs4_xdr_dec_secinfo() NFS: Fix documenting comment for nfs_create_request() NFS4: fix cb_recallany decode error nfs4: serialize layoutcommit SUNRPC: remove rpcbind clients destruction on module cleanup SUNRPC: remove rpcbind clients creation during service registering NFSd: call svc rpcbind cleanup explicitly SUNRPC: cleanup service destruction ...
2011-11-03pnfs-obj: move to ore 02: move to OREBoaz Harrosh1-1/+1
In this patch we are actually moving to the ORE. (Object Raid Engine). objio_state holds a pointer to an ore_io_state. Once we have an ore_io_state at hand we can call the ore for reading/writing. We register on the done path to kick off the nfs io_done mechanism. Again for Ease of reviewing the old code is "#if 0" but is not removed so the diff command works better. The old code will be removed in the next patch. fs/exofs/Kconfig::ORE is modified to also be auto-included if PNFS_OBJLAYOUT is set. Since we now depend on ORE. (See comments in fs/exofs/Kconfig) Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-11-02filesystems: add set_nlink()Miklos Szeredi1-1/+1
Replace remaining direct i_nlink updates with a new set_nlink() updater function. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-11-01fs: add module.h to files that were implicitly using itPaul Gortmaker2-0/+2
Some files were using the complete module.h infrastructure without actually including the header at all. Fix them up in advance so once the implicit presence is removed, we won't get failures like this: CC [M] fs/nfsd/nfssvc.o fs/nfsd/nfssvc.c: In function 'nfsd_create_serv': fs/nfsd/nfssvc.c:335: error: 'THIS_MODULE' undeclared (first use in this function) fs/nfsd/nfssvc.c:335: error: (Each undeclared identifier is reported only once fs/nfsd/nfssvc.c:335: error: for each function it appears in.) fs/nfsd/nfssvc.c: In function 'nfsd': fs/nfsd/nfssvc.c:555: error: implicit declaration of function 'module_put_and_exit' make[3]: *** [fs/nfsd/nfssvc.o] Error 1 Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-25ore: Enable RAID5 mountsBoaz Harrosh1-3/+11
Now that we support raid5 Enable it at mount. Raid6 will come next raid4 is not demanded for so it will probably not be enabled. (Until some one wants it) NOTE: That mkfs.exofs had support for raid5/6 since long time ago. (Making an empty raidX FS is just as easy as raid0 ;-} ) Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-25exofs: Support for RAID5 read-4-write interface.Boaz Harrosh1-2/+59
The ore need suplied a r4w_get_page/r4w_put_page API from Filesystem so it can get cache pages to read-into when writing parial stripes. Also I commented out and NULLed the .writepage (singular) vector. Because it gives terrible write pattern to raid and is apparently not needed. Even in OOM conditions the system copes (even better) with out it. TODO: How to specify to write_cache_pages() to start or include a certain page? Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-25ore: RAID5 WriteBoaz Harrosh4-16/+578
This is finally the RAID5 Write support. The bigger part of this patch is not the XOR engine itself, But the read4write logic, which is a complete mini prepare_for_striping reading engine that can read scattered pages of a stripe into cache so it can be used for XOR calculation. That is, if the write was not stripe aligned. The main algorithm behind the XOR engine is the 2 dimensional array: struct __stripe_pages_2d. A drawing might save 1000 words --- __stripe_pages_2d | n = pages_in_stripe_unit; w = group_width - parity; | pages array presented to the XOR lib | | V | __1_page_stripe[0].pages --> [c0][c1]..[cw][c_par] <---| | | __1_page_stripe[1].pages --> [c0][c1]..[cw][c_par] <--- | ... | ... | __1_page_stripe[n].pages --> [c0][c1]..[cw][c_par] ^ | data added columns first then row --- The pages are put on this array columns first. .i.e: p0-of-c0, p1-of-c0, ... pn-of-c0, p0-of-c1, ... So we are doing a corner turn of the pages. Note that pages will zigzag down and left. but are put sequentially in growing order. So when the time comes to XOR the stripe, only the beginning and end of the array need be checked. We scan the array and any NULL spot will be field by pages-to-be-read. The FS that wants to support RAID5 needs to supply an operations-vector that searches a given page in cache, and specifies if the page is uptodate or need reading. All these pages to be read are put on a slave ore_io_state and synchronously read. All the pages of a stripe are read in one IO, using the scatter gather mechanism. In write we constrain our IO to only be incomplete on a single stripe. Meaning either the complete IO is within a single stripe so we might have pages to read from both beginning or end of the strip. Or we have some reading to do at beginning but end at strip boundary. The left over pages are pushed to the next IO by the API already established by previous work, where an IO offset/length combination presented to the ORE might get the length truncated and the user must re-submit the leftover pages. (Both exofs and NFS support this) But any ORE user should make it's best effort to align it's IO before hand and avoid complications. A cached ore_layout->stripe_size member can be used for that calculation. (NOTE: that ORE demands that stripe_size may not be bigger then 32bit) What else? Well read it and tell me. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-25ore: RAID5 readBoaz Harrosh4-78/+455
This patch introduces the first stage of RAID5 support mainly the skip-over-raid-units when reading. For writes it inserts BLANK units, into where XOR blocks should be calculated and written to. It introduces the new "general raid maths", and the main additional parameters and components needed for raid5. Since at this stage it could corrupt future version that actually do support raid5. The enablement of raid5 mounting and setting of parity-count > 0 is disabled. So the raid5 code will never be used. Mounting of raid5 is only enabled later once the basic XOR write is also in. But if the patch "enable RAID5" is applied this code has been tested to be able to properly read raid5 volumes and is according to standard. Also it has been tested that the new maths still properly supports RAID0 and grouping code just as before. (BTW: I have found more bugs in the pnfs-obj RAID math fixed here) The ore.c file is getting too big, so new ore_raid.[hc] files are added that will include the special raid stuff that are not used in striping and mirrors. In future write support these will get bigger. When adding the ore_raid.c to Kbuild file I was forced to rename ore.ko to libore.ko. Is it possible to keep source file, say ore.c and module file ore.ko the same even if there are multiple files inside ore.ko? Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-25ore: Make ore_calc_stripe_info EXPORT_SYMBOLBoaz Harrosh1-5/+3
ore_calc_stripe_info is needed by exofs::export.c for the layout calculations. Make it exportable Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-14ore/exofs: Change ore_check_io APIBoaz Harrosh2-23/+20
Current ore_check_io API receives a residual pointer, to report partial IO. But it is actually not used, because in a multiple devices IO there is never a linearity in the IO failure. On the other hand if every failing device is reported through a received callback measures can be taken to handle only failed devices. One at a time. This will also be needed by the objects-layout-driver for it's error reporting facility. Exofs is not currently using the new information and keeps the old behaviour of failing the complete IO in case of an error. (No partial completion) TODO: Use an ore_check_io callback to set_page_error only the failing pages. And re-dirty write pages. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-14ore/exofs: Define new ore_verify_layoutBoaz Harrosh3-53/+72
All users of the ore will need to check if current code supports the given layout. For example RAID5/6 is not currently supported. So move all the checks from exofs/super.c to a new ore_verify_layout() to be used by ore users. Note that any new layout should be passed through the ore_verify_layout() because the ore engine will prepare and verify some internal members of ore_layout, and assumes it's called. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-14ore: Support for partial component tableBoaz Harrosh2-0/+5
Users like the objlayout-driver would like to only pass a partial device table that covers the IO in question. For example exofs divides the file into raid-group-sized chunks and only serves group_width number of devices at a time. The partiality is communicated by setting ore_componets->first_dev and the array covers all logical devices from oc->first_dev upto (oc->first_dev + oc->numdevs) The ore_comp_dev() API receives a logical device index and returns the actual present device in the table. An out-of-range dev_index will BUG. Logical device index is the theoretical device index as if all the devices of a file are present. .i.e: total_devs = group_width * mirror_p1 * group_count 0 <= dev_index < total_devs Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-14ore: Support for short read/writesBoaz Harrosh1-7/+23
Memory conditions and max_bio constraints might cause us to not comply to the full length of the requested IO. Instead of failing the complete IO we can issue a shorter read/write and report how much was actually executed in the ios->length member. All users must check ios->length at IO_done or upon return of ore_read/write and re-issue the reminder of the bytes. Because other wise there is no error returned like before. This is part of the effort to support the pnfs-obj layout driver. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-14exofs: Support for short read/writesBoaz Harrosh1-9/+26
If at read/write_done the actual IO was shorter then requested, reported in returned ios->length. It is not an error. The reminder of the pages should just be unlocked but not marked uptodate or end_page_writeback. They will be re issued later by the VFS. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-14ore: Remove check for ios->kern_buff in _prepare_for_striping to laterBoaz Harrosh1-23/+13
Move the check and preparation of the ios->kern_buff case to later inside _write_mirror(). Since read was never used with ios->kern_buff its support is removed instead of fixed. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-14ore: cleanup: Embed an ore_striping_info inside ore_io_stateBoaz Harrosh1-37/+24
Now that each ore_io_state covers only a single raid group. A single striping_info math is needed. Embed one inside ore_io_state to cache the calculation results and eliminate an extra call. Also the outer _prepare_for_striping is removed since it does nothing. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-14ore: Only IO one group at a time (API change)Boaz Harrosh2-51/+154
Usually a single IO is confined to one group of devices (group_width) and at the boundary of a raid group it can spill into a second group. Current code would allocate a full device_table size array at each io_state so it can comply to requests that span two groups. Needless to say that is very wasteful, specially when device_table count can get very large (hundreds even thousands), while a group_width is usually 8 or 10. * Change ore API to trim on IO that spans two raid groups. The user passes offset+length to ore_get_rw_state, the ore might trim on that length if spanning a group boundary. The user must check ios->length or ios->nrpages to see how much IO will be preformed. It is the responsibility of the user to re-issue the reminder of the IO. * Modify exofs To copy spilled pages on to the next IO. This means one last kick is needed after all coalescing of pages is done. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-04ore/exofs: Change the type of the devices array (API change)Boaz Harrosh3-42/+69
In the pNFS obj-LD the device table at the layout level needs to point to a device_cache node, where it is possible and likely that many layouts will point to the same device-nodes. In Exofs we have a more orderly structure where we have a single array of devices that repeats twice for a round-robin view of the device table This patch moves to a model that can be used by the pNFS obj-LD where struct ore_components holds an array of ore_dev-pointers. (ore_dev is newly defined and contains a struct osd_dev *od member) Each pointer in the array of pointers will point to a bigger user-defined dev_struct. That can be accessed by use of the container_of macro. In Exofs an __alloc_dev_table() function allocates the ore_dev-pointers array as well as an exofs_dev array, in one allocation and does the addresses dance to set everything pointing correctly. It still keeps the double allocation trick for the inodes round-robin view of the table. The device table is always allocated dynamically, also for the single device case. So it is unconditionally freed at umount. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2011-10-03ore: Make ore_striping_info and ore_calc_stripe_info publicBoaz Harrosh1-16/+8
The struct ore_striping_info will be used later in other structures. And ore_calc_stripe_info as well. Rename them make struct ore_striping_info public. ore_calc_stripe_info is still static, will be made public on first use. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>