Commit Graph

39703 Commits (24d4e7f642882a8a13da170b4ba86eec8fa91bf2)

Author SHA1 Message Date
Al Viro 9ea459e110 delayed mntput
On final mntput() we want fs shutdown to happen before return to
userland; however, the only case where we want it happen right
there (i.e. where task_work_add won't do) is MNT_INTERNAL victim.
Those have to be fully synchronous - failure halfway through module
init might count on having vfsmount killed right there.  Fortunately,
final mntput on MNT_INTERNAL vfsmounts happens on shallow stack.
So we handle those synchronously and do an analog of delayed fput
logics for everything else.

As the result, we are guaranteed that fs shutdown will always happen
on shallow stack.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:38:53 -04:00
Ian Kent b3ca406f27 autofs - remove obsolete d_invalidate() from expire
Biederman's umount-on-rmdir series changes d_invalidate() to sumarily remove
mounts under the passed in dentry regardless of whether they are busy
or not. So calling this in fs/autofs4/expire.c:autofs4_tree_busy() is
definitely the wrong thing to do becuase it will silently umount entries
instead of just cleaning stale dentrys.

But this call shouldn't be needed and testing shows that automounting
continues to function without it.

As Al Viro correctly surmises the original intent of the call was to
perform what shrink_dcache_parent() does.

If at some time in the future I see stale dentries accumulating
following failed mounts I'll revisit the issue and possibly add a
shrink_dcache_parent() call if needed.

Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:38:52 -04:00
Al Viro 8d85b4845a Allow sharing external names after __d_move()
* external dentry names get a small structure prepended to them
(struct external_name).
* it contains an atomic refcount, matching the number of struct dentry
instances that have ->d_name.name pointing to that external name.  The
first thing free_dentry() does is decrementing refcount of external name,
so the instances that are between the call of free_dentry() and
RCU-delayed actual freeing do not contribute.
* __d_move(x, y, false) makes the name of x equal to the name of y,
external or not.  If y has an external name, extra reference is grabbed
and put into x->d_name.name.  If x used to have an external name, the
reference to the old name is dropped and, should it reach zero, freeing
is scheduled via kfree_rcu().
* free_dentry() in dentry with external name decrements the refcount of
that name and, should it reach zero, does RCU-delayed call that will
free both the dentry and external name.  Otherwise it does what it
used to do, except that __d_free() doesn't even look at ->d_name.name;
it simply frees the dentry.

All non-RCU accesses to dentry external name are safe wrt freeing since they
all should happen before free_dentry() is called.  RCU accesses might run
into a dentry seen by free_dentry() or into an old name that got already
dropped by __d_move(); however, in both cases dentry must have been
alive and refer to that name at some point after we'd done rcu_read_lock(),
which means that any freeing must be still pending.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:38:41 -04:00
Trond Myklebust 6543f80367 NFSv4.1/pnfs: replace broken pnfs_put_lseg_async
You cannot call pnfs_put_lseg_async() more than once per lseg, so it
is really an inappropriate way to deal with a refcount issue.

Instead, replace it with a function that decrements the refcount, and
puts the final 'free' operation (which is incompatible with locks) on
the workqueue.

Cc: Weston Andros Adamson <dros@primarydata.com>
Fixes: e6cf82d1830f: pnfs: add pnfs_put_lseg_async
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-10-08 16:45:43 -04:00
Andy Lutomirski a1480dcc3c fs: Add a missing permission check to do_umount
Accessing do_remount_sb should require global CAP_SYS_ADMIN, but
only one of the two call sites was appropriately protected.

Fixes CVE-2014-7975.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
2014-10-08 12:32:47 -07:00
Tom Haynes ea18cb3f11 NFSv4: Remove dead prototype for nfs4_insert_deviceid_node()
nfs4_insert_deviceid_node() was removed in 661373b13d

Signed-off-by: Tom Haynes <loghyr@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-10-08 14:31:01 -04:00
Linus Torvalds da01e61428 Merge tag 'f2fs-for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
 "This patch-set introduces a couple of new features such as large
  sector size, FITRIM, and atomic/volatile writes.

  Several patches enhance power-off recovery and checkpoint routines.

  The fsck.f2fs starts to support fixing corrupted partitions with
  recovery hints provided by this patch-set.

  Summary:
   - retain some recovery information for fsck.f2fs
   - enhance checkpoint speed
   - enhance flush command management
   - bug fix for lseek
   - tune in-place-update policies
   - enhance roll-forward speed
   - revisit all the roll-forward and fsync rules
   - support larget sector size
   - support FITRIM
   - support atomic and volatile writes

  And several clean-ups and bug fixes are included"

* tag 'f2fs-for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (42 commits)
  f2fs: support volatile operations for transient data
  f2fs: support atomic writes
  f2fs: remove unused return value
  f2fs: clean up f2fs_ioctl functions
  f2fs: potential shift wrapping buf in f2fs_trim_fs()
  f2fs: call f2fs_unlock_op after error was handled
  f2fs: check the use of macros on block counts and addresses
  f2fs: refactor flush_nat_entries to remove costly reorganizing ops
  f2fs: introduce FITRIM in f2fs_ioctl
  f2fs: introduce cp_control structure
  f2fs: use more free segments until SSR is activated
  f2fs: change the ipu_policy option to enable combinations
  f2fs: fix to search whole dirty segmap when get_victim
  f2fs: fix to clean previous mount option when remount_fs
  f2fs: skip punching hole in special condition
  f2fs: support large sector size
  f2fs: fix to truncate blocks past EOF in ->setattr
  f2fs: update i_size when __allocate_data_block
  f2fs: use MAX_BIO_BLOCKS(sbi)
  f2fs: remove redundant operation during roll-forward recovery
  ...
2014-10-08 12:53:15 -04:00
Linus Torvalds 6dea0737bc Merge branch 'for-3.18' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
 "Highlights:

   - support the NFSv4.2 SEEK operation (allowing clients to support
     SEEK_HOLE/SEEK_DATA), thanks to Anna.
   - end the grace period early in a number of cases, mitigating a
     long-standing annoyance, thanks to Jeff
   - improve SMP scalability, thanks to Trond"

* 'for-3.18' of git://linux-nfs.org/~bfields/linux: (55 commits)
  nfsd: eliminate "to_delegation" define
  NFSD: Implement SEEK
  NFSD: Add generic v4.2 infrastructure
  svcrdma: advertise the correct max payload
  nfsd: introduce nfsd4_callback_ops
  nfsd: split nfsd4_callback initialization and use
  nfsd: introduce a generic nfsd4_cb
  nfsd: remove nfsd4_callback.cb_op
  nfsd: do not clear rpc_resp in nfsd4_cb_done_sequence
  nfsd: fix nfsd4_cb_recall_done error handling
  nfsd4: clarify how grace period ends
  nfsd4: stop grace_time update at end of grace period
  nfsd: skip subsequent UMH "create" operations after the first one for v4.0 clients
  nfsd: set and test NFSD4_CLIENT_STABLE bit to reduce nfsdcltrack upcalls
  nfsd: serialize nfsdcltrack upcalls for a particular client
  nfsd: pass extra info in env vars to upcalls to allow for early grace period end
  nfsd: add a v4_end_grace file to /proc/fs/nfsd
  lockd: add a /proc/fs/lockd/nlm_end_grace file
  nfsd: reject reclaim request when client has already sent RECLAIM_COMPLETE
  nfsd: remove redundant boot_time parm from grace_done client tracking op
  ...
2014-10-08 12:51:44 -04:00
Linus Torvalds 25641c0c8d NFS client updates for Linux 3.18
Highlights include:
 
 Stable fixes:
 - fix an NFSv4.1 state renewal regression
 - fix open/lock state recovery error handling
 - fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
 - fix statd when reconnection fails
 - Don't wake tasks during connection abort
 - Don't start reboot recovery if lease check fails
 - fix duplicate proc entries
 
 Features:
 - pNFS block driver fixes and clean ups from Christoph
 - More code cleanups from Anna
 - Improve mmap() writeback performance
 - Replace use of PF_TRANS with a more generic mechanism for avoiding
   deadlocks in nfs_release_page
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUMpFYAAoJEGcL54qWCgDywHYP/A7XNykwOGhoHVP1Cgr3xqoz
 gVhAw97AEMZE8xSNVEGS++pJTe59JVzsIsYAwdHMwePV33l3zyzYorae6N9p7zWF
 0xVaNQ4qNLVhbrNLAoB5KA/c3/jMnNjF5t15+8akZad5pt4kXLlhSKjyVpdEEtJE
 A0eneXShMYEeLZoOJhpQt5bsw0OZ8YbWWEMjGlDqyeelvV3K1+zfivQOoyX6hS4w
 XFkPEDmU7zunE/xFP9ZoUaVdLO0TvOWfEZ7STWoHm7NuWfPQiDb9w1mTnuZbZyka
 ssezoGcitzwsjCcQ5e1iKTOoFRIsm/zYXFQgFQL7VFMBU1Tss9Of8047EyDkqcPF
 GxctsGg0gQ2FkG7yx7JH7AKpyibOIuByQrQQ916coWSf7K0L4H4Rcky3vryroylP
 1e1RI49xu215OTm+dLvlvYCv55bqCrTmaUGImZac18+ixD2eh6MNfW2ubSdxk89L
 U2rTFV09Bd52N7IQOGQx1FBEI2ZnIFUV4UaFz7v+rGFxOnk6+WYe+iWyb4wC70Yc
 8Jh/gTIQDd5aghql3FTieMOyfEvO6Re4pLMXmqEWMAevicx2t8DwkJriRu6X8Iy2
 rlDlBPwu5QmRWC20Dc897f0VajwDtwdeB8puod7nobOWzOfx4FrNqLJ+jR3pmHUk
 0otvJytqemXt+zkqqHKK
 =/OQi
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable fixes:
   - fix an NFSv4.1 state renewal regression
   - fix open/lock state recovery error handling
   - fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
   - fix statd when reconnection fails
   - don't wake tasks during connection abort
   - don't start reboot recovery if lease check fails
   - fix duplicate proc entries

  Features:
  - pNFS block driver fixes and clean ups from Christoph
  - More code cleanups from Anna
  - Improve mmap() writeback performance
  - Replace use of PF_TRANS with a more generic mechanism for avoiding
    deadlocks in nfs_release_page"

* tag 'nfs-for-3.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (66 commits)
  NFSv4.1: Fix an NFSv4.1 state renewal regression
  NFSv4: fix open/lock state recovery error handling
  NFSv4: Fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
  NFS: Fabricate fscache server index key correctly
  SUNRPC: Add missing support for RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT
  NFSv3: Fix missing includes of nfs3_fs.h
  NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()
  NFS: avoid waiting at all in nfs_release_page when congested.
  NFS: avoid deadlocks with loop-back mounted NFS filesystems.
  MM: export page_wakeup functions
  SCHED: add some "wait..on_bit...timeout()" interfaces.
  NFS: don't use STABLE writes during writeback.
  NFSv4: use exponential retry on NFS4ERR_DELAY for async requests.
  rpc: Add -EPERM processing for xs_udp_send_request()
  rpc: return sent and err from xs_sendpages()
  lockd: Try to reconnect if statd has moved
  SUNRPC: Don't wake tasks during connection abort
  Fixing lease renewal
  nfs: fix duplicate proc entries
  pnfs/blocklayout: Fix a 64-bit division/remainder issue in bl_map_stripe
  ...
2014-10-08 12:49:23 -04:00
Qu Wenruo a43bb39b5c btrfs: Fix compile error when CONFIG_SECURITY is not set.
Fix the following compile error when CONFIG_SECURITY is not set:

error: 'struct security_mnt_opts' has no member named 'num_mnt_opts'

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-08 06:59:24 -07:00
Fabian Frederick d29c0afe4d GFS2: use _RET_IP_ instead of (unsigned long)__builtin_return_address(0)
use macro definition

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-10-08 09:57:07 +01:00
Linus Torvalds 28596c9722 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull "trivial tree" updates from Jiri Kosina:
 "Usual pile from trivial tree everyone is so eagerly waiting for"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
  Remove MN10300_PROC_MN2WS0038
  mei: fix comments
  treewide: Fix typos in Kconfig
  kprobes: update jprobe_example.c for do_fork() change
  Documentation: change "&" to "and" in Documentation/applying-patches.txt
  Documentation: remove obsolete pcmcia-cs from Changes
  Documentation: update links in Changes
  Documentation: Docbook: Fix generated DocBook/kernel-api.xml
  score: Remove GENERIC_HAS_IOMAP
  gpio: fix 'CONFIG_GPIO_IRQCHIP' comments
  tty: doc: Fix grammar in serial/tty
  dma-debug: modify check_for_stack output
  treewide: fix errors in printk
  genirq: fix reference in devm_request_threaded_irq comment
  treewide: fix synchronize_rcu() in comments
  checkstack.pl: port to AArch64
  doc: queue-sysfs: minor fixes
  init/do_mounts: better syntax description
  MIPS: fix comment spelling
  powerpc/simpleboot: fix comment
  ...
2014-10-07 21:16:26 -04:00
Chris Mason 0d4cf4e6bf Btrfs: fix compiles when CONFIG_BTRFS_FS_RUN_SANITY_TESTS is off
Commit fccb84c94 moved added some helpers to cleanup our sanity tests,
but it looks like both Dave and I always compile with the tests enabled.

This fixes things to work when they are turned off too.

Signed-off-by: Chris Mason <clm@fb.com>
2014-10-07 13:24:20 -07:00
Jaegeuk Kim 02a1335f25 f2fs: support volatile operations for transient data
This patch adds support for volatile writes which keep data pages in memory
until f2fs_evict_inode is called by iput.

For instance, we can use this feature for the sqlite database as follows.
While supporting atomic writes for main database file, we can keep its journal
data temporarily in the page cache by the following sequence.

1. open
 -> ioctl(F2FS_IOC_START_VOLATILE_WRITE);
2. writes
 : keep all the data in the page cache.
3. flush to the database file with atomic writes
  a. ioctl(F2FS_IOC_START_ATOMIC_WRITE);
  b. writes
  c. ioctl(F2FS_IOC_COMMIT_ATOMIC_WRITE);
4. close
 -> drop the cached data

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-10-07 11:54:41 -07:00
Jeff Layton 6e129d0068 locks: flock_make_lock should return a struct file_lock (or PTR_ERR)
Eliminate the need for a return pointer.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07 14:06:13 -04:00
Jeff Layton 7ca76311fe locks: set fl_owner for leases to filp instead of current->files
Like flock locks, leases are owned by the file description. Now that the
i_have_this_lease check in __break_lease is gone, we don't actually use
the fl_owner for leases for anything. So, it's now safe to set this more
appropriately to the same value as the fl_file.

While we're at it, fix up the comments over the fl_owner_t definition
since they're rather out of date.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-10-07 14:06:13 -04:00
Jeff Layton 4d01b7f5e7 locks: give lm_break a return value
Christoph suggests:

   "Add a return value to lm_break so that the lock manager can tell the
    core code "you can delete this lease right now".  That gets rid of
    the games with the timeout which require all kinds of race avoidance
    code in the users."

Do that here and have the nfsd lease break routine use it when it detects
that there was a race between setting up the lease and it being broken.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07 14:06:13 -04:00
Jeff Layton 03d12ddf84 locks: __break_lease cleanup in preparation of allowing direct removal of leases
Eliminate an unneeded "flock" variable. We can use "fl" as a loop cursor
everywhere. Add a any_leases_conflict helper function as well to
consolidate a bit of code.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07 14:06:13 -04:00
Jeff Layton 843c6b2f4c locks: remove i_have_this_lease check from __break_lease
I think that the intent of this code was to ensure that a process won't
deadlock if it has one fd open with a lease on it and then breaks that
lease by opening another fd. In that case it'll treat the __break_lease
call as if it were non-blocking.

This seems wrong -- the process could (for instance) be multithreaded
and managing different fds via different threads. I also don't see any
mention of this limitation in the (somewhat sketchy) documentation.

Remove the check and the non-blocking behavior when i_have_this_lease
is true.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-10-07 14:06:13 -04:00
Jeff Layton c45198eda2 locks: move freeing of leases outside of i_lock
There was only one place where we still could free a file_lock while
holding the i_lock -- lease_modify. Add a new list_head argument to the
lm_change operation, pass in a private list when calling it, and fix
those callers to dispose of the list once the lock has been dropped.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07 14:06:13 -04:00
Jeff Layton f82b4b6780 locks: move i_lock acquisition into generic_*_lease handlers
Now that we have a saner internal API for managing leases, we no longer
need to mandate that the inode->i_lock be held over most of the lease
code. Push it down into generic_add_lease and generic_delete_lease.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07 14:06:13 -04:00
Jeff Layton 1c7dd2ff43 locks: define a lm_setup handler for leases
...and move the fasync setup into it for fcntl lease calls. At the same
time, change the semantics of how the file_lock double-pointer is
handled. Up until now, on a successful lease return you got a pointer to
the lock on the list. This is bad, since that pointer can no longer be
relied on as valid once the inode->i_lock has been released.

Change the code to instead just zero out the pointer if the lease we
passed in ended up being used. Then the callers can just check to see
if it's NULL after the call and free it if it isn't.

The priv argument has the same semantics. The lm_setup function can
zero the pointer out to signal to the caller that it should not be
freed after the function returns.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07 14:06:12 -04:00
Jeff Layton e6f5c78930 locks: plumb a "priv" pointer into the setlease routines
In later patches, we're going to add a new lock_manager_operation to
finish setting up the lease while still holding the i_lock.  To do
this, we'll need to pass a little bit of info in the fcntl setlease
case (primarily an fasync structure). Plumb the extra pointer into
there in advance of that.

We declare this pointer as a void ** to make it clear that this is
private info, and that the caller isn't required to set this unless
the lm_setup specifically requires it.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07 14:06:12 -04:00
Jeff Layton 0c637be884 nfsd: don't keep a pointer to the lease in nfs4_file
Now that we don't need to pass in an actual lease pointer to
vfs_setlease on unlock, we can stop tracking a pointer to the lease in
the nfs4_file.

Switch all of the places that check the fi_lease to check fi_deleg_file
instead. We always set that at the same time so it will have the same
semantics.

Cc: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07 14:06:12 -04:00
Jeff Layton e51673aa5d locks: clean up vfs_setlease kerneldoc comments
Some of the latter paragraphs seem ambiguous and just plain wrong.
In particular the break_lease comment makes no sense. We call
break_lease (and break_deleg) from all sorts of vfs-layer functions,
so there is clearly such a method.

Also get rid of some of the other comments about what's needed for
a full implementation.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07 14:06:12 -04:00
Jeff Layton 0efaa7e82f locks: generic_delete_lease doesn't need a file_lock at all
Ensure that it's OK to pass in a NULL file_lock double pointer on
a F_UNLCK request and convert the vfs_setlease F_UNLCK callers to
do just that.

Finally, turn the BUG_ON in generic_setlease into a WARN_ON_ONCE
with an error return. That's a problem we can handle without
crashing the box if it occurs.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07 14:06:12 -04:00
Jeff Layton 415b96c5a1 nfsd: fix potential lease memory leak in nfs4_setlease
It's unlikely to ever occur, but if there were already a lease set on
the file then we could end up getting back a different pointer on a
successful setlease attempt than the one we allocated. If that happens,
the one we allocated could leak.

In practice, I don't think this will happen due to the fact that we only
try to set up the lease once per nfs4_file, but this error handling is a
bit more correct given the current lease API.

Cc: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07 14:06:12 -04:00
Jeff Layton bfe8602436 locks: close potential race in lease_get_mtime
lease_get_mtime is called without the i_lock held, so there's no
guarantee about the stability of the list. Between the time when we
assign "flock" and then dereference it to check whether it's a lease
and for write, the lease could be freed.

Ensure that that doesn't occur by taking the i_lock before trying
to check the lease.

Cc: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-10-07 14:06:12 -04:00
Jaegeuk Kim 88b88a6679 f2fs: support atomic writes
This patch introduces a very limited functionality for atomic write support.
In order to support atomic write, this patch adds two ioctls:
 o F2FS_IOC_START_ATOMIC_WRITE
 o F2FS_IOC_COMMIT_ATOMIC_WRITE

The database engine should be aware of the following sequence.
1. open
 -> ioctl(F2FS_IOC_START_ATOMIC_WRITE);
2. writes
  : all the written data will be treated as atomic pages.
3. commit
 -> ioctl(F2FS_IOC_COMMIT_ATOMIC_WRITE);
  : this flushes all the data blocks to the disk, which will be shown all or
  nothing by f2fs recovery procedure.
4. repeat to #2.

The IO pattens should be:

  ,- START_ATOMIC_WRITE                  ,- COMMIT_ATOMIC_WRITE
 CP | D D D D D D | FSYNC | D D D D | FSYNC ...
                      `- COMMIT_ATOMIC_WRITE

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-10-06 17:39:50 -07:00
Alexey Khoroshilov 0f9e2bf008 ecryptfs: remove unneeded buggy code in ecryptfs_do_create()
There is a bug in error handling of lock_parent() in ecryptfs_do_create():
lock_parent() acquries mutex even if dget_parent() fails, so mutex should be unlocked anyway.

But dget_parent() does not fail, so the patch just removes unneeded buggy code.

Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2014-10-06 16:54:50 -05:00
Qu Wenruo f667aef6af btrfs: Make btrfs handle security mount options internally to avoid losing security label.
[BUG]
Originally when mount btrfs with "-o subvol=" mount option, btrfs will
lose all security lable.
And if the btrfs fs is mounted somewhere else, due to the lost of
security lable, SELinux will refuse to mount since the same super block
is being mounted using different security lable.

[REPRODUCER]
With SELinux enabled:
 #mkfs -t btrfs /dev/sda5
 #mount -o context=system_u:object_r:nfs_t:s0 /dev/sda5 /mnt/btrfs
 #btrfs subvolume create /mnt/btrfs/subvol
 #mount -o subvol=subvol,context=system_u:object_r:nfs_t:s0 /dev/sda5
  /mnt/test

kernel message:
SELinux: mount invalid.  Same superblock, different security settings
for (dev sda5, type btrfs)

[REASON]
This happens because btrfs will call vfs_kern_mount() and then
mount_subtree() to handle subvolume name lookup.
First mount will cut off all the security lables and when it comes to
the second vfs_kern_mount(), it has no security label now.

[FIX]
This patch will makes btrfs behavior much more like nfs,
which has the type flag FS_BINARY_MOUNTDATA,
making btrfs handles the security label internally.
So security label will be set in the real mount time and won't lose
label when use with "subvol=" mount option.

Reported-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-06 06:23:32 -07:00
Chao Yu 35425ea249 ecryptfs: avoid to access NULL pointer when write metadata in xattr
Christopher Head 2014-06-28 05:26:20 UTC described:
"I tried to reproduce this on 3.12.21. Instead, when I do "echo hello > foo"
in an ecryptfs mount with ecryptfs_xattr specified, I get a kernel crash:

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff8110eb39>] fsstack_copy_attr_all+0x2/0x61
PGD d7840067 PUD b2c3c067 PMD 0
Oops: 0002 [#1] SMP
Modules linked in: nvidia(PO)
CPU: 3 PID: 3566 Comm: bash Tainted: P           O 3.12.21-gentoo-r1 #2
Hardware name: ASUSTek Computer Inc. G60JX/G60JX, BIOS 206 03/15/2010
task: ffff8801948944c0 ti: ffff8800bad70000 task.ti: ffff8800bad70000
RIP: 0010:[<ffffffff8110eb39>]  [<ffffffff8110eb39>] fsstack_copy_attr_all+0x2/0x61
RSP: 0018:ffff8800bad71c10  EFLAGS: 00010246
RAX: 00000000000181a4 RBX: ffff880198648480 RCX: 0000000000000000
RDX: 0000000000000004 RSI: ffff880172010450 RDI: 0000000000000000
RBP: ffff880198490e40 R08: 0000000000000000 R09: 0000000000000000
R10: ffff880172010450 R11: ffffea0002c51e80 R12: 0000000000002000
R13: 000000000000001a R14: 0000000000000000 R15: ffff880198490e40
FS:  00007ff224caa700(0000) GS:ffff88019fcc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000000bb07f000 CR4: 00000000000007e0
Stack:
ffffffff811826e8 ffff8800a39d8000 0000000000000000 000000000000001a
ffff8800a01d0000 ffff8800a39d8000 ffffffff81185fd5 ffffffff81082c2c
00000001a39d8000 53d0abbc98490e40 0000000000000037 ffff8800a39d8220
Call Trace:
[<ffffffff811826e8>] ? ecryptfs_setxattr+0x40/0x52
[<ffffffff81185fd5>] ? ecryptfs_write_metadata+0x1b3/0x223
[<ffffffff81082c2c>] ? should_resched+0x5/0x23
[<ffffffff8118322b>] ? ecryptfs_initialize_file+0xaf/0xd4
[<ffffffff81183344>] ? ecryptfs_create+0xf4/0x142
[<ffffffff810f8c0d>] ? vfs_create+0x48/0x71
[<ffffffff810f9c86>] ? do_last.isra.68+0x559/0x952
[<ffffffff810f7ce7>] ? link_path_walk+0xbd/0x458
[<ffffffff810fa2a3>] ? path_openat+0x224/0x472
[<ffffffff810fa7bd>] ? do_filp_open+0x2b/0x6f
[<ffffffff81103606>] ? __alloc_fd+0xd6/0xe7
[<ffffffff810ee6ab>] ? do_sys_open+0x65/0xe9
[<ffffffff8157d022>] ? system_call_fastpath+0x16/0x1b
RIP  [<ffffffff8110eb39>] fsstack_copy_attr_all+0x2/0x61
RSP <ffff8800bad71c10>
CR2: 0000000000000000
---[ end trace df9dba5f1ddb8565 ]---"

If we create a file when we mount with ecryptfs_xattr_metadata option, we will
encounter a crash in this path:
->ecryptfs_create
  ->ecryptfs_initialize_file
    ->ecryptfs_write_metadata
      ->ecryptfs_write_metadata_to_xattr
        ->ecryptfs_setxattr
          ->fsstack_copy_attr_all
It's because our dentry->d_inode used in fsstack_copy_attr_all is NULL, and it
will be initialized when ecryptfs_initialize_file finish.

So we should skip copying attr from lower inode when the value of ->d_inode is
invalid.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Cc: stable@vger.kernel.org # v3.2+: b59db43 eCryptfs: Prevent file create race condition
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2014-10-05 23:51:43 -05:00
Jaegeuk Kim 120c2cba1d f2fs: remove unused return value
Don't return any value without any usage.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-10-05 21:05:15 -07:00
Theodore Ts'o f4bb298102 ext4: add ext4_iget_normal() which is to be used for dir tree lookups
If there is a corrupted file system which has directory entries that
point at reserved, metadata inodes, prohibit them from being used by
treating them the same way we treat Boot Loader inodes --- that is,
mark them to be bad inodes.  This prohibits them from being opened,
deleted, or modified via chmod, chown, utimes, etc.

In particular, this prevents a corrupted file system which has a
directory entry which points at the journal inode from being deleted
and its blocks released, after which point Much Hilarity Ensues.

Reported-by: Sami Liedes <sami.liedes@iki.fi>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-05 22:56:00 -04:00
Theodore Ts'o e2bfb088fa ext4: don't orphan or truncate the boot loader inode
The boot loader inode (inode #5) should never be visible in the
directory hierarchy, but it's possible if the file system is corrupted
that there will be a directory entry that points at inode #5.  In
order to avoid accidentally trashing it, when such a directory inode
is opened, the inode will be marked as a bad inode, so that it's not
possible to modify (or read) the inode from userspace.

Unfortunately, when we unlink this (invalid/illegal) directory entry,
we will put the bad inode on the ophan list, and then when try to
unlink the directory, we don't actually remove the bad inode from the
orphan list before freeing in-memory inode structure.  This means the
in-memory orphan list is corrupted, leading to a kernel oops.

In addition, avoid truncating a bad inode in ext4_destroy_inode(),
since truncating the boot loader inode is not a smart thing to do.

Reported-by: Sami Liedes <sami.liedes@iki.fi>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-05 22:47:07 -04:00
Chris Mason 0ec31a61f0 Merge branch 'remove-unlikely' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus 2014-10-04 09:57:44 -07:00
Chris Mason 27b19cc886 Merge branch 'cleanup/blocksize-diet-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus 2014-10-04 09:57:14 -07:00
Chris Mason bbf65cf0b5 Merge branch 'cleanup/misc-for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus
Signed-off-by: Chris Mason <clm@fb.com>

Conflicts:
	fs/btrfs/extent_io.c
2014-10-04 09:56:45 -07:00
Filipe Manana bf8e8ca6fd Btrfs: send, don't delay dir move if there's a new parent inode
If between two snapshots we rename an existing directory named X to Y and
make it a child (direct or not) of a new inode named X, we were delaying
the move/rename of the former directory unnecessarily, which would result
in attempting to rename the new directory from its orphan name to name X
prematurely.

Minimal reproducer:

    $ mkfs.btrfs -f /dev/vdd
    $ mount /dev/vdd /mnt
    $ mkdir -p /mnt/merlin/RC/OSD/Source

    $ btrfs subvolume snapshot -r /mnt /mnt/mysnap1

    $ mkdir /mnt/OSD
    $ mv /mnt/merlin/RC/OSD /mnt/OSD/OSD-Plane_788
    $ mv /mnt/OSD /mnt/merlin/RC

    $ btrfs subvolume snapshot -r /mnt /mnt/mysnap2

    $ btrfs send /mnt/mysnap1 -f /tmp/1.snap
    $ btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/2.snap

    $ mkfs.btrfs -f /dev/vdc
    $ mount /dev/vdc /mnt2

    $ btrfs receive /mnt2 -f /tmp/1.snap
    $ btrfs receive /mnt2 -f /tmp/2.snap

The second receive (from an incremental send) failed with the following
error message: "rename o261-7-0 -> merlin/RC/OSD failed".
This is a regression introduced in the 3.16 kernel.

A test case for xfstests follows.

Reported-by: Marc Merlin <marc@merlins.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:59 -07:00
David Sterba c926093ec5 btrfs: add more superblock checks
Populate btrfs_check_super_valid() with checks that try to verify
consistency of superblock by additional conditions that may arise from
corrupted devices or bitflips. Some of tests are only hints and issue
warnings instead of failing the mount, basically when the checks are
derived from the data found in the superblock.

Tested on a broken image provided by Qu.

Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:59 -07:00
Sage Weil 42383020be Btrfs: fix race in WAIT_SYNC ioctl
We check whether transid is already committed via last_trans_committed and
then search through trans_list for pending transactions.  If
last_trans_committed is updated by btrfs_commit_transaction after we check
it (there is no locking), we will fail to find the committed transaction
and return EINVAL to the caller.  This has been observed occasionally by
ceph-osd (which uses this ioctl heavily).

Fix by rechecking whether the provided transid <= last_trans_committed
after the search fails, and if so return 0.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:59 -07:00
Filipe Manana 656f30dba7 Btrfs: be aware of btree inode write errors to avoid fs corruption
While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).

Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.

Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:59 -07:00
Fabian Frederick 15b636e1dd Btrfs: remove redundant btrfs_verify_qgroup_counts declaration.
Do like disk-io function declared under CONFIG_BTRFS_FS_RUN_SANITY_TESTS
and keep prototype in qgroup.h only

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:59 -07:00
Fabian Frederick b99d9a6a4a btrfs: fix shadow warning on cmp
cmp was declared twice in btrfs_compare_trees resulting in a shadow
warning. This patch renames second internal variable.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:59 -07:00
Fabian Frederick 1b6e44690d Btrfs: fix compilation errors under DEBUG
bi_sector and bi_size moved to bi_iter since commit 4f024f3797
("block: Abstract out bvec iterator")

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:58 -07:00
Liu Bo 8146502820 Btrfs: fix crash of btrfs_release_extent_buffer_page
This is actually inspired by Filipe's patch.  When write_one_eb() fails on
submit_extent_page(), it'll give up writing this eb and mark it with
EXTENT_BUFFER_IOERR.  So if it's not the last page that encounter the failure,
there are some left pages which remain DIRTY, and if a later COW on this eb
happens, ie. eb is COWed and freed, it'd run into BUG_ON in
btrfs_release_extent_buffer_page() for the DIRTY page, ie. BUG_ON(PageDirty(page));

This adds the missing clear_page_dirty_for_io() for the rest pages of eb.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:58 -07:00
Filipe Manana 55e3bd2e0c Btrfs: add missing end_page_writeback on submit_extent_page failure
If submit_extent_page() fails in write_one_eb(), we end up with the current
page not marked dirty anymore, unlocked and marked for writeback. But we never
end up calling end_page_writeback() against the page, which will make calls to
filemap_fdatawait_range (e.g. at transaction commit time) hang forever waiting
for the writeback bit to be cleared from the page.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:58 -07:00
Qu Wenruo 32be3a1ac6 btrfs: Fix the wrong condition judgment about subset extent map
Previous commit: btrfs: Fix and enhance merge_extent_mapping() to insert
best fitted extent map
is using wrong condition to judgement whether the range is a subset of a
existing extent map.

This may cause bug in btrfs no-holes mode.

This patch will correct the judgment and fix the bug.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:58 -07:00
Josef Bacik bbe9051441 Btrfs: fix build_backref_tree issue with multiple shared blocks
Marc Merlin sent me a broken fs image months ago where it would blow up in the
upper->checked BUG_ON() in build_backref_tree.  This is because we had a
scenario like this

block a -- level 4 (not shared)
   |
block b -- level 3 (reloc block, shared)
   |
block c -- level 2 (not shared)
   |
block d -- level 1 (shared)
   |
block e -- level 0 (shared)

We go to build a backref tree for block e, we notice block d is shared and add
it to the list of blocks to lookup it's backrefs for.  Now when we loop around
we will check edges for the block, so we will see we looked up block c last
time.  So we lookup block d and then see that the block that points to it is
block c and we can just skip that edge since we've already been up this path.
The problem is because we clear need_check when we see block d (as it is shared)
we never add block b as needing to be checked.  And because block c is in our
path already we bail out before we walk up to block b and add it to the backref
check list.

To fix this we need to reset need_check if we trip over a block that doesn't
need to be checked.  This will make sure that any subsequent blocks in the path
as we're walking up afterwards are added to the list to be processed.  With this
patch I can now mount Marc's fs image and it'll complete the balance without
panicing.  Thanks,

Reported-by: Marc MERLIN <marc@merlins.org>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:58 -07:00
Josef Bacik 75bfb9aff4 Btrfs: cleanup error handling in build_backref_tree
When balance panics it tends to panic in the

BUG_ON(!upper->checked);

test, because it means it couldn't build the backref tree properly.  This is
annoying to users and frankly a recoverable error, nothing in this function is
actually fatal since it is just an in-memory building of the backrefs for a
given bytenr.  So go through and change all the BUG_ON()'s to ASSERT()'s, and
fix the BUG_ON(!upper->checked) thing to just return an error.

This patch also fixes the error handling so it tears down the work we've done
properly.  This code was horribly broken since we always just panic'ed instead
of actually erroring out, so it needed to be completely re-worked.  With this
patch my broken image no longer panics when I mount it.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-10-03 16:14:58 -07:00
Linus Torvalds 7d1419f30c Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs/smb3 fixes from Steve French:
 "Fix for CIFS/SMB3 oops on reconnect during readpages (3.17 regression)
  and for incorrectly closing file handle in symlink error cases"

* 'for-linus' of git://git.samba.org/sfrench/cifs-2.6:
  CIFS: Fix readpages retrying on reconnects
  Fix problem recognizing symlinks
2014-10-03 13:09:57 -07:00
Dmitry Monakhov 3e67cfad22 ext4: grab missed write_count for EXT4_IOC_SWAP_BOOT
Otherwise this provokes complain like follows:
WARNING: CPU: 12 PID: 5795 at fs/ext4/ext4_jbd2.c:48 ext4_journal_check_start+0x4e/0xa0()
Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod
CPU: 12 PID: 5795 Comm: python Not tainted 3.17.0-rc2-00175-gae5344f #158
Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011
 0000000000000030 ffff8808116cfd28 ffffffff815c7dfc 0000000000000030
 0000000000000000 ffff8808116cfd68 ffffffff8106ce8c ffff8808116cfdc8
 ffff880813b16000 ffff880806ad6ae8 ffffffff81202008 0000000000000000
Call Trace:
 [<ffffffff815c7dfc>] dump_stack+0x51/0x6d
 [<ffffffff8106ce8c>] warn_slowpath_common+0x8c/0xc0
 [<ffffffff81202008>] ? ext4_ioctl+0x9e8/0xeb0
 [<ffffffff8106ceda>] warn_slowpath_null+0x1a/0x20
 [<ffffffff8122867e>] ext4_journal_check_start+0x4e/0xa0
 [<ffffffff81228c10>] __ext4_journal_start_sb+0x90/0x110
 [<ffffffff81202008>] ext4_ioctl+0x9e8/0xeb0
 [<ffffffff8107b0bd>] ? ptrace_stop+0x24d/0x2f0
 [<ffffffff81088530>] ? alloc_pid+0x480/0x480
 [<ffffffff8107b1f2>] ? ptrace_do_notify+0x92/0xb0
 [<ffffffff81186545>] do_vfs_ioctl+0x4e5/0x550
 [<ffffffff815cdbcb>] ? _raw_spin_unlock_irq+0x2b/0x40
 [<ffffffff81186603>] SyS_ioctl+0x53/0x80
 [<ffffffff815ce2ce>] tracesys+0xd0/0xd5

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-03 12:47:23 -04:00
Bob Peterson d24e0569e0 GFS2: Use gfs2_rbm_incr in rgblk_free
This patch speeds up GFS2 unlink operations by using function
gfs2_rbm_incr rather than continuously calculating the rbm.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-10-03 14:40:10 +01:00
alex chen 55dacd22db ocfs2/dlm: should put mle when goto kill in dlm_assert_master_handler
In dlm_assert_master_handler, the mle is get in dlm_find_mle, should be
put when goto kill, otherwise, this mle will never be released.

Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-02 16:28:44 -07:00
Mark Tinguely 52177937e9 xfs: xfs_iflush_done checks the wrong log item callback
Commit 3013683 ("xfs: remove all the inodes on a buffer from the AIL
in bulk") made the xfs inode flush callback more efficient by
combining all the inode writes on the buffer and the deletions of
the inode log item from AIL.

The initial loop in this patch should be looping through all
the log items on the buffer to see which items have
xfs_iflush_done as their callback function. But currently,
only the log item passed to the function has its callback
compared to xfs_iflush_done. If the log item pointer passed to
the function does have the xfs_iflush_done callback function,
then all the log items on the buffer are removed from the
li_bio_list on the buffer b_fspriv and could be removed from
the AIL even though they may have not been written yet.

This problem is masked by the fact that currently all inodes on a
buffer will have the same calback function - either xfs_iflush_done
or xfs_istale_done - and hence the bug cannot manifest in any way.
Still, we need to remove the landmine so that if we add new
callbacks in future this doesn't cause us problems.

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-03 09:09:50 +10:00
Pavel Shilovsky 1209bbdff2 CIFS: Fix readpages retrying on reconnects
If we got a reconnect error from async readv we re-add pages back
to page_list and continue loop. That is wrong because these pages
have been already added to the pagecache but page_list has pages that
have not been added to the pagecache yet. This ends up with a general
protection fault in put_pages after readpages. Fix it by not retrying
the read of these pages and falling back to readpage instead.

Fixes debian bug 762306

Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
Tested-by: Arthur Marsh <arthur.marsh@internode.on.net>
2014-10-02 14:17:41 -05:00
Steve French 19e81573fc Fix problem recognizing symlinks
Changeset eb85d94bd introduced a problem where if a cifs open
fails during query info of a file we
will still try to close the file (happens with certain types
of reparse points) even though the file handle is not valid.

In addition for SMB2/SMB3 we were not mapping the return code returned
by Windows when trying to open a file (like a Windows NFS symlink)
which is a reparse point.

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org>
CC: stable <stable@vger.kernel.org> #v3.13+
2014-10-02 14:10:04 -05:00
David Sterba fccb84c94a btrfs: move checks for DUMMY_ROOT into a helper
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:30:33 +02:00
David Sterba 7ec20afbcb btrfs: new define for the inline extent data start
Use a common definition for the inline data start so we don't have to
open-code it and introduce bugs like "Btrfs: fix wrong max inline data
size limit" fixed.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:30:33 +02:00
David Sterba fb85fc9a67 btrfs: kill extent_buffer_page helper
It used to be more complex but now it's just a simple array access.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:30:32 +02:00
David Sterba a50924e3a4 btrfs: drop constant param from btrfs_release_extent_buffer_page
All callers use the same value, simplify the function.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:30:32 +02:00
David Sterba 2755a0de64 btrfs: hide typecast to definition of BTRFS_SEND_TRANS_STUB
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:30:31 +02:00
David Sterba 94404e82e5 btrfs: let merge_reloc_roots return void
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:30:31 +02:00
David Sterba 8b9456da03 btrfs: remove unused members from struct scrub_warning
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:30:30 +02:00
David Sterba 97eb6b69d1 btrfs: use slab for end_io_wq structures
The structure is frequently reused.  Rename it according to the slab
name.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:30:29 +02:00
David Sterba af13b4922b btrfs: fix error labels in init_btrfs_fs
btrfs_interface_init rarely fails but we could leak the prelim_ref slab.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:30:29 +02:00
David Sterba bfebd8b544 btrfs: use enum for wq endio metadata type
The enum exists but is not consistently used.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:30:28 +02:00
David Sterba 01d5bc3789 btrfs: remove unused extent state bits
The last users are long gone.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:30:27 +02:00
Filipe David Borba Manana 95ac567af2 Btrfs: set default max_inline to 8KiB instead of 8MiB
8MiB is way too large and likely set by mistake. This is not
a significant issue as in practice the max amount of data
added to an inline extent is also limited by the page cache
and btree leaf sizes.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:29:24 +02:00
David Sterba 4d75f8a9c8 btrfs: remove blocksize from btrfs_alloc_free_block and rename
Rename to btrfs_alloc_tree_block as it fits to the alloc/find/free +
_tree_block family. The parameter blocksize was set to the metadata
block size, directly or indirectly.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:14:54 +02:00
David Sterba 0308af4465 btrfs: remove unused parameter blocksize from btrfs_find_tree_block
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:14:52 +02:00
David Sterba ce86cd5917 btrfs: remove parameter blocksize from read_tree_block
We know the tree block size, no need to pass it around.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 17:14:50 +02:00
David Sterba 453848a05f btrfs: inline code of reada_tree_block and remove it
It's trivial with a single user. And remove one pointless BUG_ON.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 16:31:27 +02:00
David Sterba 6197d86eab btrfs: return void from readahead_tree_block
Errors in readahead are not fatal and ignored elsewhere in the code.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 16:30:40 +02:00
David Sterba 58dc4ce432 btrfs: remove unused parameter from readahead_tree_block
The parent_transid parameter has been unused since its introduction in
ca7a79ad8d ("Pass down the expected generation number when reading
tree blocks").  In reada_tree_block, it was even wrongly set to leafsize.
Transid check is done in the proper read and readahead ignores errors.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 16:30:18 +02:00
David Sterba ee39b432b4 btrfs: remove unlikely from data-dependent branches and slow paths
There are the branch hints that obviously depend on the data being
processed, the CPU predictor will do better job according to the actual
load. It also does not make sense to use the hints in slow paths that do
a lot of other operations like locking, waiting or IO.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 16:15:21 +02:00
David Sterba 5d99a998f3 btrfs: remove unlikely from NULL checks
Unlikely is implicit for NULL checks of pointers.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-02 16:06:19 +02:00
Dmitry Monakhov be5cd90dda ext4: optimize block allocation on grow indepth
It is reasonable to prepend newly created index to older one.

[ Dropped no longer used function parameter newext. -tytso ]

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-01 22:57:09 -04:00
Dmitry Monakhov dfe076c106 ext4: get rid of code duplication
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-01 22:26:17 -04:00
Dmitry Monakhov c5d311926d ext4: fix over-defensive complaint after journal abort
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-01 22:23:15 -04:00
Li Xi bce92d566a ext4: fix return value of ext4_do_update_inode
When ext4_do_update_inode() gets error from ext4_inode_blocks_set(),
error number should be returned.

Signed-off-by: Li Xi <lixi@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-10-01 22:11:06 -04:00
Jan Kara d6320cbfc9 ext4: fix mmap data corruption when blocksize < pagesize
Use truncate_isize_extended() when hole is being created in a file so that
->page_mkwrite() will get called for the partial tail page if it is
mmaped (see the first patch in the series for details).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-01 21:49:46 -04:00
Jan Kara 90a8020278 vfs: fix data corruption when blocksize < pagesize for mmaped data
->page_mkwrite() is used by filesystems to allocate blocks under a page
which is becoming writeably mmapped in some process' address space. This
allows a filesystem to return a page fault if there is not enough space
available, user exceeds quota or similar problem happens, rather than
silently discarding data later when writepage is called.

However VFS fails to call ->page_mkwrite() in all the cases where
filesystems need it when blocksize < pagesize. For example when
blocksize = 1024, pagesize = 4096 the following is problematic:
  ftruncate(fd, 0);
  pwrite(fd, buf, 1024, 0);
  map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0);
  map[0] = 'a';       ----> page_mkwrite() for index 0 is called
  ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
  mremap(map, 1024, 10000, 0);
  map[4095] = 'a';    ----> no page_mkwrite() called

At the moment ->page_mkwrite() is called, filesystem can allocate only
one block for the page because i_size == 1024. Otherwise it would create
blocks beyond i_size which is generally undesirable. But later at
->writepage() time, we also need to store data at offset 4095 but we
don't have block allocated for it.

This patch introduces a helper function filesystems can use to have
->page_mkwrite() called at all the necessary moments.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-01 21:49:18 -04:00
Brian Foster da5f10969d xfs: flush the range before zero range conversion
XFS currently discards delalloc blocks within the target range of a
zero range request. Unaligned start and end offsets are zeroed
through the page cache and the internal, aligned blocks are
converted to unwritten extents.

If EOF is page aligned and covered by a delayed allocation extent.
The inode size is not updated until I/O completion. If a zero range
request discards a delalloc range that covers page aligned EOF as
such, the inode size update never occurs. For example:

$ rm -f /mnt/file
$ xfs_io -fc "pwrite 0 64k" -c "zero 60k 4k" /mnt/file
$ stat -c "%s" /mnt/file
65536
$ umount /mnt
$ mount <dev> /mnt
$ stat -c "%s" /mnt/file
61440

Update xfs_zero_file_space() to flush the range rather than discard
delalloc blocks to ensure that inode size updates occur
appropriately.

[dchinner: Note that this is really a workaround to avoid the
underlying problems. More work is needed (and ongoing) to fix those
issues so this fix is being added as a temporary stop-gap measure. ]

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:44:54 +10:00
Brian Foster 07d08681d2 xfs: restore buffer_head unwritten bit on ioend cancel
xfs_vm_writepage() walks each buffer_head on the page, maps to the block
on disk and attaches to a running ioend structure that represents the
I/O submission. A new ioend is created when the type of I/O (unwritten,
delayed allocation or overwrite) required for a particular buffer_head
differs from the previous. If a buffer_head is a delalloc or unwritten
buffer, the associated bits are cleared by xfs_map_at_offset() once the
buffer_head is added to the ioend.

The process of mapping each buffer_head occurs in xfs_map_blocks() and
acquires the ilock in blocking or non-blocking mode, depending on the
type of writeback in progress. If the lock cannot be acquired for
non-blocking writeback, we cancel the ioend, redirty the page and
return. Writeback will revisit the page at some later point.

Note that we acquire the ilock for each buffer on the page. Therefore
during non-blocking writeback, it is possible to add an unwritten buffer
to the ioend, clear the unwritten state, fail to acquire the ilock when
mapping a subsequent buffer and cancel the ioend. If this occurs, the
unwritten status of the buffer sitting in the ioend has been lost. The
page will eventually hit writeback again, but xfs_vm_writepage() submits
overwrite I/O instead of unwritten I/O and does not perform unwritten
extent conversion at I/O completion. This leads to data corruption
because unwritten extents are treated as holes on reads and zeroes are
returned instead of reading from disk.

Modify xfs_cancel_ioend() to restore the buffer unwritten bit for ioends
of type XFS_IO_UNWRITTEN. This ensures that unwritten extent conversion
occurs once the page is eventually written back.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:42:06 +10:00
Eric Sandeen 5cca3f611d xfs: check for null dquot in xfs_quota_calc_throttle()
Coverity spotted this.

Granted, we *just* checked xfs_inod_dquot() in the caller (by
calling xfs_quota_need_throttle). However, this is the only place we
don't check the return value but the check is cheap and future-proof
so add it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:27:09 +10:00
Eric Sandeen 04dd1a0d4b xfs: fix crc field handling in xfs_sb_to/from_disk
I discovered this in userspace, but the same change applies
to the kernel.

If we xfs_mdrestore an image from a non-crc filesystem, lo
and behold the restored image has gained a CRC:

# db/xfs_metadump.sh -o /dev/sdc1 - | xfs_mdrestore - test.img
# xfs_db -c "sb 0" -c "p crc" /dev/sdc1
crc = 0 (correct)
# xfs_db -c "sb 0" -c "p crc" test.img
crc = 0xb6f8d6a0 (correct)

This is because xfs_sb_from_disk doesn't fill in sb_crc,
but xfs_sb_to_disk(XFS_SB_ALL_BITS) does write the in-memory
CRC to disk - so we get uninitialized memory on disk.

Fix this by always initializing sb_crc to 0 when we read
the superblock, and masking out the CRC bit from ALL_BITS
when we write it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:24:11 +10:00
Eric Sandeen 6ee49a20c1 xfs: don't send null bp to xfs_trans_brelse()
In this case, if bp is NULL, error is set, and we send a
NULL bp to xfs_trans_brelse, which will try to dereference it.

Test whether we actually have a buffer before we try to
free it.

Coverity spotted this.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:23:49 +10:00
Brian Foster ce57bcf6b8 xfs: check for inode size overflow in xfs_new_eof()
If we write to the maximum file offset (2^63-2), XFS fails to log the
inode size update when the page is flushed. For example:

$ xfs_io -fc "pwrite `echo "2^63-1-1" | bc` 1" /mnt/file
wrote 1/1 bytes at offset 9223372036854775806
1.000000 bytes, 1 ops; 0.0000 sec (22.711 KiB/sec and 23255.8140 ops/sec)
$ stat -c %s /mnt/file
9223372036854775807
$ umount /mnt ; mount <dev> /mnt/
$ stat -c %s /mnt/file
0

This occurs because XFS calculates the new file size as io_offset +
io_size, I/O occurs in block sized requests, and the maximum supported
file size is not block aligned. Therefore, a write to the max allowable
offset on a 4k blocksize fs results in a write of size 4k to offset
2^63-4096 (e.g., equivalent to round_down(2^63-1, 4096), or IOW the
offset of the block that contains the max file size). The offset plus
size calculation (2^63 - 4096 + 4096 == 2^63) overflows the signed
64-bit variable which goes negative and causes the > comparison to the
on-disk inode size to fail. This returns 0 from xfs_new_eof() and
results in no change to the inode on-disk.

Update xfs_new_eof() to explicitly detect overflow of the local
calculation and use the VFS inode size in this scenario. The VFS inode
size is capped to the maximum and thus XFS writes the correct inode size
to disk.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:21:53 +10:00
Dave Chinner a872703f34 xfs: only set extent size hint when asked
Currently the extent size hint is set unconditionally in
xfs_ioctl_setattr() when the FSX_EXTSIZE flag is set. Hence we can
set hints when the inode flags indicating the hint should be used
are not set.  Hence only set the extent size hint from userspace
when the inode has the XFS_DIFLAG_EXTSIZE flag set to indicate that
we should have an extent size hint set on the inode.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:20:30 +10:00
Dave Chinner 9336e3a765 xfs: project id inheritance is a directory only flag
xfs_set_diflags() allows it to be set on non-directory inodes, and
this flags errors in xfs_repair. Further, inode allocation allows
the same directory-only flag to be inherited to non-directories.
Make sure directory inode flags don't appear on other types of
inodes.

This fixes several xfstests scratch fileystem corruption reports
(e.g. xfs/050) now that xfstests checks scratch filesystems after
test completion.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:18:40 +10:00
Dave Chinner e076b0f3a5 xfs: kill time.h
The typedef for timespecs and nanotime() are completely unnecessary,
and delay() can be moved to fs/xfs/linux.h, which means this file
can go away.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:18:13 +10:00
Dave Chinner b1d6cc02f2 xfs: compat_xfs_bstat does not have forkoff
struct compat_xfs_bstat is missing the di_forkoff field and so does
not fully translate the structure correctly. Fix it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:17:58 +10:00
Dave Chinner 75e58ce4c8 Merge branch 'xfs-buf-iosubmit' into for-next 2014-10-02 09:11:14 +10:00
Christoph Hellwig 8c15612546 xfs: simplify xfs_zero_remaining_bytes
xfs_zero_remaining_bytes() open codes a log of buffer manupulations
to do a read forllowed by a write. It can simply be replaced by an
uncached read followed by a xfs_bwrite() call.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:05:44 +10:00
Dave Chinner ba3726742c xfs: check xfs_buf_read_uncached returns correctly
xfs_buf_read_uncached() has two failure modes. If can either return
NULL or bp->b_error != 0 depending on the type of failure, and not
all callers check for both. Fix it so that xfs_buf_read_uncached()
always returns the error status, and the buffer is returned as a
function parameter. The buffer will only be returned on success.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:05:32 +10:00
Dave Chinner 595bff75dc xfs: introduce xfs_buf_submit[_wait]
There is a lot of cookie-cutter code that looks like:

	if (shutdown)
		handle buffer error
	xfs_buf_iorequest(bp)
	error = xfs_buf_iowait(bp)
	if (error)
		handle buffer error

spread through XFS. There's significant complexity now in
xfs_buf_iorequest() to specifically handle this sort of synchronous
IO pattern, but there's all sorts of nasty surprises in different
error handling code dependent on who owns the buffer references and
the locks.

Pull this pattern into a single helper, where we can hide all the
synchronous IO warts and hence make the error handling for all the
callers much saner. This removes the need for a special extra
reference to protect IO completion processing, as we can now hold a
single reference across dispatch and waiting, simplifying the sync
IO smeantics and error handling.

In doing this, also rename xfs_buf_iorequest to xfs_buf_submit and
make it explicitly handle on asynchronous IO. This forces all users
to be switched specifically to one interface or the other and
removes any ambiguity between how the interfaces are to be used. It
also means that xfs_buf_iowait() goes away.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:05:14 +10:00
Dave Chinner 8b131973d1 xfs: kill xfs_bioerror_relse
There is only one caller now - xfs_trans_read_buf_map() - and it has
very well defined call semantics - read, synchronous, and b_iodone
is NULL. Hence it's pretty clear what error handling is necessary
for this case. The bigger problem of untangling
xfs_trans_read_buf_map error handling is left to a future patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:05:05 +10:00
Dave Chinner 2718775469 xfs: xfs_bioerror can die.
Internal buffer write error handling is a mess due to the unnatural
split between xfs_bioerror and xfs_bioerror_relse().

xfs_bwrite() only does sync IO and determines the handler to
call based on b_iodone, so for this caller the only difference
between xfs_bioerror() and xfs_bioerror_release() is the XBF_DONE
flag. We don't care what the XBF_DONE flag state is because we stale
the buffer in both paths - the next buffer lookup will clear
XBF_DONE because XBF_STALE is set. Hence we can use common
error handling for xfs_bwrite().

__xfs_buf_delwri_submit() is a similar - it's only ever called
on writes - all sync or async - and again there's no reason to
handle them any differently at all.

Clean up the nasty error handling and remove xfs_bioerror().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:04:56 +10:00
Dave Chinner 8dac392198 xfs: kill xfs_bdstrat_cb
Only has two callers, and is just a shutdown check and error handler
around xfs_buf_iorequest. However, the error handling is a mess of
read and write semantics, and both internal callers only call it for
writes. Hence kill the wrapper, and follow up with a patch to
sanitise the error handling.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:04:40 +10:00
Dave Chinner 61be9c529a xfs: rework xfs_buf_bio_endio error handling
Currently the report of a bio error from completion
immediately marks the buffer with an error. The issue is that this
is racy w.r.t. synchronous IO - the submitter can see b_error being
set before the IO is complete, and hence we cannot differentiate
between submission failures and completion failures.

Add an internal b_io_error field protected by the b_lock to catch IO
completion errors, and only propagate that to the buffer during
final IO completion handling. Hence we can tell in xfs_buf_iorequest
if we've had a submission failure bey checking bp->b_error before
dropping our b_io_remaining reference - that reference will prevent
b_io_error values from being propagated to b_error in the event that
completion races with submission.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:04:31 +10:00
Dave Chinner e8aaba9a78 xfs: xfs_buf_ioend and xfs_buf_iodone_work duplicate functionality
We do some work in xfs_buf_ioend, and some work in
xfs_buf_iodone_work, but much of that functionality is the same.
This work can all be done in a single function, leaving
xfs_buf_iodone just a wrapper to determine if we should execute it
by workqueue or directly. hence rename xfs_buf_iodone_work to
xfs_buf_ioend(), and add a new xfs_buf_ioend_async() for places that
need async processing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:04:22 +10:00
Dave Chinner e11bb8052c xfs: synchronous buffer IO needs a reference
When synchronous IO runs IO completion work, it does so without an
IO reference or a hold reference on the buffer. The IO "hold
reference" is owned by the submitter, and released when the
submission is complete. The IO reference is released when both the
submitter and the bio end_io processing is run, and so if the io
completion work is run from IO completion context, it is run without
an IO reference.

Hence we can get the situation where the submitter can submit the
IO, see an error on the buffer and unlock and free the buffer while
there is still IO in progress. This leads to use-after-free and
memory corruption.

Fix this by taking a "sync IO hold" reference that is owned by the
IO and not released until after the buffer completion calls are run
to wake up synchronous waiters. This means that the buffer will not
be freed in any circumstance until all IO processing is completed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:04:11 +10:00
Dave Chinner cf53e99d19 xfs: Don't use xfs_buf_iowait in the delwri buffer code
For the special case of delwri buffer submission and waiting, we
don't need to issue IO synchronously at all. The second pass to call
xfs_buf_iowait() can be replaced with  blocking on xfs_buf_lock() -
the buffer will be unlocked when the async IO is complete.

This formalises a sane the method of waiting for async IO - take an
extra reference, submit the IO, call xfs_buf_lock() when you want to
wait for IO completion. i.e.:

	bp = xfs_buf_find();
	xfs_buf_hold(bp);
	bp->b_flags |= XBF_ASYNC;
	xfs_buf_iosubmit(bp);
	xfs_buf_lock(bp)
	error = bp->b_error;
	....
	xfs_buf_relse(bp);

While this is somewhat racy for gathering IO errors, none of the
code that calls xfs_buf_delwri_submit() will race against other
users of the buffers being submitted. Even if they do, we don't
really care if the error is detected by the delwri code or the user
we raced against. Either way, the error will be detected and
handled.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:04:01 +10:00
Dave Chinner a870fe6dfa xfs: force the log before shutting down
When we have marked the filesystem for shutdown, we want to prevent
any further buffer IO from being submitted. However, we currently
force the log after marking the filesystem as shut down, hence
allowing IO to the log *after* we have marked both the filesystem
and the log as in an error state.

Clean this up by forcing the log before we mark the filesytem with
an error. This replaces the pure CIL flush that we currently have
which works around this same issue (i.e the CIL can't be flushed
once the shutdown flags are set) and hence enables us to clean up
the logic substantially.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-02 09:02:28 +10:00
David Sterba 143f363618 btrfs: remove unused variable from btrfs_parse_options
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-01 19:31:33 +02:00
David Sterba aab110abcb btrfs: defrag, use unsigned type for extent thresh
Signed type mismatches the ioctl structure, all extent calculations are
done on unsigned types.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-10-01 19:30:52 +02:00
Jeff Layton 34549ab09e nfsd: eliminate "to_delegation" define
We now have cb_to_delegation and to_delegation, which do the same thing
and are defined separately in different .c files. Move the
cb_to_delegation definition into a header file and eliminate the
redundant to_delegation definition.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-10-01 12:28:01 -04:00
Bob Peterson 19aeb5a65f GFS2: Make rename not save dirent location
This patch fixes a regression in the patch "GFS2: Remember directory
insert point", commit 2b47dad866.
The problem had to do with the rename function: The function found
space for the new dirent, and remembered that location. But then the
old dirent was removed, which often moved the eligible location for
the renamed dirent. Putting the new dirent at the saved location
caused file system corruption.

This patch adds a new "save_loc" variable to struct gfs2_diradd.
If 1, the dirent location is saved. If 0, the dirent location is not
saved and the buffer_head is released as per previous behavior.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-10-01 14:06:15 +01:00
Jaegeuk Kim 52656e6cf7 f2fs: clean up f2fs_ioctl functions
This patch cleans up f2fs_ioctl functions for better readability.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-30 15:34:56 -07:00
Dan Carpenter 8a21984d5d f2fs: potential shift wrapping buf in f2fs_trim_fs()
My static checker complains that segment is a u64 but only the lower 31
bits can be used before we hit a shift wrapping bug.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-30 15:34:56 -07:00
Jaegeuk Kim 44c1615651 f2fs: call f2fs_unlock_op after error was handled
This patch relocates f2fs_unlock_op in every directory operations to be called
after any error was processed.
Otherwise, the checkpoint can be entered with valid node ids without its
dentry when -ENOSPC is occurred.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-30 15:34:55 -07:00
Jaegeuk Kim 7cd8558baa f2fs: check the use of macros on block counts and addresses
This patch cleans up the existing and new macros for readability.

Rule is like this.

         ,-----------------------------------------> MAX_BLKADDR -,
         |  ,------------- TOTAL_BLKS ----------------------------,
         |  |                                                     |
         |  ,- seg0_blkaddr   ,----- sit/nat/ssa/main blkaddress  |
block    |  | (SEG0_BLKADDR)  | | | |   (e.g., MAIN_BLKADDR)      |
address  0..x................ a b c d .............................
            |                                                     |
global seg# 0...................... m .............................
            |                       |                             |
            |                       `------- MAIN_SEGS -----------'
            `-------------- TOTAL_SEGS ---------------------------'
                                    |                             |
 seg#                               0..........xx..................

= Note =
 o GET_SEGNO_FROM_SEG0 : blk address -> global segno
 o GET_SEGNO           : blk address -> segno
 o START_BLOCK         : segno -> starting block address

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-30 15:34:47 -07:00
Jaegeuk Kim 309cc2b6e7 f2fs: refactor flush_nat_entries to remove costly reorganizing ops
Previously, f2fs tries to reorganize the dirty nat entries into multiple sets
according to its nid ranges. This can improve the flushing nat pages, however,
if there are a lot of cached nat entries, it becomes a bottleneck.

This patch introduces a new set management flow by removing dirty nat list and
adding a series of set operations when the nat entry becomes dirty.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-30 15:30:41 -07:00
Jaegeuk Kim 4b2fecc846 f2fs: introduce FITRIM in f2fs_ioctl
This patch introduces FITRIM in f2fs_ioctl.
In this case, f2fs will issue small discards and prefree discards as many as
possible for the given area.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-30 15:06:09 -07:00
Jaegeuk Kim 75ab4cb830 f2fs: introduce cp_control structure
This patch add a new data structure to control checkpoint parameters.
Currently, it presents the reason of checkpoint such as is_umount and normal
sync.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-30 15:01:28 -07:00
Trond Myklebust b4b56796fe Merge branch 'client-4.2' into linux-next
Merge NFSv4.2 client SEEK implementation from Anna

* client-4.2: (55 commits)
  NFS: Implement SEEK
  NFSD: Implement SEEK
  NFSD: Add generic v4.2 infrastructure
  svcrdma: advertise the correct max payload
  nfsd: introduce nfsd4_callback_ops
  nfsd: split nfsd4_callback initialization and use
  nfsd: introduce a generic nfsd4_cb
  nfsd: remove nfsd4_callback.cb_op
  nfsd: do not clear rpc_resp in nfsd4_cb_done_sequence
  nfsd: fix nfsd4_cb_recall_done error handling
  nfsd4: clarify how grace period ends
  nfsd4: stop grace_time update at end of grace period
  nfsd: skip subsequent UMH "create" operations after the first one for v4.0 clients
  nfsd: set and test NFSD4_CLIENT_STABLE bit to reduce nfsdcltrack upcalls
  nfsd: serialize nfsdcltrack upcalls for a particular client
  nfsd: pass extra info in env vars to upcalls to allow for early grace period end
  nfsd: add a v4_end_grace file to /proc/fs/nfsd
  lockd: add a /proc/fs/lockd/nlm_end_grace file
  nfsd: reject reclaim request when client has already sent RECLAIM_COMPLETE
  nfsd: remove redundant boot_time parm from grace_done client tracking op
  ...
2014-09-30 17:22:02 -04:00
Trond Myklebust 72c23f0819 Merge branch 'bugfixes' into linux-next
* bugfixes:
  NFSv4.1: Fix an NFSv4.1 state renewal regression
  NFSv4: fix open/lock state recovery error handling
  NFSv4: Fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
  NFS: Fabricate fscache server index key correctly
  SUNRPC: Add missing support for RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT
  nfs: fix duplicate proc entries
2014-09-30 17:21:41 -04:00
Andy Adamson d1f456b0b9 NFSv4.1: Fix an NFSv4.1 state renewal regression
Commit 2f60ea6b8c ("NFSv4: The NFSv4.0 client must send RENEW calls if it holds a delegation") set the NFS4_RENEW_TIMEOUT flag in nfs4_renew_state, and does
not put an nfs41_proc_async_sequence call, the NFSv4.1 lease renewal heartbeat
call, on the wire to renew the NFSv4.1 state if the flag was not set.

The NFS4_RENEW_TIMEOUT flag is set when "now" is after the last renewal
(cl_last_renewal) plus the lease time divided by 3. This is arbitrary and
sometimes does the following:

In normal operation, the only way a future state renewal call is put on the
wire is via a call to nfs4_schedule_state_renewal, which schedules a
nfs4_renew_state workqueue task. nfs4_renew_state determines if the
NFS4_RENEW_TIMEOUT should be set, and the calls nfs41_proc_async_sequence,
which only gets sent if the NFS4_RENEW_TIMEOUT flag is set.
Then the nfs41_proc_async_sequence rpc_release function schedules
another state remewal via nfs4_schedule_state_renewal.

Without this change we can get into a state where an application stops
accessing the NFSv4.1 share, state renewal calls stop due to the
NFS4_RENEW_TIMEOUT flag _not_ being set. The only way to recover
from this situation is with a clientid re-establishment, once the application
resumes and the server has timed out the lease and so returns
NFS4ERR_BAD_SESSION on the subsequent SEQUENCE operation.

An example application:
open, lock, write a file.

sleep for 6 * lease (could be less)

ulock, close.

In the above example with NFSv4.1 delegations enabled, without this change,
there are no OP_SEQUENCE state renewal calls during the sleep, and the
clientid is recovered due to lease expiration on the close.

This issue does not occur with NFSv4.1 delegations disabled, nor with
NFSv4.0, with or without delegations enabled.

Signed-off-by: Andy Adamson <andros@netapp.com>
Link: http://lkml.kernel.org/r/1411486536-23401-1-git-send-email-andros@netapp.com
Fixes: 2f60ea6b8c (NFSv4: The NFSv4.0 client must send RENEW calls...)
Cc: stable@vger.kernel.org # 3.2.x
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-30 17:18:42 -04:00
Anna Schumaker 1c6dcbe5ce NFS: Implement SEEK
The SEEK operation is used when an application makes an lseek call with
either the SEEK_HOLE or SEEK_DATA flags set.  I fall back on
nfs_file_llseek() if the server does not have SEEK support.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-30 16:24:56 -04:00
Trond Myklebust 4a3a0ebad1 Merge commit '24bab491220f' into client-4.2
- Pull in patch 'NFSD: Implement SEEK' from Bruce's nfsd-next tree
  for dependencies.
2014-09-30 16:23:39 -04:00
J. Bruce Fields 15b23ef5d3 nfsd4: fix corruption of NFSv4 read data
The calculation of page_ptr here is wrong in the case the read doesn't
start at an offset that is a multiple of a page.

The result is that nfs4svc_encode_compoundres sets rq_next_page to a
value one too small, and then the loop in svc_free_res_pages may
incorrectly fail to clear a page pointer in rq_respages[].

Pages left in rq_respages[] are available for the next rpc request to
use, so xdr data may be written to that page, which may hold data still
waiting to be transmitted to the client or data in the page cache.

The observed result was silent data corruption seen on an NFSv4 client.

We tag this as "fixing" 05638dc73a because that commit exposed this
bug, though the incorrect calculation predates it.

Particular thanks to Andrea Arcangeli and David Gilbert for analysis and
testing.

Fixes: 05638dc73a "nfsd4: simplify server xdr->next_page use"
Cc: stable@vger.kernel.org
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-30 15:57:04 -04:00
Jan Kara c53f755d33 ocfs2: Back out change to use OCFS2_MAXQUOTAS in ocfs2_setattr()
ocfs2_setattr() actually needs to really use MAXQUOTAS and not
OCFS2_MAXQUOTAS since it will pass the array over to VFS. Currently
this isn't a problem since MAXQUOTAS == OCFS2_MAXQUOTAS but it would
be once we introduce project quotas.

CC: Mark Fasheh <mfasheh@suse.com>
CC: Joel Becker <jlbec@evilplan.org>
CC: ocfs2-devel@oss.oracle.com
Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-30 18:09:59 +02:00
James Morris 6c8ff877cd Merge commit 'v3.16' into next 2014-10-01 00:44:04 +10:00
David Howells a3b7c00484 CacheFiles: Handle object being killed before being set up
If a cache object gets killed whilst in the process of being set up - for
instance if the netfs relinquishes the cookie that the object is associated
with - then the object's state machine will transit to the DROP_OBJECT state
without necessarily going through the LOOKUP_OBJECT or CREATE_OBJECT states.

This is a problem for CacheFiles because cachefiles_drop_object() assumes that
object->dentry will be set upon reaching the DROP_OBJECT state and has an
ASSERT() to that effect (see the oops below) - but object->dentry doesn't get
set until the LOOKUP_OBJECT or CREATE_OBJECT states (and not always then if
they fail).

To fix this, just make the dentry cleanup in cachefiles_drop_object()
conditional on the dentry actually being set and remove the assertion.

	CacheFiles: Assertion failed
	------------[ cut here ]------------
	kernel BUG at .../fs/cachefiles/namei.c:425!
	...
	Workqueue: fscache_object fscache_object_work_func [fscache]
	...
	RIP: ... cachefiles_delete_object+0xcd/0x110 [cachefiles]
	...
	Call Trace:
	 [<ffffffffa043280f>] ? cachefiles_drop_object+0xff/0x130 [cachefiles]
	 [<ffffffffa02ac511>] ? fscache_drop_object+0xd1/0x1d0 [fscache]
	 [<ffffffffa02ac697>] ? fscache_object_work_func+0x87/0x210 [fscache]
	 [<ffffffff81080635>] ? process_one_work+0x155/0x450
	 [<ffffffff81081c44>] ? worker_thread+0x114/0x370
	 [<ffffffff81081b30>] ? manage_workers.isra.21+0x2c0/0x2c0
	 [<ffffffff81087fcc>] ? kthread+0xbc/0xe0
	 [<ffffffff81087f10>] ? flush_kthread_worker+0xa0/0xa0
	 [<ffffffff8150638c>] ? ret_from_fork+0x7c/0xb0
	 [<ffffffff81087f10>] ? flush_kthread_worker+0xa0/0xa0

Reported-by: Manuel Schölling <manuel.schoelling@gmx.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
2014-09-30 14:50:28 +01:00
Richard Weinberger 443b39cdd5 UBIFS: Fix trivial typo in power_cut_emulated()
s/withing/within/

Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-09-30 09:29:44 +03:00
Al Viro 6d13f69444 missing data dependency barrier in prepend_name()
AFAICS, prepend_name() is broken on SMP alpha.  Disclaimer: I don't have
SMP alpha boxen to reproduce it on.  However, it really looks like the race
is real.

CPU1: d_path() on /mnt/ramfs/<255-character>/foo
CPU2: mv /mnt/ramfs/<255-character> /mnt/ramfs/<63-character>

CPU2 does d_alloc(), which allocates an external name, stores the name there
including terminating NUL, does smp_wmb() and stores its address in
dentry->d_name.name.  It proceeds to d_add(dentry, NULL) and d_move()
old dentry over to that.  ->d_name.name value ends up in that dentry.

In the meanwhile, CPU1 gets to prepend_name() for that dentry.  It fetches
->d_name.name and ->d_name.len; the former ends up pointing to new name
(64-byte kmalloc'ed array), the latter - 255 (length of the old name).
Nothing to force the ordering there, and normally that would be OK, since we'd
run into the terminating NUL and stop.  Except that it's alpha, and we'd need
a data dependency barrier to guarantee that we see that store of NUL
__d_alloc() has done.  In a similar situation dentry_cmp() would survive; it
does explicit smp_read_barrier_depends() after fetching ->d_name.name.
prepend_name() doesn't and it risks walking past the end of kmalloc'ed object
and possibly oops due to taking a page fault in kernel mode.

Cc: stable@vger.kernel.org # 3.12+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-29 14:46:30 -04:00
Anna Schumaker 24bab49122 NFSD: Implement SEEK
This patch adds server support for the NFS v4.2 operation SEEK, which
returns the position of the next hole or data segment in a file.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-29 14:35:20 -04:00
Anna Schumaker 87a15a8090 NFSD: Add generic v4.2 infrastructure
It's cleaner to introduce everything at once and have the server reply
with "not supported" than it would be to introduce extra operations when
implementing a specific one in the middle of the list.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-29 14:35:19 -04:00
Fabian Frederick 37993271cf udf: remove redundant sys_tz declaration
sys_tz is already declared in include/linux/time.h

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-29 13:45:12 +02:00
Dave Chinner bd438f825f Merge branch 'xfs-sparse-fixes' into for-next 2014-09-29 10:52:44 +10:00
Dave Chinner b972d07971 xfs: annotate user variables passed as void
Some argument callbacks can contain user buffers, and sparse warns
about passing them as void pointers. Cast appropriately to remove
the sparse warnings.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-29 10:46:22 +10:00
Dave Chinner e3aed1a081 xfs: xfs_kset should be static
As it is accessed through the struct xfs_mount and can be set up
entirely from fs/xfs/xfs_super.c

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-29 10:46:08 +10:00
Dave Chinner bf1ed38330 xfs: xfs_qm_dquot_isolate needs locking annotations for sparse
To remove noise from the build.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-29 10:43:40 +10:00
Dave Chinner e68ed77521 xfs: fix use of agi_newino in finobt lookup
Sparse warns that we are passing the big-endian valueo f agi_newino
to the initial btree lookup function when trying to find a new
inode. This is wrong - we need to pass the host order value, not the
disk order value. This will adversely affect the next inode
allocated, but given that the free inode btree is usually much
smaller than the allocated inode btree it is much less likely to be
a performance issue if we start the search in the wrong place.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-29 10:43:15 +10:00
Dave Chinner 2f43bbd96e Merge branch 'xfs-trans-recover-cleanup' into for-next 2014-09-29 10:00:24 +10:00
Dave Chinner b818cca197 xfs: refactor recovery transaction start handling
Rework the transaction lookup and allocation code in
xlog_recovery_process_ophdr() to fold two related call-once
helper functions into a single helper. Then fold in all the
XLOG_START_TRANS logic to that helper to clean up the remaining
logic in xlog_recovery_process_ophdr().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-29 09:45:54 +10:00
Dave Chinner 7656066986 xfs: reorganise transaction recovery item code
The code for managing transactions anf the items for recovery is
spread across 3 different locations in the file. Move them all
together so that it is easy to read the code without needing to jump
long distances in the file.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-29 09:45:42 +10:00
Dave Chinner 88b863db97 xfs: fix double free in xlog_recover_commit_trans
When an error occurs during buffer submission in
xlog_recover_commit_trans(), we free the trans structure twice. Fix
it by only freeing the structure in the caller regardless of the
success or failure of the function.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-29 09:45:32 +10:00
Dave Chinner e9131e50f9 xfs: recovery of XLOG_UNMOUNT_TRANS leaks memory
The XLOG_UNMOUNT_TRANS case skips the transaction, despite the fact
an unmount record is always in a standalone transaction. Hence
whenever we come across one of these we need to free the transaction
structure associated with it as there is no commit record that
follows it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-29 09:45:18 +10:00
Dave Chinner eeb1168810 xfs: refactor xlog_recover_process_data()
Clean up xlog_recover_process_data() structure in preparation for
fixing the allocation and freeing context of the transaction being
recovered.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-29 09:45:03 +10:00
Trond Myklebust df817ba357 NFSv4: fix open/lock state recovery error handling
The current open/lock state recovery unfortunately does not handle errors
such as NFS4ERR_CONN_NOT_BOUND_TO_SESSION correctly. Instead of looping,
just proceeds as if the state manager is finished recovering.
This patch ensures that we loop back, handle higher priority errors
and complete the open/lock state recovery.

Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-28 16:03:04 -04:00
Trond Myklebust a4339b7b68 NFSv4: Fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
If a NFSv4.x server returns NFS4ERR_STALE_CLIENTID in response to a
CREATE_SESSION or SETCLIENTID_CONFIRM in order to tell us that it rebooted
a second time, then the client will currently take this to mean that it must
declare all locks to be stale, and hence ineligible for reboot recovery.

RFC3530 and RFC5661 both suggest that the client should instead rely on the
server to respond to inelegible open share, lock and delegation reclaim
requests with NFS4ERR_NO_GRACE in this situation.

Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-28 16:03:03 -04:00
Linus Torvalds 1e3827bf8a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Assorted fixes + unifying __d_move() and __d_materialise_dentry() +
  minimal regression fix for d_path() of victims of overwriting rename()
  ported on top of that"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: Don't exchange "short" filenames unconditionally.
  fold swapping ->d_name.hash into switch_names()
  fold unlocking the children into dentry_unlock_parents_for_move()
  kill __d_materialise_dentry()
  __d_materialise_dentry(): flip the order of arguments
  __d_move(): fold manipulations with ->d_child/->d_subdirs
  don't open-code d_rehash() in d_materialise_unique()
  pull rehashing and unlocking the target dentry into __d_materialise_dentry()
  ufs: deal with nfsd/iget races
  fuse: honour max_read and max_write in direct_io mode
  shmem: fix nlink for rename overwrite directory
2014-09-27 17:05:14 -07:00
Mikhail Efremov d2fa4a8476 vfs: Don't exchange "short" filenames unconditionally.
Only exchange source and destination filenames
if flags contain RENAME_EXCHANGE.
In case if executable file was running and replaced by
other file /proc/PID/exe should still show correct file name,
not the old name of the file by which it was replaced.

The scenario when this bug manifests itself was like this:
* ALT Linux uses rpm and start-stop-daemon;
* during a package upgrade rpm creates a temporary file
  for an executable to rename it upon successful unpacking;
* start-stop-daemon is run subsequently and it obtains
  the (nonexistant) temporary filename via /proc/PID/exe
  thus failing to identify the running process.

Note that "long" filenames (> DNAiME_INLINE_LEN) are still
exchanged without RENAME_EXCHANGE and this behaviour exists
long enough (should be fixed too apparently).
So this patch is just an interim workaround that restores
behavior for "short" names as it was before changes
introduced by commit da1ce0670c ("vfs: add cross-rename").

See https://lkml.org/lkml/2014/9/7/6 for details.

AV: the comments about being more careful with ->d_name.hash
than with ->d_name.name are from back in 2.3.40s; they
became obsolete by 2.3.60s, when we started to unhash the
target instead of swapping hash chain positions followed
by d_delete() as we used to do when dcache was first
introduced.

Acked-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: da1ce0670c "vfs: add cross-rename"
Signed-off-by: Mikhail Efremov <sem@altlinux.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-27 15:59:39 -04:00
Linus Torvalds a28ddb87cd fold swapping ->d_name.hash into switch_names()
and do it along with ->d_name.len there

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-27 15:59:11 -04:00
Al Viro 986c01942a fold unlocking the children into dentry_unlock_parents_for_move()
... renaming it into dentry_unlock_for_move() and making it more
symmetric with dentry_lock_for_move().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26 23:11:15 -04:00
Al Viro 63cf427a57 kill __d_materialise_dentry()
it folds into __d_move() now

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26 23:06:14 -04:00
Al Viro 4453641fe8 __d_materialise_dentry(): flip the order of arguments
... thus making it much closer to (now unreachable, BTW) IS_ROOT(dentry)
case in __d_move().  A bit more and it'll fold in.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26 22:54:02 -04:00
Al Viro 9d8cd306a8 __d_move(): fold manipulations with ->d_child/->d_subdirs
list_del() + list_add() is a slightly pessimised list_move()
list_del() + INIT_LIST_HEAD() is a slightly pessimised list_del_init()

Interleaving those makes the resulting code even worse.  And harder to follow...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26 21:34:01 -04:00
Al Viro 8527dd7187 don't open-code d_rehash() in d_materialise_unique()
... and get rid of duplicate BUG_ON() there

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26 21:26:50 -04:00
Al Viro 5cc3821b57 pull rehashing and unlocking the target dentry into __d_materialise_dentry()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26 21:25:35 -04:00
Al Viro e4502c63f5 ufs: deal with nfsd/iget races
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26 21:17:52 -04:00
Miklos Szeredi 2c80929c4c fuse: honour max_read and max_write in direct_io mode
The third argument of fuse_get_user_pages() "nbytesp" refers to the number of
bytes a caller asked to pack into fuse request. This value may be lesser
than capacity of fuse request or iov_iter.  So fuse_get_user_pages() must
ensure that *nbytesp won't grow.

Now, when helper iov_iter_get_pages() performs all hard work of extracting
pages from iov_iter, it can be done by passing properly calculated
"maxsize" to the helper.

The other caller of iov_iter_get_pages() (dio_refill_pages()) doesn't need
this capability, so pass LONG_MAX as the maxsize argument here.

Fixes: c9c37e2e63 ("fuse: switch to iov_iter_get_pages()")
Reported-by: Werner Baumann <werner.baumann@onlinehome.de>
Tested-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26 21:16:51 -04:00
Christoph Hellwig 0162ac2b97 nfsd: introduce nfsd4_callback_ops
Add a higher level abstraction than the rpc_ops for callback operations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26 16:29:29 -04:00
Christoph Hellwig f0b5de1b6b nfsd: split nfsd4_callback initialization and use
Split out initializing the nfs4_callback structure from using it.  For
the NULL callback this gets rid of tons of pointless re-initializations.

Note that I don't quite understand what protects us from running multiple
NULL callbacks at the same time, but at least this chance doesn't make
it worse..

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26 16:29:28 -04:00
Christoph Hellwig 326129d02a nfsd: introduce a generic nfsd4_cb
Add a helper to queue up a callback.  CB_NULL has a bit of special casing
because it is special in the specification, but all other new callback
operations will be able to share code with this and a few more changes
to refactor the callback code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26 16:29:27 -04:00
Christoph Hellwig 2faf3b4350 nfsd: remove nfsd4_callback.cb_op
We can always get at the private data by using container_of, no need for
a void pointer.  Also introduce a little to_delegation helper to avoid
opencoding the container_of everywhere.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26 16:29:26 -04:00
Benny Halevy 341b51df1f nfsd: do not clear rpc_resp in nfsd4_cb_done_sequence
This is incorrect when a callback is has to be restarted, in which case
the XDR decoding of the second iteration will see a NULL cb argument.

[hch: updated description]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26 16:29:25 -04:00
Christoph Hellwig 444b6e910d nfsd: fix nfsd4_cb_recall_done error handling
For any error that is not EBADHANDLE or NFS4ERR_BAD_STATEID,
nfsd4_cb_recall_done first marks the connection down, then
retries until dl_retries hits zero, then marks the connection down
again and sets cb_done.  This changes the code to only retry
for EBADHANDLE or NFS4ERR_BAD_STATEID, and factors setting
cb_done into a single point in the function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-26 16:29:25 -04:00
Fabian Frederick 6ff66ac77a fs/cachefiles: add missing \n to kerror conversions
Commit 0227d6abb3 ("fs/cachefiles: replace kerror by pr_err") didn't
include newline featuring in original kerror definition

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reported-by: David Howells <dhowells@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: <stable@vger.kernel.org>	[3.16.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26 08:10:35 -07:00
Peter Feiner 87e6d49a00 mm: softdirty: addresses before VMAs in PTE holes aren't softdirty
In PTE holes that contain VM_SOFTDIRTY VMAs, unmapped addresses before
VM_SOFTDIRTY VMAs are reported as softdirty by /proc/pid/pagemap.  This
bug was introduced in commit 68b5a65248 ("mm: softdirty: respect
VM_SOFTDIRTY in PTE holes").  That commit made /proc/pid/pagemap look at
VM_SOFTDIRTY in PTE holes but neglected to observe the start of VMAs
returned by find_vma.

Tested:
  Wrote a selftest that creates a PMD-sized VMA then unmaps the first
  page and asserts that the page is not softdirty. I'm going to send the
  pagemap selftest in a later commit.

Signed-off-by: Peter Feiner <pfeiner@google.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Jamie Liu <jamieliu@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26 08:10:35 -07:00
Joseph Qi 5760a97c71 ocfs2/dlm: do not get resource spinlock if lockres is new
There is a deadlock case which reported by Guozhonghua:
  https://oss.oracle.com/pipermail/ocfs2-devel/2014-September/010079.html

This case is caused by &res->spinlock and &dlm->master_lock
misordering in different threads.

It was introduced by commit 8d400b81cc ("ocfs2/dlm: Clean up refmap
helpers").  Since lockres is new, it doesn't not require the
&res->spinlock.  So remove it.

Fixes: 8d400b81cc ("ocfs2/dlm: Clean up refmap helpers")
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: joyce.xue <xuejiufei@huawei.com>
Reported-by: Guozhonghua <guozhonghua@h3c.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26 08:10:34 -07:00
Andreas Rohner 56d7acc792 nilfs2: fix data loss with mmap()
This bug leads to reproducible silent data loss, despite the use of
msync(), sync() and a clean unmount of the file system.  It is easily
reproducible with the following script:

  ----------------[BEGIN SCRIPT]--------------------
  mkfs.nilfs2 -f /dev/sdb
  mount /dev/sdb /mnt

  dd if=/dev/zero bs=1M count=30 of=/mnt/testfile

  umount /mnt
  mount /dev/sdb /mnt
  CHECKSUM_BEFORE="$(md5sum /mnt/testfile)"

  /root/mmaptest/mmaptest /mnt/testfile 30 10 5

  sync
  CHECKSUM_AFTER="$(md5sum /mnt/testfile)"
  umount /mnt
  mount /dev/sdb /mnt
  CHECKSUM_AFTER_REMOUNT="$(md5sum /mnt/testfile)"
  umount /mnt

  echo "BEFORE MMAP:\t$CHECKSUM_BEFORE"
  echo "AFTER MMAP:\t$CHECKSUM_AFTER"
  echo "AFTER REMOUNT:\t$CHECKSUM_AFTER_REMOUNT"
  ----------------[END SCRIPT]--------------------

The mmaptest tool looks something like this (very simplified, with
error checking removed):

  ----------------[BEGIN mmaptest]--------------------
  data = mmap(NULL, file_size - file_offset, PROT_READ | PROT_WRITE,
              MAP_SHARED, fd, file_offset);

  for (i = 0; i < write_count; ++i) {
        memcpy(data + i * 4096, buf, sizeof(buf));
        msync(data, file_size - file_offset, MS_SYNC))
  }
  ----------------[END mmaptest]--------------------

The output of the script looks something like this:

  BEFORE MMAP:    281ed1d5ae50e8419f9b978aab16de83  /mnt/testfile
  AFTER MMAP:     6604a1c31f10780331a6850371b3a313  /mnt/testfile
  AFTER REMOUNT:  281ed1d5ae50e8419f9b978aab16de83  /mnt/testfile

So it is clear, that the changes done using mmap() do not survive a
remount.  This can be reproduced a 100% of the time.  The problem was
introduced in commit 136e8770cd ("nilfs2: fix issue of
nilfs_set_page_dirty() for page at EOF boundary").

If the page was read with mpage_readpage() or mpage_readpages() for
example, then it has no buffers attached to it.  In that case
page_has_buffers(page) in nilfs_set_page_dirty() will be false.
Therefore nilfs_set_file_dirty() is never called and the pages are never
collected and never written to disk.

This patch fixes the problem by also calling nilfs_set_file_dirty() if the
page has no buffers attached to it.

[akpm@linux-foundation.org: s/PAGE_SHIFT/PAGE_CACHE_SHIFT/]
Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net>
Tested-by: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26 08:10:34 -07:00
Joseph Qi f13a568e5a ocfs2: free vol_label in ocfs2_delete_osb()
osb->vol_label is malloced in ocfs2_initialize_super but not freed if
error occurs or during umount, thus causing a memory leak.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-26 08:10:34 -07:00
hujianyang 4b1a43eab1 UBIFS: Align the dump messages of SB_NODE
I found the dump messages of UBIFS_SB_NODE is not aligned. This
patch remove the extra space from the line which is retracted.

Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-09-26 13:33:41 +03:00
David Howells f3f760314a NFS: Fabricate fscache server index key correctly
When fabricating a server index key for fscache, we should clear the index key
buffer before starting to fill it in, not in the middle.

Reported-by: James Pearson <james-p@moving-picture.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25 21:25:18 -04:00
Trond Myklebust 3fc3edf141 NFSv3: Fix missing includes of nfs3_fs.h
Silence a few warnings about missing symbols that are due to missing
includes of nfs3_fs.h.

Fixes: 00a36a1090 (NFS: Move v3 declarations out of internal.h)
Fixes: cb8c20fa53 (NFS: Move NFS v3 acl functions to nfs3_fs.h)
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25 16:28:53 -04:00
NeilBrown 1aff525629 NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()
Now that nfs_release_page() doesn't block indefinitely, other deadlock
avoidance mechanisms aren't needed.
 - it doesn't hurt for kswapd to block occasionally.  If it doesn't
   want to block it would clear __GFP_WAIT.  The current_is_kswapd()
   was only added to avoid deadlocks and we have a new approach for
   that.
 - memory allocation in the SUNRPC layer can very rarely try to
   ->releasepage() a page it is trying to handle.  The deadlock
   is removed as nfs_release_page() doesn't block indefinitely.

So we don't need to set PF_FSTRANS for sunrpc network operations any
more.

Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25 08:25:47 -04:00
NeilBrown 353db79662 NFS: avoid waiting at all in nfs_release_page when congested.
If nfs_release_page() is called on a sequence of pages which are all
in the same file which is blocked on COMMIT, each page could
contribute a 1 second delay which could be come excessive.  I have
seen delays of as much as 208 seconds.

To keep the delay to one second, mark the bdi as write-congested
if the commit didn't finished.  Once it does finish, the
write-congested flag will be cleared by nfs_commit_release_pages().

With this, the longest total delay in try_to_free_pages that I have
seen is under 3 seconds.  With no waiting in nfs_release_page at all
I have seen delays of nearly 1.5 seconds.

Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25 08:25:38 -04:00
NeilBrown 9590544694 NFS: avoid deadlocks with loop-back mounted NFS filesystems.
Support for loop-back mounted NFS filesystems is useful when NFS is
used to access shared storage in a high-availability cluster.

If the node running the NFS server fails, some other node can mount the
filesystem and start providing NFS service.  If that node already had
the filesystem NFS mounted, it will now have it loop-back mounted.

nfsd can suffer a deadlock when allocating memory and entering direct
reclaim.
While direct reclaim does not write to the NFS filesystem it can send
and wait for a COMMIT through nfs_release_page().

This patch modifies nfs_release_page() to wait a limited time for the
commit to complete - one second.  If the commit doesn't complete
in this time, nfs_release_page() will fail.  This means it might now
fail in some cases where it wouldn't before.  These cases are only
when 'gfp' includes '__GFP_WAIT'.

nfs_release_page() is only called by try_to_release_page(), and that
can only be called on an NFS page with required 'gfp' flags from
 - page_cache_pipe_buf_steal() in splice.c
 - shrink_page_list() in vmscan.c
 - invalidate_inode_pages2_range() in truncate.c

The first two handle failure quite safely.  The last is only called
after ->launder_page() has been called, and that will have waited
for the commit to finish already.

So aborting if the commit takes longer than 1 second is perfectly safe.

Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25 08:25:28 -04:00
NeilBrown e87b4c7a7a NFS: don't use STABLE writes during writeback.
commit b31268ac79
  FS: Use stable writes when not doing a bulk flush

was a bit heavy handed.
The particular problem that lead to this patch was that
small writes to an O_SYNC file we being written as UNSTABLE writes
followed by a commit.
This is appropriate for large writes (which require multiple NFS
requests) but for small writes (single NFS request), using
NFS_FILE_SYNC is more efficient.

So that patch causes the code to select between the two methods
depending on how many nfs requests get generated.

Unfortunately this ends up applying to non O_SYNC writes as well.
In particular if you memory-map a file and update random pages, then
when they are eventually written out by writeback they will go as
NFS_FILE_SYNC.  This is inefficient and slows down the application.

So: only set FLUSH_COND_STABLE when wbc->sync_mode is WB_SYNC_ALL.
With this patch:
 O_SYNC writes are NFS_FILE_SYNC for single requests, and NFS_UNSTABLE
    followed by COMMIT for multiple requests
 Writing immediately before close of fsync follow the same pattern.
 Non-O_SYNC writes without an fsync of close eventually get flushed
 out as UNSTABLE and a commit follows eventually as appropriate.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24 23:23:02 -04:00
NeilBrown 8478eaa16e NFSv4: use exponential retry on NFS4ERR_DELAY for async requests.
Currently asynchronous NFSv4 request will be retried with
exponential timeout (from 1/10 to 15 seconds), but async
requests will always use a 15second retry.

Some "async" requests are really synchronous though.  The
async mechanism is used to allow the request to continue if
the requesting process is killed.
In those cases, an exponential retry is appropriate.

For example, if two different clients both open a file and
get a READ delegation, and one client then unlinks the file
(while still holding an open file descriptor), that unlink
will used the "silly-rename" handling which is async.
The first rename will result in NFS4ERR_DELAY while the
delegation is reclaimed from the other client.  The rename
will not be retried for 15 seconds, causing an unlink to take
15 seconds rather than 100msec.

This patch only added exponential timeout for async unlink and
async rename.  Other async calls, such as 'close' are sometimes
waited for so they might benefit from exponential timeout too.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24 23:22:47 -04:00
Benjamin Coddington 173b3afcee lockd: Try to reconnect if statd has moved
If rpc.statd is restarted, upcalls to monitor hosts can fail with
ECONNREFUSED.  In that case force a lookup of statd's new port and retry the
upcall.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24 23:08:43 -04:00
Olga Kornievskaia 8faaa6d5d4 Fixing lease renewal
Commit c9fdeb28 removed a 'continue' after checking if the lease needs
to be renewed. However, if client hasn't moved, the code falls down to
starting reboot recovery erroneously (ie., sends open reclaim and gets
back stale_clientid error) before recovering from getting stale_clientid
on the renew operation.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Fixes: c9fdeb280b (NFS: Add basic migration support to state manager thread)
Cc: stable@vger.kernel.org # 3.13+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24 23:03:15 -04:00
Fabian Frederick 2f3169fb18 nfs: fix duplicate proc entries
Commit 65b38851a1
("NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes")

updated the following function:
static int nfs_volume_list_open(struct inode *inode, struct file *file)

it used &nfs_server_list_ops instead of &nfs_volume_list_ops
which means cat /proc/fs/nfsfs/volumes = /proc/fs/nfsfs/servers

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Fixes: 65b38851a1 (NFS: Fix /proc/fs/nfsfs/servers and...)
Cc: stable@vger.kernel.org # 3.4.x+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24 23:00:18 -04:00
Tejun Heo 2aad2a86f6 percpu_ref: add PERCPU_REF_INIT_* flags
With the recent addition of percpu_ref_reinit(), percpu_ref now can be
used as a persistent switch which can be turned on and off repeatedly
where turning off maps to killing the ref and waiting for it to drain;
however, there currently isn't a way to initialize a percpu_ref in its
off (killed and drained) state, which can be inconvenient for certain
persistent switch use cases.

Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic
selection of operation mode; however, currently a newly initialized
percpu_ref is always in percpu mode making it impossible to avoid the
latency overhead of switching to atomic mode.

This patch adds @flags to percpu_ref_init() and implements the
following flags.

* PERCPU_REF_INIT_ATOMIC	: start ref in atomic mode
* PERCPU_REF_INIT_DEAD		: start ref killed and drained

These flags should be able to serve the above two use cases.

v2: target_core_tpg.c conversion was missing.  Fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
2014-09-24 13:31:50 -04:00
Tejun Heo d06efebf0c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block into for-3.18
This is to receive 0a30288da1 ("blk-mq, percpu_ref: implement a
kludge for SCSI blk-mq stall during probe") which implements
__percpu_ref_kill_expedited() to work around SCSI blk-mq stall.  The
commit reverted and patches to implement proper fix will be added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
2014-09-24 13:00:21 -04:00
Jaegeuk Kim 95dd897301 f2fs: use more free segments until SSR is activated
Previously, f2fs activates SSR if the # of free segments reaches to the # of
overprovisioned segments.
In this case, SSR starts to use dirty segments only, so that the overprovisoned
space cannot be selected for new data.
This means that we have no chance to utilizae the overprovisioned space at all.

This patch fixes that by allowing LFS allocations until the # of free segments
reaches to the last threshold, reserved space.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:24 -07:00
Jaegeuk Kim 9b5f136fd4 f2fs: change the ipu_policy option to enable combinations
This patch changes the ipu_policy setting to use any combination of orthogonal policies.

Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:24 -07:00
Chao Yu 210f41bc04 f2fs: fix to search whole dirty segmap when get_victim
In ->get_victim we get max_search value from dirty_i->nr_dirty without
protection of seglist_lock, after that, nr_dirty can be increased/decreased
before we hold seglist_lock lock.
Then in main loop we attempt to traverse all dirty section one time to find
victim section, but it's not accurate to use max_search as the total loop count,
because we might lose checking several sections or check sections redundantly
for the case of nr_dirty are increased or decreased previously.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:23 -07:00
Chao Yu 26666c8a43 f2fs: fix to clean previous mount option when remount_fs
In manual of mount, we descript remount as below:

"mount -o remount,rw /dev/foo /dir
After  this call all old mount options are replaced and arbitrary stuff from
fstab is ignored, except the loop= option which is internally generated and
maintained by the mount command."

Previously f2fs do not clear up old mount options when remount_fs, so we have no
chance of disabling previous option (e.g. flush_merge). Fix it.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:22 -07:00
Chao Yu 14cecc5cd6 f2fs: skip punching hole in special condition
Now punching hole in directory is not supported in f2fs, so let's limit file
type in punch_hole().

In addition, in punch_hole if offset is exceed file size, we should skip
punching hole.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:21 -07:00
Chao Yu 55cf9cb63f f2fs: support large sector size
Block size in f2fs is 4096 bytes, so theoretically, f2fs can support 4096 bytes
sector device at maximum. But now f2fs only support 512 bytes size sector, so
block device such as zRAM which uses page cache as its block storage space will
not be mounted successfully as mismatch between sector size of zRAM and sector
size of f2fs supported.

In this patch we support large sector size in f2fs, so block device with sector
size of 512/1024/2048/4096 bytes can be supported in f2fs.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:20 -07:00
Chao Yu 09db6a2ef8 f2fs: fix to truncate blocks past EOF in ->setattr
By using FALLOC_FL_KEEP_SIZE in ->fallocate of f2fs, we can fallocate block past
EOF without changing i_size of inode. These blocks past EOF will not be
truncated in ->setattr as we truncate them only when change the file size.

We should give a chance to truncate blocks out of filesize in setattr().

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:20 -07:00
Jaegeuk Kim 976e4c50ae f2fs: update i_size when __allocate_data_block
The f2fs_direct_IO uses __allocate_data_block, but inside the allocation path,
we should update i_size at the changed time to update its inode page.
Otherwise, we can get wrong i_size after roll-forward recovery.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:19 -07:00
Jaegeuk Kim 90a893c749 f2fs: use MAX_BIO_BLOCKS(sbi)
This patch cleans up a simple macro.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:18 -07:00
Jaegeuk Kim c52e1b10b1 f2fs: remove redundant operation during roll-forward recovery
If same data is updated multiple times, we don't need to redo whole the
operations.
Let's just update the lastest one.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:17 -07:00
Jaegeuk Kim 19c9c466e5 f2fs: do not skip latest inode information
In f2fs_sync_file, if there is no written appended writes, it skips
to write its node blocks.
But, if there is up-to-date inode page, we should write it to update
its metadata during the roll-forward recovery.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:16 -07:00
Jaegeuk Kim 441ac5cb32 f2fs: fix roll-forward missing scenarios
We can summarize the roll forward recovery scenarios as follows.

[Term] F: fsync_mark, D: dentry_mark

1. inode(x) | CP | inode(x) | dnode(F)
-> Update the latest inode(x).

2. inode(x) | CP | inode(F) | dnode(F)
-> No problem.

3. inode(x) | CP | dnode(F) | inode(x)
-> Recover to the latest dnode(F), and drop the last inode(x)

4. inode(x) | CP | dnode(F) | inode(F)
-> No problem.

5. CP | inode(x) | dnode(F)
-> The inode(DF) was missing. Should drop this dnode(F).

6. CP | inode(DF) | dnode(F)
-> No problem.

7. CP | dnode(F) | inode(DF)
-> If f2fs_iget fails, then goto next to find inode(DF).

8. CP | dnode(F) | inode(x)
-> If f2fs_iget fails, then goto next to find inode(DF).
   But it will fail due to no inode(DF).

So, this patch adds some missing points such as #1, #5, #7, and #8.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:16 -07:00
Jaegeuk Kim 88bd02c947 f2fs: fix conditions to remain recovery information in f2fs_sync_file
This patch revisited whole the recovery information during the f2fs_sync_file.

In this patch, there are three information to make a decision.

a) IS_CHECKPOINTED,	/* is it checkpointed before? */
b) HAS_FSYNCED_INODE,	/* is the inode fsynced before? */
c) HAS_LAST_FSYNC,	/* has the latest node fsync mark? */

And, the scenarios for our rule are based on:

[Term] F: fsync_mark, D: dentry_mark

1. inode(x) | CP | inode(x) | dnode(F)
2. inode(x) | CP | inode(F) | dnode(F)
3. inode(x) | CP | dnode(F) | inode(x) | inode(F)
4. inode(x) | CP | dnode(F) | inode(F)
5. CP | inode(x) | dnode(F) | inode(DF)
6. CP | inode(DF) | dnode(F)
7. CP | dnode(F) | inode(DF)
8. CP | dnode(F) | inode(x) | inode(DF)

For example, #3, the three conditions should be changed as follows.

   inode(x) | CP | dnode(F) | inode(x) | inode(F)
a)    x       o      o          o          o
b)    x       x      x          x          o
c)    x       o      o          x          o

If f2fs_sync_file stops   ------^,
 it should write inode(F)    --------------^

So, the need_inode_block_update should return true, since
 c) get_nat_flag(e, HAS_LAST_FSYNC), is false.

For example, #8,
      CP | alloc | dnode(F) | inode(x) | inode(DF)
a)    o      x        x          x          x
b)    x               x          x          o
c)    o               o          x          o

If f2fs_sync_file stops   -------^,
 it should write inode(DF)    --------------^

Note that, the roll-forward policy should follow this rule, which means,
if there are any missing blocks, we doesn't need to recover that inode.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:15 -07:00
Jaegeuk Kim 7ef35e3b9e f2fs: introduce a flag to represent each nat entry information
This patch introduces a flag in the nat entry structure to merge various
information such as checkpointed and fsync_done marks.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:14 -07:00
Jaegeuk Kim 4c521f493b f2fs: use meta_inode cache to improve roll-forward speed
Previously, all the dnode pages should be read during the roll-forward recovery.
Even worsely, whole the chain was traversed twice.
This patch removes that redundant and costly read operations by using page cache
of meta_inode and readahead function as well.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-23 11:10:12 -07:00
Dave Chinner 33044dc408 Merge branch 'xfs-misc-fixes-for-3.18-2' into for-next 2014-09-23 22:55:51 +10:00
Dave Chinner 2ebff7bbd7 xfs: flush entire last page of old EOF on truncate up
On a sub-page sized filesystem, truncating a mapped region down
leaves us in a world of hurt. We truncate the pagecache, zeroing the
newly unused tail, then punch blocks out from under the page. If we
then truncate the file back up immediately, we expose that unmapped
hole to a dirty page mapped into the user application, and that's
where it all goes wrong.

In truncating the page cache, we avoid unmapping the tail page of
the cache because it still contains valid data. The problem is that
it also contains a hole after the truncate, but nobody told the mm
subsystem that. Therefore, if the page is dirty before the truncate,
we'll never get a .page_mkwrite callout after we extend the file and
the application writes data into the hole on the page.  Hence when
we come to writing that region of the page, it has no blocks and no
delayed allocation reservation and hence we toss the data away.

This patch adds code to the truncate up case to solve it, by
ensuring the partial page at the old EOF is always cleaned after we
do any zeroing and move the EOF upwards. We can't actually serialise
the page writeback and truncate against page faults (yes, that
problem AGAIN) so this is really just a best effort and assumes it
is extremely unlikely that someone is concurrently writing to the
page at the EOF while extending the file.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 22:55:00 +10:00
Dave Chinner 7abbb8f928 xfs: xfs_swap_extent_flush can be static
Fix sparse warning introduced by commit 4ef897a ("xfs: flush both
inodes in xfs_swap_extents").

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 16:20:11 +10:00
Dave Chinner 02cc18764c xfs: xfs_buf_write_fail_rl_state can be static
Fix sparse warning introduced by commit ac8809f9 ("xfs: abort
metadata writeback on permanent errors").

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 16:15:45 +10:00
Fengguang Wu ea95961df7 xfs: xfs_rtget_summary can be static
Fix sparse warning introduced by commit afabfd3 ("xfs: combine
xfs_rtmodify_summary and xfs_rtget_summary").

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 16:11:43 +10:00
Fabian Frederick e3cf17962a xfs: remove second xfs_quota.h inclusion in xfs_icache.c
xfs_quota.h was included twice.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 16:05:55 +10:00
Eric Sandeen fb04013156 xfs: don't ASSERT on corrupt ftype
xfs_dir3_data_get_ftype() gets the file type off disk, but ASSERTs
if it's invalid:

     ASSERT(type < XFS_DIR3_FT_MAX);

We shouldn't ASSERT on bad values read from disk.  V3 dirs are
CRC-protected, but V2 dirs + ftype are not.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 16:05:32 +10:00
Dave Chinner 8af3dcd3c8 xfs: xlog_cil_force_lsn doesn't always wait correctly
When running a tight mount/unmount loop on an older kernel, RedHat
QE found that unmount would occasionally hang in
xfs_buf_unpin_wait() on the superblock buffer. Tracing and other
debug work by Eric Sandeen indicated that it was hanging on the
writing of the superblock during unmount immediately after logging
the superblock counters in a synchronous transaction. Further debug
indicated that the synchronous transaction was not waiting for
completion correctly, and we narrowed it down to
xlog_cil_force_lsn() returning NULLCOMMITLSN and hence not pushing
the transaction in the iclog buffer to disk correctly.

While this unmount superblock write code is now very different in
mainline kernels, the xlog_cil_force_lsn() code is identical, and it
was bisected to the backport of commit f876e44 ("xfs: always do log
forces via the workqueue"). This commit made the CIL push
asynchronous for log forces and hence exposed a race condition that
couldn't occur on a synchronous push.

Essentially, the xlog_cil_force_lsn() relied implicitly on the fact
that the sequence push would be complete by the time
xlog_cil_push_now() returned, resulting in the context being pushed
being in the committing list. When it was made asynchronous, it was
recognised that there was a race condition in detecting whether an
asynchronous push has started or not and code was added to handle
it.

Unfortunately, the fix was not quite right and left a race condition
where it it would detect an empty CIL while a push was in progress
before the context had been added to the committing list. This was
incorrectly seen as a "nothing to do" condition and so would tell
xfs_log_force_lsn() that there is nothing to wait for, and hence it
would push the iclogbufs in memory.

The fix is simple, but explaining the logic and the race condition
is a lot more complex. The fix is to add the context to the
committing list before we start emptying the CIL. This allows us to
detect the difference between an empty "do nothing" push and a push
that has not started by adding a discrete "emptying the CIL" state
to avoid the transient, incorrect "empty" condition that the
(unchanged) waiting code was seeing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 15:57:59 +10:00
Dave Chinner f6d31f4b04 Merge branch 'xfs-shift-extents-rework' into for-next 2014-09-23 15:51:14 +10:00
Brian Foster 8b5279e33f xfs: only writeback and truncate pages for the freed range
xfs_free_file_space() only affects the range of the file for which space
is being freed. It currently writes and truncates the page cache from
the start offset of the free to EOF.

Modify xfs_free_file_space() to write back and truncate page cache of
just the range being freed.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 15:39:05 +10:00
Brian Foster f71721d061 xfs: writeback and inval. file range to be shifted by collapse
The collapse range operation currently writes the entire file before
starting the collapse to avoid changes in the in-core extent list due to
writeback causing the extent count to change. Now that collapse range is
fsb based rather than extent index based it can sustain changes in the
extent list during the shift sequence without disruption.

Modify xfs_collapse_file_space() to writeback and invalidate pages
associated with the range of the file to be shifted.
xfs_free_file_space() currently has similar behavior, but the space free
need only affect the region of the file that is freed and this could
change in the future.

Also update the comments to reflect the current implementation. We
retain the eofblocks trim permanently as a best option for dealing with
delalloc extents. We don't shift delalloc extents because this scenario
only occurs with post-eof preallocation (since data must be flushed such
that the cache can be invalidated and data can be shifted). That means
said space must also be initialized before being shifted into the
accessible region of the file only to be immediately truncated off as
the last part of the collapse. In other words, the eofblocks trim will
happen anyways, we just run it first to ensure the file remains in a
consistent state throughout the collapse.

Finally, detect and fail explicitly in the event of a delalloc extent
during the extent shift. The implementation does not support delalloc
extents and the caller is expected to prevent this scenario in advance
as is done by collapse.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 15:39:05 +10:00
Brian Foster a979bdfea1 xfs: refactor single extent shift into xfs_bmse_shift_one() helper
xfs_bmap_shift_extents() has a variety of conditions and error checks
that make the logic difficult to follow and indent heavy. Refactor the
loop body of this function into a new xfs_bmse_shift_one() helper. This
simplifies the error checks, eliminates index decrement on merge hack by
pushing the index increment down into the helper, and makes the code
more readable by reducing multiple levels of indentation.

This is a code refactor only. The behavior of extent shift and collapse
range is not modified.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 15:39:04 +10:00
Brian Foster ddb19e3180 xfs: refactor shift-by-merge into xfs_bmse_merge() helper
The extent shift mechanism in xfs_bmap_shift_extents() is complicated
and handles several different, non-deterministic scenarios. These
include extent shifts, extent merges and potential btree updates in
either of the former scenarios.

Refactor the code to be more linear and readable. The loop logic in
xfs_bmap_shift_extents() and some initial error checking is adjusted
slightly. The associated btree lookup and update/delete operations are
condensed into single blocks of code. This reduces the number of
btree-specific blocks and facilitates the separation of the merge
operation into a new xfs_bmse_merge() and xfs_bmse_can_merge() helpers.

This is a code refactor only. The behavior of extent shift and collapse
range is not modified.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 15:38:09 +10:00
Brian Foster 2c845f5a5f xfs: track collapse via file offset rather than extent index
The collapse range implementation uses a transaction per extent shift.
The progress of the overall operation is tracked via the current extent
index of the in-core extent list. This is racy because the ilock must be
dropped and reacquired for each transaction according to locking and log
reservation rules. Therefore, writeback to prior regions of the file is
possible and can change the extent count. This changes the extent to
which the current index refers and causes the collapse to fail mid
operation. To avoid this problem, the entire file is currently written
back before the collapse operation starts.

To eliminate the need to flush the entire file, use the file offset
(fsb) to track the progress of the overall extent shift operation rather
than the extent index. Modify xfs_bmap_shift_extents() to
unconditionally convert the start_fsb parameter to an extent index and
return the file offset of the extent where the shift left off, if
further extents exist. The bulk of ths function can remain based on
extent index as ilock is held by the caller. xfs_collapse_file_space()
now uses the fsb output as the starting point for the subsequent shift.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 15:37:09 +10:00
Dave Chinner 0d085a529b xfs: ensure WB_SYNC_ALL writeback handles partial pages correctly
XFS has been having trouble with stray delayed allocation extents
beyond EOF for a long time. Recent changes to the collapse range
code has triggered erroneous EBUSY errors on page invalidtion for
block size smaller than page size filesystems. These
have been caused by dirty buffers beyond EOF on a partial page which
do not get written to disk during a sync.

The issue is that write-ahead in xfs_cluster_write() finds such a
partial page and handles it by leaving the page dirty but pushing it
into a writeback state. This used to work just fine, as the
write_cache_pages() code would then find the dirty partial page in
the next mapping tree lookup as the dirty tag is still set.

Unfortunately, when we moved to a mark and sweep approach to
writeback to fix other writeback sync issues, we broken this. THe
act of marking the page as under writeback now clears the TOWRITE
tag in the radix tree, even though the page is still dirty. This
causes the TOWRITE tag to be cleared, and hence the next lookup on
the mapping tree does not find the dirty partial page and so doesn't
try to write it again.

This same writeback bug was found recently in ext4 and fixed in
commit 1c8349a ("ext4: fix data integrity sync in ordered mode")
without communication to the wider filesystem community. We can use
exactly the same fix here so the TOWRITE flag is not cleared on
partial page writes.

cc: stable@vger.kernel.org # dependent on 1c8349a171
Root-cause-found-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-23 15:36:27 +10:00
Ingo Molnar 6273143359 Merge branch 'rcu/next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull the v3.18 RCU changes from Paul E. McKenney:

"
  * Update RCU documentation.  These were posted to LKML at
    https://lkml.org/lkml/2014/8/28/378.

  * Miscellaneous fixes.  These were posted to LKML at
    https://lkml.org/lkml/2014/8/28/386.  An additional fix that
    eliminates a documented (but now inconvenient) deadlock between
    RCU hotplug and expedited grace periods was posted at
    https://lkml.org/lkml/2014/8/28/573.

  * Changes related to No-CBs CPUs and NO_HZ_FULL.  These were posted
    to LKML at https://lkml.org/lkml/2014/8/28/412.

  * Torture-test updates.  These were posted to LKML at
    https://lkml.org/lkml/2014/8/28/546 and at
    https://lkml.org/lkml/2014/9/11/1114.

  * RCU-tasks implementation.  These were posted to LKML at
    https://lkml.org/lkml/2014/8/28/540.
"

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-23 07:21:42 +02:00
Linus Torvalds e2519c2c85 FS-Cache fixes
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIVAwUAVB/u8ROxKuMESys7AQJGxA/9Ep4IwIAclMYcVQtk4zzvz+A9jVw0pu+X
 aItU70GcGoxehGlOxyqKXp/dg3CZ7ZPcxmQqzl7nzCZY28or5z/Nu6Y/YII+V0NB
 f6vgHPhfGO7CSSdwlSzFY/6DdU/EXhjA3X4suowRPIO3B/2ydLxDe3sahIWY/AhM
 4h3ioDR+A4ZZvU8fmw62l9Iae46mkk7KT8Am/2bUJQpENu4caVEtNtyJ2IsiPtZ8
 pddgvPL/f+piz4Ufdmc/XH4KB8kiLwU+7t3xfoZnSzYz9Vzcq4sQVvSxqNSaYzSC
 REEnELuLEweIh57PP67giZGPE6MA50kISH/soc43yl/oWj9i3cqkNb2JW+nxpvfS
 pIHfgRmzqUPe9MrmVUtle9c1HhjAyts3y27JB4onpLghN96JgIcTWJ1vZxnxbXzp
 ADVCgc3z8nhttLQMU+NsfD4fHUvj+bxULW1bSPdB5DfSrkpqmQJZBRexeFZWGyA9
 p1p9Yaty2Sdq/RPmtIfMPA3cu1D9LG4wkvGPmmEAjnIlKebRBjriKDmCdwf9okWA
 m+fAbuBnTjc+R0ERE6Lh5OI72hbwcgyAyDx5RdWkXVIPMBuBK67nR6nJK7vPW4x0
 qK0B5Pk2iwobKQMicVSxI+GImeeiQLi798KHJwN4uFBy74rIISWilIsylBrxlxK8
 s2hdI83XJJI=
 =ojl4
 -----END PGP SIGNATURE-----

Merge tag 'fscache-fixes-20140917' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull fs-cache fixes from David Howells:

 - Put a timeout in releasepage() to deal with a recursive hang between
   the memory allocator, writeback, ext4 and fscache under memory
   pressure.

 - Fix a pair of refcount bugs in the fscache error handling.

 - Remove a couple of unused pagevecs.

 - The cachefiles requirement that the base directory support rename
   should permit rename2 as an alternative - otherwise certain
   filesystems cannot now be used as backing stores (such as ext4).

* tag 'fscache-fixes-20140917' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  CacheFiles: Handle rename2
  cachefiles: remove two unused pagevecs.
  FS-Cache: refcount becomes corrupt under vma pressure.
  FS-Cache: Reduce cookie ref count if submit fails.
  FS-Cache: Timeout for releasepage()
2014-09-22 17:52:16 -07:00
Josef Bacik 1d52c78afb Btrfs: try not to ENOSPC on log replay
When doing log replay we may have to update inodes, which traditionally goes
through our delayed inode stuff.  This will try to move space over from the
trans handle, but we don't reserve space in our trans handle on replay since we
don't know how much we will need, so instead we try to flush.  But because we
have a trans handle open we won't flush anything, so if we are out of reserve
space we will simply return ENOSPC.  Since we know that if an operation made it
into the log then we definitely had space before the box bought the farm then we
don't need to worry about doing this space reservation.  Use the
fs_info->log_root_recovering flag to skip the delayed inode stuff and update the
item directly.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-22 17:13:36 -07:00
Josef Bacik f6acfd5011 Btrfs: don't do async reclaim during log replay
Trying to reproduce a log enospc bug I hit a panic in the async reclaim code
during log replay.  This is because we use fs_info->fs_root as our root for
shrinking and such.  Technically we can use whatever root we want, but let's
just not allow async reclaim while we're doing log replay.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-22 17:13:31 -07:00
Josef Bacik 47ab2a6c68 Btrfs: remove empty block groups automatically
One problem that has plagued us is that a user will use up all of his space with
data, remove a bunch of that data, and then try to create a bunch of small files
and run out of space.  This happens because all the chunks were allocated for
data since the metadata requirements were so low.  But now there's a bunch of
empty data block groups and not enough metadata space to do anything.  This
patch solves this problem by automatically deleting empty block groups.  If we
notice the used count go down to 0 when deleting or on mount notice that a block
group has a used count of 0 then we will queue it to be deleted.

When the cleaner thread runs we will double check to make sure the block group
is still empty and then we will delete it.  This patch has the side effect of no
longer having a bunch of BUG_ON()'s in the chunk delete code, which will be
helpful for both this and relocate.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-22 17:13:21 -07:00
Jens Axboe 6d11fb454b Merge branch 'for-linus' into for-3.18/core
Moving patches from for-linus to 3.18 instead, pull in this changes
that will go to Linus today.
2014-09-22 11:57:32 -06:00
Anton Altaparmakov f2d5a94436 Fix nasty 32-bit overflow bug in buffer i/o code.
On 32-bit architectures, the legacy buffer_head functions are not always
handling the sector number with the proper 64-bit types, and will thus
fail on 4TB+ disks.

Any code that uses __getblk() (and thus bread(), breadahead(),
sb_bread(), sb_breadahead(), sb_getblk()), and calls it using a 64-bit
block on a 32-bit arch (where "long" is 32-bit) causes an inifinite loop
in __getblk_slow() with an infinite stream of errors logged to dmesg
like this:

  __find_get_block_slow() failed. block=6740375944, b_blocknr=2445408648
  b_state=0x00000020, b_size=512
  device sda1 blocksize: 512

Note how in hex block is 0x191C1F988 and b_blocknr is 0x91C1F988 i.e. the
top 32-bits are missing (in this case the 0x1 at the top).

This is because grow_dev_page() is broken and has a 32-bit overflow due
to shifting the page index value (a pgoff_t - which is just 32 bits on
32-bit architectures) left-shifted as the block number.  But the top
bits to get lost as the pgoff_t is not type cast to sector_t / 64-bit
before the shift.

This patch fixes this issue by type casting "index" to sector_t before
doing the left shift.

Note this is not a theoretical bug but has been seen in the field on a
4TiB hard drive with logical sector size 512 bytes.

This patch has been verified to fix the infinite loop problem on 3.17-rc5
kernel using a 4TB disk image mounted using "-o loop".  Without this patch
doing a "find /nt" where /nt is an NTFS volume causes the inifinite loop
100% reproducibly whilst with the patch it works fine as expected.

Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-22 08:41:16 -07:00
Trond Myklebust 5466112f09 pnfs/blocklayout: Fix a 64-bit division/remainder issue in bl_map_stripe
kbuild test robot reports:

   fs/built-in.o: In function `bl_map_stripe':
   >> :(.text+0x965b4): undefined reference to `__aeabi_uldivmod'
   >> :(.text+0x965cc): undefined reference to `__aeabi_uldivmod'
   >> :(.text+0x96604): undefined reference to `__aeabi_uldivmod'

Fixes: 5c83746a0c (pnfs/blocklayout: in-kernel GETDEVICEINFO XDR parsing)
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-21 14:20:20 -04:00
Linus Torvalds 46be7b73e8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "I've got a revert to fix a regression with btrfs device registration,
  and Filipe has part two of his fsync fix from last week"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Revert "Btrfs: device_list_add() should not update list when mounted"
  Btrfs: set inode's logged_trans/last_log_commit after ranged fsync
2014-09-19 13:10:53 -07:00
Linus Torvalds 81770f4144 NFS client fixes for 3.17
Highligts:
 - Fix an Oops in nfs4_open_and_get_state
 - Fix an Oops in the nfs4_state_manager
 - Fix another bug in the close/open_downgrade code
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUHIRLAAoJEGcL54qWCgDykZkP/jHDs/0HcK3x8jW+zbxKP6tf
 xfyhJGySwnTo2v0UPD1pETtQke9bWnm38RVl04wf2H4Gb7jR/BoDZ5J1C7956vuN
 FFXt9lcnTj2Cijn/8wz2S9GneY/mjsWf9OP7NUM3O6DgxORhdoviOnYOAzqzEXjG
 ylqTP/3FVglDbawKaLy3ubI0dteNxOu9U4gLveP617Ysd8h4s5XsYHPYKOOltybS
 HhVNf/3EdoD3lms67Zj7yPl7PtdDhNKFrS32nhnfdLLgsMiwTyb9ZYaFpK2XcD9v
 KDKblibH/wpQCsnReB66dKBR8P4ktTvXM1ovkb7LFUZD5tsOcb1Bp5ROzHXUSmiI
 sXh5Ueue0FPKExU5WFKROE43+G5KOJG5pB2RwgugsqVlZjFhGotZrIle17Zuqxz0
 kVR+vGZ50O/nLQ+EoRhDRRbDBrUMT7/xxHDSPQ6d4HK2hNTbrXuSXcoe8/BvbSTt
 JXQCdbWDPZ5oR6z8RoBN1xHhJvXC3Y2w/d7ZzOpl3yLzsKpJ7K4tys4Z29iv3ut6
 ziRS1AvJvedwSK73fWTs+zEHKm+pFMqq2U+DncvWWOWOVpIv6eKRlY9O8enP7IeW
 qNHj4UVYnr9w4oAhvk2WJt1TZrhzBX9NhMjHSxUCSOs5v/YeiBjPTTy40N4O0Y6Z
 DwKwDNxZq49ILEznntsd
 =EOW6
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.17-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client fixes from Trond Myklebust:
 "Highligts:
   - fix an Oops in nfs4_open_and_get_state
   - fix an Oops in the nfs4_state_manager
   - fix another bug in the close/open_downgrade code"

* tag 'nfs-for-3.17-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4: Fix another bug in the close/open_downgrade code
  NFSv4: nfs4_state_manager() vs. nfs_server_remove_lists()
  NFS: remove BUG possibility in nfs4_open_and_get_state
2014-09-19 13:07:49 -07:00
Richard Weinberger d577bc104f UBIFS: Remove bogus assert
This assertion was only correct before UBIFS had xattr support.
Now with xattr support also a directory node can carry data
and can act as host node.

Suggested-by: Artem Bityutskiy <dedekind1@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-09-19 18:11:50 +03:00
Filipe Manana 8407f55326 Btrfs: fix data corruption after fast fsync and writeback error
When we do a fast fsync, we start all ordered operations and then while
they're running in parallel we visit the list of modified extent maps
and construct their matching file extent items and write them to the
log btree. After that, in btrfs_sync_log() we wait for all the ordered
operations to finish (via btrfs_wait_logged_extents).

The problem with this is that we were completely ignoring errors that
can happen in the extent write path, such as -ENOSPC, a temporary -ENOMEM
or -EIO errors for example. When such error happens, it means we have parts
of the on disk extent that weren't written to, and so we end up logging
file extent items that point to these extents that contain garbage/random
data - so after a crash/reboot plus log replay, we get our inode's metadata
pointing to those extents.

This worked in contrast with the full (non-fast) fsync path, where we
start all ordered operations, wait for them to finish and then write
to the log btree. In this path, after each ordered operation completes
we check if it's flagged with an error (BTRFS_ORDERED_IOERR) and return
-EIO if so (via btrfs_wait_ordered_range).

So if an error happens with any ordered operation, just return a -EIO
error to userspace, so that it knows that not all of its previous writes
were durably persisted and the application can take proper action (like
redo the writes for e.g.) - and definitely not leave any file extent items
in the log refer to non fully written extents.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-19 06:57:51 -07:00
Filipe Manana 669249eea3 Btrfs: fix fsync race leading to invalid data after log replay
When the fsync callback (btrfs_sync_file) starts, it first waits for
the writeback of any dirty pages to start and finish without holding
the inode's mutex (to reduce contention). After this it acquires the
inode's mutex and repeats that process via btrfs_wait_ordered_range
only if we're doing a full sync (BTRFS_INODE_NEEDS_FULL_SYNC flag
is set on the inode).

This is not safe for a non full sync - we need to start and wait for
writeback to finish for any pages that might have been made dirty
before acquiring the inode's mutex and after that first step mentioned
before. Why this is needed is explained by the following comment added
to btrfs_sync_file:

  "Right before acquiring the inode's mutex, we might have new
   writes dirtying pages, which won't immediately start the
   respective ordered operations - that is done through the
   fill_delalloc callbacks invoked from the writepage and
   writepages address space operations. So make sure we start
   all ordered operations before starting to log our inode. Not
   doing this means that while logging the inode, writeback
   could start and invoke writepage/writepages, which would call
   the fill_delalloc callbacks (cow_file_range,
   submit_compressed_extents). These callbacks add first an
   extent map to the modified list of extents and then create
   the respective ordered operation, which means in
   tree-log.c:btrfs_log_inode() we might capture all existing
   ordered operations (with btrfs_get_logged_extents()) before
   the fill_delalloc callback adds its ordered operation, and by
   the time we visit the modified list of extent maps (with
   btrfs_log_changed_extents()), we see and process the extent
   map they created. We then use the extent map to construct a
   file extent item for logging without waiting for the
   respective ordered operation to finish - this file extent
   item points to a disk location that might not have yet been
   written to, containing random data - so after a crash a log
   replay will make our inode have file extent items that point
   to disk locations containing invalid data, as we returned
   success to userspace without waiting for the respective
   ordered operation to finish, because it wasn't captured by
   btrfs_get_logged_extents()."

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-19 06:57:50 -07:00
Kirill Tkhai f139caf2e8 sched, cleanup, treewide: Remove set_current_state(TASK_RUNNING) after schedule()
schedule(), io_schedule() and schedule_timeout() always return
with TASK_RUNNING state set, so one more setting is unnecessary.

(All places in patch are visible good, only exception is
 kiblnd_scheduler() from:

      drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c

 Its schedule() is one line above standard 3 lines of unified diff)

No places where set_current_state() is used for mb().

Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1410529254.3569.23.camel@tkhai
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Anil Belur <askb23@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: David Airlie <airlied@linux.ie>
Cc: David Howells <dhowells@redhat.com>
Cc: Dmitry Eremin <dmitry.eremin@intel.com>
Cc: Frank Blaschka <blaschka@linux.vnet.ibm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Isaac Huang <he.huang@intel.com>
Cc: James E.J. Bottomley <JBottomley@parallels.com>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Laura Abbott <lauraa@codeaurora.org>
Cc: Liang Zhen <liang.zhen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Masaru Nomura <massa.nomura@gmail.com>
Cc: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Oleg Drokin <green@linuxhacker.ru>
Cc: Peng Tao <bergwolf@gmail.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Robert Love <robert.w.love@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Ursula Braun <ursula.braun@de.ibm.com>
Cc: Zi Shen Lim <zlim.lnx@gmail.com>
Cc: devel@driverdev.osuosl.org
Cc: dm-devel@redhat.com
Cc: dri-devel@lists.freedesktop.org
Cc: fcoe-devel@open-fcoe.org
Cc: jfs-discussion@lists.sourceforge.net
Cc: linux390@de.ibm.com
Cc: linux-afs@lists.infradead.org
Cc: linux-cris-kernel@axis.com
Cc: linux-kernel@vger.kernel.org
Cc: linux-nfs@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linux-raid@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: qla2xxx-upstream@qlogic.com
Cc: user-mode-linux-devel@lists.sourceforge.net
Cc: user-mode-linux-user@lists.sourceforge.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-19 12:35:17 +02:00
Abhi Das 00a158be83 GFS2: fix bad inode i_goal values during block allocation
This patch checks if i_goal is either zero or if doesn't exist
within any rgrp (i.e gfs2_blk2rgrpd() returns NULL). If so, it
assigns the ip->i_no_addr block as the i_goal.

There are two scenarios where a bad i_goal can result in a
-EBADSLT error.

1. Attempting to allocate to an existing inode:
Control reaches gfs2_inplace_reserve() and ip->i_goal is bad.
We need to fix i_goal here.

2. A new inode is created in a directory whose i_goal is hosed:
In this case, the parent dir's i_goal is copied onto the new
inode. Since the new inode is not yet created, the ip->i_no_addr
field is invalid and so, the fix in gfs2_inplace_reserve() as per
1) won't work in this scenario. We need to catch and fix it sooner
in the parent dir itself (gfs2_create_inode()), before it is
copied to the new inode.

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-09-19 10:45:18 +01:00
Theodore Ts'o f6e63f9080 ext4: fold ext4_nojournal_sops into ext4_sops
There's no longer any need to have a separate set of super_operations
for nojournal mode.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-18 17:12:30 -04:00
Theodore Ts'o bb04457658 ext4: support freezing ext2 (nojournal) file systems
Through an oversight, when we added nojournal support to ext4, we
didn't add support to allow file system freezing.  This is relatively
easy to add, so let's do it.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Dexuan Cui <decui@microsoft.com>
2014-09-18 17:12:02 -04:00
Theodore Ts'o bda3253043 ext4: fold ext4_sync_fs_nojournal() into ext4_sync_fs()
This allows us to eliminate duplicate code, and eventually allow us to
also fold ext4_sops and ext4_nojournal_sops together.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-18 16:12:37 -04:00
Linus Torvalds d9773ceabf Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs/smb3 fixes from Steve French:
 "Fixes for problems found during testing and debugging at the SMB3
  storage test event (plugfest) this week"

* 'for-linus' of git://git.samba.org/sfrench/cifs-2.6:
  Fix mfsymlinks file size check
  Update version number displayed by modinfo for cifs.ko
  cifs: remove dead code
  Revert "cifs: No need to send SIGKILL to demux_thread during umount"
  [SMB3] Fix oops when creating symlinks on smb3
  [CIFS] Fix setting time before epoch (negative time values)
2014-09-18 11:10:35 -07:00
Zefan Li 52de4779f2 cpuset: simplify proc_cpuset_show()
Use the ONE macro instead of REG, and we can simplify proc_cpuset_show().

Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-18 13:27:23 -04:00
Zefan Li 006f4ac497 cgroup: simplify proc_cgroup_show()
Use the ONE macro instead of REG, and we can simplify proc_cgroup_show().

Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-18 13:27:23 -04:00
Trond Myklebust cd9288ffae NFSv4: Fix another bug in the close/open_downgrade code
James Drew reports another bug whereby the NFS client is now sending
an OPEN_DOWNGRADE in a situation where it should really have sent a
CLOSE: the client is opening the file for O_RDWR, but then trying to
do a downgrade to O_RDONLY, which is not allowed by the NFSv4 spec.

Reported-by: James Drews <drews@engr.wisc.edu>
Link: http://lkml.kernel.org/r/541AD7E5.8020409@engr.wisc.edu
Fixes: aee7af356e (NFSv4: Fix problems with close in the presence...)
Cc: stable@vger.kernel.org # 2.6.33+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-18 13:04:22 -04:00
Steve Dickson 080af20cc9 NFSv4: nfs4_state_manager() vs. nfs_server_remove_lists()
There is a race between nfs4_state_manager() and
nfs_server_remove_lists() that happens during a nfsv3 mount.

The v3 mount notices there is already a supper block so
nfs_server_remove_lists() called which uses the nfs_client_lock
spin lock to synchronize access to the client list.

At the same time nfs4_state_manager() is running through
the client list looking for work to do, using the same
lock. When nfs4_state_manager() wins the race to the
list, a v3 client pointer is found and not ignored
properly which causes the panic.

Moving some protocol checks before the state checking
avoids the panic.

CC: Stable Tree <stable@vger.kernel.org>
Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-18 13:04:21 -04:00
Chris Mason 0f23ae74f5 Revert "Btrfs: device_list_add() should not update list when mounted"
This reverts commit b96de000bc.

This commit is triggering failures to mount by subvolume id in some
configurations.  The main problem is how many different ways this
scanning function is used, both for scanning while mounted and
unmounted.  A proper cleanup is too big for late rcs.

For now, just revert the commit and we'll put a better fix into a later
merge window.

Signed-off-by: Chris Mason <clm@fb.com>
2014-09-18 07:49:05 -07:00
Qu Wenruo e6c4efd87a btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map
The following commit enhanced the merge_extent_mapping() to reduce
fragment in extent map tree, but it can't handle case which existing
lies before map_start:
51f39 btrfs: Use right extent length when inserting overlap extent map.

[BUG]
When existing extent map's start is before map_start,
the em->len will be minus, which will corrupt the extent map and fail to
insert the new extent map.
This will happen when someone get a large extent map, but when it is
going to insert it into extent map tree, some one has already commit
some write and split the huge extent into small parts.

[REPRODUCER]
It is very easy to tiger using filebench with randomrw personality.
It is about 100% to reproduce when using 8G preallocated file in 60s
randonrw test.

[FIX]
This patch can now handle any existing extent position.
Since it does not directly use existing->start, now it will find the
previous and next extent around map_start.
So the old existing->start < map_start bug will never happen again.

[ENHANCE]
This patch will insert the best fitted extent map into extent map tree,
other than the oldest [map_start, map_start + sectorsize) or the
relatively newer but not perfect [map_start, existing->start).

The patch will first search existing extent that does not intersects with
the desired map range [map_start, map_start + len).
The existing extent will be either before or behind map_start, and based
on the existing extent, we can find out the previous and next extent
around map_start.

So the best fitted extent would be [prev->end, next->start).
For prev or next is not found, em->start would be prev->end and em->end
wold be next->start.

With this patch, the fragment in extent map tree should be reduced much
more than the 51f39 commit and reduce an unneeded extent map tree search.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-18 07:14:46 -07:00
Jan Kara 279bf6d390 ext4: don't check quota format when there are no quota files
The check whether quota format is set even though there are no
quota files with journalled quota is pointless and it actually
makes it impossible to turn off journalled quotas (as there's
no way to unset journalled quota format). Just remove the check.

CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-18 01:12:15 -04:00
Jan Kara 50849db32a jbd2: simplify calling convention around __jbd2_journal_clean_checkpoint_list
__jbd2_journal_clean_checkpoint_list() returns number of buffers it
freed but noone was using the value so just stop doing that. This
also allows for simplifying the calling convention for
journal_clean_once_cp_list().

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-18 00:58:12 -04:00
Jan Kara cc97f1a7c7 jbd2: avoid pointless scanning of checkpoint lists
Yuanhan has reported that when he is running fsync(2) heavy workload
creating new files over ramdisk, significant amount of time is spent in
__jbd2_journal_clean_checkpoint_list() trying to clean old transactions
(but they cannot be cleaned up because flusher hasn't yet checkpointed
those buffers). The workload can be generated by:
  fs_mark -d /fs/ram0/1 -D 2 -N 2560 -n 1000000 -L 1 -S 1 -s 4096

Reduce the amount of scanning by stopping to scan the transaction list
once we find a transaction that cannot be checkpointed. Note that this
way of cleaning is still enough to keep freeing space in the journal
after fully checkpointed transactions.

Reported-and-tested-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-18 00:42:16 -04:00
David Howells e2cf1f1cc7 CacheFiles: Handle rename2
Not all filesystems now provide the rename i_op - ext4 for one - but rather
provide the rename2 i_op.  CacheFiles checks that the filesystem has rename
and so will reject ext4 now with EPERM:

	CacheFiles: Failed to register: -1

Fix this by checking for rename2 as an alternative.  The call to vfs_rename()
actually handles selection of the appropriate function, so we needn't worry
about that.

Turning on debugging shows:

	[cachef] ==> cachefiles_get_directory(,,cache)
	[cachef] subdir -> ffff88000b22b778 positive
	[cachef] <== cachefiles_get_directory() = -1 [check]

where -1 is EPERM.

Signed-off-by: David Howells <dhowells@redhat.com>
2014-09-17 23:29:53 +01:00
NeilBrown 696382f938 cachefiles: remove two unused pagevecs.
These two have been unused since

commit c4d6d8dbf3
    CacheFiles: Fix the marking of cached pages

in 3.8.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: David Howells <dhowells@redhat.com>
2014-09-17 23:29:50 +01:00
Milosz Tanski 3e1199dcad FS-Cache: refcount becomes corrupt under vma pressure.
In rare cases under heavy VMA pressure the ref count for a fscache cookie
becomes corrupt. In this case we decrement ref count even if we fail before
incrementing the refcount.

FS-Cache: Assertion failed bnode-eca5f9c6/syslog
0 > 0 is false
------------[ cut here ]------------
kernel BUG at fs/fscache/cookie.c:519!
invalid opcode: 0000 [#1] SMP
Call Trace:
[<ffffffffa01ba060>] __fscache_relinquish_cookie+0x50/0x220 [fscache]
[<ffffffffa02d64ce>] ceph_fscache_unregister_inode_cookie+0x3e/0x50 [ceph]
[<ffffffffa02ae1d3>] ceph_destroy_inode+0x33/0x200 [ceph]
[<ffffffff811cf67e>] ? __fsnotify_inode_delete+0xe/0x10
[<ffffffff811a9e0c>] destroy_inode+0x3c/0x70
[<ffffffff811a9f51>] evict+0x111/0x180
[<ffffffff811aa763>] iput+0x103/0x190
[<ffffffff811a5de8>] __dentry_kill+0x1c8/0x220
[<ffffffff811a5f31>] shrink_dentry_list+0xf1/0x250
[<ffffffff811a762c>] prune_dcache_sb+0x4c/0x60
[<ffffffff811930af>] super_cache_scan+0xff/0x170
[<ffffffff8113d7a0>] shrink_slab_node+0x140/0x2c0
[<ffffffff8113f2da>] shrink_slab+0x8a/0x130
[<ffffffff81142572>] balance_pgdat+0x3e2/0x5d0
[<ffffffff811428ca>] kswapd+0x16a/0x4a0
[<ffffffff810a43f0>] ? __wake_up_sync+0x20/0x20
[<ffffffff81142760>] ? balance_pgdat+0x5d0/0x5d0
[<ffffffff81083e09>] kthread+0xc9/0xe0
[<ffffffff81010000>] ? ftrace_raw_event_xen_mmu_release_ptpage+0x70/0x90
[<ffffffff81083d40>] ? flush_kthread_worker+0xb0/0xb0
[<ffffffff8159f63c>] ret_from_fork+0x7c/0xb0
[<ffffffff81083d40>] ? flush_kthread_worker+0xb0/0xb0
RIP [<ffffffffa01b984b>] __fscache_disable_cookie+0x1db/0x210 [fscache]
RSP <ffff8803bc85f9b8>
---[ end trace 254d0d7c74a01f25 ]---

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2014-09-17 22:41:40 +01:00
Liu Bo 4d1a40c66b Btrfs: fix up bounds checking in lseek
An user reported this, it is because that lseek's SEEK_SET/SEEK_CUR/SEEK_END
allow a negative value for @offset, but btrfs's SEEK_DATA/SEEK_HOLE don't
prepare for that and convert the negative @offset into unsigned type,
so we get (end < start) warning.

[ 1269.835374] ------------[ cut here ]------------
[ 1269.836809] WARNING: CPU: 0 PID: 1241 at fs/btrfs/extent_io.c:430 insert_state+0x11d/0x140()
[ 1269.838816] BTRFS: end < start 4094 18446744073709551615
[ 1269.840334] CPU: 0 PID: 1241 Comm: a.out Tainted: G        W      3.16.0+ #306
[ 1269.858229] Call Trace:
[ 1269.858612]  [<ffffffff81801a69>] dump_stack+0x4e/0x68
[ 1269.858952]  [<ffffffff8107894c>] warn_slowpath_common+0x8c/0xc0
[ 1269.859416]  [<ffffffff81078a36>] warn_slowpath_fmt+0x46/0x50
[ 1269.859929]  [<ffffffff813b0fbd>] insert_state+0x11d/0x140
[ 1269.860409]  [<ffffffff813b1396>] __set_extent_bit+0x3b6/0x4e0
[ 1269.860805]  [<ffffffff813b21c7>] lock_extent_bits+0x87/0x200
[ 1269.861697]  [<ffffffff813a5b28>] btrfs_file_llseek+0x148/0x2a0
[ 1269.862168]  [<ffffffff811f201e>] SyS_lseek+0xae/0xc0
[ 1269.862620]  [<ffffffff8180b212>] system_call_fastpath+0x16/0x1b
[ 1269.862970] ---[ end trace 4d33ea885832054b ]---

This assumes that btrfs starts finding DATA/HOLE from the beginning of file
if the assigned @offset is negative.

Also we add alignment for lock_extent_bits 's range.

Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:46:30 -07:00
Miao Xie f612496bca Btrfs: cleanup the read failure record after write or when the inode is freeing
After the data is written successfully, we should cleanup the read failure record
in that range because
- If we set data COW for the file, the range that the failure record pointed to is
  mapped to a new place, so it is invalid.
- If we set no data COW for the file, and if there is no error during writting,
  the corrupted data is corrected, so the failure record can be removed. And if
  some errors happen on the mirrors, we also needn't worry about it because the
  failure record will be recreated if we read the same place again.

Sometimes, we may fail to correct the data, so the failure records will be left
in the tree, we need free them when we free the inode or the memory leak happens.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:39:02 -07:00
Miao Xie 8b110e393c Btrfs: implement repair function when direct read fails
This patch implement data repair function when direct read fails.

The detail of the implementation is:
- When we find the data is not right, we try to read the data from the other
  mirror.
- When the io on the mirror ends, we will insert the endio work into the
  dedicated btrfs workqueue, not common read endio workqueue, because the
  original endio work is still blocked in the btrfs endio workqueue, if we
  insert the endio work of the io on the mirror into that workqueue, deadlock
  would happen.
- After we get right data, we write it back to the corrupted mirror.
- And if the data on the new mirror is still corrupted, we will try next
  mirror until we read right data or all the mirrors are traversed.
- After the above work, we set the uptodate flag according to the result.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:39:01 -07:00
Miao Xie 28e1cc7d1b Btrfs: Set real mirror number for read operation on RAID0/5/6
We need real mirror number for RAID0/5/6 when reading data, or if read error
happens, we would pass 0 as the number of the mirror on which the io error
happens. It is wrong and would cause the filesystem read the data from the
corrupted mirror again.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:39:00 -07:00
Miao Xie 1203b6813e Btrfs: modify clean_io_failure and make it suit direct io
We could not use clean_io_failure in the direct IO path because it got the
filesystem information from the page structure, but the page in the direct
IO bio didn't have the filesystem information in its structure. So we need
modify it and pass all the information it need by parameters.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:59 -07:00
Miao Xie ffdd2018dd Btrfs: modify repair_io_failure and make it suit direct io
The original code of repair_io_failure was just used for buffered read,
because it got some filesystem data from page structure, it is safe for
the page in the page cache. But when we do a direct read, the pages in bio
are not in the page cache, that is there is no filesystem data in the page
structure. In order to implement direct read data repair, we need modify
repair_io_failure and pass all filesystem data it need by function
parameters.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:58 -07:00
Miao Xie 2fe6303e7c Btrfs: split bio_readpage_error into several functions
The data repair function of direct read will be implemented later, and some code
in bio_readpage_error will be reused, so split bio_readpage_error into
several functions which will be used in direct read repair later.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:56 -07:00
Miao Xie 454ff3de42 Btrfs: Cleanup unused variant and argument of IO failure handlers
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:55 -07:00
Miao Xie 6c387ab20d Btrfs: fix missing error handler if submiting re-read bio fails
We forgot to free failure record and bio after submitting re-read bio failed,
fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:54 -07:00
Miao Xie c1dc08967f Btrfs: do file data check by sub-bio's self
Direct IO splits the original bio to several sub-bios because of the limit of
raid stripe, and the filesystem will wait for all sub-bios and then run final
end io process.

But it was very hard to implement the data repair when dio read failure happens,
because at the final end io function, we didn't know which mirror the data was
read from. So in order to implement the data repair, we have to move the file data
check in the final end io function to the sub-bio end io function, in which we can
get the mirror number of the device we access. This patch did this work as the
first step of the direct io data repair implementation.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:53 -07:00
Miao Xie dc380aea5f Btrfs: cleanup similar code of the buffered data data check and dio read data check
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:52 -07:00
Miao Xie 23ea8e5a07 Btrfs: load checksum data once when submitting a direct read io
The current code would load checksum data for several times when we split
a whole direct read io because of the limit of the raid stripe, it would
make us search the csum tree for several times. In fact, it just wasted time,
and made the contention of the csum tree root be more serious. This patch
improves this problem by loading the data at once.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:50 -07:00
Miao Xie c3929c3624 Btrfs: modify rw_devices counter under chunk_mutex context
rw_devices counter is often used to tune the profile when doing chunk allocation,
so we should modify it under the chunk_mutex context to avoid getting wrong
chunk profile.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:49 -07:00
Miao Xie 5f37583569 Btrfs: move the missing device to its own fs device list
For a missing device, we don't know it belong to which fs before we read its
fsid from the chunk tree. So we add them into the current fs device list at first.
When we get its fsid, we should move them to their own fs device list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:48 -07:00
Miao Xie 416d7b802a Btrfs: stop mounting the fs if the non-ENOENT errors happen when opening seed fs
When we open a seed filesystem, if the degraded mount option is set, we continue to
mount the fs if we don't find some devices in the seed filesystem. But we should stop
mounting if other errors happen. Fix it

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:47 -07:00
Miao Xie 82372bc816 Btrfs: make the logic of source device removing more clear
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:46 -07:00
Miao Xie 67a2c45ee7 Btrfs: fix use-after-free problem of the device during device replace
The problem is:
	Task0(device scan task)		Task1(device replace task)
	scan_one_device()
	mutex_lock(&uuid_mutex)
	device = find_device()
					mutex_lock(&device_list_mutex)
					lock_chunk()
					rm_and_free_source_device
					unlock_chunk()
					mutex_unlock(&device_list_mutex)
	check device

Destroying the target device if device replace fails also has the same problem.

We fix this problem by locking uuid_mutex during destroying source device or
target device, just like the device remove operation.

It is a temporary solution, we can fix this problem and make the code more
clear by atomic counter in the future.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:44 -07:00
Miao Xie adbbb8631b Btrfs: fix unprotected device list access when cloning fs devices
We can build a new filesystem based a seed filesystem, and we need clone
the fs devices when we open the new filesystem. But someone might clear
the seed flag of the seed filesystem, then mount that filesystem and
remove some device. If we mount the new filesystem, we might access
a device list which was being changed when we clone the fs devices.
Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:43 -07:00
Miao Xie 2196d6e8a7 Btrfs: Fix misuse of chunk mutex
There were several problems about chunk mutex usage:
- Lock chunk mutex when updating metadata. It would cause the nested
  deadlock because updating metadata might need allocate new chunks
  that need acquire chunk mutex. We remove chunk mutex at this case,
  because b-tree lock and other lock mechanism can help us.
- ABBA deadlock occured between device_list_mutex and chunk_mutex.
  When we update device status, we must acquire device_list_mutex at the
  beginning, and then we might get chunk_mutex during the device status
  update because we need allocate new chunks for metadata COW. But at
  most place, we acquire chunk_mutex at first and then acquire device list
  mutex. We need change the lock order.
- Some place we needn't acquire chunk_mutex. For example we needn't get
  chunk_mutex when we free a empty seed fs_devices structure.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:42 -07:00
Miao Xie 15484377f5 Btrfs: fix unprotected device list access when getting the fs information
When we get the fs information, we forgot to acquire the mutex of device list,
it might cause the problem we might access a device that was removed. Fix
it by acquiring the device list mutex.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:41 -07:00
Miao Xie fe48a5c00f Btrfs: fix unprotected system chunk array insertion
We didn't protect the system chunk array when we added a new
system chunk into it, it would cause the array be corrupted
if someone remove/add some system chunk into array at the same
time. Fix it by chunk lock.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:40 -07:00
Miao Xie 7cc8e58d53 Btrfs: fix unprotected device's variants on 32bits machine
->total_bytes,->disk_total_bytes,->bytes_used is protected by chunk
lock when we change them, but sometimes we read them without any lock,
and we might get unexpected value. We fix this problem like inode's
i_size.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:38 -07:00
Miao Xie 1c1161870c Btrfs: update free_chunk_space during allocting a new chunk
We should update free_chunk_space in time when we allocate a new chunk,
not when we deal with the pending device update and block group insertion,
because we need the real free_chunk_space data to calculate the reserved
space, if we don't update it in time, we would consider the disk space which
has be allocated as free space, and would use it to do overcommit reservation.
Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:37 -07:00
Miao Xie 43530c46cc Btrfs: fix unprotected device->bytes_used update
We should update device->bytes_used in the lock context of
chunk_mutex, or we would get wrong data.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:36 -07:00
Miao Xie 5d778aaeb0 Btrfs: Fix wrong free_chunk_space assignment during removing a device
During removing a device, we have modified free_chunk_space when we
shrink the device, so we needn't assign a new value to it after
the device shrink. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:35 -07:00
Miao Xie ce7213c70c Btrfs: fix wrong device bytes_used in the super block
device->bytes_used will be changed when allocating a new chunk, and
disk_total_size will be changed if resizing is successful.
Meanwhile, the on-disk super blocks of the previous transaction
might not be updated. Considering the consistency of the metadata
in the previous transaction, We should use the size in the previous
transaction to check if the super block is beyond the boundary
of the device.

Though it is not big problem because we don't use it now, but anyway
it is better that we make it be consistent with the common metadata,
maybe we will use it in the future.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:34 -07:00
Miao Xie 935e5cc935 Btrfs: fix wrong disk size when writing super blocks
total_size will be changed when resizing a device, and disk_total_size
will be changed if resizing is successful. Meanwhile, the on-disk super
blocks of the previous transaction might not be updated. Considering
the consistency of the metadata in the previous transaction, We should
use the size in the previous transaction to check if the super block is
beyond the boundary of the device. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:33 -07:00
Miao Xie 1c43366d3b Btrfs: fix unprotected assignment of the target device
We didn't protect the assignment of the target device, it might cause the
problem that the super block update was skipped because we might find wrong
size of the target device during the assignment. Fix it by moving the
assignment sentences into the initialization function of the target device.
And there is another merit that we can check if the target device is suitable
more early.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:31 -07:00
Miao Xie c7662111c7 Btrfs: cleanup double assignment of device->bytes_used when device replace finishes
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:30 -07:00
Miao Xie 90180da42c Btrfs: cleanup unused num_can_discard in fs_devices
The member variants - num_can_discard - of fs_devices structure
are set, but no one use them to do anything. so remove them.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:29 -07:00
Li RongQing 82f70d62f7 btrfs: remove the wrong comments
This comments became wrong after c3c532[bdi: add helper function for
doing init and register of a bdi for a file system], so remove them.

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:28 -07:00
Filipe Manana a2cc11db24 Btrfs: fix directory recovery from fsync log
When replaying a directory from the fsync log, if a directory entry
exists both in the fs/subvol tree and in the log, the directory's inode
got its i_size updated incorrectly, accounting for the dentry's name
twice.

Reproducer, from a test for xfstests:

    _scratch_mkfs >> $seqres.full 2>&1
    _init_flakey
    _mount_flakey

    touch $SCRATCH_MNT/foo
    sync

    touch $SCRATCH_MNT/bar
    xfs_io -c "fsync" $SCRATCH_MNT
    xfs_io -c "fsync" $SCRATCH_MNT/bar

    _load_flakey_table $FLAKEY_DROP_WRITES
    _unmount_flakey

    _load_flakey_table $FLAKEY_ALLOW_WRITES
    _mount_flakey

    [ -f $SCRATCH_MNT/foo ] || echo "file foo is missing"
    [ -f $SCRATCH_MNT/bar ] || echo "file bar is missing"

    _unmount_flakey
    _check_scratch_fs $FLAKEY_DEV

The filesystem check at the end failed with the message:
"root 5 root dir 256 error".

A test case for xfstests follows.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:27 -07:00
Liu Bo 25ce459c1a Btrfs: fix loop writing of async reclaim
One of my tests shows that when we really don't have space to reclaim via
flush_space and also run out of space, this async reclaim work loops on adding
itself into the workqueue and keeps writing something to disk according to
iostat's results, and these writes mainly comes from commit_transaction which
writes super_block.  This's unacceptable as it can be bad to disks, especially
memeory storages.

This adds a check to avoid the above situation.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:25 -07:00
Josef Bacik dc046b10c8 Btrfs: make fiemap not blow when you have lots of snapshots
We have been iterating all references for each extent we have in a file when we
do fiemap to see if it is shared.  This is fine when you have a few clones or a
few snapshots, but when you have 5k snapshots suddenly fiemap just sits there
and stares at you.  So add btrfs_check_shared which will use the backref walking
code but will short circuit as soon as it finds a root or inode that doesn't
match the one we currently have.  This makes fiemap on my testbox go from
looking at me blankly for a day to spitting out actual output in a reasonable
amount of time.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:24 -07:00
Filipe Manana 78a017a2c9 Btrfs: add missing compression property remove in btrfs_ioctl_setflags
The behaviour of a 'chattr -c' consists of getting the current flags,
clearing the FS_COMPR_FL bit and then sending the result to the set
flags ioctl - this means the bit FS_NOCOMP_FL isn't set in the flags
passed to the ioctl. This results in the compression property not being
cleared from the inode - it was cleared only if the bit FS_NOCOMP_FL
was set in the received flags.

Reproducer:

    $ mkfs.btrfs -f /dev/sdd
    $ mount /dev/sdd /mnt && cd /mnt
    $ mkdir a
    $ chattr +c a
    $ touch a/file
    $ lsattr a/file
    --------c------- a/file
    $ chattr -c a
    $ touch a/file2
    $ lsattr a/file2
    --------c------- a/file2
    $ lsattr -d a
    ---------------- a

Reported-by: Andreas Schneider <asn@cryptomilk.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:23 -07:00
Qu Wenruo 12b894cb28 btrfs: Fix a deadlock in btrfs_dev_replace_finishing()
btrfs-transacion:5657
[stack snip]
btrfs_bio_map()
    btrfs_bio_counter_inc_blocked()
        percpu_counter_inc(&fs_info->bio_counter)  ###bio_counter > 0(A)
        __btrfs_bio_map()
            btrfs_dev_replace_lock()
                mutex_lock(dev_replace->lock)	   ###wait mutex(B)

btrfs:32612
[stack snip]
btrfs_dev_replace_start()
    btrfs_dev_replace_lock()
	mutex_lock(dev_replace->lock)		   ###hold mutex(B)
    btrfs_dev_replace_finishing()
        btrfs_rm_dev_replace_blocked()
            wait until percpu_counter_sum == 0	   ###wait on bio_counter(A)

This bug can be triggered quite easily by the following test script:
http://pastebin.com/MQmb37Cy

This patch will fix the ABBA problem by calling
btrfs_dev_replace_unlock() before btrfs_rm_dev_replace_blocked().

The consistency of btrfs devices list and their superblocks is protected
by device_list_mutex, not btrfs_dev_replace_lock/unlock().
So it is safe the move btrfs_dev_replace_unlock() before
btrfs_rm_dev_replace_blocked().

Reported-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:22 -07:00
Liu Bo a583c02664 Btrfs: cleanup the same name in end_bio_extent_readpage
We've defined a 'offset' out of bio_for_each_segment_all.

This is just a clean rename, no function changes.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:20 -07:00
Mark Fasheh 0b4699dcb6 btrfs: don't go readonly on existing qgroup items
btrfs_drop_snapshot() leaves subvolume qgroup items on disk after
completion. This can cause problems with snapshot creation. If a new
snapshot tries to claim the deleted subvolumes id, btrfs will get -EEXIST
from add_qgroup_item() and go read-only. The following commands will
reproduce this problem (assume btrfs is on /dev/sda and is mounted at
/btrfs)

mkfs.btrfs -f /dev/sda
mount -t btrfs /dev/sda /btrfs/
btrfs quota enable /btrfs/
btrfs su sna /btrfs/ /btrfs/snap
btrfs su de /btrfs/snap
sleep 45
umount /btrfs/
mount -t btrfs /dev/sda /btrfs/

We can fix this by catching -EEXIST in add_qgroup_item() and
initializing the existing items. We have the problem of orphaned
relation items being on disk from an old snapshot but that is outside
the scope of this patch.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:19 -07:00
Filipe Manana 2a39e59802 Btrfs: shrink further sizeof(struct extent_buffer)
The map_start and map_len fields aren't used anywhere, so just remove
them. On a x86_64 system, this reduced sizeof(struct extent_buffer)
from 296 bytes to 280 bytes, and therefore 14 extent_buffer structs can
now fit into a page instead of 13.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:17 -07:00
Filipe Manana 4395e0c4da Btrfs: send, lower mem requirements for processing xattrs
Maximum xattr size can be up to nearly the leaf size. For an fs with a
leaf size larger than the page size, using kmalloc requires allocating
multiple pages that are contiguous, which might not be possible if
there's heavy memory fragmentation. Therefore fallback to vmalloc if
we fail to allocate with kmalloc. Also start with a smaller buffer size,
since xattr values typically are smaller than a page.

Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:16 -07:00
David Sterba f87c4318af btrfs: remove stale define after removing ordered operations
Last user removed in commit "btrfs: disable strict file flushes for
renames and truncates" (8d875f95da).

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:15 -07:00
Filipe Manana 2000552396 Btrfs: improve free space cache management and space allocation
While under random IO, a block group's free space cache eventually reaches
a state where it has a mix of extent entries and bitmap entries representing
free space regions.

As later free space regions are returned to the cache, some of them are merged
with existing extent entries if they are contiguous with them. But others are
not merged, because despite the existence of adjacent free space regions in
the cache, the merging doesn't happen because the existing free space regions
are represented in bitmap extents. Even when new free space regions are merged
with existing extent entries (enlarging the free space range they represent),
we create chances of having after an enlarged region that is contiguous with
some other region represented in a bitmap entry.

Both clustered and non-clustered space allocation work by iterating over our
extent and bitmap entries and skipping any that represents a region smaller
then the allocation request (and giving preference to extent entries before
bitmap entries). By having a contiguous free space region that is represented
by 2 (or more) entries (mix of extent and bitmap entries), we end up not
satisfying an allocation request with a size larger than the size of any of
the entries but no larger than the sum of their sizes. Making the caller assume
we're under a ENOSPC condition or force it to allocate multiple smaller space
regions (as we do for file data writes), which adds extra overhead and more
chances of causing fragmentation due to the smaller regions being all spread
apart from each other (more likely when under concurrency).

For example, if we have the following in the cache:

* extent entry representing free space range: [128Mb - 256Kb, 128Mb[

* bitmap entry covering the range [128Mb, 256Mb[, but only with the bits
  representing the range [128Mb, 128Mb + 768Kb[ set - that is, only that
  space in this 128Mb area is marked as free

An allocation request for 1Mb, starting at offset not greater than 128Mb - 256Kb,
would fail before, despite the existence of such contiguous free space area in the
cache. The caller could only allocate up to 768Kb of space at once and later another
256Kb (or vice-versa). In between each smaller allocation request, another task
working on a different file/inode might come in and take that space, preventing the
former task of getting a contiguous 1Mb region of free space.

Therefore this change implements the ability to move free space from bitmap
entries into existing and new free space regions represented with extent
entries. This is done when a space region is added to the cache.

A test was added to the sanity tests that explains in detail the issue too.

Some performance test results with compilebench on a 4 cores machine, with
32Gb of ram and using an HDD follow.

Test: compilebench -D /mnt -i 30 -r 1000 --makej

Before this change:

   intial create total runs 30 avg 69.02 MB/s (user 0.28s sys 0.57s)
   compile total runs 30 avg 314.96 MB/s (user 0.12s sys 0.25s)
   read compiled tree total runs 3 avg 27.14 MB/s (user 1.52s sys 0.90s)
   delete compiled tree total runs 30 avg 3.14 seconds (user 0.15s sys 0.66s)

After this change:

   intial create total runs 30 avg 68.37 MB/s (user 0.29s sys 0.55s)
   compile total runs 30 avg 382.83 MB/s (user 0.12s sys 0.24s)
   read compiled tree total runs 3 avg 27.82 MB/s (user 1.45s sys 0.97s)
   delete compiled tree total runs 30 avg 3.18 seconds (user 0.17s sys 0.65s)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:13 -07:00
Anand Jain 3c1dbdf54a btrfs: rename total_bytes to avoid confusion
we are assigning number_devices to the total_bytes,
that's very confusing for a moment

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:12 -07:00
Anand Jain de4c296f63 btrfs: fix typo in the log message
there is no matching open parenthesis for the closing parenthesis

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:11 -07:00
Anand Jain b2efedca68 btrfs: rw_devices shouldn't be incremented for seed fs in btrfs_rm_dev_replace_srcdev()
seed fs devices don't participate as rw_device, so don't increment
rw_devices when the device being handled belongs to a seed fs.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:10 -07:00
Anand Jain 8bef8401a0 btrfs: fix memory leak when there is no more seed device
When we replace all the seed device in the system there is
no point in just keeping the btrfs_fs_devices with out
any device

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:09 -07:00
Anand Jain 94d5f0c2ae btrfs: update sprout seed pointer when seed fs is relinquished
We are not updating sprout fs seed pointer when all seed device
is replaced. This patch will check if all seed device has been
replaced and then update the sprout pointer accordingly.

Same reproducer as in the previous patch would apply here.
And notice that btrfs_close_device will check if seed fs is
present and spits out the error with out this patch.

int btrfs_close_devices(struct btrfs_fs_devices *fs_devices)
{
::
                seed_devices = fs_devices->seed;
::
        while (seed_devices) {
                fs_devices = seed_devices;
                seed_devices = fs_devices->seed;
                __btrfs_close_devices(fs_devices);
                free_fs_devices(fs_devices);
        }

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:08 -07:00
Anand Jain 63dd86fa79 btrfs: fix rw_devices miss match after seed replace
reproducer:
    reproducer:
    mount /dev/sdb /btrfs
    btrfs dev add /dev/sdc /btrfs
    btrfs rep start -B /dev/sdb /dev/sdd /btrfs
    umount /btrfs

WARNING: CPU: 0 PID: 3882 at fs/btrfs/volumes.c:892 __btrfs_close_devices+0x1c8/0x200 [btrfs]()

which is

        WARN_ON(fs_devices->rw_devices);

   The problem here is that we did not add one to the rw_devices when
   we replace the seed device with a writable device.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:06 -07:00
Anand Jain 25e8e9113d btrfs: replace seed device followed by unmount causes kernel WARNING
reproducer:
mount /dev/sdb /btrfs
btrfs dev add /dev/sdc /btrfs
btrfs rep start -B /dev/sdb /dev/sdd /btrfs
umount /btrfs

WARNING: CPU: 0 PID: 12661 at fs/btrfs/volumes.c:891 __btrfs_close_devices+0x1b0/0x200 [btrfs]()
::

__btrfs_close_devices()
::
        WARN_ON(fs_devices->open_devices);

After the seed device has been replaced the new target device
is no more a seed device. So we need to update the device
numbers in the fs_devices as pointed by the fs_info.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:05 -07:00
Anand Jain d51908ce4e btrfs: preparatory to make btrfs_rm_dev_replace_srcdev() seed aware
There is no logical change in this patch, just a preparatory patch,
so that changes can be easily reasoned.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:04 -07:00
Andrey Utkin 56094eecd3 btrfs: Drop stray check of fixup_workers creation
The issue was introduced in a79b7d4b3e,
adding allocation of extent_workers, so this stray check is surely not
meant to be a check of something else.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=82021
Reported-by: Maks Naumov <maksqwe1@ukr.net>
Signed-off-by: Andrey Utkin <andrey.krieger.utkin@gmail.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:03 -07:00
Filipe Manana f98de9b9c0 Btrfs: make btrfs_search_forward return with nodes unlocked
None of the uses of btrfs_search_forward() need to have the path
nodes (level >= 1) read locked, only the leaf needs to be locked
while the caller processes it. Therefore make it return a path
with all nodes unlocked, except for the leaf.

This change is motivated by the observation that during a file
fsync we repeatdly call btrfs_search_forward() and process the
returned leaf while upper nodes of the returned path (level >= 1)
are read locked, which unnecessarily blocks other tasks that want
to write to the same fs/subvol btree.
Therefore instead of modifying the fsync code to unlock all nodes
with level >= 1 immediately after calling btrfs_search_forward(),
change btrfs_search_forward() to do it, so that it benefits all
callers.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:02 -07:00
Anand Jain 79aec2b80d btrfs: sysfs label interface should check for read only FS
Not sure how this escaped many eyes so far

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:38:01 -07:00
Anand Jain 20ee0825ec btrfs: code optimize: BTRFS_ATTR_RW could set the mode
BTRFS_ATTR_RW could set the mode and be inline with BTRFS_ATTR

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:59 -07:00
Anand Jain 98b3d389eb btrfs: code optimize: BTRFS_ATTR could handle the mode
All that uses BTRFS_ATTR want mode to be set at 0444 so just do
it at the define.  And few spacing alignments.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:58 -07:00
Anand Jain 3f4b57e09d btrfs: use BTRFS_ATTR instead of btrfs_no_store()
we have BTRFS_ATTR define to create sysfs RO file, use that.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:57 -07:00
Filipe Manana 160f4089c8 Btrfs: avoid unnecessary switch of path locks to blocking mode
If we need to cow a node, increase the write lock level and retry the
tree search, there's no point of changing the node locks in our path
to blocking mode, as we only waste time and unnecessarily wake up other
tasks waiting on the spinning locks (just to block them again shortly
after) because we release our path before repeating the tree search.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:56 -07:00
Filipe Manana 24cdc847d9 Btrfs: unlock nodes earlier when inserting items in a btree
In ctree.c:setup_items_for_insert(), we can unlock all nodes in our
path before we process the leaf (shift items and data, adjust data
offsets, etc). This allows for better btree concurrency, as we're
often holding a write lock on at least the node at level 1.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:55 -07:00
Satoru Takeuchi d1b00a4711 btrfs: use IS_ALIGNED() for assertion in btrfs_lookup_csums_range() for simplicity
btrfs_lookup_csums_range() uses ALIGN() to check if "start"
and "end + 1" are aligned to "root->sectorsize". It's better to
replace these with IS_ALIGNED() for simplicity.

Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:54 -07:00
Mark Fasheh d3982100ba btrfs: add trace for qgroup accounting
We want this to debug qgroup changes on live systems.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:50 -07:00
Miao Xie 443f24fee7 Btrfs: cleanup unused latest_devid and latest_trans in fs_devices
The member variants - latest_devid and latest_trans - of fs_devices structure
are set, but no one use them to do anything. so remove them.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:49 -07:00
Miao Xie 6ba40b615f Btrfs: update the comment of total_bytes and disk_total_bytes of btrfs_devie
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:48 -07:00
Miao Xie addc3fa74e Btrfs: Fix the problem that the dirty flag of dev stats is cleared
The io error might happen during writing out the device stats, and the
device stats information and dirty flag would be update at that time,
but the current code didn't consider this case, just clear the dirty
flag, it would cause that we forgot to write out the new device stats
information. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:46 -07:00
Miao Xie d5ee37bcb1 Btrfs: make the device lock and its protected data in the same cacheline
The lock in btrfs_device structure was far away from its protected data, it would
make CPU load the cache line twice when we accessed them, move them together.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:45 -07:00
Miao Xie 5f546063ce Btrfs: fix wrong generation check of super block on a seed device
The super block generation of the seed devices is not the same as the
filesystem which sprouted from them because we don't update the super
block on the seed devices when we change that new filesystem. So we
should not use the generation of that new filesystem to check the super
block generation on the seed devices, Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:44 -07:00
Miao Xie 17a9be2f28 Btrfs: fix wrong fsid check of scrub
All the metadata in the seed devices has the same fsid as the fsid
of the seed filesystem which is on the seed device, so we should check
them by the current filesystem. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:43 -07:00
David Sterba 2fad4e83e1 btrfs: wake up transaction thread from SYNC_FS ioctl
The transaction thread may want to do more work, namely it pokes the
cleaner ktread that will start processing uncleaned subvols.

This can be triggered by user via the 'btrfs fi sync' command, otherwise
there was a delay up to 30 seconds before the cleaner started to clean
old snapshots.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:42 -07:00
Wang Shilong c01a5c074c Btrfs: fix wrong max inline data size limit
inline data is stored from offset of @disk_bytenr in
struct btrfs_file_extent_item. So substracting total
size of struct btrfs_file_extent_item is wrong, fix it.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:40 -07:00
Wang Shilong 354877befa Btrfs: fix off-by-one in cow_file_range_inline()
Btrfs could still inline file data if its size is same as
page size, so don't skip max value here.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:39 -07:00
Wang Shilong 7816030eb4 Btrfs: fall into nocompression codes quickly if possible
If flag NOCOMPRESS is set which means bad compression ratio,
we could avoid call cow_file_range_async() for this case earlier.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:38 -07:00
Wang Shilong f79707b092 Btrfs: fix wrong skipping compression for an inode
If a file's compression ratios is bad, we will set NOCOMPRESS
flag for it, and it will skip compression for that inode next time.

However, if we remount fs to COMPRESS_FORCE, it still should try
if we could compress pages for that inode, this patch fix wrong
check for this problem.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:36 -07:00
Fabian Frederick d447d0da44 Btrfs: fix sparse warning
Fix the following sparse warning:
fs/btrfs/send.c:518:51: warning: incorrect type in argument 2 (different address spaces)
fs/btrfs/send.c:518:51:    expected char const [noderef] <asn:1>*<noident>
fs/btrfs/send.c:518:51:    got char *

We can safely use (const char __user *) with set_fs(KERNEL_DS)

__force added to avoid sparse-all warning:
fs/btrfs/send.c:518:40: warning: cast adds address space to expression (<asn:1>)

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: Zach Brown <zab@zabbo.net>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:35 -07:00
HIMANGI SARAOGI 14586651ed Btrfs: use BUG_ON
Use BUG_ON(x) rather than if(x) BUG();

The semantic patch that fixes this problem is as follows:

// <smpl>
@@ identifier x; @@
-if (x) BUG();
+BUG_ON(x);
// </smpl>

Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:34 -07:00
Sergey Senozhatsky 7880991344 btrfs compression: merge inflate and deflate z_streams
`struct workspace' used for zlib compression contains two zlib
z_stream-s: `def_strm' used in zlib_compress_pages(), and `inf_strm'
used in zlib_decompress/zlib_decompress_biovec(). None of these
functions use `inf_strm' and `def_strm' simultaniously, meaning that
for every compress/decompress operation we need only one z_stream
(out of two available).

`inf_strm' and `def_strm' are different in size of ->workspace. For
inflate stream we vmalloc() zlib_inflate_workspacesize() bytes, for
deflate stream - zlib_deflate_workspacesize() bytes. On my system zlib
returns the following workspace sizes, correspondingly: 42312 and 268104
(+ guard pages).

Keep only one `z_stream' in `struct workspace' and use it for both
compression and decompression. Hence, instead of vmalloc() of two
z_stream->worskpace-s, allocate only one of size:
	max(zlib_deflate_workspacesize(), zlib_inflate_workspacesize())

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:33 -07:00
Filipe Manana 555e128640 Btrfs: set error return value in btrfs_get_blocks_direct
We were returning with 0 (success) because we weren't extracting the
error code from em (PTR_ERR(em)). Fix it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:32 -07:00
Filipe Manana 27a3507de9 Btrfs: reduce size of struct extent_state
The tree field of struct extent_state was only used to figure out if
an extent state was connected to an inode's io tree or not. For this
we can just use the rb_node field itself.

On a x86_64 system with this change the sizeof(struct extent_state) is
reduced from 96 bytes down to 88 bytes, meaning that with a page size
of 4096 bytes we can now store 46 extent states per page instead of 42.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:30 -07:00
Fabian Frederick 6f84e23646 btrfs: use PTR_ERR_OR_ZERO
replace IS_ERR/PTR_ERR

Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:29 -07:00
Wang Shilong 29549aec76 Btrfs: print btrfs specific info for some fatal error cases
Marc argued that if there are several btrfs filesystems mounted,
while users even don't know which filesystem hit the corrupted
errors something like generation verification failure.

Since @extent_buffer structure has a member @fs_info, let's output
btrfs device info.

Reported-by: Marc MERLIN <marc@merlins.org>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:28 -07:00
Miao Xie d20983b40e Btrfs: fix writing data into the seed filesystem
If we mounted a seed filesystem with degraded option, and then added a new
device into the seed filesystem, then we found adding device failed because
of the IO failure.

Steps to reproduce:
 # mkfs.btrfs -d raid1 -m raid1 <dev0> <dev1>
 # btrfstune -S 1 <dev0>
 # mount <dev0> -o degraded <mnt>
 # btrfs device add -f <dev2> <mnt>

It is because the original didn't set the chunk on the seed device to be
read-only if the degraded flag was set. It was introduced by patch f48b90756,
which fixed the problem the raid1 filesystem became read-only after one device
of it was missing. But this fix method was not right, we should set the read-only
flag according to the number of the missing devices, not the degraded mount
option, if the number of the missing devices is less than the max error number
that the profile of the chunk tolerates, we don't set it to be read-only.

Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:27 -07:00
Wang Shilong 47059d930f Btrfs: make defragment work with nodatacow option
Btrfs defragment will utilize COW feature, which means this
did not work for nodatacow option, this problem was detected
by xfstests generic/018 with nodatacow mount option.

Fix this problem by forcing cow for a extent with state
@EXTETN_DEFRAG setting.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:26 -07:00
Satoru Takeuchi 48fcc3ff7d btrfs: label should not contain return char
Rediffed remaining parts of original patch from Anand Jain.  This makes
sure to avoid trailing newlines in the btrfs label output

reproducer.sh:

===============================================================================

TEST_DEV=/dev/vdb
TEST_DIR=/home/sat/mnt

umount /home/sat/mnt

mkfs.btrfs -f $TEST_DEV
UUID=$(btrfs fi show $TEST_DEV | head -1 | sed -e 's/.*uuid: \([-0-9a-z]*\)$/\1/')
mount $TEST_DEV $TEST_DIR
LABELFILE=/sys/fs/btrfs/$UUID/label

echo "Test for empty label..." >&2
LINES="$(cat $LABELFILE | wc -l | awk '{print $1}')"
RET=0

if [ $LINES -eq 0 ] ; then
    echo '[PASS] Trailing \n is removed correctly.' >&2
else
    echo '[FAIL] Trailing \n still exists.' >&2
    RET=1
fi

echo "Test for non-empty label..." >&2

echo testlabel >$LABELFILE
LINES="$(cat $LABELFILE | wc -l | awk '{print $1}')"

if [ $LINES -eq 1 ] ; then
    echo '[PASS] Trailing \n is removed correctly.' >&2
else
    echo '[FAIL] Trailing \n still exists.' >&2
    RET=1
fi

exit $RET
===============================================================================

Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:25 -07:00
Anand Jain ec95d4917b btrfs: device delete must be sysloged
as in the disk add patch, disk detached from the volume must be
recorded in the syslog as well for the same reason.

Signed-off-by: Anand Jain <Anand.Jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:23 -07:00
Anand Jain 43d2076168 btrfs: device add must be sysloged
when we add a new disk to the mounted btrfs we don't record it
as of now, disk add is a critical change of btrfs configuration,
it must be recorded in the syslog to help offline investigations
of customer problems when reported.

Signed-off-by: Anand Jain <Anand.Jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:20 -07:00
Wang Shilong 4027e0f4c4 Btrfs: clear compress-force when remounting with compress option
Steps to reproduce:
 # mkfs.btrfs -f /dev/sdb
 # mount /dev/sdb /mnt -o compress-force=lzo
 # mount /dev/sdb /mnt -o remount,compress=zlib
 # cat /proc/mounts

Remounting from compress-force to compress could not clear compress-force
option. The problem is there is no way for users to clear compress-force
option separately.

Fix this problem by clearing @FORCE_COMPRESS flag when remounting to
compress=xxx.

Suggested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Tested-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:19 -07:00
David Sterba ed6078f703 btrfs: use DIV_ROUND_UP instead of open-coded variants
The form

  (value + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT

is equivalent to

  (value + PAGE_CACHE_SIZE - 1) / PAGE_CACHE_SIZE

The rest is a simple subsitution, no difference in the generated
assembly code.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:17 -07:00
David Sterba 4e54b17ad6 btrfs: clean away stripe_align helper
Only wraps the ALIGN macro.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:16 -07:00
David Sterba 707e8a0715 btrfs: use nodesize everywhere, kill leafsize
The nodesize and leafsize were never of different values. Unify the
usage and make nodesize the one. Cleanup the redundant checks and
helpers.

Shaves a few bytes from .text:

  text    data     bss     dec     hex filename
852418   24560   23112  900090   dbbfa btrfs.ko.before
851074   24584   23112  898770   db6d2 btrfs.ko.after

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:14 -07:00
David Sterba 962a298f35 btrfs: kill the key type accessor helpers
btrfs_set_key_type and btrfs_key_type are used inconsistently along with
open coded variants. Other members of btrfs_key are accessed directly
without any helpers anyway.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:12 -07:00
David Sterba 3abdbd780e btrfs: make close_ctree return void
There's no user of the return value and we can get rid of the comment in
put_super.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:11 -07:00
David Sterba 57cdc8db21 btrfs: cleanup ino cache members of btrfs_root
The naming is confusing, generic yet used for a specific cache. Add a
prefix 'ino_' or rename appropriately.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:09 -07:00
David Sterba c6f83c74fd btrfs: clenaup: don't call btrfs_release_path before free_path
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:08 -07:00
David Sterba 32471dc2ba btrfs: remove obsolete comment in btrfs_clean_one_deleted_snapshot
The comment applied when there was a BUG_ON.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-17 13:37:07 -07:00
J. Bruce Fields 70b2823535 nfsd4: clarify how grace period ends
The grace period is ended in two steps--first userland is notified that
the grace period is now long enough that any clients who have not yet
reclaimed can be safely forgotten, then we flip the switch that forbids
reclaims and allows new opens.  I had to think a bit to convince myself
that the ordering was right here.  Document it.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-17 16:33:19 -04:00
J. Bruce Fields bea57fe45b nfsd4: stop grace_time update at end of grace period
The attempt to automatically set a new grace period time at the end of
the grace period isn't really helpful.  We'll probably shut down and
reboot before we actually make use of the new grace period time anyway.
So may as well leave it up to the init system to get this right.

This just confuses people when they see /proc/fs/nfsd/nfsv4gracetime
change from what they set it to.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-17 16:33:18 -04:00
Jeff Layton 65decb650a nfsd: skip subsequent UMH "create" operations after the first one for v4.0 clients
In the case of v4.0 clients, we may call into the "create" client
tracking operation multiple times (once for each openowner). Upcalling
for each one of those is wasteful and slow however. We can skip doing
further "create" operations after the first one if we know that one has
already been done.

v4.1+ clients generally only call into this function once (on
RECLAIM_COMPLETE), and we can't skip upcalling on the create even if the
STABLE bit is set. Doing so would make it impossible for nfsdcltrack to
lift the grace period early since the timestamp has a different meaning
in the case where the client is expected to issue a RECLAIM_COMPLETE.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17 16:33:17 -04:00
Jeff Layton 788a7914ad nfsd: set and test NFSD4_CLIENT_STABLE bit to reduce nfsdcltrack upcalls
The nfsdcltrack upcall doesn't utilize the NFSD4_CLIENT_STABLE flag,
which basically results in an upcall every time we call into the client
tracking ops.

Change it to set this bit on a successful "check" or "create" request,
and clear it on a "remove" request.  Also, check to see if that bit is
set before upcalling on a "check" or "remove" request, and skip
upcalling appropriately, depending on its state.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17 16:33:17 -04:00
Jeff Layton d682e750ce nfsd: serialize nfsdcltrack upcalls for a particular client
In a later patch, we want to add a flag that will allow us to reduce the
need for upcalls. In order to handle that correctly, we'll need to
ensure that racing upcalls for the same client can't occur. In practice
it should be rare for this to occur with a well-behaved client, but it
is possible.

Convert one of the bits in the cl_flags field to be an upcall bitlock,
and use it to ensure that upcalls for the same client are serialized.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17 16:33:16 -04:00
Jeff Layton d4318acd5d nfsd: pass extra info in env vars to upcalls to allow for early grace period end
In order to support lifting the grace period early, we must tell
nfsdcltrack what sort of client the "create" upcall is for. We can't
reliably tell if a v4.0 client has completed reclaiming, so we can only
lift the grace period once all the v4.1+ clients have issued a
RECLAIM_COMPLETE and if there are no v4.0 clients.

Also, in order to lift the grace period, we have to tell userland when
the grace period started so that it can tell whether a RECLAIM_COMPLETE
has been issued for each client since then.

Since this is all optional info, we pass it along in environment
variables to the "init" and "create" upcalls. By doing this, we don't
need to revise the upcall format. The UMH upcall can simply make use of
this info if it happens to be present. If it's not then it can just
avoid lifting the grace period early.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17 16:33:15 -04:00
Jeff Layton 7f5ef2e900 nfsd: add a v4_end_grace file to /proc/fs/nfsd
Allow a privileged userland process to end the v4 grace period early.
Writing "Y", "y", or "1" to the file will cause the v4 grace period to
be lifted.  The basic idea with this will be to allow the userland
client tracking program to lift the grace period once it knows that no
more clients will be reclaiming state.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17 16:33:14 -04:00
Jeff Layton d68e3c4aa4 lockd: add a /proc/fs/lockd/nlm_end_grace file
Add a new procfile that will allow a (privileged) userland process to
end the NLM grace period early. The basic idea here will be to have
sm-notify write to this file, if it sent out no NOTIFY requests when
it runs. In that situation, we can generally expect that there will be
no reclaim requests so the grace period can be lifted early.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17 16:33:13 -04:00
Jeff Layton 3b3e7b7223 nfsd: reject reclaim request when client has already sent RECLAIM_COMPLETE
As stated in RFC 5661, section 18.51.3:

    Once a RECLAIM_COMPLETE is done, there can be no further reclaim
    operations for locks whose scope is defined as having completed
    recovery.  Once the client sends RECLAIM_COMPLETE, the server will
    not allow the client to do subsequent reclaims of locking state for
    that scope and, if these are attempted, will return
    NFS4ERR_NO_GRACE.

Ensure that we enforce that requirement.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17 16:33:13 -04:00
Jeff Layton 919b8049f0 nfsd: remove redundant boot_time parm from grace_done client tracking op
Since it's stored in nfsd_net, we don't need to pass it in separately.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17 16:33:12 -04:00
Jeff Layton f779002965 lockd: move lockd's grace period handling into its own module
Currently, all of the grace period handling is part of lockd. Eventually
though we'd like to be able to build v4-only servers, at which point
we'll need to put all of this elsewhere.

Move the code itself into fs/nfs_common and have it build a grace.ko
module. Then, rejigger the Kconfig options so that both nfsd and lockd
enable it automatically.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-17 16:33:11 -04:00
Jan Kara 52362810be ocfs2: Don't use MAXQUOTAS value
MAXQUOTAS value defines maximum number of quota types VFS supports.
This isn't necessarily the number of types ocfs2 supports and with
addition of project quotas these two numbers stop matching. So make
ocfs2 use its private definition.

CC: Mark Fasheh <mfasheh@suse.com>
CC: Joel Becker <jlbec@evilplan.org>
CC: ocfs2-devel@oss.oracle.com
Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-17 11:59:12 +02:00
Jan Kara aca6061773 reiserfs: Don't use MAXQUOTAS value
MAXQUOTAS value defines maximum number of quota types VFS supports.
This isn't necessarily the number of types reiserfs supports and with
addition of project quotas these two numbers stop matching. So make
reiserfs use its private definition.

CC: reiserfs-devel@vger.kernel.org
CC: Jeff Mahoney <jeffm@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-17 11:59:12 +02:00
Jan Kara a93114e468 ext3: Don't use MAXQUOTAS value
MAXQUOTAS value defines maximum number of quota types VFS supports. This
isn't necessarily the number of types ext3 supports and with addition of
project quotas these two numbers stop matching. So make ext3 use its
private definition.

CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-17 11:59:11 +02:00
Jan Kara 6fb1ca92a6 udf: Fix race between write(2) and close(2)
Currently write(2) updating i_size and close(2) of the file can race in
such a way that udf_truncate_tail_extent() called from
udf_file_release() sees old i_size but already new extents added by the
running write call. This results in complaints like:
  UDF-fs: warning (device vdb2): udf_truncate_tail_extent: Too long extent
    after EOF in inode 877: i_size: 0 lbcount: 1073739776 extent 0+1073739776
  UDF-fs: error (device vdb2): udf_truncate_tail_extent: Extent after EOF
    in inode 877

Fix the problem by grabbing i_mutex in udf_file_release() to be sure
i_size is consistent with current state of extent list. Also avoid
truncating tail extent unnecessarily when the file is still open for
writing.

Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-17 11:59:11 +02:00
Filipe Manana 125c4cf9f3 Btrfs: set inode's logged_trans/last_log_commit after ranged fsync
When a ranged fsync finishes if there are still extent maps in the modified
list, still set the inode's logged_trans and last_log_commit. This is important
in case an inode is fsync'ed and unlinked in the same transaction, to ensure its
inode ref gets deleted from the log and the respective dentries in its parent
are deleted too from the log (if the parent directory was fsync'ed in the same
transaction).

Instead make btrfs_inode_in_log() return false if the list of modified extent
maps isn't empty.

This is an incremental on top of the v4 version of the patch:

    "Btrfs: fix fsync data loss after a ranged fsync"

which was added to its v5, but didn't make it on time.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-16 16:12:19 -07:00
Dmitry Monakhov 844749764b ext4: explicitly inform user about orphan list cleanup
Production fs likely compiled/mounted w/o jbd debugging, so orphan
list clearing will be silent.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-16 14:52:03 -04:00
Dmitry Monakhov 1245799f75 jbd2: jbd2_log_wait_for_space improve error detetcion
If EIO happens after we have dropped j_state_lock, we won't notice
that the journal has been aborted.  So it is reasonable to move this
check after we have grabbed the j_checkpoint_mutex and re-grabbed the
j_state_lock.  This patch helps to prevent false positive complain
after EIO.

#DMESG:
__jbd2_log_wait_for_space: needed 8448 blocks and only had 8386 space available
__jbd2_log_wait_for_space: no way to get more journal space in ram1-8
------------[ cut here ]------------
WARNING: CPU: 15 PID: 6739 at fs/jbd2/checkpoint.c:168 __jbd2_log_wait_for_space+0x188/0x200()
Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod
CPU: 15 PID: 6739 Comm: fsstress Tainted: G        W      3.17.0-rc2-00429-g684de57 #139
Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011
 00000000000000a8 ffff88077aaab878 ffffffff815c1a8c 00000000000000a8
 0000000000000000 ffff88077aaab8b8 ffffffff8106ce8c ffff88077aaab898
 ffff8807c57e6000 ffff8807c57e6028 0000000000002100 ffff8807c57e62f0
Call Trace:
 [<ffffffff815c1a8c>] dump_stack+0x51/0x6d
 [<ffffffff8106ce8c>] warn_slowpath_common+0x8c/0xc0
 [<ffffffff8106ceda>] warn_slowpath_null+0x1a/0x20
 [<ffffffff812419f8>] __jbd2_log_wait_for_space+0x188/0x200
 [<ffffffff8123be9a>] start_this_handle+0x4da/0x7b0
 [<ffffffff810990e5>] ? local_clock+0x25/0x30
 [<ffffffff810aba87>] ? lockdep_init_map+0xe7/0x180
 [<ffffffff8123c5bc>] jbd2__journal_start+0xdc/0x1d0
 [<ffffffff811f2414>] ? __ext4_new_inode+0x7f4/0x1330
 [<ffffffff81222a38>] __ext4_journal_start_sb+0xf8/0x110
 [<ffffffff811f2414>] __ext4_new_inode+0x7f4/0x1330
 [<ffffffff810ac359>] ? lock_release_holdtime+0x29/0x190
 [<ffffffff812025bb>] ext4_create+0x8b/0x150
 [<ffffffff8117fe3b>] vfs_create+0x7b/0xb0
 [<ffffffff8118097b>] do_last+0x7db/0xcf0
 [<ffffffff8117e31d>] ? inode_permission+0x4d/0x50
 [<ffffffff811845d2>] path_openat+0x242/0x590
 [<ffffffff81191a76>] ? __alloc_fd+0x36/0x140
 [<ffffffff81184a6a>] do_filp_open+0x4a/0xb0
 [<ffffffff81191b61>] ? __alloc_fd+0x121/0x140
 [<ffffffff81172f20>] do_sys_open+0x170/0x220
 [<ffffffff8117300e>] SyS_open+0x1e/0x20
 [<ffffffff811715d6>] SyS_creat+0x16/0x20
 [<ffffffff815c7e12>] system_call_fastpath+0x16/0x1b
---[ end trace cd71c831f82059db ]---

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-16 14:50:50 -04:00
Darrick J. Wong 064d83892e jbd2: free bh when descriptor block checksum fails
Free the buffer head if the journal descriptor block fails checksum
verification.

This is the jbd2 port of the e2fsprogs patch "e2fsck: free bh on csum
verify error in do_one_pass".

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Cc: stable@vger.kernel.org
2014-09-16 14:43:09 -04:00
Darrick J. Wong a0626e7595 ext4: check EA value offset when loading
When loading extended attributes, check each entry's value offset to
make sure it doesn't collide with the entries.

Without this check it is easy to crash the kernel by mounting a
malicious FS containing a file with an EA wherein e_value_offs = 0 and
e_value_size > 0 and then deleting the EA, which corrupts the name
list.

(See the f_ea_value_crash test's FS image in e2fsprogs for an example.)

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-09-16 14:34:59 -04:00
David Howells c06cfb08b8 KEYS: Remove key_type::match in favour of overriding default by match_preparse
A previous patch added a ->match_preparse() method to the key type.  This is
allowed to override the function called by the iteration algorithm.
Therefore, we can just set a default that simply checks for an exact match of
the key description with the original criterion data and allow match_preparse
to override it as needed.

The key_type::match op is then redundant and can be removed, as can the
user_match() function.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
2014-09-16 17:36:06 +01:00
Linus Torvalds 37504a3be9 Here are a number of small fixes for GFS2. There is a fix for FIEMAP
on large sparse files, a negative dentry hashing fix, a fix for
 flock, and a bug fix relating to d_splice_alias usage. There are
 also (patches 1 and 5) a couple of updates which are less
 critical, but small and low risk.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQIcBAABAgAGBQJUGAT8AAoJEMrg3m4a/8jSevIQAJh0sfUAzDZ2p1etZw2OR2bm
 0ritvFVkOB4CZ97TS7DvItoqcke+e2ehT8ykIzCNZhJotld7jsitzuQm0ugsMvG2
 ltrUil2eqyeBFt1dJk9m366+akbdocVhSpVL/6hLRJI9GTbfWr26jHwUTIQisq5u
 Pdr42lLwwAPrj9D7GQRA1jSZLbA36teRGUuV+3CFxAWVgqG0kOmq7iGd27LZZ/hP
 m4hjPvdczTfhM3kjNvohicLXQiXhsJdlOZsZKfMTSGmJ2WsMRKoK2PZx2eRmDEEe
 ypfkboyfcHCJlbIx+x0qPRheYMC8a2TX3JMb3l6eaV0mZoeExB2kPHeBlfu1jddp
 XwLvbs/fbfC96Q6WPKM3JdBDY9DBf5t30Is0VxNJeMH2ERIubaLrOACeD4P7V5mX
 PIc3/bcDtNMOirgdd2NGxGR0TxmyhKs7pbtWC+e6Wud04qgsK1JtU8cNro0j8MRE
 Rfay0RY+8OjHfDn+shQYl4mh8MF2u3JF3/ThcZjtDfAB0AIhF1Bs5Ihl12B45ov1
 7ah0J+8Yk7gwH++KDFNv6XGNzjpjkbBEr9FGLLN3DNDysk4ibEYP3NwEf0Vnme7A
 EhYv0IFgUbgdLrU3owA7rRh38A8W23cr78qRSM66/TEGYqBi9+4/DwLE4EIorGYA
 mKSpLXhcNtCIrQGNCz3Q
 =5NIi
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes

Pull gfs2 fixes from Steven Whitehouse:
 "Here are a number of small fixes for GFS2.

  There is a fix for FIEMAP on large sparse files, a negative dentry
  hashing fix, a fix for flock, and a bug fix relating to d_splice_alias
  usage.

  There are also (patches 1 and 5) a couple of updates which are less
  critical, but small and low risk"

* tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes:
  GFS2: fix d_splice_alias() misuses
  GFS2: Don't use MAXQUOTAS value
  GFS2: Hash the negative dentry during inode lookup
  GFS2: Request demote when a "try" flock fails
  GFS2: Change maxlen variables to size_t
  GFS2: fs/gfs2/super.c: replace seq_printf by seq_puts
2014-09-16 07:47:04 -07:00
James Hogan a060dc5010 vfs: workaround gcc <4.6 build error in link_path_walk()
Commit d6bb3e9075 ("vfs: simplify and shrink stack frame of
link_path_walk()") introduced build problems with GCC versions older
than 4.6 due to the initialisation of a member of an anonymous union in
struct qstr without enclosing braces.

This hits GCC bug 10676 [1] (which was fixed in GCC 4.6 by [2]), and
causes the following build error:

  fs/namei.c: In function 'link_path_walk':
  fs/namei.c:1778: error: unknown field 'hash_len' specified in initializer

This is worked around by adding explicit braces.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=10676
[2] https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=159206

Fixes: d6bb3e9075 (vfs: simplify and shrink stack frame of link_path_walk())
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-metag@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-16 07:44:54 -07:00
Steve French 364d42930d Fix mfsymlinks file size check
If the mfsymlinks file size has changed (e.g. the file no longer
represents an emulated symlink) we were not returning an error properly.

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Stefan Metzmacher <metze@samba.org>
2014-09-16 06:48:20 -05:00
Jaegeuk Kim 60979115a6 f2fs: fix double lock for inode page during roll-foward recovery
If the inode is same and its data index are needed to truncate, we can fall into
double lock for its inode page via get_dnode_of_data.

Error case is like this.

1. write data 1, 2, 3, 4, 5 in inode #4.
2. write data 100, 102, 103, 104, 105 in dnode #6 of inode #4.
3. sync
4. update data 100->106 in dnode #6.
5. fsync inode #4.
6. power-cut

-> Then,
1. go back to #3's checkpoint
2. in do_recover_data, get_dnode_of_data() gets inode #4.
3. detect 100->106 in dnode #6.
4. check_index_in_prev_nodes tries to truncate 100 in dnode #6.
5. to trigger truncate_hole, get_dnode_of_data should grab inode #4.
6. detect *kernel hang*

This patch should resolve that bug.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-16 04:10:47 -07:00
Huang Ying c6e489305e f2fs: fix a race condition in next_free_nid
The nm_i->fcnt checking is executed before spin_lock, so if another
thread delete the last free_nid from the list, the wrong nid may be
gotten.  So fix the race condition by moving the nm_i->fnct checking
into spin_lock.

Signed-off-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-16 04:10:46 -07:00
Huang Ying 7704182387 f2fs: use nm_i->next_scan_nid as default for next_free_nid
Now, if there is no free nid in nm_i->free_nid_list, 0 may be saved
into next_free_nid of checkpoint, this may cause useless scanning for
next mount.  nm_i->next_scan_nid should be a better default value than
0.

Signed-off-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-16 04:10:45 -07:00
Jaegeuk Kim c1ce1b02bb f2fs: give an option to enable in-place-updates during fsync to users
If user wrote F2FS_IPU_FSYNC:4 in /sys/fs/f2fs/ipu_policy, f2fs_sync_file
only starts to try in-place-updates.
And, if the number of dirty pages is over /sys/fs/f2fs/min_fsync_blocks, it
keeps out-of-order manner. Otherwise, it triggers in-place-updates.

This may be used by storage showing very high random write performance.

For example, it can be used when,

Seq. writes (Data) + wait + Seq. writes (Node)

is pretty much slower than,

Rand. writes (Data)

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-16 04:10:44 -07:00
Jaegeuk Kim a7ffdbe22c f2fs: expand counting dirty pages in the inode page cache
Previously f2fs only counts dirty dentry pages, but there is no reason not to
expand the scope.

This patch changes the names on the management of dirty pages and to count
dirty pages in each inode info as well.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-16 04:10:39 -07:00
Steve French 69af38dbc5 Update version number displayed by modinfo for cifs.ko
Update cifs.ko version to 2.05

Signed-off-by: Steve French <smfrench@gmail.com>w
2014-09-16 05:31:01 -05:00
Arnd Bergmann 116ae5e2b0 cifs: remove dead code
cifs provides two dummy functions 'sess_auth_lanman' and
'sess_auth_kerberos' for the case in which the respective
features are not defined. However, the caller is also under
an #ifdef, so we just get warnings about unused code:

fs/cifs/sess.c:1109:1: warning: 'sess_auth_kerberos' defined but not used [-Wunused-function]
 sess_auth_kerberos(struct sess_data *sess_data)

Removing the dead functions gets rid of the warnings without
any downsides that I can see.

(Yalin Wang reported the identical problem and fix so added him)

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-09-16 05:30:11 -05:00
Steve French a5c3e1c725 Revert "cifs: No need to send SIGKILL to demux_thread during umount"
This reverts commit 52a3624444.

Causes rmmod to fail for at least 7 seconds after unmount which
makes automated testing a little harder when reloading cifs.ko
between test runs.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
CC: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-09-16 05:26:24 -05:00
Stephen Rothwell b262b35c2c pnfs/blocklayout: include vmalloc.h for __vmalloc
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-15 19:33:28 -04:00
Linus Torvalds d6bb3e9075 vfs: simplify and shrink stack frame of link_path_walk()
Commit 9226b5b440 ("vfs: avoid non-forwarding large load after small
store in path lookup") made link_path_walk() always access the
"hash_len" field as a single 64-bit entity, in order to avoid mixed size
accesses to the members.

However, what I didn't notice was that that effectively means that the
whole "struct qstr this" is now basically redundant.  We already
explicitly track the "const char *name", and if we just use "u64
hash_len" instead of "long len", there is nothing else left of the
"struct qstr".

We do end up wanting the "struct qstr" if we have a filesystem with a
"d_hash()" function, but that's a rare case, and we might as well then
just squirrell away the name and hash_len at that point.

End result: fewer live variables in the loop, a smaller stack frame, and
better code generation.  And we don't need to pass in pointers variables
to helper functions any more, because the return value contains all the
relevant information.  So this removes more lines than it adds, and the
source code is clearer too.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-15 10:51:07 -07:00
Steve French da80659d4a [SMB3] Fix oops when creating symlinks on smb3
We were not checking for symlink support properly for SMB2/SMB3
mounts so could oops when mounted with mfsymlinks when try
to create symlink when mfsymlinks on smb2/smb3 mounts

Signed-off-by: Steve French <smfrench@gmail.com>
Cc: <stable@vger.kernel.org> # 3.14+
CC: Sachin Prabhu <sprabhu@redhat.com>
2014-09-15 03:04:50 -05:00
Linus Torvalds 83373f7028 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "double iput() on failure exit in lustre, racy removal of spliced
  dentries from ->s_anon in __d_materialise_dentry() plus a bunch of
  assorted RCU pathwalk fixes"

The RCU pathwalk fixes end up fixing a couple of cases where we
incorrectly dropped out of RCU walking, due to incorrect initialization
and testing of the sequence locks in some corner cases.  Since dropping
out of RCU walk mode forces the slow locked accesses, those corner cases
slowed down quite dramatically.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  be careful with nd->inode in path_init() and follow_dotdot_rcu()
  don't bugger nd->seq on set_root_rcu() from follow_dotdot_rcu()
  fix bogus read_seqretry() checks introduced in b37199e
  move the call of __d_drop(anon) into __d_materialise_unique(dentry, anon)
  [fix] lustre: d_make_root() does iput() on dentry allocation failure
2014-09-14 17:37:36 -07:00
Linus Torvalds 9226b5b440 vfs: avoid non-forwarding large load after small store in path lookup
The performance regression that Josef Bacik reported in the pathname
lookup (see commit 99d263d4c5 "vfs: fix bad hashing of dentries") made
me look at performance stability of the dcache code, just to verify that
the problem was actually fixed.  That turned up a few other problems in
this area.

There are a few cases where we exit RCU lookup mode and go to the slow
serializing case when we shouldn't, Al has fixed those and they'll come
in with the next VFS pull.

But my performance verification also shows that link_path_walk() turns
out to have a very unfortunate 32-bit store of the length and hash of
the name we look up, followed by a 64-bit read of the combined hash_len
field.  That screws up the processor store to load forwarding, causing
an unnecessary hickup in this critical routine.

It's caused by the ugly calling convention for the "hash_name()"
function, and easily fixed by just making hash_name() fill in the whole
'struct qstr' rather than passing it a pointer to just the hash value.

With that, the profile for this function looks much smoother.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-14 17:28:32 -07:00
Steve French 2ae83bf938 [CIFS] Fix setting time before epoch (negative time values)
xfstest generic/258 sets the time on a file to a negative value
(before 1970) which fails since do_div can not handle negative
numbers.  In addition 'normal' division of 64 bit values does
not build on 32 bit arch so have to workaround this by special
casing negative values in cifs_NTtimeToUnix

Samba server also has a bug with this (see samba bugzilla 7771)
but it works to Windows server.

Signed-off-by: Steve French <smfrench@gmail.com>
2014-09-14 17:06:36 -05:00
Al Viro 4023bfc9f3 be careful with nd->inode in path_init() and follow_dotdot_rcu()
in the former we simply check if dentry is still valid after picking
its ->d_inode; in the latter we fetch ->d_inode in the same places
where we fetch dentry and its ->d_seq, under the same checks.

Cc: stable@vger.kernel.org # 2.6.38+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-14 14:24:47 -04:00
Al Viro 7bd88377d4 don't bugger nd->seq on set_root_rcu() from follow_dotdot_rcu()
return the value instead, and have path_init() do the assignment.  Broken by
"vfs: Fix absolute RCU path walk failures due to uninitialized seq number",
which was Cc-stable with 2.6.38+ as destination.  This one should go where
it went.

To avoid dummy value returned in case when root is already set (it would do
no harm, actually, since the only caller that doesn't ignore the return value
is guaranteed to have nd->root *not* set, but it's more obvious that way),
lift the check into callers.  And do the same to set_root(), to keep them
in sync.

Cc: stable@vger.kernel.org # 2.6.38+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-14 14:19:44 -04:00
Al Viro f5be3e2912 fix bogus read_seqretry() checks introduced in b37199e
read_seqretry() returns true on mismatch, not on match...

Cc: stable@vger.kernel.org # 3.15+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-13 22:14:16 -04:00
Al Viro 6f18493e54 move the call of __d_drop(anon) into __d_materialise_unique(dentry, anon)
and lock the right list there

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-13 22:14:03 -04:00
Linus Torvalds 99d263d4c5 vfs: fix bad hashing of dentries
Josef Bacik found a performance regression between 3.2 and 3.10 and
narrowed it down to commit bfcfaa77bd ("vfs: use 'unsigned long'
accesses for dcache name comparison and hashing"). He reports:

 "The test case is essentially

      for (i = 0; i < 1000000; i++)
              mkdir("a$i");

  On xfs on a fio card this goes at about 20k dir/sec with 3.2, and 12k
  dir/sec with 3.10.  This is because we spend waaaaay more time in
  __d_lookup on 3.10 than in 3.2.

  The new hashing function for strings is suboptimal for <
  sizeof(unsigned long) string names (and hell even > sizeof(unsigned
  long) string names that I've tested).  I broke out the old hashing
  function and the new one into a userspace helper to get real numbers
  and this is what I'm getting:

      Old hash table had 1000000 entries, 0 dupes, 0 max dupes
      New hash table had 12628 entries, 987372 dupes, 900 max dupes
      We had 11400 buckets with a p50 of 30 dupes, p90 of 240 dupes, p99 of 567 dupes for the new hash

  My test does the hash, and then does the d_hash into a integer pointer
  array the same size as the dentry hash table on my system, and then
  just increments the value at the address we got to see how many
  entries we overlap with.

  As you can see the old hash function ended up with all 1 million
  entries in their own bucket, whereas the new one they are only
  distributed among ~12.5k buckets, which is why we're using so much
  more CPU in __d_lookup".

The reason for this hash regression is two-fold:

 - On 64-bit architectures the down-mixing of the original 64-bit
   word-at-a-time hash into the final 32-bit hash value is very
   simplistic and suboptimal, and just adds the two 32-bit parts
   together.

   In particular, because there is no bit shuffling and the mixing
   boundary is also a byte boundary, similar character patterns in the
   low and high word easily end up just canceling each other out.

 - the old byte-at-a-time hash mixed each byte into the final hash as it
   hashed the path component name, resulting in the low bits of the hash
   generally being a good source of hash data.  That is not true for the
   word-at-a-time case, and the hash data is distributed among all the
   bits.

The fix is the same in both cases: do a better job of mixing the bits up
and using as much of the hash data as possible.  We already have the
"hash_32|64()" functions to do that.

Reported-by: Josef Bacik <jbacik@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-13 11:30:10 -07:00
Al Viro cfb2f9d5c9 GFS2: fix d_splice_alias() misuses
Callers of d_splice_alias(dentry, inode) don't need iput(), neither
on success nor on failure.  Either the reference to inode is stored
in a previously negative dentry, or it's dropped.  In either case
inode reference the caller used to hold is consumed.

__gfs2_lookup() does iput() in case when d_splice_alias() has failed.
Double iput() if we ever hit that.  And gfs2_create_inode() ends up
not only with double iput(), but with link count dropped to zero - on
an inode it has just found in directory.

Cc: stable@vger.kernel.org # v3.14+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-09-12 20:58:55 +01:00
Linus Torvalds 602b536629 NFS client fixes for 3.17
Highlights:
 - Fix a kernel warning when removing /proc/net/nfsfs
 - Revert commit 49a4bda22e due to Oopses
 - Fix a typo in the pNFS file layout commit code
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUEyYLAAoJEGcL54qWCgDylFwP/08KZad4r9Od5cONfv4LmaKu
 a+8s/ys9haoA4vaNCm1HTH216pJUUC2S31P1AZUzDoIt3Eox8QCa7VhMztoGC1G3
 U+3/k7LmBzNfJCeNcvr5WV/o/QMMbB6QJ4wQWoBOx5H+hqDuEnMd/BBMaU1qEcGa
 8AzAlPTCnNJyg+mah6xHOTx21WfukaVVJtWHKrY1vcN8cYTgaOm0vHqRQQraBYfX
 lIFH55+wKuxa+IbPV6NZ3yl7HG963IYtUkP3faRlXh91616Xthbec3yhyuoxvZzd
 5oRxPwtLIPP47kwAL0nJVGxI4wDE8q2a35kv8akw0waLGzWab6NmI5MADBamrZOP
 Vnv4wlrE5vgDvIEG42oKfPCRo5qB+Nc79wQzQ62pDRT+OkB9PbYtIc+n/l8HD8jm
 JfH09duW203D8llbJLa/YeJSQC9BeV1coyduB9WTDZBNVrAQPAVgWMO+XZLCGGFR
 l5f6vNNQbgCFx9hewCnnv0COUJuf4/MN3mKYRSO/zH/oRR7rfFnqmHtD2rwqOizk
 PPaF6qXY8IY4NIj0UF1JYFYFLPN65z1JI+XOfDSfGhdGVrWEXtkC2k1tEhIQ71rU
 1riULq67vGFfPG4SJ43Xf2JwvcpFni2VguFeOw05xRsC2RioRbzr3GucWVGsxQ66
 cK2AS2MEcBYFWqodXZky
 =+QUa
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.17-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client fixes from Trond Myklebust:
 "Highlights:
   - fix a kernel warning when removing /proc/net/nfsfs
   - revert commit 49a4bda22e due to Oopses
   - fix a typo in the pNFS file layout commit code"

* tag 'nfs-for-3.17-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  pnfs: fix filelayout_retry_commit when idx > 0
  nfs: revert "nfs4: queue free_lock_state job submission to nfsiod"
  nfs: fix kernel warning when removing proc entry
2014-09-12 11:54:54 -07:00
Linus Torvalds 7ed641be75 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Filipe is doing a careful pass through fsync problems, and these are
  the fixes so far.  I'll have one more for rc6 that we're still
  testing.

  My big commit is fixing up some inode hash races that Al Viro found
  (thanks Al)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: use insert_inode_locked4 for inode creation
  Btrfs: fix fsync data loss after a ranged fsync
  Btrfs: kfree()ing ERR_PTRs
  Btrfs: fix crash while doing a ranged fsync
  Btrfs: fix corruption after write/fsync failure + fsync + log recovery
  Btrfs: fix autodefrag with compression
2014-09-12 11:53:30 -07:00
Peng Tao 88ac815cdb nfs41: change PNFS_LAYOUTRET_ON_SETATTR to only return on truncation to smaller size
Both blocks layout and objects layout want to use it to avoid CB_LAYOUTRECALL
but that should only happen if client is doing truncation to a smaller size.
For other cases, we let server decide if it wants to recall client's layouts.
Change PNFS_LAYOUTRET_ON_SETATTR to follow the logic and not to send
layoutreturn unnecessarily.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 14:03:20 -04:00
Anna Schumaker cb8c20fa53 NFS: Move NFS v3 acl functions to nfs3_fs.h
This code is internal to the v3 module, so other parts of the client
shouldn't have any knowledge of it.

nfs3_getxattr(), nfs3_setxattr(), and nfs3_removexattr() no longer exist
anywhere so I remove the declarations while I'm here.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:50:26 -04:00
Anna Schumaker f08460dc23 NFS: Remove v3 not compiled check from validate_mount_data()
This check is already performed by the module loading code - if the
module can't be found then -EPROTONOSUPPORT will be returned.  Let's
handle v3 this way, too.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:50:20 -04:00
Anna Schumaker 00a36a1090 NFS: Move v3 declarations out of internal.h
I am generally against the "one big header file" approach, and
everything in the client includes this file.  Let's move all the NFS v3
declarations into a v3-only header file.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:49:40 -04:00
Anna Schumaker f418c64b71 NFS: Unconditionally enable commit code
The goal is to create a generic NFS module with code that does not
depend on what versions of NFS are enabled.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:49:31 -04:00
Trond Myklebust 164ae58c3c pNFS/blocklayout: Remove a couple of unused variables
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:34:54 -04:00
Christoph Hellwig 84c9dee3ad pnfs: enable CB_NOTIFY_DEVICEID support
This code has been around for a while, but never was enabled, although
it is in a working shape.

Note that we implement NOTIFY_DEVICEID4_CHANGE identical to
NOTIFY_DEVICEID4_DELETE.  Given that in either case we can't do anything
but preventing further lookups of a given device ID there isn't much difference
in semantics for the two.  For the delete case the server MUST ensure that
there are no outstanding layouts, while for the change case it doesn't, but
that has little relevance to the client.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:33:50 -04:00
Christoph Hellwig 5c83746a0c pnfs/blocklayout: in-kernel GETDEVICEINFO XDR parsing
This patches moves parsing of the GETDEVICEINFO XDR to kernel space, as well
as the management of complex devices.  The reason for that is we might have
multiple outstanding complex devices after a NOTIFY_DEVICEID4_CHANGE, which
device mapper or md can't handle as they claim devices exclusively.

But as is turns out simple striping / concatenation is fairly trivial to
implement anyway, so we make our life simpler by reducing the reliance
on blkmapd.  For now we still use blkmapd by feeding it synthetic SIMPLE
device XDR to translate device signatures to device numbers, but in the
long runs I have plans to eliminate it entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:33:50 -04:00
Christoph Hellwig 871760ce97 pnfs/blocklayout: move all rpc_pipefs related code into a single file
Create a file to house all the rpc_pipefs boilerplate code instead of
sprinkling it over a few files.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:33:50 -04:00
Christoph Hellwig ca0fe1dfa5 pnfs/blocklayout: refactor extent processing
Factor out a helper for all per-extent work, and merge the now trivial
functions for lseg allocation and parsing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:33:49 -04:00
Christoph Hellwig 9cc4754117 pnfs/blocklayout: move extent processing to blocklayout.c
This isn't device(id) related, so move it into the main file.  Simple move
for now, the next commit will clean it up a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:33:49 -04:00
Christoph Hellwig 34dc93c2fc pnfs/blocklayout: allocate separate pages for the layoutcommit payload
Instead of overflowing the XDR send buffer with our extent list allocate
pages and pre-encode the layoutupdate payload into them.  We optimistically
allocate a single page use alloc_page and only switch to vmalloc when we
have more extents outstanding.  Currently there is only a single testcase
(xfstests generic/113) which can reproduce large enough extent lists for
this to occur.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:22:45 -04:00
Christoph Hellwig d4b18c3e00 pnfs: remove GETDEVICELIST implementation
The current GETDEVICELIST implementation is buggy in that it doesn't handle
cursors correctly, and in that it returns an error if the server returns
NFSERR_NOTSUPP.  Given that there is no actual need for GETDEVICELIST,
it has various issues and might get removed for NFSv4.2 stop using it in
the blocklayout driver, and thus the Linux NFS client as whole.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:20:54 -04:00
Christoph Hellwig fd41b4748b pnfs/objlayout: fix endianess annotation in objio_alloc_deviceid_node
The kbuild test robot complained about a new sparse warning in
objio_alloc_deviceid_node, but it turns out that this was just a moved
reference to an existing variable.  Fix it to have the right big endian
annotated type.

Note that there are some other endianess issues in this file that I didn't
bother to sort out as they involve global headers.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:20:43 -04:00
Christoph Hellwig 3e3f6b4e26 pnfs/blocklayout: remove some debugging
The kbuild test robot complained that we got the printk format wrong.
Let's just kill these printks instead of fixing them as there is not
point after the initial tree algorithm debugging.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:20:35 -04:00
NeilBrown f39c010479 NFS: remove BUG possibility in nfs4_open_and_get_state
commit 4fa2c54b51
    NFS: nfs4_do_open should add negative results to the dcache.

used "d_drop(); d_add();" to ensure that a dentry was hashed
as a negative cached entry.
This is not safe if the dentry has an non-NULL ->d_inode.
It will trigger a BUG_ON in d_instantiate().
In that case, d_delete() is needed.

Also, only d_add if the dentry is currently unhashed, it seems
pointless removed and re-adding it unchanged.

Reported-by: Christoph Hellwig <hch@infradead.org>
Fixes: 4fa2c54b51
Cc: Jeff Layton <jeff.layton@primarydata.com>
Link: http://lkml.kernel.org/r/20140908144525.GB19811@infradead.org
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:10:53 -04:00
Darrick J. Wong 684de57486 ext4: don't keep using page if inline conversion fails
If inline->extent conversion fails (most probably due to ENOSPC) and
we release the temporary page that we allocated to transfer the file
contents, don't keep using the page pointer after releasing the page.
This occasionally leads to complaints about evicting locked pages or
hangs when blocksize > pagesize, because it's possible for the page to
get reallocated elsewhere in the meantime.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Tao Ma <tm@tao.ma>
2014-09-11 11:45:12 -04:00
Darrick J. Wong df4763bea5 ext4: validate external journal superblock checksum
If the external journal device has metadata_csum enabled, verify
that the superblock checksum matches the block before we try to
mount.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-11 11:44:36 -04:00
Darrick J. Wong feb8c6d3dd jbd2: fix journal checksum feature flag handling
Clear all three journal checksum feature flags before turning on
whichever journal checksum options we want.  Rearrange the error
checking so that newer flags get complained about first.

Reported-by: TR Reardon <thomas_reardon@hotmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-11 11:38:21 -04:00
Jens Axboe b207892b06 Merge branch 'for-linus' into for-3.18/core
A bit of churn on the for-linus side that would be nice to have
in the core bits for 3.18, so pull it in to catch us up and make
forward progress easier.

Signed-off-by: Jens Axboe <axboe@fb.com>

Conflicts:
	block/scsi_ioctl.c
2014-09-11 09:31:18 -06:00
Lukas Czerner c7f725435a ext4: provide separate operations for sysfs feature files
Currently sysfs feature files uses ext4_attr_ops as the file operations
to show/store data. However the feature files is not supposed to contain
any data at all, the sole existence of the file means that the module
support the feature. Moreover, none of the sysfs feature attributes
actually register show/store functions so that would not be a problem.

However if a sysfs feature attribute register a show or store function
we might be in trouble because the kobject in this case is _not_ embedded
in the ext4_sb_info structure as ext4_attr_show/store expect.

So just to be safe, provide separate empty sysfs_ops to use in
ext4_feat_ktype. This might safe us from potential problems in the
future. As a bonus we can "store" something more descriptive than
nothing in the files, so let it contain "enabled" to make it clear that
the feature is really present in the module.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-11 11:27:58 -04:00
Lukas Czerner 52c198c682 ext4: add sysfs entry showing whether the fs contains errors
Currently there is no easy way to tell that the mounted file system
contains errors other than checking for log messages, or reading the
information directly from superblock.

This patch adds new sysfs entries:

errors_count		(number of fs errors we encounter)
first_error_time	(unix timestamp for the first error we see)
last_error_time		(unix timestamp for the last error we see)

If the file system is not marked as containing errors then any of the
file will return 0. Otherwise it will contain valid information. More
details about the errors should as always be found in the logs.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-11 11:18:13 -04:00
Jan Kara a2d4a646e6 ext4: don't use MAXQUOTAS value
MAXQUOTAS value defines maximum number of quota types VFS supports.
This isn't necessarily the number of types ext4 supports. Although
ext4 will support project quotas, use ext4 private definition for
consistency with other filesystems.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-11 11:15:15 -04:00
Christoph Hellwig f0c63124a6 nfsd: update mtime on truncate
This fixes a failure in xfstests generic/313 because nfs doesn't update
mtime on a truncate.  The protocol requires this to be done implicity
for a size changing setattr.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-11 11:12:16 -04:00
Jan Kara a937cca270 GFS2: Don't use MAXQUOTAS value
MAXQUOTAS value defines maximum number of quota types VFS supports.
This isn't necessarily the number of types gfs2 supports and with
addition of project quotas these two numbers stop matching. So make gfs2
use its private definition.

CC: cluster-devel@redhat.com
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-09-11 10:59:56 +01:00
Benjamin Coddington 7b7a91152d GFS2: Hash the negative dentry during inode lookup
Fix a regression introduced by:
6d4ade986f GFS2: Add atomic_open support
where an early return misses d_splice_alias() which had been
adding the negative dentry.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-09-11 10:54:43 +01:00
Jaegeuk Kim 2403c155b8 f2fs: remove lengthy inode->i_ino
This patch is to remove lengthy name by adding a new variable.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-10 17:00:25 -07:00
Linus Torvalds 584f1adaf0 Merge branch 'akpm' (fixes from Andrew Morton)
Merge misc fixes from Andrew Morton:
 "10 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  fs/notify: don't show f_handle if exportfs_encode_inode_fh failed
  fsnotify/fdinfo: use named constants instead of hardcoded values
  kcmp: fix standard comparison bug
  mm/mmap.c: use pr_emerg when printing BUG related information
  shm: add memfd.h to UAPI export list
  checkpatch: allow commit descriptions on separate line from commit id
  sh: get_user_pages_fast() must flush cache
  eventpoll: fix uninitialized variable in epoll_ctl
  kernel/printk/printk.c: fix faulty logic in the case of recursive printk
  mem-hotplug: let memblock skip the hotpluggable memory regions in __next_mem_range()
2014-09-10 15:42:18 -07:00
Andrey Vagin 7e8824816b fs/notify: don't show f_handle if exportfs_encode_inode_fh failed
Currently we handle only ENOSPC.  In case of other errors the file_handle
variable isn't filled properly and we will show a part of stack.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-10 15:42:12 -07:00
Andrey Vagin 1fc98d11ca fsnotify/fdinfo: use named constants instead of hardcoded values
MAX_HANDLE_SZ is equal to 128, but currently the size of pad is only 64
bytes, so exportfs_encode_inode_fh can return an error.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-10 15:42:12 -07:00
Nicolas Iooss c680e41b3a eventpoll: fix uninitialized variable in epoll_ctl
When calling epoll_ctl with operation EPOLL_CTL_DEL, structure epds is
not initialized but ep_take_care_of_epollwakeup reads its event field.
When this unintialized field has EPOLLWAKEUP bit set, a capability check
is done for CAP_BLOCK_SUSPEND in ep_take_care_of_epollwakeup.  This
produces unexpected messages in the audit log, such as (on a system
running SELinux):

    type=AVC msg=audit(1408212798.866:410): avc:  denied
    { block_suspend } for  pid=7754 comm="dbus-daemon" capability=36
    scontext=unconfined_u:unconfined_r:unconfined_t
    tcontext=unconfined_u:unconfined_r:unconfined_t
    tclass=capability2 permissive=1

    type=SYSCALL msg=audit(1408212798.866:410): arch=c000003e syscall=233
    success=yes exit=0 a0=3 a1=2 a2=9 a3=7fffd4d66ec0 items=0 ppid=1
    pid=7754 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0
    fsgid=0 tty=(none) ses=3 comm="dbus-daemon"
    exe="/usr/bin/dbus-daemon"
    subj=unconfined_u:unconfined_r:unconfined_t key=(null)

("arch=c000003e syscall=233 a1=2" means "epoll_ctl(op=EPOLL_CTL_DEL)")

Remove use of epds in epoll_ctl when op == EPOLL_CTL_DEL.

Fixes: 4d7e30d989 ("epoll: Add a flag, EPOLLWAKEUP, to prevent suspend while epoll events are ready")
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-10 15:42:12 -07:00
Linus Torvalds 7ec62d421b Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF fixes from Jan Kara:
 "Fixes for UDF handling of NFS handles and one fix for proper handling
  of corrupted media"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: saner calling conventions for udf_new_inode()
  udf: fix the udf_iget() vs. udf_new_inode() races
  udf: merge the pieces inserting a new non-directory object into directory
  udf: Set i_generation field
  udf: Properly detect stale inodes
  udf: Make udf_read_inode() and udf_iget() return error
  udf: Avoid infinite loop when processing indirect ICBs
  udf: Fold udf_fill_inode() into __udf_read_inode()
  udf: Avoid dir link count to go negative
2014-09-10 14:04:17 -07:00
Jeff Layton 8d11620e1e nfs: add __acquires and __releases annotations to seqfile start/stop routines
To make sparse happy...

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:04 -07:00
Jeff Layton dad2b015bb nfs: fix RCU cl_xprt handling in nfs_swap_activate/deactivate
sparse says:

fs/nfs/file.c:543:60: warning: incorrect type in argument 1 (different address spaces)
fs/nfs/file.c:543:60:    expected struct rpc_xprt *xprt
fs/nfs/file.c:543:60:    got struct rpc_xprt [noderef] <asn:4>*cl_xprt
fs/nfs/file.c:548:53: warning: incorrect type in argument 1 (different address spaces)
fs/nfs/file.c:548:53:    expected struct rpc_xprt *xprt
fs/nfs/file.c:548:53:    got struct rpc_xprt [noderef] <asn:4>*cl_xprt

cl_xprt is RCU-managed, so we need to take care to dereference and use
it while holding the RCU read lock.

Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:04 -07:00
Christoph Hellwig 08a899d5d9 nfs: setattr can only change regular file sizes
The VFS never calls setattr with ATTR_SIZE on anything but regular
files.  Remove the if check and turn it into an assert similar to
what some other file systems do.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:04 -07:00
Christoph Hellwig 20d655d619 pnfs/blocklayout: use the device id cache
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:04 -07:00
Christoph Hellwig 30ff0603ca pnfs: add a nfs4_get_deviceid helper
This will be used by the block layout driver when splitting extents.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:04 -07:00
Christoph Hellwig 9dd2fcd32f pnfs: add a common GETDEVICELIST implementation
At a simple helper to issue a GETDEVICELIST operation and pre-load
the device id cache based on the result.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:04 -07:00
Christoph Hellwig 661373b13d pnfs: factor GETDEVICEINFO implementations
Add support to the common pNFS core to issue GETDEVICEINFO calls on
a device ID cache miss.  The code is taken from the well debugged
file layout implementation and calls out to the layoutdriver through
a new alloc_deviceid_node method.  The calling conventions for
nfs4_find_get_deviceid are changed so that all information needed to
send a GETDEVICEINFO request is passed to the common code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:03 -07:00
Christoph Hellwig 848746bd24 pnfs/blocklayout: return layouts on setattr
This speads up truncate-heavy workloads like fsx by multiple orders of
magnitude.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:03 -07:00
Christoph Hellwig 71d5b76302 pnfs/blocklayout: implement the return_range method
This allows removing extents from the extent tree especially on truncate
operations, and thus fixing reads from truncated and re-extended that
previously returned stale data.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:03 -07:00
Christoph Hellwig 8067253c8c pnfs/blocklayout: rewrite extent tracking
Currently the block layout driver tracks extents in three separate
data structures:

 - the two list of pnfs_block_extent structures returned by the server
 - the list of sectors that were in invalid state but have been written to
 - a list of pnfs_block_short_extent structures for LAYOUTCOMMIT

All of these share the property that they are not only highly inefficient
data structures, but also that operations on them are even more inefficient
than nessecary.

In addition there are various implementation defects like:

 - using an int to track sectors, causing corruption for large offsets
 - incorrect normalization of page or block granularity ranges
 - insufficient error handling
 - incorrect synchronization as extents can be modified while they are in
   use

This patch replace all three data with a single unified rbtree structure
tracking all extents, as well as their in-memory state, although we still
need to instance for read-only and read-write extent due to the arcane
client side COW feature in the block layouts spec.

To fix the problem of extent possibly being modified while in use we make
sure to return a copy of the extent for use in the write path - the
extent can only be invalidated by a layout recall or return which has
to wait until the I/O operations finished due to refcounts on the layout
segment.

The new extent tree work similar to the schemes used by block based
filesystems like XFS or ext4.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:03 -07:00
Christoph Hellwig 8c792ea940 pnfs/blocklayout: don't set pages uptodate
The core nfs code handles setting pages uptodate on reads, no need to mess
with the pageflags outselves.  Also remove a debug function to dump page
flags.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:03 -07:00
Christoph Hellwig 3a6fd1f004 pnfs/blocklayout: remove read-modify-write handling in bl_write_pagelist
Use the new PNFS_READ_WHOLE_PAGE flag to offload read-modify-write
handling to core nfs code, and remove a huge chunk of deadlock prone
mess from the block layout writeback path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:03 -07:00
Christoph Hellwig c88953d87f pnfs: add return_range method
If a layout driver keeps per-inode state outside of the layout segments it
needs to be notified of any layout returns or recalls on an inode, and not
just about the freeing of layout segments.  Add a method to acomplish this,
which will allow the block layout driver to handle the case of truncated
and re-expanded files properly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:03 -07:00
Christoph Hellwig 612aa983a0 pnfs: add flag to force read-modify-write in ->write_begin
Like all block based filesystems, the pNFS block layout driver can't read
or write at a byte granularity and thus has to perform read-modify-write
cycles on writes smaller than this granularity.

Add a flag so that the core NFS code always reads a whole page when
starting a smaller write, so that we can do it in the place where the VFS
expects it instead of doing in very deadlock prone way in the writeback
handler.

Note that in theory we could do less than page size reads here for disks
that have a smaller sector size which are served by a server with a smaller
pnfs block size.  But so far that doesn't seem like a worthwhile
optimization.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:02 -07:00
Christoph Hellwig 7c5d187581 pnfs: force a layout commit when encountering busy segments during recall
Expedite layout recall processing by forcing a layout commit when
we see busy segments.  Without it the layout recall might have to wait
until the VM decided to start writeback for the file, which can introduce
long delays.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:02 -07:00
Trond Myklebust 3a3908c8b0 NFS: Fix a compile warning when !(CONFIG_NFS_V3 || CONFIG_NFS_V4)
gcc reports:

linux/fs/nfs/write.c: In function ‘nfs_page_find_head_request_locked.isra.17’:
linux/fs/nfs/write.c:121:64: warning: ‘cinfo.mds’ may be used uninitialized in this function [-Wmaybe-uninitialized]
  list_for_each_entry_safe(freq, t, &cinfo.mds->list, wb_list) {
                                                                  ^
linux/fs/nfs/write.c:110:25: note: ‘cinfo.mds’ was declared here
  struct nfs_commit_info cinfo;

Reported-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Cc: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:02 -07:00
Christoph Hellwig 921b81a8cd pnfs/blocklayout: correctly decrement extent length
When we do non-page sized reads we can underflow the extent_length variable
and read incorrect data.  Fix the extent_length calculation and change to
defensive <= checks for the extent length in the read and write path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:02 -07:00
Christoph Hellwig be98fd0ac3 pnfs/blocklayout: plug block queues
Make sure the block queue is plugged when performing pNFS blocklayout I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:02 -07:00
Christoph Hellwig 72c5e59f63 pnfs/blocklayout: improve GETDEVICEINFO error reporting
Tell userspace what stage of GETDEVICEINFO failed so that there is a chance
to debug it, especially with the userspace daemon clusterf***k in the block
layout driver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:02 -07:00
Christoph Hellwig e3aaf7f2b8 pnfs/blocklayout: reject pnfs blocksize larger than page size
The Linux VM subsystem can't support block sizes larger than page size
for block based filesystems very well.  While this can be hacked around
to some extent for simple filesystems the read-modify-write cycles
required for pnfs block invalid extents are extremly deadlock prone
when operating on multiple pages.  Reject this case early on instead
of pretending to support it (badly).

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:02 -07:00
Christoph Hellwig 5f919c9f10 pnfs: allow splicing pre-encoded pages into the layoutcommit args
Currently there is no XDR buffer space allocated for the per-layout driver
layoutcommit payload, which leads to server buffer overflows in the
blocklayout driver even under simple workloads.  As we can't do per-layout
sizes for XDR operations we'll have to splice a previously encoded list
of pages into the XDR stream, similar to how we handle ACL buffers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:01 -07:00
Christoph Hellwig 47abadefad pnfs: avoid using stale stateids after layoutreturn
After we issued a layoutreturn operations the may free the layout stateid
and will thus cause bad stateid error when the client uses it again.

We currently try to avoid this case by chosing the open stateid if not
lsegs are present for this inode.  But various places can hold refererence
on lsegs and thus cause the list not to be empty shortly after a layout
return.  Add an explicit flag to mark the current layout stateid invalid
and force usage of the openstateid after we did a full file layoutreturn.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:01 -07:00
Christoph Hellwig defb846088 pnfs: retry after a bad stateid error from layoutget
Currently we fall through to nfs4_async_handle_error when we get
a bad stateid error back from layoutget.  nfs4_async_handle_error
with a NULL state argument will never retry the operations but return
the error to higher layer, causing an avoiable fallback to MDS I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:01 -07:00
Christoph Hellwig 362f74745c pnfs: don't check sequence on new stateids in layoutget
When layoutget returns an entirely new layout stateid it should not
check the generation counter as the new stateid will start with a new
counter entirely unrelated to old one.

The current behavior causes constant layoutget failures against a block
server which allocates a new stateid after an recall that removed all
outstanding layouts.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:01 -07:00
Christoph Hellwig 1013df6115 pnfs: do not pass uninitialized lsegs to ->free_lseg
Ensure the lsegs are initialized early so that we don't pass an unitialized
one back to ->free_lseg during error processing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:01 -07:00
Christoph Hellwig 2e11f8296d nfs: cap request size to fit a kmalloced page array
pNFS servers may return arbitrarily large layouts.  Trim back the I/O size
to one that we can at least allocate the page array for.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:01 -07:00
Peng Tao bc7d4b8fd0 nfs/filelayout: set layoutcommit depending on write verifier
Following http://www.rfc-editor.org/errata_search.php?rfc=5661&eid=2751
Don't set layoutcommit for commit_through_mds case.
For FILE_SYNC writes, don't set layoutcommit.
For DATA_SYNC wirtes, set layout commit right after wirtes done.
For UNSTABLE writes, set layout commit when commit done.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:01 -07:00
Peng Tao 378520b837 nfs41: add a helper function to set layoutcommit after commit
Track lwb in nfs_commit_data so that we can use it to setup
layoutcommit in commit_done callback.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:00 -07:00
Anna Schumaker 61beef75cc NFS: Clear up state owner lock usage
can_open_cached() reads values out of the state structure, meaning that
we need the so_lock to have a correct return value.  As a bonus, this
helps clear up some potentially confusing code.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:00 -07:00
Weston Andros Adamson 224ecbf5a6 pnfs: fix filelayout_retry_commit when idx > 0
filelayout_retry_commit was recently split out from alloc_ds_commits,
but was done in such a way that the bucket pointer always starts at
index 0 no matter what the @idx argument is set to.

The intention of the @idx argument is to retry commits starting at
bucket @idx. This is called when alloc_ds_commits fails for a bucket.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:43:45 -07:00
Linus Torvalds e874a5fe3e Merge branch 'for-next-3.17' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs/smb3 fixes from Steve French:
 "This includes various cifs and smb3 bug fixes including those for bugs
  found with the recently updated xfstests.

  Also I am working fixes for two additional cifs problems found by
  xfstests which I plan to send later (when reviewed and run additional
  tests)"

* 'for-next-3.17' of git://git.samba.org/sfrench/cifs-2.6:
  Clarify Kconfig help text for CIFS and SMB2/SMB3
  CIFS: Fix wrong filename length for SMB2
  CIFS: Fix wrong restart readdir for SMB1
  CIFS: Fix directory rename error
  cifs: No need to send SIGKILL to demux_thread during umount
  cifs: Allow directIO read/write during cache=strict
  cifs: remove unneeded check of null checking in if condition
  cifs: fix a possible use of uninit variable in SMB2_sess_setup
  cifs: fix memory leak when password is supplied multiple times
  cifs: fix a possible null pointer deref in decode_ascii_ssetup
  Trivial whitespace fix
2014-09-09 17:00:43 -07:00
Jaegeuk Kim 0b4c5afde9 f2fs: fix negative value for lseek offset
If application throws negative value of lseek with SEEK_DATA|SEEK_HOLE,
previous f2fs went into BUG_ON in get_dnode_of_data, which was reported
by Tommi Rantala.

He could make a simple code to detect this having:
	lseek(fd, -17595150933902LL, SEEK_DATA);

This patch should resolve that bug.

Reported-by: Tommi Rentala <tt.rantala@gmail.com>
[Jaegeuk Kim: relocate the condition as suggested by Chao]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-09 14:46:36 -07:00
Huang Ying 9a01b56b1a f2fs: avoid node page to be written twice in gc_node_segment
In gc_node_segment, if node page gc is run concurrently with node page
writeback, and check_valid_map and get_node_page run after page locked
and before cur_valid_map is updated as below, it is possible for the
page to be written twice unnecessarily.

			sync_node_pages
			  try_lock_page
			  ...
check_valid_map		  f2fs_write_node_page
			    ...
			    write_node_page
			      do_write_page
			        allocate_data_block
				  ...
				  refresh_sit_entry /* update cur_valid_map */
				  ...
			    ...
			    unlock_page
get_node_page
...
set_page_dirty
...
f2fs_put_page
  unlock_page

This can be solved via calling check_valid_map after get_node_page again.

Signed-off-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-09 13:15:07 -07:00
Gu Zheng 721bd4d5c3 f2fs: use lock-less list(llist) to simplify the flush cmd management
We use flush cmd control to collect many flush cmds, and flush them
together. In this case, we use two list to manage the flush cmds
(collect and dispatch), and one spin lock is used to protect this.
In fact, the lock-less list(llist) is very suitable to this case,
and we use simplify this routine.

-
v2:
-use llist_for_each_entry_safe to fix possible use-after-free issue.
-remove the unused field from struct flush_cmd.
Thanks for Yu's suggestion.
-

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-09 13:15:06 -07:00
Chao Yu 184a5cd2ce f2fs: refactor flush_sit_entries codes for reducing SIT writes
In commit aec71382c6 ("f2fs: refactor flush_nat_entries codes for reducing NAT
writes"), we descripte the issue as below:

"Although building NAT journal in cursum reduce the read/write work for NAT
block, but previous design leave us lower performance when write checkpoint
frequently for these cases:
1. if journal in cursum has already full, it's a bit of waste that we flush all
   nat entries to page for persistence, but not to cache any entries.
2. if journal in cursum is not full, we fill nat entries to journal util
   journal is full, then flush the left dirty entries to disk without merge
   journaled entries, so these journaled entries may be flushed to disk at next
   checkpoint but lost chance to flushed last time."

Actually, we have the same problem in using SIT journal area.

In this patch, firstly we will update sit journal with dirty entries as many as
possible. Secondly if there is no space in sit journal, we will remove all
entries in journal and walk through the whole dirty entry bitmap of sit,
accounting dirty sit entries located in same SIT block to sit entry set. All
entry sets are linked to list sit_entry_set in sm_info, sorted ascending order
by count of entries in set. Later we flush entries in set which have fewest
entries into journal as many as we can, and then flush dense set with merged
entries to disk.

In this way we can use sit journal area more effectively, also we will reduce
SIT update, result in gaining in performance and saving lifetime of flash
device.

In my testing environment, it shows this patch can help to reduce SIT block
update obviously.

virtual machine + hard disk:
fsstress -p 20 -n 400 -l 5
		sit page num	cp count	sit pages/cp
based		2006.50		1349.75		1.486
patched		1566.25		1463.25		1.070

Our latency of merging op is small when handling a great number of dirty SIT
entries in flush_sit_entries:
latency(ns)	dirty sit count
36038		2151
49168		2123
37174		2232

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-09 13:15:05 -07:00
Chao Yu d3a14afd5e f2fs: remove unneeded sit_i in macro SIT_BLOCK_OFFSET/START_SEGNO
sit_i in macro SIT_BLOCK_OFFSET/START_SEGNO is not used, remove it.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-09 13:15:05 -07:00
Jaegeuk Kim b0c44f05a2 f2fs: need fsck.f2fs if the recovery was failed
If the roll-forward recovery was failed, we'd better conduct fsck.f2fs.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-09 13:15:04 -07:00
Jaegeuk Kim ec325b5270 f2fs: handle bug cases by letting fsck.f2fs initiate
This patch adds to handle corner buggy cases for fsck.f2fs.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-09 13:15:03 -07:00
Jaegeuk Kim 05796763b8 f2fs: add BUG cases to initiate fsck.f2fs
This patch replaces BUG cases with f2fs_bug_on to remain fsck.f2fs information.
And it implements some void functions to initiate fsck.f2fs too.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-09 13:15:03 -07:00
Jaegeuk Kim 9850cf4a89 f2fs: need fsck.f2fs when f2fs_bug_on is triggered
If any f2fs_bug_on is triggered, fsck.f2fs is needed.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-09 13:15:02 -07:00
Jaegeuk Kim 2ae4c673e3 f2fs: retain inconsistency information to initiate fsck.f2fs
This patch adds sbi->need_fsck to conduct fsck.f2fs later.
This flag can only be removed by fsck.f2fs.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-09 13:14:25 -07:00
Jeff Layton e0b93eddfe security: make security_file_set_fowner, f_setown and __f_setown void return
security_file_set_fowner always returns 0, so make it f_setown and
__f_setown void return functions and fix up the error handling in the
callers.

Cc: linux-security-module@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-09-09 16:01:36 -04:00
Jeff Layton 1c994a0909 locks: consolidate "nolease" routines
GFS2 and NFS have setlease routines that always just return -EINVAL.
Turn that into a generic routine that can live in fs/libfs.c.

Cc: <linux-nfs@vger.kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: <cluster-devel@redhat.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-09-09 16:01:36 -04:00
Jeff Layton 699688a416 locks: remove lock_may_read and lock_may_write
There are no callers of these functions.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-09 16:01:09 -04:00
Jeff Layton 09802fd2a8 lockd: rip out deferred lock handling from testlock codepath
As Kinglong points out, the nlm_block->b_fl field is no longer used at
all. Also, vfs_test_lock in the generic locking code will only return
FILE_LOCK_DEFERRED if FL_SLEEP is set, and it isn't here.

The only other place that returns that value is the DLM lock code, but
it only does that in dlm_posix_lock, never in dlm_posix_get.

Remove all of the deferred locking code from the testlock codepath
since it doesn't appear to ever be used anyway.

I do have a small concern that this might cause a behavior change in the
case where you have a block already sitting on the list when the
testlock request comes in, but that looks like it doesn't really work
properly anyway. I think it's best to just pass that down to
vfs_test_lock and let the filesystem report that instead of trying to
infer what's going on with the lock by looking at an existing block.

Cc: cluster-devel@redhat.com
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Kinglong Mee <kinglongmee@gmail.com>
2014-09-09 16:01:09 -04:00
Kinglong Mee aef9583b23 NFSD: Get reference of lockowner when coping file_lock
v5: using nfs4_get_stateowner() instead of an inline function
v3: Update based on Jeff's comments
v2: Fix bad using of struct file_lock_operations for handle the owner

Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-09 16:01:09 -04:00
Kinglong Mee b5971afa0b NFSD: New helper nfs4_get_stateowner() for atomic_inc sop reference
v5: same as the first version

Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-09 16:01:09 -04:00
Kinglong Mee f328296e27 locks: Copy fl_lmops information for conflock in locks_copy_conflock()
Commit d5b9026a67 ([PATCH] knfsd: locks: flag NFSv4-owned locks) using
fl_lmops field in file_lock for checking nfsd4 lockowner.

But, commit 1a747ee0cc (locks: don't call ->copy_lock methods on return
of conflicting locks) causes the fl_lmops of conflock always be NULL.

Also, commit 0996905f93 (lockd: posix_test_lock() should not call
locks_copy_lock()) caused the fl_lmops of conflock always be NULL too.

Make sure copy the private information by fl_copy_lock() in struct
file_lock_operations, merge __locks_copy_lock() to fl_copy_lock().

Jeff advice, "Set fl_lmops on conflocks, but don't set fl_ops.
fl_ops are superfluous, since they are callbacks into the filesystem.
There should be no need to bother the filesystem at all with info
in a conflock. But, lock _ownership_ matters for conflocks and that's
indicated by the fl_lmops. So you really do want to copy the fl_lmops
for conflocks I think."

v5: add missing calling of locks_release_private() in nlmsvc_testlock()
v4: only copy fl_lmops for conflock, don't copy fl_ops

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-09 16:01:09 -04:00
Kinglong Mee 5c97d7b147 locks: New ops in lock_manager_operations for get/put owner
NFSD or other lockmanager may increase the owner's reference,
so adds two new options for copying and releasing owner.

v5: change order from 2/6 to 3/6
v4: rename lm_copy_owner/lm_release_owner to lm_get_owner/lm_put_owner

Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-09 16:01:09 -04:00
Kinglong Mee 3fe0fff18f locks: Rename __locks_copy_lock() to locks_copy_conflock()
Jeff advice, " Right now __locks_copy_lock is only used to copy
conflocks. It would be good to rename that to something more
distinct (i.e.locks_copy_conflock), to make it clear that we're
generating a conflock there."

v5: change order from 3/6 to 2/6
v4: new patch only renaming function name

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-09 16:01:09 -04:00
Joe Perches d0449b90f8 locks: Remove unused conf argument from lm_grant
This argument is always NULL so don't pass it around.

[jlayton: remove dependencies on previous patches in series]

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-09-09 16:01:06 -04:00
Jeff Layton f39b913cee locks: pass correct "before" pointer to locks_unlink_lock in generic_add_lease
The argument to locks_unlink_lock can't be just any pointer to a
pointer. It must be a pointer to the fl_next field in the previous
lock in the list.

Cc: <stable@vger.kernel.org> # v3.15+
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-09-09 16:00:51 -04:00
Dmitry Kasatkin 3034a14682 ima: pass 'opened' flag to identify newly created files
Empty files and missing xattrs do not guarantee that a file was
just created.  This patch passes FILE_CREATED flag to IMA to
reliably identify new files.

Signed-off-by: Dmitry Kasatkin <d.kasatkin@samsung.com>
Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>  3.14+
2014-09-09 10:28:43 -04:00
Dave Chinner a4241aebe9 Merge branch 'xfs-misc-fixes-for-3.18-1' into for-next 2014-09-09 13:25:31 +10:00
Eric Sandeen ab6978c295 xfs: remove rbpp check from xfs_rtmodify_summary_int
rbpp is always passed into xfs_rtmodify_summary
and xfs_rtget_summary, so there is no need to
test for it in xfs_rtmodify_summary_int.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-09 11:59:12 +10:00
Eric Sandeen afabfd30d0 xfs: combine xfs_rtmodify_summary and xfs_rtget_summary
xfs_rtmodify_summary and xfs_rtget_summary are almost identical;
fold them into xfs_rtmodify_summary_int(), with wrappers for each of
the original calls.

The _int function modifies if a delta is passed, and returns a
summary pointer if *sum is passed.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-09 11:58:42 +10:00
Eric Sandeen b16ed7c114 xfs: combine xfs_dir_canenter into xfs_dir_createname
xfs_dir_canenter and xfs_dir_createname are
almost identical.

Fold the former into the latter, with a helpful
wrapper for the former.  If createname is called without
an inode number, it now only checks for space, and does
not actually add the entry.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-09 11:58:07 +10:00
Eric Sandeen 94f3cad555 xfs: check resblks before calling xfs_dir_canenter
Move the resblks test out of the xfs_dir_canenter,
and into the caller.

This makes a little more sense on the face of it;
xfs_dir_canenter immediately returns if resblks !=0;
and given some of the comments preceding the calls:

 * Check for ability to enter directory entry, if no space reserved.

even more so.

It also facilitates the next patch.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-09 11:57:52 +10:00
Eric Sandeen 970fd3f04d xfs: deduplicate xlog_do_recovery_pass()
In xlog_do_recovery_pass(), there are 2 distinct cases:
non-wrapped and wrapped log recovery.

If we find a wrapped log, we recover around the end
of the log, and then handle the rest of recovery
exactly as in the non-wrapped case - using exactly the same
(duplicated) code.

Rather than having the same code in both cases, we can
get the wrapped portion out of the way first if needed,
and then recover the non-wrapped portion of the log.

There should be no functional change here, just code
reorganization & deduplication.

The patch looks a bit bigger than it really is; the last
hunk is whitespace changes (un-indenting).

Tested with xfstests "check -g log" on a stock configuration.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-09 11:57:29 +10:00
Eric Sandeen 59f9c00432 xfs: lseek: the "whence" argument is called "whence"
For some reason, the older commit:

    965c8e5 lseek: the "whence" argument is called "whence"

    lseek: the "whence" argument is called "whence"

    But the kernel decided to call it "origin" instead.
    Fix most of the sites.

left out xfs.  So fix xfs.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-09 11:57:10 +10:00
Eric Sandeen 49c69591c8 xfs: combine xfs_seek_hole & xfs_seek_data
xfs_seek_hole & xfs_seek_data are remarkably similar;
so much so that they can be combined, saving a fair
bit of semi-complex code duplication.

The following patch passes generic/285 and generic/286,
which specifically test seek behavior.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-09 11:56:48 +10:00
Brian Foster 2e22717874 xfs: export log_recovery_delay to delay mount time log recovery
XFS log recovery has been discovered to have race conditions with
buffers when I/O errors occur. External tools are available to simulate
I/O errors to XFS, but this alone is not sufficient for testing log
recovery. XFS unconditionally resets the inactive region of the log
prior to log recovery to avoid confusion over processing any partially
written log records that might have been written before an unclean
shutdown. Therefore, unconditional write I/O failures at mount time are
caught by the reset sequence rather than log recovery and hinder the
ability to test the latter.

The device-mapper dm-flakey module uses an up/down timer to define a
cycle for when to fail I/Os. Create a pre log recovery delay tunable
that can be used to coordinate XFS log recovery with I/O errors
simulated by dm-flakey. This facilitates coordination in userspace that
allows the reset of stale log blocks to succeed and writes due to log
recovery to fail. For example, define a dm-flakey instance with an
uptime long enough to allow log reset to succeed and a log recovery
delay long enough to allow the dm-flakey uptime to expire.

The 'log_recovery_delay' sysfs tunable is exported under
/sys/fs/xfs/debug and is only enabled for kernels compiled in XFS debug
mode. The value is exported in units of seconds and allows for a delay
of up to 60 seconds. Note that this is for XFS debug and test
instrumentation purposes only and should not be used by applications. No
delay is enabled by default.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-09 11:56:13 +10:00
Brian Foster 65b65735fe xfs: add debug sysfs attribute set
Create a top-level debug directory for global debug sysfs attributes.
This directory is added and removed on XFS module initialization and
removal respectively for DEBUG mode kernels only. It typically resides
at /sys/fs/xfs/debug. It is located at the top level of the xfs sysfs
hierarchy as attributes might define global behavior or behavior that
must be configured before an xfs mount is available (e.g., log recovery
behavior).

Define the global debug kobject that represents the debug sysfs
directory and add generic attribute show/store helpers to support future
attributes. No debug attributes are exported as of yet.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-09 11:52:42 +10:00
Eric Sandeen e1b05723ed xfs: add a few more verifier tests
These were exposed by fsfuzzer runs; without them we fail
in various exciting and sometimes convoluted ways when we
encounter disk corruption.

Without the MAXLEVELS tests we tend to walk off the end of
an array in a loop like this:

        for (i = 0; i < cur->bc_nlevels; i++) {
                if (cur->bc_bufs[i])

Without the dirblklog test we try to allocate more memory
than we could possibly hope for and loop forever:

xfs_dabuf_map()
	nfsb = mp->m_dir_geo->fsbcount;
	irecs = kmem_zalloc(sizeof(irec) * nfsb, KM_SLEEP...

As for the logbsize check, that's the convoluted one.

If logbsize is specified at mount time, it's sanitized
in xfs_parseargs; in particular it makes sure that it's
not > XLOG_MAX_RECORD_BSIZE.

If not specified at mount time, it comes from the superblock
via sb_logsunit; this is limited to 256k at mkfs time as well;
it's copied into m_logbsize in xfs_finish_flags().

However, if for some reason the on-disk value is corrupt and
too large, nothing catches it.  It's a circuitous path, but
that size eventually finds its way to places that make the kernel
very unhappy, leading to oopses in xlog_pack_data() because we
use the size as an index into iclog->ic_data, but the array
is not necessarily that big.

Anyway - bounds checking when we read from disk is a good thing!

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-09 11:47:24 +10:00
Brian Foster 8018ec083c xfs: mark all internal workqueues as freezable
Workqueues must be explicitly set as freezable to ensure they are frozen
in the assocated part of the hibernation/suspend sequence. Freezing of
workqueues and kernel threads is important to ensure that modifications
are not made on-disk after the hibernation image has been created.
Otherwise, the in-memory state can become inconsistent with what is on
disk and eventually lead to filesystem corruption. We have reports of
free space btree corruptions that occur immediately after restore from
hibernate that suggest the xfs-eofblocks workqueue could be causing
such problems if it races with hibernation.

Mark all of the internal XFS workqueues as freezable to ensure nothing
changes on-disk once the freezer infrastructure freezes kernel threads
and creates the hibernation image.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reported-by: Carlos E. R. <carlos.e.r@opensuse.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-09 11:44:46 +10:00
Jeff Layton 0c0e0d3c09 nfs: revert "nfs4: queue free_lock_state job submission to nfsiod"
This reverts commit 49a4bda22e.

Christoph reported an oops due to the above commit:

generic/089 242s ...[ 2187.041239] general protection fault: 0000 [#1]
SMP
[ 2187.042899] Modules linked in:
[ 2187.044000] CPU: 0 PID: 11913 Comm: kworker/0:1 Not tainted 3.16.0-rc6+ #1151
[ 2187.044287] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[ 2187.044287] Workqueue: nfsiod free_lock_state_work
[ 2187.044287] task: ffff880072b50cd0 ti: ffff88007a4ec000 task.ti: ffff88007a4ec000
[ 2187.044287] RIP: 0010:[<ffffffff81361ca6>]  [<ffffffff81361ca6>] free_lock_state_work+0x16/0x30
[ 2187.044287] RSP: 0018:ffff88007a4efd58  EFLAGS: 00010296
[ 2187.044287] RAX: 6b6b6b6b6b6b6b6b RBX: ffff88007a947ac0 RCX: 8000000000000000
[ 2187.044287] RDX: ffffffff826af9e0 RSI: ffff88007b093c00 RDI: ffff88007b093db8
[ 2187.044287] RBP: ffff88007a4efd58 R08: ffffffff832d3e10 R09: 000001c40efc0000
[ 2187.044287] R10: 0000000000000000 R11: 0000000000059e30 R12: ffff88007fc13240
[ 2187.044287] R13: ffff88007fc18b00 R14: ffff88007b093db8 R15: 0000000000000000
[ 2187.044287] FS:  0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[ 2187.044287] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2187.044287] CR2: 00007f93ec33fb80 CR3: 0000000079dc2000 CR4: 00000000000006f0
[ 2187.044287] Stack:
[ 2187.044287]  ffff88007a4efdd8 ffffffff810cc877 ffffffff810cc80d ffff88007fc13258
[ 2187.044287]  000000007a947af0 0000000000000000 ffffffff8353ccc8 ffffffff82b6f3d0
[ 2187.044287]  0000000000000000 ffffffff82267679 ffff88007a4efdd8 ffff88007fc13240
[ 2187.044287] Call Trace:
[ 2187.044287]  [<ffffffff810cc877>] process_one_work+0x1c7/0x490
[ 2187.044287]  [<ffffffff810cc80d>] ? process_one_work+0x15d/0x490
[ 2187.044287]  [<ffffffff810cd569>] worker_thread+0x119/0x4f0
[ 2187.044287]  [<ffffffff810fbbad>] ? trace_hardirqs_on+0xd/0x10
[ 2187.044287]  [<ffffffff810cd450>] ? init_pwq+0x190/0x190
[ 2187.044287]  [<ffffffff810d3c6f>] kthread+0xdf/0x100
[ 2187.044287]  [<ffffffff810d3b90>] ? __init_kthread_worker+0x70/0x70
[ 2187.044287]  [<ffffffff81d9873c>] ret_from_fork+0x7c/0xb0
[ 2187.044287]  [<ffffffff810d3b90>] ? __init_kthread_worker+0x70/0x70
[ 2187.044287] Code: 0f 1f 44 00 00 31 c0 5d c3 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 8d b7 48 fe ff ff 48 8b 87 58 fe ff ff 48 89 e5 48 8b 40 30 <48> 8b 00 48 8b 10 48 89 c7 48 8b 92 90 03 00 00 ff 52 28 5d c3
[ 2187.044287] RIP  [<ffffffff81361ca6>] free_lock_state_work+0x16/0x30
[ 2187.044287]  RSP <ffff88007a4efd58>
[ 2187.103626] ---[ end trace 0f11326d28e5d8fa ]---

The original reason for this patch was because the fl_release_private
operation couldn't sleep. With commit ed9814d858 (locks: defer freeing
locks in locks_delete_lock until after i_lock has been dropped), this is
no longer a problem so we can revert this patch.

Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-08 17:00:32 -07:00
Cong Wang 21e81002f9 nfs: fix kernel warning when removing proc entry
I saw the following kernel warning:

[ 1852.321222] ------------[ cut here ]------------
[ 1852.326527] WARNING: CPU: 0 PID: 118 at fs/proc/generic.c:521 remove_proc_entry+0x154/0x16b()
[ 1852.335630] remove_proc_entry: removing non-empty directory 'fs/nfsfs', leaking at least 'volumes'
[ 1852.344084] CPU: 0 PID: 118 Comm: kworker/u8:2 Not tainted 3.16.0+ #540
[ 1852.350036] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 1852.354992] Workqueue: netns cleanup_net
[ 1852.358701]  0000000000000000 ffff880116f2fbd0 ffffffff819c03e9 ffff880116f2fc18
[ 1852.366474]  ffff880116f2fc08 ffffffff810744ee ffffffff811e0e6e ffff8800d4e96238
[ 1852.373507]  ffffffff81dbe665 ffff8800d46a5948 0000000000000005 ffff880116f2fc68
[ 1852.380224] Call Trace:
[ 1852.381976]  [<ffffffff819c03e9>] dump_stack+0x4d/0x66
[ 1852.385495]  [<ffffffff810744ee>] warn_slowpath_common+0x7a/0x93
[ 1852.389869]  [<ffffffff811e0e6e>] ? remove_proc_entry+0x154/0x16b
[ 1852.393987]  [<ffffffff8107457b>] warn_slowpath_fmt+0x4c/0x4e
[ 1852.397999]  [<ffffffff811e0e6e>] remove_proc_entry+0x154/0x16b
[ 1852.402034]  [<ffffffff8129c73d>] nfs_fs_proc_net_exit+0x53/0x56
[ 1852.406136]  [<ffffffff812a103b>] nfs_net_exit+0x12/0x1d
[ 1852.409774]  [<ffffffff81785bc9>] ops_exit_list+0x44/0x55
[ 1852.413529]  [<ffffffff81786389>] cleanup_net+0xee/0x182
[ 1852.417198]  [<ffffffff81088c9e>] process_one_work+0x209/0x40d
[ 1852.502320]  [<ffffffff81088bf7>] ? process_one_work+0x162/0x40d
[ 1852.587629]  [<ffffffff810890c1>] worker_thread+0x1f0/0x2c7
[ 1852.673291]  [<ffffffff81088ed1>] ? process_scheduled_works+0x2f/0x2f
[ 1852.759470]  [<ffffffff8108e079>] kthread+0xc9/0xd1
[ 1852.843099]  [<ffffffff8109427f>] ? finish_task_switch+0x3a/0xce
[ 1852.926518]  [<ffffffff8108dfb0>] ? __kthread_parkme+0x61/0x61
[ 1853.008565]  [<ffffffff819cbeac>] ret_from_fork+0x7c/0xb0
[ 1853.076477]  [<ffffffff8108dfb0>] ? __kthread_parkme+0x61/0x61
[ 1853.140653] ---[ end trace 69c4c6617f78e32d ]---

It looks wrong that we add "/proc/net/nfsfs" in nfs_fs_proc_net_init()
while remove "/proc/fs/nfsfs" in nfs_fs_proc_net_exit().

Fixes: commit 65b38851a1 (NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes)
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Dan Aloni <dan@kernelim.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
[Trond: replace uses of remove_proc_entry() with remove_proc_subtree()
as suggested by Al Viro]
Cc: stable@vger.kernel.org # 3.4.x : 65b38851a17: NFS: Fix /proc/fs/nfsfs/servers
Cc: stable@vger.kernel.org # 3.4.x
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-08 16:41:36 -07:00
Linus Torvalds 8c68face55 Merge branch 'for_linus_urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bugfix from Ted Ts'o.

[ Hmm.  It's possible we should make kfree() aware of error pointers,
  and use IS_ERR_OR_NULL rather than a NULL check.  But in the meantime
  this is obviously the right fix.  - Linus ]

* 'for_linus_urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: avoid trying to kfree an ERR_PTR pointer
2014-09-08 15:51:01 -07:00
Linus Torvalds 861b7102b5 Merge branch 'for-3.17' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfixes from Bruce Fields:
 "A couple minor nfsd bugfixes"

* 'for-3.17' of git://linux-nfs.org/~bfields/linux:
  lockd: fix rpcbind crash on lockd startup failure
  nfsd4: fix rd_dircount enforcement
2014-09-08 15:18:06 -07:00
Chris Mason b0d5d10f41 Btrfs: use insert_inode_locked4 for inode creation
Btrfs was inserting inodes into the hash table before we had fully
set the inode up on disk.  This leaves us open to rare races that allow
two different inodes in memory for the same [root, inode] pair.

This patch fixes things by using insert_inode_locked4 to insert an I_NEW
inode and unlock_new_inode when we're ready for the rest of the kernel
to use the inode.

It also makes sure to init the operations pointers on the inode before
going into the error handling paths.

Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-08 13:56:45 -07:00
Filipe Manana 49dae1bc1c Btrfs: fix fsync data loss after a ranged fsync
While we're doing a full fsync (when the inode has the flag
BTRFS_INODE_NEEDS_FULL_SYNC set) that is ranged too (covers only a
portion of the file), we might have ordered operations that are started
before or while we're logging the inode and that fall outside the fsync
range.

Therefore when a full ranged fsync finishes don't remove every extent
map from the list of modified extent maps - as for some of them, that
fall outside our fsync range, their respective ordered operation hasn't
finished yet, meaning the corresponding file extent item wasn't inserted
into the fs/subvol tree yet and therefore we didn't log it, and we must
let the next fast fsync (one that checks only the modified list) see this
extent map and log a matching file extent item to the log btree and wait
for its ordered operation to finish (if it's still ongoing).

A test case for xfstests follows.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-08 13:56:43 -07:00
Dan Carpenter c47ca32d3a Btrfs: kfree()ing ERR_PTRs
The "inherit" in btrfs_ioctl_snap_create_v2() and "vol_args" in
btrfs_ioctl_rm_dev() are ERR_PTRs so we can't call kfree() on them.

These kind of bugs are "One Err Bugs" where there is just one error
label that does everything.  I could set the "inherit = NULL" and keep
the single out label but it ends up being more complicated that way.  It
makes the code simpler to re-order the unwind so it's in the mirror
order of the allocation and introduce some new error labels.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-08 13:56:42 -07:00
J. Bruce Fields 7c17705e77 lockd: fix rpcbind crash on lockd startup failure
Nikita Yuschenko reported that booting a kernel with init=/bin/sh and
then nfs mounting without portmap or rpcbind running using a busybox
mount resulted in:

  # mount -t nfs 10.30.130.21:/opt /mnt
  svc: failed to register lockdv1 RPC service (errno 111).
  lockd_up: makesock failed, error=-111
  Unable to handle kernel paging request for data at address 0x00000030
  Faulting instruction address: 0xc055e65c
  Oops: Kernel access of bad area, sig: 11 [#1]
  MPC85xx CDS
  Modules linked in:
  CPU: 0 PID: 1338 Comm: mount Not tainted 3.10.44.cge #117
  task: cf29cea0 ti: cf35c000 task.ti: cf35c000
  NIP: c055e65c LR: c0566490 CTR: c055e648
  REGS: cf35dad0 TRAP: 0300   Not tainted  (3.10.44.cge)
  MSR: 00029000 <CE,EE,ME>  CR: 22442488  XER: 20000000
  DEAR: 00000030, ESR: 00000000

  GPR00: c05606f4 cf35db80 cf29cea0 cf0ded80 cf0dedb8 00000001 1dec3086
  00000000
  GPR08: 00000000 c07b1640 00000007 1dec3086 22442482 100b9758 00000000
  10090ae8
  GPR16: 00000000 000186a5 00000000 00000000 100c3018 bfa46edc 100b0000
  bfa46ef0
  GPR24: cf386ae0 c07834f0 00000000 c0565f88 00000001 cf0dedb8 00000000
  cf0ded80
  NIP [c055e65c] call_start+0x14/0x34
  LR [c0566490] __rpc_execute+0x70/0x250
  Call Trace:
  [cf35db80] [00000080] 0x80 (unreliable)
  [cf35dbb0] [c05606f4] rpc_run_task+0x9c/0xc4
  [cf35dbc0] [c0560840] rpc_call_sync+0x50/0xb8
  [cf35dbf0] [c056ee90] rpcb_register_call+0x54/0x84
  [cf35dc10] [c056f24c] rpcb_register+0xf8/0x10c
  [cf35dc70] [c0569e18] svc_unregister.isra.23+0x100/0x108
  [cf35dc90] [c0569e38] svc_rpcb_cleanup+0x18/0x30
  [cf35dca0] [c0198c5c] lockd_up+0x1dc/0x2e0
  [cf35dcd0] [c0195348] nlmclnt_init+0x2c/0xc8
  [cf35dcf0] [c015bb5c] nfs_start_lockd+0x98/0xec
  [cf35dd20] [c015ce6c] nfs_create_server+0x1e8/0x3f4
  [cf35dd90] [c0171590] nfs3_create_server+0x10/0x44
  [cf35dda0] [c016528c] nfs_try_mount+0x158/0x1e4
  [cf35de20] [c01670d0] nfs_fs_mount+0x434/0x8c8
  [cf35de70] [c00cd3bc] mount_fs+0x20/0xbc
  [cf35de90] [c00e4f88] vfs_kern_mount+0x50/0x104
  [cf35dec0] [c00e6e0c] do_mount+0x1d0/0x8e0
  [cf35df10] [c00e75ac] SyS_mount+0x90/0xd0
  [cf35df40] [c000ccf4] ret_from_syscall+0x0/0x3c

The addition of svc_shutdown_net() resulted in two calls to
svc_rpcb_cleanup(); the second is no longer necessary and crashes when
it calls rpcb_register_call with clnt=NULL.

Reported-by: Nikita Yushchenko <nyushchenko@dev.rtsoft.ru>
Fixes: 679b033df4 "lockd: ensure we tear down any live sockets when socket creation fails during lockd_up"
Cc: stable@vger.kernel.org
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-08 12:03:32 -04:00
J. Bruce Fields aee3776441 nfsd4: fix rd_dircount enforcement
Commit 3b29970909 "nfsd4: enforce rd_dircount" totally misunderstood
rd_dircount; it refers to total non-attribute bytes returned, not number
of directory entries returned.

Bring the code into agreement with RFC 3530 section 14.2.24.

Cc: stable@vger.kernel.org
Fixes: 3b29970909 "nfsd4: enforce rd_dircount"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-08 12:02:03 -04:00
Tejun Heo 018a17bdc8 bdi: reimplement bdev_inode_switch_bdi()
A block_device may be attached to different gendisks and thus
different bdis over time.  bdev_inode_switch_bdi() is used to switch
the associated bdi.  The function assumes that the inode could be
dirty and transfers it between bdis if so.  This is a bit nasty in
that it reaches into bdi internals.

This patch reimplements the function so that it writes out the inode
if dirty.  This is a lot simpler and can be implemented without
exposing bdi internals.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-08 10:00:43 -06:00
Tejun Heo ff9ea32381 block, bdi: an active gendisk always has a request_queue associated with it
bdev_get_queue() returns the request_queue associated with the
specified block_device.  blk_get_backing_dev_info() makes use of
bdev_get_queue() to determine the associated bdi given a block_device.

All the callers of bdev_get_queue() including
blk_get_backing_dev_info() assume that bdev_get_queue() may return
NULL and implement NULL handling; however, bdev_get_queue() requires
the passed in block_device is opened and attached to its gendisk.
Because an active gendisk always has a valid request_queue associated
with it, bdev_get_queue() can never return NULL and neither can
blk_get_backing_dev_info().

Make it clear that neither of the two functions can return NULL and
remove NULL handling from all the callers.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-09-08 10:00:35 -06:00
Artem Bityutskiy ba29e721eb UBIFS: fix free log space calculation
Hu (hujianyang <hujianyang@huawei.com>) discovered an issue in the
'empty_log_bytes()' function, which calculates how many bytes are left in the
log:

"
If 'c->lhead_lnum + 1 == c->ltail_lnum' and 'c->lhead_offs == c->leb_size', 'h'
would equalent to 't' and 'empty_log_bytes()' would return 'c->log_bytes'
instead of 0.
"

At this point it is not clear what would be the consequences of this, and
whether this may lead to any problems, but this patch addresses the issue just
in case.

Cc: stable@vger.kernel.org
Tested-by: hujianyang <hujianyang@huawei.com>
Reported-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-09-08 15:55:28 +03:00
Artem Bityutskiy 052c28073f UBIFS: fix a race condition
Hu (hujianyang@huawei.com) discovered a race condition which may lead to a
situation when UBIFS is unable to mount the file-system after an unclean
reboot. The problem is theoretical, though.

In UBIFS, we have the log, which basically a set of LEBs in a certain area. The
log has the tail and the head.

Every time user writes data to the file-system, the UBIFS journal grows, and
the log grows as well, because we append new reference nodes to the head of the
log. So the head moves forward all the time, while the log tail stays at the
same position.

At any time, the UBIFS master node points to the tail of the log. When we mount
the file-system, we scan the log, and we always start from its tail, because
this is where the master node points to. The only occasion when the tail of the
log changes is the commit operation.

The commit operation has 2 phases - "commit start" and "commit end". The former
is relatively short, and does not involve much I/O. During this phase we mostly
just build various in-memory lists of the things which have to be written to
the flash media during "commit end" phase.

During the commit start phase, what we do is we "clean" the log. Indeed, the
commit operation will index all the data in the journal, so the entire journal
"disappears", and therefore the data in the log become unneeded. So we just
move the head of the log to the next LEB, and write the CS node there. This LEB
will be the tail of the new log when the commit operation finishes.

When the "commit start" phase finishes, users may write more data to the
file-system, in parallel with the ongoing "commit end" operation. At this point
the log tail was not changed yet, it is the same as it had been before we
started the commit. The log head keeps moving forward, though.

The commit operation now needs to write the new master node, and the new master
node should point to the new log tail. After this the LEBs between the old log
tail and the new log tail can be unmapped and re-used again.

And here is the possible problem. We do 2 operations: (a) We first update the
log tail position in memory (see 'ubifs_log_end_commit()'). (b) And then we
write the master node (see the big lock of code in 'do_commit()').

But nothing prevents the log head from moving forward between (a) and (b), and
the log head may "wrap" now to the old log tail. And when the "wrap" happens,
the contends of the log tail gets erased. Now a power cut happens and we are in
trouble. We end up with the old master node pointing to the old tail, which was
erased. And replay fails because it expects the master node to point to the
correct log tail at all times.

This patch merges the abovementioned (a) and (b) operations by moving the master
node change code to the 'ubifs_log_end_commit()' function, so that it runs with
the log mutex locked, which will prevent the log from being changed benween
operations (a) and (b).

Cc: stable@vger.kernel.org # 07e19df UBIFS: remove mst_mutex
Cc: stable@vger.kernel.org
Reported-by: hujianyang <hujianyang@huawei.com>
Tested-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-09-08 15:55:02 +03:00
Tejun Heo a34375ef9e percpu-refcount: add @gfp to percpu_ref_init()
Percpu allocator now supports allocation mask.  Add @gfp to
percpu_ref_init() so that !GFP_KERNEL allocation masks can be used
with percpu_refs too.

This patch doesn't make any functional difference.

v2: blk-mq conversion was missing.  Updated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Jens Axboe <axboe@kernel.dk>
2014-09-08 09:51:30 +09:00
Tejun Heo 908c7f1949 percpu_counter: add @gfp to percpu_counter_init()
Percpu allocator now supports allocation mask.  Add @gfp to
percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
with percpu_counters too.

We could have left percpu_counter_init() alone and added
percpu_counter_init_gfp(); however, the number of users isn't that
high and introducing _gfp variants to all percpu data structures would
be quite ugly, so let's just do the conversion.  This is the one with
the most users.  Other percpu data structures are a lot easier to
convert.

This patch doesn't make any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: "David S. Miller" <davem@davemloft.net>
Cc: x86@kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
2014-09-08 09:51:29 +09:00
Paul E. McKenney bde6c3aa99 rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loops
RCU-tasks requires the occasional voluntary context switch
from CPU-bound in-kernel tasks.  In some cases, this requires
instrumenting cond_resched().  However, there is some reluctance
to countenance unconditionally instrumenting cond_resched() (see
http://lwn.net/Articles/603252/), so this commit creates a separate
cond_resched_rcu_qs() that may be used in place of cond_resched() in
locations prone to long-duration in-kernel looping.

This commit currently instruments only RCU-tasks.  Future possibilities
include also instrumenting RCU, RCU-bh, and RCU-sched in order to reduce
IPI usage.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07 16:27:20 -07:00
Linus Torvalds 9142eadefe Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull filesystem fixes from Al Viro:
 "Several bugfixes (all of them -stable fodder).

  Alexey's one deals with double mutex_lock() in UFS (apparently, nobody
  has tried to test "ufs: sb mutex merge + mutex_destroy" on something
  like file creation/removal on ufs).  Mine deal with two kinds of
  umount bugs, in umount propagation and in handling of automounted
  submounts, both resulting in bogus transient EBUSY from umount"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ufs: fix deadlocks introduced by sb mutex merge
  fix EBUSY on umount() from MNT_SHRINKABLE
  get rid of propagate_umount() mistakenly treating slaves as busy.
2014-09-07 10:59:58 -07:00
Alexey Khoroshilov 9ef7db7f38 ufs: fix deadlocks introduced by sb mutex merge
Commit 0244756edc ("ufs: sb mutex merge + mutex_destroy") introduces
deadlocks in ufs_new_inode() and ufs_free_inode().
Most callers of that functions acqure the mutex by themselves and
ufs_{new,free}_inode() do that via lock_ufs(),
i.e we have an unavoidable double lock.

The patch proposes to resolve the issue by making sure that
ufs_{new,free}_inode() are not called with the mutex held.

Found by Linux Driver Verification project (linuxtesting.org).

Cc: stable@vger.kernel.org # 3.16
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-07 13:26:39 -04:00
Linus Torvalds 11e9739813 xfs: fixes for v3.17-rc3
Fix:
 - a direct IO read/buffered read data corruption
 - the associated fallout from the DIO data corruption fix
 - collapse range bugs that are potential data corruption issues.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJUCkM+AAoJEK3oKUf0dfodUBgP+gJu50/XV4TFRLPlCRxhvN61
 371i3ASls1y7ivhj40NzgbDDAZHM2q8Zqwd//318dFViHhWQDlH/1ga06kscRpZX
 d8cQEbFHApgUGQL5Gdq2l2hvAzYa75H0m6cq3jveyrN2rscjCSmAXwtlcEmx3AR6
 TnCpxuVL5asjEGZYb0KfQACq//rASHJbhukpo1gB4ccZ0boWHOVf5SxuS4remzs9
 y+rlPFNl5RD/WVdnJSvu9zu/nP6op3Ax5r7jZanoKbisKHfd7QOa+k65+Vz0Vq9G
 kxgfhz+yLfkOvcktq+41e1lVBln7fCIlcO9m3b53uxWPx5cla323893UiGYsA4F/
 j/gGlh1qaT6C/1M1JBWDLDx931S78XiR1Y+WbtAU1PO+GuO0IEap3+iqtS2+oNAv
 OrpThLOgqTspK6MJeToCzdn2lRT2BJpcKwxIyDK8g+p9N6qCpyw3DfiKyu0wipGH
 D2D3mtE6drSHNaSceFAz8CrQvPOR7Ygj92QGpGSfkohxap9h6VJR/wNp/oGnpmN0
 qgcxTrHvx3kw1hXssB4gjh6fBDnOUkac0isqxdow22Qt529t9sIanzMBvz+JxHQF
 zeqeFSh96lOXmB7UFBU+QyOhbDp3cJWChHrtY3Esw/+FmG6fxEy8z6pdZsiJYELr
 5tka2richPD+gyXzcZwP
 =tRQo
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-3.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs fixes from Dave Chinner:
 "The fixes all address recently discovered data corruption issues.

  The original Direct IO issue was discovered by Chris Mason @ Facebook
  on a production workload which mixed buffered reads with direct reads
  and writes IO to the same file.  The fix for that exposed other issues
  with page invalidation (exposed by millions of fsx operations) failing
  due to dirty buffers beyond EOF.

  Finally, the collapse_range code could also cause problems due to
  racing writeback changing the extent map while it was being shifted
  around.  The commits for that problem are simple mitigation fixes that
  prevent the problem from occuring.  A more robust fix for 3.18 that
  addresses the underlying problem is currently being worked on by
  Brian.

  Summary of fixes:
   - a direct IO read/buffered read data corruption
   - the associated fallout from the DIO data corruption fix
   - collapse range bugs that are potential data corruption issues"

* tag 'xfs-for-linus-3.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
  xfs: trim eofblocks before collapse range
  xfs: xfs_file_collapse_range is delalloc challenged
  xfs: don't log inode unless extent shift makes extent modifications
  xfs: use ranged writeback and invalidation for direct IO
  xfs: don't zero partial page cache pages during O_DIRECT writes
  xfs: don't zero partial page cache pages during O_DIRECT writes
  xfs: don't dirty buffers beyond EOF
2014-09-06 12:13:17 -07:00
Anton Altaparmakov 10096fb108 Export sync_filesystem() for modular ->remount_fs() use
This patch changes sync_filesystem() to be EXPORT_SYMBOL().

The reason this is needed is that starting with 3.15 kernel, due to
Theodore Ts'o's commit 02b9984d64 ("fs: push sync_filesystem() down to
the file system's remount_fs()"), all file systems that have dirty data
to be written out need to call sync_filesystem() from their
->remount_fs() method when remounting read-only.

As this is now a generically required function rather than an internal
only function it should be EXPORT_SYMBOL() so that all file systems can
call it.

Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-05 08:16:21 -07:00
Gioh Kim a49058fab2 jbd/jbd2: use non-movable memory for the jbd superblock
Sicne the jbd/jbd2 superblock is not released until the file system is
unmounted, allocate the buffer cache from the non-moveable area to
allow page migration and CMA allocations to more easily succeed.

Signed-off-by: Gioh Kim <gioh.kim@lge.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-09-04 22:36:35 -04:00
Gioh Kim a8ac900b81 ext4: use non-movable memory for the ext4 superblock
Since the ext4 superblock is not released until the file system is
unmounted, allocate the buffer cache entry for the ext4 superblock out
of the non-moveable are to allow page migrations and thus CMA
allocations to more easily succeed if the CMA area is limited.

Signed-off-by: Gioh Kim <gioh.kim@lge.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-09-04 22:36:15 -04:00
Gioh Kim 3b5e6454aa fs/buffer.c: support buffer cache allocations with gfp modifiers
A buffer cache is allocated from movable area because it is referred
for a while and released soon.  But some filesystems are taking buffer
cache for a long time and it can disturb page migration.

New APIs are introduced to allocate buffer cache with user specific
flag.  *_gfp APIs are for user want to set page allocation flag for
page cache allocation.  And *_unmovable APIs are for the user wants to
allocate page cache from non-movable area.

Signed-off-by: Gioh Kim <gioh.kim@lge.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-09-04 22:04:42 -04:00
Linus Torvalds b7fece1be8 Merge git://git.kvack.org/~bcrl/aio-fixes
Pull aio bugfixes from Ben LaHaise:
 "Two small fixes"

* git://git.kvack.org/~bcrl/aio-fixes:
  aio: block exit_aio() until all context requests are completed
  aio: add missing smp_rmb() in read_events_ring
2014-09-04 16:08:55 -07:00
Theodore Ts'o d26e2c4d72 ext4: renumber EXT4_EX_* flags to avoid flag aliasing problems
Suggested-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-04 18:09:29 -04:00
Jan Kara 0e5ecf0a76 jbd2: optimize jbd2_log_do_checkpoint() a bit
When we discover written out buffer in transaction checkpoint list we
don't have to recheck validity of a transaction. Either this is the
last buffer in a transaction - and then we are done - or this isn't
and then we can just take another buffer from the checkpoint list
without dropping j_list_lock.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-04 18:09:29 -04:00
Theodore Ts'o dc6e8d669c jbd2: don't call get_bh() before calling __jbd2_journal_remove_checkpoint()
The __jbd2_journal_remove_checkpoint() doesn't require an elevated
b_count; indeed, until the jh structure gets released by the call to
jbd2_journal_put_journal_head(), the bh's b_count is elevated by
virtue of the existence of the jh structure.

Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-04 18:09:22 -04:00
Theodore Ts'o 754cfed6bb ext4: drop the EXT4_STATE_DELALLOC_RESERVED flag
Having done a full regression test, we can now drop the
DELALLOC_RESERVED state flag.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-09-04 18:08:22 -04:00
Theodore Ts'o e3cf5d5d9a ext4: prepare to drop EXT4_STATE_DELALLOC_RESERVED
The EXT4_STATE_DELALLOC_RESERVED flag was originally implemented
because it was too hard to make sure the mballoc and get_block flags
could be reliably passed down through all of the codepaths that end up
calling ext4_mb_new_blocks().

Since then, we have mb_flags passed down through most of the code
paths, so getting rid of EXT4_STATE_DELALLOC_RESERVED isn't as tricky
as it used to.

This commit plumbs in the last of what is required, and then adds a
WARN_ON check to make sure we haven't missed anything.  If this passes
a full regression test run, we can then drop
EXT4_STATE_DELALLOC_RESERVED.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-09-04 18:07:25 -04:00
Theodore Ts'o a521100231 ext4: pass allocation_request struct to ext4_(alloc,splice)_branch
Instead of initializing the allocation_request structure in
ext4_alloc_branch(), set it up in ext4_ind_map_blocks(), and then pass
it to ext4_alloc_branch() and ext4_splice_branch().

This allows ext4_ind_map_blocks to pass flags in the allocation
request structure without having to add Yet Another argument to
ext4_alloc_branch().

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-09-04 18:06:25 -04:00
Gu Zheng 6098b45b32 aio: block exit_aio() until all context requests are completed
It seems that exit_aio() also needs to wait for all iocbs to complete (like
io_destroy), but we missed the wait step in current implemention, so fix
it in the same way as we did in io_destroy.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org
2014-09-04 16:54:47 -04:00
Al Viro 0b93a92be4 udf: saner calling conventions for udf_new_inode()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-04 21:37:41 +02:00
Al Viro b231509616 udf: fix the udf_iget() vs. udf_new_inode() races
Currently udf_iget() (triggered by NFS) can race with udf_new_inode()
leading to two inode structures with the same inode number:

nfsd: iget_locked() creates inode
nfsd: try to read from disk, block on that.
udf_new_inode(): allocate inode with that inumber
udf_new_inode(): insert it into icache, set it up and dirty
udf_write_inode(): write inode into buffer cache
nfsd: get CPU again, look into buffer cache, see nice and sane on-disk
  inode, set the in-core inode from it

Fix the problem by putting inode into icache in locked state (I_NEW set)
and unlocking it only after it's fully set up.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-04 21:37:41 +02:00
Al Viro d2be51cb34 udf: merge the pieces inserting a new non-directory object into directory
boilerplate code in udf_{create,mknod,symlink} taken to new helper

symlink case converted to unique id calculated by udf_new_inode() - no
point finding a new one.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-04 21:37:40 +02:00
Jan Kara 470cca56c3 udf: Set i_generation field
Currently UDF doesn't initialize i_generation in any way and thus NFS
can easily get reallocated inodes from stale file handles. Luckily UDF
already has a unique object identifier associated with each inode -
i_unique. Use that for initialization of i_generation.

Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-04 21:37:40 +02:00
Jan Kara 4071b91362 udf: Properly detect stale inodes
NFS can easily ask for inodes that are already deleted. Currently UDF
happily returns such inodes which is a bug. Return -ESTALE if
udf_read_inode() is asked to read deleted inode.

Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-04 21:37:39 +02:00
Jan Kara 6d3d5e860a udf: Make udf_read_inode() and udf_iget() return error
Currently __udf_read_inode() wasn't returning anything and we found out
whether we succeeded reading inode by checking whether inode is bad or
not. udf_iget() returned NULL on failure and inode pointer otherwise.
Make these two functions properly propagate errors up the call stack and
use the return value in callers.

Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-04 21:36:35 +02:00
Jan Kara c03aa9f6e1 udf: Avoid infinite loop when processing indirect ICBs
We did not implement any bound on number of indirect ICBs we follow when
loading inode. Thus corrupted medium could cause kernel to go into an
infinite loop, possibly causing a stack overflow.

Fix the possible stack overflow by removing recursion from
__udf_read_inode() and limit number of indirect ICBs we follow to avoid
infinite loops.

Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-04 14:12:29 +02:00
Jan Kara bb7720a0b4 udf: Fold udf_fill_inode() into __udf_read_inode()
There's no good reason to separate these since udf_fill_inode() is
called only from __udf_read_inode() and both do part of the same thing.

Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-04 13:32:50 +02:00
Jan Kara 8a70ee3307 udf: Avoid dir link count to go negative
If we are writing back inode of unlinked directory, its link count ends
up being (u16)-1. Although the inode is deleted, udf_iget() can load the
inode when NFS uses stale file handle and get confused.

Signed-off-by: Jan Kara <jack@suse.cz>
2014-09-04 11:47:51 +02:00
Jaegeuk Kim 4081363fbe f2fs: introduce F2FS_I_SB, F2FS_M_SB, and F2FS_P_SB
This patch adds three inline functions to clean up dirty casting codes.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-03 17:37:13 -07:00
Kinglong Mee 027bc41a3e NFSD: Put export if prepare_creds() fail
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-03 17:43:04 -04:00
Kinglong Mee 13c82e8eb5 NFSD: Full checking of authentication name
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-03 17:43:03 -04:00
Kinglong Mee 48c348b09c NFSD: Fix bad using of return value from qword_get
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-03 17:43:02 -04:00
Kinglong Mee 15d176c195 NFSD: Fix a memory leak if nfsd4_recdir_load fail
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-03 17:43:01 -04:00
Kinglong Mee c2236f141e NFSD: Reset creds after mnt_want_write_file() fail
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-03 17:43:01 -04:00
Kinglong Mee 8519f994e5 NFSD: Put file after ima_file_check fail in nfsd_open()
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-03 17:43:00 -04:00
Linus Torvalds 70c8038dd6 Merge tag 'for-f2fs-3.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs bug fixes from Jaegeuk Kim:
 "This series includes patches to:

   - fix recovery routines
   - fix bugs related to inline_data/xattr
   - fix when casting the dentry names
   - handle EIO or ENOMEM correctly
   - fix memory leak
   - fix lock coverage"

* tag 'for-f2fs-3.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (28 commits)
  f2fs: reposition unlock_new_inode to prevent accessing invalid inode
  f2fs: fix wrong casting for dentry name
  f2fs: simplify by using a literal
  f2fs: truncate stale block for inline_data
  f2fs: use macro for code readability
  f2fs: introduce need_do_checkpoint for readability
  f2fs: fix incorrect calculation with total/free inode num
  f2fs: remove rename and use rename2
  f2fs: skip if inline_data was converted already
  f2fs: remove rewrite_node_page
  f2fs: avoid double lock in truncate_blocks
  f2fs: prevent checkpoint during roll-forward
  f2fs: add WARN_ON in f2fs_bug_on
  f2fs: handle EIO not to break fs consistency
  f2fs: check s_dirty under cp_mutex
  f2fs: unlock_page when node page is redirtied out
  f2fs: introduce f2fs_cp_error for readability
  f2fs: give a chance to mount again when encountering errors
  f2fs: trigger release_dirty_inode in f2fs_put_super
  f2fs: don't skip checkpoint if there is no dirty node pages
  ...
2014-09-03 10:10:28 -07:00
Theodore Ts'o a9cfcd63e8 ext4: avoid trying to kfree an ERR_PTR pointer
Thanks to Dan Carpenter for extending smatch to find bugs like this.
(This was found using a development version of smatch.)

Fixes: 36de928641
Reported-by: Dan Carpenter <dan.carpenter@oracle.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-09-03 09:37:30 -04:00
Filipe Manana dac5705cad Btrfs: fix crash while doing a ranged fsync
While doing a ranged fsync, that is, one whose range doesn't cover the
whole possible file range (0 to LLONG_MAX), we can crash under certain
circumstances with a trace like the following:

[41074.641913] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
(...)
[41074.642692] CPU: 0 PID: 24580 Comm: fsx Not tainted 3.16.0-fdm-btrfs-next-45+ #1
(...)
[41074.643886] RIP: 0010:[<ffffffffa01ecc99>]  [<ffffffffa01ecc99>] btrfs_ordered_update_i_size+0x279/0x2b0 [btrfs]
(...)
[41074.644919] Stack:
(...)
[41074.644919] Call Trace:
[41074.644919]  [<ffffffffa01db531>] btrfs_truncate_inode_items+0x3f1/0xa10 [btrfs]
[41074.644919]  [<ffffffffa01eb54f>] ? btrfs_get_logged_extents+0x4f/0x80 [btrfs]
[41074.644919]  [<ffffffffa02137a9>] btrfs_log_inode+0x2f9/0x970 [btrfs]
[41074.644919]  [<ffffffff81090875>] ? sched_clock_local+0x25/0xa0
[41074.644919]  [<ffffffff8164a55e>] ? mutex_unlock+0xe/0x10
[41074.644919]  [<ffffffff810af51d>] ? trace_hardirqs_on+0xd/0x10
[41074.644919]  [<ffffffffa0214b4f>] btrfs_log_inode_parent+0x1ef/0x560 [btrfs]
[41074.644919]  [<ffffffff811d0c55>] ? dget_parent+0x5/0x180
[41074.644919]  [<ffffffffa0215d11>] btrfs_log_dentry_safe+0x51/0x80 [btrfs]
[41074.644919]  [<ffffffffa01e2d1a>] btrfs_sync_file+0x1ba/0x3e0 [btrfs]
[41074.644919]  [<ffffffff811eda6b>] vfs_fsync_range+0x1b/0x30
(...)

The necessary conditions that lead to such crash are:

* an incremental fsync (when the inode doesn't have the
  BTRFS_INODE_NEEDS_FULL_SYNC flag set) happened for our file and it logged
  a file extent item ending at offset X;

* the file got the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode, due
  to a file truncate operation that reduces the file to a size smaller
  than X;

* a ranged fsync call happens (via an msync for example), with a range that
  doesn't cover the whole file and the end of this range, lets call it Y, is
  smaller than X;

* btrfs_log_inode, sees the flag BTRFS_INODE_NEEDS_FULL_SYNC set and
  calls btrfs_truncate_inode_items() to remove all items from the log
  tree that are associated with our file;

* btrfs_truncate_inode_items() removes all of the inode's items, and the lowest
  file extent item it removed is the one ending at offset X, where X > 0 and
  X > Y - before returning, it calls btrfs_ordered_update_i_size() with an offset
  parameter set to X;

* btrfs_ordered_update_i_size() sees that X is greater then the current ordered
  size (btrfs_inode's disk_i_size) and then it assumes there can't be any ongoing
  ordered operation with a range covering the offset X, calling a BUG_ON() if
  such ordered operation exists. This assumption is made because the disk_i_size
  is only increased after the corresponding file extent item is added to the
  btree (btrfs_finish_ordered_io);

* But because our fsync covers only a limited range, such an ordered extent might
  exist, and our fsync callback (btrfs_sync_file) doesn't wait for such ordered
  extent to finish when calling btrfs_wait_ordered_range();

And then by the time btrfs_ordered_update_i_size() is called, via:

   btrfs_sync_file() ->
       btrfs_log_dentry_safe() ->
           btrfs_log_inode_parent() ->
               btrfs_log_inode() ->
                   btrfs_truncate_inode_items() ->
                       btrfs_ordered_update_i_size()

We hit the BUG_ON(), which could never happen if the fsync range covered the whole
possible file range (0 to LLONG_MAX), as we would wait for all ordered extents to
finish before calling btrfs_truncate_inode_items().

So just don't call btrfs_ordered_update_i_size() if we're removing the inode's items
from a log tree, which isn't supposed to change the in memory inode's disk_i_size.

Issue found while running xfstests/generic/127 (happens very rarely for me), more
specifically via the fsx calls that use memory mapped IO (and issue msync calls).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-02 16:46:05 -07:00
Filipe Manana d9f85963e3 Btrfs: fix corruption after write/fsync failure + fsync + log recovery
While writing to a file, in inode.c:cow_file_range() (and same applies to
submit_compressed_extents()), after reserving an extent for the file data,
we create a new extent map for the written range and insert it into the
extent map cache. After that, we create an ordered operation, but if it
fails (due to a transient/temporary-ENOMEM), we return without dropping
that extent map, which points to a reserved extent that is freed when we
return. A subsequent incremental fsync (when the btrfs inode doesn't have
the flag BTRFS_INODE_NEEDS_FULL_SYNC) considers this extent map valid and
logs a file extent item based on that extent map, which points to a disk
extent that doesn't contain valid data - it was freed by us earlier, at this
point it might contain any random/garbage data.

Therefore, if we reach an error condition when cowing a file range after
we added the new extent map to the cache, drop it from the cache before
returning.

Some sequence of steps that lead to this:

    $ mkfs.btrfs -f /dev/sdd
    $ mount -o commit=9999 /dev/sdd /mnt
    $ cd /mnt

    $ xfs_io -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" foo
    $ xfs_io -c "pwrite -S 0x02 -b 4096 4096 4096"
    $ sync

    $ od -t x1 foo
    0000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
    *
    0010000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
    *
    0020000

    $ xfs_io -c "pwrite -S 0xa1 -b 4096 0 4096" foo

    # Now this write + fsync fail with -ENOMEM, which was returned by
    # btrfs_add_ordered_extent() in inode.c:cow_file_range().
    $ xfs_io -c "pwrite -S 0xff -b 4096 4096 4096" foo
    $ xfs_io -c "fsync" foo
    fsync: Cannot allocate memory

    # Now do a new write + fsync, which will succeed. Our previous
    # -ENOMEM was a transient/temporary error.
    $ xfs_io -c "pwrite -S 0xee -b 4096 16384 4096" foo
    $ xfs_io -c "fsync" foo

    # Our file content (in page cache) is now:
    $ od -t x1 foo
    0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1
    *
    0010000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
    *
    0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    *
    0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
    *
    0050000

    # Now reboot the machine, and mount the fs, so that fsync log replay
    # takes place.

    # The file content is now weird, in particular the first 8Kb, which
    # do not match our data before nor after the sync command above.
    $ od -t x1 foo
    0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
    *
    0010000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
    *
    0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    *
    0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
    *
    0050000

    # In fact these first 4Kb are a duplicate of the last 4kb block.
    # The last write got an extent map/file extent item that points to
    # the same disk extent that we got in the write+fsync that failed
    # with the -ENOMEM error. btrfs-debug-tree and btrfsck allow us to
    # verify that:

    $ btrfs-debug-tree /dev/sdd
    (...)
	item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
		extent data disk byte 12582912 nr 8192
		extent data offset 0 nr 8192 ram 8192
	item 7 key (257 EXTENT_DATA 8192) itemoff 15766 itemsize 53
		extent data disk byte 0 nr 0
		extent data offset 0 nr 8192 ram 8192
	item 8 key (257 EXTENT_DATA 16384) itemoff 15713 itemsize 53
		extent data disk byte 12582912 nr 4096
		extent data offset 0 nr 4096 ram 4096

    $ umount /dev/sdd
    $ btrfsck /dev/sdd
    Checking filesystem on /dev/sdd
    UUID: db5e60e1-050d-41e6-8c7f-3d742dea5d8f
    checking extents
    extent item 12582912 has multiple extent items
    ref mismatch on [12582912 4096] extent item 1, found 2
    Backref bytes do not match extent backref, bytenr=12582912, ref bytes=4096, backref bytes=8192
    backpointer mismatch on [12582912 4096]
    Errors found in extent allocation tree or chunk allocation
    checking free space cache
    checking fs roots
    root 5 inode 257 errors 1000, some csum missing
    found 131074 bytes used err is 1
    total csum bytes: 4
    total tree bytes: 131072
    total fs tree bytes: 32768
    total extent tree bytes: 16384
    btree space waste bytes: 123404
    file data blocks allocated: 274432
     referenced 274432
    Btrfs v3.14.1-96-gcc7fd5a-dirty

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-02 16:46:05 -07:00
Trond Myklebust 66f09ca717 nfs: do not start the callback thread until we set rqstp->rq_task
This fixes an Oopsable race when starting up the callback server.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-02 17:53:30 -04:00
Trond Myklebust d4e8990299 lockd: Do not start the lockd thread before we've set nlmsvc_rqst->rq_task
This fixes an Oopsable race when starting lockd.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-09-02 17:49:17 -04:00
Jeff Moyer 2ff396be60 aio: add missing smp_rmb() in read_events_ring
We ran into a case on ppc64 running mariadb where io_getevents would
return zeroed out I/O events.  After adding instrumentation, it became
clear that there was some missing synchronization between reading the
tail pointer and the events themselves.  This small patch fixes the
problem in testing.

Thanks to Zach for helping to look into this, and suggesting the fix.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org
2014-09-02 15:20:03 -04:00
Chao Yu b73e52824c f2fs: reposition unlock_new_inode to prevent accessing invalid inode
As the race condition on the inode cache, following scenario can appear:
[Thread a]				[Thread b]
					->f2fs_mkdir
					  ->f2fs_add_link
					    ->__f2fs_add_link
					      ->init_inode_metadata failed here
->gc_thread_func
  ->f2fs_gc
    ->do_garbage_collect
      ->gc_data_segment
        ->f2fs_iget
          ->iget_locked
            ->wait_on_inode
					  ->unlock_new_inode
        ->move_data_page
					  ->make_bad_inode
					  ->iput

When we fail in create/symlink/mkdir/mknod/tmpfile, the new allocated inode
should be set as bad to avoid being accessed by other thread. But in above
scenario, it allows f2fs to access the invalid inode before this inode was set
as bad.
This patch fix the potential problem, and this issue was found by code review.

change log from v1:
 o Add condition judgment in gc_data_segment() suggested by Changman Lee.
 o use iget_failed to simplify code.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-09-02 00:22:24 -07:00
Zheng Liu eb68d0e2fc ext4: track extent status tree shrinker delay statictics
This commit adds some statictics in extent status tree shrinker.  The
purpose to add these is that we want to collect more details when we
encounter a stall caused by extent status tree shrinker.  Here we count
the following statictics:
  stats:
    the number of all objects on all extent status trees
    the number of reclaimable objects on lru list
    cache hits/misses
    the last sorted interval
    the number of inodes on lru list
  average:
    scan time for shrinking some objects
    the number of shrunk objects
  maximum:
    the inode that has max nr. of objects on lru list
    the maximum scan time for shrinking some objects

The output looks like below:
  $ cat /proc/fs/ext4/sda1/es_shrinker_info
  stats:
    28228 objects
    6341 reclaimable objects
    5281/631 cache hits/misses
    586 ms last sorted interval
    250 inodes on lru list
  average:
    153 us scan time
    128 shrunk objects
  maximum:
    255 inode (255 objects, 198 reclaimable)
    125723 us max scan time

If the lru list has never been sorted, the following line will not be
printed:
    586ms last sorted interval
If there is an empty lru list, the following lines also will not be
printed:
    250 inodes on lru list
  ...
  maximum:
    255 inode (255 objects, 198 reclaimable)
    0 us max scan time

Meanwhile in this commit a new trace point is defined to print some
details in __ext4_es_shrink().

Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 22:26:49 -04:00
Zheng Liu e963bb1de4 ext4: improve extents status tree trace point
This commit improves the trace point of extents status tree.  We rename
trace_ext4_es_shrink_enter in ext4_es_count() because it is also used
in ext4_es_scan() and we can not identify them from the result.

Further this commit fixes a variable name in trace point in order to
keep consistency with others.

Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 22:22:13 -04:00
Seunghun Lee d91bd2c1d7 ext4: fix comments about get_blocks
get_blocks is renamed to get_block.

Signed-off-by: Seunghun Lee <waydi1@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 22:15:30 -04:00
Brian Foster 41b9d7263e xfs: trim eofblocks before collapse range
xfs_collapse_file_space() currently writes back the entire file
undergoing collapse range to settle things down for the extent shift
algorithm. While this prevents changes to the extent list during the
collapse operation, the writeback itself is not enough to prevent
unnecessary collapse failures.

The current shift algorithm uses the extent index to iterate the in-core
extent list. If a post-eof delalloc extent persists after the writeback
(e.g., a prior zero range op where the end of the range aligns with eof
can separate the post-eof blocks such that they are not written back and
converted), xfs_bmap_shift_extents() becomes confused over the encoded
br_startblock value and fails the collapse.

As with the full writeback, this is a temporary fix until the algorithm
is improved to cope with a volatile extent list and avoid attempts to
shift post-eof extents.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-02 12:12:53 +10:00
Dave Chinner 1669a8ca21 xfs: xfs_file_collapse_range is delalloc challenged
If we have delalloc extents on a file before we run a collapse range
opertaion, we sync the range that we are going to collapse to
convert delalloc extents in that region to real extents to simplify
the shift operation.

However, the shift operation then assumes that the extent list is
not going to change as it iterates over the extent list moving
things about. Unfortunately, this isn't true because we can't hold
the ILOCK over all the operations. We can prevent new IO from
modifying the extent list by holding the IOLOCK, but that doesn't
prevent writeback from running....

And when writeback runs, it can convert delalloc extents is the
range of the file prior to the region being collapsed, and this
changes the indexes of all the extents in the file. That causes the
collapse range operation to Go Bad.

The right fix is to rewrite the extent shift operation not to be
dependent on the extent list not changing across the entire
operation, but this is a fairly significant piece of work to do.
Hence, as a short-term workaround for the problem, sync the entire
file before starting a collapse operation to remove all delalloc
ranges from the file and so avoid the problem of concurrent
writeback changing the extent list.

Diagnosed-and-Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-02 12:12:53 +10:00
Brian Foster ca446d880c xfs: don't log inode unless extent shift makes extent modifications
The file collapse mechanism uses xfs_bmap_shift_extents() to collapse
all subsequent extents down into the specified, previously punched out,
region. This function performs some validation, such as whether a
sufficient hole exists in the target region of the collapse, then shifts
the remaining exents downward.

The exit path of the function currently logs the inode unconditionally.
While we must log the inode (and abort) if an error occurs and the
transaction is dirty, the initial validation paths can generate errors
before the transaction has been dirtied. This creates an unnecessary
filesystem shutdown scenario, as the caller will cancel a transaction
that has been marked dirty.

Modify xfs_bmap_shift_extents() to OR the logflags bits as modifications
are made to the inode bmap. Only log the inode in the exit path if
logflags has been set. This ensures we only have to cancel a dirty
transaction if modifications have been made and prevents an unnecessary
filesystem shutdown otherwise.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-02 12:12:53 +10:00
Dave Chinner 7d4ea3ce63 xfs: use ranged writeback and invalidation for direct IO
Now we are not doing silly things with dirtying buffers beyond EOF
and using invalidation correctly, we can finally reduce the ranges of
writeback and invalidation used by direct IO to match that of the IO
being issued.

Bring the writeback and invalidation ranges back to match the
generic direct IO code - this will greatly reduce the perturbation
of cached data when direct IO and buffered IO are mixed, but still
provide the same buffered vs direct IO coherency behaviour we
currently have.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-02 12:12:53 +10:00
Dave Chinner 834ffca6f7 xfs: don't zero partial page cache pages during O_DIRECT writes
Similar to direct IO reads, direct IO writes are using 
truncate_pagecache_range to invalidate the page cache. This is
incorrect due to the sub-block zeroing in the page cache that
truncate_pagecache_range() triggers.

This patch fixes things by using invalidate_inode_pages2_range
instead.  It preserves the page cache invalidation, but won't zero
any pages.

cc: stable@vger.kernel.org
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-02 12:12:52 +10:00
Chris Mason 85e584da32 xfs: don't zero partial page cache pages during O_DIRECT writes
xfs is using truncate_pagecache_range to invalidate the page cache
during DIO reads.  This is different from the other filesystems who
only invalidate pages during DIO writes.

truncate_pagecache_range is meant to be used when we are freeing the
underlying data structs from disk, so it will zero any partial
ranges in the page.  This means a DIO read can zero out part of the
page cache page, and it is possible the page will stay in cache.

buffered reads will find an up to date page with zeros instead of
the data actually on disk.

This patch fixes things by using invalidate_inode_pages2_range
instead.  It preserves the page cache invalidation, but won't zero
any pages.

[dchinner: catch error and warn if it fails. Comment.]

cc: stable@vger.kernel.org
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-02 12:12:52 +10:00
Dave Chinner 22e757a49c xfs: don't dirty buffers beyond EOF
generic/263 is failing fsx at this point with a page spanning
EOF that cannot be invalidated. The operations are:

1190 mapwrite   0x52c00 thru    0x5e569 (0xb96a bytes)
1191 mapread    0x5c000 thru    0x5d636 (0x1637 bytes)
1192 write      0x5b600 thru    0x771ff (0x1bc00 bytes)

where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
write attempts to invalidate the cached page over this range, it
fails with -EBUSY and so any attempt to do page invalidation fails.

The real question is this: Why can't that page be invalidated after
it has been written to disk and cleaned?

Well, there's data on the first two buffers in the page (1k block
size, 4k page), but the third buffer on the page (i.e. beyond EOF)
is failing drop_buffers because it's bh->b_state == 0x3, which is
BH_Uptodate | BH_Dirty.  IOWs, there's dirty buffers beyond EOF. Say
what?

OK, set_buffer_dirty() is called on all buffers from
__set_page_buffers_dirty(), regardless of whether the buffer is
beyond EOF or not, which means that when we get to ->writepage,
we have buffers marked dirty beyond EOF that we need to clean.
So, we need to implement our own .set_page_dirty method that
doesn't dirty buffers beyond EOF.

This is messy because the buffer code is not meant to be shared
and it has interesting locking issues on the buffer dirty bits.
So just copy and paste it and then modify it to suit what we need.

Note: the solutions the other filesystems and generic block code use
of marking the buffers clean in ->writepage does not work for XFS.
It still leaves dirty buffers beyond EOF and invalidations still
fail. Hence rather than play whack-a-mole, this patch simply
prevents those buffers from being dirtied in the first place.

cc: <stable@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-09-02 12:12:51 +10:00
Darrick J. Wong 45f1a9c3f6 ext4: enable block_validity by default
Enable by default the block_validity feature, which checks for
collisions between newly allocated blocks and critical system
metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 21:34:09 -04:00
Theodore Ts'o 88fe1acb5b jbd2: fold __wait_cp_io into jbd2_log_do_checkpoint()
__wait_cp_io() is only called by jbd2_log_do_checkpoint().  Fold it in
to make it a bit easier to understand.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 21:26:09 -04:00
Theodore Ts'o be1158cc61 jbd2: fold __process_buffer() into jbd2_log_do_checkpoint()
__process_buffer() is only called by jbd2_log_do_checkpoint(), and it
had a very complex locking protocol where it would be called with the
j_list_lock, and sometimes exit with the lock held (if the return code
was 0), or release the lock.

This was confusing both to humans and to smatch (which erronously
complained that the lock was taken twice).

Folding __process_buffer() to the caller allows us to simplify the
control flow, making the resulting function easier to read and reason
about, and dropping the compiled size of fs/jbd2/checkpoint.c by 150
bytes (over 4% of the text size).

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-09-01 21:19:01 -04:00
Theodore Ts'o ed8a1a766a ext4: rename ext4_ext_find_extent() to ext4_find_extent()
Make the function name less redundant.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 14:43:09 -04:00
Theodore Ts'o 3bdf14b4d7 ext4: reuse path object in ext4_move_extents()
Reuse the path object in ext4_move_extents() so we don't unnecessarily
free and reallocate it.

Also clean up the get_ext_path() wrapper so that it has the same
semantics of freeing the path object on error as ext4_ext_find_extent().

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 14:42:09 -04:00
Theodore Ts'o ee4bd0d963 ext4: reuse path object in ext4_ext_shift_extents()
Now that the semantics of ext4_ext_find_extent() are much cleaner,
it's safe and more efficient to reuse the path object across the
multiple calls to ext4_ext_find_extent() in ext4_ext_shift_extents().

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 14:41:09 -04:00
Theodore Ts'o 10809df84a ext4: teach ext4_ext_find_extent() to realloc path if necessary
This adds additional safety in case for some reason we end reusing a
path structure which isn't big enough for current depth of the inode.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 14:40:09 -04:00
Theodore Ts'o b7ea89ad0a ext4: allow a NULL argument to ext4_ext_drop_refs()
Teach ext4_ext_drop_refs() to accept a NULL argument, much like
kfree().  This allows us to drop a lot of checks to make sure path is
non-NULL before calling ext4_ext_drop_refs().

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 14:39:09 -04:00
Theodore Ts'o 523f431ccf ext4: call ext4_ext_drop_refs() from ext4_ext_find_extent()
In nearly all of the calls to ext4_ext_find_extent() where the caller
is trying to recycle the path object, ext4_ext_drop_refs() gets called
to release the buffer heads before the path object gets overwritten.
To simplify things for the callers, and to avoid the possibility of a
memory leak, make ext4_ext_find_extent() responsible for dropping the
buffers.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 14:38:09 -04:00
Theodore Ts'o dfe5080939 ext4: drop EXT4_EX_NOFREE_ON_ERR from rest of extents handling code
Drop EXT4_EX_NOFREE_ON_ERR from ext4_ext_create_new_leaf(),
ext4_split_extent(), ext4_convert_unwritten_extents_endio().

This requires fixing all of their callers to potentially
ext4_ext_find_extent() to free the struct ext4_ext_path object in case
of an error, and there are interlocking dependencies all the way up to
ext4_ext_map_blocks(), ext4_swap_extents(), and
ext4_ext_remove_space().

Once this is done, we can drop the EXT4_EX_NOFREE_ON_ERR flag since it
is no longer necessary.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 14:37:09 -04:00
Theodore Ts'o 4f224b8b7b ext4: drop EXT4_EX_NOFREE_ON_ERR in convert_initialized_extent()
Transfer responsibility of freeing struct ext4_ext_path on error to
ext4_ext_find_extent().

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 14:36:09 -04:00
Theodore Ts'o e8b83d9303 ext4: collapse ext4_convert_initialized_extents()
The function ext4_convert_initialized_extents() is only called by a
single function --- ext4_ext_convert_initalized_extents().  Inline the
code and get rid of the unnecessary bits in order to simplify the code.

Rename ext4_ext_convert_initalized_extents() to
convert_initalized_extents() since it's a static function that is
actually only used in a single caller, ext4_ext_map_blocks().

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 14:35:09 -04:00
Theodore Ts'o 705912ca95 ext4: teach ext4_ext_find_extent() to free path on error
Right now, there are a places where it is all to easy to leak memory
on an error path, via a usage like this:

	struct ext4_ext_path *path = NULL

	while (...) {
		...
		path = ext4_ext_find_extent(inode, block, path, 0);
		if (IS_ERR(path)) {
			/* oops, if path was non-NULL before the call to
			   ext4_ext_find_extent, we've leaked it!  :-(  */
			...
			return PTR_ERR(path);
		}
		...
	}

Unfortunately, there some code paths where we are doing the following
instead:

	path = ext4_ext_find_extent(inode, block, orig_path, 0);

and where it's important that we _not_ free orig_path in the case
where ext4_ext_find_extent() returns an error.

So change the function signature of ext4_ext_find_extent() so that it
takes a struct ext4_ext_path ** for its third argument, and by
default, on an error, it will free the struct ext4_ext_path, and then
zero out the struct ext4_ext_path * pointer.  In order to avoid
causing problems, we add a flag EXT4_EX_NOFREE_ON_ERR which causes
ext4_ext_find_extent() to use the original behavior of forcing the
caller to deal with freeing the original path pointer on the error
case.

The goal is to get rid of EXT4_EX_NOFREE_ON_ERR entirely, but this
allows for a gentle transition and makes the patches easier to verify.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 14:34:09 -04:00
Theodore Ts'o bd30d702fc ext4: fix accidental flag aliasing in ext4_map_blocks flags
Commit b8a8684502 introduced an accidental flag aliasing between
EXT4_EX_NOCACHE and EXT4_GET_BLOCKS_CONVERT_UNWRITTEN.

Fortunately, this didn't introduce any untorward side effects --- we
got lucky.  Nevertheless, fix this and leave a warning to hopefully
avoid this from happening in the future.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-09-01 14:33:09 -04:00
Theodore Ts'o 713e8dde3e ext4: fix ZERO_RANGE bug hidden by flag aliasing
We accidently aliased EXT4_EX_NOCACHE and EXT4_GET_CONVERT_UNWRITTEN
falgs, which apparently was hiding a bug that was unmasked when this
flag aliasing issue was addressed (see the subsequent commit).  The
reproduction case was:

   fsx -N 10000 -l 500000 -r 4096 -t 4096 -w 4096 -Z -R -W /vdb/junk

... which would cause fsx to report corruption in the data file.

The fix we have is a bit of an overkill, but I'd much rather be
conservative for now, and we can optimize ZERO_RANGE_FL handling
later.  The fact that we need to zap the extent_status cache for the
inode is unfortunate, but correctness is far more important than
performance.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
2014-09-01 14:32:09 -04:00
Theodore Ts'o 19008f6dfa ext4: fix ext4_swap_extents() error handling
If ext4_ext_find_extent() returns an error, we have to clear path1 or
path2 or else we would end up trying to free an ERR_PTR, which would
be bad.

Also eliminate some redundant code and mark the error paths as unlikely()

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-31 15:03:14 -04:00
Linus Torvalds 35e274458c File locking related bugfixes for v3.17 (pile #3)
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUAmbRAAoJEAAOaEEZVoIVah4P/iupUO7Ae5ODMDMog/vOp+SM
 +sWnyqnEyeMlQlNDoHoef5TPQ28aKEAq1Sg7CsqlK3qZSYSSPhb4KFsGWLZe6D5A
 7iWSMKabdnuQ3qBCsb2Y6ZdB8IRAJz81sIAVI8F32NDmSs325wN/coVwfV4g8mQF
 QSpv78TjwBY0qNhNw06pS/FLV45IaPTDDgnTHRcOLrfHajDdGTdqrKI/L0ES1PFB
 0ZUtG3qMPS2XYRyS6ZQ0TZZrl2/HMA5/fOwqKspNKxYxKS+TOf/umKwPPjHBnMHo
 mfD1XnG64ECkNio9bpg2CkjUqaT8aloNPgDxuP15vEV6bZ5WBLKjGUOY2IvPa198
 do8CaAdp2Ql6kE2IyD+G+IkjqcxZ9H8hNH3cBM+3TzvxYqaiZKKZAky5UTau2LvG
 E5cyWhDPsVBGvAJXEPBf4vhIgzhaSuNox0+73nL4xU+L7bPTDzYIYyhS/InKO0X+
 ZwAZn2u62XQmUDI8b+zrgOAHfWB0hHlcIfIsIrDxotM24TPPbJ2k2Dz0hKCZAraR
 DYDYPJZg+/QyPc8bujL5Hwjh8MogdDt1vxd9B65MwQWRn791LSGbq6VSsUnsMoAa
 dhG5U+a5eI8oQ2gkMEEK45o2ljcnZ3BSim6SGdmZ6YrNyEdk63xA4GjozK1YhWzG
 tLkXb4/7zV/dR8VTOQR/
 =Wb1t
 -----END PGP SIGNATURE-----

Merge tag 'locks-v3.17-3' of git://git.samba.org/jlayton/linux

Pull file locking bugfx from Jeff Layton:
 "Just a bugfix for a bug that crept in to v3.15.  It's in a rather rare
  error path, and I'm not aware of anyone having hit it, but it's worth
  fixing for v3.17"

* tag 'locks-v3.17-3' of git://git.samba.org/jlayton/linux:
  locks: pass correct "before" pointer to locks_unlink_lock in generic_add_lease
2014-08-30 21:04:37 -07:00
Dmitry Monakhov fcf6b1b729 ext4: refactor ext4_move_extents code base
ext4_move_extents is too complex for review. It has duplicate almost
each function available in the rest of other codebase. It has useless
artificial restriction orig_offset == donor_offset. But in fact logic
of ext4_move_extents is very simple:

Iterate extents one by one (similar to ext4_fill_fiemap_extents)
   ->Iterate each page covered extent (similar to generic_perform_write)
     ->swap extents for covered by page (can be shared with IOC_MOVE_DATA)

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-30 23:52:19 -04:00
Dmitry Monakhov f8fb4f4150 ext4: use ext4_ext_next_allocated_block instead of mext_next_extent
This allows us to make mext_next_extent static and potentially get rid
of it.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-30 23:50:56 -04:00
Dmitry Monakhov ee124d2746 ext4: use ext4_update_i_disksize instead of opencoded ones
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-30 23:34:06 -04:00
Al Viro 81b6b06197 fix EBUSY on umount() from MNT_SHRINKABLE
We need the parents of victims alive until namespace_unlock() gets to
dput() of the (ex-)mountpoints.  However, that screws up the "is it
busy" checks in case when we have shrinkable mounts that need to be
killed.  Solution: go ahead and decrement refcounts of parents right
in umount_tree(), increment them again just before dropping rwsem in
namespace_unlock() (and let the loop in the end of namespace_unlock()
finally drop those references for good, as we do now).  Parents can't
get freed until we drop rwsem - at least one reference is kept until
then, both in case when parent is among the victims and when it is
not.  So they'll still be around when we get to namespace_unlock().

Cc: stable@vger.kernel.org # 3.12+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-30 18:32:05 -04:00
Al Viro 88b368f27a get rid of propagate_umount() mistakenly treating slaves as busy.
The check in __propagate_umount() ("has somebody explicitly mounted
something on that slave?") is done *before* taking the already doomed
victims out of the child lists.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-30 18:31:41 -04:00
Wang Shilong 52c826db6d ext4: remove a duplicate call in ext4_init_new_dir()
ext4_journal_get_write_access() has just been called in ext4_append()
calling it again here is duplicated.

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-29 23:20:44 -04:00
Theodore Ts'o f8b3b59d4d ext4: convert do_split() to use the ERR_PTR convention
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-29 20:52:18 -04:00
Theodore Ts'o dd73b5d5cb ext4: convert dx_probe() to use the ERR_PTR convention
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-29 20:52:17 -04:00
Theodore Ts'o 1c2150283c ext4: convert ext4_bread() to use the ERR_PTR convention
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-29 20:52:15 -04:00
Theodore Ts'o 1056008226 ext4: convert ext4_getblk() to use the ERR_PTR convention
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-29 20:51:32 -04:00
Theodore Ts'o 537d8f9380 ext4: convert ext4_dx_find_entry() to use the ERR_PTR convention
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-29 20:49:51 -04:00
Linus Torvalds 10f3291a1d Merge branch 'akpm' (fixes from Andrew Morton)
Merge patches from Andrew Morton:
 "22 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits)
  kexec: purgatory: add clean-up for purgatory directory
  Documentation/kdump/kdump.txt: add ARM description
  flush_icache_range: export symbol to fix build errors
  tools: selftests: fix build issue with make kselftests target
  ocfs2: quorum: add a log for node not fenced
  ocfs2: o2net: set tcp user timeout to max value
  ocfs2: o2net: don't shutdown connection when idle timeout
  ocfs2: do not write error flag to user structure we cannot copy from/to
  x86/purgatory: use approprate -m64/-32 build flag for arch/x86/purgatory
  drivers/rtc/rtc-s5m.c: re-add support for devices without irq specified
  xattr: fix check for simultaneous glibc header inclusion
  kexec: remove CONFIG_KEXEC dependency on crypto
  kexec: create a new config option CONFIG_KEXEC_FILE for new syscall
  x86,mm: fix pte_special versus pte_numa
  hugetlb_cgroup: use lockdep_assert_held rather than spin_is_locked
  mm/zpool: use prefixed module loading
  zram: fix incorrect stat with failed_reads
  lib: turn CONFIG_STACKTRACE into an actual option.
  mm: actually clear pmd_numa before invalidating
  memblock, memhotplug: fix wrong type in memblock_find_in_range_node().
  ...
2014-08-29 16:28:29 -07:00
Junxiao Bi 8c7b638cec ocfs2: quorum: add a log for node not fenced
For debug use, we can see from the log whether the fence decision is
made and why it is not fenced.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-29 16:28:17 -07:00
Junxiao Bi 8e9801dfe3 ocfs2: o2net: set tcp user timeout to max value
When tcp retransmit timeout(15mins), the connection will be closed.
Pending messages may be lost during this time.  So we set tcp user
timeout to override the retransmit timeout to the max value.  This is OK
for ocfs2 since we have disk heartbeat, if peer crash, the disk
heartbeat will timeout and it will be evicted, if disk heartbeat not
timeout and connection idle for a long time, then this means the cluster
enters split-brain state, since fence can't happen, we'd better keep the
connection and wait network recover.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-29 16:28:16 -07:00
Junxiao Bi c43c363def ocfs2: o2net: don't shutdown connection when idle timeout
This patch series is to fix a possible message lost bug in ocfs2 when
network go bad.  This bug will cause ocfs2 hung forever even network
become good again.

The messages may lost in this case.  After the tcp connection is
established between two nodes, an idle timer will be set to check its
state periodically, if no messages are received during this time, idle
timer will timeout, it will shutdown the connection and try to
reconnect, so pending messages in tcp queues will be lost.  This
messages may be from dlm.  Dlm may get hung in this case.  This may
cause the whole ocfs2 cluster hung.

This is very possible to happen when network state goes bad.  Do the
reconnect is useless, it will fail if network state is still bad.  Just
waiting there for network recovering may be a good idea, it will not
lost messages and some node will be fenced until cluster goes into
split-brain state, for this case, Tcp user timeout is used to override
the tcp retransmit timeout.  It will timeout after 25 days, user should
have notice this through the provided log and fix the network, if they
don't, ocfs2 will fall back to original reconnect way.

This patch (of 3):

Some messages in the tcp queue maybe lost if we shutdown the connection
and reconnect when idle timeout.  If packets lost and reconnect success,
then the ocfs2 cluster maybe hung.

To fix this, we can leave the connection there and do the fence decision
when idle timeout, if network recover before fence dicision is made, the
connection survive without lost any messages.

This bug can be saw when network state go bad.  It may cause ocfs2 hung
forever if some packets lost.  With this fix, ocfs2 will recover from
hung if network becomes good again.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-29 16:28:16 -07:00
Ben Hutchings 2b462638e4 ocfs2: do not write error flag to user structure we cannot copy from/to
If we failed to copy from the structure, writing back the flags leaks 31
bits of kernel memory (the rest of the ir_flags field).

In any case, if we cannot copy from/to the structure, why should we
expect putting just the flags to work?

Also make sure ocfs2_info_handle_freeinode() returns the right error
code if the copy_to_user() fails.

Fixes: ddee5cdb70 ('Ocfs2: Add new OCFS2_IOC_INFO ioctl for ocfs2 v8.')
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: Joel Becker <jlbec@evilplan.org>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-29 16:28:16 -07:00
Linus Torvalds 878e580e21 NFS client fixes for 3.17
Highlights:
 - NFSv3 stable fix for another POSIX ACL regression
 - NFSv4 stable fix for a regression with OPEN_DOWNGRADE
 - NFSv4 stable fix for bad close() behaviour when holding a delegation
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUANWVAAoJEGcL54qWCgDygyQQALF755JgqEjVy+uRjmqoXn/q
 4Gc7fGEkSfGbP8BO0dJ9Qs4IX7WQTFNUw1x3xqR+wBrvjCFbQeMclI2XIBwYarBl
 7zNnBK9NpmnNh94cataR9WANTmHMm+3xSA3UmK7OnovFOSDviJpKMa3AkRIrJMX5
 ZKYvwN2CigcIefYwtQ2NqkDt0CGbt53zYavQ4hp+//LexaN5z0f2krVj8pPwquYZ
 3JX1sm+C6bKxyTbAyJ8cWCWnJ/gxDOzl2ZPjtWah4G3tVpO6CF5+07xbQ8+B5KOc
 Bm434dWJFlCYSXvmRgbC9i7d7mJU2+fI0rcUP2LDeA73oKDjndsmqmtq08hmPz5K
 FfIA7gko4SJXvYzNKyuoS8j5r+LCtEqKoCCwMucVRwy33rpinmlzw68WTsUm8YtK
 0qYDeAqeuCc9ZerGMMFfkmgigAd2cWhhUnL+V5tlpCEeFRnL1+jqnRxuBhLlzgN3
 SaikZfmncB6gNR6cGwMfceo1E2AoA1GuVy0am1yPsYMhRF6OPCxaLRR53WgioXrt
 DwKUqhQtcE0qN1MII44x0Yxl0oFMTTCl279exnjyCWMpGYX/SAI9ErOpsc+QpIxJ
 wQEL+xUHOV7B6gTt4Y6GbXtL7toBLcMmT71gz6OHcTNJN0OtMuqtp9jgy1iTG7Gm
 2Gd5Su14xza8DFEaDLca
 =3xjW
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.17-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client fixes from Trond Myklebust:
 "Highlights:
   - NFSv3 stable fix for another POSIX ACL regression
   - NFSv4 stable fix for a regression with OPEN_DOWNGRADE
   - NFSv4 stable fix for bad close() behaviour when holding a delegation"

* tag 'nfs-for-3.17-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv3: Fix another acl regression
  NFSv4: Don't clear the open state when we just did an OPEN_DOWNGRADE
  NFSv4: Fix problems with close in the presence of a delegation
2014-08-29 13:04:13 -07:00
Linus Torvalds d4f03186c8 Ext4 bug fixes for 3.17, to provide better handling of memory
allocation failures, and to fix some journaling bugs involving journal
 checksums and FALLOC_FL_ZERO_RANGE.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJT/+hGAAoJENNvdpvBGATwlU8P/02752nzboRRtqYZBxh/rP6L
 QoawhKslb516QFcwgxBpmhf/uSg0XIIakANEFSvlJksj0hcSLNmHl3SjGB6EGyu0
 1qOjgXSULFFnLGkjJ9ptzn266irQRR2AX5+mBP1T/JV6L5dRFwylCWbSElxEjobt
 WhUe0TzXjazYviItOugh8tQYKrfWlfc0UnMSOU7abastStYkROPuvUUOg0fcQCW/
 jZpgFQDKO+TmIZ/QtP26Bogz27Cthe5d1XnA9555JOyYjxpRh3HnVaZXLXOtA2nf
 eQZmDpfXCnbqORLsqDQbq1+TMMFVjudQyIgHkmMojshTc2PWGZyl/KtgxDHCoBxz
 j3a/qafUPbkqEKTLOunDggkWvOKhah7Z6ZCxzamC3d5Cy2GtjUhhp+iyllf4Tmga
 OEWIPp/5F3/UfJj/0e3fcmj8tzTP8bOgVh4xC/Iwf3wugKzeGs9iaWEs02TJpCQk
 Yu+xqhHP05MGQuMXcQbPJy+DPq3a43Y/PBlzyF9ZmvJKqs0SxRIhgDnpRXDLla/m
 a2zYkzqZBog081idgy1KSJjL1XVBjHkcMwUaZ/mOCd/ok7pAhQIYPVeJEpBylf+l
 ABtTgn8qU6+QmkaeTypY8h3OAMES/PgA+PQp46zgmPJVopov0926QuMWzXb2Wbq9
 ZFGJziWdAZos5XWnMo6A
 =e6Ou
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
 "Ext4 bug fixes for 3.17, to provide better handling of memory
  allocation failures, and to fix some journaling bugs involving
  journal checksums and FALLOC_FL_ZERO_RANGE"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix same-dir rename when inline data directory overflows
  jbd2: fix descriptor block size handling errors with journal_csum
  jbd2: fix infinite loop when recovering corrupt journal blocks
  ext4: update i_disksize coherently with block allocation on error path
  ext4: fix transaction issues for ext4_fallocate and ext_zero_range
  ext4: fix incorect journal credits reservation in ext4_zero_range
  ext4: move i_size,i_disksize update routines to helper function
  ext4: fix BUG_ON in mb_free_blocks()
  ext4: propagate errors up to ext4_find_entry()'s callers
2014-08-29 11:52:46 -07:00
Jaegeuk Kim 3304b56401 f2fs: fix wrong casting for dentry name
The dentry name type is unsigned char *.
If we don't match this type, some character codes can be changed by signed bit.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-29 00:26:50 -07:00
Darrick J. Wong d80d448c6c ext4: fix same-dir rename when inline data directory overflows
When performing a same-directory rename, it's possible that adding or
setting the new directory entry will cause the directory to overflow
the inline data area, which causes the directory to be converted to an
extent-based directory.  Under this circumstance it is necessary to
re-read the directory when deleting the old dirent because the "old
directory" context still points to i_block in the inode table, which
is now an extent tree root!  The delete fails with an FS error, and
the subsequent fsck complains about incorrect link counts and
hardlinked directories.

Test case (originally found with flat_dir_test in the metadata_csum
test program):

# mkfs.ext4 -O inline_data /dev/sda
# mount /dev/sda /mnt
# mkdir /mnt/x
# touch /mnt/x/changelog.gz /mnt/x/copyright /mnt/x/README.Debian
# sync
# for i in /mnt/x/*; do mv $i $i.longer; done
# ls -la /mnt/x/
total 0
-rw-r--r-- 1 root root 0 Aug 25 12:03 changelog.gz.longer
-rw-r--r-- 1 root root 0 Aug 25 12:03 copyright
-rw-r--r-- 1 root root 0 Aug 25 12:03 copyright.longer
-rw-r--r-- 1 root root 0 Aug 25 12:03 README.Debian.longer

(Hey!  Why are there four files now??)

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-08-28 22:22:29 -04:00
Darrick J. Wong db9ee22036 jbd2: fix descriptor block size handling errors with journal_csum
It turns out that there are some serious problems with the on-disk
format of journal checksum v2.  The foremost is that the function to
calculate descriptor tag size returns sizes that are too big.  This
causes alignment issues on some architectures and is compounded by the
fact that some parts of jbd2 use the structure size (incorrectly) to
determine the presence of a 64bit journal instead of checking the
feature flags.

Therefore, introduce journal checksum v3, which enlarges the
descriptor block tag format to allow for full 32-bit checksums of
journal blocks, fix the journal tag function to return the correct
sizes, and fix the jbd2 recovery code to use feature flags to
determine 64bitness.

Add a few function helpers so we don't have to open-code quite so
many pieces.

Switching to a 16-byte block size was found to increase journal size
overhead by a maximum of 0.1%, to convert a 32-bit journal with no
checksumming to a 32-bit journal with checksum v3 enabled.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reported-by: TR Reardon <thomas_reardon@hotmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-08-28 22:22:29 -04:00
Darrick J. Wong 022eaa7517 jbd2: fix infinite loop when recovering corrupt journal blocks
When recovering the journal, don't fall into an infinite loop if we
encounter a corrupt journal block.  Instead, just skip the block and
return an error, which fails the mount and thus forces the user to run
a full filesystem fsck.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-08-28 22:22:28 -04:00
Dmitry Monakhov 6603120e96 ext4: update i_disksize coherently with block allocation on error path
In case of delalloc block i_disksize may be less than i_size. So we
have to update i_disksize each time we allocated and submitted some
blocks beyond i_disksize.  We weren't doing this on the error paths,
so fix this.

testcase: xfstest generic/019

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-08-28 22:20:41 -04:00
J. Bruce Fields ccad7dad86 nfsd4: remove labeled NFS warning from config help
The working group appears committed to keeping the protocol stable, the
code has gotten some use and seems to work OK.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-28 16:00:07 -04:00
Anna Schumaker 2b8941b962 NFSD: Update some as-yet unused 4.2 error codes
Recent NFS v4.2 drafts have removed NFS4ERR_METADATA_NOTSUPP and
reassigned the error code to NFS4ERR_UNION_NOTSUPP.

I also add in the NFS4ERR_OFFLOAD_NO_REQS error code.

We're not using any of these yet, so there's no harm done.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-28 16:00:01 -04:00
Kinglong Mee 6cd906627b NFSD: Remove duplicate initialization of file_lock
locks_alloc_lock() has initialized struct file_lock, no need to
re-initialize it here.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-28 15:58:35 -04:00
Dan Carpenter 922cedbd00 f2fs: simplify by using a literal
We can make the code a bit simpler because we know that "!retry" is
zero.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-28 09:25:29 -07:00
Dmitry Monakhov c174e6d697 ext4: fix transaction issues for ext4_fallocate and ext_zero_range
After commit f282ac19d8 we use different transactions for
preallocation and i_disksize update which result in complain from fsck
after power-failure.  spotted by generic/019. IMHO this is regression
because fs becomes inconsistent, even more 'e2fsck -p' will no longer
works (which drives admins go crazy) Same transaction requirement
applies ctime,mtime updates

testcase: xfstest generic/019

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-08-27 18:40:00 -04:00
Dmitry Monakhov 69dc953640 ext4: fix incorect journal credits reservation in ext4_zero_range
Currently we reserve only 4 blocks but in worst case scenario
ext4_zero_partial_blocks() may want to zeroout and convert two
non adjacent blocks.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-08-27 18:33:49 -04:00
Linus Torvalds 1fb00cbca0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "The biggest of these comes from Liu Bo, who tracked down a hang we've
  been hitting since moving to kernel workqueues (it's a btrfs bug, not
  in the generic code).  His patch needs backporting to 3.16 and 3.15
  stable, which I'll send once this is in.

  Otherwise these are assorted fixes.  Most were integrated last week
  during KS, but I wanted to give everyone the chance to test the
  result, so I waited for rc2 to come out before sending"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits)
  Btrfs: fix task hang under heavy compressed write
  Btrfs: fix filemap_flush call in btrfs_file_release
  Btrfs: fix crash on endio of reading corrupted block
  btrfs: fix leak in qgroup_subtree_accounting() error path
  btrfs: Use right extent length when inserting overlap extent map.
  Btrfs: clone, don't create invalid hole extent map
  Btrfs: don't monopolize a core when evicting inode
  Btrfs: fix hole detection during file fsync
  Btrfs: ensure tmpfile inode is always persisted with link count of 0
  Btrfs: race free update of commit root for ro snapshots
  Btrfs: fix regression of btrfs device replace
  Btrfs: don't consider the missing device when allocating new chunks
  Btrfs: Fix wrong device size when we are resizing the device
  Btrfs: don't write any data into a readonly device when scrub
  Btrfs: Fix the problem that the replace destroys the seed filesystem
  btrfs: Return right extent when fiemap gives unaligned offset and len.
  Btrfs: fix wrong extent mapping for DirectIO
  Btrfs: fix wrong write range for filemap_fdatawrite_range()
  Btrfs: fix wrong missing device counter decrease
  Btrfs: fix unzeroed members in fs_devices when creating a fs from seed fs
  ...
2014-08-27 09:14:17 -07:00
Chris Mason e9512d72e8 Btrfs: fix autodefrag with compression
The autodefrag code skips defrag when two extents are adjacent.  But one
big advantage for autodefrag is cutting down on the number of small
extents, even when they are adjacent.  This commit changes it to defrag
all small extents.

Signed-off-by: Chris Mason <clm@fb.com>
2014-08-27 08:45:37 -07:00
Milosz Tanski 920bce20d7 FS-Cache: Reduce cookie ref count if submit fails.
I've been seeing issues with disposing cookies under vma pressure. The symptom
is that the refcount gets out of sync. In this case we fail to decrement the
refcount if submit fails. I found this while auditing the error in and around
cookie operations.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2014-08-27 15:29:34 +01:00
Milosz Tanski 9776de96e5 FS-Cache: Timeout for releasepage()
This is meant to avoid a recusive hang caused by underlying filesystem trying
to grab a free page and causing a write-out.

INFO: task kworker/u30:7:28375 blocked for more than 120 seconds.
      Not tainted 3.15.0-virtual #74
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u30:7   D 0000000000000000     0 28375      2 0x00000000
Workqueue: fscache_operation fscache_op_work_func [fscache]
 ffff88000b147148 0000000000000046 0000000000000000 ffff88000b1471c8
 ffff8807aa031820 0000000000014040 ffff88000b147fd8 0000000000014040
 ffff880f0c50c860 ffff8807aa031820 ffff88000b147158 ffff88007be59cd0
Call Trace:
 [<ffffffff815930e9>] schedule+0x29/0x70
 [<ffffffffa018bed5>] __fscache_wait_on_page_write+0x55/0x90 [fscache]
 [<ffffffff810a4350>] ? __wake_up_sync+0x20/0x20
 [<ffffffffa018c135>] __fscache_maybe_release_page+0x65/0x1e0 [fscache]
 [<ffffffffa02ad813>] ceph_releasepage+0x83/0x100 [ceph]
 [<ffffffff811635b0>] ? anon_vma_fork+0x130/0x130
 [<ffffffff8112cdd2>] try_to_release_page+0x32/0x50
 [<ffffffff81140096>] shrink_page_list+0x7e6/0x9d0
 [<ffffffff8113f278>] ? isolate_lru_pages.isra.73+0x78/0x1e0
 [<ffffffff81140932>] shrink_inactive_list+0x252/0x4c0
 [<ffffffff811412b1>] shrink_lruvec+0x3e1/0x670
 [<ffffffff8114157f>] shrink_zone+0x3f/0x110
 [<ffffffff81141b06>] do_try_to_free_pages+0x1d6/0x450
 [<ffffffff8114a939>] ? zone_statistics+0x99/0xc0
 [<ffffffff81141e44>] try_to_free_pages+0xc4/0x180
 [<ffffffff81136982>] __alloc_pages_nodemask+0x6b2/0xa60
 [<ffffffff811c1d4e>] ? __find_get_block+0xbe/0x250
 [<ffffffff810a405e>] ? wake_up_bit+0x2e/0x40
 [<ffffffff811740c3>] alloc_pages_current+0xb3/0x180
 [<ffffffff8112cf07>] __page_cache_alloc+0xb7/0xd0
 [<ffffffff8112da6c>] grab_cache_page_write_begin+0x7c/0xe0
 [<ffffffff81214072>] ? ext4_mark_inode_dirty+0x82/0x220
 [<ffffffff81214a89>] ext4_da_write_begin+0x89/0x2d0
 [<ffffffff8112c6ee>] generic_perform_write+0xbe/0x1d0
 [<ffffffff811a96b1>] ? update_time+0x81/0xc0
 [<ffffffff811ad4c2>] ? mnt_clone_write+0x12/0x30
 [<ffffffff8112e80e>] __generic_file_aio_write+0x1ce/0x3f0
 [<ffffffff8112ea8e>] generic_file_aio_write+0x5e/0xe0
 [<ffffffff8120b94f>] ext4_file_write+0x9f/0x410
 [<ffffffff8120af56>] ? ext4_file_open+0x66/0x180
 [<ffffffff8118f0da>] do_sync_write+0x5a/0x90
 [<ffffffffa025c6c9>] cachefiles_write_page+0x149/0x430 [cachefiles]
 [<ffffffff812cf439>] ? radix_tree_gang_lookup_tag+0x89/0xd0
 [<ffffffffa018c512>] fscache_write_op+0x222/0x3b0 [fscache]
 [<ffffffffa018b35a>] fscache_op_work_func+0x3a/0x100 [fscache]
 [<ffffffff8107bfe9>] process_one_work+0x179/0x4a0
 [<ffffffff8107d47b>] worker_thread+0x11b/0x370
 [<ffffffff8107d360>] ? manage_workers.isra.21+0x2e0/0x2e0
 [<ffffffff81083d69>] kthread+0xc9/0xe0
 [<ffffffff81010000>] ? ftrace_raw_event_xen_mmu_release_ptpage+0x70/0x90
 [<ffffffff81083ca0>] ? flush_kthread_worker+0xb0/0xb0
 [<ffffffff8159eefc>] ret_from_fork+0x7c/0xb0
 [<ffffffff81083ca0>] ? flush_kthread_worker+0xb0/0xb0

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2014-08-27 15:24:06 +01:00
Dan Carpenter 88299c9bdb timerfd: Remove an always true check
We would have returned -EINVAL earlier if ticks wasn't set.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Link: http://lkml.kernel.org/r/20140801082848.GF28869@mwanda
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-08-27 11:17:48 +02:00
Trond Myklebust f87d928f6d NFSv3: Fix another acl regression
When creating a new object on the NFS server, we should not be sending
posix setacl requests unless the preceding posix_acl_create returned a
non-trivial acl. Doing so, causes Solaris servers in particular to
return an EINVAL.

Fixes: 013cdf1088 (nfs: use generic posix ACL infrastructure,,,)
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1132786
Cc: stable@vger.kernel.org # 3.14+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-26 16:17:48 -04:00
Trond Myklebust 412f6c4c26 NFSv4: Don't clear the open state when we just did an OPEN_DOWNGRADE
If we did an OPEN_DOWNGRADE, then the right thing to do on success, is
to apply the new open mode to the struct nfs4_state. Instead, we were
unconditionally clearing the state, making it appear to our state
machinery as if we had just performed a CLOSE.

Fixes: 226056c5c3 (NFSv4: Use correct locking when updating nfs4_state...)
Cc: stable@vger.kernel.org # 3.15+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-26 16:17:48 -04:00
Trond Myklebust aee7af356e NFSv4: Fix problems with close in the presence of a delegation
In the presence of delegations, we can no longer assume that the
state->n_rdwr, state->n_rdonly, state->n_wronly reflect the open
stateid share mode, and so we need to calculate the initial value
for calldata->arg.fmode using the state->flags.

Reported-by: James Drews <drews@engr.wisc.edu>
Fixes: 88069f77e1 (NFSv41: Fix a potential state leakage when...)
Cc: stable@vger.kernel.org # 2.6.33+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-26 16:17:48 -04:00
Christoph Lameter a0b6bc63a2 block: Replace __this_cpu_ptr with raw_cpu_ptr
__this_cpu_ptr is being phased out use raw_cpu_ptr instead which was
introduced in 3.15-rc1.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-08-26 13:45:45 -04:00
Paul Bolle 78b1e540f2 fs: fix comment for 'CONFIG_LBADF'
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-08-26 09:35:56 +02:00
Rasmus Villemoes a71db86e86 fs/btrfs/tree-log.c: Fix closing brace followed by if
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-08-26 09:35:51 +02:00
Linus Torvalds f01bfc977e NFS client fixes for 3.17
Highlights:
 
 - More fixes for read/write codepath regressions
   - Sleeping while holding the inode lock
   - Stricter enforcement of page contiguity when coalescing requests
   - Fix up error handling in the page coalescing code
 - Don't busy wait on SIGKILL in the file locking code
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJT+0LpAAoJEGcL54qWCgDyWfsP/imrpge47aZywi95chV8vgjM
 O85ITZbupTFwXbB7kE63CrcaxRGhFrSStk4UDhDCDkHfFb1ksjZaPR1mnkwvkR2p
 4+JUoq0fkPfeX21+rqKCYmnhstpne/N8K8FJBsEs3/TqiCBWxWOelLXdyWun4H5B
 9JBYQ7FYitUazeSiSiDXcl7Di/E09cFPi0H5VPKRyuNdYxySabnsBOELBE/28iXr
 egW1I9UKQR2EtBrvgazBbWE5XmB9XAm4X3sD1l0QD65mfSNkbnNhPFSiCdT7f/d6
 9uxECR0Y4wNYgYAfVLBew5/MXJajcv03BFMKmTUeGj9fOQzycpBT4Dx2KxEWqfnt
 Xk2nNbISxBnO0koMflmo+LPv2lv+Br3kQ+eZCHHKknvBrX2a6bJdTCZkwACVtND9
 LdbAveFQpdaeLrm/28TnRoE927r+VeAVM19yOSG8sNAskFFg4Yy51tR0e1GivkJT
 +qmmTRx+l78HjHvoPXOYdNgBC954r6APH5ST7su/7WxNClM36fEK6XxA9xbDLJWm
 wUzlGKvpwEeBJJhgjbQLwuU8BiksjFz/CaiObNvPOpc/d2GoKIhnTg19kNhg2R//
 UCDa2d5fep4z0Bo9p0s1KZm9pSBkkLjvRp9dm8WEIxLcdaF1jBK3dJECepm6ccvw
 dmEmEfjbMudVdt/ZhapJ
 =2wRt
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.17-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client fixes from Trond Myklebust:
 "Highlights:

   - more fixes for read/write codepath regressions
     * sleeping while holding the inode lock
     * stricter enforcement of page contiguity when coalescing requests
     * fix up error handling in the page coalescing code

   - don't busy wait on SIGKILL in the file locking code"

* tag 'nfs-for-3.17-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  nfs: Don't busy-wait on SIGKILL in __nfs_iocounter_wait
  nfs: can_coalesce_requests must enforce contiguity
  nfs: disallow duplicate pages in pgio page vectors
  nfs: don't sleep with inode lock in lock_and_join_requests
  nfs: fix error handling in lock_and_join_requests
  nfs: use blocking page_group_lock in add_request
  nfs: fix nonblocking calls to nfs_page_group_lock
  nfs: change nfs_page_group_lock argument
2014-08-25 15:34:28 -07:00
Steve French ca5d13fc33 Clarify Kconfig help text for CIFS and SMB2/SMB3
Clarify descriptions of SMB2 and SMB3 support in Kconfig

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org>
2014-08-25 17:01:05 -05:00
Jaegeuk Kim c2e69583a4 f2fs: truncate stale block for inline_data
This verifies to truncate any allocated blocks, offset[0], by inline_data.
Not figured out, but for making sure.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-25 14:52:09 -07:00
Pavel Shilovsky 1bbe4997b1 CIFS: Fix wrong filename length for SMB2
The existing code uses the old MAX_NAME constant. This causes
XFS test generic/013 to fail. Fix it by replacing MAX_NAME with
PATH_MAX that SMB1 uses. Also remove an unused MAX_NAME constant
definition.

Cc: <stable@vger.kernel.org> # v3.7+
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-25 16:45:17 -05:00
Pavel Shilovsky f736906a76 CIFS: Fix wrong restart readdir for SMB1
The existing code calls server->ops->close() that is not
right. This causes XFS test generic/310 to fail. Fix this
by using server->ops->closedir() function.

Cc: <stable@vger.kernel.org> # v3.7+
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-25 16:44:28 -05:00
Benjamin LaHaise d856f32a86 aio: fix reqs_available handling
As reported by Dan Aloni, commit f8567a3845 ("aio: fix aio request
leak when events are reaped by userspace") introduces a regression when
user code attempts to perform io_submit() with more events than are
available in the ring buffer.  Reverting that commit would reintroduce a
regression when user space event reaping is used.

Fixing this bug is a bit more involved than the previous attempts to fix
this regression.  Since we do not have a single point at which we can
count events as being reaped by user space and io_getevents(), we have
to track event completion by looking at the number of events left in the
event ring.  So long as there are as many events in the ring buffer as
there have been completion events generate, we cannot call
put_reqs_available().  The code to check for this is now placed in
refill_reqs_available().

A test program from Dan and modified by me for verifying this bug is available
at http://www.kvack.org/~bcrl/20140824-aio_bug.c .

Reported-by: Dan Aloni <dan@kernelim.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Acked-by: Dan Aloni <dan@kernelim.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: stable@vger.kernel.org      # v3.16 and anything that f8567a3845 was backported to
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-24 15:47:27 -07:00
Liu Bo 9e0af23764 Btrfs: fix task hang under heavy compressed write
This has been reported and discussed for a long time, and this hang occurs in
both 3.15 and 3.16.

Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.

Btrfs has a kind of work queued as an ordered way, which means that its
ordered_func() must be processed in the way of FIFO, so it usually looks like --

normal_work_helper(arg)
    work = container_of(arg, struct btrfs_work, normal_work);

    work->func() <---- (we name it work X)
    for ordered_work in wq->ordered_list
            ordered_work->ordered_func()
            ordered_work->ordered_free()

The hang is a rare case, first when we find free space, we get an uncached block
group, then we go to read its free space cache inode for free space information,
so it will

file a readahead request
    btrfs_readpages()
         for page that is not in page cache
                __do_readpage()
                     submit_extent_page()
                           btrfs_submit_bio_hook()
                                 btrfs_bio_wq_end_io()
                                 submit_bio()
                                 end_workqueue_bio() <--(ret by the 1st endio)
                                      queue a work(named work Y) for the 2nd
                                      also the real endio()

So the hang occurs when work Y's work_struct and work X's work_struct happens
to share the same address.

A bit more explanation,

A,B,C -- struct btrfs_work
arg   -- struct work_struct

kthread:
worker_thread()
    pick up a work_struct from @worklist
    process_one_work(arg)
	worker->current_work = arg;  <-- arg is A->normal_work
	worker->current_func(arg)
		normal_work_helper(arg)
		     A = container_of(arg, struct btrfs_work, normal_work);

		     A->func()
		     A->ordered_func()
		     A->ordered_free()  <-- A gets freed

		     B->ordered_func()
			  submit_compressed_extents()
			      find_free_extent()
				  load_free_space_inode()
				      ...   <-- (the above readhead stack)
				      end_workqueue_bio()
					   btrfs_queue_work(work C)
		     B->ordered_free()

As if work A has a high priority in wq->ordered_list and there are more ordered
works queued after it, such as B->ordered_func(), its memory could have been
freed before normal_work_helper() returns, which means that kernel workqueue
code worker_thread() still has worker->current_work pointer to be work
A->normal_work's, ie. arg's address.

Meanwhile, work C is allocated after work A is freed, work C->normal_work
and work A->normal_work are likely to share the same address(I confirmed this
with ftrace output, so I'm not just guessing, it's rare though).

When another kthread picks up work C->normal_work to process, and finds our
kthread is processing it(see find_worker_executing_work()), it'll think
work C as a collision and skip then, which ends up nobody processing work C.

So the situation is that our kthread is waiting forever on work C.

Besides, there're other cases that can lead to deadlock, but the real problem
is that all btrfs workqueue shares one work->func, -- normal_work_helper,
so this makes each workqueue to have its own helper function, but only a
wraper pf normal_work_helper.

With this patch, I no long hit the above hang.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-24 07:17:02 -07:00
Dmitry Monakhov 4631dbf677 ext4: move i_size,i_disksize update routines to helper function
Cc: stable@vger.kernel.org # needed for bug fix patches
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-08-23 17:48:28 -04:00
Theodore Ts'o c99d1e6e83 ext4: fix BUG_ON in mb_free_blocks()
If we suffer a block allocation failure (for example due to a memory
allocation failure), it's possible that we will call
ext4_discard_allocated_blocks() before we've actually allocated any
blocks.  In that case, fe_len and fe_start in ac->ac_f_ex will still
be zero, and this will result in mb_free_blocks(inode, e4b, 0, 0)
triggering the BUG_ON on mb_free_blocks():

	BUG_ON(last >= (sb->s_blocksize << 3));

Fix this by bailing out of ext4_discard_allocated_blocks() if fs_len
is zero.

Also fix a missing ext4_mb_unload_buddy() call in
ext4_discard_allocated_blocks().

Google-Bug-Id: 16844242

Fixes: 86f0afd463
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-08-23 17:47:28 -04:00
Theodore Ts'o 36de928641 ext4: propagate errors up to ext4_find_entry()'s callers
If we run into some kind of error, such as ENOMEM, while calling
ext4_getblk() or ext4_dx_find_entry(), we need to make sure this error
gets propagated up to ext4_find_entry() and then to its callers.  This
way, transient errors such as ENOMEM can get propagated to the VFS.
This is important so that the system calls return the appropriate
error, and also so that in the case of ext4_lookup(), we return an
error instead of a NULL inode, since that will result in a negative
dentry cache entry that will stick around long past the OOM condition
which caused a transient ENOMEM error.

Google-Bug-Id: #17142205

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-08-23 17:47:19 -04:00
David Jeffery 92a56555bd nfs: Don't busy-wait on SIGKILL in __nfs_iocounter_wait
If a SIGKILL is sent to a task waiting in __nfs_iocounter_wait,
it will busy-wait or soft lockup in its while loop.
nfs_wait_bit_killable won't sleep, and the loop won't exit on
the error return.

Stop the busy-wait by breaking out of the loop when
nfs_wait_bit_killable returns an error.

Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-22 18:04:44 -04:00
Weston Andros Adamson 78270e8fbc nfs: can_coalesce_requests must enforce contiguity
Commit 6094f83864
"nfs: allow coalescing of subpage requests" got rid of the requirement
that requests cover whole pages, but it made some incorrect assumptions.

It turns out that callers of this interface can map adjacent requests
(by file position as seen by req_offset + req->wb_bytes) to different pages,
even when they could share a page. An example is the direct I/O interface -
iov_iter_get_pages_alloc may return one segment with a partial page filled
and the next segment (which is adjacent in the file position) starts with a
new page.

Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-22 18:04:44 -04:00
Weston Andros Adamson bba5c1887a nfs: disallow duplicate pages in pgio page vectors
Adjacent requests that share the same page are allowed, but should only
use one entry in the page vector. This avoids overruning the page
vector - it is sized based on how many bytes there are, not by
request count.

This fixes issues that manifest as "Redzone overwritten" bugs (the
vector overrun) and hangs waiting on page read / write, as it waits on
the same page more than once.

This also adds bounds checking to the page vector with a graceful failure
(WARN_ON_ONCE and pgio error returned to application).

Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-22 18:04:44 -04:00
Weston Andros Adamson 7c3af97525 nfs: don't sleep with inode lock in lock_and_join_requests
This handles the 'nonblock=false' case in nfs_lock_and_join_requests.
If the group is already locked and blocking is allowed, drop the inode lock
and wait for the group lock to be cleared before trying it all again.
This should fix warnings found in peterz's tree (sched/wait branch), where
might_sleep() checks are added to wait.[ch].

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Reviewed-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-22 18:04:43 -04:00
Weston Andros Adamson 94970014c4 nfs: fix error handling in lock_and_join_requests
This fixes handling of errors from nfs_page_group_lock in
nfs_lock_and_join_requests.  It now releases the inode lock and the
reference to the head request.

Reported-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Reviewed-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-22 18:04:43 -04:00
Weston Andros Adamson bfd484a560 nfs: use blocking page_group_lock in add_request
__nfs_pageio_add_request was calling nfs_page_group_lock nonblocking, but
this can return -EAGAIN which would end up passing -EIO to the application.

There is no reason not to block in this path, so change the two calls to
do so. Also, there is no need to check the return value of
nfs_page_group_lock when nonblock=false, so remove the error handling code.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Reviewed-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-22 18:04:43 -04:00
Weston Andros Adamson bc8a309e88 nfs: fix nonblocking calls to nfs_page_group_lock
nfs_page_group_lock was calling wait_on_bit_lock even when told not to
block. Fix by first trying test_and_set_bit, followed by wait_on_bit_lock
if and only if blocking is allowed.  Return -EAGAIN if nonblocking and the
test_and_set of the bit was already locked.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Reviewed-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-22 18:04:42 -04:00
Weston Andros Adamson fd2f3a06d3 nfs: change nfs_page_group_lock argument
Flip the meaning of the second argument from 'wait' to 'nonblock' to
match related functions. Update all five calls to reflect this change.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Reviewed-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-22 18:04:42 -04:00
Chao Yu b5b822050c f2fs: use macro for code readability
This patch introduces DEF_NIDS_PER_INODE/GET_ORPHAN_BLOCKS/F2FS_CP_PACKS macro
instead of numbers in code for readability.

change log from v1:
 o fix typo pointed out by Jaegeuk Kim.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-22 13:56:47 -07:00
Jeff Layton e0b760ff71 locks: pass correct "before" pointer to locks_unlink_lock in generic_add_lease
The argument to locks_unlink_lock can't be just any pointer to a
pointer. It must be a pointer to the fl_next field in the previous
lock in the list.

Cc: <stable@vger.kernel.org> # v3.15+
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-08-22 09:58:22 -04:00
Pavel Shilovsky a07d322059 CIFS: Fix directory rename error
CIFS servers process nlink counts differently for files and directories.
In cifs_rename() if we the request fails on the existing target, we
try to remove it through cifs_unlink() but this is not what we want
to do for directories. As the result the following sequence of commands

mkdir {1,2}; mv -T 1 2; rmdir {1,2}; mkdir {1,2}; echo foo > 2/bar

and XFS test generic/023 fail with -ENOENT error. That's why the second
mkdir reuses the existing inode (target inode of the mv -T command) with
S_DEAD flag.

Fix this by checking whether the target is directory or not and
calling cifs_rmdir() rather than cifs_unlink() for directories.

Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-22 00:26:56 -05:00
Namjae Jeon 52a3624444 cifs: No need to send SIGKILL to demux_thread during umount
There is no need to explicitly send SIGKILL to cifs_demultiplex_thread
as it is calling module_put_and_exit to exit cleanly.

socket sk_rcvtimeo is set to 7 HZ so the thread will wake up in 7 seconds and
clean itself.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Acked-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-22 00:20:58 -05:00
Namjae Jeon 787aded650 cifs: Allow directIO read/write during cache=strict
Currently cifs have all or nothing approach for directIO operations.
cache=strict mode does not allow directIO while cache=none mode performs
all the operations as directIO even when user does not specify O_DIRECT
flag. This patch enables strict cache mode to honour directIO semantics.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-22 00:20:39 -05:00
Chao Yu 9d1589ef2e f2fs: introduce need_do_checkpoint for readability
This patch introduce need_do_checkpoint() to include numerous judgment condition
for readability.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 13:57:07 -07:00
Chao Yu c200b1aa6c f2fs: fix incorrect calculation with total/free inode num
Theoretically, our total inodes number is the same as total node number, but
there are three node ids are reserved in f2fs, they are 0, 1 (node nid), and 2
(meta nid), and they should never be used by user, so our total/free inode
number calculated in ->statfs is wrong.

This patch indroduces F2FS_RESERVED_NODE_NUM and then fixes this issue by
recalculating total/free inode number with the macro.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 13:57:06 -07:00
Jaegeuk Kim 04859dba50 f2fs: remove rename and use rename2
Refer the following patch.

commit 7177a9c4b5
Author: Miklos Szeredi <mszeredi@suse.cz>
Date:   Wed Jul 23 15:15:30 2014 +0200

    fs: call rename2 if exists

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 13:57:04 -07:00
Jaegeuk Kim ec4e7af4ca f2fs: skip if inline_data was converted already
This patch checks inline_data one more time under the inode page lock whether
its inline_data is converted or not.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 13:57:03 -07:00
Jaegeuk Kim 202095a7a0 f2fs: remove rewrite_node_page
I think we need to let the dirty node pages remain in the page cache instead
of rewriting them in their places.
So, after done with successful recovery, write_checkpoint will flush all of them
through the normal write path.
Through this, we can avoid potential error cases in terms of block allocation.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 13:57:02 -07:00
Jaegeuk Kim 764aa3e978 f2fs: avoid double lock in truncate_blocks
The init_inode_metadata calls truncate_blocks when error is occurred.
The callers holds f2fs_lock_op, so we should not call it again in
truncate_blocks.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 13:57:01 -07:00
Jaegeuk Kim 14f4e69085 f2fs: prevent checkpoint during roll-forward
Any checkpoint should not be done during the core roll-forward procedure.
Especially, it includes error cases too.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 13:57:00 -07:00
Jaegeuk Kim b3fe0a0da2 f2fs: add WARN_ON in f2fs_bug_on
This patch adds WARN_ON when f2fs_bug_on is disable to see kernel messages.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 13:56:59 -07:00
Jaegeuk Kim cf779cab14 f2fs: handle EIO not to break fs consistency
There are two rules when EIO is occurred.
1. don't write any checkpoint data to preserve the previous checkpoint
2. don't lose the cached dentry/node/meta pages

So, at first, this patch adds set_page_dirty in f2fs_write_end_io's failure.
Then, writing checkpoint/dentry/node blocks is not allowed.

Note that, for the data pages, we can't just throw away by redirtying them.
Otherwise, kworker can fall into infinite loop to flush them.
(Ref. xfstests/019)

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 13:55:05 -07:00
Namjae Jeon d4a029d215 cifs: remove unneeded check of null checking in if condition
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-21 12:13:05 -05:00
Namjae Jeon 7de975e349 cifs: fix a possible use of uninit variable in SMB2_sess_setup
In case of error, goto ssetup_exit can be hit and we could end up using
uninitialized value of resp_buftype

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-21 12:12:59 -05:00
Namjae Jeon d6ccf4997e cifs: fix memory leak when password is supplied multiple times
Unlikely but possible. When password is supplied multiple times, we have
to free the previous allocation.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-21 12:06:57 -05:00
Namjae Jeon 27b7edcf1c cifs: fix a possible null pointer deref in decode_ascii_ssetup
When kzalloc fails, we will end up doing NULL pointer derefrence

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-21 12:04:29 -05:00
Jaegeuk Kim 8501017e50 f2fs: check s_dirty under cp_mutex
It needs to check s_dirty under cp_mutex, since s_dirty is reset under that
mutex.
And previous condition was not correct, since we can omit doing checkpoint
when checkpoint was done followed by all the node pages were written back.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 09:21:02 -07:00
Jaegeuk Kim 5274651927 f2fs: unlock_page when node page is redirtied out
This patch fixes missing unlock_page when a node page is redirtied out.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 09:21:01 -07:00
Jaegeuk Kim 1e968fdfe6 f2fs: introduce f2fs_cp_error for readability
This patch adds f2fs_cp_error for readability.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 09:21:00 -07:00
Jaegeuk Kim ed2e621a95 f2fs: give a chance to mount again when encountering errors
This patch gives another chance to try mount process when we encounter an error.
This makes an effect on the roll-forward recovery failures as well.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 09:21:00 -07:00
Jaegeuk Kim 6f12ac25f0 f2fs: trigger release_dirty_inode in f2fs_put_super
The generic_shutdown_super calls sync_filesystem, evict_inode, and then
f2fs_put_super. In f2fs_evict_inode, we remain some dirty inode information
so we should release them at f2fs_put_super.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-21 09:20:29 -07:00
Chris Mason f6dc45c7a9 Btrfs: fix filemap_flush call in btrfs_file_release
We should only be flushing on close if the file was flagged as needing
it during truncate.  I broke this with my ordered data vs transaction
commit deadlock fix.

Thanks to Miao Xie for catching this.

Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: Miao Xie <miaox@cn.fujitsu.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
2014-08-21 07:55:31 -07:00
Liu Bo 38c1c2e44b Btrfs: fix crash on endio of reading corrupted block
The crash is

------------[ cut here ]------------
kernel BUG at fs/btrfs/extent_io.c:2124!
[...]
Workqueue: btrfs-endio normal_work_helper [btrfs]
RIP: 0010:[<ffffffffa02d6055>]  [<ffffffffa02d6055>] end_bio_extent_readpage+0xb45/0xcd0 [btrfs]

This is in fact a regression.

It is because we forgot to increase @offset properly in reading corrupted block,
so that the @offset remains, and this leads to checksum errors while reading
left blocks queued up in the same bio, and then ends up with hiting the above
BUG_ON.

Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:30 -07:00
Eric Sandeen a3c108950d btrfs: fix leak in qgroup_subtree_accounting() error path
Coverity pointed this out; in the newly added
qgroup_subtree_accounting(), if btrfs_find_all_roots()
returns an error, we leak at least the parents pointer,
and possibly the roots pointer, depending on what failure
occurs.

If btrfs_find_all_roots() returns an error, we need to
free up all allocations before we return.  "roots" is
initialized to NULL, so it should be safe to free
it unconditionally (ulist_free() handles that case).

Cc: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:29 -07:00
Qu Wenruo 51f395ad40 btrfs: Use right extent length when inserting overlap extent map.
When current btrfs finds that a new extent map is going to be insereted
but failed with -EEXIST, it will try again to insert the extent map
but with the length of sectorsize.
This is OK if we don't enable 'no-holes' feature since all extent space
is continuous, we will not go into the not found->insert routine.

But if we enable 'no-holes' feature, it will make things out of control.
e.g. in 4K sectorsize, we pass the following args to btrfs_get_extent():
btrfs_get_extent() args: start:  27874 len 4100
28672		  27874		28672	27874+4100	32768
                    |-----------------------|
|---------hole--------------------|---------data----------|

1) not found and insert
Since no extent map containing the range, btrfs_get_extent() will go
into the not_found and insert routine, which will try to insert the
extent map (27874, 27847 + 4100).

2) first overlap
But it overlaps with (28672, 32768) extent, so -EEXIST will be returned
by add_extent_mapping().

3) retry but still overlap
After catching the -EEXIST, then btrfs_get_extent() will try insert it
again but with 4K length, which still overlaps, so -EEXIST will be
returned.

This makes the following patch fail to punch hole.
d77815461f btrfs: Avoid trucating page or punching hole in a already existed hole.

This patch will use the right length, which is the (exsisting->start -
em->start) to insert, making the above patch works in 'no-holes' mode.
Also, some small code style problems in above patch is fixed too.

Reported-by: Filipe David Manana <fdmanana@gmail.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Filipe David Manana <fdmanana@suse.com>
Tested-by: Filipe David Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:27 -07:00
Filipe Manana 62e2390e1a Btrfs: clone, don't create invalid hole extent map
When cloning a file that consists of an inline extent, we were creating
an extent map that represents a non-existing trailing hole starting at a
file offset that isn't a multiple of the sector size. This happened because
when processing an inline extent we weren't aligning the extent's length to
the sector size, and therefore incorrectly treating the range
[inline_extent_length; sector_size[ as a hole.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:26 -07:00
Filipe Manana 7064dd5c36 Btrfs: don't monopolize a core when evicting inode
If an inode has a very large number of extent maps, we can spend
a lot of time freeing them, which triggers a soft lockup warning.
Therefore reschedule if we need to when freeing the extent maps
while evicting the inode.

I could trigger this all the time by running xfstests/generic/299 on
a file system with the no-holes feature enabled. That test creates
an inode with 11386677 extent maps.

    $ mkfs.btrfs -f -O no-holes $TEST_DEV
    $ MKFS_OPTIONS="-O no-holes" ./check generic/299
    generic/299 382s ...
    Message from syslogd@debian-vm3 at Aug  7 10:44:29 ...
     kernel:[85304.208017] BUG: soft lockup - CPU#0 stuck for 22s! [umount:25330]
     384s
    Ran: generic/299
    Passed all 1 tests

    $ dmesg
    (...)
    [86304.300017] BUG: soft lockup - CPU#0 stuck for 23s! [umount:25330]
    (...)
    [86304.300036] Call Trace:
    [86304.300036]  [<ffffffff81698ba9>] __slab_free+0x54/0x295
    [86304.300036]  [<ffffffffa02ee9cc>] ? free_extent_map+0x5c/0xb0 [btrfs]
    [86304.300036]  [<ffffffff811a6cd2>] kmem_cache_free+0x282/0x2a0
    [86304.300036]  [<ffffffffa02ee9cc>] free_extent_map+0x5c/0xb0 [btrfs]
    [86304.300036]  [<ffffffffa02e3775>] btrfs_evict_inode+0xd5/0x660 [btrfs]
    [86304.300036]  [<ffffffff811e7c8d>] ? __inode_wait_for_writeback+0x6d/0xc0
    [86304.300036]  [<ffffffff816a389b>] ? _raw_spin_unlock+0x2b/0x40
    [86304.300036]  [<ffffffff811d8cbb>] evict+0xab/0x180
    [86304.300036]  [<ffffffff811d8dce>] dispose_list+0x3e/0x60
    [86304.300036]  [<ffffffff811d9b04>] evict_inodes+0xf4/0x110
    [86304.300036]  [<ffffffff811bd953>] generic_shutdown_super+0x53/0x110
    [86304.300036]  [<ffffffff811bdaa6>] kill_anon_super+0x16/0x30
    [86304.300036]  [<ffffffffa02a78ba>] btrfs_kill_super+0x1a/0xa0 [btrfs]
    [86304.300036]  [<ffffffff811bd3a9>] deactivate_locked_super+0x59/0x80
    [86304.300036]  [<ffffffff811be44e>] deactivate_super+0x4e/0x70
    [86304.300036]  [<ffffffff811dec14>] mntput_no_expire+0x174/0x1f0
    [86304.300036]  [<ffffffff811deab7>] ? mntput_no_expire+0x17/0x1f0
    [86304.300036]  [<ffffffff811e0517>] SyS_umount+0x97/0x100
    (...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Tested-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:25 -07:00
Filipe Manana 74121f7cbb Btrfs: fix hole detection during file fsync
The file hole detection logic during a file fsync wasn't correct,
because it didn't look back (in a previous leaf) for the last file
extent item that can be in a leaf to the left of our leaf and that
has a generation lower than the current transaction id. This made it
assume that a hole exists when it really doesn't exist in the file.

Such false positive hole detection happens in the following scenario:

* We have a file that has many file extent items, covering 3 or more
  btree leafs (the first leaf must contain non file extent items too).

* Two ranges of the file are modified, with their extent items being
  located at 2 different leafs and those leafs aren't consecutive.

* When processing the second modified leaf, we weren't checking if
  some file extent item exists that is located in some leaf that is
  between our 2 modified leafs, and therefore assumed the range defined
  between the last file extent item in the first leaf and the first file
  extent item in the second leaf matched a hole.

Fortunately this didn't result in overriding the log with wrong data,
instead it made the last loop in copy_items() attempt to insert a
duplicated key (for a hole file extent item), which makes the file
fsync code return with -EEXIST to file.c:btrfs_sync_file() which in
turn ends up doing a full transaction commit, which is much more
expensive then writing only to the log tree and wait for it to be
durably persisted (as well as the file's modified extents/pages).
Therefore fix the hole detection logic, so that we don't pay the
cost of doing full transaction commits.

I could trigger this issue with the following test for xfstests (which
never fails, either without or with this patch). The last fsync call
results in a full transaction commit, due to the -EEXIST error mentioned
above. I could also observe this behaviour happening frequently when
running xfstests/generic/075 in a loop.

Test:

    _cleanup()
    {
        _cleanup_flakey
        rm -fr $tmp
    }

    # get standard environment, filters and checks
    . ./common/rc
    . ./common/filter
    . ./common/dmflakey

    # real QA test starts here
    _supported_fs btrfs
    _supported_os Linux
    _require_scratch
    _require_dm_flakey
    _need_to_be_root

    rm -f $seqres.full

    # Create a file with many file extent items, each representing a 4Kb extent.
    # These items span 3 btree leaves, of 16Kb each (default mkfs.btrfs leaf size
    # as of btrfs-progs 3.12).
    _scratch_mkfs -l 16384 >/dev/null 2>&1
    _init_flakey
    SAVE_MOUNT_OPTIONS="$MOUNT_OPTIONS"
    MOUNT_OPTIONS="$MOUNT_OPTIONS -o commit=999"
    _mount_flakey

    # First fsync, inode has BTRFS_INODE_NEEDS_FULL_SYNC flag set.
    $XFS_IO_PROG -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" \
            $SCRATCH_MNT/foo | _filter_xfs_io

    # For any of the following fsync calls, inode doesn't have the flag
    # BTRFS_INODE_NEEDS_FULL_SYNC set.
    for ((i = 1; i <= 500; i++)); do
        OFFSET=$((4096 * i))
        LEN=4096
        $XFS_IO_PROG -c "pwrite -S 0x01 $OFFSET $LEN" -c "fsync" \
                $SCRATCH_MNT/foo | _filter_xfs_io
    done

    # Commit transaction and bump next transaction's id (to 7).
    sync

    # Truncate will set the BTRFS_INODE_NEEDS_FULL_SYNC flag in the btrfs's
    # inode runtime flags.
    $XFS_IO_PROG -c "truncate 2048000" $SCRATCH_MNT/foo

    # Commit transaction and bump next transaction's id (to 8).
    sync

    # Touch 1 extent item from the first leaf and 1 from the last leaf. The leaf
    # in the middle, containing only file extent items, isn't touched. So the
    # next fsync, when calling btrfs_search_forward(), won't visit that middle
    # leaf. First and 3rd leaf have now a generation with value 8, while the
    # middle leaf remains with a generation with value 6.
    $XFS_IO_PROG \
        -c "pwrite -S 0xee -b 4096 0 4096" \
        -c "pwrite -S 0xff -b 4096 2043904 4096" \
        -c "fsync" \
        $SCRATCH_MNT/foo | _filter_xfs_io

    _load_flakey_table $FLAKEY_DROP_WRITES
    md5sum $SCRATCH_MNT/foo | _filter_scratch
    _unmount_flakey

    _load_flakey_table $FLAKEY_ALLOW_WRITES
    # During mount, we'll replay the log created by the fsync above, and the file's
    # md5 digest should be the same we got before the unmount.
    _mount_flakey
    md5sum $SCRATCH_MNT/foo | _filter_scratch
    _unmount_flakey
    MOUNT_OPTIONS="$SAVE_MOUNT_OPTIONS"

    status=0
    exit

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:24 -07:00
Filipe Manana 5762b5c958 Btrfs: ensure tmpfile inode is always persisted with link count of 0
If we open a file with O_TMPFILE, don't do any further operation on
it (so that the inode item isn't updated) and then force a transaction
commit, we get a persisted inode item with a link count of 1, and not 0
as it should be.

Steps to reproduce it (requires a modern xfs_io with -T support):

    $ mkfs.btrfs -f /dev/sdd
    $ mount -o /dev/sdd /mnt
    $ xfs_io -T /mnt &
    $ sync

Then btrfs-debug-tree shows the inode item with a link count of 1:

    $ btrfs-debug-tree /dev/sdd
    (...)
    fs tree key (FS_TREE ROOT_ITEM 0)
    leaf 29556736 items 4 free space 15851 generation 6 owner 5
    fs uuid f164d01b-1b92-481d-a4e4-435fb0f843d0
    chunk uuid 0e3d0e56-bcca-4a1c-aa5f-cec2c6f4f7a6
    	item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
		inode generation 3 transid 6 size 0 block group 0 mode 40755 links 1
    	item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
    		inode ref index 0 namelen 2 name: ..
    	item 2 key (257 INODE_ITEM 0) itemoff 15951 itemsize 160
    		inode generation 6 transid 6 size 0 block group 0 mode 100600 links 1
    	item 3 key (ORPHAN ORPHAN_ITEM 257) itemoff 15951 itemsize 0
		orphan item
    checksum tree key (CSUM_TREE ROOT_ITEM 0)
    (...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:23 -07:00
Filipe Manana 9c3b306e1c Btrfs: race free update of commit root for ro snapshots
This is a better solution for the problem addressed in the following
commit:

    Btrfs: update commit root on snapshot creation after orphan cleanup
    (3821f34888)

The previous solution wasn't the best because of 2 reasons:

    1) It added another full transaction commit, which is more expensive
       than just swapping the commit root with the root;

    2) If a reboot happened after the first transaction commit (the one
       that creates the snapshot) and before the second transaction commit,
       then we would end up with the same problem if a send using that
       snapshot was requested before the first transaction commit after
       the reboot.

This change addresses those 2 issues. The second issue is addressed by
switching the commit root in the dentry lookup VFS callback, which is
also called by the snapshot/subvol creation ioctl and performs orphan
cleanup if needed. Like the vfs, the ioctl locks the parent inode too,
preventing race issues between a dentry lookup and snapshot creation.

Cc: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:21 -07:00
Liu Bo 87fa3bb078 Btrfs: fix regression of btrfs device replace
Commit 49c6f736f34f901117c20960ebd7d5e60f12fcac(
btrfs: dev replace should replace the sysfs entry) added the missing sysfs entry
in the process of device replace, but didn't take missing devices into account,
so now we have

BUG: unable to handle kernel NULL pointer dereference at 0000000000000088
IP: [<ffffffffa0268551>] btrfs_kobj_rm_device+0x21/0x40 [btrfs]
...

To reproduce it,
1. mkfs.btrfs -f disk1 disk2
2. mkfs.ext4 disk1
3. mount disk2 /mnt -odegraded
4. btrfs replace start -B 1 disk3 /mnt
--------------------------

This fixes the problem.

Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Tested-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-21 07:55:20 -07:00
Bob Peterson 2ddfbdd684 GFS2: Request demote when a "try" flock fails
This patch changes the flock code so that it uses the TRY_1CB flag
instead of the TRY flag on the first attempt. That forces any holding
nodes to issue a dlm callback, which requests a demote of the glock.
Then, if the "try" failed, it sleeps a small amount of time for the
demote to occur. Then it tries again, for an increasing amount of time.
Subsequent attempts to gain the "try" lock don't use "_1CB" so that
only one callback is issued.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-08-21 10:22:52 +01:00
Bob Peterson b650738cd0 GFS2: Change maxlen variables to size_t
This patch changes some variables (especially maxlen in function
gfs2_block_map) from unsigned int to size_t. We need 64-bit arithmetic
for very large files (e.g. 1PB) where the variables otherwise get
shifted to all 0's.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-08-21 10:22:23 +01:00
Fabian Frederick eaebdedc61 GFS2: fs/gfs2/super.c: replace seq_printf by seq_puts
fix checkpatch warnings:
"WARNING: Prefer seq_puts to seq_printf"

Cc: cluster-devel@redhat.com
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-08-21 10:22:05 +01:00
Steve French 2bb93d2441 Trivial whitespace fix
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-20 21:21:29 -05:00
Linus Torvalds 372b1dbdd1 Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
 "Most important fixes in this set include three SMB3 fixes for stable
  (including fix for possible kernel oops), and a workaround to allow
  writes to Mac servers (only cifs dialect, not more current SMB2.1,
  worked to Mac servers).  Also fallocate support added, and lease fix
  from Jeff"

* 'for-linus' of git://git.samba.org/sfrench/cifs-2.6:
  [SMB3] Enable fallocate -z support for SMB3 mounts
  enable fallocate punch hole ("fallocate -p") for SMB3
  Incorrect error returned on setting file compressed on SMB2
  CIFS: Fix wrong directory attributes after rename
  CIFS: Fix SMB2 readdir error handling
  [CIFS] Possible null ptr deref in SMB2_tcon
  [CIFS] Workaround MacOS server problem with SMB2.1 write  response
  cifs: handle lease F_UNLCK requests properly
  Cleanup sparse file support by creating worker function for it
  Add sparse file support to SMB2/SMB3 mounts
  Add missing definitions for CIFS File System Attributes
  cifs: remove unused function cifs_oplock_break_wait
2014-08-20 18:33:21 -05:00
Chin-Tsung Cheng e6d8fb340f ext3: Count internal journal as bsddf overhead in ext3_statfs
The journal blocks of external journal device should not
be counted as overhead.

Signed-off-by: Chin-Tsung Cheng <chintzung@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-08-19 23:16:51 +02:00
Jaegeuk Kim 97c3c5cac2 f2fs: don't skip checkpoint if there is no dirty node pages
This is the errorneous scenario.
1. write data
2. do checkpoint
3. produce some dirty node pages by the gc thread
4. write back dirty node pages
5. f2fs_put_super will skip the checkpoint, since dirty count for node pages is
  zero.

This patch removes such the wrong condition check.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-19 10:01:35 -07:00
Jaegeuk Kim b307384e4f f2fs: avoid bug_on when error is occurred
During the recovery, if an error like EIO or ENOMEM, f2fs_bug_on should skip.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-19 10:01:35 -07:00
Jaegeuk Kim 1c35a90e8a f2fs: fix to recover inline_xattr/data and blocks
This patch fixes not to skip xattr recovery and inline xattr/data recovery
order.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-19 10:01:34 -07:00
Jaegeuk Kim e3b4d43f7c f2fs: should clear the inline_xattr flag
During the recovery, we should clear the inline_xattr flag if its xattr node
block is recovered.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-19 10:01:34 -07:00
Jaegeuk Kim 695facc05a f2fs: clear FI_INC_LINK during the recovery
If an inode are fsynced multiple times with fsync & dent marks, this inode will
set FI_INC_LINK at find_fsync_dnodes during the recovery.
But, in recover_inode, recover_dentry doesn't clear that flag when multiple hits
were occurred.

So this patch removes the flag for the further consistency.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-19 10:01:34 -07:00
Jaegeuk Kim 617deb8c05 f2fs: fix the initial inode page for recovery
If a new inode page is needed for recover_dentry, we should assing i_inline
as zero.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-19 10:01:34 -07:00
Jaegeuk Kim 0342fd301a f2fs: make clear on test condition and return types
This patch adds a parentheses to make clear for condition check.
And also it changes the return type for better meanings.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-19 10:01:33 -07:00
Jaegeuk Kim b067ba1f1b f2fs: should convert inline_data during the mkwrite
If mkwrite is called to an inode having inline_data, it can overwrite the data
index space as NEW_ADDR. (e.g., the first 4 bytes are coincidently zero)

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-19 10:01:33 -07:00
arter97 e1c4204520 f2fs: fix typo
Fix typo and some grammatical errors.

The words "filesystem" and "readahead" are being used without the space treewide.

Signed-off-by: Park Ju Hyung <qkrwngud825@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-19 10:01:33 -07:00
Jan Kara 410dd3cf4c isofs: Fix unbounded recursion when processing relocated directories
We did not check relocated directory in any way when processing Rock
Ridge 'CL' tag. Thus a corrupted isofs image can possibly have a CL
entry pointing to another CL entry leading to possibly unbounded
recursion in kernel code and thus stack overflow or deadlocks (if there
is a loop created from CL entries).

Fix the problem by not allowing CL entry to point to a directory entry
with CL entry (such use makes no good sense anyway) and by checking
whether CL entry doesn't point to itself.

CC: stable@vger.kernel.org
Reported-by: Chris Evans <cevans@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-08-19 18:29:30 +02:00
Chao Yu 85cd083b49 udf: avoid unneeded up_write when fail to add entry in ->symlink
We have released the ->i_data_sem before invoking udf_add_entry(),
so in following error path, we should not release this lock again.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-08-19 18:29:30 +02:00
Miao Xie 95669976bd Btrfs: don't consider the missing device when allocating new chunks
The original code allocated new chunks by the number of the writable devices
and missing devices to make sure that any RAID levels on a degraded FS continue
to be honored, but it introduced a problem that it stopped us to allocating
new chunks, the steps to reproduce is following:

 # mkfs.btrfs -m raid1 -d raid1 -f <dev0> <dev1>
 # mkfs.btrfs -f <dev1>			//Removing <dev1> from the original fs
 # mount -o degraded <dev0> <mnt>
 # dd if=/dev/null of=<mnt>/tmpfile bs=1M

It is because we allocate new chunks only on the writable devices, if we take
the number of missing devices into account, and want to allocate new chunks
with higher RAID level, we will fail becaue we don't have enough writable
device. Fix it by ignoring the number of missing devices when allocating
new chunks.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:19 -07:00
Miao Xie 7df69d3e94 Btrfs: Fix wrong device size when we are resizing the device
total_bytes of device is just a in-memory variant which is used to record
the size of the device, and it might be changed before we resize a device,
if the resize operation fails, it will be fallbacked. But some code used it
to update on-disk metadata of the device, it would cause the problem that
on-disk metadata of the devices was not consistent. We should use the other
variant named disk_total_bytes to update the on-disk metadata of device,
because that variant is updated only when the resize operation is successful.
Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:18 -07:00
Miao Xie 5d68da3b8e Btrfs: don't write any data into a readonly device when scrub
We should not write data into a readonly device especially seed device when
doing scrub, skip those devices.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:17 -07:00
Miao Xie ff61d17c63 Btrfs: Fix the problem that the replace destroys the seed filesystem
The seed filesystem was destroyed by the device replace, the reproduce
method is:
 # mkfs.btrfs -f <dev0>
 # btrfstune -S 1 <dev0>
 # mount <dev0> <mnt>
 # btrfs device add <dev1> <mnt>
 # umount <mnt>
 # mount <dev1> <mnt>
 # btrfs replace start -f <dev0> <dev2> <mnt>
 # umount <mnt>
 # mount <dev0> <mnt>

It is because we erase the super block on the seed device. It is wrong,
we should not change anything on the seed device.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:16 -07:00
Qu Wenruo 2c91943b50 btrfs: Return right extent when fiemap gives unaligned offset and len.
When page aligned start and len passed to extent_fiemap(), the result is
good, but when start and len is not aligned, e.g. start = 1 and len =
4095 is passed to extent_fiemap(), it returns no extent.

The problem is that start and len is all rounded down which causes the
problem. This patch will round down start and round up (start + len) to
return right extent.

Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:14 -07:00
Wang Shilong e2eca69dc6 Btrfs: fix wrong extent mapping for DirectIO
btrfs_next_leaf() will use current leaf's last key to search
and then return a bigger one. So it may still return a file extent
item that is smaller than expected value and we will
get an overflow here for @em->len.

This is easy to reproduce for Btrfs Direct writting, it did not
cause any problem, because writting will re-insert right mapping later.

However, by hacking code to make DIO support compression, wrong extent
mapping is kept and it encounter merging failure(EEXIST) quickly.

Fix this problem by looping to find next file extent item that is bigger
than @start or we could not find anything more.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:13 -07:00
Wang Shilong 9a025a0860 Btrfs: fix wrong write range for filemap_fdatawrite_range()
filemap_fdatawrite_range() expect the third arg to be @end
not @len, fix it.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:12 -07:00
Miao Xie 3a7d55c84c Btrfs: fix wrong missing device counter decrease
The missing devices are accounted by its own fs device, for example
the missing devices in seed filesystem will be accounted by the fs device
of the seed filesystem, not by the new filesystem which is based on
the seed filesystem, so when we remove the missing device in the
seed filesystem, we should decrease the counter of its own fs device.
Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:52:10 -07:00
Miao Xie 69611ac810 Btrfs: fix unzeroed members in fs_devices when creating a fs from seed fs
We forgot to zero some members in fs_devices when we create new fs_devices
from the one of the seed fs. It would cause the problem that we got wrong
chunk profile when allocating chunks. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:36:32 -07:00
Anand Jain 77bdae4d13 btrfs: check generation as replace duplicates devid+uuid
When FS in unmounted we need to check generation number as well
since devid+uuid combination could match with the missing replaced
disk when it reappears, and without this patch it might pair with
the replaced disk again.

 device_list_add() function is called in the following threads,
	mount device option
	mount argument
	ioctl BTRFS_IOC_SCAN_DEV (btrfs dev scan)
	ioctl BTRFS_IOC_DEVICES_READY (btrfs dev ready <dev>)
 they have been unit tested to work fine with this patch.

 If the user knows what he is doing and really want to pair with
 replaced disk (which is not a standard operation), then he should
 first clear the kernel btrfs device list in the memory by doing
 the module unload/load and followed with the mount -o device option.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:36:30 -07:00
Anand Jain b96de000bc Btrfs: device_list_add() should not update list when mounted
device_list_add() is called when user runs btrfs dev scan, which would add
any btrfs device into the btrfs_fs_devices list.

Now think of a mounted btrfs. And a new device which contains the a SB
from the mounted btrfs devices.

In this situation when user runs btrfs dev scan, the current code would
just replace existing device with the new device.

Which is to note that old device is neither closed nor gracefully
removed from the btrfs.

The FS is still operational with the old bdev however the device name
is the btrfs_device is new which is provided by the btrfs dev scan.

reproducer:

devmgt[1] detach /dev/sdc

replace the missing disk /dev/sdc

btrfs rep start -f 1 /dev/sde /btrfs
Label: none  uuid: 5dc0aaf4-4683-4050-b2d6-5ebe5f5cd120
        Total devices 2 FS bytes used 32.00KiB
        devid    1 size 958.94MiB used 115.88MiB path /dev/sde
        devid    2 size 958.94MiB used 103.88MiB path /dev/sdd

make /dev/sdc to reappear

devmgt attach host2

btrfs dev scan

btrfs fi show -m
Label: none  uuid: 5dc0aaf4-4683-4050-b2d6-5ebe5f5cd120^M
        Total devices 2 FS bytes used 32.00KiB^M
        devid    1 size 958.94MiB used 115.88MiB path /dev/sdc <- Wrong.
        devid    2 size 958.94MiB used 103.88MiB path /dev/sdd

since /dev/sdc has been replaced with /dev/sde, the /dev/sdc shouldn't be
part of the btrfs-fsid when it reappears. If user want it to be part of it
then sys admin should be using btrfs device add instead.

[1] github.com/anajain/devmgt.git

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:36:28 -07:00
chandan 1707e26d6a Btrfs: fill_holes: Fix slot number passed to hole_mergeable() call.
For a non-existent key, btrfs_search_slot() sets path->slots[0] to the slot
where the key could have been present, which in this case would be the slot
containing the extent item which would be the next neighbor of the file range
being punched. The current code passes an incremented path->slots[0] and we
skip to the wrong file extent item. This would mean that we would fail to
merge the "yet to be created" hole with the next neighboring hole (if one
exists). Fix this.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:36:26 -07:00
Miao Xie 7a5c3c9be1 Btrfs: fix put dio bio twice when we submit dio bio fail
The caller of btrfs_submit_direct_hook() will put the original dio bio
when btrfs_submit_direct_hook() return a error number, so we needn't
put the original bio in btrfs_submit_direct_hook().

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-19 08:36:24 -07:00
Rajesh Ghanekar 18c01ab302 nfsd: allow turning off nfsv3 readdir_plus
One of our customer's application only needs file names, not file
attributes. With directories having 10K+ inodes (assuming buffer cache
has directory blocks cached having file names, but inode cache is
limited and hence need eviction of older cached inodes), older inodes
are evicted periodically. So if they keep on doing readdir(2) from NSF
client on multiple directories, some directory's files are periodically
removed from inode cache and hence new readdir(2) on same directory
requires disk access to bring back inodes again to inode cache.

As READDIRPLUS request fetches attributes also, doing getattr on each
file on server, it causes unnecessary disk accesses. If READDIRPLUS on
NFS client is returned with -ENOTSUPP, NFS client uses READDIR request
which just gets the names of the files in a directory, not attributes,
hence avoiding disk accesses on server.

There's already a corresponding client-side mount option, but an export
option reduces the need for configuration across multiple clients.

This flag affects NFSv3 only.  If it turns out it's needed for NFSv4 as
well then we may have to figure out how to extend the behavior to NFSv4,
but it's not currently obvious how to do that.

Signed-off-by: Rajesh Ghanekar <rajesh_ghanekar@symantec.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-18 15:12:14 -04:00
Steve French 30175628bf [SMB3] Enable fallocate -z support for SMB3 mounts
fallocate -z (FALLOC_FL_ZERO_RANGE) can map to SMB3
FSCTL_SET_ZERO_DATA SMB3 FSCTL but FALLOC_FL_ZERO_RANGE
when called without the FALLOC_FL_KEEPSIZE flag set could want
the file size changed so we can not support that subcase unless
the file is cached (and thus we know the file size).

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org>
2014-08-17 18:16:40 -05:00
Steve French 31742c5a33 enable fallocate punch hole ("fallocate -p") for SMB3
Implement FALLOC_FL_PUNCH_HOLE (which does not change the file size
fortunately so this matches the behavior of the equivalent SMB3
fsctl call) for SMB3 mounts.  This allows "fallocate -p" to work.
It requires that the server support setting files as sparse
(which Windows allows).

Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-17 18:12:38 -05:00
Steve French ad3829cf1d Incorrect error returned on setting file compressed on SMB2
When the server (for an SMB2 or SMB3 mount) doesn't support
an ioctl (such as setting the compressed flag
on a file) we were incorrectly returning EIO instead
of EOPNOTSUPP, this is confusing e.g. doing chattr +c to a file
on a non-btrfs Samba partition, now the error returned is more
intuitive to the user.  Also fixes error mapping on setting
hardlink to servers which don't support that.

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: David Disseldorp <ddiss@suse.de>
2014-08-17 18:12:31 -05:00
J. Bruce Fields f7b43d0c99 nfsd4: reserve adequate space for LOCK op
As of  8c7424cff6 "nfsd4: don't try to encode conflicting owner if low
on space", we permit the server to process a LOCK operation even if
there might not be space to return the conflicting lockowner, because
we've made returning the conflicting lockowner optional.

However, the rpc server still wants to know the most we might possibly
return, so we need to take into account the possible conflicting
lockowner in the svc_reserve_space() call here.

Symptoms were log messages like "RPC request reserved 88 but used 108".

Fixes: 8c7424cff6 "nfsd4: don't try to encode conflicting owner if low on space"
Reported-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:14 -04:00
J. Bruce Fields 1383bf37ce nfsd4: remove obsolete comment
We do what Neil suggests now.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:14 -04:00
Ross Lagerwall 63bab0651b nfsd3: Check write permission after checking existence
When creating a file that already exists in a read-only directory with
O_EXCL, the NFSv3 server returns EACCES rather than EEXIST (which local
files and the NFSv4 server return).  Fix this by checking the MAY_CREATE
permission only if the file does not exist.  Since this already happens
in do_nfsd_create, the check in nfsd3_proc_create can simply be removed.

Signed-off-by: Ross Lagerwall <rosslagerwall@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:14 -04:00
Jeff Layton afbda402a0 nfsd: call nfs4_put_deleg_lease outside of state_lock
Currently, we hold the state_lock when releasing the lease. That's
potentially problematic in the future if we allow for setlease methods
that can sleep. Move the nfs4_put_deleg_lease call out of the delegation
unhashing routine (which was always a bit goofy anyway), and into the
unlocked sections of the callers of unhash_delegation_locked.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:14 -04:00
Jeff Layton 6bcc034eac nfsd: protect lease-related nfs4_file fields with fi_lock
Currently these fields are protected with the state_lock, but that
doesn't really make a lot of sense. These fields are "private" to the
nfs4_file, and can be protected with the more granular fi_lock.

The fi_lock is already held when setting these fields. Make the code
hold the fp->fi_lock when clearing the lease-related fields in the
nfs4_file, and no longer require that the state_lock be held when
calling into this function.

To prevent lock inversion with the i_lock, we also move the vfs_setlease
and fput calls outside of the fi_lock. This also sets us up for allowing
vfs_setlease calls to block in the future.

Finally, remove a redundant NULL pointer check. unhash_delegation_locked
locks the fp->fi_lock prior to that check, so fp in that function must
never be NULL.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:13 -04:00
Trond Myklebust ef9b16dc6d nfsd: Reorder nfsd_cache_match to check more powerful discriminators first
We would normally expect the xid and the checksum to be the best
discriminators. Check them before looking at the procedure number,
etc.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:13 -04:00
Trond Myklebust 89a26b3d29 nfsd: split DRC global spinlock into per-bucket locks
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:13 -04:00
Trond Myklebust 31e60f5222 nfsd: convert num_drc_entries to an atomic_t
...so we can remove the spinlocking around it.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:12 -04:00
Trond Myklebust 11acf6ef3b nfsd: Remove the cache_hash list
Now that the lru list is per-bucket, we don't need a second list for
searches.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:12 -04:00
Trond Myklebust bedd4b61a4 nfsd: convert the lru list into a per-bucket thing
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:12 -04:00
Trond Myklebust 7142b98d9f nfsd: Clean up drc cache in preparation for global spinlock elimination
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:12 -04:00
Trond Myklebust 887999774a nfs: Ensure that nfs_callback_start_svc sets the server rq_task...
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:10 -04:00
Trond Myklebust d6a7ce424f lockd: Ensure that lockd_start_svc sets the server rq_task...
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-17 12:00:10 -04:00
Pavel Shilovsky b46799a8f2 CIFS: Fix wrong directory attributes after rename
When we requests rename we also need to update attributes
of both source and target parent directories. Not doing it
causes generic/309 xfstest to fail on SMB2 mounts. Fix this
by marking these directories for force revalidating.

Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-17 05:08:46 -05:00
Pavel Shilovsky 52755808d4 CIFS: Fix SMB2 readdir error handling
SMB2 servers indicates the end of a directory search with
STATUS_NO_MORE_FILE error code that is not processed now.
This causes generic/257 xfstest to fail. Fix this by triggering
the end of search by this error code in SMB2_query_directory.

Also when negotiating CIFS protocol we tell the server to close
the search automatically at the end and there is no need to do
it itself. In the case of SMB2 protocol, we need to close it
explicitly - separate close directory checks for different
protocols.

Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-17 05:08:39 -05:00
Steve French 18f39e7be0 [CIFS] Possible null ptr deref in SMB2_tcon
As Raphael Geissert pointed out, tcon_error_exit can dereference tcon
and there is one path in which tcon can be null.

Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org> # v3.7+
Reported-by: Raphael Geissert <geissert@debian.org>
2014-08-17 00:41:02 -05:00
Linus Torvalds e64df3ebe8 Merge branch 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "These are all fixes I'd like to get out to a broader audience.

  The biggest of the bunch is Mark's quota fix, which is also in the
  SUSE kernel, and makes our subvolume quotas dramatically more
  accurate.

  I've been running xfstests with these against your current git
  overnight, but I'm queueing up longer tests as well"

* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: disable strict file flushes for renames and truncates
  Btrfs: fix csum tree corruption, duplicate and outdated checksums
  Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch
  Btrfs: fix compressed write corruption on enospc
  btrfs: correctly handle return from ulist_add
  btrfs: qgroup: account shared subtrees during snapshot delete
  Btrfs: read lock extent buffer while walking backrefs
  Btrfs: __btrfs_mod_ref should always use no_quota
  btrfs: adjust statfs calculations according to raid profiles
2014-08-16 09:06:55 -06:00
Linus Torvalds 53b95d6341 File locking related bugfixes for v3.17 (pile #2)
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJT7ey6AAoJEAAOaEEZVoIVgdkQAJfToVhgddVMOweeGo4wqPqM
 lmS35dYVEy+gPfYhCU2Zytgk9yLlmNJeDT7Y+XGe4l15Ax/PDNuWLiRnUKRy1FrU
 l7cbKRAn1L6TqBO2ang5t68Bw7ojUpRoKWKnDyjVAj80mZYRZPWvQOurLeTtra2o
 XLnHbK54F36s07OXwSTZgbh/ffVQ1RMUBU8fy+0Ws3mTAbzO1KuB5Ws4pdt2ysjI
 14pBHO553X0VXAJxGkH66xxblt1o+Q9aBZp7RL0VNtR4bGU4FMvXy+0D5G6h8AGt
 rhl2PWTNKGg8HFgUK+7nCheH4j/0maXo541D3q+sJhbknRhD3n3x7IBvjkm9tdjB
 OgTkAp1mwL21mJP21MOrAil8uwGoSr7qTCngZY6nNWH4L3EkVl27+LlGDtkgBp/n
 BJyhcPbFSh+K4TTD3dg2rEx7wF/npQ6yPOljjvWXKJEm5lx3ZPkK1l1xS3QnVTMe
 pLq+wTZ9v1cU7+9JFWICQwclv4unjIgqxLo7/op5P5KvTWOHFW4cjdwCBPVE1g3a
 WC2c5jdB0Up6Z59aXAN3p5dk8MCy6NA41lkMfJN4AUAIzb6NvNeDBhrcFaHwXowm
 jgCtEPqFN/HqsEwJmhJ7fcIEKYrOCuzeaR5tGuwJ11re/oULGXTgMkrFwgm/YWwu
 esRBAc53/hZg6oo3ipUH
 =09BX
 -----END PGP SIGNATURE-----

Merge tag 'locks-v3.17-2' of git://git.samba.org/jlayton/linux

Pull file locking bugfixes from Jeff Layton:
 "Most of these patches are to fix a long-standing regression that crept
  in when the BKL was removed from the file-locking code.  The code was
  converted to use a conventional spinlock, but some fl_release_private
  ops can block and you can end up sleeping inside the lock.

  There's also a patch to make /proc/locks show delegations as 'DELEG'"

* tag 'locks-v3.17-2' of git://git.samba.org/jlayton/linux:
  locks: update Locking documentation to clarify fl_release_private behavior
  locks: move locks_free_lock calls in do_fcntl_add_lease outside spinlock
  locks: defer freeing locks in locks_delete_lock until after i_lock has been dropped
  locks: don't reuse file_lock in __posix_lock_file
  locks: don't call locks_release_private from locks_copy_lock
  locks: show delegations as "DELEG" in /proc/locks
2014-08-16 08:58:47 -06:00
Linus Torvalds da06df548e Merge git://git.kvack.org/~bcrl/aio-next
Pull aio updates from Ben LaHaise.

* git://git.kvack.org/~bcrl/aio-next:
  aio: use iovec array rather than the single one
  aio: fix some comments
  aio: use the macro rather than the inline magic number
  aio: remove the needless registration of ring file's private_data
  aio: remove no longer needed preempt_disable()
  aio: kill the misleading rcu read locks in ioctx_add_table() and kill_ioctx()
  aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock()
2014-08-16 08:56:27 -06:00
Steve French 754789a1c0 [CIFS] Workaround MacOS server problem with SMB2.1 write
response

Writes fail to Mac servers with SMB2.1 mounts (works with cifs though) due
to them sending an incorrect RFC1001 length for the SMB2.1 Write response.
Workaround this problem. MacOS server sends a write response with 3 bytes
of pad beyond the end of the SMB itself.  The RFC1001 length is 3 bytes
more than the sum of the SMB2.1 header length + the write reponse.

Incorporate feedback from Jeff and JRA to allow servers to send
a tcp frame that is even more than three bytes too long
(ie much longer than the SMB2/SMB3 request that it contains) but
we do log it once now. In the earlier version of the patch I had
limited how far off the length field could be before we fail the request.

Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-15 23:49:01 -05:00
Jeff Layton 024408062b cifs: handle lease F_UNLCK requests properly
Currently any F_UNLCK request for a lease just gets back -EAGAIN. Allow
them to go immediately to generic_setlease instead.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-15 23:01:52 -05:00
Steve French d43cc79343 Cleanup sparse file support by creating worker function for it
Simply move code to new function (for clarity). Function sets or clears
the sparse file attribute flag.

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: David Disseldorp <ddiss@samba.org>
2014-08-15 23:01:00 -05:00
Chris Mason 8d875f95da btrfs: disable strict file flushes for renames and truncates
Truncates and renames are often used to replace old versions of a file
with new versions.  Applications often expect this to be an atomic
replacement, even if they haven't done anything to make sure the new
version is fully on disk.

Btrfs has strict flushing in place to make sure that renaming over an
old file with a new file will fully flush out the new file before
allowing the transaction commit with the rename to complete.

This ordering means the commit code needs to be able to lock file pages,
and there are a few paths in the filesystem where we will try to end a
transaction with the page lock held.  It's rare, but these things can
deadlock.

This patch removes the ordered flushes and switches to a best effort
filemap_flush like ext4 uses. It's not perfect, but it should fix the
deadlocks.

Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:42 -07:00
Filipe Manana 27b9a8122f Btrfs: fix csum tree corruption, duplicate and outdated checksums
Under rare circumstances we can end up leaving 2 versions of a checksum
for the same file extent range.

The reason for this is that after calling btrfs_next_leaf we process
slot 0 of the leaf it returns, instead of processing the slot set in
path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after
btrfs_next_leaf() releases the path and before it searches for the next
leaf, another task might cause a split of the next leaf, which migrates
some of its keys to the leaf we were processing before calling
btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the
same leaf but with path->slots[0] having a slot number corresponding
to the first new key it got, that is, a slot number that didn't exist
before calling btrfs_next_leaf(), as the leaf now has more keys than
it had before. So we must really process the returned leaf starting at
path->slots[0] always, as it isn't always 0, and the key at slot 0 can
have an offset much lower than our search offset/bytenr.

For example, consider the following scenario, where we have:

sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568
four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472

  Leaf N:

    slot = 0                           slot = btrfs_header_nritems() - 1
  |-------------------------------------------------------------------|
  | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] |
  |-------------------------------------------------------------------|

  Leaf N + 1:

      slot = 0                          slot = btrfs_header_nritems() - 1
  |--------------------------------------------------------------------|
  | [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 |
  |--------------------------------------------------------------------|

Because we are at the last slot of leaf N, we call btrfs_next_leaf() to
find the next highest key, which releases the current path and then searches
for that next key. However after releasing the path and before finding that
next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call
to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore
btrfs_next_leaf() will returns us a path again with leaf N but with the slot
pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N
is then:

    slot = 0                        slot = btrfs_header_nritems() - 2  slot = btrfs_header_nritems() - 1
  |----------------------------------------------------------------------------------------------------|
  | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4]  [(CSUM CSUM 40161280), size 32] |
  |----------------------------------------------------------------------------------------------------|

And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump
into the "insert:" label, which will set tmp to:

    tmp = min((sums->len - total_bytes) >> blocksize_bits,
        (next_offset - file_key.offset) >> blocksize_bits) =
    min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) =
    min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4

and

   ins_size = csum_size * tmp = 4 * 4 = 16 bytes.

In other words, we insert a new csum item in the tree with key
(CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums
for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong,
because the item with key (CSUM CSUM 40161280) (the one that was moved from
leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288
bytes of our data and won't get those old checksums removed.

So this leaves us 2 different checksums for 3 4kb blocks of data in the tree,
and breaks the logical rule:

   Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover

An obvious bad effect of this is that a subsequent csum tree lookup to get
the checksum of any of the blocks with logical offset of 40161280, 40165376
or 40169472 (the last 3 4kb blocks of file data), will get the old checksums.

Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:40 -07:00
Takashi Iwai 4eb1f66dce Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch
We've got bug reports that btrfs crashes when quota is enabled on
32bit kernel, typically with the Oops like below:
 BUG: unable to handle kernel NULL pointer dereference at 00000004
 IP: [<f9234590>] find_parent_nodes+0x360/0x1380 [btrfs]
 *pde = 00000000
 Oops: 0000 [#1] SMP
 CPU: 0 PID: 151 Comm: kworker/u8:2 Tainted: G S      W 3.15.2-1.gd43d97e-default #1
 Workqueue: btrfs-qgroup-rescan normal_work_helper [btrfs]
 task: f1478130 ti: f147c000 task.ti: f147c000
 EIP: 0060:[<f9234590>] EFLAGS: 00010213 CPU: 0
 EIP is at find_parent_nodes+0x360/0x1380 [btrfs]
 EAX: f147dda8 EBX: f147ddb0 ECX: 00000011 EDX: 00000000
 ESI: 00000000 EDI: f147dda4 EBP: f147ddf8 ESP: f147dd38
  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
 CR0: 8005003b CR2: 00000004 CR3: 00bf3000 CR4: 00000690
 Stack:
  00000000 00000000 f147dda4 00000050 00000001 00000000 00000001 00000050
  00000001 00000000 d3059000 00000001 00000022 000000a8 00000000 00000000
  00000000 000000a1 00000000 00000000 00000001 00000000 00000000 11800000
 Call Trace:
  [<f923564d>] __btrfs_find_all_roots+0x9d/0xf0 [btrfs]
  [<f9237bb1>] btrfs_qgroup_rescan_worker+0x401/0x760 [btrfs]
  [<f9206148>] normal_work_helper+0xc8/0x270 [btrfs]
  [<c025e38b>] process_one_work+0x11b/0x390
  [<c025eea1>] worker_thread+0x101/0x340
  [<c026432b>] kthread+0x9b/0xb0
  [<c0712a71>] ret_from_kernel_thread+0x21/0x30
  [<c0264290>] kthread_create_on_node+0x110/0x110

This indicates a NULL corruption in prefs_delayed list.  The further
investigation and bisection pointed that the call of ulist_add_merge()
results in the corruption.

ulist_add_merge() takes u64 as aux and writes a 64bit value into
old_aux.  The callers of this function in backref.c, however, pass a
pointer of a pointer to old_aux.  That is, the function overwrites
64bit value on 32bit pointer.  This caused a NULL in the adjacent
variable, in this case, prefs_delayed.

Here is a quick attempt to band-aid over this: a new function,
ulist_add_merge_ptr() is introduced to pass/store properly a pointer
value instead of u64.  There are still ugly void ** cast remaining
in the callers because void ** cannot be taken implicitly.  But, it's
safer than explicit cast to u64, anyway.

Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=887046
Cc: <stable@vger.kernel.org> [v3.11+]
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:19 -07:00
Liu Bo ce62003f69 Btrfs: fix compressed write corruption on enospc
When failing to allocate space for the whole compressed extent, we'll
fallback to uncompressed IO, but we've forgotten to redirty the pages
which belong to this compressed extent, and these 'clean' pages will
simply skip 'submit' part and go to endio directly, at last we got data
corruption as we write nothing.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Tested-By: Martin Steigerwald <martin@lichtvoll.de>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:18 -07:00
Mark Fasheh f90e579c2b btrfs: correctly handle return from ulist_add
ulist_add() can return '1' on sucess, which qgroup_subtree_accounting()
doesn't take into account. As a result, that value can be bubbled up to
callers, causing an error to be printed. Fix this by only returning the
value of ulist_add() when it indicates an error.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:16 -07:00
Mark Fasheh 1152651a08 btrfs: qgroup: account shared subtrees during snapshot delete
During its tree walk, btrfs_drop_snapshot() will skip any shared
subtrees it encounters. This is incorrect when we have qgroups
turned on as those subtrees need to have their contents
accounted. In particular, the case we're concerned with is when
removing our snapshot root leaves the subtree with only one root
reference.

In those cases we need to find the last remaining root and add
each extent in the subtree to the corresponding qgroup exclusive
counts.

This patch implements the shared subtree walk and a new qgroup
operation, BTRFS_QGROUP_OPER_SUB_SUBTREE. When an operation of
this type is encountered during qgroup accounting, we search for
any root references to that extent and in the case that we find
only one reference left, we go ahead and do the math on it's
exclusive counts.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:14 -07:00
Filipe Manana 6f7ff6d783 Btrfs: read lock extent buffer while walking backrefs
Before processing the extent buffer, acquire a read lock on it, so
that we're safe against concurrent updates on the extent buffer.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:13 -07:00
Josef Bacik e339a6b097 Btrfs: __btrfs_mod_ref should always use no_quota
Before I extended the no_quota arg to btrfs_dec/inc_ref because I didn't
understand how snapshot delete was using it and assumed that we needed the
quota operations there.  With Mark's work this has turned out to be not the
case, we _always_ need to use no_quota for btrfs_dec/inc_ref, so just drop the
argument and make __btrfs_mod_ref call it's process function with no_quota set
always.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:11 -07:00
David Sterba ba7b6e62f4 btrfs: adjust statfs calculations according to raid profiles
This has been discussed in thread:
http://thread.gmane.org/gmane.comp.file-systems.btrfs/32528

and this patch implements this proposal:
http://thread.gmane.org/gmane.comp.file-systems.btrfs/32536

Works fine for "clean" raid profiles where the raid factor correction
does the right job. Otherwise it's pessimistic and may show low space
although there's still some left.

The df nubmers are lightly wrong in case of mixed block groups, but this
is not a major usecase and can be addressed later.

The RAID56 numbers are wrong almost the same way as before and will be
addressed separately.

CC: Hugo Mills <hugo@carfax.org.uk>
CC: cwillu <cwillu@cwillu.com>
CC: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-08-15 07:43:10 -07:00
Jeff Layton 2dfb928f7e locks: move locks_free_lock calls in do_fcntl_add_lease outside spinlock
There's no need to call locks_free_lock here while still holding the
i_lock. Defer that until the lock has been dropped.

Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-08-14 10:07:47 -04:00
Jeff Layton ed9814d858 locks: defer freeing locks in locks_delete_lock until after i_lock has been dropped
In commit 72f98e7255 (locks: turn lock_flocks into a spinlock), we
moved from using the BKL to a global spinlock. With this change, we lost
the ability to block in the fl_release_private operation.

This is problematic for NFS (and probably some other filesystems as
well). Add a new list_head argument to locks_delete_lock. If that
argument is non-NULL, then queue any locks that we want to free to the
list instead of freeing them.

Then, add a new locks_dispose_list function that will walk such a list
and call locks_free_lock on them after the i_lock has been dropped.

Finally, change all of the callers of locks_delete_lock to pass in a
list_head, except for lease_modify. That function can be called long
after the i_lock has been acquired. Deferring the freeing of a lease
after unlocking it in that function is non-trivial until we overhaul
some of the spinlocking in the lease code.

Currently though, no filesystem that sets fl_release_private supports
leases, so this is not currently a problem. We'll eventually want to
make the same change in the lease code, but it needs a lot more work
before we can reasonably do so.

Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-08-14 10:07:47 -04:00
Jeff Layton b84d49f944 locks: don't reuse file_lock in __posix_lock_file
Currently in the case where a new file lock completely replaces the old
one, we end up overwriting the existing lock with the new info. This
means that we have to call fl_release_private inside i_lock. Change the
code to instead copy the info to new_fl, insert that lock into the
correct spot and then delete the old lock. In a later patch, we'll defer
the freeing of the old lock until after the i_lock has been dropped.

Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-08-14 10:07:47 -04:00
Linus Torvalds 06b8ab5528 NFS client updates for Linux 3.17
Highlights include:
 
 - Stable fix for a bug in nfs3_list_one_acl()
 - Speed up NFS path walks by supporting LOOKUP_RCU
 - More read/write code cleanups
 - pNFS fixes for layout return on close
 - Fixes for the RCU handling in the rpcsec_gss code
 - More NFS/RDMA fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJT65zoAAoJEGcL54qWCgDyvq8QAJ+OKuC5dpngrZ13i4ZJIcK1
 TJSkWCr44FhYPlrmkLCntsGX6C0376oFEtJ5uqloqK0+/QtvwRNVSQMKaJopKIVY
 mR4En0WwpigxVQdW2lgto6bfOhzMVO+llVdmicEVrU8eeSThATxGNv7rxRzWorvL
 RX3TwBkWSc0kLtPi66VRFQ1z+gg5I0kngyyhsKnLOaHHtpTYP2JDZlRPRkokXPUg
 nmNedmC3JrFFkarroFIfYr54Qit2GW/eI2zVhOwHGCb45j4b2wntZ6wr7LpUdv3A
 OGDBzw59cTpcx3Hij9CFvLYVV9IJJHBNd2MJqdQRtgWFfs+aTkZdk4uilUJCIzZh
 f4BujQAlm/4X1HbPxsSvkCRKga7mesGM7e0sBDPHC1vu0mSaY1cakcj2kQLTpbQ7
 gqa1cR3pZ+4shCq37cLwWU0w1yElYe1c4otjSCttPCrAjXbXJZSFzYnHm8DwKROR
 t+yEDRL5BIXPu1nEtSnD2+xTQ3vUIYXooZWEmqLKgRtBTtPmgSn9Vd8P1OQXmMNo
 VJyFXyjNx5WH06Wbc/jLzQ1/cyhuPmJWWyWMJlVROyv+FXk9DJUFBZuTkpMrIPcF
 NlBXLV1GnA7PzMD9Xt9bwqteERZl6fOUDJLWS9P74kTk5c2kD+m+GaqC/rBTKKXc
 ivr2s7aIDV48jhnwBSVL
 =KE07
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.17-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

   - stable fix for a bug in nfs3_list_one_acl()
   - speed up NFS path walks by supporting LOOKUP_RCU
   - more read/write code cleanups
   - pNFS fixes for layout return on close
   - fixes for the RCU handling in the rpcsec_gss code
   - more NFS/RDMA fixes"

* tag 'nfs-for-3.17-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (79 commits)
  nfs: reject changes to resvport and sharecache during remount
  NFS: Avoid infinite loop when RELEASE_LOCKOWNER getting expired error
  SUNRPC: remove all refcounting of groupinfo from rpcauth_lookupcred
  NFS: fix two problems in lookup_revalidate in RCU-walk
  NFS: allow lockless access to access_cache
  NFS: teach nfs_lookup_verify_inode to handle LOOKUP_RCU
  NFS: teach nfs_neg_need_reval to understand LOOKUP_RCU
  NFS: support RCU_WALK in nfs_permission()
  sunrpc/auth: allow lockless (rcu) lookup of credential cache.
  NFS: prepare for RCU-walk support but pushing tests later in code.
  NFS: nfs4_lookup_revalidate: only evaluate parent if it will be used.
  NFS: add checks for returned value of try_module_get()
  nfs: clear_request_commit while holding i_lock
  pnfs: add pnfs_put_lseg_async
  pnfs: find swapped pages on pnfs commit lists too
  nfs: fix comment and add warn_on for PG_INODE_REF
  nfs: check wait_on_bit_lock err in page_group_lock
  sunrpc: remove "ec" argument from encrypt_v2 operation
  sunrpc: clean up sparse endianness warnings in gss_krb5_wrap.c
  sunrpc: clean up sparse endianness warnings in gss_krb5_seal.c
  ...
2014-08-13 18:13:19 -06:00
Linus Torvalds dc1cc85133 xfs: update for 3.17-rc1
This update contains:
 o conversion of the XFS core to pass negative error numbers
 o restructing of core XFS code that is shared with userspace to fs/xfs/libxfs
 o introduction of sysfs interface for XFS
 o bulkstat refactoring
 o demand driven speculative preallocation removal
 o XFS now always requires 64 bit sectors to be configured
 o metadata verifier changes to ensure CRCs are calculated during log recovery
 o various minor code cleanups
 o miscellaneous bug fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJT6dJvAAoJEK3oKUf0dfodYOYP/jhWGv29OrX958WC+U1kCItb
 WmNV6dSENWgtgYRgdtcxa5ZE0HeEndvN63LOCzAB9VegZK7XjpLob9J2f07gMvP1
 u4vv9LvgLdky/tYxttBYHuyen0hQO0r+esRg/xrawpklKQ3Efofi4MUSbb5ZBDzP
 QvVeNhIDI04A0CoDngWDkV1PpbdwDUjVZFZon36XVDTcSCf/h2B+nekjOVJQpEDC
 qJ9nZWRgAcm4IzZZBqwGt0LJ62Z+Ww8jksevWEnjcXm1xfsltL8gf8o6WOX5iXDR
 PSeoJwVRAxLfiQxnQSENEOeLvKWVXNyC/B10w1H20qZkvQJMsX/Fq0MraBL0Ediu
 0qyZgJsOyoOAvJYXWfUyykUe03dF5/cgrB+RpOLmp4B9lCugCEsQHtjRXEHE1EK2
 +4sHc13I7+SLIAlgySmJdrLe8Kq1vamo0/WD2wH+pdOvNmQ7HJzBUFEj1D83b55a
 DWfBjbz/N5lJW82Vfek2coURJGi/1B87koAtODXWfeM+nbBs47z0dv31G235Boq3
 +Qy2qQdCv05xV+CZn/RAmaudAFFnJEV8L4ADpwqSGfVM72ug0KQF8M5DblG1Q8sb
 xRAQq+krG7K4ui1kPNKcxoHoUIYQUJm0p1aU4V2OAQlikYCCaSgRYGF2xzSVC7j9
 uZd4e4bllsBdPF/Se+8W
 =EUjq
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-3.17-rc1' of git://oss.sgi.com/xfs/xfs

Pull xfs update from Dave Chinner:
 "This update contains:
   - conversion of the XFS core to pass negative error numbers
   - restructing of core XFS code that is shared with userspace to
     fs/xfs/libxfs
   - introduction of sysfs interface for XFS
   - bulkstat refactoring
   - demand driven speculative preallocation removal
   - XFS now always requires 64 bit sectors to be configured
   - metadata verifier changes to ensure CRCs are calculated during log
     recovery
   - various minor code cleanups
   - miscellaneous bug fixes

  The diffstat is kind of noisy because of the restructuring of the code
  to make kernel/userspace code sharing simpler, along with the XFS wide
  change to use the standard negative error return convention (at last!)"

* tag 'xfs-for-linus-3.17-rc1' of git://oss.sgi.com/xfs/xfs: (45 commits)
  xfs: fix coccinelle warnings
  xfs: flush both inodes in xfs_swap_extents
  xfs: fix swapext ilock deadlock
  xfs: kill xfs_vnode.h
  xfs: kill VN_MAPPED
  xfs: kill VN_CACHED
  xfs: kill VN_DIRTY()
  xfs: dquot recovery needs verifiers
  xfs: quotacheck leaves dquot buffers without verifiers
  xfs: ensure verifiers are attached to recovered buffers
  xfs: catch buffers written without verifiers attached
  xfs: avoid false quotacheck after unclean shutdown
  xfs: fix rounding error of fiemap length parameter
  xfs: introduce xfs_bulkstat_ag_ichunk
  xfs: require 64-bit sector_t
  xfs: fix uflags detection at xfs_fs_rm_xquota
  xfs: remove XFS_IS_OQUOTA_ON macros
  xfs: tidy up xfs_set_inode32
  xfs: allow inode allocations in post-growfs disk space
  xfs: mark xfs_qm_quotacheck as static
  ...
2014-08-13 17:49:53 -06:00
Linus Torvalds cec997093b Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota, reiserfs, UDF updates from Jan Kara:
 "Scalability improvements for quota, a few reiserfs fixes, and couple
  of misc cleanups (udf, ext2)"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  reiserfs: Fix use after free in journal teardown
  reiserfs: fix corruption introduced by balance_leaf refactor
  udf: avoid redundant memcpy when writing data in ICB
  fs/udf: re-use hex_asc_upper_{hi,lo} macros
  fs/quota: kernel-doc warning fixes
  udf: use linux/uaccess.h
  fs/ext2/super.c: Drop memory allocation cast
  quota: remove dqptr_sem
  quota: simplify remove_inode_dquot_ref()
  quota: avoid unnecessary dqget()/dqput() calls
  quota: protect Q_GETFMT by dqonoff_mutex
2014-08-13 17:45:40 -06:00
Linus Torvalds 8d2d441ac4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
 "There is a lot of refactoring and hardening of the libceph and rbd
  code here from Ilya that fix various smaller bugs, and a few more
  important fixes with clone overlap.  The main fix is a critical change
  to the request_fn handling to not sleep that was exposed by the recent
  mutex changes (which will also go to the 3.16 stable series).

  Yan Zheng has several fixes in here for CephFS fixing ACL handling,
  time stamps, and request resends when the MDS restarts.

  Finally, there are a few cleanups from Himangi Saraogi based on
  Coccinelle"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (39 commits)
  libceph: set last_piece in ceph_msg_data_pages_cursor_init() correctly
  rbd: remove extra newlines from rbd_warn() messages
  rbd: allocate img_request with GFP_NOIO instead GFP_ATOMIC
  rbd: rework rbd_request_fn()
  ceph: fix kick_requests()
  ceph: fix append mode write
  ceph: fix sizeof(struct tYpO *) typo
  ceph: remove redundant memset(0)
  rbd: take snap_id into account when reading in parent info
  rbd: do not read in parent info before snap context
  rbd: update mapping size only on refresh
  rbd: harden rbd_dev_refresh() and callers a bit
  rbd: split rbd_dev_spec_update() into two functions
  rbd: remove unnecessary asserts in rbd_dev_image_probe()
  rbd: introduce rbd_dev_header_info()
  rbd: show the entire chain of parent images
  ceph: replace comma with a semicolon
  rbd: use rbd_segment_name_free() instead of kfree()
  ceph: check zero length in ceph_sync_read()
  ceph: reset r_resend_mds after receiving -ESTALE
  ...
2014-08-13 17:43:29 -06:00
Linus Torvalds 89838b80bb No significant changes, mostly small fixes here and there. The more important
fixes are:
 
 * UBI deleted list items while iterating the list with 'list_for_each_entry'
 * The UBI block driver did not work properly with very large UBI volumes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJT6IALAAoJECmIfjd9wqK0yRwP/R4XpNyWZOrK72fvb3M6ucK0
 SK3pu6FNfvOSYibmhOtZWjFbiMkgUd7o8jy+4uJa6PIMRQbKGrH/xvujPzON6GQT
 ZZxeJ7n1O8IAZL/dibqFpEv3NQFjgz1IaLnpktQkkp6rWsXCBYNbvFIywWYgkndt
 1w11QG/TAhkJcA/0Xy/+zmkHeXjOQrVeVMbBN2MMVekWg8JgMNTd9mkWGa1A1oGI
 nuq30nIkUP7LX9T/olFGhjFIeeb/bkmWgpzS4K9PI+hOGGCZvIWj/pu8Jj8DAR54
 UGA7qtMkTjxldvHPnuACsSKNS/N7UGY64kTwcucEcJaN2MXmaBFP8ipI/wtb9bCV
 fUoBt+p470E7iDoQymbM5zNw/HPNh4XWVd2X1oYVftmO70PZ6sWeOKWC+tJn/Aj8
 xsrgd1PcWmuyu399EFW/Il5ItOqfYsNIkVFNzIb1O6Pd8Ylt5eYtuyoPXk5aRA4p
 xZ3ohO3OihvWXwwtCUjforwPy37iaX8vwnuI9xdgnEisTHJWR6PYITvcrwqAIH44
 6PEqINU4zwGTGxOTOWKZxdtPVIuChuwqDyY+7+xQD+LIPQ1/BodozExGiQ0lhAmA
 DDC2Te8q5lpZTz03E8ot9zZHmjQoG+uo10DhEzT5U1ycv9p7dISGfbVucyuKDTw9
 JdUnI8cZ2k6mZi8q6dfA
 =zgqa
 -----END PGP SIGNATURE-----

Merge tag 'upstream-3.17-rc1' of git://git.infradead.org/linux-ubifs

Pull UBI/UBIFS changes from Artem Bityutskiy:
 "No significant changes, mostly small fixes here and there.  The more
  important fixes are:

   - UBI deleted list items while iterating the list with
     'list_for_each_entry'
   - The UBI block driver did not work properly with very large UBI
     volumes"

* tag 'upstream-3.17-rc1' of git://git.infradead.org/linux-ubifs: (21 commits)
  UBIFS: Add log overlap assertions
  Revert "UBIFS: add a log overlap assertion"
  UBI: bugfix in ubi_wl_flush()
  UBI: block: Avoid disk size integer overflow
  UBI: block: Set disk_capacity out of the mutex
  UBI: block: Make ubiblock_resize return something
  UBIFS: add a log overlap assertion
  UBIFS: remove unnecessary check
  UBIFS: remove mst_mutex
  UBIFS: kernel-doc warning fix
  UBI: init_volumes: Ignore volumes with no LEBs
  UBIFS: replace seq_printf by seq_puts
  UBIFS: replace count*size kzalloc by kcalloc
  UBIFS: kernel-doc warning fix
  UBIFS: fix error path in create_default_filesystem()
  UBIFS: fix spelling of "scanned"
  UBIFS: fix some comments
  UBIFS: remove useless @ecc in struct ubifs_scan_leb
  UBIFS: remove useless statements
  UBIFS: Add missing break statements in dbg_chk_pnode()
  ...
2014-08-13 17:42:11 -06:00
Steve French 3d1a3745d8 Add sparse file support to SMB2/SMB3 mounts
Many Linux filesystes make a file "sparse" when extending
a file with ftruncate. This does work for CIFS to Samba
(only) but not for SMB2/SMB3 (to Samba or Windows) since
there is a "set sparse" fsctl which is supposed to be
sent to mark a file as sparse.

This patch marks a file as sparse by sending this simple
set sparse fsctl if it is extended more than 2 pages.
It has been tested to Windows 8.1, Samba and various
SMB2/SMB3 servers which do support setting sparse (and
MacOS which does not appear to support the fsctl yet).
If a server share does not support setting a file
as sparse, then we do not retry setting sparse on that
share.

The disk space savings for sparse files can be quite
large (even more significant on Windows servers than Samba).

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com>
2014-08-13 13:18:35 -05:00
Steve French 8ae31240cc Add missing definitions for CIFS File System Attributes
Signed-off-by: Steve French <smfrench@gmail.com>
Acked-by: Shirish Pargaonkar <spargaonkar@suse.com>
2014-08-12 23:47:14 -05:00
Jan Kara 01777836c8 reiserfs: Fix use after free in journal teardown
If do_journal_release() races with do_journal_end() which requeues
delayed works for transaction flushing, we can leave work items for
flushing outstanding transactions queued while freeing them. That
results in use after free and possible crash in run_timers_softirq().

Fix the problem by not requeueing works if superblock is being shut down
(MS_ACTIVE not set) and using cancel_delayed_work_sync() in
do_journal_release().

CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2014-08-12 12:46:30 +02:00
Linus Torvalds f6f993328b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "Stuff in here:

   - acct.c fixes and general rework of mnt_pin mechanism.  That allows
     to go for delayed-mntput stuff, which will permit mntput() on deep
     stack without worrying about stack overflows - fs shutdown will
     happen on shallow stack.  IOW, we can do Eric's umount-on-rmdir
     series without introducing tons of stack overflows on new mntput()
     call chains it introduces.
   - Bruce's d_splice_alias() patches
   - more Miklos' rename() stuff.
   - a couple of regression fixes (stable fodder, in the end of branch)
     and a fix for API idiocy in iov_iter.c.

  There definitely will be another pile, maybe even two.  I'd like to
  get Eric's series in this time, but even if we miss it, it'll go right
  in the beginning of for-next in the next cycle - the tricky part of
  prereqs is in this pile"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
  fix copy_tree() regression
  __generic_file_write_iter(): fix handling of sync error after DIO
  switch iov_iter_get_pages() to passing maximal number of pages
  fs: mark __d_obtain_alias static
  dcache: d_splice_alias should detect loops
  exportfs: update Exporting documentation
  dcache: d_find_alias needn't recheck IS_ROOT && DCACHE_DISCONNECTED
  dcache: remove unused d_find_alias parameter
  dcache: d_obtain_alias callers don't all want DISCONNECTED
  dcache: d_splice_alias should ignore DCACHE_DISCONNECTED
  dcache: d_splice_alias mustn't create directory aliases
  dcache: close d_move race in d_splice_alias
  dcache: move d_splice_alias
  namei: trivial fix to vfs_rename_dir comment
  VFS: allow ->d_manage() to declare -EISDIR in rcu_walk mode.
  cifs: support RENAME_NOREPLACE
  hostfs: support rename flags
  shmem: support RENAME_EXCHANGE
  shmem: support RENAME_NOREPLACE
  btrfs: add RENAME_NOREPLACE
  ...
2014-08-11 11:44:11 -07:00
Jeff Layton 566709bd62 locks: don't call locks_release_private from locks_copy_lock
All callers of locks_copy_lock pass in a brand new file_lock struct, so
there's no need to call locks_release_private on it. Replace that with
a warning that fires in the event that we receive a target lock that
doesn't look like it's properly initialized.

Acked-by: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-08-11 14:24:22 -04:00
Jeff Layton 8144f1f699 locks: show delegations as "DELEG" in /proc/locks
Now that they are a distinct lease type, show them as such.

Cc: J. Bruce Fields <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-08-11 13:36:54 -04:00
Al Viro 12a5b5294c fix copy_tree() regression
Since 3.14 we had copy_tree() get the shadowing wrong - if we had one
vfsmount shadowing another (i.e. if A is a slave of B, C is mounted
on A/foo, then D got mounted on B/foo creating D' on A/foo shadowed
by C), copy_tree() of A would make a copy of D' shadow the the copy of
C, not the other way around.

It's easy to fix, fortunately - just make sure that mount follows
the one that shadows it in mnt_child as well as in mnt_hash, and when
copy_tree() decides to attach a new mount, check if the last child
it has added to the same parent should be shadowing the new one.
And if it should, just use the same logics commit_tree() has - put the
new mount into the hash and children lists right after the one that
should shadow it.

Cc: stable@vger.kernel.org [3.14 and later]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-11 12:28:10 -04:00
Vincent Stehlé e91259f3c7 cifs: remove unused function cifs_oplock_break_wait
Commit 743162013d ("sched: Remove proliferation of wait_on_bit() action
functions") has removed the call to cifs_oplock_break_wait, making this
function unused; remove it.

This fixes the following compilation warning:

  fs/cifs/misc.c:578:1: warning: ‘cifs_oplock_break_wait’ defined but not used [-Wunused-function]

Signed-off-by: Vincent Stehlé <vincent.stehle@laposte.net>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-11 01:31:03 -05:00
Linus Torvalds 155134fef2 Revert "proc: Point /proc/{mounts,net} at /proc/thread-self/{mounts,net} instead of /proc/self/{mounts,net}"
This reverts commits 344470cac4 and e813244072.

It turns out that the exact path in the symlink matters, if for somewhat
unfortunate reasons: some apparmor configurations don't allow dhclient
access to the per-thread /proc files.  As reported by Jörg Otte:

  audit: type=1400 audit(1407684227.003:28): apparmor="DENIED"
    operation="open" profile="/sbin/dhclient"
    name="/proc/1540/task/1540/net/dev" pid=1540 comm="dhclient"
    requested_mask="r" denied_mask="r" fsuid=0 ouid=0

so we had better revert this for now.  We might be able to work around
this in practice by only using the per-thread symlinks if the thread
isn't the thread group leader, and if the namespaces differ between
threads (which basically never happens).

We'll see. In the meantime, the revert was made to be intentionally easy.

Reported-by: Jörg Otte <jrg.otte@gmail.com>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-10 21:24:59 -07:00
Linus Torvalds 77e40aae76 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace updates from Eric Biederman:
 "This is a bunch of small changes built against 3.16-rc6.  The most
  significant change for users is the first patch which makes setns
  drmatically faster by removing unneded rcu handling.

  The next chunk of changes are so that "mount -o remount,.." will not
  allow the user namespace root to drop flags on a mount set by the
  system wide root.  Aks this forces read-only mounts to stay read-only,
  no-dev mounts to stay no-dev, no-suid mounts to stay no-suid, no-exec
  mounts to stay no exec and it prevents unprivileged users from messing
  with a mounts atime settings.  I have included my test case as the
  last patch in this series so people performing backports can verify
  this change works correctly.

  The next change fixes a bug in NFS that was discovered while auditing
  nsproxy users for the first optimization.  Today you can oops the
  kernel by reading /proc/fs/nfsfs/{servers,volumes} if you are clever
  with pid namespaces.  I rebased and fixed the build of the
  !CONFIG_NFS_FS case yesterday when a build bot caught my typo.  Given
  that no one to my knowledge bases anything on my tree fixing the typo
  in place seems more responsible that requiring a typo-fix to be
  backported as well.

  The last change is a small semantic cleanup introducing
  /proc/thread-self and pointing /proc/mounts and /proc/net at it.  This
  prevents several kinds of problemantic corner cases.  It is a
  user-visible change so it has a minute chance of causing regressions
  so the change to /proc/mounts and /proc/net are individual one line
  commits that can be trivially reverted.  Unfortunately I lost and
  could not find the email of the original reporter so he is not
  credited.  From at least one perspective this change to /proc/net is a
  refgression fix to allow pthread /proc/net uses that were broken by
  the introduction of the network namespace"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc: Point /proc/mounts at /proc/thread-self/mounts instead of /proc/self/mounts
  proc: Point /proc/net at /proc/thread-self/net instead of /proc/self/net
  proc: Implement /proc/thread-self to point at the directory of the current thread
  proc: Have net show up under /proc/<tgid>/task/<tid>
  NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes
  mnt: Add tests for unprivileged remount cases that have found to be faulty
  mnt: Change the default remount atime from relatime to the existing value
  mnt: Correct permission checks in do_remount
  mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount
  mnt: Only change user settable mount flags in remount
  namespaces: Use task_lock and not rcu to protect nsproxy
2014-08-09 17:10:41 -07:00
Linus Torvalds 0d10c2c170 Merge branch 'for-3.17' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
 "This includes a major rewrite of the NFSv4 state code, which has
  always depended on a single mutex.  As an example, open creates are no
  longer serialized, fixing a performance regression on NFSv3->NFSv4
  upgrades.  Thanks to Jeff, Trond, and Benny, and to Christoph for
  review.

  Also some RDMA fixes from Chuck Lever and Steve Wise, and
  miscellaneous fixes from Kinglong Mee and others"

* 'for-3.17' of git://linux-nfs.org/~bfields/linux: (167 commits)
  svcrdma: remove rdma_create_qp() failure recovery logic
  nfsd: add some comments to the nfsd4 object definitions
  nfsd: remove the client_mutex and the nfs4_lock/unlock_state wrappers
  nfsd: remove nfs4_lock_state: nfs4_state_shutdown_net
  nfsd: remove nfs4_lock_state: nfs4_laundromat
  nfsd: Remove nfs4_lock_state(): reclaim_complete()
  nfsd: Remove nfs4_lock_state(): setclientid, setclientid_confirm, renew
  nfsd: Remove nfs4_lock_state(): exchange_id, create/destroy_session()
  nfsd: Remove nfs4_lock_state(): nfsd4_open and nfsd4_open_confirm
  nfsd: Remove nfs4_lock_state(): nfsd4_delegreturn()
  nfsd: Remove nfs4_lock_state(): nfsd4_open_downgrade + nfsd4_close
  nfsd: Remove nfs4_lock_state(): nfsd4_lock/locku/lockt()
  nfsd: Remove nfs4_lock_state(): nfsd4_release_lockowner
  nfsd: Remove nfs4_lock_state(): nfsd4_test_stateid/nfsd4_free_stateid
  nfsd: Remove nfs4_lock_state(): nfs4_preprocess_stateid_op()
  nfsd: remove old fault injection infrastructure
  nfsd: add more granular locking to *_delegations fault injectors
  nfsd: add more granular locking to forget_openowners fault injector
  nfsd: add more granular locking to forget_locks fault injector
  nfsd: add a list_head arg to nfsd_foreach_client_lock
  ...
2014-08-09 14:31:18 -07:00
Linus Torvalds 023f78b02c Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS updates from Steve French:
 "The most visible change in this set is the additional of multi-credit
  support for SMB2/SMB3 which dramatically improves the large file i/o
  performance for these dialects and significantly increases the maximum
  i/o size used on the wire for SMB2/SMB3.

  Also reconnection behavior after network failure is improved"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6: (35 commits)
  Add worker function to set allocation size
  [CIFS] Fix incorrect hex vs. decimal in some debug print statements
  update CIFS TODO list
  Add Pavel to contributor list in cifs AUTHORS file
  Update cifs version
  CIFS: Fix STATUS_CANNOT_DELETE error mapping for SMB2
  CIFS: Optimize readpages in a short read case on reconnects
  CIFS: Optimize cifs_user_read() in a short read case on reconnects
  CIFS: Improve indentation in cifs_user_read()
  CIFS: Fix possible buffer corruption in cifs_user_read()
  CIFS: Count got bytes in read_into_pages()
  CIFS: Use separate var for the number of bytes got in async read
  CIFS: Indicate reconnect with ECONNABORTED error code
  CIFS: Use multicredits for SMB 2.1/3 reads
  CIFS: Fix rsize usage for sync read
  CIFS: Fix rsize usage in user read
  CIFS: Separate page reading from user read
  CIFS: Fix rsize usage in readpages
  CIFS: Separate page search from readpages
  CIFS: Use multicredits for SMB 2.1/3 writes
  ...
2014-08-09 13:03:34 -07:00
Linus Torvalds c309bfa9b4 MTD updates for 3.17-rc1
AMD-compatible CFI driver:
  - Support OTP programming for Micron M29EW family
  - Increase buffer write timeout, according to detected flash parameter info
 
 NAND
  - Add helpers for retrieving ONFI timing modes
  - GPMI: provide option to disable bad block marker swapping (required for
      Ka-On electronics platforms)
 
 SPI NOR
  - EON EN25QH128 support
  - Support new Flag Status Register (FSR) on a few Micron flash
 
 Common
  - New sysfs entries for bad block and ECC stats
 
 And a few miscellaneous refactorings, cleanups, and driver improvements
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJT5WCXAAoJEFySrpd9RFgt0rwP/1anAulAcve/QzVF9LDFPec8
 jSvK8WWFcHLVb9EvTHtUjRz2RSRNhe0eeyEld3WpdKZZ73VVVaHnGdJv8J8Ys8jn
 kfNBDfgDLFrVzycYCqjQ2gdvidCyrjQgtPP0E/Q/RN6FBur0/rp2WKoJ2FvuT6SS
 kOz5f3TOe+iNtxQoJwkFvs/IjfFMThGs+YMJ8Z9s4LcJHD65T6hF+zDwl8xF2xfG
 b104PsG3I58kJdYjKhRQ2/ol+YCPoVhQorhhuaqeouZum/Hb/2g3rKHVZpAv2n6m
 JWnTpbdJDqGoPFVPyJr5Vm/UYOwxEBSWimuNp+2WN7EsXux1x9JZZl5+ZNUMmb4q
 vxYhIDul2+Sg1lN+ruBe+xi6d8DI8Y5WIc9xJgn3YHLC8YSkiZ11bhQyyeHA9i5h
 jZYKSkN/ERl8iAA4ULD6tsZv4ds8LVI/XOxrcSM7myQ4p8oY5QBxEWEuGPgyH6A6
 qCVkc0TAriSPfcCBvs4o8s2uNUocq7x6ZT1xfdlJ0KVstCmeEJBJnBYwWIXR3tu7
 3+I/bI41Q29ZGV8x5PEGDSgLVDNAJZnfGPuCwdMWuVD7UQuEEOJkCvy8o7C2Fold
 hRXh3SlUICCGDzd0JdNmdt5hPuB0tzsG0YkRoRj2sS30TlHi77nN/m3HYi/JQ4UW
 gA21laizfJ+z7g/5Kabb
 =Wkmz
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20140808' of git://git.infradead.org/linux-mtd

Pull MTD updates from Brian Norris:
 "AMD-compatible CFI driver:
   - Support OTP programming for Micron M29EW family
   - Increase buffer write timeout, according to detected flash
     parameter info

  NAND
   - Add helpers for retrieving ONFI timing modes
   - GPMI: provide option to disable bad block marker swapping (required
     for Ka-On electronics platforms)

  SPI NOR
   - EON EN25QH128 support
   - Support new Flag Status Register (FSR) on a few Micron flash

  Common
   - New sysfs entries for bad block and ECC stats

  And a few miscellaneous refactorings, cleanups, and driver
  improvements"

* tag 'for-linus-20140808' of git://git.infradead.org/linux-mtd: (31 commits)
  mtd: gpmi: make blockmark swapping optional
  mtd: gpmi: remove line breaks from error messages and improve wording
  mtd: gpmi: remove useless (void *) type casts and spaces between type casts and variables
  mtd: atmel_nand: NFC: support multiple interrupt handling
  mtd: atmel_nand: implement the nfc_device_ready() by checking the R/B bit
  mtd: atmel_nand: add NFC status error check
  mtd: atmel_nand: make ecc parameters same as definition
  mtd: nand: add ONFI timing mode to nand_timings converter
  mtd: nand: define struct nand_timings
  mtd: cfi_cmdset_0002: fix do_write_buffer() timeout error
  mtd: denali: use 8 bytes for READID command
  mtd/ftl: fix the double free of the buffers allocated in build_maps()
  mtd: phram: Fix whitespace issues
  mtd: spi-nor: add support for EON EN25QH128
  mtd: cfi_cmdset_0002: Add support for locking OTP memory
  mtd: cfi_cmdset_0002: Add support for writing OTP memory
  mtd: cfi_cmdset_0002: Invalidate cache after entering/exiting OTP memory
  mtd: cfi_cmdset_0002: Add support for reading OTP
  mtd: spi-nor: add support for flag status register on Micron chips
  mtd: Account for BBT blocks when a partition is being allocated
  ...
2014-08-08 18:13:21 -07:00
David Herrmann 40e041a2c8 shm: add sealing API
If two processes share a common memory region, they usually want some
guarantees to allow safe access. This often includes:
  - one side cannot overwrite data while the other reads it
  - one side cannot shrink the buffer while the other accesses it
  - one side cannot grow the buffer beyond previously set boundaries

If there is a trust-relationship between both parties, there is no need
for policy enforcement.  However, if there's no trust relationship (eg.,
for general-purpose IPC) sharing memory-regions is highly fragile and
often not possible without local copies.  Look at the following two
use-cases:

  1) A graphics client wants to share its rendering-buffer with a
     graphics-server. The memory-region is allocated by the client for
     read/write access and a second FD is passed to the server. While
     scanning out from the memory region, the server has no guarantee that
     the client doesn't shrink the buffer at any time, requiring rather
     cumbersome SIGBUS handling.
  2) A process wants to perform an RPC on another process. To avoid huge
     bandwidth consumption, zero-copy is preferred. After a message is
     assembled in-memory and a FD is passed to the remote side, both sides
     want to be sure that neither modifies this shared copy, anymore. The
     source may have put sensible data into the message without a separate
     copy and the target may want to parse the message inline, to avoid a
     local copy.

While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
ways to achieve most of this, the first one is unproportionally ugly to
use in libraries and the latter two are broken/racy or even disabled due
to denial of service attacks.

This patch introduces the concept of SEALING.  If you seal a file, a
specific set of operations is blocked on that file forever.  Unlike locks,
seals can only be set, never removed.  Hence, once you verified a specific
set of seals is set, you're guaranteed that no-one can perform the blocked
operations on this file, anymore.

An initial set of SEALS is introduced by this patch:
  - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
            in size. This affects ftruncate() and open(O_TRUNC).
  - GROW: If SEAL_GROW is set, the file in question cannot be increased
          in size. This affects ftruncate(), fallocate() and write().
  - WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
           are possible. This affects fallocate(PUNCH_HOLE), mmap() and
           write().
  - SEAL: If SEAL_SEAL is set, no further seals can be added to a file.
          This basically prevents the F_ADD_SEAL operation on a file and
          can be set to prevent others from adding further seals that you
          don't want.

The described use-cases can easily use these seals to provide safe use
without any trust-relationship:

  1) The graphics server can verify that a passed file-descriptor has
     SEAL_SHRINK set. This allows safe scanout, while the client is
     allowed to increase buffer size for window-resizing on-the-fly.
     Concurrent writes are explicitly allowed.
  2) For general-purpose IPC, both processes can verify that SEAL_SHRINK,
     SEAL_GROW and SEAL_WRITE are set. This guarantees that neither
     process can modify the data while the other side parses it.
     Furthermore, it guarantees that even with writable FDs passed to the
     peer, it cannot increase the size to hit memory-limits of the source
     process (in case the file-storage is accounted to the source).

The new API is an extension to fcntl(), adding two new commands:
  F_GET_SEALS: Return a bitset describing the seals on the file. This
               can be called on any FD if the underlying file supports
               sealing.
  F_ADD_SEALS: Change the seals of a given file. This requires WRITE
               access to the file and F_SEAL_SEAL may not already be set.
               Furthermore, the underlying file must support sealing and
               there may not be any existing shared mapping of that file.
               Otherwise, EBADF/EPERM is returned.
               The given seals are _added_ to the existing set of seals
               on the file. You cannot remove seals again.

The fcntl() handler is currently specific to shmem and disabled on all
files. A file needs to explicitly support sealing for this interface to
work. A separate syscall is added in a follow-up, which creates files that
support sealing. There is no intention to support this on other
file-systems. Semantics are unclear for non-volatile files and we lack any
use-case right now. Therefore, the implementation is specific to shmem.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ryan Lortie <desrt@desrt.ca>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Daniel Mack <zonque@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:31 -07:00
David Herrmann 4bb5f5d939 mm: allow drivers to prevent new writable mappings
This patch (of 6):

The i_mmap_writable field counts existing writable mappings of an
address_space.  To allow drivers to prevent new writable mappings, make
this counter signed and prevent new writable mappings if it is negative.
This is modelled after i_writecount and DENYWRITE.

This will be required by the shmem-sealing infrastructure to prevent any
new writable mappings after the WRITE seal has been set.  In case there
exists a writable mapping, this operation will fail with EBUSY.

Note that we rely on the fact that iff you already own a writable mapping,
you can increase the counter without using the helpers.  This is the same
that we do for i_writecount.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ryan Lortie <desrt@desrt.ca>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Daniel Mack <zonque@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:31 -07:00
Fabian Frederick e0d9bf4cc0 fs/dlm/debug_fs.c: remove unnecessary null test before debugfs_remove
This fixes checkpatch warning:

  WARNING: debugfs_remove(NULL) is safe this check is probably not required

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:27 -07:00
Yinghai Lu d97b07c54f initramfs: support initramfs that is bigger than 2GiB
Now with 64bit bzImage and kexec tools, we support ramdisk that size is
bigger than 2g, as we could put it above 4G.

Found compressed initramfs image could not be decompressed properly.  It
turns out that image length is int during decompress detection, and it
will become < 0 when length is more than 2G.  Furthermore, during
decompressing len as int is used for inbuf count, that has problem too.

Change len to long, that should be ok as on 32 bit platform long is
32bits.

Tested with following compressed initramfs image as root with kexec.
	gzip, bzip2, xz, lzma, lzop, lz4.
run time for populate_rootfs():
   size        name       Nehalem-EX  Westmere-EX  Ivybridge-EX
 9034400256 root_img     :   26s           24s          30s
 3561095057 root_img.lz4 :   28s           27s          27s
 3459554629 root_img.lzo :   29s           29s          28s
 3219399480 root_img.gz  :   64s           62s          49s
 2251594592 root_img.xz  :  262s          260s         183s
 2226366598 root_img.lzma:  386s          376s         277s
 2901482513 root_img.bz2 :  635s          599s

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Rashika Kheria <rashika.kheria@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kyungsik Lee <kyungsik.lee@lge.com>
Cc: P J P <ppandit@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: "Daniel M. Weeks" <dan@danweeks.net>
Cc: Alexandre Courbot <acourbot@nvidia.com>
Cc: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:26 -07:00
Fabian Frederick fa5a7a41a6 fs/qnx6: update debugging to current functions
Add DDEBUG in Makefile when CONFIG_QNX6FS_DEBUG is set.  All QNX6DEBUG
messages are replaced by pr_debug which means debugging will be emitted in
debug level only and no more in error and info levels.  debug uses now
pr_fmt and __func__

QNX6DEBUG definition has been removed.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joe Perches <joe@perches.com>
Cc: Kai Bankett <chaosman@ontika.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:26 -07:00
Fabian Frederick e6c3261653 fs/qnx6: use pr_fmt and __func__ in logging
Remove "qnx6:" and "qnx6: " from each logging instruction.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joe Perches <joe@perches.com>
Cc: Kai Bankett <chaosman@ontika.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:26 -07:00
Fabian Frederick e00d5b5ad7 fs/qnx6: convert printk to pr_foo()
Use current logging functions.

Coalesce formats.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joe Perches <joe@perches.com>
Cc: Kai Bankett <chaosman@ontika.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:25 -07:00
Fabian Frederick b6d53b1b11 fs/romfs/super.c: add blank line after declarations
Fix checkpatch warning:

  WARNING: Missing a blank line after declarations

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:25 -07:00
Fabian Frederick ca9e5368a7 fs/romfs/super.c: use pr_fmt in logging
- Remove "Error" in format logging (already in pr_ level)

- Use modulename in pr_fmt instead of ROMFS: in each pr_ callsites.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:25 -07:00
Fabian Frederick 3d79d3145f fs/romfs/super.c: convert printk to pr_foo()
Use current logging functions. Coalesce formats.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:25 -07:00
Fabian Frederick 1508f3eb69 fs/cramfs/inode.c: use linux/uaccess.h
Fixes checkpatch warning:

  WARNING: Use #include <linux/uaccess.h> instead of <asm/uaccess.h>

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:25 -07:00
Fabian Frederick 31d92e5519 fs/cramfs: code clean-up
Fixes some checkpatch errors/warnings:

  WARNING: Missing a blank line after declarations
  ERROR: spaces required around that '=' (ctx:VxV)
  ERROR: "foo * bar" should be "foo *bar"
  ERROR: space prohibited after that open parenthesis '('

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:25 -07:00
Fabian Frederick 4f21e1ea09 fs/cramfs: use pr_fmt
Use module name for "cramfs: " prefix.  (note that uncompress.c printk had
no prefix).

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:25 -07:00
Fabian Frederick f175ff8100 fs/cramfs: convert printk to pr_foo()
Use current logging functions.  No level printk converted to pr_err

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:25 -07:00
Fabian Frederick 998d6688fb fs/omfs/inode.c: replace count*size kzalloc by kcalloc
kcalloc manages count*sizeof overflow.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:25 -07:00
Fabian Frederick b8f52d89c0 fs/pstore/ram_core.c: replace count*size kmalloc by kmalloc_array
kmalloc_array manages count*sizeof overflow.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:25 -07:00
Fabian Frederick 1da85fdff5 fs/bfs: use bfs prefix for dump_imap
All bfs related functions use bfs_ prefix.  This patch also moves extern
declaration to bfs.h and removes prototype from inode.c

This fixes checkpatch warning:

  WARNING: externs should be avoided in .c files

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: "Tigran A. Aivazian" <tigran@aivazian.fsnet.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:24 -07:00
Joe Perches 19bdd41a57 adfs: add __printf verification, fix format/argument mismatches
Might as well do the right thing.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:24 -07:00
Fabian Frederick b16214d43d fs/adfs/dir_fplus.c: replace count*size kzalloc by kcalloc
kcalloc manages count*sizeof overflow.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:24 -07:00
Fabian Frederick e2ffcf5c7e fs/adfs/dir_fplus.c: use ARRAY_SIZE instead of sizeof/sizeof[0]
Use kernel.h definition.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:24 -07:00
Fabian Frederick b134079f1f fs/exofs/ore_raid.c: replace count*size kzalloc by kcalloc
kcalloc manages count*sizeof overflow.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:24 -07:00
Joe Perches e5eea0981a sysctl: remove typedef ctl_table
Remove the final user, and the typedef itself.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:24 -07:00
Vitaly Kuznetsov 0692dedcf6 fs/proc/vmcore.c:mmap_vmcore: skip non-ram pages reported by hypervisors
We have a special check in read_vmcore() handler to check if the page was
reported as ram or not by the hypervisor (pfn_is_ram()).  However, when
vmcore is read with mmap() no such check is performed.  That can lead to
unpredictable results, e.g.  when running Xen PVHVM guest memcpy() after
mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
enormous load in both DomU and Dom0.

Fix the issue by mapping each non-ram page to the zero page.  Keep direct
path with remap_oldmem_pfn_range() to avoid looping through all pages on
bare metal.

The issue can also be solved by overriding remap_oldmem_pfn_range() in
xen-specific code, as remap_oldmem_pfn_range() was been designed for.
That, however, would involve non-obvious xen code path for all x86 builds
with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
code on x86 arch from doing the same override.

[fengguang.wu@intel.com: remap_oldmem_pfn_checked() can be static]
[akpm@linux-foundation.org: clean up layout]
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:23 -07:00
Vladimir Davydov 41f727fde1 fork/exec: cleanup mm initialization
mm initialization on fork/exec is spread all over the place, which makes
the code look inconsistent.

We have mm_init(), which is supposed to init/nullify mm's internals, but
it doesn't init all the fields it should:

 - on fork ->mmap,mm_rb,vmacache_seqnum,map_count,mm_cpumask,locked_vm
   are zeroed in dup_mmap();

 - on fork ->pmd_huge_pte is zeroed in dup_mm(), immediately before
   calling mm_init();

 - ->cpu_vm_mask_var ptr is initialized by mm_init_cpumask(), which is
   called before mm_init() on both fork and exec;

 - ->context is initialized by init_new_context(), which is called after
   mm_init() on both fork and exec;

Let's consolidate all the initializations in mm_init() to make the code
look cleaner.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:23 -07:00
Alexey Dobriyan 8f053ac11f proc: remove INF macro
If you're applying this patch, all /proc/$PID/* files were converted
to seq_file interface and this code became unused.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:23 -07:00
Alexey Dobriyan d962c14483 proc: convert /proc/$PID/hardwall to seq_file interface
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:23 -07:00
Alexey Dobriyan 19aadc98d6 proc: convert /proc/$PID/io to seq_file interface
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:23 -07:00
Alexey Dobriyan 6ba51e3751 proc: convert /proc/$PID/oom_score to seq_file interface
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:23 -07:00
Alexey Dobriyan f6e826ca37 proc: convert /proc/$PID/schedstat to seq_file interface
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:23 -07:00
Alexey Dobriyan edfcd6064f proc: convert /proc/$PID/wchan to seq_file interface
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:23 -07:00
Alexey Dobriyan 2ca66ff70a proc: convert /proc/$PID/cmdline to seq_file interface
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:23 -07:00
Alexey Dobriyan 09d93bd627 proc: convert /proc/$PID/syscall to seq_file interface
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:22 -07:00
Alexey Dobriyan 1c963eb135 proc: convert /proc/$PID/limits to seq_file interface
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:22 -07:00
Alexey Dobriyan f9ea536ef8 proc: convert /proc/$PID/auxv to seq_file interface
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:22 -07:00
Alexey Dobriyan cedbccab8b proc: more "const char *" pointers
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:22 -07:00
Alexey Dobriyan 849d20a128 proc: remove proc_tty_ldisc variable
/proc/tty/ldisc appear to be unused as a directory and
it had been always that way.

But it is userspace visible thing.

Cowardly remove only in-kernel variable holding it.

[akpm@linux-foundation.org: add comment]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:22 -07:00
Alexey Dobriyan 4dcc03fc45 proc: make proc_subdir_lock static
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:22 -07:00
Alexey Dobriyan 335eb53158 proc: faster /proc/$PID lookup
Currently lookup for /proc/$PID first goes through spinlock and whole list
of misc /proc entries only to confirm that, yes, /proc/42 can not possibly
match random proc entry.

List is is several dozens entries long (52 entries on my setup).

None of this is necessary.

Try to convert dentry name to integer first.
If it works, it must be /proc/$PID.
If it doesn't, it must be random proc entry.

Based on patch from Al Viro.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:22 -07:00
Alexey Dobriyan dbcdb50441 proc: add and remove /proc entry create checks
* remove proc_create(NULL, ...) check, let it oops

* warn about proc_create("", ...) and proc_create("very very long name", ...)
  proc code keeps length as u8, no 256+ name length possible

* warn about proc_create("123", ...)
  /proc/$PID and /proc/misc namespaces are separate things,
  but dumb module might create funky a-la $PID entry.

* remove post mortem strchr('/') check
  Triggering it implies either strchr() is buggy or memory corruption.
  It should be VFS check anyway.

In reality, none of these checks will ever trigger,
it is preparation for the next patch.

Based on patch from Al Viro.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:22 -07:00
Fabian Frederick ccf94f1b4a proc: constify seq_operations
proc_uid_seq_operations, proc_gid_seq_operations and
proc_projid_seq_operations are only called in proc_id_map_open with
seq_open as const struct seq_operations so we can constify the 3
structures and update proc_id_map_open prototype.

   text    data     bss     dec     hex filename
   6817     404    1984    9205    23f5 kernel/user_namespace.o-before
   6913     308    1984    9205    23f5 kernel/user_namespace.o-after

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:22 -07:00
Fabian Frederick 108a8a11cb fs/proc/kcore.c: use PAGE_ALIGN instead of ALIGN(PAGE_SIZE)
Use mm.h definition.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:22 -07:00
Fabian Frederick f58d6c7547 fs/hpfs/dnode.c: fix suspect code indent
Fix 2 checkpatch warnings:

  WARNING: suspect code indent for conditional statements

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:22 -07:00
Fabian Frederick f3fb9e2732 fs/reiserfs/xattr.c: fix blank line missing after declarations
Fix checkpatch warning:

  WARNING: Missing a blank line after declarations

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:22 -07:00
Fabian Frederick 17093991af fs/reiserfs: use linux/uaccess.h
Fix checkpatch warning

  WARNING: Use #include <linux/uaccess.h> instead of <asm/uaccess.h>

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:21 -07:00
Fabian Frederick 53872ed077 fs/reiserfs: replace not-standard %Lu/%Ld
Fixes checkpatch warnings:

"WARNING: %Ld/%Lu are not-standard C, use %lld/%llu"

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:21 -07:00
Fabian Frederick edc023caf4 fs/ufs/inode.c: kernel-doc warning fixes
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:21 -07:00
Fabian Frederick d4beaabd30 fs/ufs: convert UFSD printk to pr_debug
Convert no level printk to pr_debug in UFSD.  DEBUG is defined with
CONFIG_UFS_DEBUG so pr_debug are emitted here.

Also fixing call to UFSD (add;)

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:21 -07:00
Fabian Frederick 7e1e4167d4 fs/ufs/super.c: use va_format instead of buffer/vsnprintf
Remove error_buffer and use %pV

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:21 -07:00
Fabian Frederick 07bc94fdb4 fs/ufs/super.c: use __func__ in logging
Replace approximate function name by __func__ using standard format
"function():"

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:21 -07:00
Fabian Frederick de771bdaaa fs/ufs: use pr_fmt
Replace UFS-fs, UFS-fs: and UFS: by pr_fmt with module name "ufs: "

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:21 -07:00
Fabian Frederick a9814c5d2d fs/ufs: convert printk to pr_foo()
Use current logging functions.

- no level printk under CONFIG_UFS_DEBUG converted to pr_debug

- no level printk elsewhere converted to pr_err

- add DDEBUG flag in Makefile

- coalesce formats

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:21 -07:00
Vyacheslav Dubeyko dd70edbde2 nilfs2: integrate sysfs support into driver
This patch integrates creation of sysfs groups and
attributes into NILFS file system driver.

It was found the issue with nilfs_sysfs_{create/delete}_snapshot_group
functions by Michael L Semon <mlsemon35@gmail.com> in the first
version of the patch:

  BUG: sleeping function called from invalid context at kernel/locking/mutex.c:579
  in_atomic(): 1, irqs_disabled(): 0, pid: 32676, name: umount.nilfs2
  2 locks held by umount.nilfs2/32676:
   #0:  (&type->s_umount_key#21){++++..}, at: [<790c18e2>] deactivate_super+0x37/0x58
   #1:  (&(&nilfs->ns_cptree_lock)->rlock){+.+...}, at: [<791bf659>] nilfs_put_root+0x23/0x5a
  Preemption disabled at:[<791bf659>] nilfs_put_root+0x23/0x5a

  CPU: 0 PID: 32676 Comm: umount.nilfs2 Not tainted 3.14.0+ #2
  Hardware name: Dell Computer Corporation Dimension 2350/07W080, BIOS A01 12/17/2002
  Call Trace:
    dump_stack+0x4b/0x75
    __might_sleep+0x111/0x16f
    mutex_lock_nested+0x1e/0x3ad
    kernfs_remove+0x12/0x26
    sysfs_remove_dir+0x3d/0x62
    kobject_del+0x13/0x38
    nilfs_sysfs_delete_snapshot_group+0xb/0xd
    nilfs_put_root+0x2a/0x5a
    nilfs_detach_log_writer+0x1ab/0x2c1
    nilfs_put_super+0x13/0x68
    generic_shutdown_super+0x60/0xd1
    kill_block_super+0x1d/0x60
    deactivate_locked_super+0x22/0x3f
    deactivate_super+0x3e/0x58
    mntput_no_expire+0xe2/0x141
    SyS_oldumount+0x70/0xa5
    syscall_call+0x7/0xb

The reason of the issue was placement of
nilfs_sysfs_{create/delete}_snapshot_group() call under
nilfs->ns_cptree_lock protection.  But this protection is unnecessary and
wrong solution.  The second version of the patch fixes this issue.

[fengguang.wu@intel.com: nilfs_sysfs_create_mounted_snapshots_group can be static]
Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Vyacheslav Dubeyko <Vyacheslav.Dubeyko@hgst.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:21 -07:00
Vyacheslav Dubeyko a5a7332a29 nilfs2: add /sys/fs/nilfs2/<device>/mounted_snapshots/<snapshot> group
This patch adds creation of <snapshot> group for every mounted
snapshot in /sys/fs/nilfs2/<device>/mounted_snapshots group.

The group contains details about mounted snapshot:
(1) inodes_count - show number of inodes for snapshot.
(2) blocks_count - show number of blocks for snapshot.

Signed-off-by: Vyacheslav Dubeyko <Vyacheslav.Dubeyko@hgst.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:21 -07:00
Vyacheslav Dubeyko a2ecb791a9 nilfs2: add /sys/fs/nilfs2/<device>/mounted_snapshots group
This patch adds creation of /sys/fs/nilfs2/<device>/mounted_snapshots
group.

The mounted_snapshots group contains group for every
mounted snapshot.

Signed-off-by: Vyacheslav Dubeyko <Vyacheslav.Dubeyko@hgst.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:21 -07:00
Vyacheslav Dubeyko 02a0ba1c60 nilfs2: add /sys/fs/nilfs2/<device>/checkpoints group
This patch adds creation of /sys/fs/nilfs2/<device>/checkpoints
group.

The checkpoints group contains attributes that describe
details about volume's checkpoints:
(1) checkpoints_number - show number of checkpoints on volume.
(2) snapshots_number - show number of snapshots on volume.
(3) last_seg_checkpoint - show checkpoint number of the latest segment.
(4) next_checkpoint - show next checkpoint number.

Signed-off-by: Vyacheslav Dubeyko <Vyacheslav.Dubeyko@hgst.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:21 -07:00
Vyacheslav Dubeyko ef43d5cd84 nilfs2: add /sys/fs/nilfs2/<device>/segments group
This patch adds creation of /sys/fs/nilfs2/<device>/segments
group.

The segments group contains attributes that describe
details about volume's segments:
(1) segments_number - show number of segments on volume.
(2) blocks_per_segment - show number of blocks in segment.
(3) clean_segments - show count of clean segments.
(4) dirty_segments - show count of dirty segments.

Signed-off-by: Vyacheslav Dubeyko <Vyacheslav.Dubeyko@hgst.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:20 -07:00
Vyacheslav Dubeyko abc968dbf2 nilfs2: add /sys/fs/nilfs2/<device>/segctor group
This patch adds creation of /sys/fs/nilfs2/<device>/segctor
group.

The segctor group contains attributes that describe
segctor thread activity details:
(1) last_pseg_block - show start block number of the latest segment.
(2) last_seg_sequence - show sequence value of the latest segment.
(3) last_seg_checkpoint - show checkpoint number of the latest segment.
(4) current_seg_sequence - show segment sequence counter.
(5) current_last_full_seg - show index number of the latest full segment.
(6) next_full_seg - show index number of the full segment index
to be used next.
(7) next_pseg_offset - show offset of next partial segment in
the current full segment.
(8) next_checkpoint - show next checkpoint number.
(9) last_seg_write_time - show write time of the last segment
in human-readable format.
(10) last_seg_write_time_secs - show write time of the last segment
in seconds.
(11) last_nongc_write_time - show write time of the last segment
not for cleaner operation in human-readable format.
(12) last_nongc_write_time_secs - show write time of the last segment
not for cleaner operation in seconds.
(13) dirty_data_blocks_count - show number of dirty data blocks.

Signed-off-by: Vyacheslav Dubeyko <Vyacheslav.Dubeyko@hgst.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:20 -07:00
Vyacheslav Dubeyko caa05d49df nilfs2: add /sys/fs/nilfs2/<device>/superblock group
This patch adds creation of /sys/fs/nilfs2/<device>/superblock
group.

The superblock group contains attributes that describe
superblock's details:
(1) sb_write_time - show previous write time of super block in
human-readable format.
(2) sb_write_time_secs - show previous write time of super block
in seconds.
(3) sb_write_count - show write count of super block.
(4) sb_update_frequency - show/set interval of periodical update
of superblock (in seconds). You can set preferable frequency of
superblock update by command:

echo <value> > /sys/fs/nilfs2/<device>/superblock/sb_update_frequency

Signed-off-by: Vyacheslav Dubeyko <Vyacheslav.Dubeyko@hgst.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:20 -07:00
Vyacheslav Dubeyko da7141fb78 nilfs2: add /sys/fs/nilfs2/<device> group
This patch adds creation of /sys/fs/nilfs2/<device> group.

The <device> group contains attributes that describe file
system partition's details:
(1) revision - show NILFS file system revision.
(2) blocksize - show volume block size in bytes.
(3) device_size - show volume size in bytes.
(4) free_blocks - show count of free blocks on volume.
(5) uuid - show volume's UUID.
(6) volume_name - show volume's name.

Signed-off-by: Vyacheslav Dubeyko <Vyacheslav.Dubeyko@hgst.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:20 -07:00
Vyacheslav Dubeyko aebe17f684 nilfs2: add /sys/fs/nilfs2/features group
This patchset implements creation of sysfs groups and attributes with
the purpose to show NILFS2 volume details, internal state of the driver
and to manage internal state of NILFS2 driver.

Sysfs is a virtual file system that exports information about devices
and drivers from the kernel device model to user space, and is also used
for configuration.  NILFS2 is a complex file system that has segctor
thread, GC thread, checkpoint/snapshot model and so on.  Sysfs namespace
provides native and easy way for: (1) getting info and statistics about
volume state; (2) getting info and configuration of internal subsystems
(segctor thread); (3) snapshots management.

Suggested patchset provides basis for managing segctor thread behaviour
and manipulation by snapshots.  Currently, it informs only about segctor
thread's internal parameters and about mounted snapshots.  But sysfs
interface can provide easy and simple way for deep management of segctor
thread and snapshots.

This patchset provides opportunity to manage interval of periodical
update of superblock (in seconds).  Default value is 10 seconds.  Now a
user can increase this value by means of
nilfs2/<device>/superblock/sb_update_frequency attribute in the case of
necessity.

Also the patchset provides opportunity to get information easily about
key volumes's parameters (free blocks, superblock write count,
superblock update frequency, latest segment info, dirty data blocks
count, count of clean segments, count of dirty segments and so on) in
real time manner.  Such information can be used in scripts for subtle
management of filesystem.

Implemented functionality creates such groups:
(1) /sys/fs/nilfs2 - root group
(2) /sys/fs/nilfs2/features - group contains attributes that describe NILFS
file system driver features
(3) /sys/fs/nilfs2/<device> - group contains attributes that describe file
system partition's details
(4) /sys/fs/nilfs2/<device>/superblock - group contains attributes that describe
superblock's details
(5) /sys/fs/nilfs2/<device>/segctor - group contains attributes that describe
segctor thread activity details
(6) /sys/fs/nilfs2/<device>/segments - group contains attributes that describe
details about volume's segments
(7) /sys/fs/nilfs2/<device>/checkpoints - group contains attributes that describe
details about volume's checkpoints
(8) /sys/fs/nilfs2/<device>/mounted_snapshots - group contains group for every
mounted snapshot
(9) /sys/fs/nilfs2/<device>/mounted_snapshots/<snapshot> - group contains
details about mounted snapshot

This patch (of 9):

This patch adds code of creation /sys/fs/nilfs2 group and
/sys/fs/nilfs2/features group.

The features group contains attributes that describe NILFS
file system driver features:
(1) revision - show current revision of NILFS file system driver.

There are two formats of timestamp output - seconds and human-readable
format.  Every showed timestamp has two sysfs files (time-<xxx> and
time-<xxx>-secs).  One sysfs file (time-<xxx>) shows time in
human-readable format.  Another sysfs file (time-<xxx>-secs) shows time in
seconds.

It was reported by Michael Semon that timestamp output in human-readable
format should be changed from "2014-4-12 14:5:38" to "2014-04-12
14:05:38".  Second version of the patch fixes this issue.

Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Vyacheslav Dubeyko <Vyacheslav.Dubeyko@hgst.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:20 -07:00
Fabian Frederick 834b46c37a fs/coda: use linux/uaccess.h
Fix checkpatch warning

  WARNING: Use #include <linux/uaccess.h> instead of <asm/uaccess.h>

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:20 -07:00
Fabian Frederick 8e19189ef8 fs/befs/linuxvfs.c: check superblock before dump operation
befs_dump_super_block was called between befs_load_sb and befs_check_sb.
It has been reported to crash (5/900) with null block testing.

This patch loads, checks and only dump superblock if it's a valid one
then brelse bh.

(befs_dump_super_block uses disk_sb (bh->b_data) so it seems we need to
call it before brelse(bh) but I don't know why befs_check_sb was called
after brelse.  Another thing I don't understand is why this problem
appears now).

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:20 -07:00
Qi Yong 6d6747f853 minix zmap block counts calculation fix
The original minix zmap blocks calculation was correct, in the formula of:

	sbi->s_nzones - sbi->s_firstdatazone + 1

It is

	sp->s_zones - (sp->s_firstdatazone - 1)

in the minix3 source code.

But a later commit 016e8d44bc ("fs/minix: Verify bitmap block counts
before mounting") has changed it unfortunately as:

  sbi->s_nzones - (sbi->s_firstdatazone + 1)

This would show free blocks one block less than the real when the total
data blocks are in "full zmap blocks plus one".

This patch corrects that zmap blocks calculation and tidy a printk
message while at it.

Signed-off-by: Qi Yong <qiyong@fc-cn.com>
Cc: Josh Boyer <jwboyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:20 -07:00
NeilBrown 3b97dd0581 autofs4: comment typo: remove a a doubled word
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:19 -07:00
NeilBrown bdac38329e autofs4: remove some unused inline functions
{__,}manage_dentry_{set,clear}_{automount,transit}

are 4 unused inline functions.  Discard them.

Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:19 -07:00
NeilBrown 668128e90b autofs4: don't take spinlock when not needed in autofs4_lookup_expiring
If the expiring_list is empty, we can avoid a costly spinlock in the
rcu-walk path through autofs4_d_manage (once the rest of the path
becomes rcu-walk friendly).

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:19 -07:00
NeilBrown c312442fe3 autofs4: remove a redundant assignment
The variable 'ino' already exists and already has the correct value.
The d_fsdata of a dentry is never changed after the d_fsdata is
instantiated, so this new assignment cannot be necessary.

It was introduced in commit b5b801779d ("autofs4: Add d_manage()
dentry operation").

Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:19 -07:00
NeilBrown 26b7a54a35 autofs4: remove unused autofs4_ispending()
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:18 -07:00
Fabian Frederick 6f4535ed7d fs/ramfs/file-nommu.c: replace count*size kzalloc by kcalloc
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Axel Lin <axel.lin@ingics.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:18 -07:00
Fabian Frederick ca35664031 fs/efs/namei.c: return is not a function
Fix checkpatch errors:

"ERROR: return is not a function, parentheses are not required"

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:18 -07:00
Al Viro c7f3888ad7 switch iov_iter_get_pages() to passing maximal number of pages
... instead of maximal size.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:11 -04:00
Fengguang Wu 49c7dd287a fs: mark __d_obtain_alias static
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:11 -04:00
J. Bruce Fields 95ad5c2913 dcache: d_splice_alias should detect loops
I believe this can only happen in the case of a corrupted filesystem.
So -EIO looks like the appropriate error.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:11 -04:00
J. Bruce Fields 8d80d7dabe dcache: d_find_alias needn't recheck IS_ROOT && DCACHE_DISCONNECTED
If we get to this point and discover the dentry is not a root dentry, or
not DCACHE_DISCONNECTED--great, we always prefer that anyway.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:10 -04:00
J. Bruce Fields 52ed46f0fa dcache: remove unused d_find_alias parameter
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:10 -04:00
J. Bruce Fields 1a0a397e41 dcache: d_obtain_alias callers don't all want DISCONNECTED
There are a few d_obtain_alias callers that are using it to get the
root of a filesystem which may already have an alias somewhere else.

This is not the same as the filehandle-lookup case, and none of them
actually need DCACHE_DISCONNECTED set.

It isn't really a serious problem, but it would really be clearer if we
reserved DCACHE_DISCONNECTED for those cases where it's actually needed.

In the btrfs case this was causing a spurious printk from
nfsd/nfsfh.c:fh_verify when it found an unexpected DCACHE_DISCONNECTED
dentry.  Josef worked around this by unsetting DCACHE_DISCONNECTED
manually in 3a0dfa6a12 "Btrfs: unset DCACHE_DISCONNECTED when mounting
default subvol", and this replaces that workaround.

Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:10 -04:00
J. Bruce Fields da093a9b76 dcache: d_splice_alias should ignore DCACHE_DISCONNECTED
Any IS_ROOT() alias should be safe to use; there's nothing special about
DCACHE_DISCONNECTED dentries.

Note that this is in fact useful for filesystems such as btrfs which can
legimately encounter a directory with a preexisting IS_ROOT alias on a
lookup that crosses into a subvolume.  (Those aliases are currently
marked DCACHE_DISCONNECTED--but not really for any good reason, and
we'll change that soon.)

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:10 -04:00
J. Bruce Fields 908790fa3b dcache: d_splice_alias mustn't create directory aliases
Currently if d_splice_alias finds a directory with an alias that is not
IS_ROOT or not DCACHE_DISCONNECTED, it creates a duplicate directory.

Duplicate directory dentries are unacceptable; it is better just to
error out.

(In the case of a local filesystem the most likely case is filesystem
corruption: for example, perhaps two directories point to the same child
directory, and the other parent has already been found and cached.)

Note that distributed filesystems may encounter this case in normal
operation if a remote host moves a directory to a location different
from the one we last cached in the dcache.  For that reason, such
filesystems should instead use d_materialise_unique, which tries to move
the old directory alias to the right place instead of erroring out.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:10 -04:00
J. Bruce Fields 75a2352d01 dcache: close d_move race in d_splice_alias
d_splice_alias will d_move an IS_ROOT() directory dentry into place if
one exists.  This should be safe as long as the dentry remains IS_ROOT,
but I can't see what guarantees that: once we drop the i_lock all we
hold here is the i_mutex on an unrelated parent directory.

Instead copy the logic of d_materialise_unique.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:10 -04:00
J. Bruce Fields 3f70bd51cb dcache: move d_splice_alias
Just a trivial move to locate it near (similar) d_materialise_unique
code and save some forward references in a following patch.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:10 -04:00
J. Bruce Fields d03b29a271 namei: trivial fix to vfs_rename_dir comment
Looks like the directory loop check is actually done in renameat?
Whatever, leave this out rather than trying to keep it up to date with
the code.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:10 -04:00
NeilBrown b8faf035ea VFS: allow ->d_manage() to declare -EISDIR in rcu_walk mode.
In REF-walk mode, ->d_manage can return -EISDIR to indicate
that the dentry is not really a mount trap (or even a mount point)
and that any mounts or any DCACHE_NEED_AUTOMOUNT flag should be
ignored.

RCU-walk mode doesn't currently support this, so if there is a dentry
with DCACHE_NEED_AUTOMOUNT set but which shouldn't be a mount-trap,
lookup_fast() will always drop in REF-walk mode.

With this patch, an -EISDIR from ->d_manage will always cause mounts
and automounts to be ignored, both in REF-walk and RCU-walk.

Bug-fixed-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:10 -04:00
Miklos Szeredi 7c33d5972c cifs: support RENAME_NOREPLACE
This flag gives CIFS the ability to support its native rename semantics.

Implementation is simple: just bail out before trying to hack around the
noreplace semantics.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Steve French <smfrench@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:09 -04:00
Miklos Szeredi 9a423bb6e3 hostfs: support rename flags
Support RENAME_NOREPLACE and RENAME_EXCHANGE flags on hostfs if the
underlying filesystem supports it.

Since renameat2(2) is not yet in any libc, use syscall(2) to invoke the
renameat2 syscall.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Richard Weinberger <richard@nod.at>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:09 -04:00
Miklos Szeredi 80ace85c91 btrfs: add RENAME_NOREPLACE
RENAME_NOREPLACE is trivial to implement for most filesystems: switch over
to ->rename2() and check for the supported flags.  The rest is done by the
VFS.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Chris Mason <clm@fb.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:09 -04:00
Miklos Szeredi a0dbc56610 bad_inode: add ->rename2()
so we return -EIO instead of -EINVAL.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:09 -04:00
Miklos Szeredi 7177a9c4b5 fs: call rename2 if exists
Christoph Hellwig suggests:

1) make vfs_rename call ->rename2 if it exists instead of ->rename
2) switch all filesystems that you're adding NOREPLACE support for to
   use ->rename2
3) see how many ->rename instances we'll have left after a few
   iterations of 2.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:09 -04:00
Al Viro 3064c3563b death to mnt_pinned
Rather than playing silly buggers with vfsmount refcounts, just have
acct_on() ask fs/namespace.c for internal clone of file->f_path.mnt
and replace it with said clone.  Then attach the pin to original
vfsmount.  Voila - the clone will be alive until the file gets closed,
making sure that underlying superblock remains active, etc., and
we can drop the original vfsmount, so that it's not kept busy.
If the file lives until the final mntput of the original vfsmount,
we'll notice that there's an fs_pin (one in bsd_acct_struct that
holds that file) and mnt_pin_kill() will take it out.  Since
->kill() is synchronous, we won't proceed past that point until
these files are closed (and private clones of our vfsmount are
gone), so we get the same ordering warranties we used to get.

mnt_pin()/mnt_unpin()/->mnt_pinned is gone now, and good riddance -
it never became usable outside of kernel/acct.c (and racy wrt
umount even there).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:09 -04:00
Al Viro 8fa1f1c2bd make fs/{namespace,super}.c forget about acct.h
These externs belong in fs/internal.h.  Rename (they are not acct-specific
anymore) and move them over there.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:09 -04:00
Al Viro efb170c228 take fs_pin stuff to fs/*
Add a new field to fs_pin - kill(pin).  That's what umount and r/o remount
will be calling for all pins attached to vfsmount and superblock resp.
Called after bumping the refcount, so it won't go away under us.  Dropping
the refcount is responsibility of the instance.  All generic stuff moved to
fs/fs_pin.c; the next step will rip all the knowledge of kernel/acct.c from
fs/super.c and fs/namespace.c.  After that - death to mnt_pin(); it was
intended to be usable as generic mechanism for code that wants to attach
objects to vfsmount, so that they would not make the sucker busy and
would get killed on umount.  Never got it right; it remained acct.c-specific
all along.  Now it's very close to being killable.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:08 -04:00
Al Viro 0aec09d049 drop ->s_umount around acct_auto_close()
just repeat the frozen check after regaining it, and check that sb
is still alive.  If several threads hit acct_auto_close() at the
same time, acct_auto_close() will survive that just fine.  And we
really don't want to play with writes and closing the file with
->s_umount held exclusive - it's a deadlock country.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:08 -04:00
Al Viro 215752fce3 acct: get rid of acct_list
Put these suckers on per-vfsmount and per-superblock lists instead.
Note: right now it's still acct_lock for everything, but that's
going to change.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:08 -04:00
Al Viro ed44724b79 acct: switch to __kernel_write()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:07 -04:00
Al Viro 82df9c8beb Merge commit 'ccbf62d8a284cf181ac28c8e8407dd077d90dd4b' into for-next
backmerge to avoid kernel/acct.c conflict
2014-08-07 14:07:57 -04:00
Yan, Zheng 282c105225 ceph: fix kick_requests()
__do_request() may unregister the request. So we should update
iterator 'p' before calling __do_request()

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-08-07 14:30:00 +04:00
Linus Torvalds 33caee3992 Merge branch 'akpm' (patchbomb from Andrew Morton)
Merge incoming from Andrew Morton:
 - Various misc things.
 - arch/sh updates.
 - Part of ocfs2.  Review is slow.
 - Slab updates.
 - Most of -mm.
 - printk updates.
 - lib/ updates.
 - checkpatch updates.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (226 commits)
  checkpatch: update $declaration_macros, add uninitialized_var
  checkpatch: warn on missing spaces in broken up quoted
  checkpatch: fix false positives for --strict "space after cast" test
  checkpatch: fix false positive MISSING_BREAK warnings with --file
  checkpatch: add test for native c90 types in unusual order
  checkpatch: add signed generic types
  checkpatch: add short int to c variable types
  checkpatch: add for_each tests to indentation and brace tests
  checkpatch: fix brace style misuses of else and while
  checkpatch: add --fix option for a couple OPEN_BRACE misuses
  checkpatch: use the correct indentation for which()
  checkpatch: add fix_insert_line and fix_delete_line helpers
  checkpatch: add ability to insert and delete lines to patch/file
  checkpatch: add an index variable for fixed lines
  checkpatch: warn on break after goto or return with same tab indentation
  checkpatch: emit a warning on file add/move/delete
  checkpatch: add test for commit id formatting style in commit log
  checkpatch: emit fewer kmalloc_array/kcalloc conversion warnings
  checkpatch: improve "no space after cast" test
  checkpatch: allow multiple const * types
  ...
2014-08-06 21:14:42 -07:00
Linus Torvalds 158c12948f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree changes from Jiri Kosina:
 "Summer edition of trivial tree updates"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits)
  doc: fix two typos in watchdog-api.txt
  irq-gic: remove file name from heading comment
  MAINTAINERS: Add miscdevice.h to file list for char/misc drivers.
  scsi: mvsas: mv_sas.c: Fix for possible null pointer dereference
  doc: replace "practise" with "practice" in Documentation
  befs: remove check for CONFIG_BEFS_RW
  scsi: doc: fix 'SCSI_NCR_SETUP_MASTER_PARITY'
  drivers/usb/phy/phy.c: remove a leading space
  mfd: fix comment
  cpuidle: fix comment
  doc: hpfall.c: fix missing null-terminate after strncpy call
  usb: doc: hotplug.txt code typos
  kbuild: fix comment in Makefile.modinst
  SH: add proper prompt to SH_MAGIC_PANEL_R2_VERSION
  ARM: msm: Remove MSM_SCM
  crypto: Remove MPILIB_EXTRA
  doc: CN: remove dead link, kerneltrap.org no longer works
  media: update reference, kerneltrap.org no longer works
  hexagon: update reference, kerneltrap.org no longer works
  doc: LSM: update reference, kerneltrap.org no longer works
  ...
2014-08-06 21:03:53 -07:00
Ken Helias 1d023284c3 list: fix order of arguments for hlist_add_after(_rcu)
All other add functions for lists have the new item as first argument
and the position where it is added as second argument.  This was changed
for no good reason in this function and makes using it unnecessary
confusing.

The name was changed to hlist_add_behind() to cause unconverted code to
generate a compile error instead of using the wrong parameter order.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Ken Helias <kenhelias@firemail.de>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	[intel driver bits]
Cc: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:24 -07:00
Peter Feiner 68b5a65248 mm: softdirty: respect VM_SOFTDIRTY in PTE holes
After a VMA is created with the VM_SOFTDIRTY flag set, /proc/pid/pagemap
should report that the VMA's virtual pages are soft-dirty until
VM_SOFTDIRTY is cleared (i.e., by the next write of "4" to
/proc/pid/clear_refs).  However, pagemap ignores the VM_SOFTDIRTY flag
for virtual addresses that fall in PTE holes (i.e., virtual addresses
that don't have a PMD, PUD, or PGD allocated yet).

To observe this bug, use mmap to create a VMA large enough such that
there's a good chance that the VMA will occupy an unused PMD, then test
the soft-dirty bit on its pages.  In practice, I found that a VMA that
covered a PMD's worth of address space was big enough.

This patch adds the necessary VMA lookup to the PTE hole callback in
/proc/pid/pagemap's page walk and sets soft-dirty according to the VMAs'
VM_SOFTDIRTY flag.

Signed-off-by: Peter Feiner <pfeiner@google.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:22 -07:00
Rafael Aquini cc7452b6dc mm: export NR_SHMEM via sysinfo(2) / si_meminfo() interfaces
Historically, we exported shared pages to userspace via sysinfo(2)
sharedram and /proc/meminfo's "MemShared" fields.  With the advent of
tmpfs, from kernel v2.4 onward, that old way for accounting shared mem
was deemed inaccurate and we started to export a hard-coded 0 for
sysinfo.sharedram.  Later on, during the 2.6 timeframe, "MemShared" got
re-introduced to /proc/meminfo re-branded as "Shmem", but we're still
reporting sysinfo.sharedmem as that old hard-coded zero, which makes the
"shared memory" report inconsistent across interfaces.

This patch leverages the addition of explicit accounting for pages used
by shmem/tmpfs -- "4b02108 mm: oom analysis: add shmem vmstat" -- in
order to make the users of sysinfo(2) and si_meminfo*() friends aware of
that vmstat entry and make them report it consistently across the
interfaces, as well to make sysinfo(2) returned data consistent with our
current API documentation states.

Signed-off-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:19 -07:00
Fabian Frederick 1b7f8ba603 fs/ocfs2/slot_map.c: replace count*size kzalloc by kcalloc
kcalloc manages count*sizeof overflow.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:13 -07:00
Tariq Saeed bba1cb17d9 ocfs2: race between umount and unfinished remastering during recovery
Orabug: 19074140

When umount is issued during recovery on the new master that has not
finished remastering locks, it triggers BUG() in
dlm_send_mig_lockres_msg().  Here is the situation:

 1) node A has a lock on resource X mastered by node B.

 2) node B dies ->  node A sets recovering flag for res X

 3) Node C becomes the new master for resources owned by the
    dead node and is remastering locks of the dead node but
    has not finished the remastering process yet.

 4) umount is issued on node C.

 5) During processing of umount, ignoring unfished recovery,
    node C attempts to migrate resource X to node A.

 6) node A finds res X in DLM_LOCK_RES_RECOVERING state, considers
    it a logic error and sends back -EFAULT.

 7) node C asserts BUG() upon seeing EFAULT resp from node B.

Fix is to delay migrating res X till remastering is finished at which
point recovering flag will be cleared on both A and C.

Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:13 -07:00
Xue jiufei 7567c14883 ocfs2: remove conversion of total_backoff in dlm_join_domain()
The unit of total_backoff is msecs not jiffies, so no need to do the
conversion.  Otherwise, the join timeout is not 90 sec.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:13 -07:00
Yingtai Xie 981035b47d ocfs2: correctly check the return value of ocfs2_search_extent_list
ocfs2_search_extent_list may return -1, so we should check the return
value in ocfs2_split_and_insert, otherwise it may cause array index out of
bound.

And ocfs2_search_extent_list can only return value less than
el->l_next_free_rec, so check if it is equal or larger than
le16_to_cpu(el->l_next_free_rec) is meaningless.

Signed-off-by: Yingtai Xie <xieyingtai@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:13 -07:00
Fabian Frederick c811f5f41e fs/squashfs/super.c: logging cleanup
- Convert printk to pr_foo()
- Add pr_fmt for future logging entries
- Coalesce formats

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:13 -07:00
Fabian Frederick 14694888db fs/squashfs/file_direct.c: replace count*size kmalloc by kmalloc_array
kmalloc_array() manages count*sizeof overflow.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:13 -07:00
Fabian Frederick 2502722dde ntfs: kernel-doc warning fixes
cached_page and lru_pvec were removed from ntfs_attr_extend_initialized
in commit 2ec93b0bf3 ("ntfs: clean up ntfs_attr_extend_initialized")

lru_pvec has been removed from __ntfs_grab_cache_pages in commit
4c99000ac4 ("ntfs: use add_to_page_cache_lru()")

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:12 -07:00
Fabian Frederick 2122da26c3 fs/logfs/readwrite.c: kernel-doc warning fixes
s/-/:/ and fix variable names.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:12 -07:00
Jan Kara 5838d4442b fanotify: fix double free of pending permission events
Commit 8581679424 ("fanotify: Fix use after free for permission
events") introduced a double free issue for permission events which are
pending in group's notification queue while group is being destroyed.
These events are freed from fanotify_handle_event() but they are not
removed from groups notification queue and thus they get freed again
from fsnotify_flush_notify().

Fix the problem by removing permission events from notification queue
before freeing them if we skip processing access response.  Also expand
comments in fanotify_release() to explain group shutdown in detail.

Fixes: 8581679424
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Douglas Leeder <douglas.leeder@sophos.com>
Tested-by: Douglas Leeder <douglas.leeder@sophos.com>
Reported-by: Heinrich Schuchard <xypron.glpk@gmx.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:12 -07:00
Jan Kara 8ba8fa9170 fsnotify: rename event handling functions
Rename fsnotify_add_notify_event() to fsnotify_add_event() since the
"notify" part is duplicit.  Rename fsnotify_remove_notify_event() and
fsnotify_peek_notify_event() to fsnotify_remove_first_event() and
fsnotify_peek_first_event() respectively since "notify" part is duplicit
and they really look at the first event in the queue.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:12 -07:00
Fabian Frederick 3e58406484 fs/fscache: make ctl_table static
fscache_sysctls and fscache_sysctls_root are only used in main.c

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: David Howells <dhowells@redhat.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:12 -07:00
Linus Torvalds ae045e2455 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Steady transitioning of the BPF instructure to a generic spot so
      all kernel subsystems can make use of it, from Alexei Starovoitov.

   2) SFC driver supports busy polling, from Alexandre Rames.

   3) Take advantage of hash table in UDP multicast delivery, from David
      Held.

   4) Lighten locking, in particular by getting rid of the LRU lists, in
      inet frag handling.  From Florian Westphal.

   5) Add support for various RFC6458 control messages in SCTP, from
      Geir Ola Vaagland.

   6) Allow to filter bridge forwarding database dumps by device, from
      Jamal Hadi Salim.

   7) virtio-net also now supports busy polling, from Jason Wang.

   8) Some low level optimization tweaks in pktgen from Jesper Dangaard
      Brouer.

   9) Add support for ipv6 address generation modes, so that userland
      can have some input into the process.  From Jiri Pirko.

  10) Consolidate common TCP connection request code in ipv4 and ipv6,
      from Octavian Purdila.

  11) New ARP packet logger in netfilter, from Pablo Neira Ayuso.

  12) Generic resizable RCU hash table, with intial users in netlink and
      nftables.  From Thomas Graf.

  13) Maintain a name assignment type so that userspace can see where a
      network device name came from (enumerated by kernel, assigned
      explicitly by userspace, etc.) From Tom Gundersen.

  14) Automatic flow label generation on transmit in ipv6, from Tom
      Herbert.

  15) New packet timestamping facilities from Willem de Bruijn, meant to
      assist in measuring latencies going into/out-of the packet
      scheduler, latency from TCP data transmission to ACK, etc"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1536 commits)
  cxgb4 : Disable recursive mailbox commands when enabling vi
  net: reduce USB network driver config options.
  tg3: Modify tg3_tso_bug() to handle multiple TX rings
  amd-xgbe: Perform phy connect/disconnect at dev open/stop
  amd-xgbe: Use dma_set_mask_and_coherent to set DMA mask
  net: sun4i-emac: fix memory leak on bad packet
  sctp: fix possible seqlock seadlock in sctp_packet_transmit()
  Revert "net: phy: Set the driver when registering an MDIO bus device"
  cxgb4vf: Turn off SGE RX/TX Callback Timers and interrupts in PCI shutdown routine
  team: Simplify return path of team_newlink
  bridge: Update outdated comment on promiscuous mode
  net-timestamp: ACK timestamp for bytestreams
  net-timestamp: TCP timestamping
  net-timestamp: SCHED timestamp on entering packet scheduler
  net-timestamp: add key to disambiguate concurrent datagrams
  net-timestamp: move timestamp flags out of sk_flags
  net-timestamp: extend SCM_TIMESTAMPING ancillary data struct
  cxgb4i : Move stray CPL definitions to cxgb4 driver
  tcp: reduce spurious retransmits due to transient SACK reneging
  qlcnic: Initialize dcbnl_ops before register_netdev
  ...
2014-08-06 09:38:14 -07:00
Linus Torvalds bb2cbf5e93 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
 "In this release:

   - PKCS#7 parser for the key management subsystem from David Howells
   - appoint Kees Cook as seccomp maintainer
   - bugfixes and general maintenance across the subsystem"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (94 commits)
  X.509: Need to export x509_request_asymmetric_key()
  netlabel: shorter names for the NetLabel catmap funcs/structs
  netlabel: fix the catmap walking functions
  netlabel: fix the horribly broken catmap functions
  netlabel: fix a problem when setting bits below the previously lowest bit
  PKCS#7: X.509 certificate issuer and subject are mandatory fields in the ASN.1
  tpm: simplify code by using %*phN specifier
  tpm: Provide a generic means to override the chip returned timeouts
  tpm: missing tpm_chip_put in tpm_get_random()
  tpm: Properly clean sysfs entries in error path
  tpm: Add missing tpm_do_selftest to ST33 I2C driver
  PKCS#7: Use x509_request_asymmetric_key()
  Revert "selinux: fix the default socket labeling in sock_graft()"
  X.509: x509_request_asymmetric_keys() doesn't need string length arguments
  PKCS#7: fix sparse non static symbol warning
  KEYS: revert encrypted key change
  ima: add support for measuring and appraising firmware
  firmware_class: perform new LSM checks
  security: introduce kernel_fw_from_file hook
  PKCS#7: Missing inclusion of linux/err.h
  ...
2014-08-06 08:06:39 -07:00
Linus Torvalds e7fda6c4c3 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer and time updates from Thomas Gleixner:
 "A rather large update of timers, timekeeping & co

   - Core timekeeping code is year-2038 safe now for 32bit machines.
     Now we just need to fix all in kernel users and the gazillion of
     user space interfaces which rely on timespec/timeval :)

   - Better cache layout for the timekeeping internal data structures.

   - Proper nanosecond based interfaces for in kernel users.

   - Tree wide cleanup of code which wants nanoseconds but does hoops
     and loops to convert back and forth from timespecs.  Some of it
     definitely belongs into the ugly code museum.

   - Consolidation of the timekeeping interface zoo.

   - A fast NMI safe accessor to clock monotonic for tracing.  This is a
     long standing request to support correlated user/kernel space
     traces.  With proper NTP frequency correction it's also suitable
     for correlation of traces accross separate machines.

   - Checkpoint/restart support for timerfd.

   - A few NOHZ[_FULL] improvements in the [hr]timer code.

   - Code move from kernel to kernel/time of all time* related code.

   - New clocksource/event drivers from the ARM universe.  I'm really
     impressed that despite an architected timer in the newer chips SoC
     manufacturers insist on inventing new and differently broken SoC
     specific timers.

[ Ed. "Impressed"? I don't think that word means what you think it means ]

   - Another round of code move from arch to drivers.  Looks like most
     of the legacy mess in ARM regarding timers is sorted out except for
     a few obnoxious strongholds.

   - The usual updates and fixlets all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
  timekeeping: Fixup typo in update_vsyscall_old definition
  clocksource: document some basic timekeeping concepts
  timekeeping: Use cached ntp_tick_length when accumulating error
  timekeeping: Rework frequency adjustments to work better w/ nohz
  timekeeping: Minor fixup for timespec64->timespec assignment
  ftrace: Provide trace clocks monotonic
  timekeeping: Provide fast and NMI safe access to CLOCK_MONOTONIC
  seqcount: Add raw_write_seqcount_latch()
  seqcount: Provide raw_read_seqcount()
  timekeeping: Use tk_read_base as argument for timekeeping_get_ns()
  timekeeping: Create struct tk_read_base and use it in struct timekeeper
  timekeeping: Restructure the timekeeper some more
  clocksource: Get rid of cycle_last
  clocksource: Move cycle_last validation to core code
  clocksource: Make delta calculation a function
  wireless: ath9k: Get rid of timespec conversions
  drm: vmwgfx: Use nsec based interfaces
  drm: i915: Use nsec based interfaces
  timekeeping: Provide ktime_get_raw()
  hangcheck-timer: Use ktime_get_ns()
  ...
2014-08-05 17:46:42 -07:00
Jeff Mahoney 27d0e5bc85 reiserfs: fix corruption introduced by balance_leaf refactor
Commits f1f007c308 (reiserfs: balance_leaf refactor, pull out
balance_leaf_insert_left) and cf22df182b (reiserfs: balance_leaf
refactor, pull out balance_leaf_paste_left) missed that the `body'
pointer was getting repositioned. Subsequent users of the pointer
would expect it to be repositioned, and as a result, parts of the
tree would get overwritten. The most common observed corruption
is indirect block pointers being overwritten.

Since the body value isn't actually used anymore in the called routines,
we can pass back the offset it should be shifted. We constify the body
and ih pointers in the balance_leaf as a mostly-free preventative measure.

Cc: <stable@vger.kernel.org> # 3.16
Reported-and-tested-by: Jeff Chua <jeff.chua.linux@gmail.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-08-05 23:18:38 +02:00
Jeff Layton 14a571a8ec nfsd: add some comments to the nfsd4 object definitions
Add some comments that describe what each of these objects is, and how
they related to one another.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 16:09:20 -04:00
Jeff Layton b687f6863e nfsd: remove the client_mutex and the nfs4_lock/unlock_state wrappers
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 15:00:54 -04:00
Steve French f29ebb47d5 Add worker function to set allocation size
Adds setinfo worker function for SMB2/SMB3 support of SET_ALLOCATION_INFORMATION

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org>
2014-08-05 12:53:37 -05:00
Jeff Layton 74cf76df0f nfsd: remove nfs4_lock_state: nfs4_state_shutdown_net
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 10:55:20 -04:00
Jeff Layton dab6ef2415 nfsd: remove nfs4_lock_state: nfs4_laundromat
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 10:55:20 -04:00
Trond Myklebust 05149dd4dc nfsd: Remove nfs4_lock_state(): reclaim_complete()
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 10:55:19 -04:00
Trond Myklebust cb86fb1428 nfsd: Remove nfs4_lock_state(): setclientid, setclientid_confirm, renew
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 10:55:18 -04:00
Trond Myklebust 3974552dce nfsd: Remove nfs4_lock_state(): exchange_id, create/destroy_session()
Also destroy_clientid and bind_conn_to_session.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 10:55:17 -04:00
Trond Myklebust 3234975f47 nfsd: Remove nfs4_lock_state(): nfsd4_open and nfsd4_open_confirm
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 10:55:16 -04:00
Trond Myklebust 084d4d4549 nfsd: Remove nfs4_lock_state(): nfsd4_delegreturn()
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 10:55:15 -04:00
Trond Myklebust 36626a2ecf nfsd: Remove nfs4_lock_state(): nfsd4_open_downgrade + nfsd4_close
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 10:55:14 -04:00
Trond Myklebust 2dd7f2ad4e nfsd: Remove nfs4_lock_state(): nfsd4_lock/locku/lockt()
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 10:55:13 -04:00
Trond Myklebust 51f5e78355 nfsd: Remove nfs4_lock_state(): nfsd4_release_lockowner
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 10:55:12 -04:00
Trond Myklebust e7d5dc19ce nfsd: Remove nfs4_lock_state(): nfsd4_test_stateid/nfsd4_free_stateid
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 10:55:12 -04:00
Trond Myklebust c2d1d6a8f0 nfsd: Remove nfs4_lock_state(): nfs4_preprocess_stateid_op()
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 10:55:11 -04:00
Jeff Layton 285abdee53 nfsd: remove old fault injection infrastructure
Remove the old nfsd_for_n_state function and move nfsd_find_client
higher up into the file to get rid of forward declaration. Remove
the struct nfsd_fault_inject_op arguments from the operations as
they are no longer needed by any of them.

Finally, remove the old "standard" get and set routines, which
also eliminates the client_mutex from this code.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 10:55:10 -04:00
Jeff Layton 98d5c7c5bd nfsd: add more granular locking to *_delegations fault injectors
...instead of relying on the client_mutex.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 10:55:09 -04:00
Jeff Layton 82e05efaec nfsd: add more granular locking to forget_openowners fault injector
...instead of relying on the client_mutex.

Also, fix up the printk output that is generated when the file is read.
It currently says that it's reporting the number of open files, but
it's actually reporting the number of openowners.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 10:55:08 -04:00
Jeff Layton 016200c373 nfsd: add more granular locking to forget_locks fault injector
...instead of relying on the client_mutex.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 10:55:07 -04:00
Jeff Layton 3738d50e7f nfsd: add a list_head arg to nfsd_foreach_client_lock
In a later patch, we'll want to collect the locks onto a list for later
destruction. If "func" is defined and "collect" is defined, then we'll
add the lock stateid to the list.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 10:55:06 -04:00
Jeff Layton 69fc9edf98 nfsd: add nfsd_inject_forget_clients
...which uses the client_lock for protection instead of client_mutex.
Also remove nfsd_forget_client as there are no more callers.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 10:55:05 -04:00
Jeff Layton a0926d1527 nfsd: add a forget_client set_clnt routine
...that relies on the client_lock instead of client_mutex.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 10:55:04 -04:00
Jeff Layton 7ec0e36f1a nfsd: add a forget_clients "get" routine with proper locking
Add a new "get" routine for forget_clients that relies on the
client_lock instead of the client_mutex.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 10:55:04 -04:00
Jeff Layton c96223d3b6 nfsd: abstract out the get and set routines into the fault injection ops
Now that we've added more granular locking in other places, it's time
to address the fault injection code. This code is currently quite
reliant on the client_mutex for protection. Start to change this by
adding a new set of fault injection op vectors.

For now they all use the legacy ones. In later patches we'll add new
routines that can deal with more granular locking.

Also, move some of the printk routines into the callers to make the
results of the operations more uniform.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 10:55:02 -04:00
Jeff Layton 294ac32e99 nfsd: protect clid and verifier generation with client_lock
The clid counter is a global counter currently. Move it to be a per-net
property so that it can be properly protected by the nn->client_lock
instead of relying on the client_mutex.

The verifier generator is also potentially racy if there are two
simultaneous callers. Generate the verifier when we generate the clid
value, so it's also created under the client_lock. With this, there's
no need to keep two counters as they'd always be in sync anyway, so
just use the clientid_counter for both.

As Trond points out, what would be best is to eventually move this
code to use IDR instead of the hash tables. That would also help ensure
uniqueness, but that's probably best done as a separate project.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 10:55:02 -04:00
Jeff Layton fd699b8a48 nfsd: don't destroy clients that are busy
It's possible that we'll have an in-progress call on some of the clients
while a rogue EXCHANGE_ID or DESTROY_CLIENTID call comes in. Be sure to
try and mark the client expired first, so that the refcount is
respected.

This will only be a problem once the client_mutex is removed.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 10:55:01 -04:00
Kinglong Mee fb94d766af NFSD: Put the reference of nfs4_file when freeing stid
After testing nfs4 lock, I restart the nfsd service, got messages as,

[ 5677.403419] nfsd: last server has exited, flushing export cache
[ 5677.463728] =============================================================================
[ 5677.463942] BUG nfsd4_files (Tainted: G    B      OE): Objects remaining in nfsd4_files on kmem_cache_close()
[ 5677.464055] -----------------------------------------------------------------------------

[ 5677.464203] INFO: Slab 0xffffea0000233400 objects=28 used=1 fp=0xffff880008cd3d98 flags=0x3ffc0000004080
[ 5677.464318] CPU: 0 PID: 3772 Comm: rmmod Tainted: G    B      OE 3.16.0-rc2+ #29
[ 5677.464420] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 5677.464538]  0000000000000000 0000000036af2c9f ffff88000ce97d68 ffffffff816eacfa
[ 5677.464643]  ffffea0000233400 ffff88000ce97e40 ffffffff811cda44 ffffffff00000020
[ 5677.464774]  ffff88000ce97e50 ffff88000ce97e00 656a624f00000008 616d657220737463
[ 5677.464875] Call Trace:
[ 5677.464925]  [<ffffffff816eacfa>] dump_stack+0x45/0x56
[ 5677.464983]  [<ffffffff811cda44>] slab_err+0xb4/0xe0
[ 5677.465040]  [<ffffffff811d0457>] ? __kmalloc+0x117/0x290
[ 5677.465099]  [<ffffffff81100eec>] ? on_each_cpu_cond+0xac/0xf0
[ 5677.465158]  [<ffffffff811d1bc0>] ? kmem_cache_close+0x110/0x2e0
[ 5677.465218]  [<ffffffff811d1be0>] kmem_cache_close+0x130/0x2e0
[ 5677.465279]  [<ffffffff8135a0c1>] ? kobject_cleanup+0x91/0x1b0
[ 5677.465338]  [<ffffffff811d22be>] __kmem_cache_shutdown+0xe/0x10
[ 5677.465399]  [<ffffffff8119bd28>] kmem_cache_destroy+0x48/0x100
[ 5677.465466]  [<ffffffffa05ef78d>] nfsd4_free_slabs+0x2d/0x50 [nfsd]
[ 5677.465530]  [<ffffffffa05fa987>] exit_nfsd+0x34/0x6ad [nfsd]
[ 5677.465589]  [<ffffffff81104ac2>] SyS_delete_module+0x162/0x200
[ 5677.465649]  [<ffffffff81013b69>] ? do_notify_resume+0x59/0x90
[ 5677.465759]  [<ffffffff816f2369>] system_call_fastpath+0x16/0x1b
[ 5677.465822] INFO: Object 0xffff880008cd0000 @offset=0
[ 5677.465882] INFO: Allocated in nfsd4_process_open1+0x61/0x350 [nfsd] age=7599 cpu=0 pid=3253
[ 5677.466115]  __slab_alloc+0x3b0/0x4b1
[ 5677.466166]  kmem_cache_alloc+0x1e4/0x240
[ 5677.466220]  nfsd4_process_open1+0x61/0x350 [nfsd]
[ 5677.466276]  nfsd4_open+0xee/0x860 [nfsd]
[ 5677.466329]  nfsd4_proc_compound+0x4d7/0x7f0 [nfsd]
[ 5677.466384]  nfsd_dispatch+0xbb/0x200 [nfsd]
[ 5677.466447]  svc_process_common+0x453/0x6f0 [sunrpc]
[ 5677.466506]  svc_process+0x103/0x170 [sunrpc]
[ 5677.466559]  nfsd+0x117/0x190 [nfsd]
[ 5677.466609]  kthread+0xd8/0xf0
[ 5677.466656]  ret_from_fork+0x7c/0xb0
[ 5677.466775] kmem_cache_destroy nfsd4_files: Slab cache still has objects
[ 5677.466839] CPU: 0 PID: 3772 Comm: rmmod Tainted: G    B      OE 3.16.0-rc2+ #29
[ 5677.466937] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 5677.467049]  0000000000000000 0000000036af2c9f ffff88000ce97eb0 ffffffff816eacfa
[ 5677.467150]  ffff880020bb2d00 ffff88000ce97ed0 ffffffff8119bdd9 0000000000000000
[ 5677.467250]  ffffffffa06065c0 ffff88000ce97ee0 ffffffffa05ef78d ffff88000ce97ef0
[ 5677.467351] Call Trace:
[ 5677.467397]  [<ffffffff816eacfa>] dump_stack+0x45/0x56
[ 5677.467454]  [<ffffffff8119bdd9>] kmem_cache_destroy+0xf9/0x100
[ 5677.467516]  [<ffffffffa05ef78d>] nfsd4_free_slabs+0x2d/0x50 [nfsd]
[ 5677.467579]  [<ffffffffa05fa987>] exit_nfsd+0x34/0x6ad [nfsd]
[ 5677.467639]  [<ffffffff81104ac2>] SyS_delete_module+0x162/0x200
[ 5677.467765]  [<ffffffff81013b69>] ? do_notify_resume+0x59/0x90
[ 5677.467826]  [<ffffffff816f2369>] system_call_fastpath+0x16/0x1b

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Fixes: 11b9164ada "nfsd: Add a struct nfs4_file field to struct nfs4_stid"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-05 10:53:36 -04:00
Linus Torvalds 8e099d1e8b Bug fixes and clean ups for the 3.17 merge window
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJT4Eh0AAoJENNvdpvBGATwmaAP/0W9B/VPY4rSnanpXvsYlone
 wjIjh11NZdU3daW/c/VhgEIO81rFaoFoQ/dN+pIo7uHG8JRqZjyHXZY2eVF5SFFp
 FQyGEGMRhE+u78IDg3U99OlAUTo8SAHjHVlYUBVpT9lazMLWRPqP7uHJbXow5ijK
 /bXSVY+6fOWY1/yruCZv1nRtg9JNZgCc1LOPDqn6K16jItBKfYBvVbpw6hise2v2
 rPfcSlKJ5Wzo/PgNX+IR9nnUOXzpbdM2CZbxy0qB2jZYirzE6VNeHhc1JWxQWNB1
 Dg3j/ynEWPs03+9ywLcQ0kEQvUXhQQlMGmfPkgzWeAUQDUv4QAeYmFiRhc/EgJWY
 othDmKVqy0Pn9rmGCOMg/TJFH0Mz/c7PTxFVF1onxs2Sqvl3yCdZANT+ie5UoC9m
 zkUHdY3HkARiK/I6d5CCJzvHMxWNyf6bAJmoR6L/SaOPXebc4cfFtjgV01kbTWv+
 rW9MCj3TiIC9MpWdVJADcmc2w3cY0L/NUbFWHZhMSDFiuJvcLUw5afaJTvKEPuKp
 WnnYICPj6wMP7Gy/isTxBGGi0UjPm67DHGLpG+syPDi1RxU6Vw3p9PjbdvCRj7so
 UD3xzntCHOzVcgAxE92V4pMZAajtv0sfIBVzMm7k8iUXzb7Er1c5dV6ROkOzguq3
 Ogj/c2JHSSB4TSYdVxsG
 =hf9n
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Bug fixes and clean ups for the 3.17 merge window"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix ext4_discard_allocated_blocks() if we can't allocate the pa struct
  ext4: fix COLLAPSE RANGE test for bigalloc file systems
  ext4: check inline directory before converting
  ext4: fix incorrect locking in move_extent_per_page
  ext4: use correct depth value
  ext4: add i_data_sem sanity check
  ext4: fix wrong size computation in ext4_mb_normalize_request()
  ext4: make ext4_has_inline_data() as a inline function
  ext4: remove readpage() check in ext4_mmap_file()
  ext4: fix punch hole on files with indirect mapping
  ext4: remove metadata reservation checks
  ext4: rearrange initialization to fix EXT4FS_DEBUG
2014-08-04 20:46:54 -07:00
Linus Torvalds b54ecfb702 Merge tag 'for-f2fs-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
 "This series includes patches to:
   - add nobarrier mount option
   - support tmpfile and rename2
   - enhance the fdatasync behavior
   - fix the error path
   - fix the recovery routine
   - refactor a part of the checkpoint procedure
   - reduce some lock contentions"

* tag 'for-f2fs-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (40 commits)
  f2fs: use for_each_set_bit to simplify the code
  f2fs: add f2fs_balance_fs for expand_inode_data
  f2fs: invalidate xattr node page when evict inode
  f2fs: avoid skipping recover_inline_xattr after recover_inline_data
  f2fs: add tracepoint for f2fs_direct_IO
  f2fs: reduce competition among node page writes
  f2fs: fix coding style
  f2fs: remove redundant lines in allocate_data_block
  f2fs: add tracepoint for f2fs_issue_flush
  f2fs: avoid retrying wrong recovery routine when error was occurred
  f2fs: test before set/clear bits
  f2fs: fix wrong condition for unlikely
  f2fs: enable in-place-update for fdatasync
  f2fs: skip unnecessary data writes during fsync
  f2fs: add info of appended or updated data writes
  f2fs: use radix_tree for ino management
  f2fs: add infra for ino management
  f2fs: punch the core function for inode management
  f2fs: add nobarrier mount option
  f2fs: fix to put root inode in error path of fill_super
  ...
2014-08-04 20:30:07 -07:00
Linus Torvalds 29b88e23a9 Driver core patches for 3.17-rc1
Here's the big driver-core pull request for 3.17-rc1.
 
 Largest thing in here is the dma-buf rework and fence code, that touched
 many different subsystems so it was agreed it should go through this
 tree to handle merge issues.  There's also some firmware loading
 updates, as well as tests added, and a few other tiny changes, the
 changelog has the details.
 
 All have been in linux-next for a long time.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iEYEABECAAYFAlPf1XcACgkQMUfUDdst+ylREACdHLXBa02yLrRzbrONJ+nARuFv
 JuQAoMN49PD8K9iMQpXqKBvZBsu+iCIY
 =w8OJ
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here's the big driver-core pull request for 3.17-rc1.

  Largest thing in here is the dma-buf rework and fence code, that
  touched many different subsystems so it was agreed it should go
  through this tree to handle merge issues.  There's also some firmware
  loading updates, as well as tests added, and a few other tiny changes,
  the changelog has the details.

  All have been in linux-next for a long time"

* tag 'driver-core-3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (32 commits)
  ARM: imx: Remove references to platform_bus in mxc code
  firmware loader: Fix _request_firmware_load() return val for fw load abort
  platform: Remove most references to platform_bus device
  test: add firmware_class loader test
  doc: fix minor typos in firmware_class README
  staging: android: Cleanup style issues
  Documentation: devres: Sort managed interfaces
  Documentation: devres: Add devm_kmalloc() et al
  fs: debugfs: remove trailing whitespace
  kernfs: kernel-doc warning fix
  debugfs: Fix corrupted loop in debugfs_remove_recursive
  stable_kernel_rules: Add pointer to netdev-FAQ for network patches
  driver core: platform: add device binding path 'driver_override'
  driver core/platform: remove unused implicit padding in platform_object
  firmware loader: inform direct failure when udev loader is disabled
  firmware: replace ALIGN(PAGE_SIZE) by PAGE_ALIGN
  firmware: read firmware size using i_size_read()
  firmware loader: allow disabling of udev as firmware loader
  reservation: add suppport for read-only access using rcu
  reservation: update api and add some helpers
  ...

Conflicts:
	drivers/base/platform.c
2014-08-04 18:34:04 -07:00
Linus Torvalds 98959948a7 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:

 - Move the nohz kick code out of the scheduler tick to a dedicated IPI,
   from Frederic Weisbecker.

  This necessiated quite some background infrastructure rework,
  including:

   * Clean up some irq-work internals
   * Implement remote irq-work
   * Implement nohz kick on top of remote irq-work
   * Move full dynticks timer enqueue notification to new kick
   * Move multi-task notification to new kick
   * Remove unecessary barriers on multi-task notification

 - Remove proliferation of wait_on_bit() action functions and allow
   wait_on_bit_action() functions to support a timeout.  (Neil Brown)

 - Another round of sched/numa improvements, cleanups and fixes.  (Rik
   van Riel)

 - Implement fast idling of CPUs when the system is partially loaded,
   for better scalability.  (Tim Chen)

 - Restructure and fix the CPU hotplug handling code that may leave
   cfs_rq and rt_rq's throttled when tasks are migrated away from a dead
   cpu.  (Kirill Tkhai)

 - Robustify the sched topology setup code.  (Peterz Zijlstra)

 - Improve sched_feat() handling wrt.  static_keys (Jason Baron)

 - Misc fixes.

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
  sched/fair: Fix 'make xmldocs' warning caused by missing description
  sched: Use macro for magic number of -1 for setparam
  sched: Robustify topology setup
  sched: Fix sched_setparam() policy == -1 logic
  sched: Allow wait_on_bit_action() functions to support a timeout
  sched: Remove proliferation of wait_on_bit() action functions
  sched/numa: Revert "Use effective_load() to balance NUMA loads"
  sched: Fix static_key race with sched_feat()
  sched: Remove extra static_key*() function indirection
  sched/rt: Fix replenish_dl_entity() comments to match the current upstream code
  sched: Transform resched_task() into resched_curr()
  sched/deadline: Kill task_struct->pi_top_task
  sched: Rework check_for_tasks()
  sched/rt: Enqueue just unthrottled rt_rq back on the stack in __disable_runtime()
  sched/fair: Disable runtime_enabled on dying rq
  sched/numa: Change scan period code to match intent
  sched/numa: Rework best node setting in task_numa_migrate()
  sched/numa: Examine a task move when examining a task swap
  sched/numa: Simplify task_numa_compare()
  sched/numa: Use effective_load() to balance NUMA loads
  ...
2014-08-04 16:23:30 -07:00
Scott Mayhew 71a6ec8ac5 nfs: reject changes to resvport and sharecache during remount
Commit c8e47028 made it possible to change resvport/noresvport and
sharecache/nosharecache via a remount operation, neither of which should be
allowed.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Fixes: c8e47028 (nfs: Apply NFS_MOUNT_CMP_FLAGMASK to nfs_compare_remount_data)
Cc: stable@vger.kernel.org # 3.16+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-04 17:41:52 -04:00
Kinglong Mee 5b53dc88b0 NFS: Avoid infinite loop when RELEASE_LOCKOWNER getting expired error
Fix Commit 60ea681299 (NFS: Migration support for RELEASE_LOCKOWNER)
If getting expired error, client will enter a infinite loop as,

client                            server
   RELEASE_LOCKOWNER(old clid) ----->
                <--- expired error
   RENEW(old clid)             ----->
                <--- expired error
   SETCLIENTID                 ----->
                <--- a new clid
   SETCLIENTID_CONFIRM (new clid) -->
                <--- ok
   RELEASE_LOCKOWNER(old clid) ----->
                <--- expired error
   RENEW(new clid)             ----->
                <-- ok
   RELEASE_LOCKOWNER(old clid) ----->
                <--- expired error
   RENEW(new clid)             ----->
                <-- ok
                ... ...

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
[Trond: replace call to nfs4_async_handle_error() with
 nfs4_schedule_lease_recovery()]
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-04 16:51:38 -04:00
Chao Yu b65ee14818 f2fs: use for_each_set_bit to simplify the code
This patch uses for_each_set_bit to simplify some codes in f2fs.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-04 13:20:53 -07:00
Chao Yu 497a0930bb f2fs: add f2fs_balance_fs for expand_inode_data
This patch adds f2fs_balance_fs in expand_inode_data to avoid allocation failure
with segment.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-04 13:02:03 -07:00
Chao Yu 002a41cabb f2fs: invalidate xattr node page when evict inode
When inode is evicted, all the page cache belong to this inode should be
released including the xattr node page. But previously we didn't do this, this
patch fixed this issue.

v2:
 o reposition invalidate_mapping_pages() to the right place suggested by
Jaegeuk Kim.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-04 13:01:22 -07:00
Linus Torvalds f2a84170ed Merge branch 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu updates from Tejun Heo:

 - Major reorganization of percpu header files which I think makes
   things a lot more readable and logical than before.

 - percpu-refcount is updated so that it requires explicit destruction
   and can be reinitialized if necessary.  This was pulled into the
   block tree to replace the custom percpu refcnting implemented in
   blk-mq.

 - In the process, percpu and percpu-refcount got cleaned up a bit

* 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (21 commits)
  percpu-refcount: implement percpu_ref_reinit() and percpu_ref_is_zero()
  percpu-refcount: require percpu_ref to be exited explicitly
  percpu-refcount: use unsigned long for pcpu_count pointer
  percpu-refcount: add helpers for ->percpu_count accesses
  percpu-refcount: one bit is enough for REF_STATUS
  percpu-refcount, aio: use percpu_ref_cancel_init() in ioctx_alloc()
  workqueue: stronger test in process_one_work()
  workqueue: clear POOL_DISASSOCIATED in rebind_workers()
  percpu: Use ALIGN macro instead of hand coding alignment calculation
  percpu: invoke __verify_pcpu_ptr() from the generic part of accessors and operations
  percpu: preffity percpu header files
  percpu: use raw_cpu_*() to define __this_cpu_*()
  percpu: reorder macros in percpu header files
  percpu: move {raw|this}_cpu_*() definitions to include/linux/percpu-defs.h
  percpu: move generic {raw|this}_cpu_*_N() definitions to include/asm-generic/percpu.h
  percpu: only allow sized arch overrides for {raw|this}_cpu_*() ops
  percpu: reorganize include/linux/percpu-defs.h
  percpu: move accessors from include/linux/percpu.h to percpu-defs.h
  percpu: include/asm-generic/percpu.h should contain only arch-overridable parts
  percpu: introduce arch_raw_cpu_ptr()
  ...
2014-08-04 10:09:27 -07:00
Eric W. Biederman 344470cac4 proc: Point /proc/mounts at /proc/thread-self/mounts instead of /proc/self/mounts
In oddball cases where the thread has a different mount namespace than
the thread group leader or more likely in cases where the thread
remains and the thread group leader has exited this ensures that
/proc/mounts continues to work.

This should not cause any problems but if it does this patch can just
be reverted.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-08-04 10:07:15 -07:00
Eric W. Biederman e813244072 proc: Point /proc/net at /proc/thread-self/net instead of /proc/self/net
In oddball cases where the thread has a different network namespace
than the primary thread group leader or more likely in cases where
the thread remains and the thread group leader has exited this
ensures that /proc/net continues to work.

This should not cause any problems but if it does this patch can just
be reverted.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-08-04 10:07:13 -07:00
Eric W. Biederman 0097875bd4 proc: Implement /proc/thread-self to point at the directory of the current thread
/proc/thread-self is derived from /proc/self.  /proc/thread-self
points to the directory in proc containing information about the
current thread.

This funtionality has been missing for a long time, and is tricky to
implement in userspace as gettid() is not exported by glibc.  More
importantly this allows fixing defects in /proc/mounts and /proc/net
where in a threaded application today they wind up being empty files
when only the initial pthread has exited, causing problems for other
threads.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-08-04 10:07:11 -07:00
Eric W. Biederman 6ba8ed79a3 proc: Have net show up under /proc/<tgid>/task/<tid>
Network namespaces are per task so it make sense for them to show up
in the task directory.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-08-04 10:07:08 -07:00
Linus Torvalds 1bff598860 File locking related changes for v3.17 (pile #1)
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJT34bvAAoJEAAOaEEZVoIVAPIQAINMD2fqeF3g9ZHxyzsKWoUp
 f14ZKeF/6nbG4Bn+iIihzxz/Bs9qS+03oVeI4oAg1c9crT+qZ6+nLM4C1n5gfck0
 Z0DvF1ITFcr+Nv0D/GSIiI4NY8ZJLP5gZWCPYaO4xamwVs2Bh4/B4uxi7ETIkfXh
 uL6dN739D2fBDNZBbeRh4VJTGXbT6ipzkTIBFXkMfmqGtUxzeTfepN+IdhE5gVnx
 xXc8ZZOVNmWI7g/YAYKSMlLbufHHgX47U2sNTljtHII4GXf98DmiYulcJvhfQ1JP
 7xSmbIrvn9Gm2iGobzbfED/OjXA0rsdw1vSzTO/uHUYPRriMOwuDRGE+S3oP0dRD
 ZdxQa8iOZjWEsWbDTRekBBAIXWcTUN8g8EbPj74EN0GWi3HYFj/ORkowj5Ym6zWh
 Sv4w9SafNMOKy9tt4RVh4iwendU/pNLrRgvR407aM+UWkhwCpinlO6vSLHppUwlC
 dgxFZtkdeBf5tMkm8Tja+XAV2SjU8DwP4nFU1kHu25L0W7m7hmmIeu6Crq5qlL3J
 0NCPTO1LeGNP1WiOQf99nXoJVeL3//CfD+H4LjIMcGCc4P7gJc346rH91Zd6rXY/
 kGomnkBMw+5WLvfOJ1NhuaEy3g8Wfk84QzlsmWgTEzl5qT+SEjPLcDHWqWJ2GTvB
 gFUPWHenVMcrFN/n0CWg
 =hvV8
 -----END PGP SIGNATURE-----

Merge tag 'locks-v3.17-1' of git://git.samba.org/jlayton/linux

Pull file locking related changes from Jeff Layton:
 "Just a couple of changes from Christoph to start us down the road
  toward getting rid of the fl_owner_t typedef"

* tag 'locks-v3.17-1' of git://git.samba.org/jlayton/linux:
  locks: purge fl_owner_t from fs/locks.c
  locks: typedef fl_owner_t to void *
2014-08-04 10:03:10 -07:00
Eric W. Biederman 65b38851a1 NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes
The usage of pid_ns->child_reaper->nsproxy->net_ns in
nfs_server_list_open and nfs_client_list_open is not safe.

/proc for a pid namespace can remain mounted after the all of the
process in that pid namespace have exited.  There are also times
before the initial process in a pid namespace has started or after the
initial process in a pid namespace has exited where
pid_ns->child_reaper can be NULL or stale.  Making the idiom
pid_ns->child_reaper->nsproxy a double whammy of problems.

Luckily all that needs to happen is to move /proc/fs/nfsfs/servers and
/proc/fs/nfsfs/volumes under /proc/net to /proc/net/nfsfs/servers and
/proc/net/nfsfs/volumes and add a symlink from the original location,
and to use seq_open_net as it has been designed.

Cc: stable@vger.kernel.org
Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-08-04 09:28:32 -07:00
NeilBrown 50d77739fa NFS: fix two problems in lookup_revalidate in RCU-walk
1/ rcu_dereference isn't correct: that field isn't
   RCU protected.   It could potentially change at any time
   so ACCESS_ONCE might be justified.

   changes to ->d_parent are protected by ->d_seq.  However
   that isn't always checked after ->d_revalidate is called,
   so it is safest to keep the double-check that ->d_parent
   hasn't changed at the end of these functions.

2/ in nfs4_lookup_revalidate, "->d_parent" was forgotten.
   So 'parent' was not the parent of 'dentry'.
   This fails safe is the context is that dentry->d_inode is
   NULL, and the result of parent->d_inode being NULL is
   that ECHILD is returned, which is always safe.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-04 09:22:08 -04:00
Dave Chinner 645f985721 Merge branch 'xfs-misc-fixes-3.17-2' into for-next 2014-08-04 13:55:27 +10:00
Dave Chinner b076d8720d Merge branch 'xfs-bulkstat-refactor' into for-next 2014-08-04 13:54:46 +10:00
Dave Chinner 4d7eece2c0 Merge branch 'xfs-misc-fixes-3.17-1' into for-next 2014-08-04 13:54:14 +10:00
Dave Chinner e0ac6d45bc Merge branch 'xfs-quota-eofblocks-scan' into for-next 2014-08-04 13:53:47 +10:00
kbuild test robot 6eee8972cc xfs: fix coccinelle warnings
Removes unneeded semicolon, introduced by commit a70a4fa5 ("xfs: fix
a couple error sequence jumps in xfs_mountfs"):

fs/xfs/xfs_mount.c:858:24-25: Unneeded semicolon

Generated by: scripts/coccinelle/misc/semicolon.cocci

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04 13:49:40 +10:00
Dave Chinner 4ef897a275 xfs: flush both inodes in xfs_swap_extents
We need to treat both inodes identically from a page cache point of
view when prepareing them for extent swapping. We don't do this
right now - we assume that one of the inodes empty, because that's
what xfs_fsr currently does. Remove this assumption from the code.

While factoring out the flushing and related checks, move the
transactions reservation to immeidately after the flushes so that we
don't need to pick up and then drop the ilock to do the transaction
reservation. There are no issues with aborting the transaction it if
the checks fail before we join the inodes to the transaction and
dirty them, so this is a safe change to make.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04 13:44:08 +10:00
Dave Chinner 8121768321 xfs: fix swapext ilock deadlock
xfs_swap_extents() holds the ilock over a call to
filemap_write_and_wait(), which can then try to write data and take
the ilock. That causes a self-deadlock.

Fix the deadlock and clean up the code by separating the locking
appropriately. Add a lockflags variable to track what locks we are
holding as we gain and drop them and cleanup the error handling to
always use "out_unlock" with the lockflags variable.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04 13:29:32 +10:00
Dave Chinner b92cc59f69 xfs: kill xfs_vnode.h
Move the IO flag definitions to xfs_inode.h and kill the header file
as it is now empty.

Removing the xfs_vnode.h file showed up an implicit header include
path:
	xfs_linux.h -> xfs_vnode.h -> xfs_fs.h

And so every xfs header file has been inplicitly been including
xfs_fs.h where it is needed or not. Hence the removal of xfs_vnode.h
causes all sorts of build issues because BBTOB() and friends are no
longer automatically included in the build. This also gets fixed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04 13:28:20 +10:00
Dave Chinner dd8c38bab0 xfs: kill VN_MAPPED
Only one user, no longer needed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04 13:23:35 +10:00
Dave Chinner 2667c6f935 xfs: kill VN_CACHED
Only has 2 users, has outlived it's usefulness.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04 13:23:15 +10:00
Dave Chinner eac152b474 xfs: kill VN_DIRTY()
Only one user of the macro and the dirty mapping check is redundant
so just get rid of it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04 13:22:49 +10:00
Dave Chinner ad3714b82c xfs: dquot recovery needs verifiers
dquot recovery should add verifiers to the dquot buffers that it
recovers changes into. Unfortunately, it doesn't attached the
verifiers to the buffers in a consistent manner. For example,
xlog_recover_dquot_pass2() reads dquot buffers without a verifier
and then writes it without ever having attached a verifier to the
buffer.

Further, dquot buffer recovery may write a dquot buffer that has not
been modified, or indeed, shoul dbe written because quotas are not
enabled and hence changes to the buffer were not replayed. In this
case, we again write buffers without verifiers attached because that
doesn't happen until after the buffer changes have been replayed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04 12:59:31 +10:00
Dave Chinner 5fd364fee8 xfs: quotacheck leaves dquot buffers without verifiers
When running xfs/305, I noticed that quotacheck was flushing dquot
buffers that did not have the xfs_dquot_buf_ops verifiers attached:

XFS (vdb): _xfs_buf_ioapply: no ops on block 0x1dc8/0x1dc8
ffff880052489000: 44 51 01 04 00 00 65 b8 00 00 00 00 00 00 00 00  DQ....e.........
ffff880052489010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
ffff880052489020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
ffff880052489030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
CPU: 1 PID: 2376 Comm: mount Not tainted 3.16.0-rc2-dgc+ #306
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 ffff88006fe38000 ffff88004a0ffae8 ffffffff81cf1cca 0000000000000001
 ffff88004a0ffb88 ffffffff814d50ca 000010004a0ffc70 0000000000000000
 ffff88006be56dc4 0000000000000021 0000000000001dc8 ffff88007c773d80
Call Trace:
 [<ffffffff81cf1cca>] dump_stack+0x45/0x56
 [<ffffffff814d50ca>] _xfs_buf_ioapply+0x3ca/0x3d0
 [<ffffffff810db520>] ? wake_up_state+0x20/0x20
 [<ffffffff814d51f5>] ? xfs_bdstrat_cb+0x55/0xb0
 [<ffffffff814d513b>] xfs_buf_iorequest+0x6b/0xd0
 [<ffffffff814d51f5>] xfs_bdstrat_cb+0x55/0xb0
 [<ffffffff814d53ab>] __xfs_buf_delwri_submit+0x15b/0x220
 [<ffffffff814d6040>] ? xfs_buf_delwri_submit+0x30/0x90
 [<ffffffff814d6040>] xfs_buf_delwri_submit+0x30/0x90
 [<ffffffff8150f89d>] xfs_qm_quotacheck+0x17d/0x3c0
 [<ffffffff81510591>] xfs_qm_mount_quotas+0x151/0x1e0
 [<ffffffff814ed01c>] xfs_mountfs+0x56c/0x7d0
 [<ffffffff814f0f12>] xfs_fs_fill_super+0x2c2/0x340
 [<ffffffff811c9fe4>] mount_bdev+0x194/0x1d0
 [<ffffffff814f0c50>] ? xfs_finish_flags+0x170/0x170
 [<ffffffff814ef0f5>] xfs_fs_mount+0x15/0x20
 [<ffffffff811ca8c9>] mount_fs+0x39/0x1b0
 [<ffffffff811e4d67>] vfs_kern_mount+0x67/0x120
 [<ffffffff811e757e>] do_mount+0x23e/0xad0
 [<ffffffff8117abde>] ? __get_free_pages+0xe/0x50
 [<ffffffff811e71e6>] ? copy_mount_options+0x36/0x150
 [<ffffffff811e8103>] SyS_mount+0x83/0xc0
 [<ffffffff81cfd40b>] tracesys+0xdd/0xe2

This was caused by dquot buffer readahead not attaching a verifier
structure to the buffer when readahead was issued, resulting in the
followup read of the buffer finding a valid buffer and so not
attaching new verifiers to the buffer as part of the read.

Also, when a verifier failure occurs, we then read the buffer
without verifiers. Attach the verifiers manually after this read so
that if the buffer is then written it will be verified that the
corruption has been repaired.

Further, when flushing a dquot we don't ask for a verifier when
reading in the dquot buffer the dquot belongs to. Most of the time
this isn't an issue because the buffer is still cached, but when it
is not cached it will result in writing the dquot buffer without
having the verfier attached.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04 12:43:26 +10:00
Dave Chinner 67dc288c21 xfs: ensure verifiers are attached to recovered buffers
Crash testing of CRC enabled filesystems has resulted in a number of
reports of bad CRCs being detected after the filesystem was mounted.
Errors such as the following were being seen:

XFS (sdb3): Mounting V5 Filesystem
XFS (sdb3): Starting recovery (logdev: internal)
XFS (sdb3): Metadata CRC error detected at xfs_agf_read_verify+0x5a/0x100 [xfs], block 0x1
XFS (sdb3): Unmount and run xfs_repair
XFS (sdb3): First 64 bytes of corrupted metadata buffer:
ffff880136ffd600: 58 41 47 46 00 00 00 01 00 00 00 00 00 0f aa 40  XAGF...........@
ffff880136ffd610: 00 02 6d 53 00 02 77 f8 00 00 00 00 00 00 00 01  ..mS..w.........
ffff880136ffd620: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 03  ................
ffff880136ffd630: 00 00 00 04 00 08 81 d0 00 08 81 a7 00 00 00 00  ................
XFS (sdb3): metadata I/O error: block 0x1 ("xfs_trans_read_buf_map") error 74 numblks 1

The errors were typically being seen in AGF, AGI and their related
btree block buffers some time after log recovery had run. Often it
wasn't until later subsequent mounts that the problem was
discovered. The common symptom was a buffer with the correct
contents, but a CRC and an LSN that matched an older version of the
contents.

Some debug added to _xfs_buf_ioapply() indicated that buffers were
being written without verifiers attached to them from log recovery,
and Jan Kara isolated the cause to log recovery readahead an dit's
interactions with buffers that had a more recent LSN on disk than
the transaction being recovered. In this case, the buffer did not
get a verifier attached, and os when the second phase of log
recovery ran and recovered EFIs and unlinked inodes, the buffers
were modified and written without the verifier running. Hence they
had up to date contents, but stale LSNs and CRCs.

Fix it by attaching verifiers to buffers we skip due to future LSN
values so they don't escape into the buffer cache without the
correct verifier attached.

This patch is based on analysis and a patch from Jan Kara.

cc: <stable@vger.kernel.org>
Reported-by: Jan Kara <jack@suse.cz>
Reported-by: Fanael Linithien <fanael4@gmail.com>
Reported-by: Grozdan <neutrino8@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04 12:43:06 +10:00
Dave Chinner 400b9d8875 xfs: catch buffers written without verifiers attached
We recently had a bug where buffers were slipping through log
recovery without any verifier attached to them. This was resulting
in on-disk CRC mismatches for valid data. Add some warning code to
catch this occurrence so that we catch such bugs during development
rather than not being aware they exist.

Note that we cannot do this verification unconditionally as non-CRC
filesystems don't always attach verifiers to the buffers being
written. e.g. during log recovery we cannot identify all the
different types of buffers correctly on non-CRC filesystems, so we
can't attach the correct verifiers in all cases and so we don't
attach any. Hence we don't want on non-CRC filesystems to avoid
spamming the logs with false indications.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04 12:42:40 +10:00
Eric Sandeen 5ef828c415 xfs: avoid false quotacheck after unclean shutdown
The commit

83e782e xfs: Remove incore use of XFS_OQUOTA_ENFD and XFS_OQUOTA_CHKD

added a new function xfs_sb_quota_from_disk() which swaps
on-disk XFS_OQUOTA_* flags for in-core XFS_GQUOTA_* and XFS_PQUOTA_*
flags after the superblock is read.

However, if log recovery is required, the superblock is read again,
and the modified in-core flags are re-read from disk, so we have
XFS_OQUOTA_* flags in memory again.  This causes the
XFS_QM_NEED_QUOTACHECK() test to be true, because the XFS_OQUOTA_CHKD
is still set, and not XFS_GQUOTA_CHKD or XFS_PQUOTA_CHKD.

Change xfs_sb_from_disk to call xfs_sb_quota_from disk and always
convert the disk flags to in-memory flags.

Add a lower-level function which can be called with "false" to
not convert the flags, so that the sb verifier can verify
exactly what was on disk, per Brian Foster's suggestion.

Reported-by: Cyril B. <cbay@excellency.fr>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
2014-08-04 11:35:44 +10:00
Brian Foster eedf32bfca xfs: fix rounding error of fiemap length parameter
The offset and length parameters are converted from bytes to basic
blocks by xfs_vn_fiemap(). The BTOBB() converter rounds the value up to
the nearest basic block. This leads to unexpected behavior when
unaligned offsets are provided to FIEMAP.

Fix the conversions of byte values to block values to cover the provided
offsets. Round down the start offset to the nearest basic block.
Calculate the end offset based on the provided values, round up and
calculate length based on the start block offset.

Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04 11:35:35 +10:00
Jie Liu 1e773c4989 xfs: introduce xfs_bulkstat_ag_ichunk
Introduce xfs_bulkstat_ag_ichunk() to process inodes in chunk with a
pointer to a formatter function that will iget the inode and fill in
the appropriate structure.

Refactor xfs_bulkstat() with it.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-08-04 11:22:31 +10:00
NeilBrown f682a398b2 NFS: allow lockless access to access_cache
The access cache is used during RCU-walk path lookups, so it is best
to avoid locking if possible as taking a lock kills concurrency.

The rbtree is not rcu-safe and cannot easily be made so.
Instead we simply check the last (i.e. most recent) entry on the LRU
list.  If this doesn't match, then we return -ECHILD and retry in
lock/refcount mode.

This requires freeing the nfs_access_entry struct with rcu, and
requires using rcu access primatives when adding entries to the lru, and
when examining the last entry.

Calling put_rpccred before kfree_rcu looks a bit odd, but as
put_rpccred already provides rcu protection, we know that the cred will
not actually be freed until the next grace period, so any concurrent
access will be safe.

This patch provides about 5% performance improvement on a stat-heavy
synthetic work load with 4 threads on a 2-core CPU.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-03 17:14:13 -04:00
NeilBrown 1fa1e38447 NFS: teach nfs_lookup_verify_inode to handle LOOKUP_RCU
It fails with -ECHILD rather than make an RPC call.

This allows nfs_lookup_revalidate to call it in RCU-walk mode.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-03 17:14:13 -04:00
NeilBrown 912a108da7 NFS: teach nfs_neg_need_reval to understand LOOKUP_RCU
This requires nfs_check_verifier to take an rcu_walk flag, and requires
an rcu version of nfs_revalidate_inode which returns -ECHILD rather
than making an RPC call.

With this, nfs_lookup_revalidate can call nfs_neg_need_reval in
RCU-walk mode.

We can also move the LOOKUP_RCU check past the nfs_check_verifier()
call in nfs_lookup_revalidate.

If RCU_WALK prevents nfs_check_verifier or nfs_neg_need_reval from
doing a full check, they return a status indicating that a revalidation
is required.  As this revalidation will not be possible in RCU_WALK
mode, -ECHILD will ultimately be returned, which is the desired result.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-03 17:14:12 -04:00
NeilBrown f3324a2a94 NFS: support RCU_WALK in nfs_permission()
nfs_permission makes two calls which are not always safe in RCU_WALK,
rpc_lookup_cred and nfs_do_access.

The second can easily be made rcu-safe by aborting with -ECHILD before
making the RPC call.

The former can be made rcu-safe by calling rpc_lookup_cred_nonblock()
instead.
As this will almost always succeed, we use it even when RCU_WALK
isn't being used as it still saves some spinlocks in a common case.
We only fall back to rpc_lookup_cred() if rpc_lookup_cred_nonblock()
fails and MAY_NOT_BLOCK isn't set.

This optimisation (always trying rpc_lookup_cred_nonblock()) is
particularly important when a security module is active.
In that case inode_permission() may return -ECHILD from
security_inode_permission() even though ->permission() succeeded in
RCU_WALK mode.
This leads to may_lookup() retrying inode_permission after performing
unlazy_walk().  The spinlock that rpc_lookup_cred() takes is often
more expensive than anything security_inode_permission() does, so that
spinlock becomes the main bottleneck.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-03 17:14:12 -04:00
NeilBrown d51ac1a8e9 NFS: prepare for RCU-walk support but pushing tests later in code.
nfs_lookup_revalidate, nfs4_lookup_revalidate, and nfs_permission
all need to understand and handle RCU-walk for NFS to gain the
benefits of RCU-walk for cached information.

Currently these functions all immediately return -ECHILD
if the relevant flag (LOOKUP_RCU or MAY_NOT_BLOCK) is set.

This patch pushes those tests later in the code so that we only abort
immediately before we enter rcu-unsafe code.  As subsequent patches
make that rcu-unsafe code rcu-safe, several of these new tests will
disappear.

With this patch there are several paths through the code which will no
longer return -ECHILD during an RCU-walk.  However these are mostly
error paths or other uninteresting cases.

A noteworthy change in nfs_lookup_revalidate is that we don't take
(or put) the reference to ->d_parent when LOOKUP_RCU is set.
Rather we rcu_dereference ->d_parent, and check that ->d_inode
is not NULL.  We also check that ->d_parent hasn't changed after
all the tests.

In nfs4_lookup_revalidate we simply avoid testing LOOKUP_RCU on the
path that only calls nfs_lookup_revalidate() as that function
already performs the required test.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-03 17:14:11 -04:00
NeilBrown 49317a7fda NFS: nfs4_lookup_revalidate: only evaluate parent if it will be used.
nfs4_lookup_revalidate only uses 'parent' to get 'dir', and only
uses 'dir' if 'inode == NULL'.

So we don't need to find out what 'parent' or 'dir' is until we
know that 'inode' is NULL.

By moving 'dget_parent' inside the 'if', we can reduce the number of
call sites for 'dput(parent)'.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-03 17:14:11 -04:00
Alexey Khoroshilov 1f70ef96b1 NFS: add checks for returned value of try_module_get()
There is a couple of places in client code where returned value
of try_module_get() is ignored. As a result there is a small chance
to premature unload module because of unbalanced refcounting.

The patch adds error handling in that places.

Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-03 17:14:10 -04:00
Weston Andros Adamson 411a99adff nfs: clear_request_commit while holding i_lock
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-03 17:05:26 -04:00
Weston Andros Adamson e6cf82d183 pnfs: add pnfs_put_lseg_async
This is useful when lsegs need to be released while holding locks.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-03 17:05:25 -04:00
Weston Andros Adamson 02d1426c70 pnfs: find swapped pages on pnfs commit lists too
nfs_page_find_head_request_locked looks through the regular nfs commit lists
when the page is swapped out, but doesn't look through the pnfs commit lists.

I'm not sure if anyone has hit any issues caused by this.

Suggested-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-03 17:05:25 -04:00
Weston Andros Adamson b412ddf066 nfs: fix comment and add warn_on for PG_INODE_REF
Fix the comment in nfs_page.h for PG_INODE_REF to reflect that it's no longer
set only on head requests. Also add a WARN_ON_ONCE in nfs_inode_remove_request
as PG_INODE_REF should always be set.

Suggested-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-03 17:05:25 -04:00
Weston Andros Adamson e7029206ff nfs: check wait_on_bit_lock err in page_group_lock
Return errors from wait_on_bit_lock from nfs_page_group_lock.

Add a bool argument @wait to nfs_page_group_lock. If true, loop over
wait_on_bit_lock until it returns cleanly. If false, return the error
from wait_on_bit_lock.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-03 17:05:24 -04:00
NeilBrown 4fa2c54b51 NFS: nfs4_do_open should add negative results to the dcache.
If you have an NFSv4 mounted directory which does not container 'foo'
and:

  ls -l foo
  ssh $server touch foo
  cat foo

then the 'cat' will fail (usually, depending a bit on the various
cache ages).  This is correct as negative looks are cached by default.
However with the same initial conditions:

  cat foo
  ssh $server touch foo
  cat foo

will usually succeed.  This is because an "open" does not add a
negative dentry to the dcache, while a "lookup" does.

This can have negative performance effects.  When "gcc" searches for
an include file, it will try to "open" the file in every director in
the search path.  Without caching of negative "open" results, this
generates much more traffic to the server than it should (or than
NFSv3 does).

The root of the problem is that _nfs4_open_and_get_state() will call
d_add_unique() on a positive result, but not on a negative result.
Compare with nfs_lookup() which calls d_materialise_unique on both
a positive result and on ENOENT.

This patch adds a call d_add() in the ENOENT case for
_nfs4_open_and_get_state() and also calls nfs_set_verifier().

With it, many fewer "open" requests for known-non-existent files are
sent to the server.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-03 17:05:22 -04:00
Andrey Utkin 7a9e75a185 nfs3_list_one_acl(): check get_acl() result with IS_ERR_OR_NULL
There was a check for result being not NULL. But get_acl() may return
NULL, or ERR_PTR, or actual pointer.
The purpose of the function where current change is done is to "list
ACLs only when they are available", so any error condition of get_acl()
mustn't be elevated, and returning 0 there is still valid.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=81111
Signed-off-by: Andrey Utkin <andrey.krieger.utkin@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixes: 74adf83f5d (nfs: only show Posix ACLs in listxattr if actually...)
Cc: stable@vger.kernel.org # 3.14+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-03 17:05:22 -04:00
Trond Myklebust 9806755c56 Merge branch 'nfs-rdma' of git://git.linux-nfs.org/projects/anna/nfs-rdma into linux-next
* 'nfs-rdma' of git://git.linux-nfs.org/projects/anna/nfs-rdma: (916 commits)
  xprtrdma: Handle additional connection events
  xprtrdma: Remove RPCRDMA_PERSISTENT_REGISTRATION macro
  xprtrdma: Make rpcrdma_ep_disconnect() return void
  xprtrdma: Schedule reply tasklet once per upcall
  xprtrdma: Allocate each struct rpcrdma_mw separately
  xprtrdma: Rename frmr_wr
  xprtrdma: Disable completions for LOCAL_INV Work Requests
  xprtrdma: Disable completions for FAST_REG_MR Work Requests
  xprtrdma: Don't post a LOCAL_INV in rpcrdma_register_frmr_external()
  xprtrdma: Reset FRMRs after a flushed LOCAL_INV Work Request
  xprtrdma: Reset FRMRs when FAST_REG_MR is flushed by a disconnect
  xprtrdma: Properly handle exhaustion of the rb_mws list
  xprtrdma: Chain together all MWs in same buffer pool
  xprtrdma: Back off rkey when FAST_REG_MR fails
  xprtrdma: Unclutter struct rpcrdma_mr_seg
  xprtrdma: Don't invalidate FRMRs if registration fails
  xprtrdma: On disconnect, don't ignore pending CQEs
  xprtrdma: Update rkeys after transport reconnect
  xprtrdma: Limit data payload size for ALLPHYSICAL
  xprtrdma: Protect ia->ri_id when unmapping/invalidating MRs
  ...
2014-08-03 17:04:51 -04:00
Trond Myklebust 3a505845cd NFS: Enforce an upper limit on the number of cached access call
This may be used to limit the number of cached credentials building up
inside the access cache.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-08-03 17:03:22 -04:00
Steve French 59b04c5df7 [CIFS] Fix incorrect hex vs. decimal in some debug print statements
Joe Perches and Hans Wennborg noticed that various places in the
kernel were printing decimal numbers with 0x prefix.
    printk("0x%d") or equivalent
This fixes the instances of this in the cifs driver.

CC: Hans Wennborg <hans@hanshq.net>
CC: Joe Perches <joe@perches.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-02 21:16:48 -05:00
Chao Yu 70cfed88ef f2fs: avoid skipping recover_inline_xattr after recover_inline_data
When we recover data of inode in roll-forward procedure, and the inode has both
inline data and inline xattr. We may skip recovering inline xattr if we recover
inline data form node page first.
This patch will fix the problem that we lost inline xattr data in above
scenario.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-02 07:43:51 -07:00
Chao Yu 70407fad85 f2fs: add tracepoint for f2fs_direct_IO
This patch adds a tracepoint for f2fs_direct_IO.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-08-02 07:34:46 -07:00
Steve French 81691503b2 Update cifs version
to 2.04

Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-02 01:23:04 -05:00
Pavel Shilovsky 21496687a7 CIFS: Fix STATUS_CANNOT_DELETE error mapping for SMB2
The existing mapping causes unlink() call to return error after delete
operation. Changing the mapping to -EACCES makes the client process
the call like CIFS protocol does - reset dos attributes with ATTR_READONLY
flag masked off and retry the operation.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-02 01:23:04 -05:00
Pavel Shilovsky b770ddfa26 CIFS: Optimize readpages in a short read case on reconnects
by marking pages with a data from a partially received response up-to-date.
This is suitable for non-signed connections.

Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-02 01:23:04 -05:00
Pavel Shilovsky d913ed17f0 CIFS: Optimize cifs_user_read() in a short read case on reconnects
by filling the output buffer with a data got from a partially received
response and requesting the remaining data from the server. This is
suitable for non-signed connections.

Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-02 01:23:04 -05:00
Pavel Shilovsky fb8a3e5255 CIFS: Improve indentation in cifs_user_read()
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-02 01:23:04 -05:00
Pavel Shilovsky 2e8a05d802 CIFS: Fix possible buffer corruption in cifs_user_read()
If there was a short read in the middle of the rdata list,
we can end up with a corrupt output buffer.

Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-02 01:23:04 -05:00
Pavel Shilovsky b3160aebb4 CIFS: Count got bytes in read_into_pages()
that let us know how many bytes we have already got before reconnect.

Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-02 01:23:04 -05:00
Pavel Shilovsky 34a54d6177 CIFS: Use separate var for the number of bytes got in async read
and don't mix it with the number of bytes that was requested.

Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-02 01:23:04 -05:00
Pavel Shilovsky 3fabaa2746 CIFS: Indicate reconnect with ECONNABORTED error code
that let us not mix it with EAGAIN.

Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-02 01:23:03 -05:00
Pavel Shilovsky bed9da0213 CIFS: Use multicredits for SMB 2.1/3 reads
If we negotiate SMB 2.1 and higher version of the protocol and
a server supports large read buffer size, we need to consume 1
credit per 65536 bytes. So, we need to know how many credits
we have and obtain the required number of them before constructing
a readdata structure in readpages and user read.

Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-02 01:23:03 -05:00
Pavel Shilovsky e374d90f8a CIFS: Fix rsize usage for sync read
If a server changes maximum buffer size for read requests (rsize)
on reconnect we can fail on repeating with a big size buffer on
-EAGAIN error in cifs_read. Fix this by checking rsize all the
time before repeating requests.

Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-02 01:23:03 -05:00
Pavel Shilovsky 25f402598d CIFS: Fix rsize usage in user read
If a server changes maximum buffer size for read (rsize) requests
on reconnect we can fail on repeating with a big size buffer on
-EAGAIN error in user read. Fix this by checking rsize all the
time before repeating requests.

Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-02 01:23:03 -05:00
Pavel Shilovsky 0ada36b244 CIFS: Separate page reading from user read
Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-02 01:23:03 -05:00
Pavel Shilovsky 69cebd7560 CIFS: Fix rsize usage in readpages
If a server changes maximum buffer size for read (rsize) requests
on reconnect we can fail on repeating with a big size buffer on
-EAGAIN error in readpages. Fix this by checking rsize all the
time before repeating requests.

Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-02 01:23:03 -05:00
Pavel Shilovsky 387eb92ac6 CIFS: Separate page search from readpages
Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-02 01:23:03 -05:00
Pavel Shilovsky cb7e9eabb2 CIFS: Use multicredits for SMB 2.1/3 writes
If we negotiate SMB 2.1 and higher version of the protocol and
a server supports large write buffer size, we need to consume 1
credit per 65536 bytes. So, we need to know how many credits
we have and obtain the required number of them before constructing
a writedata structure in writepages and iovec write.

Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-02 01:23:03 -05:00
Pavel Shilovsky 6ec0b01b26 CIFS: Fix wsize usage in iovec write
If a server change maximum buffer size for write (wsize) requests
on reconnect we can fail on repeating with a big size buffer on
-EAGAIN error in iovec write. Fix this by checking wsize all the
time before repeating request in iovec write.

Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-02 01:23:02 -05:00
Pavel Shilovsky 43de94eadf CIFS: Separate writing from iovec write
Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-02 01:23:02 -05:00
Pavel Shilovsky 66386c08be CIFS: Separate filling pages from iovec write
Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-02 01:23:02 -05:00
Pavel Shilovsky 7f6c50086a CIFS: Fix cifs_writev_requeue when wsize changes
If wsize changes on reconnect we need to use new writedata structure
that for retrying.

Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-02 01:23:02 -05:00
Pavel Shilovsky 66231a4796 CIFS: Fix wsize usage in writepages
If a server change maximum buffer size for write (wsize) requests
on reconnect we can fail on repeating with a big size buffer on
-EAGAIN error in writepages. Fix this by checking wsize all the
time before repeating request in writepages.

Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-02 01:23:02 -05:00
Pavel Shilovsky 90ac1387c2 CIFS: Separate pages initialization from writepages
Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-02 01:23:02 -05:00
Pavel Shilovsky 619aa48edb CIFS: Separate page sending from writepages
Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-02 01:23:02 -05:00
Steve French 27924075b5 Remove sparse build warning
The recent session setup patch set
(cifs-Separate-rawntlmssp-auth-from-CIFS_SessSetup.patch)
had introduced a trivial sparse build warning.

Signed-off-by: Steve French <smfrench@gmail.com>
Cc: Sachin Prabhu <sprabhu@redhat.com>
2014-08-02 01:23:01 -05:00
Pavel Shilovsky 7e48ff8202 CIFS: Separate page processing from writepages
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-02 01:23:01 -05:00
Pavel Shilovsky 038bc961c3 CIFS: Fix async reading on reconnects
If we get into read_into_pages() from cifs_readv_receive() and then
loose a network, we issue cifs_reconnect that moves all mids to
a private list and issue their callbacks. The callback of the async
read request sets a mid to retry, frees it and wakes up a process
that waits on the rdata completion.

After the connection is established we return from read_into_pages()
with a short read, use the mid that was freed before and try to read
the remaining data from the a newly created socket. Both actions are
not what we want to do. In reconnect cases (-EAGAIN) we should not
mask off the error with a short read but should return the error
code instead.

Acked-by: Jeff Layton <jlayton@samba.org>
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-08-02 01:23:01 -05:00
Jeff Layton 7abea1e8e8 nfsd: don't destroy client if mark_client_expired_locked fails
If it fails, it means that the client is in use and so destroying it
would be bad. Currently, the client_mutex prevents this from happening
but once we remove it, we won't be able to do this.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-01 16:28:26 -04:00
Jeff Layton 97403d95e1 nfsd: move unhash_client_locked call into mark_client_expired_locked
All the callers except for the fault injection code call it directly
afterward, and in the fault injection case it won't hurt to do so
anyway.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-01 16:28:25 -04:00
Jeff Layton 217526e7ec nfsd: protect the close_lru list and oo_last_closed_stid with client_lock
Currently, it's protected by the client_mutex. Move it so that the list
and the fields in the openowner are protected by the client_lock.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-01 16:28:24 -04:00
Trond Myklebust 0a880a28f8 nfsd: Add lockdep assertions to document the nfs4_client/session locking
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-01 16:28:23 -04:00
Trond Myklebust 3e339f964b nfsd: Ensure lookup_clientid() takes client_lock
Ensure that the client lookup is done safely under the client_lock, so
we're not relying on the client_mutex.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-01 16:28:23 -04:00
Trond Myklebust 6b10ad193d nfsd: Protect nfsd4_destroy_clientid using client_lock
...instead of relying on the client_mutex.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-01 16:28:22 -04:00
Jeff Layton d20c11d86d nfsd: Protect session creation and client confirm using client_lock
In particular, we want to ensure that the move_to_confirmed() is
protected by the nn->client_lock spin lock, so that we can use that when
looking up the clientid etc. instead of relying on the client_mutex.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-01 16:28:21 -04:00
Trond Myklebust 3dbacee6e1 nfsd: Protect unconfirmed client creation using client_lock
...instead of relying on the client_mutex.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-01 16:28:20 -04:00
Trond Myklebust 5cc40fd7b6 nfsd: Move create_client() call outside the lock
For efficiency reasons, and because we want to use spin locks instead
of relying on the client_mutex.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-01 16:28:20 -04:00
Trond Myklebust 425510f5c8 nfsd: Don't require client_lock in free_client
The struct nfs_client is supposed to be invisible and unreferenced
before it gets here.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-01 16:28:19 -04:00
Trond Myklebust 4864af97e0 nfsd: Ensure that the laundromat unhashes the client before releasing locks
If we leave the client on the confirmed/unconfirmed tables, and leave
the sessions visible on the sessionid_hashtbl, then someone might
find them before we've had a chance to destroy them.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-01 16:28:18 -04:00
Trond Myklebust 4beb345b37 nfsd: Ensure struct nfs4_client is unhashed before we try to destroy it
When we remove the client_mutex protection, we will need to ensure
that it can't be found by other threads while we're destroying it.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-01 16:28:17 -04:00
J. Bruce Fields 83e452fee8 nfsd4: fix out of date comment
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-01 16:28:16 -04:00
Kinglong Mee d9499a9571 NFSD: Decrease nfsd_users in nfsd_startup_generic fail
A memory allocation failure could cause nfsd_startup_generic to fail, in
which case nfsd_users wouldn't be incorrectly left elevated.

After nfsd restarts nfsd_startup_generic will then succeed without doing
anything--the first consequence is likely nfs4_start_net finding a bad
laundry_wq and crashing.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Fixes: 4539f14981 "nfsd: replace boolean nfsd_up flag by users counter"
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-08-01 16:26:09 -04:00
Eric Biggers 6d2b6170c8 vfs: fix check for fallocate on active swapfile
Fix the broken check for calling sys_fallocate() on an active swapfile,
introduced by commit 0790b31b69 ("fs: disallow all fallocate
operation on active swapfile").

Signed-off-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-01 02:36:04 -04:00
Christoph Hellwig af43647277 direct-io: fix AIO regression
The direct-io.c rewrite to use the iov_iter infrastructure stopped updating
the size field in struct dio_submit, and thus rendered the check for
allowing asynchronous completions to always return false.  Fix this by
comparing it to the count of bytes in the iov_iter instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Tim Chen <tim.c.chen@linux.intel.com>
Tested-by: Tim Chen <tim.c.chen@linux.intel.com>
2014-08-01 02:35:51 -04:00
Sachin Prabhu cc87c47d9d cifs: Separate rawntlmssp auth from CIFS_SessSetup()
Separate rawntlmssp authentication from CIFS_SessSetup(). Also cleanup
CIFS_SessSetup() since we no longer do any auth within it.

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-07-31 23:11:15 -05:00
Sachin Prabhu ee03c646dd cifs: Split Kerberos authentication off CIFS_SessSetup()
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-07-31 23:11:15 -05:00
Sachin Prabhu 583cf7afc7 cifs: Split ntlm and ntlmv2 authentication methods off CIFS_SessSetup()
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-07-31 23:11:15 -05:00
Sachin Prabhu 80a0e63751 cifs: Split lanman auth from CIFS_SessSetup()
In preparation for splitting CIFS_SessSetup() into smaller more
manageable chunks, we first add helper functions.

We then proceed to split out lanman auth out of CIFS_SessSetup()

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-07-31 23:11:15 -05:00
Sachin Prabhu 6d81ed1ec2 cifs: replace code with free_rsp_buf()
The functionality provided by free_rsp_buf() is duplicated in a number
of places. Replace these instances with a call to free_rsp_buf().

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-07-31 23:11:15 -05:00
Eric W. Biederman ffbc6f0ead mnt: Change the default remount atime from relatime to the existing value
Since March 2009 the kernel has treated the state that if no
MS_..ATIME flags are passed then the kernel defaults to relatime.

Defaulting to relatime instead of the existing atime state during a
remount is silly, and causes problems in practice for people who don't
specify any MS_...ATIME flags and to get the default filesystem atime
setting.  Those users may encounter a permission error because the
default atime setting does not work.

A default that does not work and causes permission problems is
ridiculous, so preserve the existing value to have a default
atime setting that is always guaranteed to work.

Using the default atime setting in this way is particularly
interesting for applications built to run in restricted userspace
environments without /proc mounted, as the existing atime mount
options of a filesystem can not be read from /proc/mounts.

In practice this fixes user space that uses the default atime
setting on remount that are broken by the permission checks
keeping less privileged users from changing more privileged users
atime settings.

Cc: stable@vger.kernel.org
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-07-31 17:12:59 -07:00
Eric W. Biederman 9566d67428 mnt: Correct permission checks in do_remount
While invesgiating the issue where in "mount --bind -oremount,ro ..."
would result in later "mount --bind -oremount,rw" succeeding even if
the mount started off locked I realized that there are several
additional mount flags that should be locked and are not.

In particular MNT_NOSUID, MNT_NODEV, MNT_NOEXEC, and the atime
flags in addition to MNT_READONLY should all be locked.  These
flags are all per superblock, can all be changed with MS_BIND,
and should not be changable if set by a more privileged user.

The following additions to the current logic are added in this patch.
- nosuid may not be clearable by a less privileged user.
- nodev  may not be clearable by a less privielged user.
- noexec may not be clearable by a less privileged user.
- atime flags may not be changeable by a less privileged user.

The logic with atime is that always setting atime on access is a
global policy and backup software and auditing software could break if
atime bits are not updated (when they are configured to be updated),
and serious performance degradation could result (DOS attack) if atime
updates happen when they have been explicitly disabled.  Therefore an
unprivileged user should not be able to mess with the atime bits set
by a more privileged user.

The additional restrictions are implemented with the addition of
MNT_LOCK_NOSUID, MNT_LOCK_NODEV, MNT_LOCK_NOEXEC, and MNT_LOCK_ATIME
mnt flags.

Taken together these changes and the fixes for MNT_LOCK_READONLY
should make it safe for an unprivileged user to create a user
namespace and to call "mount --bind -o remount,... ..." without
the danger of mount flags being changed maliciously.

Cc: stable@vger.kernel.org
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-07-31 17:12:34 -07:00
Eric W. Biederman 07b645589d mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount
There are no races as locked mount flags are guaranteed to never change.

Moving the test into do_remount makes it more visible, and ensures all
filesystem remounts pass the MNT_LOCK_READONLY permission check.  This
second case is not an issue today as filesystem remounts are guarded
by capable(CAP_DAC_ADMIN) and thus will always fail in less privileged
mount namespaces, but it could become an issue in the future.

Cc: stable@vger.kernel.org
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-07-31 17:12:17 -07:00
Eric W. Biederman a6138db815 mnt: Only change user settable mount flags in remount
Kenton Varda <kenton@sandstorm.io> discovered that by remounting a
read-only bind mount read-only in a user namespace the
MNT_LOCK_READONLY bit would be cleared, allowing an unprivileged user
to the remount a read-only mount read-write.

Correct this by replacing the mask of mount flags to preserve
with a mask of mount flags that may be changed, and preserve
all others.   This ensures that any future bugs with this mask and
remount will fail in an easy to detect way where new mount flags
simply won't change.

Cc: stable@vger.kernel.org
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-07-31 17:11:54 -07:00
Jeff Layton 4ae098d327 nfsd: rename unhash_generic_stateid to unhash_ol_stateid
...to better match other functions that deal with open/lock stateids.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:31 -04:00
Jeff Layton d83017f94c nfsd: don't thrash the cl_lock while freeing an open stateid
When we remove the client_mutex, we'll have a potential race between
FREE_STATEID and CLOSE.

The root of the problem is that we are walking the st_locks list,
dropping the spinlock and then trying to release the persistent
reference to the lockstateid. In between, a FREE_STATEID call can come
along and take the lock, find the stateid and then try to put the
reference. That leads to a double put.

Fix this by not releasing the cl_lock in order to release each lock
stateid. Use put_generic_stateid_locked to unhash them and gather them
onto a list, and free_ol_stateid_reaplist to free any that end up on the
list.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:31 -04:00
Jeff Layton 2c41beb0e5 nfsd: reduce cl_lock thrashing in release_openowner
Releasing an openowner is a bit inefficient as it can potentially thrash
the cl_lock if you have a lot of stateids attached to it. Once we remove
the client_mutex, it'll also potentially be dangerous to do this.

Add some functions to make it easier to defer the part of putting a
generic stateid reference that needs to be done outside the cl_lock while
doing the parts that must be done while holding it under a single lock.

First we unhash each open stateid. Then we call
put_generic_stateid_locked which will put the reference to an
nfs4_ol_stateid. If it turns out to be the last reference, it'll go
ahead and remove the stid from the IDR tree and put it onto the reaplist
using the st_locks list_head.

Then, after dropping the lock we'll call free_ol_stateid_reaplist to
walk the list of stateids that are fully unhashed and ready to be freed,
and free each of them. This function can sleep, so it must be done
outside any spinlocks.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:30 -04:00
Jeff Layton fc5a96c3b7 nfsd: close potential race in nfsd4_free_stateid
Once we remove the client_mutex, it'll be possible for the sc_type of a
lock stateid to change after it's found and checked, but before we can
go to destroy it. If that happens, we can end up putting the persistent
reference to the stateid more than once, and unhash it more than once.

Fix this by unhashing the lock stateid prior to dropping the cl_lock but
after finding it.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:29 -04:00
Jeff Layton 3c1c995cc2 nfsd: optimize destroy_lockowner cl_lock thrashing
Reduce the cl_lock trashing in destroy_lockowner. Unhash all of the
lockstateids on the lockowner's list. Put the reference under the lock
and see if it was the last one. If so, then add it to a private list
to be destroyed after we drop the lock.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:28 -04:00
Jeff Layton a819ecc1bb nfsd: add locking to stateowner release
Once we remove the client_mutex, we'll need to properly protect
the stateowner reference counts using the cl_lock.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:27 -04:00
Jeff Layton 882e9d25e1 nfsd: clean up and reorganize release_lockowner
Do more within the main loop, and simplify the function a bit. Also,
there's no need to take a stateowner reference unless we're going to call
release_lockowner.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:27 -04:00
Trond Myklebust d4f0489f38 nfsd: Move the open owner hash table into struct nfs4_client
Preparation for removing the client_mutex.

Convert the open owner hash table into a per-client table and protect it
using the nfs4_client->cl_lock spin lock.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:26 -04:00
Trond Myklebust c58c6610ec nfsd: Protect adding/removing lock owners using client_lock
Once we remove client mutex protection, we'll need to ensure that
stateowner lookup and creation are atomic between concurrent compounds.
Ensure that alloc_init_lock_stateowner checks the hashtable under the
client_lock before adding a new element.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:25 -04:00
Trond Myklebust 7ffb588086 nfsd: Protect adding/removing open state owners using client_lock
Once we remove client mutex protection, we'll need to ensure that
stateowner lookup and creation are atomic between concurrent compounds.
Ensure that alloc_init_open_stateowner checks the hashtable under the
client_lock before adding a new element.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:24 -04:00
Jeff Layton b401be22b5 nfsd: don't allow CLOSE to proceed until refcount on stateid drops
Once we remove client_mutex protection, it'll be possible to have an
in-flight operation using an openstateid when a CLOSE call comes in.
If that happens, we can't just put the sc_file reference and clear its
pointer without risking an oops.

Fix this by ensuring that v4.0 CLOSE operations wait for the refcount
to drop before proceeding to do so.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:23 -04:00
Jeff Layton d3134b1049 nfsd: make openstateids hold references to their openowners
Change it so that only openstateids hold persistent references to
openowners. References can still be held by compounds in progress.

With this, we can get rid of NFS4_OO_NEW. It's possible that we
will create a new openowner in the process of doing the open, but
something later fails. In the meantime, another task could find
that openowner and start using it on a successful open. If that
occurs we don't necessarily want to tear it down, just put the
reference that the failing compound holds.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:23 -04:00
Jeff Layton 5adfd8850b nfsd: clean up refcounting for lockowners
Ensure that lockowner references are only held by lockstateids and
operations that are in-progress. With this, we can get rid of
release_lockowner_if_empty, which will be racy once we remove
client_mutex protection.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:22 -04:00
Trond Myklebust e4f1dd7fc2 nfsd: Make lock stateid take a reference to the lockowner
A necessary step toward client_mutex removal.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:21 -04:00
Jeff Layton 8f4b54c53f nfsd: add an operation for unhashing a stateowner
Allow stateowners to be unhashed and destroyed when the last reference
is put. The unhashing must be idempotent. In a future patch, we'll add
some locking around it, but for now it's only protected by the
client_mutex.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:20 -04:00
Jeff Layton 5db1c03feb nfsd: clean up lockowner refcounting when finding them
Ensure that when finding or creating a lockowner, that we get a
reference to it. For now, we also take an extra reference when a
lockowner is created that can be put when release_lockowner is called,
but we'll remove that in a later patch once we change how references are
held.

Since we no longer destroy lockowners in the event of an error in
nfsd4_lock, we must change how the seqid gets bumped in the lk_is_new
case. Instead of doing so on creation, do it manually in nfsd4_lock.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:20 -04:00
Jeff Layton 58fb12e6a4 nfsd: Add a mutex to protect the NFSv4.0 open owner replay cache
We don't want to rely on the client_mutex for protection in the case of
NFSv4 open owners. Instead, we add a mutex that will only be taken for
NFSv4.0 state mutating operations, and that will be released once the
entire compound is done.

Also, ensure that nfsd4_cstate_assign_replay/nfsd4_cstate_clear_replay
take a reference to the stateowner when they are using it for NFSv4.0
open and lock replay caching.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:19 -04:00
Jeff Layton 6b180f0b57 nfsd: Add reference counting to state owners
The way stateowners are managed today is somewhat awkward. They need to
be explicitly destroyed, even though the stateids reference them. This
will be particularly problematic when we remove the client_mutex.

We may create a new stateowner and attempt to open a file or set a lock,
and have that fail. In the meantime, another RPC may come in that uses
that same stateowner and succeed. We can't have the first task tearing
down the stateowner in that situation.

To fix this, we need to change how stateowners are tracked altogether.
Refcount them and only destroy them once all stateids that reference
them have been destroyed. This patch starts by adding the refcounting
necessary to do that.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:18 -04:00
Trond Myklebust 2d3f96689f nfsd: Migrate the stateid reference into nfs4_find_stateid_by_type()
Allow nfs4_find_stateid_by_type to take the stateid reference, while
still holding the &cl->cl_lock. Necessary step toward client_mutex
removal.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:17 -04:00
Trond Myklebust fd9110113c nfsd: Migrate the stateid reference into nfs4_lookup_stateid()
Allow nfs4_lookup_stateid to take the stateid reference, instead
of having all the callers do so.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:16 -04:00
Trond Myklebust 4cbfc9f704 nfsd: Migrate the stateid reference into nfs4_preprocess_seqid_op
Allow nfs4_preprocess_seqid_op to take the stateid reference, instead
of having all the callers do so.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:15 -04:00
Trond Myklebust 0667b1e9d8 nfsd: Add reference counting to nfs4_preprocess_confirmed_seqid_op
Ensure that all the callers put the open stateid after use.
Necessary step toward client_mutex removal.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:15 -04:00
Trond Myklebust 2585fc7958 nfsd: nfsd4_open_confirm() must reference the open stateid
Ensure that nfsd4_open_confirm() keeps a reference to the open
stateid until it is done working with it.

Necessary step toward client_mutex removal.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:14 -04:00
Trond Myklebust 8a0b589d8f nfsd: Prepare nfsd4_close() for open stateid referencing
Prepare nfsd4_close for a future where nfs4_preprocess_seqid_op()
hands it a fully referenced open stateid. Necessary step toward
client_mutex removal.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:13 -04:00
Trond Myklebust d6f2bc5dcf nfsd: nfsd4_process_open2() must reference the open stateid
Ensure that nfsd4_process_open2() keeps a reference to the open
stateid until it is done working with it. Necessary step toward
client_mutex removal.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:12 -04:00
Trond Myklebust dcd94cc2e7 nfsd: nfsd4_process_open2() must reference the delegation stateid
Ensure that nfsd4_process_open2() keeps a reference to the delegation
stateid until it is done working with it. Necessary step toward
client_mutex removal.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:11 -04:00
Trond Myklebust 67cb1279be nfsd: Ensure that nfs4_open_delegation() references the delegation stateid
Ensure that nfs4_open_delegation() keeps a reference to the delegation
stateid until it is done working with it. Necessary step toward
client_mutex removal.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:11 -04:00
Trond Myklebust 858cc57336 nfsd: nfsd4_locku() must reference the lock stateid
Ensure that nfsd4_locku() keeps a reference to the lock stateid
until it is done working with it. Necessary step toward client_mutex
removal.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:10 -04:00
Trond Myklebust 3d0fabd5a4 nfsd: Add reference counting to lock stateids
Ensure that nfsd4_lock() references the lock stateid while it is
manipulating it. Not currently necessary, but will be once the
client_mutex is removed.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:09 -04:00
Jeff Layton 1af71cc801 nfsd: ensure atomicity in nfsd4_free_stateid and nfsd4_validate_stateid
Hold the cl_lock over the bulk of these functions. In addition to
ensuring that they aren't freed prematurely, this will also help prevent
a potential race that could be introduced later. Once we remove the
client_mutex, it'll be possible for FREE_STATEID and CLOSE to race and
for both to try to put the "persistent" reference to the stateid.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:08 -04:00
Jeff Layton 356a95ece7 nfsd: clean up races in lock stateid searching and creation
Preparation for removal of the client_mutex.

Currently, no lock aside from the client_mutex is held when calling
find_lock_state. Ensure that the cl_lock is held by adding a lockdep
assertion.

Once we remove the client_mutex, it'll be possible for another thread to
race in and insert a lock state for the same file after we search but
before we insert a new one. Ensure that doesn't happen by redoing the
search after allocating a new stid that we plan to insert. If one is
found just put the one that was allocated.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:07 -04:00
Jeff Layton 1c755dc1ad nfsd: Add locking to protect the state owner lists
Change to using the clp->cl_lock for this. For now, there's a lot of
cl_lock thrashing, but in later patches we'll eliminate that and close
the potential races that can occur when releasing the cl_lock while
walking the lists. For now, the client_mutex prevents those races.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:20:07 -04:00
Jeff Layton b49e084d8c nfsd: do filp_close in sc_free callback for lock stateids
Releasing locks when we unhash the stateid instead of doing so only when
the stateid is actually released will be problematic in later patches
when we need to protect the unhashing with spinlocks. Move it into the
sc_free operation instead.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:19:50 -04:00
Jeff Layton 4770d72201 nfsd4: use cl_lock to synchronize all stateid idr calls
Currently, this is serialized by the client_mutex, which is slated for
removal. Add finer-grained locking here. Also, do some cleanup around
find_stateid to prepare for taking references.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Benny Halevy <bhalevy@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 14:19:25 -04:00
Trond Myklebust 11b9164ada nfsd: Add a struct nfs4_file field to struct nfs4_stid
All stateids are associated with a nfs4_file. Let's consolidate.
Replace delegation->dl_file with the dl_stid.sc_file, and
nfs4_ol_stateid->st_file with st_stid.sc_file.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 12:51:34 -04:00
Trond Myklebust 6011695da2 nfsd: Add reference counting to the lock and open stateids
When we remove the client_mutex, we'll need to be able to ensure that
these objects aren't destroyed while we're not holding locks.

Add a ->free() callback to the struct nfs4_stid, so that we can
release a reference to the stid without caring about the contents.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-31 12:43:53 -04:00
hujianyang 25601a3c97 UBIFS: Add log overlap assertions
We use a circle area to record the log nodes in ubifs. This log area
should not be overlapped. But after researching the code, I found
some conditions may lead log head wraps log ltail. Although we've
fixed the problems discovered, there may be some other issues still
left.

This patch adds assertions where lhead changes to next leb to make
sure ltail is not wrapped.

Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-07-31 15:52:51 +03:00
Chao Yu b3582c6892 f2fs: reduce competition among node page writes
We do not need to block on ->node_write among different node page writers e.g.
fsync/flush, unless we have a node page writer from write_checkpoint.
So it's better use rw_semaphore instead of mutex type for ->node_write to
promote performance.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-30 23:28:37 -07:00
Theodore Ts'o 86f0afd463 ext4: fix ext4_discard_allocated_blocks() if we can't allocate the pa struct
If there is a failure while allocating the preallocation structure, a
number of blocks can end up getting marked in the in-memory buddy
bitmap, and then not getting released.  This can result in the
following corruption getting reported by the kernel:

EXT4-fs error (device sda3): ext4_mb_generate_buddy:758: group 1126,
12793 clusters in bitmap, 12729 in gd

In that case, we need to release the blocks using mb_free_blocks().

Tested: fs smoke test; also demonstrated that with injected errors,
	the file system is no longer getting corrupted

Google-Bug-Id: 16657874

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-07-30 22:17:17 -04:00
Jaegeuk Kim 65b85ccce0 f2fs: fix coding style
This patch fixes wrong coding style.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-30 17:25:54 -07:00
Dongho Sim 33be828ada f2fs: remove redundant lines in allocate_data_block
There are redundant lines in allocate_data_block.

In this function, we call refresh_sit_entry with old seg and old curseg.
After that, we call locate_dirty_segment with old curseg.

But, the new address is always allocated from old curseg and
we call locate_dirty_segment with old curseg in refresh_sit_entry.
So, we do not need to call locate_dirty_segment with old curseg again.

We've discussed like below:

Jaegeuk said:
 "When considering SSR, we need to take care of the following scenario.
  - old segno : X
  - new address : Z
  - old curseg : Y
  This means, a new block is supposed to be written to Z from X.
  And Z is newly allocated in the same path from Y.

  In that case, we should trigger locate_dirty_segment for Y, since
  it was a current_segment and can be dirty owing to SSR.
  But that was not included in the dirty list."

Changman said:
 "We already choosed old curseg(Y) and then we allocate new address(Z) from old
  curseg(Y). After that we call refresh_sit_entry(old address, new address).
  In the funcation, we call locate_dirty_segment with old seg and old curseg.
  So calling locate_dirty_segment after refresh_sit_entry again is redundant."

Jaegeuk said:
 "Right. The new address is always allocated from old_curseg."

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Dongho Sim <dh.sim@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-30 15:38:59 -07:00
Jaegeuk Kim 24a9ee0fa3 f2fs: add tracepoint for f2fs_issue_flush
This patch adds a tracepoint for f2fs_issue_flush.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-30 14:13:36 -07:00
Jaegeuk Kim cf2271e781 f2fs: avoid retrying wrong recovery routine when error was occurred
This patch eliminates the propagation of recovery errors to the next mount.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-30 14:13:35 -07:00
Jaegeuk Kim 61e0f2d0a5 f2fs: test before set/clear bits
If the bit is already set, we don't need to reset it, and vice versa.
Because we don't need to make the caches dirty for that.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-30 14:13:35 -07:00
Jaegeuk Kim 01229f5e1b f2fs: fix wrong condition for unlikely
This patch fixes the wrongly used unlikely condition.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-30 14:13:31 -07:00
Jaegeuk Kim ea1aa12ca2 f2fs: enable in-place-update for fdatasync
This patch enforces in-place-updates only when fdatasync is requested.
If we adopt this in-place-updates for the fdatasync, we can skip to write the
recovery information.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-30 14:13:23 -07:00
Jaegeuk Kim 6d99ba41a7 f2fs: skip unnecessary data writes during fsync
This patch intends to improve the fsync performance by skipping remaining the
recovery information, only when there is no data that we should recover.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-30 14:13:03 -07:00
David S. Miller f139c74a8d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-30 13:25:49 -07:00
Jeff Layton b3fbfe0e7a nfsd: print status when nfsd4_open fails to open file it just created
It's possible for nfsd to fail opening a file that it has just created.
When that happens, we throw a WARN but it doesn't include any info about
the error code. Print the status code to give us a bit more info.

Our QA group hit some of these warnings under some very heavy stress
testing. My suspicion is that they hit the file-max limit, but it's hard
to know for sure. Go ahead and add a -ENFILE mapping to
nfserr_serverfault to make the error more distinct (and correct).

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-29 23:08:38 -04:00
Eric W. Biederman 728dba3a39 namespaces: Use task_lock and not rcu to protect nsproxy
The synchronous syncrhonize_rcu in switch_task_namespaces makes setns
a sufficiently expensive system call that people have complained.

Upon inspect nsproxy no longer needs rcu protection for remote reads.
remote reads are rare.  So optimize for same process reads and write
by switching using rask_lock instead.

This yields a simpler to understand lock, and a faster setns system call.

In particular this fixes a performance regression observed
by Rafael David Tinoco <rafael.tinoco@canonical.com>.

This is effectively a revert of Pavel Emelyanov's commit
cf7b708c8d Make access to task's nsproxy lighter
from 2007.  The race this originialy fixed no longer exists as
do_notify_parent uses task_active_pid_ns(parent) instead of
parent->nsproxy.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-07-29 18:08:50 -07:00
Christoph Hellwig d5cf09bace xfs: require 64-bit sector_t
Trying to support tiny disks only and saving a bit memory might have
made sense on an SGI O2 15 years ago, but is pretty pointless today.

Remove the rarely tested codepath that uses various smaller in-memory
types to reduce our test matrix and make the codebase a little bit
smaller and less complicated.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-30 09:12:05 +10:00
Jeff Layton 650ecc8f8f nfsd: remove dl_fh field from struct nfs4_delegation
Now that the nfs4_file has a filehandle in it, we no longer need to
keep a per-delegation copy of it. Switch to using the one in the
nfs4_file instead.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-29 14:49:58 -04:00
Jeff Layton f54fe962b8 nfsd: give block_delegation and delegation_blocked its own spinlock
The state lock can be fairly heavily contended, and there's no reason
that nfs4_file lookups and delegation_blocked should be mutually
exclusive.  Let's give the new block_delegation code its own spinlock.
It does mean that we'll need to take a different lock in the delegation
break code, but that's not generally as critical to performance.

Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-29 14:49:57 -04:00
Jeff Layton 0b26693c56 nfsd: clean up nfs4_set_delegation
Move the alloc_init_deleg call into nfs4_set_delegation and change the
function to return a pointer to the delegation or an IS_ERR return. This
allows us to skip allocating a delegation if the file has already
experienced a lease conflict.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-29 14:49:56 -04:00
Jeff Layton 4cf59221c7 nfsd: clean up arguments to nfs4_open_delegation
No need to pass in a net pointer since we can derive that.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-29 14:49:55 -04:00
Jeff Layton f9416e281e nfsd: drop unused stp arg to alloc_init_deleg
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-29 14:49:54 -04:00
Trond Myklebust 02a3508dba nfsd: Convert delegation counter to an atomic_long_t type
We want to convert to an atomic type so that we don't need to lock
across the call to alloc_init_deleg(). Then convert to a long type so
that we match the size of 'max_delegations'.

None of this is a problem today, but it will be once we remove
client_mutex protection.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-29 14:49:54 -04:00
Jeff Layton 2d4a532d38 nfsd: ensure that clp->cl_revoked list is protected by clp->cl_lock
Currently, both destroy_revoked_delegation and revoke_delegation
manipulate the cl_revoked list without any locking aside from the
client_mutex. Ensure that the clp->cl_lock is held when manipulating it,
except for the list walking in destroy_client. At that point, the client
should no longer be in use, and so it should be safe to walk the list
without any locking. That also means that we don't need to do the
list_splice_init there either.

Also, the fact that revoke_delegation deletes dl_recall_lru list_head
without any locking makes it difficult to know whether it's doing so
safely in all cases. Move the list_del_init calls into the callers, and
add a WARN_ON in the event that t's passed a delegation that has a
non-empty list_head.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-29 14:49:53 -04:00
Jeff Layton 4269067696 nfsd: fully unhash delegations when revoking them
Ensure that the delegations cannot be found by the laundromat etc once
we add them to the various 'revoke' lists.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-29 14:49:52 -04:00
Trond Myklebust f83388341b nfsd: simplify stateid allocation and file handling
Don't allow stateids to clear the open file pointer until they are
being destroyed. In a later patches we'll want to rely on the fact that
we have a valid file pointer when dealing with the stateid and this
will save us from having to do a lot of NULL pointer checks before
doing so.

Also, move to allocating stateids with kzalloc and get rid of the
explicit zeroing of fields.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-29 14:49:51 -04:00
David Howells 0ef1351523 AFS: Correctly assemble the client UUID
Correctly assemble the client UUID by OR'ing in the flags rather than
assigning them over the other components.

Reported-by: Himangi Saraogi <himangi774@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-29 10:14:36 -07:00
Jaegeuk Kim fff04f90c1 f2fs: add info of appended or updated data writes
This patch introduces a inode number list in which represents inodes having
appended data writes or updated data writes after last checkpoint.
This will be used at fsync to determine whether the recovery information
should be written or not.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-29 07:46:11 -07:00
Jaegeuk Kim 39efac41fb f2fs: use radix_tree for ino management
For better ino management, this patch replaces the data structure from list
to radix tree.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-29 07:46:11 -07:00
Jaegeuk Kim 6451e041c8 f2fs: add infra for ino management
This patch changes the naming of orphan-related data structures to use as
inode numbers managed globally.
Later, we can use this facility for managing any inode number lists.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-29 07:45:54 -07:00
Namjae Jeon ee98fa3a8b ext4: fix COLLAPSE RANGE test for bigalloc file systems
Blocks in collapse range should be collapsed per cluster unit when
bigalloc is enable. If bigalloc is not enable, EXT4_CLUSTER_SIZE will
be same with EXT4_BLOCK_SIZE.

With this bug fixed, patch enables COLLAPSE_RANGE for bigalloc, which
fixes a large number of xfstest failures which use fsx.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-29 10:45:31 -04:00
Jaegeuk Kim 953e6cc6bc f2fs: punch the core function for inode management
This patch punches out the core functions to manage the inode numbers.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-29 07:40:25 -07:00
Jaegeuk Kim 0f7b2abd18 f2fs: add nobarrier mount option
This patch adds a mount option, nobarrier, in f2fs.
The assumption in here is that file system keeps the IO ordering, but
doesn't care about cache flushes inside the storages.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-29 05:27:48 -07:00
David S. Miller 3fd0202a0d Merge tag 'master-2014-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next
John W. Linville says:

====================
pull request: wireless-next 2014-07-25

Please pull this batch of updates intended for the 3.17 stream!

For the mac80211 bits, Johannes says:

"We have a lot of TDLS patches, among them a fix that should make hwsim
tests happy again. The rest, this time, is mostly small fixes."

For the Bluetooth bits, Gustavo says:

"Some more patches for 3.17. The most important change here is the move of
the 6lowpan code to net/6lowpan. It has been agreed with Davem that this
change will go through the bluetooth tree. The rest are mostly clean up and
fixes."

and,

"Here follows some more patches for 3.17. These are mostly fixes to what
we've sent to you before for next merge window."

For the iwlwifi bits, Emmanuel says:

"I have the usual amount of BT Coex stuff. Arik continues to work
on TDLS and Ariej contributes a few things for HS2.0. I added a few
more things to the firmware debugging infrastructure. Eran fixes a
small bug - pretty normal content."

And for the Atheros bits, Kalle says:

"For ath6kl me and Jessica added support for ar6004 hw3.0, our latest
version of ar6004.

For ath10k Janusz added a printout so that it's easier to check what
ath10k kconfig options are enabled. He also added a debugfs file to
configure maximum amsdu and ampdu values. Also we had few fixes as
usual."

On top of that is the usual large batch of various driver updates --
brcmfmac, mwifiex, the TI drivers, and wil6210 all get some action.
Rafał has also been very busy with b43 and related updates.

Also, I pulled the wireless tree into this in order to resolve a
merge conflict...

P.S.  The change to fs/compat_ioctl.c reflects a name change in a
Bluetooth header file...
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-28 17:36:25 -07:00
Darrick J. Wong 40b163f1c4 ext4: check inline directory before converting
Before converting an inline directory to a regular directory, check
the directory entries to make sure they're not obviously broken.
This helps us to avoid a BUG_ON if one of the dirents is trashed.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
2014-07-28 13:06:26 -04:00
Artem Bityutskiy 6390e99177 Revert "UBIFS: add a log overlap assertion"
This reverts commit 545f7fdf6d.

Hujianyang's testing revealed that the patch is bogus.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-07-28 19:15:19 +03:00
Yan, Zheng 06fee30f6a ceph: fix append mode write
generic_write_checks() may update 'pos', so we need to pass 'pos'
to ceph_sync_write() and ceph_sync_direct_write();

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-07-28 13:29:33 +04:00
Ilya Dryomov 7e8a295295 ceph: fix sizeof(struct tYpO *) typo
struct ceph_xattr -> struct ceph_inode_xattr

Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-07-28 13:29:27 +04:00
Ilya Dryomov 1a295bd8c8 ceph: remove redundant memset(0)
xattrs array of pointers is allocated with kcalloc() - no need to
memset() it to 0 right after that.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-07-28 13:28:33 +04:00
Ingo Molnar ca5bc6cd5d Merge branch 'sched/urgent' into sched/core, to merge fixes before applying new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-28 10:03:00 +02:00
Dmitry Monakhov 6e2631463f ext4: fix incorrect locking in move_extent_per_page
If we have to copy data we must drop i_data_sem because of
get_blocks() will be called inside mext_page_mkuptodate(), but later we must
reacquire it again because we are about to change extent's tree

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-07-27 22:32:27 -04:00
Dmitry Monakhov 29faed1638 ext4: use correct depth value
Inode's depth can be changed from here:
ext4_ext_try_to_merge() ->ext4_ext_try_to_merge_up()
We must use correct value.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-07-27 22:30:29 -04:00
Dmitry Monakhov 4b1f166071 ext4: add i_data_sem sanity check
Each caller of ext4_ext_dirty must hold i_data_sem,
The only exception is migration code, let's make it convenient.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2014-07-27 22:28:15 -04:00
Xiaoguang Wang b27b1535ac ext4: fix wrong size computation in ext4_mb_normalize_request()
As the member fe_len defined in struct ext4_free_extent is expressed as
number of clusters, the variable "size" computation is wrong, we need to
first translate fe_len to block number, then to bytes.

Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-07-27 22:26:36 -04:00
Linus Torvalds fbf08efa04 Merge branch 'vfs-for-3.16' of git://git.infradead.org/users/hch/vfs
Pull vfs fixes from Christoph Hellwig:
 "A vfsmount leak fix, and a compile warning fix"

* 'vfs-for-3.16' of git://git.infradead.org/users/hch/vfs:
  fs: umount on symlink leaks mnt count
  direct-io: fix uninitialized warning in do_direct_IO()
2014-07-27 09:43:41 -07:00
Linus Torvalds 0246544fc9 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
 "These two pathes fix issues with the kernel-userspace protocol changes
  in v3.15"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: add FUSE_NO_OPEN_SUPPORT flag to INIT
  fuse: s_time_gran fix
2014-07-25 16:16:34 -07:00
Chao Yu 9d84795077 f2fs: fix to put root inode in error path of fill_super
We should put root inode correctly in error path of fill_super, otherwise we
may encounter a leak case of inode resource.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-25 08:19:57 -07:00
Chao Yu dbf20cb259 f2fs: avoid use invalid mapping of node_inode when evict meta inode
Andrey Tsyvarev reported:
"Using memory error detector reveals the following use-after-free error
in 3.15.0:

AddressSanitizer: heap-use-after-free in f2fs_evict_inode
Read of size 8 by thread T22279:
  [<ffffffffa02d8702>] f2fs_evict_inode+0x102/0x2e0 [f2fs]
  [<ffffffff812359af>] evict+0x15f/0x290
  [<     inlined    >] iput+0x196/0x280 iput_final
  [<ffffffff812369a6>] iput+0x196/0x280
  [<ffffffffa02dc416>] f2fs_put_super+0xd6/0x170 [f2fs]
  [<ffffffff81210095>] generic_shutdown_super+0xc5/0x1b0
  [<ffffffff812105fd>] kill_block_super+0x4d/0xb0
  [<ffffffff81210a86>] deactivate_locked_super+0x66/0x80
  [<ffffffff81211c98>] deactivate_super+0x68/0x80
  [<ffffffff8123cc88>] mntput_no_expire+0x198/0x250
  [<     inlined    >] SyS_umount+0xe9/0x1a0 SYSC_umount
  [<ffffffff8123f1c9>] SyS_umount+0xe9/0x1a0
  [<ffffffff81cc8df9>] system_call_fastpath+0x16/0x1b

Freed by thread T3:
  [<ffffffffa02dc337>] f2fs_i_callback+0x27/0x30 [f2fs]
  [<     inlined    >] rcu_process_callbacks+0x2d6/0x930 __rcu_reclaim
  [<     inlined    >] rcu_process_callbacks+0x2d6/0x930 rcu_do_batch
  [<     inlined    >] rcu_process_callbacks+0x2d6/0x930 invoke_rcu_callbacks
  [<     inlined    >] rcu_process_callbacks+0x2d6/0x930 __rcu_process_callbacks
  [<ffffffff810fd266>] rcu_process_callbacks+0x2d6/0x930
  [<ffffffff8107cce2>] __do_softirq+0x142/0x380
  [<ffffffff8107cf50>] run_ksoftirqd+0x30/0x50
  [<ffffffff810b2a87>] smpboot_thread_fn+0x197/0x280
  [<ffffffff810a8238>] kthread+0x148/0x160
  [<ffffffff81cc8d4c>] ret_from_fork+0x7c/0xb0

Allocated by thread T22276:
  [<ffffffffa02dc7dd>] f2fs_alloc_inode+0x2d/0x170 [f2fs]
  [<ffffffff81235e2a>] iget_locked+0x10a/0x230
  [<ffffffffa02d7495>] f2fs_iget+0x35/0xa80 [f2fs]
  [<ffffffffa02e2393>] f2fs_fill_super+0xb53/0xff0 [f2fs]
  [<ffffffff81211bce>] mount_bdev+0x1de/0x240
  [<ffffffffa02dbce0>] f2fs_mount+0x10/0x20 [f2fs]
  [<ffffffff81212a85>] mount_fs+0x55/0x220
  [<ffffffff8123c026>] vfs_kern_mount+0x66/0x200
  [<     inlined    >] do_mount+0x2b4/0x1120 do_new_mount
  [<ffffffff812400d4>] do_mount+0x2b4/0x1120
  [<     inlined    >] SyS_mount+0xb2/0x110 SYSC_mount
  [<ffffffff812414a2>] SyS_mount+0xb2/0x110
  [<ffffffff81cc8df9>] system_call_fastpath+0x16/0x1b

The buggy address ffff8800587866c8 is located 48 bytes inside
  of 680-byte region [ffff880058786698, ffff880058786940)

Memory state around the buggy address:
  ffff880058786100: ffffffff ffffffff ffffffff ffffffff
  ffff880058786200: ffffffff ffffffff ffffffrr rrrrrrrr
  ffff880058786300: rrrrrrrr rrffffff ffffffff ffffffff
  ffff880058786400: ffffffff ffffffff ffffffff ffffffff
  ffff880058786500: ffffffff ffffffff ffffffff fffffffr
 >ffff880058786600: rrrrrrrr rrrrrrrr rrrfffff ffffffff
                                                ^
  ffff880058786700: ffffffff ffffffff ffffffff ffffffff
  ffff880058786800: ffffffff ffffffff ffffffff ffffffff
  ffff880058786900: ffffffff rrrrrrrr rrrrrrrr rrrr....
  ffff880058786a00: ........ ........ ........ ........
  ffff880058786b00: ........ ........ ........ ........
Legend:
  f - 8 freed bytes
  r - 8 redzone bytes
  . - 8 allocated bytes
  x=1..7 - x allocated bytes + (8-x) redzone bytes

Investigation shows, that f2fs_evict_inode, when called for
'meta_inode', uses invalidate_mapping_pages() for 'node_inode'.
But 'node_inode' is deleted before 'meta_inode' in f2fs_put_super via
iput().

It seems that in common usage scenario this use-after-free is benign,
because 'node_inode' remains partially valid data even after
kmem_cache_free().
But things may change if, while 'meta_inode' is evicted in one f2fs
filesystem, another (mounted) f2fs filesystem requests inode from cache,
and formely
'node_inode' of the first filesystem is returned."

Nids for both meta_inode and node_inode are reservation, so it's not necessary
for us to invalidate pages which will never be allocated.
To fix this issue, let's skipping needlessly invalidating pages for
{meta,node}_inode in f2fs_evict_inode.

Reported-by: Andrey Tsyvarev <tsyvarev@ispras.ru>
Tested-by: Andrey Tsyvarev <tsyvarev@ispras.ru>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-25 08:19:55 -07:00
Chao Yu 32f9bc25cb f2fs: support ->rename2()
Now new interface ->rename2() is added to VFS, here are related description:
https://lkml.org/lkml/2014/2/7/873
https://lkml.org/lkml/2014/2/7/758

This patch adds function f2fs_rename2() to support ->rename2() including
handling both RENAME_EXCHANGE and RENAME_NOREPLACE flag.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-25 08:14:08 -07:00
Huang Ying 79e35dc3c2 f2fs: add f2fs_balance_fs for direct IO
Otherwise, if a large amount of direct IO writes were done, the
segment allocation may be failed because no enough segments are gced.

Changes:

v2: add f2fs_balance_fs into __get_data_block instead of f2fs_direct_IO.

Signed-off-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-25 08:13:58 -07:00
Gu Zheng 00fefb9cf2 aio: use iovec array rather than the single one
Previously, we only offer a single iovec to handle all the read/write cases, so
the PREADV/PWRITEV request always need to alloc more iovec buffer when copying
user vectors.
If we use a tmp iovec array rather than the single one, some small PREADV/PWRITEV
workloads(vector size small than the tmp buffer) will not need to alloc more
iovec buffer when copying user vectors.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2014-07-24 10:59:40 -04:00
Gu Zheng 2be4e7deec aio: fix some comments
The function comments of aio_run_iocb and aio_read_events are out of date, so
fix them here.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2014-07-24 10:59:40 -04:00
Gu Zheng 8dc4379e17 aio: use the macro rather than the inline magic number
Replace the inline magic number with the ready-made macro(AIO_RING_MAGIC),
just clean up.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2014-07-24 10:59:40 -04:00
Gu Zheng b53f1f82fb aio: remove the needless registration of ring file's private_data
Remove the registration of ring file's private_data, we do not use
it.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2014-07-24 10:59:40 -04:00
Eric Paris 7d8b6c6375 CAPABILITIES: remove undefined caps from all processes
This is effectively a revert of 7b9a7ec565
plus fixing it a different way...

We found, when trying to run an application from an application which
had dropped privs that the kernel does security checks on undefined
capability bits.  This was ESPECIALLY difficult to debug as those
undefined bits are hidden from /proc/$PID/status.

Consider a root application which drops all capabilities from ALL 4
capability sets.  We assume, since the application is going to set
eff/perm/inh from an array that it will clear not only the defined caps
less than CAP_LAST_CAP, but also the higher 28ish bits which are
undefined future capabilities.

The BSET gets cleared differently.  Instead it is cleared one bit at a
time.  The problem here is that in security/commoncap.c::cap_task_prctl()
we actually check the validity of a capability being read.  So any task
which attempts to 'read all things set in bset' followed by 'unset all
things set in bset' will not even attempt to unset the undefined bits
higher than CAP_LAST_CAP.

So the 'parent' will look something like:
CapInh:	0000000000000000
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	ffffffc000000000

All of this 'should' be fine.  Given that these are undefined bits that
aren't supposed to have anything to do with permissions.  But they do...

So lets now consider a task which cleared the eff/perm/inh completely
and cleared all of the valid caps in the bset (but not the invalid caps
it couldn't read out of the kernel).  We know that this is exactly what
the libcap-ng library does and what the go capabilities library does.
They both leave you in that above situation if you try to clear all of
you capapabilities from all 4 sets.  If that root task calls execve()
the child task will pick up all caps not blocked by the bset.  The bset
however does not block bits higher than CAP_LAST_CAP.  So now the child
task has bits in eff which are not in the parent.  These are
'meaningless' undefined bits, but still bits which the parent doesn't
have.

The problem is now in cred_cap_issubset() (or any operation which does a
subset test) as the child, while a subset for valid cap bits, is not a
subset for invalid cap bits!  So now we set durring commit creds that
the child is not dumpable.  Given it is 'more priv' than its parent.  It
also means the parent cannot ptrace the child and other stupidity.

The solution here:
1) stop hiding capability bits in status
	This makes debugging easier!

2) stop giving any task undefined capability bits.  it's simple, it you
don't put those invalid bits in CAP_FULL_SET you won't get them in init
and you won't get them in any other task either.
	This fixes the cap_issubset() tests and resulting fallout (which
	made the init task in a docker container untraceable among other
	things)

3) mask out undefined bits when sys_capset() is called as it might use
~0, ~0 to denote 'all capabilities' for backward/forward compatibility.
	This lets 'capsh --caps="all=eip" -- -c /bin/bash' run.

4) mask out undefined bit when we read a file capability off of disk as
again likely all bits are set in the xattr for forward/backward
compatibility.
	This lets 'setcap all+pe /bin/bash; /bin/bash' run

Signed-off-by: Eric Paris <eparis@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Andrew G. Morgan <morgan@kernel.org>
Cc: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Steve Grubb <sgrubb@redhat.com>
Cc: Dan Walsh <dwalsh@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: James Morris <james.l.morris@oracle.com>
2014-07-24 21:53:47 +10:00
James Morris 4ca332e11d Merge tag 'keys-next-20140722' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs into next 2014-07-24 21:36:19 +10:00
Jie Liu 74dc93a908 xfs: fix uflags detection at xfs_fs_rm_xquota
We are intended to check up uflags against FS_PROJ_QUOTA rather than
FS_USER_UQUOTA once more, it looks to me like a typo, but might cause
the project quota metadata space can not be removed.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24 21:27:17 +10:00
Jie Liu 7b409a7d6f xfs: remove XFS_IS_OQUOTA_ON macros
Remove the XFS_IS_OQUOTA_ON macros as it is obsoleted.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24 21:27:16 +10:00
Eric Sandeen 54aa61f82d xfs: tidy up xfs_set_inode32
xfs_set_inode32() caught my eye because it had weird spacing around
the "-1's".  In cleaning that up, I realized that the assignment in
the declaration of "ino" is never used; it's rewritten before it
gets read.

Drop the ino initializer from its declaration since it's not used,
and move the agino initialization into the body of the function,
mostly so that we can have pretty whitespace and not exceed 80
columns.  :)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24 20:53:10 +10:00
Eric Sandeen 9de67c3ba9 xfs: allow inode allocations in post-growfs disk space
Today, if we perform an xfs_growfs which adds allocation groups,
mp->m_maxagi is not properly updated when the growfs is complete.

Therefore inodes will continue to be allocated only in the
AGs which existed prior to the growfs, and the new space
won't be utilized.

This is because of this path in xfs_growfs_data_private():

xfs_growfs_data_private
	xfs_initialize_perag(mp, nagcount, &nagimax);
		if (mp->m_flags & XFS_MOUNT_32BITINODES)
			index = xfs_set_inode32(mp);
		else
			index = xfs_set_inode64(mp);

		if (maxagi)
			*maxagi = index;

where xfs_set_inode* iterates over the (old) agcount in
mp->m_sb.sb_agblocks, which has not yet been updated
in the growfs path.  So "index" will be returned based on
the old agcount, not the new one, and new AGs are not available
for inode allocation.

Fix this by explicitly passing the proper AG count (which
xfs_initialize_perag() already has) down another level,
so that xfs_set_inode* can make the proper decision about
acceptable AGs for inode allocation in the potentially
newly-added AGs.

This has been broken since 3.7, when these two
xfs_set_inode* functions were added in commit 2d2194f.
Prior to that, we looped over "agcount" not sb_agblocks
in these calculations.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24 20:51:54 +10:00
Jie Liu eb866bbf09 xfs: mark xfs_qm_quotacheck as static
xfs_qm_quotacheck() is not used outside of xfs_qm.c.  Mark it static
and move it around in the file to avoid a forward declaration.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24 20:49:57 +10:00
Mark Tinguely 5c18717ea2 xfs: fix cil push sequence after log recovery
When the CIL checkpoint is fully written to the log, the LSN of the checkpoint
commit record is written into the CIL context structure. This allows log force
waiters to correctly detect when the checkpoint they are waiting on have been
fully written into the log buffers.

However, the initial context after mount is initialised with a non-zero commit
LSN, so appears to waiters as though it is complete even though it may not have
even been pushed, let alone written to the log buffers. Hence a log force
immediately after a filesystem is mounted may not behave correctly, nor does
commit record ordering if multiple CIL pushes interleave immediately after
mount.

To fix this, make sure the initial context commit LSN is not touched until the
first checkpointis actually pushed.

[dchinner: rewrite commit message]

Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24 20:49:40 +10:00
Vasily Averin 295dc39d94 fs: umount on symlink leaks mnt count
Currently umount on symlink blocks following umount:

/vz is separate mount

# ls /vz/ -al | grep test
drwxr-xr-x.  2 root root       4096 Jul 19 01:14 testdir
lrwxrwxrwx.  1 root root         11 Jul 19 01:16 testlink -> /vz/testdir
# umount -l /vz/testlink
umount: /vz/testlink: not mounted (expected)

# lsof /vz
# umount /vz
umount: /vz: device is busy. (unexpected)

In this case mountpoint_last() gets an extra refcount on path->mnt

Signed-off-by: Vasily Averin <vvs@openvz.org>
Acked-by: Ian Kent <raven@themaw.net>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
2014-07-24 06:18:12 -04:00
Boaz Harrosh 6fcc5420bf direct-io: fix uninitialized warning in do_direct_IO()
The following warnings:

  fs/direct-io.c: In function ‘__blockdev_direct_IO’:
  fs/direct-io.c:1011:12: warning: ‘to’ may be used uninitialized in this function [-Wmaybe-uninitialized]
  fs/direct-io.c:913:16: note: ‘to’ was declared here
  fs/direct-io.c:1011:12: warning: ‘from’ may be used uninitialized in this function [-Wmaybe-uninitialized]
  fs/direct-io.c:913:10: note: ‘from’ was declared here

are false positive because dio_get_page() either fails, or sets both
'from' and 'to'.

Paul Bolle said ...
Maybe it's better to move initializing "to" and "from" out of
dio_get_page(). That _might_ make it easier for both the the reader and
the compiler to understand what's going on. Something like this:

Christoph Hellwig said ...
The fix of moving the code definitively looks nicer, while I think
uninitialized_var is horrible wart that won't get anywhere near my code.

Boaz Harrosh: I agree with Christoph and Paul

Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2014-07-24 06:17:07 -04:00
Brian Foster f074051ff5 xfs: squash prealloc while over quota free space as well
From: Brian Foster <bfoster@redhat.com>

Commit 4d559a3b introduced heavy prealloc. squashing to catch the case
of requesting too large a prealloc on smaller filesystems, leading to
repeated flush and retry cycles that occur on ENOSPC. Now that we issue
eofblocks scans on EDQUOT/ENOSPC, squash the prealloc against the
minimum available free space across all applicable quotas as well to
avoid a similar problem of repeated eofblocks scans.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24 19:56:08 +10:00
Brian Foster dc06f398f0 xfs: run an eofblocks scan on ENOSPC/EDQUOT
From: Brian Foster <bfoster@redhat.com>

Speculative preallocation and and the associated throttling metrics
assume we're working with large files on large filesystems. Users have
reported inefficiencies in these mechanisms when we happen to be dealing
with large files on smaller filesystems. This can occur because while
prealloc throttling is aggressive under low free space conditions, it is
not active until we reach 5% free space or less.

For example, a 40GB filesystem has enough space for several files large
enough to have multi-GB preallocations at any given time. If those files
are slow growing, they might reserve preallocation for long periods of
time as well as avoid the background scanner due to frequent
modification. If a new file is written under these conditions, said file
has no access to this already reserved space and premature ENOSPC is
imminent.

To handle this scenario, modify the buffered write ENOSPC handling and
retry sequence to invoke an eofblocks scan. In the smaller filesystem
scenario, the eofblocks scan resets the usage of preallocation such that
when the 5% free space threshold is met, throttling effectively takes
over to provide fair and efficient preallocation until legitimate
ENOSPC.

The eofblocks scan is selective based on the nature of the failure. For
example, an EDQUOT failure in a particular quota will use a filtered
scan for that quota. Because we don't know which quota might have caused
an allocation failure at any given time, we include each applicable
quota determined to be under low free space conditions in the scan.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24 19:49:28 +10:00
Brian Foster f452639792 xfs: support a union-based filter for eofblocks scans
From: Brian Foster <bfoster@redhat.com>

The eofblocks scan inode filter uses intersection logic by default.
E.g., specifying both user and group quota ids filters out inodes that
are not covered by both the specified user and group quotas. This is
suitable for behavior exposed to userspace.

Scans that are initiated from within the kernel might require more broad
semantics, such as scanning all inodes under each quota associated with
an inode to alleviate low free space conditions in each.

Create the XFS_EOF_FLAGS_UNION flag to support a conditional union-based
filtering algorithm for eofblocks scans. This flag is intentionally left
out of the valid mask as it is not supported for scans initiated from
userspace.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24 19:44:28 +10:00
Brian Foster 5400da7dc0 xfs: add scan owner field to xfs_eofblocks
From: Brian Foster <bfoster@redhat.com>

The scan owner field represents an optional inode number that is
responsible for the current scan. The purpose is to identify that an
inode is under iolock and as such, the iolock shouldn't be attempted
when trimming eofblocks. This is an internal only field.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24 19:40:22 +10:00
Jie Liu f3d1e58743 xfs: introduce xfs_bulkstat_grab_ichunk
From: Jie Liu <jeff.liu@oracle.com>

Introduce xfs_bulkstat_grab_ichunk() to look up an inode chunk in where
the given inode resides, then grab the record.  Update the data for the
pointed-to record if the inode was not the last in the chunk and there
are some left allocated, return the grabbed inode count on success.

Refactor xfs_bulkstat() with it.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24 18:42:21 +10:00
Jie Liu 4b8fdfecd8 xfs: introduce xfs_bulkstat_ichunk_ra
From: Jie Liu <jeff.liu@oracle.com>

Introduce xfs_bulkstat_ichunk_ra() to loop over all clusters in the
next inode chunk, then performs readahead if there are any allocated
inodes in that cluster.

Refactor xfs_bulkstat() with it.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24 18:41:18 +10:00
Jie Liu d4c2734875 xfs: fix error handling at xfs_bulkstat
From: Jie Liu <jeff.liu@oracle.com>

We should not ignore the btree operation errors at xfs_bulkstat() but
to propagate them if any. This patch fix two places in this function
and the remaining things will be fixed with code refactoring thereafter.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24 18:40:43 +10:00
Jie Liu 296dfd7fdb xfs: remove redundant user buffer count checks at xfs_bulkstat
From: Jie Liu <jeff.liu@oracle.com>

Remove the redundant user buffer and count checks as it has already
been validated at xfs_ioc_bulkstat().

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24 18:40:26 +10:00
Himangi Saraogi 08a0f24e4c ceph: replace comma with a semicolon
Replace a comma between expression statements by a semicolon. This changes
the semantics of the code, but given the current indentation appears to be
what is intended.

A simplified version of the Coccinelle semantic patch that performs this
transformation is as follows:
// <smpl>
@r@
expression e1,e2;
@@

 e1
-,
+;
 e2;
// </smpl>

Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
2014-07-24 12:04:46 +04:00
Jie Liu c7cb51dcb0 xfs: fix error handling at xfs_inumbers
From: Jie Liu <jeff.liu@oracle.com>

To fetch the file system number tables, we currently just ignore the
errors and proceed to loop over the next AG or bump agino to the next
chunk in case of btree operations failed, that is not properly because
those errors might hint us potential file system problems.

This patch rework xfs_inumbers() to handle the btree operation errors
as well as the loop conditions.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24 12:18:47 +10:00
Jie Liu 549fa00679 xfs: consolidate xfs_inumbers
From: Jie Liu <jeff.liu@oracle.com>

Consolidate xfs_inumbers() to make the formatter function return correct
error and make the source code looks a bit neat.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24 12:11:47 +10:00
Christoph Hellwig d716f8eedb xfs: remove xfs_bulkstat_single
From: Christoph Hellwig <hch@lst.de>

xfs_bukstat_one doesn't have any failure case that would go away when
called through xfs_bulkstat, so remove the fallback and the now unessecary
xfs_bulkstat_single function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24 12:07:15 +10:00
Jie Liu 8fe657760d xfs: remove redundant stat assignment in xfs_bulkstat_one_int
From: Jie Liu <jeff.liu@oracle.com>

Remove the redundant BULKSTAT_RV_NOTHING assignment in case of call
xfs_iget() failed at xfs_bulkstat_one_int().

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24 11:33:28 +10:00
Linus Torvalds 82e13c71bc Merge branch 'for-3.16' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfix from Bruce Fields:
 "Another regression from the xdr encoding rewrite"

* 'for-3.16' of git://linux-nfs.org/~bfields/linux:
  NFSD: Fix crash encoding lock reply on 32-bit
2014-07-23 17:55:11 -07:00
Hugh Dickins 4e66d445d0 simple_xattr: permit 0-size extended attributes
If a filesystem uses simple_xattr to support user extended attributes,
LTP setxattr01 and xfstests generic/062 fail with "Cannot allocate
memory": simple_xattr_alloc()'s wrap-around test mistakenly excludes
values of zero size.  Fix that off-by-one (but apparently no filesystem
needs them yet).

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-23 15:10:55 -07:00
Silesh C V aed8adb768 coredump: fix the setting of PF_DUMPCORE
Commit 079148b919 ("coredump: factor out the setting of PF_DUMPCORE")
cleaned up the setting of PF_DUMPCORE by removing it from all the
linux_binfmt->core_dump() and moving it to zap_threads().But this ended
up clearing all the previously set flags.  This causes issues during
core generation when tsk->flags is checked again (eg.  for PF_USED_MATH
to dump floating point registers).  Fix this.

Signed-off-by: Silesh C V <svellattu@mvista.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Mandeep Singh Baines <msb@chromium.org>
Cc: <stable@vger.kernel.org>	[3.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-23 15:10:54 -07:00
Thomas Gleixner 5eaaed4fe2 fs: lockd: Use ktime_get_ns()
Replace the ever recurring:
        ts = ktime_get_ts();
        ns = timespec_to_ns(&ts);
with
        ns = ktime_get_ns();

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 15:01:44 -07:00
Jeff Layton f9c00c3ab4 nfsd: Do not let nfs4_file pin the struct inode
Remove the fi_inode field in struct nfs4_file in order to remove the
possibility of struct nfs4_file pinning the inode when it does not have
any open state.

The only place we still need to get to an inode is in check_for_locks,
so change it to use find_any_file and use the inode from any that it
finds. If it doesn't find one, then just assume there aren't any.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-23 16:35:24 -04:00
Trond Myklebust b07c54a4a3 nfsd: nfs4_check_fh - make it actually check the filehandle
...instead of just checking the inode that corresponds to it.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-23 16:35:24 -04:00
Trond Myklebust ca94321783 nfsd: Use the filehandle to look up the struct nfs4_file instead of inode
This makes more sense anyway since an inode pointer value can change
even when the filehandle doesn't.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-23 16:35:24 -04:00
Trond Myklebust e2cf80d73f nfsd: Store the filehandle with the struct nfs4_file
For use when we may not have a struct inode.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-23 16:35:23 -04:00
Himangi Saraogi fc8e5a644c nfsd4: convert comma to semicolon
Replace a comma between expression statements by a semicolon. This changes
the semantics of the code, but given the current indentation appears to be
what is intended.

A simplified version of the Coccinelle semantic patch that performs this
transformation is as follows:
// <smpl>
@r@
expression e1,e2;
@@

 e1
-,
+;
 e2;
// </smpl>

Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-23 14:20:49 -04:00
Jeff Layton 2f6ce8e73c nfsd: ensure that st_access_bmap and st_deny_bmap are initialized to 0
Open stateids must be initialized with the st_access_bmap and
st_deny_bmap set to 0, so that nfs4_get_vfs_file can properly record
their state in old_access_bmap and old_deny_bmap.

This bug was introduced in commit baeb4ff0e5 (nfsd: make deny mode
enforcement more efficient and close races in it) and was causing the
refcounts to end up incorrect when nfs4_get_vfs_file returned an error
after bumping the refcounts. This made it impossible to unmount the
underlying filesystem after running pynfs tests that involve deny modes.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-23 14:20:47 -04:00
Thomas Gleixner 57e0be041d sched: Make task->real_start_time nanoseconds based
Simplify the only user of this data by removing the timespec
conversion.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:18:05 -07:00
Thomas Gleixner 53cc7bad37 timerfd: Use ktime_mono_to_real()
We have a few other use cases of ktime_get_monotonic_offset() which
can be optimized with ktime_mono_to_real(). The timerfd code uses the
offset only for comparison, so we can use ktime_mono_to_real(0) for
this as well.

Funny enough text size shrinks with that on ARM and x8664 !?

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:18:02 -07:00
Kinglong Mee f98bac5a30 NFSD: Fix crash encoding lock reply on 32-bit
Commit 8c7424cff6 "nfsd4: don't try to encode conflicting owner if low
on space" forgot to free conf->data in nfsd4_encode_lockt and before
sign conf->data to NULL in nfsd4_encode_lock_denied, causing a leak.

Worse, kfree() can be called on an uninitialized pointer in the case of
a succesful lock (or one that fails for a reason other than a conflict).

(Note that lock->lk_denied.ld_owner.data appears it should be zero here,
until you notice that it's one arm of a union the other arm of which is
written to in the succesful case by the

	memcpy(&lock->lk_resp_stateid, &lock_stp->st_stid.sc_stateid,
	                                sizeof(stateid_t));

in nfsd4_lock().  In the 32-bit case this overwrites ld_owner.data.)

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Fixes: 8c7424cff6 ""nfsd4: don't try to encode conflicting owner if low on space"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-23 10:31:56 -04:00
David Howells 633706a2ee Merge branch 'keys-fixes' into keys-next
Signed-off-by: David Howells <dhowells@redhat.com>
2014-07-22 21:55:45 +01:00
David Howells f9167789df KEYS: user: Use key preparsing
Make use of key preparsing in user-defined and logon keys so that quota size
determination can take place prior to keyring locking when a key is being
added.

Also the idmapper key types need to change to match as they use the
user-defined key type routines.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
Acked-by: Jeff Layton <jlayton@primarydata.com>
2014-07-22 21:46:17 +01:00
Jeff Layton d55a166c96 nfsd: bump dl_time when unhashing delegation
There's a potential race between a lease break and DELEGRETURN call.

Suppose a lease break comes in and queues the workqueue job for a
delegation, but it doesn't run just yet. Then, a DELEGRETURN comes in
finds the delegation and calls destroy_delegation on it to unhash it and
put its primary reference.

Next, the workqueue job runs and queues the delegation back onto the
del_recall_lru list, issues the CB_RECALL and puts the final reference.
With that, the final reference to the delegation is put, but it's still
on the LRU list.

When we go to unhash a delegation, it's because we intend to get rid of
it soon afterward, so we don't want lease breaks to mess with it once
that occurs. Fix this by bumping the dl_time whenever we unhash a
delegation, to ensure that lease breaks don't monkey with it.

I believe this is a regression due to commit 02e1215f9f (nfsd: Avoid
taking state_lock while holding inode lock in nfsd_break_one_deleg).
Prior to that, the state_lock was held in the lm_break callback itself,
and that would have prevented this race.

Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-22 15:34:47 -04:00
Andrew Gallagher d7afaec0b5 fuse: add FUSE_NO_OPEN_SUPPORT flag to INIT
Here some additional changes to set a capability flag so that clients can
detect when it's appropriate to return -ENOSYS from open.

This amends the following commit introduced in 3.14:

  7678ac5061  fuse: support clients that don't implement 'open'

However we can only add the flag to 3.15 and later since there was no
protocol version update in 3.14.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: <stable@vger.kernel.org> # v3.15+
2014-07-22 16:37:43 +02:00
Miklos Szeredi a800bad366 fuse: s_time_gran fix
Default s_time_gran is 1, don't overwrite that if userspace didn't
explicitly specify one.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: <stable@vger.kernel.org> # v3.15+
2014-07-22 16:37:42 +02:00
Benjamin LaHaise be6fb451a2 aio: remove no longer needed preempt_disable()
Based on feedback from Jens Axboe on 263782c1c9,
clean up get/put_reqs_available() to remove the no longer needed preempt_disable()
and preempt_enable() pair.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Jens Axboe <axboe@kernel.dk>
2014-07-22 09:56:56 -04:00
Trond Myklebust 72c0b0fb9f nfsd: Move the delegation reference counter into the struct nfs4_stid
We will want to add reference counting to the lock stateid and open
stateids too in later patches.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-21 17:03:00 -04:00
Jeff Layton 417c6629b2 nfsd: fix race that grants unrecallable delegation
If nfs4_setlease succesfully acquires a new delegation, then another
task breaks the delegation before we reach hash_delegation_locked, then
the breaking task will see an empty fi_delegations list and do nothing.
The client will receive an open reply incorrectly granting a delegation
and will never receive a recall.

Move more of the delegation fields to be protected by the fi_lock. It's
more granular than the state_lock and in later patches we'll want to
be able to rely on it in addition to the state_lock.

Attempt to acquire a delegation. If that succeeds, take the spinlocks
and then check to see if the file has had a conflict show up since then.
If it has, then we assume that the lease is no longer valid and that
we shouldn't hand out a delegation.

There's also one more potential (but very unlikely) problem. If the
lease is broken before the delegation is hashed, then it could leak.
In the event that the fi_delegations list is empty, reset the
fl_break_time to jiffies so that it's cleaned up ASAP by
the normal lease handling code.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-21 16:31:17 -04:00
Greg Kroah-Hartman 90125edbc4 Merge 3.16-rc6 into driver-core-next
We want the platform changes in here as well.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-21 10:07:25 -07:00
J. Bruce Fields 57a3714421 nfsd4: CREATE_SESSION should update backchannel immediately
nfsd4_probe_callback kicks off some work that will eventually run
nfsd4_process_cb_update and update the session flags.  In theory we
could process a following SEQUENCE call before that update happens
resulting in flags that don't accurately represent, for example, the
lack of a backchannel.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-21 12:30:50 -04:00
Brian Norris d0d5864676 Linux 3.16-rc6
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJTzJFGAAoJEHm+PkMAQRiGNzQH/087gQch5K+A2HKvPzjUXq57
 G82DJHLONMMq8+NY3Vqhp8g2V8zRbXGJEvMJMsyuscO37Vo7ADcrYo8lqY9w5bIl
 h+Zarhkqz0rqRs2SfMMIVzdd2W7MzL+lqj3GplGPxHztw0+qk7PRKILx6eRppGaH
 JaD4NfkD5+1vfve/2d1ze9D5pCiw6PFNzjesKZxScQhNhIyLdRamfSTY4r9XeURo
 CxpwjphEYfvAcgc39mwzEHPHyKSqULu0By6R8FXQpJ9QjVtzcGEiF+cPqGncpZOR
 5ZSyU5e1CpBl9w8o6Lm9ewXmaCSnBU/VFrOwWvZrXfokZedXBOz7KdShU93XFjU=
 =0VJM
 -----END PGP SIGNATURE-----

Merge tag 'v3.16-rc6' into MTD development branch

Linux 3.16-rc6
2014-07-21 00:01:16 -07:00
Linus Torvalds da83fc6e0f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "We have two more fixes in my for-linus branch.

  I was hoping to also include a fix for a btrfs deadlock with
  compression enabled, but we're still nailing that one down"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: test for valid bdev before kobj removal in btrfs_rm_device
  Btrfs: fix abnormal long waiting in fsync
2014-07-20 20:21:05 -07:00
Linus Torvalds 90d51d5606 NFS client fixes for Linux 3.16
Highlights include;
 - Stable fix for an NFSv3 posix ACL regression
 - Multiple fixes for regressions to the NFS generic read/write code
   - Fix page splitting bugs that come into play when a small rsize/wsize
     read/write needs to be sent again (due to error conditions or page
     redirty).
   - Fix nfs_wb_page_cancel, which is called by the "invalidatepage" method
 - Fix 2 compile warnings about unused variables.
 - Fix a performance issue affecting unstable writes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTys5JAAoJEGcL54qWCgDygakP/1JOfQZzyHNRml5aDVRLtfp/
 QQpn8ne3jjEov+0BhzSxqkpHP+8fCcF0UgD5asEnbM83ruoB8EUHErXdq7kRkAYf
 COpDZAww4sPXUszG/xGqIED503DKbf69Ds5pQMh8g71hfQpw04UmMQYbnRNBP2bI
 Z5eu0Xiwaf3MVRaxHMXVy9sl7in/cQBvrXnKLTIYOLA0U5bdCI9JWT5+qbOUCC/3
 YuYm5EjM+OMyhvEWyVDXFp9kmw0vtBS8FwAXYKBjwfJLNl8dGuERWKlFDPNOxdpZ
 QrOBhUH73d2tgvTHzUg/RRtpA6mKOfznU3SQK3muP188/2/sbb4y/RChwv+Ignat
 YqWFgbDTz1idKvjj+Vzcd7eL3HO9Kono/YkAG8i5xmOlSeenzDra1lDAbuB+zNzd
 oLVY1AJp8+13c74gaCDurxJ7fq6Fth97eqxdo8fdQH8Mn6m6qRm2V57rOo59eFQS
 bX6EN4ja8ashrKprCvlhXDGJmQKZWJMEGoTXJCj5w+HBklONV6jdbyzGFKT+Zmhg
 IP+USsaLJDswZNpdWq8Zb9RthfFAURKEQEemwV5eLQNtTa+xQ22ZOF2ycrbBqsED
 etCCm8gyNukl2RCAxGfLrtpBytlcNMmcRKGktKqO1SAAua3IP33wGD7zql8rwGCZ
 m9XMXUkVKiHoErl3jZ4T
 =qklz
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.16-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client fixes from Trond Myklebust:
 "Apologies for the relative lateness of this pull request, however the
  commits fix some issues with the NFS read/write code updates in
  3.16-rc1 that can cause serious Oopsing when using small r/wsize.  The
  delay was mainly due to extra testing to make sure that the fixes
  behave correctly.

  Highlights include;
   - Stable fix for an NFSv3 posix ACL regression
   - Multiple fixes for regressions to the NFS generic read/write code:
     - Fix page splitting bugs that come into play when a small
       rsize/wsize read/write needs to be sent again (due to error
       conditions or page redirty)
     - Fix nfs_wb_page_cancel, which is called by the "invalidatepage"
       method
   - Fix 2 compile warnings about unused variables
   - Fix a performance issue affecting unstable writes"

* tag 'nfs-for-3.16-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFS: Don't reset pg_moreio in __nfs_pageio_add_request
  NFS: Remove 2 unused variables
  nfs: handle multiple reqs in nfs_wb_page_cancel
  nfs: handle multiple reqs in nfs_page_async_flush
  nfs: change find_request to find_head_request
  nfs: nfs_page should take a ref on the head req
  nfs: mark nfs_page reqs with flag for extra ref
  nfs: only show Posix ACLs in listxattr if actually present
2014-07-20 19:55:44 -07:00
Yan, Zheng d0d0db2268 ceph: check zero length in ceph_sync_read()
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-07-21 10:17:05 +08:00
Eric Sandeen 0bfaa9c5cb btrfs: test for valid bdev before kobj removal in btrfs_rm_device
commit 99994cd btrfs: dev delete should remove sysfs entry
added a btrfs_kobj_rm_device, which dereferences device->bdev...
right after we check whether device->bdev might be NULL.

I don't honestly know if it's possible to have a NULL device->bdev
here, but assuming that it is (given the test), we need to move
the kobject removal to be under that test.

(Coverity spotted this)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-19 11:49:44 -07:00
Liu Bo 98ce2deda2 Btrfs: fix abnormal long waiting in fsync
xfstests generic/127 detected this problem.

With commit 7fc34a62ca, now fsync will only flush
data within the passed range.  This is the cause of the above problem,
-- btrfs's fsync has a stage called 'sync log' which will wait for all the
ordered extents it've recorded to finish.

In xfstests/generic/127, with mixed operations such as truncate, fallocate,
punch hole, and mapwrite, we get some pre-allocated extents, and mapwrite will
mmap, and then msync.  And I find that msync will wait for quite a long time
(about 20s in my case), thanks to ftrace, it turns out that the previous
fallocate calls 'btrfs_wait_ordered_range()' to flush dirty pages, but as the
range of dirty pages may be larger than 'btrfs_wait_ordered_range()' wants,
there can be some ordered extents created but not getting corresponding pages
flushed, then they're left in memory until we fsync which runs into the
stage 'sync log', and fsync will just wait for the system writeback thread
to flush those pages and get ordered extents finished, so the latency is
inevitable.

This adds a flush similar to btrfs_start_ordered_extent() in
btrfs_wait_logged_extents() to fix that.

Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-19 11:49:44 -07:00
Artem Bityutskiy 545f7fdf6d UBIFS: add a log overlap assertion
Add an assertion which checkes that the head of the log never overlaps with the
tail of the log.

Suggested-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-07-19 09:55:19 +03:00
Artem Bityutskiy f1cb705acc UBIFS: remove unnecessary check
Remove the "if (c->lhead_offs == 0)" check because is unnecessary, since
at that point the log head offset is guaranteed to be zero due to the previous
operation.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-07-19 09:54:25 +03:00
Artem Bityutskiy 07e19dff63 UBIFS: remove mst_mutex
The 'mst_mutex' is not needed since because 'ubifs_write_master()' is only
called on the mount path and commit path. The mount path is sequential and
there is no parallelism, and the commit path is also serialized - there is only
one commit going on at a time.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-07-19 09:53:52 +03:00
Fabian Frederick 39274a1e8d UBIFS: kernel-doc warning fix
s/data/timer

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-07-19 09:53:52 +03:00
Fabian Frederick d4eb08ff0a UBIFS: replace seq_printf by seq_puts
Fix checkpatch warnings:
"WARNING: Prefer seq_puts to seq_printf"

Andrew Morton wrote:

"
- puts is presumably faster

- puts doesn't go rogue if you accidentally pass it a "%".

- this patch actually made fs/ubifs/super.o 12 bytes smaller.
  Perhaps because seq_printf() is a varargs function, forcing the
  caller to pass args on the stack instead of in registers.
"

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-07-19 09:53:52 +03:00
Fabian Frederick 86b4c14de3 UBIFS: replace count*size kzalloc by kcalloc
kcalloc manages count*sizeof overflow.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-07-19 09:53:52 +03:00
Fabian Frederick ef13f01828 UBIFS: kernel-doc warning fix
No grouped argument in drop_last_node.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-07-19 09:53:52 +03:00
hujianyang 6dcfb80264 UBIFS: fix error path in create_default_filesystem()
In the end of 'create_default_filesystem()' we need to check
the return value of 'ubifs_write_node()' to ensure that we have
successfully written the 'cs_node'.

Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-07-19 09:53:52 +03:00
Artem Bityutskiy f2b6521aa1 UBIFS: fix spelling of "scanned"
Randy Dunlap pointed that we should use "scanned" instead of "scaned". This
patch makes the correction.

Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-07-19 09:53:51 +03:00
Seunghun Lee d685c41215 UBIFS: fix some comments
This patch fixes some comments about return type.

Signed-off-by: Seunghun Lee <waydi1@gmail.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-07-19 09:53:51 +03:00
hujianyang c7b5bb0beb UBIFS: remove useless @ecc in struct ubifs_scan_leb
We set @ecc in ubifs_scan_leb only if leb_read returns EBADMSG and
do not use it any more. This patch removes this variable and adds
comments about EBADMSG handling.

Artem: re-phrase commentaries

Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-07-19 09:53:51 +03:00
hujianyang b793a8c888 UBIFS: remove useless statements
This patch removes useless and duplicate statements.

Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-07-19 09:53:51 +03:00
hujianyang ce6ebdb87e UBIFS: Add missing break statements in dbg_chk_pnode()
This is a minor fix. These two branches in 'dbg_chk_pnode()'
are dealing with different conditions. Although there is
no fault in current state, I think adding "break"s in
each end of branch is better.

Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-07-19 09:53:51 +03:00
hujianyang 5a95741a57 UBIFS: fix error handling in dump_lpt_leb()
This patch checks the return value of 'ubifs_unpack_nnode()'.
If this function returns an error, 'nnode' may not be
initialized, so just print an error message and break.

Signed-off-by: hujianyang <hujianyang@huawei.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
2014-07-19 09:53:51 +03:00
Kees Cook c2e1f2e30d seccomp: implement SECCOMP_FILTER_FLAG_TSYNC
Applying restrictive seccomp filter programs to large or diverse
codebases often requires handling threads which may be started early in
the process lifetime (e.g., by code that is linked in). While it is
possible to apply permissive programs prior to process start up, it is
difficult to further restrict the kernel ABI to those threads after that
point.

This change adds a new seccomp syscall flag to SECCOMP_SET_MODE_FILTER for
synchronizing thread group seccomp filters at filter installation time.

When calling seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_TSYNC,
filter) an attempt will be made to synchronize all threads in current's
threadgroup to its new seccomp filter program. This is possible iff all
threads are using a filter that is an ancestor to the filter current is
attempting to synchronize to. NULL filters (where the task is running as
SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be
transitioned into SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS,
...) has been set on the calling thread, no_new_privs will be set for
all synchronized threads too. On success, 0 is returned. On failure,
the pid of one of the failing threads will be returned and no filters
will have been applied.

The race conditions against another thread are:
- requesting TSYNC (already handled by sighand lock)
- performing a clone (already handled by sighand lock)
- changing its filter (already handled by sighand lock)
- calling exec (handled by cred_guard_mutex)
The clone case is assisted by the fact that new threads will have their
seccomp state duplicated from their parent before appearing on the tasklist.

Holding cred_guard_mutex means that seccomp filters cannot be assigned
while in the middle of another thread's exec (potentially bypassing
no_new_privs or similar). The call to de_thread() may kill threads waiting
for the mutex.

Changes across threads to the filter pointer includes a barrier.

Based on patches by Will Drewry.

Suggested-by: Julien Tinnes <jln@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
2014-07-18 12:13:40 -07:00
Kees Cook 1d4457f999 sched: move no_new_privs into new atomic flags
Since seccomp transitions between threads requires updates to the
no_new_privs flag to be atomic, the flag must be part of an atomic flag
set. This moves the nnp flag into a separate task field, and introduces
accessors.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
2014-07-18 12:13:38 -07:00
Linus Torvalds f839719122 This patch set contains two minor docs/spelling fixes, some fixes for
flock, a change to use GFP_NOFS to avoid recursion on a rarely used
 code path and a fix for a race relating to the glock lru.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQIcBAABAgAGBQJTyPZQAAoJEMrg3m4a/8jSFBEQAKSnJQUP9MSxVwNBrgOiybXW
 kQd8RYs7cdt33i97C3Im9xSVktPz4HKTvuwHyvNV1oyWScfWSyqCgC//cU+/zlYV
 wJDZWIASNoQheY6UfxR6TeBPZo9Hgq7RQRGj4h1ttag9+b8Zz9aV5TCxcoh28ULF
 629TyNwg4xdiEKX2xZusDwGCoHn5f5l9pAa5MyPrcyPzn1lOJP1lz++Lci2nqC4g
 DvA/KzQzDLQ2lKXdSd95avwQxnHqmeCTvClPmK9GgONrt66tqq6CcCLB1jPRE7/O
 J7f0VWy/PEeo8ot+9siiA380EvM6hWvJx5Fuen/Qb9dQ5sgsJMkvgbqlHK6zB/i3
 Je6Qq+aVPz3qktmXdyEagpXfZAQAxy0PUWezQBQH8HIlhwKMGC1QaFgMoAFIks1Y
 S38IBHCwlymytWYdVaRhyUOnlzzaSyeYROzs7hZoxRRUilge5rPkrqtv4HWLSRtZ
 rGFEid181+qTO2TyoiMRY2oR3U0PHfbE9Dhv5Pu9caTl55kj9eAGwvqnOn6IpyvF
 eiUoWOnDYFO+8sxFKPYFndglEZx0zBU6B/7axyQ3qam3BojTJwKh+2+4TqauM0zo
 4ehwJEzVmV21sbyMfUHCKTQEkW8OjQ+EkxAEmGhp4IODNwZ3vPfFBdhFi3fBipqO
 WhDmeDmOddb9cCoQG8WZ
 =VTve
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes

Pull gfs2 fixes from Steven Whitehouse:
 "This patch set contains two minor docs/spelling fixes, some fixes for
  flock, a change to use GFP_NOFS to avoid recursion on a rarely used
  code path and a fix for a race relating to the glock lru"

* tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes:
  GFS2: fs/gfs2/rgrp.c: kernel-doc warning fixes
  GFS2: memcontrol: Spelling s/invlidate/invalidate/
  GFS2: Allow caching of glocks for flock
  GFS2: Allow flocks to use normal glock dq rather than dq_wait
  GFS2: replace count*size kzalloc by kcalloc
  GFS2: Use GFP_NOFS when allocating glocks
  GFS2: Fix race in glock lru glock disposal
  GFS2: Only wait for demote when last holder is dequeued
2014-07-18 06:26:04 -10:00
Linus Torvalds 847f56eb0e xfs: fixes for 3.15-rc5
Fixes for low memory perforamnce regressions and a quota inode handling
 regression.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJTyFtkAAoJEK3oKUf0dfodAioP/0nsIq3be9zch0/Jjf9aduON
 1aMaUE/9p4g/zu0f96ld3GI5/guRVllZ/0qJFyYAnAgOGs1hGARQfOuNnHBjgdLZ
 D261JbT9z8d8inQ7BMSc3EBBQ2CAZsAwRmg6UbeaWBE4hjlJ03RGWBE/aMMh10Wh
 t2fWUoeYWZWLi+Gfa+lPpnTESH7nBK5cW+daC16I1fU9Z/RcXAl+pCN2s5Ls6h7G
 1mQNfhAuII+/ydB7UWXL6SQ7/sDvDwNvedMLljwg6oYu/riSG2kYPZKW6O9qwL6T
 z9onPg5lEQMjWlbl6qzNo6OT3pAs37vssG0zVw2ZS8GjZ84edRzw2PGctJAFu2Hh
 sWqmtYGNKBjtjnxJ4zlRfeBkwpHbZGlLyOwzKoDlyQ8j9KZ8v8lsKyEpJK6/XJNG
 1rMJZV5twu+xvZUwf0zkg0tuxoT/3T3kbIHsFkEaJmQW7jrxTvdybW/rp6KHSbcb
 rzCpuZ5Ghh9qss+EeCv3k2nnWjysDP4kSwQMZ0zCedzvDTmga2TMw//MUwaM+i7M
 D7Raq4Qcs2updrFk2j9OyML2hi49KuPTtEu2OC7ObfxvBsZgSbTvyw1Vq/rsiDM5
 FZMV/giKRoCFpRpp7xF+db0zkBC2xDU9tGz196dzGtg7rvp6Z5401mS8fAr9H/LJ
 D2Wf2OXx3oss9v4rrO7N
 =rnN8
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-3.16-rc5' of git://oss.sgi.com/xfs/xfs

Pull xfs fixes from Dave Chinner:
 "Fixes for low memory perforamnce regressions and a quota inode
  handling regression.

  These are regression fixes for issues recently introduced - the change
  in the stack switch location is fairly important, so I've held off
  sending this update until I was sure that it still addresses the stack
  usage problem the original solved.  So while the commits in the xfs
  tree are recent, it has been under tested for several weeks now"

* tag 'xfs-for-linus-3.16-rc5' of git://oss.sgi.com/xfs/xfs:
  xfs: null unused quota inodes when quota is on
  xfs: refine the allocation stack switch
  Revert "xfs: block allocation work needs to be kswapd aware"
2014-07-18 06:21:43 -10:00
Chuck Lever 3c45ddf823 svcrdma: Select NFSv4.1 backchannel transport based on forward channel
The current code always selects XPRT_TRANSPORT_BC_TCP for the back
channel, even when the forward channel was not TCP (eg, RDMA). When
a 4.1 mount is attempted with RDMA, the server panics in the TCP BC
code when trying to send CB_NULL.

Instead, construct the transport protocol number from the forward
channel transport or'd with XPRT_TRANSPORT_BC. Transports that do
not support bi-directional RPC will not have registered a "BC"
transport, causing create_backchannel_client() to fail immediately.

Fixes: https://bugzilla.linux-nfs.org/show_bug.cgi?id=265
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-18 11:35:45 -04:00
Fabian Frederick 27ff6a0f7f GFS2: fs/gfs2/rgrp.c: kernel-doc warning fixes
Cc: cluster-devel@redhat.com
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-07-18 11:15:14 +01:00
Geert Uytterhoeven 6b49d1d9c3 GFS2: memcontrol: Spelling s/invlidate/invalidate/
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: cluster-devel@redhat.com
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-07-18 11:14:31 +01:00
Bob Peterson 97a4f1d765 GFS2: Allow caching of glocks for flock
This patch removes the GLF_NOCACHE flag from the glocks associated with
flocks. There should be no good reason not to cache glocks for flocks:
they only force the glock to be demoted before they can be reacquired,
which can slow down performance and even cause glock hangs, especially
in cases where the flocks are held in Shared (SH) mode.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-07-18 11:14:12 +01:00
Bob Peterson 5bef3e7cf1 GFS2: Allow flocks to use normal glock dq rather than dq_wait
This patch allows flock glocks to use a non-blocking dequeue rather
than dq_wait. It also reverts the previous patch I had posted regarding
dq_wait. The reverted patch isn't necessarily a bad idea, but I decided
this might avoid unforeseen side effects, and was therefore safer.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-07-18 11:13:56 +01:00
Fabian Frederick 6ec43b1838 GFS2: replace count*size kzalloc by kcalloc
kcalloc manages count*sizeof overflow.

Cc: cluster-devel@redhat.com
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-07-18 11:13:38 +01:00
Steven Whitehouse fe0bbd2986 GFS2: Use GFP_NOFS when allocating glocks
Normally GFP_KERNEL is ok here, but there is now a rarely used code path
relating to deallocation of unlinked inodes (in certain corner cases)
which if hit at times of memory shortage can cause recursion while
trying to free memory.

One solution would be to try and move the gfs2_glock_get() call so
that it is no longer called while another glock is held, but that
doesn't look at all easy, so GFP_NOFS is the best solution for the
time being.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-07-18 11:13:12 +01:00
Steven Whitehouse 94a09a3999 GFS2: Fix race in glock lru glock disposal
We must not leave items on the LRU list with GLF_LOCK set, since
they can be removed if the glock is brought back into use, which
may then potentially result in a hang, waiting for GLF_LOCK to
clear.

It doesn't happen very often, since it requires a glock that has
not been used for a long time to be brought back into use at the
same moment that the shrinker is part way through disposing of
glocks.

The fix is to set GLF_LOCK at a later time, when we already know
that the other locks can be obtained. Also, we now only release
the lru_lock in case a resched is needed, rather than on every
iteration.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-07-18 11:12:51 +01:00
Bob Peterson 79272b3562 GFS2: Only wait for demote when last holder is dequeued
Function gfs2_glock_dq_wait is supposed to dequeue a glock and then
wait for the lock to be demoted. The problem is, if this is a shared
lock, its demote will depend on the other holders, which means you
might end up waiting forever because the other process is blocked.
This problem is especially apparent when dealing with nested flocks.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-07-18 11:12:14 +01:00
Cyrill Gorcunov 5442e9fbd7 timerfd: Implement timerfd_ioctl method to restore timerfd_ctx::ticks, v3
The read() of timerfd files allows to fetch the number of timer ticks
while there is no way to set it back from userspace.

To restore the timer's state as it was at checkpoint moment we need
a path to bring @ticks back. Initially I thought about writing ticks
back via write() interface but it seems such API is somehow obscure.

Instead implement timerfd_ioctl() method with TFD_IOC_SET_TICKS
command which allows to adjust @ticks into non-zero value waking
up the waiters.

I wrapped code with CONFIG_CHECKPOINT_RESTORE which can be
dropped off if there users except c/r camp appear.

v2 (by akpm@):
 - Use define timerfd_ioctl NULL for non c/r config

v3:
 - Use copy_from_user for @ticks fetching since
   not all arch support get_user for 8 byte argument

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christopher Covington <cov@codeaurora.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Link: http://lkml.kernel.org/r/20140715215703.285617923@openvz.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-07-18 11:49:57 +02:00
Cyrill Gorcunov af9c4957cf timerfd: Implement show_fdinfo method
For checkpoint/restore of timerfd files we need to know how exactly
the timer were armed, to be able to recreate it on restore stage.
Thus implement show_fdinfo method which provides enough information
for that.

One of significant changes I think is the addition of @settime_flags
member. Currently there are two flags TFD_TIMER_ABSTIME and
TFD_TIMER_CANCEL_ON_SET, and the second can be found from
@might_cancel variable but in case if the flags will be extended
in future we most probably will have to somehow remember them
explicitly anyway so I guss doing that right now won't hurt.

To not bloat the timerfd_ctx structure I've converted @expired
to short integer and defined @settime_flags as short too.

v2 (by avagin@, vdavydov@ and tglx@):

 - Add it_value/it_interval fields
 - Save flags being used in timerfd_setup in context

v3 (by tglx@):
 - don't forget to use CONFIG_PROC_FS

v4 (by akpm@):
 -Use define timerfd_show NULL for non c/r config

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Link: http://lkml.kernel.org/r/20140715215703.114365649@openvz.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-07-18 11:49:57 +02:00
J. Bruce Fields 5d6031ca74 nfsd4: zero op arguments beyond the 8th compound op
The first 8 ops of the compound are zeroed since they're a part of the
argument that's zeroed by the

	memset(rqstp->rq_argp, 0, procp->pc_argsize);

in svc_process_common().  But we handle larger compounds by allocating
the memory on the fly in nfsd4_decode_compound().  Other than code
recently fixed by 01529e3f81 "NFSD: Fix memory leak in encoding denied
lock", I don't know of any examples of code depending on this
initialization. But it definitely seems possible, and I'd rather be
safe.

Compounds this long are unusual so I'm much more worried about failure
in this poorly tested cases than about an insignificant performance hit.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-17 16:20:39 -04:00
Jeff Layton ae4b884fc6 nfsd: silence sparse warning about accessing credentials
sparse says:

    fs/nfsd/auth.c:31:38: warning: incorrect type in argument 1 (different address spaces)
    fs/nfsd/auth.c:31:38:    expected struct cred const *cred
    fs/nfsd/auth.c:31:38:    got struct cred const [noderef] <asn:4>*real_cred

Add a new accessor for the ->real_cred and use that to fetch the
pointer. Accessing current->real_cred directly is actually quite safe
since we know that they can't go away so this is mostly a cosmetic fixup
to silence sparse.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-17 16:15:35 -04:00
David Howells 0c7774abb4 KEYS: Allow special keys (eg. DNS results) to be invalidated by CAP_SYS_ADMIN
Special kernel keys, such as those used to hold DNS results for AFS, CIFS and
NFS and those used to hold idmapper results for NFS, used to be
'invalidateable' with key_revoke().  However, since the default permissions for
keys were reduced:

	Commit: 96b5c8fea6
	KEYS: Reduce initial permissions on keys

it has become impossible to do this.

Add a key flag (KEY_FLAG_ROOT_CAN_INVAL) that will permit a key to be
invalidated by root.  This should not be used for system keyrings as the
garbage collector will try and remove any invalidate key.  For system keyrings,
KEY_FLAG_ROOT_CAN_CLEAR can be used instead.

After this, from userspace, keyctl_invalidate() and "keyctl invalidate" can be
used by any possessor of CAP_SYS_ADMIN (typically root) to invalidate DNS and
idmapper keys.  Invalidated keys are immediately garbage collected and will be
immediately rerequested if needed again.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Steve Dickson <steved@redhat.com>
2014-07-17 20:45:08 +01:00
Trond Myklebust b0fc29d6fc nfsd: Ensure stateids remain unique until they are freed
Add an extra delegation state to allow the stateid to remain in the idr
tree until the last reference has been released. This will be necessary
to ensure uniqueness once the client_mutex is removed.

[jlayton: reset the sc_type under the state_lock in unhash_delegation]

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-16 21:39:51 -04:00
Jeff Layton d564fbec7a nfsd: nfs4_alloc_init_lease should take a nfs4_file arg
No need to pass the delegation pointer in here as it's only used to get
the nfs4_file pointer.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-16 21:35:25 -04:00
Jeff Layton 02e1215f9f nfsd: Avoid taking state_lock while holding inode lock in nfsd_break_one_deleg
state_lock is a heavily contended global lock. We don't want to grab
that while simultaneously holding the inode->i_lock.

Add a new per-nfs4_file lock that we can use to protect the
per-nfs4_file delegation list. Hold that while walking the list in the
break_deleg callback and queue the workqueue job for each one.

The workqueue job can then take the state_lock and do the list
manipulations without the i_lock being held prior to starting the
rpc call.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-16 21:06:12 -04:00
Jeff Layton e8051c837b nfsd: eliminate nfsd4_init_callback
It's just an obfuscated INIT_WORK call. Just make the work_func_t a
non-static symbol and use a normal INIT_WORK call.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-16 14:18:58 -04:00
NeilBrown c1221321b7 sched: Allow wait_on_bit_action() functions to support a timeout
It is currently not possible for various wait_on_bit functions
to implement a timeout.

While the "action" function that is called to do the waiting
could certainly use schedule_timeout(), there is no way to carry
forward the remaining timeout after a false wake-up.
As false-wakeups a clearly possible at least due to possible
hash collisions in bit_waitqueue(), this is a real problem.

The 'action' function is currently passed a pointer to the word
containing the bit being waited on.  No current action functions
use this pointer.  So changing it to something else will be a
little noisy but will have no immediate effect.

This patch changes the 'action' function to take a pointer to
the "struct wait_bit_key", which contains a pointer to the word
containing the bit so nothing is really lost.

It also adds a 'private' field to "struct wait_bit_key", which
is initialized to zero.

An action function can now implement a timeout with something
like

static int timed_out_waiter(struct wait_bit_key *key)
{
	unsigned long waited;
	if (key->private == 0) {
		key->private = jiffies;
		if (key->private == 0)
			key->private -= 1;
	}
	waited = jiffies - key->private;
	if (waited > 10 * HZ)
		return -EAGAIN;
	schedule_timeout(waited - 10 * HZ);
	return 0;
}

If any other need for context in a waiter were found it would be
easy to use ->private for some other purpose, or even extend
"struct wait_bit_key".

My particular need is to support timeouts in nfs_release_page()
to avoid deadlocks with loopback mounted NFS.

While wait_on_bit_timeout() would be a cleaner interface, it
will not meet my need.  I need the timeout to be sensitive to
the state of the connection with the server, which could change.
 So I need to use an 'action' interface.

Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140707051604.28027.41257.stgit@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16 15:10:41 +02:00
NeilBrown 743162013d sched: Remove proliferation of wait_on_bit() action functions
The current "wait_on_bit" interface requires an 'action'
function to be provided which does the actual waiting.
There are over 20 such functions, many of them identical.
Most cases can be satisfied by one of just two functions, one
which uses io_schedule() and one which just uses schedule().

So:
 Rename wait_on_bit and        wait_on_bit_lock to
        wait_on_bit_action and wait_on_bit_lock_action
 to make it explicit that they need an action function.

 Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
 which are *not* given an action function but implicitly use
 a standard one.
 The decision to error-out if a signal is pending is now made
 based on the 'mode' argument rather than being encoded in the action
 function.

 All instances of the old wait_on_bit and wait_on_bit_lock which
 can use the new version have been changed accordingly and their
 action functions have been discarded.
 wait_on_bit{_lock} does not return any specific error code in the
 event of a signal so the caller must check for non-zero and
 interpolate their own error code as appropriate.

The wait_on_bit() call in __fscache_wait_on_invalidate() was
ambiguous as it specified TASK_UNINTERRUPTIBLE but used
fscache_wait_bit_interruptible as an action function.
David Howells confirms this should be uniformly
"uninterruptible"

The main remaining user of wait_on_bit{,_lock}_action is NFS
which needs to use a freezer-aware schedule() call.

A comment in fs/gfs2/glock.c notes that having multiple 'action'
functions is useful as they display differently in the 'wchan'
field of 'ps'. (and /proc/$PID/wchan).
As the new bit_wait{,_io} functions are tagged "__sched", they
will not show up at all, but something higher in the stack.  So
the distinction will still be visible, only with different
function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
gfs2/glock.c case).

Since first version of this patch (against 3.15) two new action
functions appeared, on in NFS and one in CIFS.  CIFS also now
uses an action function that makes the same freezer aware
schedule call as NFS.

Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: David Howells <dhowells@redhat.com> (fscache, keys)
Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2)
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-16 15:10:39 +02:00
Linus Torvalds c20ddc6499 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota fix from Jan Kara:
 "Fix locking of dquot shrinker"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  quota: missing lock in dqcache_shrink_scan()
2014-07-15 17:47:42 -10:00
Chao Yu f1121ab0ba f2fs: reduce searching region of segmap when free section
In __set_test_and_free we will check whether all segment are free in one section
When free one segment, in order to set section to free status.
But the searching region of segmap is from start segno to last segno of f2fs,
it's not necessary. So let's just only check all segment bitmap of target
section.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-15 13:56:49 -07:00
Chao Yu 3f1be4f9c9 udf: avoid redundant memcpy when writing data in ICB
Valid data within i_size in page cache will be copied to ICB cache when we
writeback the page by invoking udf_adinicb_writepage, so the copy in
udf_adinicb_write_end is redundant.

After we remove the copy, it's better to use simple_write_end directly in
udf_adinicb_aops instead of udf_adinicb_write_end.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-07-15 22:40:24 +02:00
Andy Shevchenko c7ff48212d fs/udf: re-use hex_asc_upper_{hi,lo} macros
This patch cleans up udf_translate_to_linux() a bit by using globally defined
macros instead of custom code.

We can use sprintf(buf, "%04X", ...) there as well, but this one faster.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-07-15 22:40:24 +02:00
Fabian Frederick 2c15ac5bdb fs/quota: kernel-doc warning fixes
type and id were removed and qid added to quota_send_warning in commit

431f19744d
("userns: Convert quota netlink aka quota_send_warning")

Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-07-15 22:40:23 +02:00
Fabian Frederick e973606cc2 udf: use linux/uaccess.h
Fix checkpatch warning
WARNING: Use #include <linux/uaccess.h> instead of <asm/uaccess.h>

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-07-15 22:40:23 +02:00
Himangi Saraogi e625b310ed fs/ext2/super.c: Drop memory allocation cast
Drop cast on the result of kmem_cache_alloc.

The semantic patch that makes this change is as follows:

// <smpl>
@@
type T;
@@

- (T *)
  (\(kmalloc\|kzalloc\|kcalloc\|kmem_cache_alloc\|kmem_cache_zalloc\|
   kmem_cache_alloc_node\|kmalloc_node\|kzalloc_node\)(...))
// </smpl>

Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-07-15 22:40:22 +02:00
Niu Yawei b9ba6f94b2 quota: remove dqptr_sem
Remove dqptr_sem to make quota code scalable: Remove the dqptr_sem,
accessing inode->i_dquot now protected by dquot_srcu, and changing
inode->i_dquot is now serialized by dq_data_lock.

Signed-off-by: Lai Siyao <lai.siyao@intel.com>
Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-07-15 22:40:22 +02:00
Niu Yawei 9eb6463f31 quota: simplify remove_inode_dquot_ref()
Simplify the remove_inode_dquot_ref() to make it more obvious
that now we keep one reference for each dquot from inodes.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-07-15 22:40:21 +02:00
Niu Yawei 1ea06bec78 quota: avoid unnecessary dqget()/dqput() calls
Avoid unnecessary dqget()/dqput() calls in __dquot_initialize(),
that will introduce global lock contention otherwise.

Signed-off-by: Lai Siyao <lai.siyao@intel.com>
Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-07-15 22:40:21 +02:00
Niu Yawei 606cdcca04 quota: protect Q_GETFMT by dqonoff_mutex
dqptr_sem will go away. Protect Q_GETFMT quotactl by
dqonoff_mutex instead. This is also enough to make sure
quota info will not go away while we are looking at it.

Signed-off-by: Lai Siyao <lai.siyao@intel.com>
Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-07-15 22:40:20 +02:00
Niu Yawei d68aab6b8f quota: missing lock in dqcache_shrink_scan()
Commit 1ab6c4997e (fs: convert fs shrinkers to new scan/count API)
accidentally removed locking from quota shrinker. Fix it -
dqcache_shrink_scan() should use dq_list_lock to protect the
scan on free_dquots list.

CC: stable@vger.kernel.org
Fixes: 1ab6c4997e
Signed-off-by: Niu Yawei <yawei.niu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2014-07-15 22:36:18 +02:00
Linus Torvalds 0b632204c7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
 "This contains miscellaneous fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: replace count*size kzalloc by kcalloc
  fuse: release temporary page if fuse_writepage_locked() failed
  fuse: restructure ->rename2()
  fuse: avoid scheduling while atomic
  fuse: handle large user and group ID
  fuse: inode: drop cast
  fuse: ignore entry-timeout on LOOKUP_REVAL
  fuse: timeout comparison fix
2014-07-15 08:57:17 -07:00
Zheng Liu 83447ccb4d ext4: make ext4_has_inline_data() as a inline function
Now ext4_has_inline_data() is used in wide spread codepaths.  So we need
to make it as a inline function to avoid burning some CPU cycles.

Change in text size:

         text     data      bss     dec     hex filename
before: 326110    19258    5528  350896   55ab0 fs/ext4/ext4.o
after:  326227    19258    5528  351013   55b25 fs/ext4/ext4.o

I use the following script to measure the CPU usage.

  #!/bin/bash

  shm_base='/dev/shm'
  img=${shm_base}/ext4-img
  mnt=/mnt/loop

  e2fsprgs_base=$HOME/e2fsprogs
  mkfs=${e2fsprgs_base}/misc/mke2fs
  fsck=${e2fsprgs_base}/e2fsck/e2fsck

  sudo umount $mnt
  dd if=/dev/zero of=$img bs=4k count=3145728
  ${mkfs} -t ext4 -O inline_data -F $img
  sudo mount -t ext4 -o loop $img $mnt

  # start testing...
  testdir="${mnt}/testdir"
  mkdir $testdir
  cd $testdir

  echo "start testing..."
  for ((cnt=0;cnt<100;cnt++)); do

  for ((i=0;i<5;i++)); do
  	for ((j=0;j<5;j++)); do
  		for ((k=0;k<5;k++)); do
  			for ((l=0;l<5;l++)); do
  				mkdir -p $i/$j/$k/$l
  				echo "$i-$j-$k-$l" > $i/$j/$k/$l/testfile
  			done
  		done
  	done
  done

  ls -R $testdir > /dev/null
  rm -rf $testdir/*

  done

The result of `perf top -G -U` is as below.

vanilla:
 13.92%  [ext4]  [k] ext4_do_update_inode
  9.36%  [ext4]  [k] __ext4_get_inode_loc
  4.07%  [ext4]  [k] ftrace_define_fields_ext4_writepages
  3.83%  [ext4]  [k] __ext4_handle_dirty_metadata
  3.42%  [ext4]  [k] ext4_get_inode_flags
  2.71%  [ext4]  [k] ext4_mark_iloc_dirty
  2.46%  [ext4]  [k] ftrace_define_fields_ext4_direct_IO_enter
  2.26%  [ext4]  [k] ext4_get_inode_loc
  2.22%  [ext4]  [k] ext4_has_inline_data
  [...]

After applied the patch, we don't see ext4_has_inline_data() because it
has been inlined and perf couldn't sample it.  Although it doesn't mean
that the CPU cycles can be saved but at least the overhead of function
calls can be eliminated.  So IMHO we'd better inline this function.

Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-15 10:10:04 -04:00
Zhang Zhen 590a141863 ext4: remove readpage() check in ext4_mmap_file()
There is no kind of file which does not supply a page reading function.

Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-15 09:56:19 -04:00
Lukas Czerner 4f579ae7de ext4: fix punch hole on files with indirect mapping
Currently punch hole code on files with direct/indirect mapping has some
problems which may lead to a data loss. For example (from Jan Kara):

fallocate -n -p 10240000 4096

will punch the range 10240000 - 12632064 instead of the range 1024000 -
10244096.

Also the code is a bit weird and it's not using infrastructure provided
by indirect.c, but rather creating it's own way.

This patch fixes the issues as well as making the operation to run 4
times faster from my testing (punching out 60GB file). It uses similar
approach used in ext4_ind_truncate() which takes advantage of
ext4_free_branches() function.

Also rename the ext4_free_hole_blocks() to something more sensible, like
the equivalent we have for extent mapped files. Call it
ext4_ind_remove_space().

This has been tested mostly with fsx and some xfstests which are testing
punch hole but does not require unwritten extents which are not
supported with direct/indirect mapping. Not problems showed up even with
1024k block size.

CC: stable@vger.kernel.org
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-15 06:03:38 -04:00
Theodore Ts'o 71d4f7d032 ext4: remove metadata reservation checks
Commit 27dd438542 ("ext4: introduce reserved space") reserves 2% of
the file system space to make sure metadata allocations will always
succeed.  Given that, tracking the reservation of metadata blocks is
no longer necessary.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-15 06:02:38 -04:00
Theodore Ts'o d5e03cbb0c ext4: rearrange initialization to fix EXT4FS_DEBUG
The EXT4FS_DEBUG is a *very* developer specific #ifdef designed for
ext4 developers only.  (You have to modify fs/ext4/ext4.h to enable
it.)

Rearrange how we initialize data structures to avoid calling
ext4_count_free_clusters() until the multiblock allocator has been
initialized.

This also allows us to only call ext4_count_free_clusters() once, and
simplifies the code somewhat.

(Thanks to Chen Gang <gang.chen.5i5j@gmail.com> for pointing out a
!CONFIG_SMP compile breakage in the original patch.)

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-07-15 06:01:38 -04:00
Brian Foster 80d6d69821 xfs: add log attributes for log lsn and grant head data
Create log attributes to export the current runtime state of the log to
sysfs. Note that the filesystem should be frozen for consistency across
attributes.

The following per-mount attributes are created: log_head_lsn,
log_tail_lsn, reserve_grant_head and write_grant_head. These represent
the physical log head, tail and reserve and write grant heads
respectively. Attribute values are exported in the following format:

	"cycle:[block,byte]"

... where cycle represents the log cycle and [block,bytes] represents
either the basic block or byte offset of the log, depending on the
attribute.  Log sequence number (LSN) values are encoded in basic blocks
and grant heads are encoded in bytes. All values are in decimal format.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15 08:07:48 +10:00
Brian Foster baff4e44b9 xfs: add xlog sysfs kobject and attribute handlers
Embed a kobject into the xfs log data structure (xlog). This creates a
'log' subdirectory for every XFS mount instance in sysfs. The lifecycle
of the log kobject is tied to the lifecycle of the log.

Also define a set of generic attribute handlers associated with the log
kobject in preparation for the addition of attributes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15 08:07:29 +10:00
Brian Foster a31b1d3d89 xfs: add xfs_mount sysfs kobject
Embed a base kobject into xfs_mount. This creates a kobject associated
with each XFS mount and a subdirectory in sysfs with the name of the
filesystem. The subdirectory lifecycle matches that of the mount. Also
add the new xfs_sysfs.[c,h] source files with some XFS sysfs
infrastructure to facilitate attribute creation.

Note that there are currently no attributes exported as part of the
xfs_mount kobject. It exists solely to serve as a per-mount container
for child objects.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15 08:07:01 +10:00
Brian Foster 3d8712265c xfs: add a sysfs kset
Create a sysfs kset to contain all sub-objects associated with the XFS
module. The kset is created and removed on module initialization and
removal respectively. The kset uses fs_obj as a parent. This leads to
the creation of a /sys/fs/xfs directory when the kset exists.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15 07:41:37 +10:00
Brian Foster a70a4fa528 xfs: fix a couple error sequence jumps in xfs_mountfs()
xfs_mountfs() has a couple failure conditions that do not jump to the
correct labels. Specifically:

- xfs_initialize_perag_data() failure does not deallocate the log even
  though it occurs after log initialization
- xfs_mount_reset_sbqflags() failure returns the error directly rather
  than jump to the error sequence

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15 07:41:25 +10:00
Dave Chinner 7f8a058f6d Merge branch 'xfs-libxfs-restructure' into for-next 2014-07-15 07:37:18 +10:00
Dave Chinner 03e01349c6 xfs: null unused quota inodes when quota is on
When quota is on, it is expected that unused quota inodes have a
value of NULLFSINO. The changes to support a separate project quota
in 3.12 broken this rule for non-project quota inode enabled
filesystem, as the code now refuses to write the group quota inode
if neither group or project quotas are enabled. This regression was
introduced by commit d892d58 ("xfs: Start using pquotaino from the
superblock").

In this case, we should be writing NULLFSINO rather than nothing to
ensure that we leave the group quota inode in a valid state while
quotas are enabled.

Failure to do so doesn't cause a current kernel to break - the
separate project quota inodes introduced translation code to always
treat a zero inode as NULLFSINO. This was introduced by commit
0102629 ("xfs: Initialize all quota inodes to be NULLFSINO") with is
also in 3.12 but older kernels do not do this and hence taking a
filesystem back to an older kernel can result in quotas failing
initialisation at mount time. When that happens, we see this in
dmesg:

[ 1649.215390] XFS (sdb): Mounting Filesystem
[ 1649.316894] XFS (sdb): Failed to initialize disk quotas.
[ 1649.316902] XFS (sdb): Ending clean mount

By ensuring that we write NULLFSINO to quota inodes that aren't
active, we avoid this problem. We have to be really careful when
determining if the quota inodes are active or not, because we don't
want to write a NULLFSINO if the quota inodes are active and we
simply aren't updating them.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15 07:28:41 +10:00
Dave Chinner cf11da9c5d xfs: refine the allocation stack switch
The allocation stack switch at xfs_bmapi_allocate() has served it's
purpose, but is no longer a sufficient solution to the stack usage
problem we have in the XFS allocation path.

Whilst the kernel stack size is now 16k, that is not a valid reason
for undoing all our "keep stack usage down" modifications. What it
does allow us to do is have the freedom to refine and perfect the
modifications knowing that if we get it wrong it won't blow up in
our faces - we have a safety net now.

This is important because we still have the issue of older kernels
having smaller stacks and that they are still supported and are
demonstrating a wide range of different stack overflows.  Red Hat
has several open bugs for allocation based stack overflows from
directory modifications and direct IO block allocation and these
problems still need to be solved. If we can solve them upstream,
then distro's won't need to bake their own unique solutions.

To that end, I've observed that every allocation based stack
overflow report has had a specific characteristic - it has happened
during or directly after a bmap btree block split. That event
requires a new block to be allocated to the tree, and so we
effectively stack one allocation stack on top of another, and that's
when we get into trouble.

A further observation is that bmap btree block splits are much rarer
than writeback allocation - over a range of different workloads I've
observed the ratio of bmap btree inserts to splits ranges from 100:1
(xfstests run) to 10000:1 (local VM image server with sparse files
that range in the hundreds of thousands to millions of extents).
Either way, bmap btree split events are much, much rarer than
allocation events.

Finally, we have to move the kswapd state to the allocation workqueue
work when allocation is done on behalf of kswapd. This is proving to
cause significant perturbation in performance under memory pressure
and appears to be generating allocation deadlock warnings under some
workloads, so avoiding the use of a workqueue for the majority of
kswapd writeback allocation will minimise the impact of such
behaviour.

Hence it makes sense to move the stack switch to xfs_btree_split()
and only do it for bmap btree splits. Stack switches during
allocation will be much rarer, so there won't be significant
performacne overhead caused by switching stacks. The worse case
stack from all allocation paths will be split, not just writeback.
And the majority of memory allocations will be done in the correct
context (e.g. kswapd) without causing additional latency, and so we
simplify the memory reclaim interactions between processes,
workqueues and kswapd.

The worst stack I've been able to generate with this patch in place
is 5600 bytes deep. It's very revealing because we exit XFS at:

37)     1768      64   kmem_cache_alloc+0x13b/0x170

about 1800 bytes of stack consumed, and the remaining 3800 bytes
(and 36 functions) is memory reclaim, swap and the IO stack. And
this occurs in the inode allocation from an open(O_CREAT) syscall,
not writeback.

The amount of stack being used is much less than I've previously be
able to generate - fs_mark testing has been able to generate stack
usage of around 7k without too much trouble; with this patch it's
only just getting to 5.5k. This is primarily because the metadata
allocation paths (e.g. directory blocks) are no longer causing
double splits on the same stack, and hence now stack tracing is
showing swapping being the worst stack consumer rather than XFS.

Performance of fs_mark inode create workloads is unchanged.
Performance of fs_mark async fsync workloads is consistently good
with context switches reduced by around 150,000/s (30%).
Performance of dbench, streaming IO and postmark is unchanged.
Allocation deadlock warnings have not been seen on the workloads
that generated them since adding this patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15 07:08:24 +10:00
Dave Chinner aa182e64f1 Revert "xfs: block allocation work needs to be kswapd aware"
This reverts commit 1f6d64829d.

This commit resulted in regressions in performance in low
memory situations where kswapd was doing writeback of delayed
allocation blocks. It resulted in significant parallelism of the
kswapd work and with the special kswapd flags meant that hundreds of
active allocation could dip into kswapd specific memory reserves and
avoid being throttled. This cause a large amount of performance
variation, as well as random OOM-killer invocations that didn't
previously exist.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15 07:08:10 +10:00
Fabian Frederick 04ec5f5c00 ecryptfs: remove unnecessary break after goto
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: ecryptfs@vger.kernel.org
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2014-07-14 14:29:11 -05:00
Benjamin LaHaise 6e830d5371 Merge ../aio-fixes 2014-07-14 13:14:27 -04:00
Benjamin LaHaise 263782c1c9 aio: protect reqs_available updates from changes in interrupt handlers
As of commit f8567a3845 it is now possible to
have put_reqs_available() called from irq context.  While put_reqs_available()
is per cpu, it did not protect itself from interrupts on the same CPU.  This
lead to aio_complete() corrupting the available io requests count when run
under a heavy O_DIRECT workloads as reported by Robert Elliott.  Fix this by
disabling irq updates around the per cpu batch updates of reqs_available.

Many thanks to Robert and folks for testing and tracking this down.

Reported-by: Robert Elliot <Elliott@hp.com>
Tested-by: Robert Elliot <Elliott@hp.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@infradead.org>
Cc: stable@vger.kenel.org
2014-07-14 13:05:26 -04:00
Fabian Frederick f2b3455e47 fuse: replace count*size kzalloc by kcalloc
kcalloc manages count*sizeof overflow.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-07-14 16:30:25 +02:00
Maxim Patlasov 27f1b36326 fuse: release temporary page if fuse_writepage_locked() failed
tmp_page to be freed if fuse_write_file_get() returns NULL.

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-07-14 16:17:57 +02:00
Yan, Zheng 51da8e8c6f ceph: reset r_resend_mds after receiving -ESTALE
this makes __choose_mds() choose mds according caps

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-07-14 10:49:15 +08:00
Christoph Hellwig 73a8f5f7e6 locks: purge fl_owner_t from fs/locks.c
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-07-13 21:39:07 -04:00
Linus Torvalds 18b34d9a7a More bug fixes for ext4 -- most importantly, a fix for a bug
(introduced in 3.15) that can end up triggering a file system
 corruption error after a journal replay.  (It shouldn't lead to any
 actual data corruption, but it is scary and can force file systems to
 be remounted read-only, etc.)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJTwuY+AAoJENNvdpvBGATwgN8QAJ2/S5GFxQwHbglHmayXYuMQ
 fU411FwJ1wbqjYyYb+jyBoYcsgpsCKPTqA2JbPlHsFTm2Ec+BPzsybhtYw5ybdeW
 1qAfPTSgNxYXroNwpaqOamxgfXgOaV4iqwvZ4tYcLcrtPq0MOcC5rlSaKMdJuSA1
 6M2/8PijOTndUVJpS/GhSMdKlTAXjtfv9V6t/pfLuoo7cNadlggpJnwC8Qm9DNAA
 5ETVZK44q2+2YvGwrvY6LBb9BVBpL29YbWPNqqw/OXXY++ZFhBJV07osZO38MpsB
 QzUyfRaMTgm9/BdbkG8uxA7Zk6C0YBl5eC4aU79LWGWjGO225CLj95LoBOVjQw9f
 eh+RFGapwVvtyzScDF/a9pH6UwGco/s4kCq8rLr2ztljlO595N3LUwhQBHtiSGtm
 fr65NRDyJMXbqy8yLGrlOnP/4ll2VfTH+el2+tzr5smoTD29EASM155hKDDUOAG0
 TrDHtNrxG1MIROHjp+HSui424Op7NXTnfjwmuKzo+mGpPOcPclPSmAacFJpRGVBE
 220hnk+LrBf525nJzQYHifdCL+JAqbWv/S4YSRGizgppK3DlO/gYcu1zpWb0WWuo
 0VuvxUZDSIZY1aVpMEOQov74WtovB7YyG8RPHl7h2m5dJuLLFgJmLDMDTJR1LLNT
 +tHNJ6jERLQz9wqTvquh
 =OX7Z
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
 "More bug fixes for ext4 -- most importantly, a fix for a bug
  introduced in 3.15 that can end up triggering a file system corruption
  error after a journal replay.

  It shouldn't lead to any actual data corruption, but it is scary and
  can force file systems to be remounted read-only, etc"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix potential null pointer dereference in ext4_free_inode
  ext4: fix a potential deadlock in __ext4_es_shrink()
  ext4: revert commit which was causing fs corruption after journal replays
  ext4: disable synchronous transaction batching if max_batch_time==0
  ext4: clarify ext4_error message in ext4_mb_generate_buddy_error()
  ext4: clarify error count warning messages
  ext4: fix unjournalled bg descriptor while initializing inode bitmap
2014-07-13 13:14:55 -07:00
Trond Myklebust e655f945cd Merge branch 'bugfixes' into linux-next
* bugfixes:
  NFS: Don't reset pg_moreio in __nfs_pageio_add_request
  NFS: Remove 2 unused variables
  nfs: handle multiple reqs in nfs_wb_page_cancel
  nfs: handle multiple reqs in nfs_page_async_flush
  nfs: change find_request to find_head_request
  nfs: nfs_page should take a ref on the head req
  nfs: mark nfs_page reqs with flag for extra ref
  nfs: only show Posix ACLs in listxattr if actually present

Conflicts:
	fs/nfs/write.c
2014-07-13 15:22:02 -04:00
Trond Myklebust f563b89b18 NFS: Don't reset pg_moreio in __nfs_pageio_add_request
Once we've started sending unstable NFS writes, we do not want to
clear pg_moreio, or we may end up sending the very last request as
a stable write if the commit lists are still empty.

Do, however, reset pg_moreio in the case where we end up having to
recoalesce the write if an attempt to use pNFS failed.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-13 15:18:44 -04:00
Fabian Frederick 002160269f NFS: use ARRAY_SIZE instead of sizeof/sizeof[0]
Use macro definition

Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 18:43:58 -04:00
Himangi Saraogi 8ee2b78a44 NFSv4: Drop cast
This patch does away with the cast on void * as it is unnecessary.

The following Coccinelle semantic patch was used for making the change:

@r@
expression x;
void* e;
type T;
identifier f;
@@

(
  *((T *)e)
|
  ((T *)x)[...]
|
  ((T *)x)->f
|
- (T *)
  e
)

Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 18:43:47 -04:00
Fabian Frederick 57b696fb1b fs/nfs_common/nfsacl.c: move EXPORT symbol after functions
Fix checkpatch warnings:

"WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable"

Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 18:43:42 -04:00
Jeff Layton f11b2a1cfb nfs4: copy acceptor name from context to nfs_client
The current CB_COMPOUND handling code tries to compare the principal
name of the request with the cl_hostname in the client. This is not
guaranteed to ever work, particularly if the client happened to mount
a CNAME of the server or a non-fqdn.

Fix this by instead comparing the cr_principal string with the acceptor
name that we get from gssd. In the event that gssd didn't send one
down (i.e. it was too old), then we fall back to trying to use the
cl_hostname as we do today.

Signed-off-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 18:41:25 -04:00
Jeff Layton f1cdae87fc nfs4: turn free_lock_state into a void return operation
Nothing checks its return value.

Signed-off-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 18:36:37 -04:00
Jeff Layton 49a4bda22e nfs4: queue free_lock_state job submission to nfsiod
We got a report of the following warning in Fedora:

BUG: sleeping function called from invalid context at mm/slub.c:969
in_atomic(): 1, irqs_disabled(): 0, pid: 533, name: bash
3 locks held by bash/533:
 #0:  (&sp->so_delegreturn_mutex){+.+...}, at: [<ffffffffa033da62>] nfs4_proc_lock+0x262/0x910 [nfsv4]
 #1:  (&nfsi->rwsem){.+.+.+}, at: [<ffffffffa033da6a>] nfs4_proc_lock+0x26a/0x910 [nfsv4]
 #2:  (&sb->s_type->i_lock_key#23){+.+...}, at: [<ffffffff812998dc>] flock_lock_file_wait+0x8c/0x3a0
CPU: 0 PID: 533 Comm: bash Not tainted 3.15.0-0.rc1.git1.1.fc21.x86_64 #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 0000000000000000 00000000d664ff3c ffff880078b69a70 ffffffff817e82e0
 0000000000000000 ffff880078b69a98 ffffffff810cf1a4 0000000000000050
 0000000000000050 ffff88007cc01a00 ffff880078b69ad8 ffffffff8121449e
Call Trace:
 [<ffffffff817e82e0>] dump_stack+0x4d/0x66
 [<ffffffff810cf1a4>] __might_sleep+0x184/0x240
 [<ffffffff8121449e>] kmem_cache_alloc_trace+0x4e/0x330
 [<ffffffffa0331124>] ? nfs4_release_lockowner+0x74/0x110 [nfsv4]
 [<ffffffffa0331124>] nfs4_release_lockowner+0x74/0x110 [nfsv4]
 [<ffffffffa0352340>] nfs4_put_lock_state+0x90/0xb0 [nfsv4]
 [<ffffffffa0352375>] nfs4_fl_release_lock+0x15/0x20 [nfsv4]
 [<ffffffff81297515>] locks_free_lock+0x45/0x90
 [<ffffffff8129996c>] flock_lock_file_wait+0x11c/0x3a0
 [<ffffffffa033da6a>] ? nfs4_proc_lock+0x26a/0x910 [nfsv4]
 [<ffffffffa033301e>] do_vfs_lock+0x1e/0x30 [nfsv4]
 [<ffffffffa033da79>] nfs4_proc_lock+0x279/0x910 [nfsv4]
 [<ffffffff810dbb26>] ? local_clock+0x16/0x30
 [<ffffffff810f5a3f>] ? lock_release_holdtime.part.28+0xf/0x200
 [<ffffffffa02f820c>] do_unlk+0x8c/0xc0 [nfs]
 [<ffffffffa02f85c5>] nfs_flock+0xa5/0xf0 [nfs]
 [<ffffffff8129a6f6>] locks_remove_file+0xb6/0x1e0
 [<ffffffff812159d8>] ? kfree+0xd8/0x2d0
 [<ffffffff8123bc63>] __fput+0xd3/0x210
 [<ffffffff8123bdee>] ____fput+0xe/0x10
 [<ffffffff810bfb6d>] task_work_run+0xcd/0xf0
 [<ffffffff81019cd1>] do_notify_resume+0x61/0x90
 [<ffffffff817fbea2>] int_signal+0x12/0x17

The problem is that NFSv4 is trying to do an allocation from
fl_release_private (in order to send a RELEASE_LOCKOWNER call). That
function can be called while holding the inode->i_lock, and it's
currently set up to do __GFP_WAIT allocations. v4.1 code has a
similar problem.

This patch adds a work_struct to the nfs4_lock_state and has the code
queue the free_lock_state operation to nfsiod.

Reported-by: Josh Stone <jistone@redhat.com>
Signed-off-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 18:36:35 -04:00
Jeff Layton 8003d3c4aa nfs4: treat lock owners as opaque values
Do the following set of ops with a file on a NFSv4 mount:

    exec 3>>/file/on/nfsv4
    flock -x 3
    exec 3>&-

You'll see the LOCK request go across the wire, but no LOCKU when the
file is closed.

What happens is that the fd is passed across a fork, and the final close
is done in a different process than the opener. That makes
__nfs4_find_lock_state miss finding the correct lock state because it
uses the fl_pid as a search key. A new one is created, and the locking
code treats it as a delegation stateid (because NFS_LOCK_INITIALIZED
isn't set).

The root cause of this breakage seems to be commit 77041ed9b4
(NFSv4: Ensure the lockowners are labelled using the fl_owner and/or
fl_pid).

That changed it so that flock lockowners are allocated based on the
fl_pid. I think this is incorrect. flock locks should be "owned" by the
struct file, and that is already accounted for in the fl_owner field of
the lock request when it comes through nfs_flock.

This patch basically reverts the above commit and with it, a LOCKU is
sent in the above reproducer.

Signed-off-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 18:36:31 -04:00
Peng Tao 039b756a2d nfs41: layout return on close in delegation return
If file is not opened by anyone, we do layout return on close
in delegation return.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 18:23:17 -04:00
Peng Tao fe08c54691 nfs41: return layout on last close
If client has valid delegation, do not return layout on close at all.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 18:23:04 -04:00
Peng Tao 15bb3afe90 nfs4: add nfs4_check_delegation
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 18:22:58 -04:00
Peng Tao 0b0bc6ea77 pnfs/filelayout: retry ds commit if nfs_commitdata_alloc fails
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 18:22:45 -04:00
Peng Tao c8a3292d24 pnfs/filelayout: fix race between mark_request_commit and scan_commit_lists
We need to hold cinfo lock while setting bucket->wlseg and adding req to nwritten
list at the same time. Otherwise there might be a window where nwritten list
is empty yet we set bucket->wlseg, in which case ff_layout_scan_ds_commit_list()
may end up clearing bucket->wlseg incorrectly, casuing client to oops later on.

This was found when testing flexfile layout but filelayout has the same problem.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 18:22:41 -04:00
Trond Myklebust f3792d63d2 NFSv4: Fix OPEN w/create access mode checking
POSIX states that open("foo", O_CREAT|O_RDONLY, 000) should succeed if
the file "foo" does not already exist. With the current NFS client,
it will fail with an EACCES error because of the permissions checks in
nfs4_opendata_access().

Fix is to turn that test off if the server says that we created the file.

Reported-by: "Frank S. Filz" <ffilzlnx@mindspring.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 18:20:55 -04:00
Trond Myklebust aafe37504c NFS: Remove 2 unused variables
Cc: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 17:35:57 -04:00
Weston Andros Adamson 3e2170451e nfs: handle multiple reqs in nfs_wb_page_cancel
Use nfs_lock_and_join_requests to merge all subrequests into the head request -
this cancels and dereferences all subrequests.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 17:35:47 -04:00
Weston Andros Adamson d458138353 nfs: handle multiple reqs in nfs_page_async_flush
Change nfs_find_and_lock_request so nfs_page_async_flush can handle multiple
requests in a page. There is only one request for a page the first time
nfs_page_async_flush is called, but if a write or commit fails, async_flush
is called again and there may be multiple requests associated with the page.
The solution is to merge all the requests in a page group into a single
request before calling nfs_pageio_add_request.

Rename nfs_find_and_lock_request to nfs_lock_and_join_requests and
change it to first lock all requests for the page, then cancel and merge
all subrequests into the head request.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 17:35:46 -04:00
Weston Andros Adamson 84d3a9a913 nfs: change find_request to find_head_request
nfs_page_find_request_locked* should find the head request for that page.
Rename the functions and add comments to make this clear, and fix a bug
that could return a subrequest when page_private isn't set on the page.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 16:51:41 -04:00
Weston Andros Adamson 85710a837c nfs: nfs_page should take a ref on the head req
nfs_pages that aren't the the head of a group must take a reference on the
head as long as ->wb_head is set to it. This stops the head from hitting
a refcount of 0 while there is still an active nfs_page for the page group.

This avoids kref warnings in the writeback code when the page group head
is found and referenced.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 16:51:41 -04:00
Weston Andros Adamson 17089a29a2 nfs: mark nfs_page reqs with flag for extra ref
Change the use of PG_INODE_REF - set it when taking extra reference on
subrequests and take care to only release once for each request.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-12 16:51:41 -04:00
Namjae Jeon bf40c92635 ext4: fix potential null pointer dereference in ext4_free_inode
Fix potential null pointer dereferencing problem caused by e43bb4e612
("ext4: decrement free clusters/inodes counters when block group declared bad")

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2014-07-12 16:11:42 -04:00
Theodore Ts'o 3f1f9b8513 ext4: fix a potential deadlock in __ext4_es_shrink()
This fixes the following lockdep complaint:

[ INFO: possible circular locking dependency detected ]
3.16.0-rc2-mm1+ #7 Tainted: G           O  
-------------------------------------------------------
kworker/u24:0/4356 is trying to acquire lock:
 (&(&sbi->s_es_lru_lock)->rlock){+.+.-.}, at: [<ffffffff81285fff>] __ext4_es_shrink+0x4f/0x2e0

but task is already holding lock:
 (&ei->i_es_lock){++++-.}, at: [<ffffffff81286961>] ext4_es_insert_extent+0x71/0x180

which lock already depends on the new lock.

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&ei->i_es_lock);
                               lock(&(&sbi->s_es_lru_lock)->rlock);
                               lock(&ei->i_es_lock);
  lock(&(&sbi->s_es_lru_lock)->rlock);

 *** DEADLOCK ***

6 locks held by kworker/u24:0/4356:
 #0:  ("writeback"){.+.+.+}, at: [<ffffffff81071d00>] process_one_work+0x180/0x560
 #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff81071d00>] process_one_work+0x180/0x560
 #2:  (&type->s_umount_key#22){++++++}, at: [<ffffffff811a9c74>] grab_super_passive+0x44/0x90
 #3:  (jbd2_handle){+.+...}, at: [<ffffffff812979f9>] start_this_handle+0x189/0x5f0
 #4:  (&ei->i_data_sem){++++..}, at: [<ffffffff81247062>] ext4_map_blocks+0x132/0x550
 #5:  (&ei->i_es_lock){++++-.}, at: [<ffffffff81286961>] ext4_es_insert_extent+0x71/0x180

stack backtrace:
CPU: 0 PID: 4356 Comm: kworker/u24:0 Tainted: G           O   3.16.0-rc2-mm1+ #7
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Workqueue: writeback bdi_writeback_workfn (flush-253:0)
 ffffffff8213dce0 ffff880014b07538 ffffffff815df0bb 0000000000000007
 ffffffff8213e040 ffff880014b07588 ffffffff815db3dd ffff880014b07568
 ffff880014b07610 ffff88003b868930 ffff88003b868908 ffff88003b868930
Call Trace:
 [<ffffffff815df0bb>] dump_stack+0x4e/0x68
 [<ffffffff815db3dd>] print_circular_bug+0x1fb/0x20c
 [<ffffffff810a7a3e>] __lock_acquire+0x163e/0x1d00
 [<ffffffff815e89dc>] ? retint_restore_args+0xe/0xe
 [<ffffffff815ddc7b>] ? __slab_alloc+0x4a8/0x4ce
 [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0
 [<ffffffff810a8707>] lock_acquire+0x87/0x120
 [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0
 [<ffffffff8128592d>] ? ext4_es_free_extent+0x5d/0x70
 [<ffffffff815e6f09>] _raw_spin_lock+0x39/0x50
 [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0
 [<ffffffff8119760b>] ? kmem_cache_alloc+0x18b/0x1a0
 [<ffffffff81285fff>] __ext4_es_shrink+0x4f/0x2e0
 [<ffffffff812869b8>] ext4_es_insert_extent+0xc8/0x180
 [<ffffffff812470f4>] ext4_map_blocks+0x1c4/0x550
 [<ffffffff8124c4c4>] ext4_writepages+0x6d4/0xd00
	...

Reported-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Minchan Kim <minchan@kernel.org>
Cc: stable@vger.kernel.org
Cc: Zheng Liu <gnehzuil.liu@gmail.com>
2014-07-12 15:32:24 -04:00
Linus Torvalds bae78dc259 Merge branch 'for-3.16' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfix from Bruce Fields:
 "Another xdr encoding regression that may cause incorrect encoding on
  failures of certain readdirs"

* 'for-3.16' of git://linux-nfs.org/~bfields/linux:
  nfsd: Fix bad reserving space for encoding rdattr_error
2014-07-11 15:10:04 -07:00
Gu Zheng 4b2868aa4f f2fs: remove the unused stat_lock
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-11 15:01:48 -07:00
Gu Zheng 7a6c76b1b2 f2fs: cleanup the needless return of f2fs_create_root_stats
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-11 15:01:47 -07:00
Kinglong Mee d5d5c304b1 NFSD: Fix bad checking of space for padding in splice read
Note that the caller has already reserved space for count and eof, so
xdr->p has already moved past them, only the padding remains.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Fixes dc97618ddd (nfsd4: separate splice and readv cases)
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-11 15:19:25 -04:00
Kinglong Mee 35e634b83c NFSD: Check acl returned from get_acl/posix_acl_from_mode
Commit 4ac7249ea5 (nfsd: use get_acl and ->set_acl)
don't check the acl returned from get_acl()/posix_acl_from_mode().

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-11 15:03:53 -04:00
Theodore Ts'o f9ae9cf5d7 ext4: revert commit which was causing fs corruption after journal replays
Commit 007649375f ("ext4: initialize multi-block allocator before
checking block descriptors") causes the block group descriptor's count
of the number of free blocks to become inconsistent with the number of
free blocks in the allocation bitmap.  This is a harmless form of fs
corruption, but it causes the kernel to potentially remount the file
system read-only, or to panic, depending on the file systems's error
behavior.

Thanks to Eric Whitney for his tireless work to reproduce and to find
the guilty commit.

Fixes: 007649375f ("ext4: initialize multi-block allocator before checking block descriptors"

Cc: stable@vger.kernel.org  # 3.15
Reported-by: David Jander <david@protonic.nl>
Reported-by: Matteo Croce <technoboy85@gmail.com>
Tested-by: Eric Whitney <enwlinux@gmail.com>
Suggested-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-07-11 13:55:40 -04:00
Jeff Layton a46cb7f287 nfsd: cleanup and rename nfs4_check_open
Rename it to better describe what it does, and have it just return the
stateid instead of a __be32 (which is now always nfs_ok). Also, do the
search for an existing stateid after the delegation check, to reduce
cleanup if the delegation check returns error.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-11 11:06:38 -04:00
Jeff Layton baeb4ff0e5 nfsd: make deny mode enforcement more efficient and close races in it
The current enforcement of deny modes is both inefficient and scattered
across several places, which makes it hard to guarantee atomicity. The
inefficiency is a problem now, and the lack of atomicity will mean races
once the client_mutex is removed.

First, we address the inefficiency. We have to track deny modes on a
per-stateid basis to ensure that open downgrades are sane, but when the
server goes to enforce them it has to walk the entire list of stateids
and check against each one.

Instead of doing that, maintain a per-nfs4_file deny mode. When a file
is opened, we simply set any deny bits in that mode that were specified
in the OPEN call. We can then use that unified deny mode to do a simple
check to see whether there are any conflicts without needing to walk the
entire stateid list.

The only time we'll need to walk the entire list of stateids is when a
stateid that has a deny mode on it is being released, or one is having
its deny mode downgraded. In that case, we must walk the entire list and
recalculate the fi_share_deny field. Since deny modes are pretty rare
today, this should be very rare under normal workloads.

To address the potential for races once the client_mutex is removed,
protect fi_share_deny with the fi_lock. In nfs4_get_vfs_file, check to
make sure that any deny mode we want to apply won't conflict with
existing access. If that's ok, then have nfs4_file_get_access check that
new access to the file won't conflict with existing deny modes.

If that also passes, then get file access references, set the correct
access and deny bits in the stateid, and update the fi_share_deny field.
If opening the file or truncating it fails, then unwind the whole mess
and return the appropriate error.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-11 11:06:32 -04:00
Jeff Layton 7214e8600e nfsd: always hold the fi_lock when bumping fi_access refcounts
Once we remove the client_mutex, there's an unlikely but possible race
that could occur. It will be possible for nfs4_file_put_access to race
with nfs4_file_get_access. The refcount will go to zero (briefly) and
then bumped back to one. If that happens we set ourselves up for a
use-after-free and the potential for a lock to race onto the i_flock
list as a filp is being torn down.

Ensure that we can safely bump the refcount on the file by holding the
fi_lock whenever that's done. The only place it currently isn't is in
get_lock_access.

In order to ensure atomicity with finding the file, use the
find_*_file_locked variants and then call get_lock_access to get new
access references on the nfs4_file under the same lock.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-11 11:06:17 -04:00
Jeff Layton 3b84240a7b nfsd: clean up reset_union_bmap_deny
Fix the "deny" argument type, and start the loop at 1. The 0 iteration
is always a noop.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-11 11:06:11 -04:00
Jeff Layton 6eb3a1d096 nfsd: set stateid access and deny bits in nfs4_get_vfs_file
Cleanup -- ensure that the stateid bits are set at the same time that
the file access refcounts are incremented. Keeping them coherent like
this makes it easier to ensure that we account for all of the
references.

Since the initialization of the st_*_bmap fields is done when it's
hashed, we go ahead and hash the stateid before getting access to the
file and unhash it if that function returns error. This will be
necessary anyway in a follow-on patch that will overhaul deny mode
handling.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-11 11:06:05 -04:00
Jeff Layton c11c591fe6 nfsd: shrink st_access_bmap and st_deny_bmap
We never use anything above bit #3, so an unsigned long for each is
wasteful. Shrink them to a char each, and add some WARN_ON_ONCE calls if
we try to set or clear bits that would go outside those sizes.

Note too that because atomic bitops work on unsigned longs, we have to
abandon their use here. That shouldn't be a problem though since we
don't really care about the atomicity in this code anyway. Using them
was just a convenient way to flip bits.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-11 11:06:04 -04:00
Jeff Layton 6d338b51eb nfsd: remove nfs4_file_put_fd
...and replace it with a simple swap call.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-11 11:05:57 -04:00
Jeff Layton 1265965172 nfsd: refactor nfs4_file_get_access and nfs4_file_put_access
Have them take NFS4_SHARE_ACCESS_* flags instead of an open mode. This
spares the callers from having to convert it themselves.

This also allows us to simplify these functions as we no longer need
to do the access_to_omode conversion in either one.

Note too that this patch eliminates the WARN_ON in
__nfs4_file_get_access. It's valid for now, but in a later patch we'll
be bumping the refcounts prior to opening the file in order to close
some races, at which point we'll need to remove it anyway.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-11 11:03:23 -04:00
Marcel Holtmann f49daa8190 Bluetooth: Move HCI socket definitions into its own header file
All the HCI sockets and ioctl based definitions have been in a global
header file that also includes all the HCI protocol structures. To
make this a bit cleaner, move them into its own file.

This also adjusts fs/compat_ioctl.c to only include this new file
and not all the protocol structures that are not needed.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2014-07-11 13:53:04 +03:00
Chao Yu 81e366f87f f2fs: check name_len of dir entry to prevent from deadloop
We assume that modification of some special application could result in zeroed
name_len, or it is consciously made by somebody. We will deadloop in
find_in_block when name_len of dir entry is zero.

This patch is added for preventing deadloop in above scenario.

change log from v1:
 o use f2fs_bug_on rather than break out from searching dir entry suggested by
Jaegeuk Kim.

Jaegeuk describe:
"Well, IMO, it would be good to add f2fs_bug_on() here with a specific comment.
In the current phase of f2fs, it is more important to investigate the file
system bugs, rather than workarounds for any corrupted images.
And, definitely it needs to stop the kernel if any corrupted image was mounted,
so that we can figure out where the bugs are occurred."

Suggested-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-10 17:00:02 -07:00
Trond Myklebust e20fcf1e65 nfsd: clean up helper __release_lock_stateid
Use filp_close instead of open coding. filp_close does a bit more than
just release the locks and put the filp. It also calls ->flush and
dnotify_flush, both of which should be done here anyway.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-10 15:05:26 -04:00
Trond Myklebust de18643dce nfsd: Add locking to the nfs4_file->fi_fds[] array
Preparation for removal of the client_mutex, which currently protects
this array. While we don't actually need the find_*_file_locked variants
just yet, a later patch will. So go ahead and add them now to reduce
future churn in this code.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-10 15:05:26 -04:00
Trond Myklebust 1d31a2531a nfsd: Add fine grained protection for the nfs4_file->fi_stateids list
Access to this list is currently serialized by the client_mutex. Add
finer grained locking around this list in preparation for its removal.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-10 15:05:25 -04:00
Linus Torvalds 40f6123737 Merge branch 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
 "Mostly fixes for the fallouts from the recent cgroup core changes.

  The decoupled nature of cgroup dynamic hierarchy management
  (hierarchies are created dynamically on mount but may or may not be
  reused once unmounted depending on remaining usages) led to more
  ugliness being added to kernfs.

  Hopefully, this is the last of it"

* 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cpuset: break kernfs active protection in cpuset_write_resmask()
  cgroup: fix a race between cgroup_mount() and cgroup_kill_sb()
  kernfs: introduce kernfs_pin_sb()
  cgroup: fix mount failure in a corner case
  cpuset,mempolicy: fix sleeping function called from invalid context
  cgroup: fix broken css_has_online_children()
2014-07-10 11:38:23 -07:00
Jeff Layton d6c249b4d4 nfsd: reduce some spinlocking in put_client_renew
No need to take the lock unless the count goes to 0.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-10 13:41:00 -04:00
Jeff Layton dff1399f8a nfsd: close potential race between delegation break and laundromat
Bruce says:

    There's also a preexisting expire_client/laundromat vs break race:

    - expire_client/laundromat adds a delegation to its local
      reaplist using the same dl_recall_lru field that a delegation
      uses to track its position on the recall lru and drops the
      state lock.

    - a concurrent break_lease adds the delegation to the lru.

    - expire/client/laundromat then walks it reaplist and sees the
      lru head as just another delegation on the list....

Fix this race by checking the dl_time under the state_lock. If we find
that it's not 0, then we know that it has already been queued to the LRU
list and that we shouldn't queue it again.

In the case of destroy_client, we must also ensure that we don't hit
similar races by ensuring that we don't move any delegations to the
reaplist with a dl_time of 0. Just bump the dl_time by one before we
drop the state_lock. We're destroying the delegations anyway, so a 1s
difference there won't matter.

The fault injection code also requires a bit of surgery here:

First, in the case of nfsd_forget_client_delegations, we must prevent
the same sort of race vs. the delegation break callback. For that, we
just increment the dl_time to ensure that a delegation callback can't
race in while we're working on it.

We can't do that for nfsd_recall_client_delegations, as we need to have
it actually queue the delegation, and that won't happen if we increment
the dl_time. The state lock is held over that function, so we don't need
to worry about these sorts of races there.

There is one other potential bug nfsd_recall_client_delegations though.
Entries on the victims list are not dequeued before calling
nfsd_break_one_deleg. That's a potential list corruptor, so ensure that
we do that there.

Reported-by: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-10 13:40:51 -04:00
Miklos Szeredi 4237ba43b6 fuse: restructure ->rename2()
Make ->rename2() universal, i.e. able to handle zero flags.  This is to
make future change of the API easier.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-07-10 10:50:19 +02:00
Kinglong Mee 01529e3f81 NFSD: Fix memory leak in encoding denied lock
Commit 8c7424cff6 (nfsd4: don't try to encode conflicting owner if low on space)
forgot free conf->data in nfsd4_encode_lockt and before sign conf->data to NULL
in nfsd4_encode_lock_denied.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-09 20:55:08 -04:00
Trond Myklebust 0fe492db60 nfsd: Convert nfs4_check_open_reclaim() to work with lookup_clientid()
lookup_clientid is preferable to find_confirmed_client since it's able
to use the cached client in the compound state.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-09 20:55:07 -04:00
Trond Myklebust 2d91e8953c nfsd: Always use lookup_clientid() in nfsd4_process_open1
In later patches, we'll be moving the stateowner table into the
nfs4_client, and by doing this we ensure that we have a cached
nfs4_client pointer.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-09 20:55:06 -04:00
Trond Myklebust 13d6f66b08 nfsd: Convert nfsd4_process_open1() to work with lookup_clientid()
...and have alloc_init_open_stateowner just use the cstate->clp pointer
instead of passing in a clp separately. This allows us to use the
cached nfs4_client pointer in the cstate instead of having to look it
up again.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-09 20:55:05 -04:00
Jeff Layton 4b24ca7d30 nfsd: Allow struct nfsd4_compound_state to cache the nfs4_client
We want to use the nfsd4_compound_state to cache the nfs4_client in
order to optimise away extra lookups of the clid.

In the v4.0 case, we use this to ensure that we only have to look up the
client at most once per compound for each call into lookup_clientid. For
v4.1+ we set the pointer in the cstate during SEQUENCE processing so we
should never need to do a search for it.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-09 20:55:04 -04:00
Jeff Layton 62814d6a9b nfsd: add a nfserrno mapping for -E2BIG to nfserr_fbig
I saw this pop up with some pynfs testing:

    [  123.609992] nfsd: non-standard errno: -7

...and -7 is -E2BIG. I think what happened is that XFS returned -E2BIG
due to some xattr operations with the ACL10 pynfs TEST (I guess it has
limited xattr size?).

Add a better mapping for that error since it's possible that we'll need
it. How about we convert it to NFSERR_FBIG? As Bruce points out, they
both have "BIG" in the name so it must be good.

Also, turn the printk in this function into a WARN() so that we can get
a bit more information about situations that don't have proper mappings.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-09 20:55:03 -04:00
Jeff Layton 722b620d18 nfsd: properly convert return from commit_metadata to __be32
Commit 2a7420c03e504 (nfsd: Ensure that nfsd_create_setattr commits
files to stable storage), added a couple of calls to commit_metadata,
but doesn't convert their return codes to __be32 in the appropriate
places.

Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-09 20:55:02 -04:00
Trond Myklebust 2dd6e458c3 nfsd: Cleanup - Let nfsd4_lookup_stateid() take a cstate argument
The cstate already holds information about the session, and hence
the client id, so it makes more sense to pass that information
rather than the current practice of passing a 'minor version' number.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-09 20:55:01 -04:00
Trond Myklebust d4e19e7027 nfsd: Don't get a session reference without a client reference
If the client were to disappear from underneath us while we're holding
a session reference, things would be bad. This cleanup helps ensure
that it cannot, which will be a possibility when the client_mutex is
removed.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-09 20:55:00 -04:00
Jeff Layton fd44907c2d nfsd: clean up nfsd4_release_lockowner
Now that we know that we won't have several lockowners with the same,
owner->data, we can simplify nfsd4_release_lockowner and get rid of
the lo_list in the process.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-09 20:54:59 -04:00
Trond Myklebust b3c32bcd9c nfsd: NFSv4 lock-owners are not associated to a specific file
Just like open-owners, lock-owners are associated with a name, a clientid
and, in the case of minor version 0, a sequence id. There is no association
to a file.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-09 20:54:58 -04:00
Jeff Layton c53530da4d nfsd: Allow lockowners to hold several stateids
A lockowner can have more than one lock stateid. For instance, if a
process has more than one file open and has locks on both, then the same
lockowner has more than one stateid associated with it. Change it so
that this reality is better reflected by the objects that nfsd uses.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-09 20:54:57 -04:00
Rahul Bedarkar 88e412ea5e fs: debugfs: remove trailing whitespace
fixes checkpatch.pl trailing whitespace errors

Signed-off-by: Rahul Bedarkar <rahulbedarkar89@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09 16:58:21 -07:00
Fabian Frederick 8278bd3abd kernfs: kernel-doc warning fix
s/static_name/name_is_static

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09 16:37:29 -07:00
Steven Rostedt 485d44022a debugfs: Fix corrupted loop in debugfs_remove_recursive
[ I'm currently running my tests on it now, and so far, after a few
 hours it has yet to blow up. I'll run it for 24 hours which it never
 succeeded in the past. ]

The tracing code has a way to make directories within the debugfs file
system as well as deleting them using mkdir/rmdir in the instance
directory. This is very limited in functionality, such as there is
no renames, and the parent directory "instance" can not be modified.
The tracing code creates the instance directory from the debugfs code
and then replaces the dentry->d_inode->i_op with its own to allow
for mkdir/rmdir to work.

When these are called, the d_entry and inode locks need to be released
to call the instance creation and deletion code. That code has its own
accounting and locking to serialize everything to prevent multiple
users from causing harm. As the parent "instance" directory can not
be modified this simplifies things.

I created a stress test that creates several threads that randomly
creates and deletes directories thousands of times a second. The code
stood up to this test and I submitted it a while ago.

Recently I added a new test that adds readers to the mix. While the
instance directories were being added and deleted, readers would read
from these directories and even enable tracing within them. This test
was able to trigger a bug:

 general protection fault: 0000 [#1] PREEMPT SMP
 Modules linked in: ...
 CPU: 3 PID: 17789 Comm: rmdir Tainted: G        W     3.15.0-rc2-test+ #41
 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
 task: ffff88003786ca60 ti: ffff880077018000 task.ti: ffff880077018000
 RIP: 0010:[<ffffffff811ed5eb>]  [<ffffffff811ed5eb>] debugfs_remove_recursive+0x1bd/0x367
 RSP: 0018:ffff880077019df8  EFLAGS: 00010246
 RAX: 0000000000000002 RBX: ffff88006f0fe490 RCX: 0000000000000000
 RDX: dead000000100058 RSI: 0000000000000246 RDI: ffff88003786d454
 RBP: ffff88006f0fe640 R08: 0000000000000628 R09: 0000000000000000
 R10: 0000000000000628 R11: ffff8800795110a0 R12: ffff88006f0fe640
 R13: ffff88006f0fe640 R14: ffffffff81817d0b R15: ffffffff818188b7
 FS:  00007ff13ae24700(0000) GS:ffff88007d580000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 CR2: 0000003054ec7be0 CR3: 0000000076d51000 CR4: 00000000000007e0
 Stack:
  ffff88007a41ebe0 dead000000100058 00000000fffffffe ffff88006f0fe640
  0000000000000000 ffff88006f0fe678 ffff88007a41ebe0 ffff88003793a000
  00000000fffffffe ffffffff810bde82 ffff88006f0fe640 ffff88007a41eb28
 Call Trace:
  [<ffffffff810bde82>] ? instance_rmdir+0x15b/0x1de
  [<ffffffff81132e2d>] ? vfs_rmdir+0x80/0xd3
  [<ffffffff81132f51>] ? do_rmdir+0xd1/0x139
  [<ffffffff8124ad9e>] ? trace_hardirqs_on_thunk+0x3a/0x3c
  [<ffffffff814fea62>] ? system_call_fastpath+0x16/0x1b
 Code: fe ff ff 48 8d 75 30 48 89 df e8 c9 fd ff ff 85 c0 75 13 48 c7 c6 b8 cc d2 81 48 c7 c7 b0 cc d2 81 e8 8c 7a f5 ff 48 8b 54 24 08 <48> 8b 82 a8 00 00 00 48 89 d3 48 2d a8 00 00 00 48 89 44 24 08
 RIP  [<ffffffff811ed5eb>] debugfs_remove_recursive+0x1bd/0x367
  RSP <ffff880077019df8>

It took a while, but every time it triggered, it was always in the
same place:

	list_for_each_entry_safe(child, next, &parent->d_subdirs, d_u.d_child) {

Where the child->d_u.d_child seemed to be corrupted.  I added lots of
trace_printk()s to see what was wrong, and sure enough, it was always
the child's d_u.d_child field. I looked around to see what touches
it and noticed that in __dentry_kill() which calls dentry_free():

static void dentry_free(struct dentry *dentry)
{
	/* if dentry was never visible to RCU, immediate free is OK */
	if (!(dentry->d_flags & DCACHE_RCUACCESS))
		__d_free(&dentry->d_u.d_rcu);
	else
		call_rcu(&dentry->d_u.d_rcu, __d_free);
}

I also noticed that __dentry_kill() unlinks the child->d_u.child
under the parent->d_lock spin_lock.

Looking back at the loop in debugfs_remove_recursive() it never takes the
parent->d_lock to do the list walk. Adding more tracing, I was able to
prove this was the issue:

 ftrace-t-15385   1.... 246662024us : dentry_kill <ffffffff81138b91>: free ffff88006d573600
    rmdir-15409   2.... 246662024us : debugfs_remove_recursive <ffffffff811ec7e5>: child=ffff88006d573600 next=dead000000100058

The dentry_kill freed ffff88006d573600 just as the remove recursive was walking
it.

In order to fix this, the list walk needs to be modified a bit to take
the parent->d_lock. The safe version is no longer necessary, as every
time we remove a child, the parent->d_lock must be released and the
list walk must start over. Each time a child is removed, even though it
may still be on the list, it should be skipped by the first check
in the loop:

		if (!debugfs_positive(child))
			continue;

Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09 16:37:29 -07:00
Chao Yu 6b2920a513 f2fs: use inner macro and function to clean up codes
In this patch we use below inner macro and function to clean up codes.
1. ADDRS_PER_PAGE
2. SM_I
3. f2fs_readonly

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:26 -07:00
Chao Yu 3aab8f828e f2fs: introduce f2fs_write_failed to handle error case when write
When we fail in ->write_begin()/->direct_IO(), our allocated node block in disk
and page cache are still kept, despite these may not be used again.

This patch introduce f2fs_write_failed() to handle the error case of these two
interfaces, it will truncate page cache and blocks of this file according to
i_size.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:26 -07:00
Gu Zheng eee6160f2e f2fs: arguments cleanup of finding file flow functions
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:26 -07:00
Gu Zheng 1c3bb97899 f2fs: remove the needless point-cast
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:26 -07:00
Gu Zheng 34e6d456da f2fs: remove the redundant validation check of acl
kernel side(xx_init_acl), the acl is get/cloned from the parent dir's,
which is credible. So remove the redundant validation check of acl
here.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:25 -07:00
Chao Yu 1256010ab1 f2fs: reduce region of f2fs_lock_op covered for better concurrency
In our rename process, region of f2fs_lock_op covered is too big as some of the
code like f2fs_empty_dir/f2fs_find_entry are not needed to protect by this lock.

So in the extreme case like doing checkpoint when we rename old inode to exist
inode in a large directory could cause lower concurrency.

Let's reduce the region of f2fs_lock_op to fix this.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:25 -07:00
Fabian Frederick b434babf85 f2fs: replace count*size kzalloc by kcalloc
kcalloc manages count*sizeof overflow.

Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: linux-f2fs-devel@lists.sourceforge.net
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:25 -07:00
Chao Yu aec71382c6 f2fs: refactor flush_nat_entries codes for reducing NAT writes
Although building NAT journal in cursum reduce the read/write work for NAT
block, but previous design leave us lower performance when write checkpoint
frequently for these cases:
1. if journal in cursum has already full, it's a bit of waste that we flush all
   nat entries to page for persistence, but not to cache any entries.
2. if journal in cursum is not full, we fill nat entries to journal util
   journal is full, then flush the left dirty entries to disk without merge
   journaled entries, so these journaled entries may be flushed to disk at next
   checkpoint but lost chance to flushed last time.

In this patch we merge dirty entries located in same NAT block to nat entry set,
and linked all set to list, sorted ascending order by entries' count of set.
Later we flush entries in sparse set into journal as many as we can, and then
flush merged entries to disk. In this way we can not only gain in performance,
but also save lifetime of flash device.

In my testing environment, it shows this patch can help to reduce NAT block
writes obviously. In hard disk test case: cost time of fsstress is stablely
reduced by about 5%.

1. virtual machine + hard disk:
fsstress -p 20 -n 200 -l 5
		node num	cp count	nodes/cp
based		4599.6		1803.0		2.551
patched		2714.6		1829.6		1.483

2. virtual machine + 32g micro SD card:
fsstress -p 20 -n 200 -l 1 -w -f chown=0 -f creat=4 -f dwrite=0
-f fdatasync=4 -f fsync=4 -f link=0 -f mkdir=4 -f mknod=4 -f rename=5
-f rmdir=5 -f symlink=0 -f truncate=4 -f unlink=5 -f write=0 -S

		node num	cp count	nodes/cp
based		84.5		43.7		1.933
patched		49.2		40.0		1.23

Our latency of merging op shows not bad when handling extreme case like:
merging a great number of dirty nats:
latency(ns)	dirty nat count
3089219		24922
5129423		27422
4000250		24523

change log from v1:
 o fix wrong logic in add_nat_entry when grab a new nat entry set.
 o swith to create slab cache in create_node_manager_caches.
 o use GFP_ATOMIC instead of GFP_NOFS to avoid potential long latency.

change log from v2:
 o make comment position more appropriate suggested by Jaegeuk Kim.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:25 -07:00
Jaegeuk Kim a014e037be f2fs: clean up an unused parameter and assignment
This patch cleans up simple unnecessary codes.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:25 -07:00
Jaegeuk Kim b97a9b5da8 f2fs: introduce f2fs_do_tmpfile for code consistency
This patch adds f2fs_do_tmpfile to eliminate the redundant init_inode_metadata
flow.
Throught this, we can provide the consistent lock usage, e.g., fi->i_sem,  and
this will enable better debugging stuffs.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:24 -07:00
Chao Yu 50732df02e f2fs: support ->tmpfile()
Add function f2fs_tmpfile() to support O_TMPFILE file creation, and modify logic
of init_inode_metadata to enable linkat temp file.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:24 -07:00
Chao Yu ca0a81b397 f2fs: avoid to truncate non-updated page partially
After we call find_data_page in truncate_partial_data_page, we could not
guarantee this page is updated or not as error may occurred in lower layer.

We'd better check status of the page to avoid this no updated page be
writebacked to device.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:24 -07:00
Chao Yu 5576cd6ca5 f2fs: avoid unneeded SetPageUptodate in f2fs_write_end
We have already set page update in ->write_begin, so we should remove redundant
SetPageUptodate in ->write_end.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 14:04:24 -07:00
Linus Torvalds 191d385f25 f2fs bugfixes for 3.16
o fix normal and recovery path for fallocated regions
 o fix error case mishandling
 o recover renamed fsync inodes correctly
 o fix to get out of infinite loops in balance_dirty_pages
 o fix kernel NULL pointer error
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTvUA5AAoJEEAUqH6CSFDSSKgP/RQ6ryncwwSUilDswq95/VI1
 qXwAlHLBgJkPquld6Klqw//4ot49sThCjBtusxdNqoyB5aSb/xqupJxRvCrJe1RQ
 dRDYP1Mq63phd0cWsjAokfwXuiJQ2Ys/1bq2HguzAhL+7qNVNJEoy27ISUgvh71J
 3v9pTfOqFY/qMxAa1Y91kIat3/27QTCtVQdS1sQM7s8UXlZHIIGyxrSmYWPUGNar
 yVtMNtgMQcEtmekRAjstM0glj3IukosTP1jameXYumEw9bchfIeeLznvtDiEqxKA
 maXtEPA+yrEk5y+RhOiBgaHuV/9uNmrHHvTwoqhMl9Wl+I4RzxpOhD2agRAUFbdn
 rvPKU514tsjhkdelSYf0v2rXf0PxZcZ5XE27TZ+xyhCADKykBdN5ZzTH1OUWjEOA
 TNdPVKv2btpvEdGdmdGzjKIQpPfjLgJLAKqDNNTSQ3u4XlVioMn6IyzEGddz41By
 kSU0Hzj3iBHk+XlqBWSELOd34aCuvqXG/gcE7rWOj0qbJ5T6GKVRTQN5CbqMNutJ
 Udw0JDhImgYxNI5fsy7Stg/5IqOwhp/pDIpLOHXRnYpLb2rJ1kzvgz4B/eJAZCcc
 zmjxZBn1C2GLBJYFDbY1KeR5Tp6WZ9yok+wbXFiO1mpx5RsU7jIL64X/7+Zg0X84
 p3LlN/vBn1nr2DiB3+n/
 =pwxz
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-fixes-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs bugfixes from Jaegeuk Kim:
 "This includes a couple of bug fixes found by xfstests.  In addition,
  one critical bug was reported by Brian Chadwick, which is falling into
  the infinite loop in balance_dirty_pages.  And it turned out due to
  the IO merging policy in f2fs, which was newly merged in 3.16.

   - fix normal and recovery path for fallocated regions
   - fix error case mishandling
   - recover renamed fsync inodes correctly
   - fix to get out of infinite loops in balance_dirty_pages
   - fix kernel NULL pointer error"

* tag 'f2fs-fixes-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
  f2fs: avoid to access NULL pointer in issue_flush_thread
  f2fs: check bdi->dirty_exceeded when trying to skip data writes
  f2fs: do checkpoint for the renamed inode
  f2fs: release new entry page correctly in error path of f2fs_rename
  f2fs: fix error path in init_inode_metadata
  f2fs: check lower bound nid value in check_nid_range
  f2fs: remove unused variables in f2fs_sm_info
  f2fs: fix not to allocate unnecessary blocks during fallocate
  f2fs: recover fallocated data and its i_size together
  f2fs: fix to report newly allocate region as extent
2014-07-09 09:46:58 -07:00
Chao Yu 50e1f8d221 f2fs: avoid to access NULL pointer in issue_flush_thread
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=75861

Denis 2014-05-10 11:28:59 UTC reported:
"F2FS-fs (mmcblk0p28): mounting..
 Unable to handle kernel NULL pointer dereference at virtual address 00000018
 ...
 [<c0a2f678>] (_raw_spin_lock+0x3c/0x70) from [<c03a0330>] (issue_flush_thread+0x50/0x17c)
 [<c03a0330>] (issue_flush_thread+0x50/0x17c) from [<c01b4064>] (kthread+0x98/0xa4)
 [<c01b4064>] (kthread+0x98/0xa4) from [<c0108060>] (kernel_thread_exit+0x0/0x8)"

This patch assign cmd_control_info in sm_info before issue_flush_thread is being
created, so this make sure that issue flush thread will have no chance to access
invalid info in fcc.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 05:59:55 -07:00
Jaegeuk Kim 2743f86554 f2fs: check bdi->dirty_exceeded when trying to skip data writes
If we don't check the current backing device status, balance_dirty_pages can
fall into infinite pausing routine.

This can be occurred when a lot of directories make a small number of dirty
dentry pages including files.

Reported-by: Brian Chadwick <brianchad@westnet.com.au>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 05:59:45 -07:00
Jaegeuk Kim b2c0829912 f2fs: do checkpoint for the renamed inode
If an inode is renamed, it should be registered as file_lost_pino to conduct
checkpoint at f2fs_sync_file.
Otherwise, the inode cannot be recovered due to no dent_mark in the following
scenario.

Note that, this scenario is from xfstests/322.

1. create "a"
2. fsync "a"
3. rename "a" to "b"
4. fsync "b"
5. Sudden power-cut

After recovery is done, "b" should be seen.
However, the result shows "a", since the recovery procedure does not enter
recover_dentry due to no dent_mark.

The reason is like below.
- The nid of "a" is checkpointed during #2, f2fs_sync_file.
- The inode page for "b" produced by #3 is written without dent_mark by
sync_node_pages.

So, this patch fixes this bug by assinging file_lost_pino to the "a"'s inode.
If the pino is lost, f2fs_sync_file conducts checkpoint, and then recovers
the latest pino and its dentry information for further recovery.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 05:59:31 -07:00
Chao Yu dd4d961fe7 f2fs: release new entry page correctly in error path of f2fs_rename
This patch correct releasing code of new_page to avoid BUG_ON in error patch of
f2fs_rename.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 05:59:11 -07:00
Chao Yu 90d72459cc f2fs: fix error path in init_inode_metadata
If we fail in this path:
->init_inode_metadata
  ->make_empty_dir
    ->get_new_data_page
      ->grab_cache_page return -ENOMEM

We will bug on in error path of init_inode_metadata when call remove_inode_page
because i_block = 2 (one inode block will be released later & one dentry block).

We should release the dentry block in init_inode_metadata to avoid this BUG_ON,
and avoid leak of dentry block resource, because we never have second chance to
release that block in ->evict_inode as in upper error path we make this inode
'bad'.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 05:58:50 -07:00
Chao Yu d6b7d4b31d f2fs: check lower bound nid value in check_nid_range
This patch add lower bound verification for nid in check_nid_range, so nids
reserved like 0, node, meta passed by caller could be checked there.

And then check_nid_range could be used in f2fs_nfs_get_inode for simplifying
code.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 05:58:08 -07:00
Chao Yu 8bc6f60e3f f2fs: remove unused variables in f2fs_sm_info
Remove unused variables in struct f2fs_sm_info.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-07-09 05:57:57 -07:00
Trond Myklebust 3c87b9b7c0 nfsd: lock owners are not per open stateid
In the NFSv4 spec, lock stateids are per-file objects. Lockowners are not.
This patch replaces the current list of lock owners in the open stateids
with a list of lock stateids.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:37 -04:00
Trond Myklebust acf9295b1c nfsd: clean up nfsd4_close_open_stateid
Minor cleanup that should introduce no behavioral changes.

Currently this function just unhashes the stateid and leaves the caller
to do the work of the CLOSE processing.

Change nfsd4_close_open_stateid so that it handles doing all of the work
of closing a stateid. Move the handling of the unhashed stateid into it
instead of doing that work in nfsd4_close. This will help isolate some
coming changes to stateid handling from nfsd4_close.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:36 -04:00
Jeff Layton db24b3b4b2 nfsd: declare v4.1+ openowners confirmed on creation
There's no need to confirm an openowner in v4.1 and above, so we can
go ahead and set NFS4_OO_CONFIRMED when we create openowners in
those versions. This will also be necessary when we remove the
client_mutex, as it'll be possible for two concurrent opens to race
in versions >4.0.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:35 -04:00
Trond Myklebust b607664ee7 nfsd: Cleanup nfs4svc_encode_compoundres
Move the slot return, put session etc into a helper in fs/nfsd/nfs4state.c
instead of open coding in nfs4svc_encode_compoundres.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:34 -04:00
Trond Myklebust e17f99b728 nfsd: nfs4_preprocess_seqid_op should only set *stpp on success
Not technically a bugfix, since nothing tries to use the return pointer
if this function doesn't return success, but it could be a problem
with some coming changes.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:33 -04:00
Jeff Layton 5b8db00bae nfsd: add a new /proc/fs/nfsd/max_connections file
Currently, the maximum number of connections that nfsd will allow
is based on the number of threads spawned. While this is fine for a
default, there really isn't a clear relationship between the two.

The number of threads corresponds to the number of concurrent requests
that we want to allow the server to process at any given time. The
connection limit corresponds to the maximum number of clients that we
want to allow the server to handle. These are two entirely different
quantities.

Break the dependency on increasing threads in order to allow for more
connections, by adding a new per-net parameter that can be set to a
non-zero value. The default is still to base it on the number of threads,
so there should be no behavior change for anyone who doesn't use it.

Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:32 -04:00
Trond Myklebust 0f3a24b43b nfsd: Ensure that nfsd_create_setattr commits files to stable storage
Since nfsd_create_setattr strips the mode from the struct iattr, it
is quite possible that it will optimise away the call to nfsd_setattr
altogether.
If this is the case, then we never call commit_metadata() on the
newly created file.

Also ensure that both nfsd_setattr() and nfsd_create_setattr() fail
when the call to commit_metadata fails.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:31 -04:00
Kinglong Mee 1e444f5bc0 NFSD: Remove iattr parameter from nfsd_symlink()
Commit db2e747b14 (vfs: remove mode parameter from vfs_symlink())
have remove mode parameter from vfs_symlink.
So that, iattr isn't needed by nfsd_symlink now, just remove it.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:31 -04:00
Trond Myklebust 950e0118d0 nfsd: Protect addition to the file_hashtbl
Current code depends on the client_mutex to guarantee a single struct
nfs4_file per inode in the file_hashtbl and make addition atomic with
respect to lookup.  Rely instead on the state_Lock, to make it easier to
stop taking the client_mutex here later.

To prevent an i_lock/state_lock inversion, change nfsd4_init_file to
use ihold instead if igrab. That's also more efficient anyway as we
definitely hold a reference to the inode at that point.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:30 -04:00
Christoph Hellwig 7e6a72e5f1 nfsd: fix file access refcount leak when nfsd4_truncate fails
nfsd4_process_open2 will currently will get access to the file, and then
call nfsd4_truncate to (possibly) truncate it. If that operation fails
though, then the access references will never be released as the
nfs4_ol_stateid is never initialized.

Fix by moving the nfsd4_truncate call into nfs4_get_vfs_file, ensuring
that the refcounts are properly put if the truncate fails.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:29 -04:00
Kinglong Mee 1055414fe1 NFSD: Avoid warning message when compile at i686 arch
fs/nfsd/nfs4xdr.c: In function 'nfsd4_encode_readv':
>> fs/nfsd/nfs4xdr.c:3137:148: warning: comparison of distinct pointer types lacks a cast [enabled by default]
thislen = min(len, ((void *)xdr->end - (void *)xdr->p));

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:28 -04:00
J. Bruce Fields d5e2338324 nfsd4: replace defer_free by svcxdr_tmpalloc
Avoid an extra allocation for the tmpbuf struct itself, and stop
ignoring some allocation failures.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:27 -04:00
J. Bruce Fields bcaab953b1 nfsd4: remove nfs4_acl_new
This is a not-that-useful kmalloc wrapper.  And I'd like one of the
callers to actually use something other than kmalloc.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:27 -04:00
J. Bruce Fields 29c353b3fe nfsd4: define svcxdr_dupstr to share some common code
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:26 -04:00
J. Bruce Fields ce043ac826 nfsd4: remove unused defer_free argument
28e05dd845 "knfsd: nfsd4: represent nfsv4 acl with array instead of
linked list" removed the last user that wanted a custom free function.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:25 -04:00
J. Bruce Fields 7fb84306f5 nfsd4: rename cr_linkname->cr_data
The name of a link is currently stored in cr_name and cr_namelen, and
the content in cr_linkname and cr_linklen.  That's confusing.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:24 -04:00
J. Bruce Fields 52ee04330f nfsd: let nfsd_symlink assume null-terminated data
Currently nfsd_symlink has a weird hack to serve callers who don't
null-terminate symlink data: it looks ahead at the next byte to see if
it's zero, and copies it to a new buffer to null-terminate if not.

That means callers don't have to null-terminate, but they *do* have to
ensure that the byte following the end of the data is theirs to read.

That's a bit subtle, and the NFSv4 code actually got this wrong.

So let's just throw out that code and let callers pass null-terminated
strings; we've already fixed them to do that.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:23 -04:00
J. Bruce Fields 0aeae33f5d nfsd: make NFSv2 null terminate symlink data
It's simple enough for NFSv2 to null-terminate the symlink data.

A bit weird (it depends on knowing that we've already read the following
byte, which is either padding or part of the mode), but no worse than
the conditional kstrdup it otherwise relies on in nfsd_symlink().

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:23 -04:00
J. Bruce Fields b829e9197a nfsd: fix rare symlink decoding bug
An NFS operation that creates a new symlink includes the symlink data,
which is xdr-encoded as a length followed by the data plus 0 to 3 bytes
of zero-padding as required to reach a 4-byte boundary.

The vfs, on the other hand, wants null-terminated data.

The simple way to handle this would be by copying the data into a newly
allocated buffer with space for the final null.

The current nfsd_symlink code tries to be more clever by skipping that
step in the (likely) case where the byte following the string is already
0.

But that assumes that the byte following the string is ours to look at.
In fact, it might be the first byte of a page that we can't read, or of
some object that another task might modify.

Worse, the NFSv4 code tries to fix the problem by actually writing to
that byte.

In the NFSv2/v3 cases this actually appears to be safe:

	- nfs3svc_decode_symlinkargs explicitly null-terminates the data
	  (after first checking its length and copying it to a new
	  page).
	- NFSv2 limits symlinks to 1k.  The buffer holding the rpc
	  request is always at least a page, and the link data (and
	  previous fields) have maximum lengths that prevent the request
	  from reaching the end of a page.

In the NFSv4 case the CREATE op is potentially just one part of a long
compound so can end up on the end of a page if you're unlucky.

The minimal fix here is to copy and null-terminate in the NFSv4 case.
The nfsd_symlink() interface here seems too fragile, though.  It should
really either do the copy itself every time or just require a
null-terminated string.

Reported-by: Jeff Layton <jlayton@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:22 -04:00
Christoph Hellwig 74adf83f5d nfs: only show Posix ACLs in listxattr if actually present
The big ACL switched nfs to use generic_listxattr, which calls all existing
->list handlers.  Add a custom .listxattr implementation that only lists
the ACLs if they actually are present on the given inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Philippe Troin <phil@fifi.org>
Tested-by: Philippe Troin <phil@fifi.org>
Fixes: 013cdf1088 (nfs: use generic posix ACL infrastructure ...)
Cc: stable@vger.kernel.org # 3.14+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-08 14:36:08 -04:00
Peng Tao 31434f496a nfs: check hostname in nfs_get_client
We reference cl_hostname in many places. Add a check to make
sure it exists.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-08 14:30:03 -04:00
Peng Tao a363e32e94 nfsv4: set hostname when creating nfsv4 ds connection
We reference cl_hostname in many places for debugging purpose.
So make it useful by setting hostname when calling nfs_get_client.

Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-07-08 14:30:03 -04:00
Yan, Zheng f5f1864743 ceph: properly apply umask when ACL is enabled
when ACL is enabled, posix_acl_create() may change inode's mode

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-07-08 15:08:47 +04:00
Yan, Zheng 5aaa432ad9 ceph: pass proper page offset to copy_page_to_iter()
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-07-08 15:08:47 +04:00
Yan, Zheng c5c9a0bf1b ceph: include time stamp in replayed MDS requests
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-07-08 15:08:46 +04:00
Yan, Zheng 494d77bf8f ceph: check unsupported fallocate mode
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-07-08 15:08:46 +04:00
Kinglong Mee c3a4561796 nfsd: Fix bad reserving space for encoding rdattr_error
Introduced by commit 561f0ed498 (nfsd4: allow large readdirs).

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-07 14:16:31 -04:00
Miklos Szeredi c55a01d360 fuse: avoid scheduling while atomic
As reported by Richard Sharpe, an attempt to use fuse_notify_inval_entry()
triggers complains about scheduling while atomic:

  BUG: scheduling while atomic: fuse.hf/13976/0x10000001

This happens because fuse_notify_inval_entry() attempts to allocate memory
with GFP_KERNEL, holding "struct fuse_copy_state" mapped by kmap_atomic().

Introduced by commit 58bda1da4b "fuse/dev: use atomic maps"

Fix by moving the map/unmap to just cover the actual memcpy operation.

Original patch from Maxim Patlasov <mpatlasov@parallels.com>

Reported-by: Richard Sharpe <realrichardsharpe@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: <stable@vger.kernel.org> # v3.15+
2014-07-07 15:28:51 +02:00
Miklos Szeredi 233a01fa9c fuse: handle large user and group ID
If the number in "user_id=N" or "group_id=N" mount options was larger than
INT_MAX then fuse returned EINVAL.

Fix this to handle all valid uid/gid values.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
2014-07-07 15:28:51 +02:00
Himangi Saraogi 7b3d8bf771 fuse: inode: drop cast
This patch removes the cast on data of type void * as it is not needed.
The following Coccinelle semantic patch was used for making the change:

@r@
expression x;
void* e;
type T;
identifier f;
@@

(
  *((T *)e)
|
  ((T *)x)[...]
|
  ((T *)x)->f
|
- (T *)
  e
)

Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2014-07-07 15:28:51 +02:00
Anand Avati 154210ccb3 fuse: ignore entry-timeout on LOOKUP_REVAL
The following test case demonstrates the bug:

  sh# mount -t glusterfs localhost:meta-test /mnt/one

  sh# mount -t glusterfs localhost:meta-test /mnt/two

  sh# echo stuff > /mnt/one/file; rm -f /mnt/two/file; echo stuff > /mnt/one/file
  bash: /mnt/one/file: Stale file handle

  sh# echo stuff > /mnt/one/file; rm -f /mnt/two/file; sleep 1; echo stuff > /mnt/one/file

On the second open() on /mnt/one, FUSE would have used the old
nodeid (file handle) trying to re-open it. Gluster is returning
-ESTALE. The ESTALE propagates back to namei.c:filename_lookup()
where lookup is re-attempted with LOOKUP_REVAL. The right
behavior now, would be for FUSE to ignore the entry-timeout and
and do the up-call revalidation. Instead FUSE is ignoring
LOOKUP_REVAL, succeeding the revalidation (because entry-timeout
has not passed), and open() is again retried on the old file
handle and finally the ESTALE is going back to the application.

Fix: if revalidation is happening with LOOKUP_REVAL, then ignore
entry-timeout and always do the up-call.

Signed-off-by: Anand Avati <avati@redhat.com>
Reviewed-by: Niels de Vos <ndevos@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
2014-07-07 15:28:51 +02:00
Miklos Szeredi 126b9d4365 fuse: timeout comparison fix
As suggested by checkpatch.pl, use time_before64() instead of direct
comparison of jiffies64 values.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: <stable@vger.kernel.org>
2014-07-07 15:28:50 +02:00
Eric Sandeen 5dd214248f ext4: disable synchronous transaction batching if max_batch_time==0
The mount manpage says of the max_batch_time option,

	This optimization can be turned off entirely
	by setting max_batch_time to 0.

But the code doesn't do that.  So fix the code to do
that.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-07-05 19:18:22 -04:00
Theodore Ts'o 94d4c066a4 ext4: clarify ext4_error message in ext4_mb_generate_buddy_error()
We are spending a lot of time explaining to users what this error
means.  Let's try to improve the message to avoid this problem.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-07-05 19:15:50 -04:00
Theodore Ts'o ae0f78de2c ext4: clarify error count warning messages
Make it clear that values printed are times, and that it is error
since last fsck. Also add note about fsck version required.

Signed-off-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
2014-07-05 18:40:52 -04:00
Theodore Ts'o 61c219f581 ext4: fix unjournalled bg descriptor while initializing inode bitmap
The first time that we allocate from an uninitialized inode allocation
bitmap, if the block allocation bitmap is also uninitalized, we need
to get write access to the block group descriptor before we start
modifying the block group descriptor flags and updating the free block
count, etc.  Otherwise, there is the potential of a bad journal
checksum (if journal checksums are enabled), and of the file system
becoming inconsistent if we crash at exactly the wrong time.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-07-05 16:28:35 -04:00
Linus Torvalds b82207b8e8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "We've queued up a few fixes in my for-linus branch"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix crash when starting transaction
  Btrfs: fix btrfs_print_leaf for skinny metadata
  Btrfs: fix race of using total_bytes_pinned
  btrfs: use E2BIG instead of EIO if compression does not help
  btrfs: remove stale comment from btrfs_flush_all_pending_stuffs
  Btrfs: fix use-after-free when cloning a trailing file hole
  btrfs: fix null pointer dereference in btrfs_show_devname when name is null
  btrfs: fix null pointer dereference in clone_fs_devices when name is null
  btrfs: fix nossd and ssd_spread mount option regression
  Btrfs: fix race between balance recovery and root deletion
  Btrfs: atomically set inode->i_flags in btrfs_update_iflags
  btrfs: only unlock block in verify_parent_transid if we locked it
  Btrfs: assert send doesn't attempt to start transactions
  btrfs compression: reuse recently used workspace
  Btrfs: fix crash when mounting raid5 btrfs with missing disks
  btrfs: create sprout should rename fsid on the sysfs as well
  btrfs: dev replace should replace the sysfs entry
  btrfs: dev add should add its sysfs entry
  btrfs: dev delete should remove sysfs entry
  btrfs: rename add_device_membership to btrfs_kobj_add_device
2014-07-04 08:53:53 -07:00
Linus Torvalds 3089f54a79 Driver core fixes for 3.16-rc4
Well, one drivercore fix for kernfs to resolve a reported issue with
 sysfs files being updated from atomic contexts, and another lz4 bugfix
 for testing potential buffer overflows.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iEYEABECAAYFAlO1/FEACgkQMUfUDdst+ynRPACfWcssJKICc2N7g9/0XXGVTjVT
 PwwAnjQ8bjOfu6i2z/lViLtZGjOnzKor
 =qtjB
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg KH:
 "Well, one drivercore fix for kernfs to resolve a reported issue with
  sysfs files being updated from atomic contexts, and another lz4 bugfix
  for testing potential buffer overflows"

* tag 'driver-core-3.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  lz4: add overrun checks to lz4_uncompress_unknownoutputsize()
  kernfs: kernfs_notify() must be useable from non-sleepable contexts
2014-07-03 18:53:13 -07:00
Linus Torvalds 0fba687f9b Merge branch 'for-3.16' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfixes from Bruce Fields:
 "By coincidence, two NFSv4 symlink bugs, one introduced in the 3.16 xdr
  encoding rewrite, the other a decoding bug that I think we've had
  since the start but that just doesn't trigger very often"

* 'for-3.16' of git://linux-nfs.org/~bfields/linux:
  nfs: fix nfs4d readlink truncated packet
  nfsd: fix rare symlink decoding bug
2014-07-03 18:33:22 -07:00
Steven Rostedt 27199b15e4 ecryptfs: Remove unnecessary include of syscall.h in keystore.c
There's no reason to include syscalls.h in keystore.c. Remove it.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2014-07-03 16:38:23 -05:00
Fabian Frederick 3db593e8af fs/ecryptfs/messaging.c: remove null test before kfree
Fix checkpatch warning:
WARNING: kfree(NULL) is safe this check is probably not required

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: ecryptfs@vger.kernel.org
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2014-07-03 16:38:09 -05:00
Himangi Saraogi c4cf3ba4f3 ecryptfs: Drop cast
This patch does away with cast on void * and the if as it is unnecessary.

The following Coccinelle semantic patch was used for making the change:

@r@
expression x;
void* e;
type T;
identifier f;
@@

(
  *((T *)e)
|
  ((T *)x)[...]
|
  ((T *)x)->f
|
- (T *)
  e
)

Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2014-07-03 16:37:56 -05:00
Heiko Carstens 058504edd0 fs/seq_file: fallback to vmalloc allocation
There are a couple of seq_files which use the single_open() interface.
This interface requires that the whole output must fit into a single
buffer.

E.g.  for /proc/stat allocation failures have been observed because an
order-4 memory allocation failed due to memory fragmentation.  In such
situations reading /proc/stat is not possible anymore.

Therefore change the seq_file code to fallback to vmalloc allocations
which will usually result in a couple of order-0 allocations and hence
also work if memory is fragmented.

For reference a call trace where reading from /proc/stat failed:

  sadc: page allocation failure: order:4, mode:0x1040d0
  CPU: 1 PID: 192063 Comm: sadc Not tainted 3.10.0-123.el7.s390x #1
  [...]
  Call Trace:
    show_stack+0x6c/0xe8
    warn_alloc_failed+0xd6/0x138
    __alloc_pages_nodemask+0x9da/0xb68
    __get_free_pages+0x2e/0x58
    kmalloc_order_trace+0x44/0xc0
    stat_open+0x5a/0xd8
    proc_reg_open+0x8a/0x140
    do_dentry_open+0x1bc/0x2c8
    finish_open+0x46/0x60
    do_last+0x382/0x10d0
    path_openat+0xc8/0x4f8
    do_filp_open+0x46/0xa8
    do_sys_open+0x114/0x1f0
    sysc_tracego+0x14/0x1a

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: David Rientjes <rientjes@google.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Thorsten Diehl <thorsten.diehl@de.ibm.com>
Cc: Andrea Righi <andrea@betterlinux.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Stefan Bader <stefan.bader@canonical.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-03 09:21:54 -07:00
Heiko Carstens f74373a5cc /proc/stat: convert to single_open_size()
These two patches are supposed to "fix" failed order-4 memory
allocations which have been observed when reading /proc/stat.  The
problem has been observed on s390 as well as on x86.

To address the problem change the seq_file memory allocations to
fallback to use vmalloc, so that allocations also work if memory is
fragmented.

This approach seems to be simpler and less intrusive than changing
/proc/stat to use an interator.  Also it "fixes" other users as well,
which use seq_file's single_open() interface.

This patch (of 2):

Use seq_file's single_open_size() to preallocate a buffer that is large
enough to hold the whole output, instead of open coding it.  Also
calculate the requested size using the number of online cpus instead of
possible cpus, since the size of the output only depends on the number
of online cpus.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Thorsten Diehl <thorsten.diehl@de.ibm.com>
Cc: Andrea Righi <andrea@betterlinux.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Stefan Bader <stefan.bader@canonical.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-03 09:21:54 -07:00
Ian Kent 571ff4731b autofs4: fix false positive compile error
On strict build environments we can see:

  fs/autofs4/inode.c: In function 'autofs4_fill_super':
  fs/autofs4/inode.c:312: error: 'pgrp' may be used uninitialized in this function
  make[2]: *** [fs/autofs4/inode.o] Error 1
  make[1]: *** [fs/autofs4] Error 2
  make: *** [fs] Error 2
  make: *** Waiting for unfinished jobs....

This is due to the use of pgrp_set being used to indicate pgrp has has
been set rather than initializing pgrp itself.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-03 09:21:53 -07:00
Filipe Manana abdd2e80a5 Btrfs: fix crash when starting transaction
Often when starting a transaction we commit the currently running transaction,
which can end up writing block group caches when the current process has its
journal_info set to NULL (and not to a transaction). This makes our assertion
at btrfs_check_data_free_space() (current_journal != NULL) fail, resulting
in a crash/hang. Therefore fix it by setting journal_info.

Two different traces of this issue follow below.

1)

    [51502.241936] BTRFS: assertion failed: current->journal_info, file: fs/btrfs/extent-tree.c, line: 3670
    [51502.242213] ------------[ cut here ]------------
    [51502.242493] kernel BUG at fs/btrfs/ctree.h:3964!
    [51502.242669] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
    (...)
    [51502.244010] Call Trace:
    [51502.244010]  [<ffffffffa02bc025>] btrfs_check_data_free_space+0x395/0x3a0 [btrfs]
    [51502.244010]  [<ffffffffa02c3bdc>] btrfs_write_dirty_block_groups+0x4ac/0x640 [btrfs]
    [51502.244010]  [<ffffffffa0357a6a>] commit_cowonly_roots+0x164/0x226 [btrfs]
    [51502.244010]  [<ffffffffa02d53cd>] btrfs_commit_transaction+0x4ed/0xab0 [btrfs]
    [51502.244010]  [<ffffffff8168ec7b>] ? _raw_spin_unlock+0x2b/0x40
    [51502.244010]  [<ffffffffa02d6259>] start_transaction+0x459/0x620 [btrfs]
    [51502.244010]  [<ffffffffa02d67ab>] btrfs_start_transaction+0x1b/0x20 [btrfs]
    [51502.244010]  [<ffffffffa02d73e1>] __unlink_start_trans+0x31/0xe0 [btrfs]
    [51502.244010]  [<ffffffffa02dea67>] btrfs_unlink+0x37/0xc0 [btrfs]
    [51502.244010]  [<ffffffff811bb054>] ? do_unlinkat+0x114/0x2a0
    [51502.244010]  [<ffffffff811baebc>] vfs_unlink+0xcc/0x150
    [51502.244010]  [<ffffffff811bb1a0>] do_unlinkat+0x260/0x2a0
    [51502.244010]  [<ffffffff811a9ef4>] ? filp_close+0x64/0x90
    [51502.244010]  [<ffffffff810aaea6>] ? trace_hardirqs_on_caller+0x16/0x1e0
    [51502.244010]  [<ffffffff81349cab>] ? trace_hardirqs_on_thunk+0x3a/0x3f
    [51502.244010]  [<ffffffff811be9eb>] SyS_unlinkat+0x1b/0x40
    [51502.244010]  [<ffffffff81698452>] system_call_fastpath+0x16/0x1b
    [51502.244010] Code: 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 89 f1 48 c7 c2 71 13 36 a0 48 89 fe 31 c0 48 c7 c7 b8 43 36 a0 48 89 e5 e8 5d b0 32 e1 <0f> 0b 0f 1f 44 00 00 55 b9 11 00 00 00 48 89 e5 41 55 49 89 f5
    [51502.244010] RIP  [<ffffffffa03575da>] assfail.constprop.88+0x1e/0x20 [btrfs]

2)

    [25405.097230] BTRFS: assertion failed: current->journal_info, file: fs/btrfs/extent-tree.c, line: 3670
    [25405.097488] ------------[ cut here ]------------
    [25405.097767] kernel BUG at fs/btrfs/ctree.h:3964!
    [25405.097940] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
    (...)
    [25405.100008] Call Trace:
    [25405.100008]  [<ffffffffa02bc025>] btrfs_check_data_free_space+0x395/0x3a0 [btrfs]
    [25405.100008]  [<ffffffffa02c3bdc>] btrfs_write_dirty_block_groups+0x4ac/0x640 [btrfs]
    [25405.100008]  [<ffffffffa035755a>] commit_cowonly_roots+0x164/0x226 [btrfs]
    [25405.100008]  [<ffffffffa02d53cd>] btrfs_commit_transaction+0x4ed/0xab0 [btrfs]
    [25405.100008]  [<ffffffff8109c170>] ? bit_waitqueue+0xc0/0xc0
    [25405.100008]  [<ffffffffa02d6259>] start_transaction+0x459/0x620 [btrfs]
    [25405.100008]  [<ffffffffa02d67ab>] btrfs_start_transaction+0x1b/0x20 [btrfs]
    [25405.100008]  [<ffffffffa02e3407>] btrfs_create+0x47/0x210 [btrfs]
    [25405.100008]  [<ffffffffa02d74cc>] ? btrfs_permission+0x3c/0x80 [btrfs]
    [25405.100008]  [<ffffffff811bc63b>] vfs_create+0x9b/0x130
    [25405.100008]  [<ffffffff811bcf19>] do_last+0x849/0xe20
    [25405.100008]  [<ffffffff811b9409>] ? link_path_walk+0x79/0x820
    [25405.100008]  [<ffffffff811bd5b5>] path_openat+0xc5/0x690
    [25405.100008]  [<ffffffff810ab07d>] ? trace_hardirqs_on+0xd/0x10
    [25405.100008]  [<ffffffff811cdcd2>] ? __alloc_fd+0x32/0x1d0
    [25405.100008]  [<ffffffff811be2a3>] do_filp_open+0x43/0xa0
    [25405.100008]  [<ffffffff811cddf1>] ? __alloc_fd+0x151/0x1d0
    [25405.100008]  [<ffffffff811abcfc>] do_sys_open+0x13c/0x230
    [25405.100008]  [<ffffffff810aaea6>] ? trace_hardirqs_on_caller+0x16/0x1e0
    [25405.100008]  [<ffffffff811abe12>] SyS_open+0x22/0x30
    [25405.100008]  [<ffffffff81698452>] system_call_fastpath+0x16/0x1b
    [25405.100008] Code: 0b 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 89 f1 48 c7 c2 51 13 36 a0 48 89 fe 31 c0 48 c7 c7 d0 43 36 a0 48 89 e5 e8 6d b5 32 e1 <0f> 0b 0f 1f 44 00 00 55 b9 11 00 00 00 48 89 e5 41 55 49 89 f5
    [25405.100008] RIP  [<ffffffffa03570ca>] assfail.constprop.88+0x1e/0x20 [btrfs]

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:04:18 -07:00
Josef Bacik be2c765dff Btrfs: fix btrfs_print_leaf for skinny metadata
We wouldn't actuall print the extent information if we had a skinny metadata
item, this fixes that.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:04:16 -07:00
Liu Bo d288db5dc0 Btrfs: fix race of using total_bytes_pinned
This percpu counter @total_bytes_pinned is introduced to skip unnecessary
operations of 'commit transaction', it accounts for those space we may free
but are stuck in delayed refs.

And we zero out @space_info->total_bytes_pinned every transaction period so
we have a better idea of how much space we'll actually free up by committing
this transaction.  However, we do the 'zero out' part a little earlier, before
we actually unpin space, so we end up returning ENOSPC when we actually have
free space that's just unpinned from committing transaction.

xfstests/generic/074 complained then.

This fixes it by actually accounting the percpu pinned number when 'unpin',
and since it's protected by space_info->lock, the race is gone now.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:04:15 -07:00
David Sterba 130d5b415a btrfs: use E2BIG instead of EIO if compression does not help
Return codes got updated in 60e1975acb
(btrfs: return errno instead of -1 from compression)
lzo wrapper returns E2BIG in this case, do the same for zlib.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-07-03 07:04:13 -07:00
David Sterba 0a4eaea892 btrfs: remove stale comment from btrfs_flush_all_pending_stuffs
Commit fcebe4562d (Btrfs: rework qgroup
accounting) removed the qgroup accounting after delayed refs.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-07-03 07:04:12 -07:00
Filipe Manana 14f5979633 Btrfs: fix use-after-free when cloning a trailing file hole
The transaction handle was being used after being freed.

Cc: Chris Mason <clm@fb.com>
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:04:10 -07:00
Anand Jain 0aeb8a6e67 btrfs: fix null pointer dereference in btrfs_show_devname when name is null
dev->name is null but missing flag is not set.
Strictly speaking the missing flag should have been set, but there
are more places where code just checks if name is null. For now this
patch does the same.

stack:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000064
IP: [<ffffffffa0228908>] btrfs_show_devname+0x58/0xf0 [btrfs]

[<ffffffff81198879>] show_vfsmnt+0x39/0x130
[<ffffffff81178056>] m_show+0x16/0x20
[<ffffffff8117d706>] seq_read+0x296/0x390
[<ffffffff8115aa7d>] vfs_read+0x9d/0x160
[<ffffffff8115b549>] SyS_read+0x49/0x90
[<ffffffff817abe52>] system_call_fastpath+0x16/0x1b

reproducer:
mkfs.btrfs -draid1 -mraid1 /dev/sdg1 /dev/sdg2
btrfstune -S 1 /dev/sdg1
modprobe -r btrfs && modprobe btrfs
mount -o degraded /dev/sdg1 /btrfs
btrfs dev add /dev/sdg3 /btrfs

Signed-off-by: Anand Jain <Anand.Jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:04:09 -07:00
Anand Jain e755f78086 btrfs: fix null pointer dereference in clone_fs_devices when name is null
when one of the device path is missing btrfs_device name is null. So this
patch will check for that.

stack:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
IP: [<ffffffff812e18c0>] strlen+0x0/0x30
[<ffffffffa01cd92a>] ? clone_fs_devices+0xaa/0x160 [btrfs]
[<ffffffffa01cdcf7>] btrfs_init_new_device+0x317/0xca0 [btrfs]
[<ffffffff81155bca>] ? __kmalloc_track_caller+0x15a/0x1a0
[<ffffffffa01d6473>] btrfs_ioctl+0xaa3/0x2860 [btrfs]
[<ffffffff81132a6c>] ? handle_mm_fault+0x48c/0x9c0
[<ffffffff81192a61>] ? __blkdev_put+0x171/0x180
[<ffffffff817a784c>] ? __do_page_fault+0x4ac/0x590
[<ffffffff81193426>] ? blkdev_put+0x106/0x110
[<ffffffff81179175>] ? mntput+0x35/0x40
[<ffffffff8116d4b0>] do_vfs_ioctl+0x460/0x4a0
[<ffffffff8115c72e>] ? ____fput+0xe/0x10
[<ffffffff81068033>] ? task_work_run+0xb3/0xd0
[<ffffffff8116d547>] SyS_ioctl+0x57/0x90
[<ffffffff817a793e>] ? do_page_fault+0xe/0x10
[<ffffffff817abe52>] system_call_fastpath+0x16/0x1b

reproducer:
mkfs.btrfs -draid1 -mraid1 /dev/sdg1 /dev/sdg2
btrfstune -S 1 /dev/sdg1
modprobe -r btrfs && modprobe btrfs
mount -o degraded /dev/sdg1 /btrfs
btrfs dev add /dev/sdg3 /btrfs

Signed-off-by: Anand Jain <Anand.Jain@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:04:07 -07:00
Eric Sandeen 2aa06a35d0 btrfs: fix nossd and ssd_spread mount option regression
The commit

0780253 btrfs: Cleanup the btrfs_parse_options for remount.

broke ssd options quite badly; it stopped making ssd_spread
imply ssd, and it made "nossd" unsettable.

Put things back at least as well as they were before
(though ssd mount option handling is still pretty odd:
# mount -o "nossd,ssd_spread" works?)

Reported-by: Roman Mamedov <rm@romanrm.net>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:04:06 -07:00
Wang Shilong 5f3164813b Btrfs: fix race between balance recovery and root deletion
Balance recovery is called when RW mounting or remounting from
RO to RW, it is called to finish roots merging.

When doing balance recovery, relocation root's corresponding
fs root(whose root refs is 0) might be destroyed by cleaner
thread, this will make btrfs fail to mount.

Fix this problem by holding @cleaner_mutex when doing balance
recovery.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:04:04 -07:00
Filipe Manana 3cc7939255 Btrfs: atomically set inode->i_flags in btrfs_update_iflags
This change is based on the corresponding recent change for ext4:

  ext4: atomically set inode->i_flags in ext4_set_inode_flags()

That has the following commit message that applies to btrfs as well:

  "Use cmpxchg() to atomically set i_flags instead of clearing out the
   S_IMMUTABLE, S_APPEND, etc. flags and then setting them from the
   EXT4_IMMUTABLE_FL, EXT4_APPEND_FL flags, since this opens up a race
   where an immutable file has the immutable flag cleared for a brief
   window of time."

Replacing EXT4_IMMUTABLE_FL and EXT4_APPEND_FL with BTRFS_INODE_IMMUTABLE
and BTRFS_INODE_APPEND, respectively.

Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-07-03 07:03:23 -07:00
Fabian Frederick 086f2f76f4 fs/jffs2/xattr.c: remove null test before kfree
Fix checkpatch warning:
WARNING: kfree(NULL) is safe this check is probably not required

Cc: David Woodhouse <dwmw2@infradead.org>
Cc: linux-mtd@lists.infradead.org
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
2014-07-02 15:25:44 -07:00
Fabian Frederick b6861d0a15 fs/jffs2/acl.c: remove null test before kfree
Fix checkpatch warning:
WARNING: kfree(NULL) is safe this check is probably not required

Cc: David Woodhouse <dwmw2@infradead.org>
Cc: linux-mtd@lists.infradead.org
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Brian Norris <computersforpeace@gmail.com>
2014-07-02 15:25:39 -07:00
Avi Kivity 69bbd9c7b9 nfs: fix nfs4d readlink truncated packet
XDR requires 4-byte alignment; nfs4d READLINK reply writes out the padding,
but truncates the packet to the padding-less size.

Fix by taking the padding into consideration when truncating the packet.

Symptoms:

	# ll /mnt/
	ls: cannot read symbolic link /mnt/test: Input/output error
	total 4
	-rw-r--r--. 1 root root  0 Jun 14 01:21 123456
	lrwxrwxrwx. 1 root root  6 Jul  2 03:33 test
	drwxr-xr-x. 1 root root  0 Jul  2 23:50 tmp
	drwxr-xr-x. 1 root root 60 Jul  2 23:44 tree

Signed-off-by: Avi Kivity <avi@cloudius-systems.com>
Fixes: 476a7b1f4b (nfsd4: don't treat readlink like a zero-copy operation)
Reviewed-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-02 17:37:13 -04:00
Tejun Heo ecca47ce82 kernfs: kernfs_notify() must be useable from non-sleepable contexts
d911d98748 ("kernfs: make kernfs_notify() trigger inotify events
too") added fsnotify triggering to kernfs_notify() which requires a
sleepable context.  There are already existing users of
kernfs_notify() which invoke it from an atomic context and in general
it's silly to require a sleepable context for triggering a
notification.

The following is an invalid context bug triggerd by md invoking
sysfs_notify() from IO completion path.

 BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
 in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1
 2 locks held by swapper/1/0:
  #0:  (&(&vblk->vq_lock)->rlock){-.-...}, at: [<ffffffffa0039042>] virtblk_done+0x42/0xe0 [virtio_blk]
  #1:  (&(&bitmap->counts.lock)->rlock){-.....}, at: [<ffffffff81633718>] bitmap_endwrite+0x68/0x240
 irq event stamp: 33518
 hardirqs last  enabled at (33515): [<ffffffff8102544f>] default_idle+0x1f/0x230
 hardirqs last disabled at (33516): [<ffffffff818122ed>] common_interrupt+0x6d/0x72
 softirqs last  enabled at (33518): [<ffffffff810a1272>] _local_bh_enable+0x22/0x50
 softirqs last disabled at (33517): [<ffffffff810a29e0>] irq_enter+0x60/0x80
 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.16.0-0.rc2.git2.1.fc21.x86_64 #1
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  0000000000000000 f90db13964f4ee05 ffff88007d403b80 ffffffff81807b4c
  0000000000000000 ffff88007d403ba8 ffffffff810d4f14 0000000000000000
  0000000000441800 ffff880078fa1780 ffff88007d403c38 ffffffff8180caf2
 Call Trace:
  <IRQ>  [<ffffffff81807b4c>] dump_stack+0x4d/0x66
  [<ffffffff810d4f14>] __might_sleep+0x184/0x240
  [<ffffffff8180caf2>] mutex_lock_nested+0x42/0x440
  [<ffffffff812d76a0>] kernfs_notify+0x90/0x150
  [<ffffffff8163377c>] bitmap_endwrite+0xcc/0x240
  [<ffffffffa00de863>] close_write+0x93/0xb0 [raid1]
  [<ffffffffa00df029>] r1_bio_write_done+0x29/0x50 [raid1]
  [<ffffffffa00e0474>] raid1_end_write_request+0xe4/0x260 [raid1]
  [<ffffffff813acb8b>] bio_endio+0x6b/0xa0
  [<ffffffff813b46c4>] blk_update_request+0x94/0x420
  [<ffffffff813bf0ea>] blk_mq_end_io+0x1a/0x70
  [<ffffffffa00392c2>] virtblk_request_done+0x32/0x80 [virtio_blk]
  [<ffffffff813c0648>] __blk_mq_complete_request+0x88/0x120
  [<ffffffff813c070a>] blk_mq_complete_request+0x2a/0x30
  [<ffffffffa0039066>] virtblk_done+0x66/0xe0 [virtio_blk]
  [<ffffffffa002535a>] vring_interrupt+0x3a/0xa0 [virtio_ring]
  [<ffffffff81116177>] handle_irq_event_percpu+0x77/0x340
  [<ffffffff8111647d>] handle_irq_event+0x3d/0x60
  [<ffffffff81119436>] handle_edge_irq+0x66/0x130
  [<ffffffff8101c3e4>] handle_irq+0x84/0x150
  [<ffffffff818146ad>] do_IRQ+0x4d/0xe0
  [<ffffffff818122f2>] common_interrupt+0x72/0x72
  <EOI>  [<ffffffff8105f706>] ? native_safe_halt+0x6/0x10
  [<ffffffff81025454>] default_idle+0x24/0x230
  [<ffffffff81025f9f>] arch_cpu_idle+0xf/0x20
  [<ffffffff810f5adc>] cpu_startup_entry+0x37c/0x7b0
  [<ffffffff8104df1b>] start_secondary+0x25b/0x300

This patch fixes it by punting the notification delivery through a
work item.  This ends up adding an extra pointer to kernfs_elem_attr
enlarging kernfs_node by a pointer, which is not ideal but not a very
big deal either.  If this turns out to be an actual issue, we can move
kernfs_elem_attr->size to kernfs_node->iattr later.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Josh Boyer <jwboyer@fedoraproject.org>
Cc: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-02 09:32:09 -07:00
Li Zefan 4e26445faa kernfs: introduce kernfs_pin_sb()
kernfs_pin_sb() tries to get a refcnt of the superblock.

This will be used by cgroupfs.

v2:
- make kernfs_pin_sb() return the superblock.
- drop kernfs_drop_sb().

tj: Updated the comment a bit.

[ This is a prerequisite for a bugfix. ]
Cc: <stable@vger.kernel.org> # 3.15
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-06-30 10:16:25 -04:00
Linus Torvalds 16874b2cb8 Fix a regression when trying to compile ext4 on older versions gcc.
Fix a number of miscellaneous bugs for punch hole as well as a
 long-standing potential double buffer head release when failing a
 block allocation for an indirect-mapped file.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJTsKRpAAoJENNvdpvBGATwf3YQAJoGNvrxd6BPpB0bSTLRdh+k
 0IDShwB9PpSJulIHx8xIOpFIvpk5T0E1Ho+UqUnpMImRiqueYbFfOPca4lYsRBqW
 r9TNmm0O8Bf8K0j8YUaV2P8BGNJuiCzv7YcCbSHt1/eP6/InyfJWA17hYD6oxBPF
 ZrrcZemDH863KhYIF55sbx9UvBLz515ifd4kHN9pSVbV3pJT5/zPiRk/wujQZTrX
 v0z5pe5i8GfYoBMmdCNak3a1YoXdf+FUdID+pvGWtfzs8AG8nS7jb8C1zkfgUWtC
 zau9yBnFHKlBdmoPrFJIpvDT/2rBRks6g6uM2Jsc+YS+zHCXi0xJR1EaPIqkRJbf
 vWHNSogzOKetyFZFML1Dg1cjXVEnPusOsyhvcXTFwb3n/YY1NBFdLiVgNbTzQ7aU
 X7P4M+ca2yib2rQos5Ltipk4ju9eT4d5qsf+QSaaryYeUHg+8sNaqd82y42eita1
 Dgg7IpyKWdNo+lBHEtDSV5Gli4oRrXddj7Yvrib7nobf+FTi4cpbbu0PfY5qXfDV
 vKrsvIiJcPJYsvi+USH3fvv2m6UaZd+gDo/RObV06tSNTZWqGzalSeviaJZ+cfEP
 v05ODGLUNvg+thhvBsYAGNeaV/SOYztmNA6v0qfI+OwscH8ycGeb6R98Kil/hd97
 P3i12fhP7tkffNTNPM/E
 =X4vY
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 bugfixes from Ted Ts'o:
 "Fix a regression when trying to compile ext4 on older versions gcc.

  Fix a number of miscellaneous bugs for punch hole as well as a
  long-standing potential double buffer head release when failing a
  block allocation for an indirect-mapped file"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Fix hole punching for files with indirect blocks
  ext4: Fix block zeroing when punching holes in indirect block files
  ext4: decrement free clusters/inodes counters when block group declared bad
  fs/mbcache: replace __builtin_log2() with ilog2()
  ext4: Fix buffer double free in ext4_alloc_branch()
2014-06-29 19:20:43 -07:00
Josef Bacik 472b909ff6 btrfs: only unlock block in verify_parent_transid if we locked it
This is a regression from my patch a26e8c9f75, we
need to only unlock the block if we were the one who locked it.  Otherwise this
will trip BUG_ON()'s in locking.c  Thanks,

cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:47 -07:00
Filipe Manana 46c4e71e9b Btrfs: assert send doesn't attempt to start transactions
When starting a transaction just assert that current->journal_info
doesn't contain a send transaction stub, since send isn't supposed
to start transactions and when it finishes (either successfully or
not) it's supposed to set current->journal_info to NULL.

This is motivated by the change titled:

    Btrfs: fix crash when starting transaction

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:46 -07:00
Sergey Senozhatsky c39aa7056f btrfs compression: reuse recently used workspace
Add compression `workspace' in free_workspace() to
`idle_workspace' list head, instead of tail. So we have
better chances to reuse most recently used `workspace'.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:46 -07:00
Liu Bo 5588383ece Btrfs: fix crash when mounting raid5 btrfs with missing disks
The reproducer is

$ mkfs.btrfs D1 D2 D3 -mraid5
$ mkfs.ext4 D2 && mkfs.ext4 D3
$ mount D1 /btrfs -odegraded

-------------------

[   87.672992] ------------[ cut here ]------------
[   87.673845] kernel BUG at fs/btrfs/raid56.c:1828!
...
[   87.673845] RIP: 0010:[<ffffffff813efc7e>]  [<ffffffff813efc7e>] __raid_recover_end_io+0x4ae/0x4d0
...
[   87.673845] Call Trace:
[   87.673845]  [<ffffffff8116bbc6>] ? mempool_free+0x36/0xa0
[   87.673845]  [<ffffffff813f0255>] raid_recover_end_io+0x75/0xa0
[   87.673845]  [<ffffffff81447c5b>] bio_endio+0x5b/0xa0
[   87.673845]  [<ffffffff81447cb2>] bio_endio_nodec+0x12/0x20
[   87.673845]  [<ffffffff81374621>] end_workqueue_fn+0x41/0x50
[   87.673845]  [<ffffffff813ad2aa>] normal_work_helper+0xca/0x2c0
[   87.673845]  [<ffffffff8108ba2b>] process_one_work+0x1eb/0x530
[   87.673845]  [<ffffffff8108b9c9>] ? process_one_work+0x189/0x530
[   87.673845]  [<ffffffff8108c15b>] worker_thread+0x11b/0x4f0
[   87.673845]  [<ffffffff8108c040>] ? rescuer_thread+0x290/0x290
[   87.673845]  [<ffffffff810939c4>] kthread+0xe4/0x100
[   87.673845]  [<ffffffff810938e0>] ? kthread_create_on_node+0x220/0x220
[   87.673845]  [<ffffffff817e7c7c>] ret_from_fork+0x7c/0xb0
[   87.673845]  [<ffffffff810938e0>] ? kthread_create_on_node+0x220/0x220

-------------------

It's because that we miscalculate @rbio->bbio->error so that it doesn't
reach maximum of tolerable errors while it should have.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Tested-by: Satoru Takeuchi<takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:45 -07:00
Anand Jain b2373f255c btrfs: create sprout should rename fsid on the sysfs as well
Creating sprout will change the fsid of the mounted root.
do the same on the sysfs as well.

reproducer:
 mount /dev/sdb /btrfs (seed disk)
 btrfs dev add /dev/sdc /btrfs
 mount -o rw,remount /btrfs
 btrfs dev del /dev/sdb /btrfs
 mount /dev/sdb /btrfs

Error:
kobject_add_internal failed for fe350492-dc28-4051-a601-e017b17e6145 with -EEXIST, don't try to register things with the same name in the same directory.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:44 -07:00
Anand Jain 49c6f736f3 btrfs: dev replace should replace the sysfs entry
when we replace the device its corresponding sysfs
entry has to be replaced as well

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:44 -07:00
Anand Jain 0d39376aa2 btrfs: dev add should add its sysfs entry
we would need the device links to be created,
when device is added.

Signed-off-by: Anand Jain <Anand.Jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:43 -07:00
Anand Jain 99994cde9c btrfs: dev delete should remove sysfs entry
when we delete the device from the mounted btrfs,
we would need its corresponding sysfs enty to
be removed as well.

Signed-off-by: Anand Jain <Anand.Jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:42 -07:00
Anand Jain 9b4eaf43f4 btrfs: rename add_device_membership to btrfs_kobj_add_device
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-28 13:48:41 -07:00
Tejun Heo 9a1049da9b percpu-refcount: require percpu_ref to be exited explicitly
Currently, a percpu_ref undoes percpu_ref_init() automatically by
freeing the allocated percpu area when the percpu_ref is killed.
While seemingly convenient, this has the following niggles.

* It's impossible to re-init a released reference counter without
  going through re-allocation.

* In the similar vein, it's impossible to initialize a percpu_ref
  count with static percpu variables.

* We need and have an explicit destructor anyway for failure paths -
  percpu_ref_cancel_init().

This patch removes the automatic percpu counter freeing in
percpu_ref_kill_rcu() and repurposes percpu_ref_cancel_init() into a
generic destructor now named percpu_ref_exit().  percpu_ref_destroy()
is considered but it gets confusing with percpu_ref_kill() while
"exit" clearly indicates that it's the counterpart of
percpu_ref_init().

All percpu_ref_cancel_init() users are updated to invoke
percpu_ref_exit() instead and explicit percpu_ref_exit() calls are
added to the destruction path of all percpu_ref users.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Nicholas A. Bellinger <nab@linux-iscsi.org>
Cc: Li Zefan <lizefan@huawei.com>
2014-06-28 08:10:14 -04:00
Tejun Heo 55c6c814ae percpu-refcount, aio: use percpu_ref_cancel_init() in ioctx_alloc()
ioctx_alloc() reaches inside percpu_ref and directly frees
->pcpu_count in its failure path, which is quite gross.  percpu_ref
has been providing a proper interface to do this,
percpu_ref_cancel_init(), for quite some time now.  Let's use that
instead.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Kent Overstreet <kmo@daterainc.com>
2014-06-28 08:10:12 -04:00
J. Bruce Fields 76f47128f9 nfsd: fix rare symlink decoding bug
An NFS operation that creates a new symlink includes the symlink data,
which is xdr-encoded as a length followed by the data plus 0 to 3 bytes
of zero-padding as required to reach a 4-byte boundary.

The vfs, on the other hand, wants null-terminated data.

The simple way to handle this would be by copying the data into a newly
allocated buffer with space for the final null.

The current nfsd_symlink code tries to be more clever by skipping that
step in the (likely) case where the byte following the string is already
0.

But that assumes that the byte following the string is ours to look at.
In fact, it might be the first byte of a page that we can't read, or of
some object that another task might modify.

Worse, the NFSv4 code tries to fix the problem by actually writing to
that byte.

In the NFSv2/v3 cases this actually appears to be safe:

	- nfs3svc_decode_symlinkargs explicitly null-terminates the data
	  (after first checking its length and copying it to a new
	  page).
	- NFSv2 limits symlinks to 1k.  The buffer holding the rpc
	  request is always at least a page, and the link data (and
	  previous fields) have maximum lengths that prevent the request
	  from reaching the end of a page.

In the NFSv4 case the CREATE op is potentially just one part of a long
compound so can end up on the end of a page if you're unlucky.

The minimal fix here is to copy and null-terminate in the NFSv4 case.
The nfsd_symlink() interface here seems too fragile, though.  It should
really either do the copy itself every time or just require a
null-terminated string.

Reported-by: Jeff Layton <jlayton@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-27 16:10:46 -04:00
Jan Kara a93cd4cf86 ext4: Fix hole punching for files with indirect blocks
Hole punching code for files with indirect blocks wrongly computed
number of blocks which need to be cleared when traversing the indirect
block tree. That could result in punching more blocks than actually
requested and thus effectively cause a data loss. For example:

fallocate -n -p 10240000 4096

will punch the range 10240000 - 12632064 instead of the range 1024000 -
10244096. Fix the calculation.

CC: stable@vger.kernel.org
Fixes: 8bad6fc813
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-06-26 12:30:54 -04:00
Jan Kara 77ea2a4ba6 ext4: Fix block zeroing when punching holes in indirect block files
free_holes_block() passed local variable as a block pointer
to ext4_clear_blocks(). Thus ext4_clear_blocks() zeroed out this local
variable instead of proper place in inode / indirect block. We later
zero out proper place in inode / indirect block but don't dirty the
inode / buffer again which can lead to subtle issues (some changes e.g.
to inode can be lost).

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-06-26 12:28:57 -04:00
Namjae Jeon e43bb4e612 ext4: decrement free clusters/inodes counters when block group declared bad
We should decrement free clusters counter when block bitmap is marked
as corrupt and free inodes counter when the allocation bitmap is
marked as corrupt to avoid misunderstanding due to incorrect available
size in statfs result.  User can get immediately ENOSPC error from
write begin without reaching for the writepages.

Cc: Darrick J. Wong<darrick.wong@oracle.com>
Reported-by: Amit Sahrawat <amit.sahrawat83@gmail.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
2014-06-26 10:11:53 -04:00
Linus Torvalds d7933ab727 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French:
 "Small set of misc cifs/smb3 fixes"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  [CIFS] fix mount failure with broken pathnames when smb3 mount with mapchars option
  cifs: revalidate mapping prior to satisfying read_iter request with cache=loose
  fs/cifs: fix regression in cifs_create_mf_symlink()
2014-06-25 21:47:28 -07:00
Linus Torvalds ec71feae06 NFS client fixes for Linux 3.16
Highlights include:
 
 - Stable fix for a data corruption case due to incorrect cache validation
 - Fix a couple of false positive cache invalidations
 - Fix NFSv4 security negotiation issues
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTq1GIAAoJEGcL54qWCgDyCfUP/3S7Py5Gocdqvb7FBPpCWtsb
 PJlv1RjC4ngT+BJpBeDSOFEcZerfeQAwguL5kEgIjdyKmsAjVjIF7ThagNQK/0yr
 qpeKh2EtbAipjjXVmul7saG3Ucuv/PggEhqGl9iJK0QyPdmnr30cHGHHt3kCIPGE
 e4AkaCN4ZuXBdDOO4YpKzIl6wQPb0Gjwps1boW4INCvnBvK6Yno26Q6ilDf92gJE
 hisEn0l8l09C6t2jZKP7daCyGForTYYlMxIbmjmQhsMEwnh1kmfpr/xuAQP2bflr
 14OFrNbrZg3p4ucp8g7EzgS1Z5m/Ism0xNKfO4LgNwUobSgbvvvScAC3/LP2HIIk
 RXuRhgb8u6pbWQRqq4XznB+csh6DGR/ui2PhonK4lJDaJxcU3bnFlhTgoC0GSyCa
 Wbbdv+nhXhw5Xi9jsma6PW/CnHJH6sk/8KviRPOpC+RsCg+X41vTHzC4XvWbentw
 aZGkNuWAnBKMyswu08E4+ScFQxToSB6ju4RjOsTTMleC0ewWXD3Y6FL+B5p4crPO
 L05KCLkP+SeRxpakOM3e/x/bkVOa+DBna7foXUZ9snWybYoOmuxOkJgJT7bxrYaA
 /3N0e/WUUgPR/bhdydMJSRo6DchKj+5GRSpx8FB9eMqqp8mNE+I61/Kq0dFEbtPQ
 1IQCFT4w1PEegDpwjb0L
 =o+QR
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client fixes from Trond Myklebust:
 "Highlights include:

   - Stable fix for a data corruption case due to incorrect cache
     validation
   - Fix a couple of false positive cache invalidations
   - Fix NFSv4 security negotiation issues"

* tag 'nfs-for-3.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4: test SECINFO RPC_AUTH_GSS pseudoflavors for support
  NFS Return -EPERM if no supported or matching SECINFO flavor
  NFS check the return of nfs4_negotiate_security in nfs4_submount
  NFS: Don't mark the data cache as invalid if it has been flushed
  NFS: Clear NFS_INO_REVAL_PAGECACHE when we update the file size
  nfs: Fix cache_validity check in nfs_write_pageuptodate()
2014-06-25 20:06:06 -07:00
T Makphaibulchoke ec7756ae15 fs/mbcache: replace __builtin_log2() with ilog2()
Fix compiler error with some gcc version(s) that do not
support __builtin_log2() by replacing __builtin_log2() with
ilog2().

Signed-off-by: T. Makphaibulchoke <tmac@hp.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Maciej W. Rozycki <macro@linux-mips.org>
2014-06-25 22:08:29 -04:00
Fabian Frederick f15b504144 FS/NFS: replace count*size kzalloc by kcalloc
kcalloc manages count*sizeof overflow.

Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-25 19:02:14 -04:00
Weston Andros Adamson 0446278999 nfs: get rid of duplicate dprintk
This was introduced by a merge error with my recent pgio patchset.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-25 19:01:27 -04:00
Dave Chinner 2451337dd0 xfs: global error sign conversion
Convert all the errors the core XFs code to negative error signs
like the rest of the kernel and remove all the sign conversion we
do in the interface layers.

Errors for conversion (and comparison) found via searches like:

$ git grep " E" fs/xfs
$ git grep "return E" fs/xfs
$ git grep " E[A-Z].*;$" fs/xfs

Negation points found via searches like:

$ git grep "= -[a-z,A-Z]" fs/xfs
$ git grep "return -[a-z,A-D,F-Z]" fs/xfs
$ git grep " -[a-z].*;" fs/xfs

[ with some bits I missed from Brian Foster ]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-25 14:58:08 +10:00
Dave Chinner 30f712c9dd libxfs: move source files
Move all the source files that are shared with userspace into
libxfs/. This is done as one big chunk simpy to get it done
quickly

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-25 14:57:53 +10:00
Dave Chinner 84be0ffc90 libxfs: move header files
Move all the header files that are shared with userspace into
libxfs. This is done as one big chunk simpy to get it done quickly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-25 14:57:36 +10:00
Dave Chinner 69116a1317 xfs: create libxfs infrastructure
To minimise the differences between kernel and userspace code,
split the kernel code into the same structure as the userspace code.
That is, the gneric core functionality of XFS is moved to a libxfs/
directory and treat it as a layering barrier in the XFS code.

This patch introduces the libxfs directory, the build infrastructure
and an initial source and header file to build. The libxfs directory
will contain the header files that are needed to build libxfs - most
of userspace does not care about the location of these header files
as they are accessed indirectly. Hence keeping them inside libxfs
makes it easy to track the changes and script the sync process as
the directory structure will be identical.

To allow this changeover to occur in the kernel code, there are some
temporary infrastructure in the makefiles to grab the header
filesystem from both locations. Once all the files are moved,
modifications will be made in the source code that will make the
need for these include directives go away.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-25 14:57:22 +10:00
Anna Schumaker 343ae531f1 nfs: Fix unused variable error
inode is unused when CONFIG_SUNRPC_DEBUG=n.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24 18:47:02 -04:00
Weston Andros Adamson c6639dac53 nfs: remove unneeded EXPORTs
EXPORT_GPLs of nfs_pageio_add_request and nfs_pageio_complete aren't
needed anymore.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24 18:47:01 -04:00
Weston Andros Adamson 53113ad35e pnfs: clean up *_resend_to_mds
Clean up pnfs_read_done_resend_to_mds and pnfs_write_done_resend_to_mds:
 - instead of passing all arguments from a nfs_pgio_header, just pass the header
 - share the common code

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24 18:47:01 -04:00
Weston Andros Adamson 4714fb51fd nfs: remove pgio_header refcount, related cleanup
The refcounting on nfs_pgio_header was related to there being (possibly)
more than one nfs_pgio_data. Now that nfs_pgio_data has been merged into
nfs_pgio_header, there is no reason to do this ref counting.  Just call
the completion callback on nfs_pgio_release/nfs_pgio_error.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24 18:47:01 -04:00
Weston Andros Adamson c65e6254ca nfs: remove unused writeverf code
Remove duplicate writeverf structure from merge of nfs_pgio_header and
nfs_pgio_data and remove writeverf related flags and logic to handle
more than one RPC per nfs_pgio_header.

Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24 18:47:00 -04:00
Weston Andros Adamson d45f60c678 nfs: merge nfs_pgio_data into _header
struct nfs_pgio_data only exists as a member of nfs_pgio_header, but is
passed around everywhere, because there used to be multiple _data structs
per _header. Many of these functions then use the _data to find a pointer
to the _header.  This patch cleans this up by merging the nfs_pgio_data
structure into nfs_pgio_header and passing nfs_pgio_header around instead.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24 18:47:00 -04:00
Weston Andros Adamson 823b0c9d98 nfs: rename members of nfs_pgio_data
Rename "verf" to "writeverf" and "pages" to "page_array" to prepare for
merge of nfs_pgio_data and nfs_pgio_header.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24 18:46:59 -04:00
Weston Andros Adamson 1e7f3a4859 nfs: move nfs_pgio_data and remove nfs_rw_header
nfs_rw_header was used to allocate an nfs_pgio_header along with an
nfs_pgio_data, because a _header would need at least one _data.

Now there is only ever one nfs_pgio_data for each nfs_pgio_header -- move
it to nfs_pgio_header and get rid of nfs_rw_header.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24 18:46:59 -04:00
Andy Adamson 66b0686049 NFSv4: test SECINFO RPC_AUTH_GSS pseudoflavors for support
Fix nfs4_negotiate_security to create an rpc_clnt used to test each SECINFO
returned pseudoflavor. Check credential creation  (and gss_context creation)
which is important for RPC_AUTH_GSS pseudoflavors which can fail for multiple
reasons including mis-configuration.

Don't call nfs4_negotiate in nfs4_submount as it was just called by
nfs4_proc_lookup_mountpoint (nfs4_proc_lookup_common)

Signed-off-by: Andy Adamson <andros@netapp.com>
[Trond: fix corrupt return value from nfs_find_best_sec()]
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24 18:46:58 -04:00
Andy Adamson 8445cd3528 NFS Return -EPERM if no supported or matching SECINFO flavor
Do not return RPC_AUTH_UNIX if SEINFO reply tests fail. This
prevents an infinite loop of NFS4ERR_WRONGSEC for non RPC_AUTH_UNIX mounts.

Without this patch, a mount with no sec= option to a server
that does not include RPC_AUTH_UNIX in the
SECINFO return can be presented with an attemtp to use RPC_AUTH_UNIX
which will result in an NFS4ERR_WRONG_SEC which will prompt the SECINFO
call which will again try RPC_AUTH_UNIX....

Signed-off-by: Andy Adamson <andros@netapp.com>
Tested-By: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24 18:46:58 -04:00
Andy Adamson 57bbe3d7c1 NFS check the return of nfs4_negotiate_security in nfs4_submount
Signed-off-by: Andy Adamson <andros@netapp.com>
Tested-By: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24 18:46:57 -04:00
Trond Myklebust 6edf96097b NFS: Don't mark the data cache as invalid if it has been flushed
Now that we have functions such as nfs_write_pageuptodate() that use
the cache_validity flags to check if the data cache is valid or not,
it is a little more important to keep the flags in sync with the
state of the data cache.
In particular, we'd like to ensure that if the data cache is empty, we
don't start marking it as needing revalidation.

Reported-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24 18:46:57 -04:00
Trond Myklebust f2467b6f64 NFS: Clear NFS_INO_REVAL_PAGECACHE when we update the file size
In nfs_update_inode(), if the change attribute is seen to change on
the server, then we set NFS_INO_REVAL_PAGECACHE in order to make
sure that we check the file size.
However, if we also update the file size in the same function, we
don't need to check it again. So make sure that we clear the
NFS_INO_REVAL_PAGECACHE that was set earlier.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24 18:46:57 -04:00
Scott Mayhew 18dd78c427 nfs: Fix cache_validity check in nfs_write_pageuptodate()
NFS_INO_INVALID_DATA cannot be ignored, even if we have a delegation.

We're still having some problems with data corruption when multiple
clients are appending to a file and those clients are being granted
write delegations on open.

To reproduce:

Client A:
vi /mnt/`hostname -s`
while :; do echo "XXXXXXXXXXXXXXX" >>/mnt/file; sleep $(( $RANDOM % 5 )); done

Client B:
vi /mnt/`hostname -s`
while :; do echo "YYYYYYYYYYYYYYY" >>/mnt/file; sleep $(( $RANDOM % 5 )); done

What's happening is that in nfs_update_inode() we're recognizing that
the file size has changed and we're setting NFS_INO_INVALID_DATA
accordingly, but then we ignore the cache_validity flags in
nfs_write_pageuptodate() because we have a delegation.  As a result,
in nfs_updatepage() we're extending the write to cover the full page
even though we've not read in the data to begin with.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Cc: <stable@vger.kernel.org> # v3.11+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-24 18:46:56 -04:00
Oleg Nesterov 855ef0dec7 aio: kill the misleading rcu read locks in ioctx_add_table() and kill_ioctx()
ioctx_add_table() is the writer, it does not need rcu_read_lock() to
protect ->ioctx_table. It relies on mm->ioctx_lock and rcu locks just
add the confusion.

And it doesn't need rcu_dereference() by the same reason, it must see
any updates previously done under the same ->ioctx_lock. We could use
rcu_dereference_protected() but the patch uses rcu_dereference_raw(),
the function is simple enough.

The same for kill_ioctx(), although it does not update the pointer.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2014-06-24 18:10:25 -04:00
Oleg Nesterov 4b70ac5fd9 aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock()
On 04/30, Benjamin LaHaise wrote:
>
> > -		ctx->mmap_size = 0;
> > -
> > -		kill_ioctx(mm, ctx, NULL);
> > +		if (ctx) {
> > +			ctx->mmap_size = 0;
> > +			kill_ioctx(mm, ctx, NULL);
> > +		}
>
> Rather than indenting and moving the two lines changing mmap_size and the
> kill_ioctx() call, why not just do "if (!ctx) ... continue;"?  That reduces
> the number of lines changed and avoid excessive indentation.

OK. To me the code looks better/simpler with "if (ctx)", but this is subjective
of course, I won't argue.

The patch still removes the empty line between mmap_size = 0 and kill_ioctx(),
we reset mmap_size only for kill_ioctx(). But feel free to remove this change.

-------------------------------------------------------------------------------
Subject: [PATCH v3 1/2] aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock()

1. We can read ->ioctx_table only once and we do not read rcu_read_lock()
   or even rcu_dereference().

   This mm has no users, nobody else can play with ->ioctx_table. Otherwise
   the code is buggy anyway, if we need rcu_read_lock() in a loop because
   ->ioctx_table can be updated then kfree(table) is obviously wrong.

2. Update the comment. "exit_mmap(mm) is coming" is the good reason to avoid
   munmap(), but another reason is that we simply can't do vm_munmap() unless
   current->mm == mm and this is not true in general, the caller is mmput().

3. We do not really need to nullify mm->ioctx_table before return, probably
   the current code does this to catch the potential problems. But in this
   case RCU_INIT_POINTER(NULL) looks better.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2014-06-24 18:10:24 -04:00
Benjamin LaHaise edfbbf388f aio: fix kernel memory disclosure in io_getevents() introduced in v3.10
A kernel memory disclosure was introduced in aio_read_events_ring() in v3.10
by commit a31ad380be.  The changes made to
aio_read_events_ring() failed to correctly limit the index into
ctx->ring_pages[], allowing an attacked to cause the subsequent kmap() of
an arbitrary page with a copy_to_user() to copy the contents into userspace.
This vulnerability has been assigned CVE-2014-0206.  Thanks to Mateusz and
Petr for disclosing this issue.

This patch applies to v3.12+.  A separate backport is needed for 3.10/3.11.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: stable@vger.kernel.org
2014-06-24 13:46:01 -04:00
Benjamin LaHaise f8567a3845 aio: fix aio request leak when events are reaped by userspace
The aio cleanups and optimizations by kmo that were merged into the 3.10
tree added a regression for userspace event reaping.  Specifically, the
reference counts are not decremented if the event is reaped in userspace,
leading to the application being unable to submit further aio requests.
This patch applies to 3.12+.  A separate backport is required for 3.10/3.11.
This issue was uncovered as part of CVE-2014-0206.

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Cc: stable@vger.kernel.org
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
2014-06-24 13:32:27 -04:00
Steve French ce36d9ab3b [CIFS] fix mount failure with broken pathnames when smb3 mount with mapchars option
When we SMB3 mounted with mapchars (to allow reserved characters : \ / > < * ?
via the Unicode Windows to POSIX remap range) empty paths
(eg when we open "" to query the root of the SMB3 directory on mount) were not
null terminated so we sent garbarge as a path name on empty paths which caused
SMB2/SMB2.1/SMB3 mounts to fail when mapchars was specified.  mapchars is
particularly important since Unix Extensions for SMB3 are not supported (yet)

Signed-off-by: Steve French <smfrench@gmail.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: David Disseldorp <ddiss@suse.de>
2014-06-24 08:10:24 -05:00
Xue jiufei ac4fef4d23 ocfs2/dlm: do not purge lockres that is queued for assert master
When workqueue is delayed, it may occur that a lockres is purged while it
is still queued for master assert.  it may trigger BUG() as follows.

N1                                         N2
dlm_get_lockres()
->dlm_do_master_requery
                                  is the master of lockres,
                                  so queue assert_master work

                                  dlm_thread() start running
                                  and purge the lockres

                                  dlm_assert_master_worker()
                                  send assert master message
                                  to other nodes
receiving the assert_master
message, set master to N2

dlmlock_remote() send create_lock message to N2, but receive DLM_IVLOCKID,
if it is RECOVERY lockres, it triggers the BUG().

Another BUG() is triggered when N3 become the new master and send
assert_master to N1, N1 will trigger the BUG() because owner doesn't
match.  So we should not purge lockres when it is queued for assert
master.

Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23 16:47:45 -07:00
jiangyiwen b9aaac5a6b ocfs2: do not return DLM_MIGRATE_RESPONSE_MASTERY_REF to avoid endless,loop during umount
The following case may lead to endless loop during umount.

node A         node B               node C       node D
umount volume,
migrate lockres1
to B
                                                 want to lock lockres1,
                                                 send
                                                 MASTER_REQUEST_MSG
                                                 to C
                                    init block mle
               send
               MIGRATE_REQUEST_MSG
               to C
                                    find a block
                                    mle, and then
                                    return
                                    DLM_MIGRATE_RESPONSE_MASTERY_REF
                                    to B
               set C in refmap
                                    umount successfully
               try to umount, endless
               loop occurs when migrate
               lockres1 since C is in
               refmap

So we can fix this endless loop case by only returning
DLM_MIGRATE_RESPONSE_MASTERY_REF if it has a mastery mle when receiving
MIGRATE_REQUEST_MSG.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: jiangyiwen <jiangyiwen@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Xue jiufei <xuejiufei@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23 16:47:45 -07:00
jiangyiwen 595297a8f9 ocfs2: manually do the iput once ocfs2_add_entry failed in ocfs2_symlink and ocfs2_mknod
When the call to ocfs2_add_entry() failed in ocfs2_symlink() and
ocfs2_mknod(), iput() will not be called during dput(dentry) because no
d_instantiate(), and this will lead to umount hung.

Signed-off-by: jiangyiwen <jiangyiwen@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23 16:47:45 -07:00
Yiwen Jiang f7a14f32e7 ocfs2: fix a tiny race when running dirop_fileop_racer
When running dirop_fileop_racer we found a dead lock case.

2 nodes, say Node A and Node B, mount the same ocfs2 volume.  Create
/race/16/1 in the filesystem, and let the inode number of dir 16 is less
than the inode number of dir race.

Node A                            Node B
mv /race/16/1 /race/
                                  right after Node A has got the
                                  EX mode of /race/16/, and tries to
                                  get EX mode of /race
                                  ls /race/16/

In this case, Node A has got the EX mode of /race/16/, and wants to get EX
mode of /race/.  Node B has got the PR mode of /race/, and wants to get
the PR mode of /race/16/.  Since EX and PR are mutually exclusive, dead
lock happens.

This patch fixes this case by locking in ancestor order before trying
inode number order.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23 16:47:45 -07:00
Xue jiufei a270c6d3c0 ocfs2/dlm: fix misuse of list_move_tail() in dlm_run_purge_list()
When a lockres in purge list but is still in use, it should be moved to
the tail of purge list.  dlm_thread will continue to check next lockres in
purge list.  However, code list_move_tail(&dlm->purge_list,
&lockres->purge) will do *no* movements, so dlm_thread will purge the same
lockres in this loop again and again.  If it is in use for a long time,
other lockres will not be processed.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23 16:47:45 -07:00
Wengang Wang 8a8ad1c2f6 ocfs2: refcount: take rw_lock in ocfs2_reflink
This patch tries to fix this crash:

 #5 [ffff88003c1cd690] do_invalid_op at ffffffff810166d5
 #6 [ffff88003c1cd730] invalid_op at ffffffff8159b2de
    [exception RIP: ocfs2_direct_IO_get_blocks+359]
    RIP: ffffffffa05dfa27  RSP: ffff88003c1cd7e8  RFLAGS: 00010202
    RAX: 0000000000000000  RBX: ffff88003c1cdaa8  RCX: 0000000000000000
    RDX: 000000000000000c  RSI: ffff880027a95000  RDI: ffff88003c79b540
    RBP: ffff88003c1cd858   R8: 0000000000000000   R9: ffffffff815f6ba0
    R10: 00000000000001c9  R11: 00000000000001c9  R12: ffff88002d271500
    R13: 0000000000000001  R14: 0000000000000000  R15: 0000000000001000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffff88003c1cd860] do_direct_IO at ffffffff811cd31b
 #8 [ffff88003c1cd950] direct_IO_iovec at ffffffff811cde9c
 #9 [ffff88003c1cd9b0] do_blockdev_direct_IO at ffffffff811ce764
#10 [ffff88003c1cdb80] __blockdev_direct_IO at ffffffff811ce7cc
#11 [ffff88003c1cdbb0] ocfs2_direct_IO at ffffffffa05df756 [ocfs2]
#12 [ffff88003c1cdbe0] generic_file_direct_write_iter at ffffffff8112f935
#13 [ffff88003c1cdc40] ocfs2_file_write_iter at ffffffffa0600ccc [ocfs2]
#14 [ffff88003c1cdd50] do_aio_write at ffffffff8119126c
#15 [ffff88003c1cddc0] aio_rw_vect_retry at ffffffff811d9bb4
#16 [ffff88003c1cddf0] aio_run_iocb at ffffffff811db880
#17 [ffff88003c1cde30] io_submit_one at ffffffff811dc238
#18 [ffff88003c1cde80] do_io_submit at ffffffff811dc437
#19 [ffff88003c1cdf70] sys_io_submit at ffffffff811dc530
#20 [ffff88003c1cdf80] system_call_fastpath at ffffffff8159a159

It crashes at
        BUG_ON(create && (ext_flags & OCFS2_EXT_REFCOUNTED));
in ocfs2_direct_IO_get_blocks.

ocfs2_direct_IO_get_blocks is expecting the OCFS2_EXT_REFCOUNTED be removed in
ocfs2_prepare_inode_for_write() if it was there. But no cluster lock is taken
during the time before (or inside) ocfs2_prepare_inode_for_write() and after
ocfs2_direct_IO_get_blocks().

It can happen in this case:

Node A(which crashes)				Node B
------------------------                 ---------------------------
ocfs2_file_aio_write
  ocfs2_prepare_inode_for_write
    ocfs2_inode_lock
    ...
    ocfs2_inode_unlock
  #no refcount found
....					ocfs2_reflink
                                          ocfs2_inode_lock
                                          ...
                                          ocfs2_inode_unlock
                                          #now, refcount flag set on extent

                                        ...
                                        flush change to disk

ocfs2_direct_IO_get_blocks
  ocfs2_get_clusters
    #extent map miss
    #buffer_head miss
    read extents from disk
  found refcount flag on extent
  crash..

Fix:
Take rw_lock in ocfs2_reflink path

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23 16:47:45 -07:00
Xue jiufei b253bfd878 ocfs2: revert "ocfs2: fix NULL pointer dereference when dismount and ocfs2rec simultaneously"
75f82eaa50 ("ocfs2: fix NULL pointer dereference when dismount and
ocfs2rec simultaneously") may cause umount hang while shutting down
truncate log.

The situation is as followes:
ocfs2_dismout_volume
-> ocfs2_recovery_exit
  -> free osb->recovery_map
-> ocfs2_truncate_shutdown
  -> lock global bitmap inode
    -> ocfs2_wait_for_recovery
          -> check whether osb->recovery_map->rm_used is zero

Because osb->recovery_map is already freed, rm_used can be any other
values, so it may yield umount hang.

Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23 16:47:45 -07:00
Tariq Saeed 27bf6305cf ocfs2: fix deadlock when two nodes are converting same lock from PR to EX and idletimeout closes conn
Orabug: 18639535

Two node cluster and both nodes hold a lock at PR level and both want to
convert to EX at the same time.  Master node 1 has sent BAST and then
closes the connection due to idletime out.  Node 0 receives BAST, sends
unlock req with cancel flag but gets error -ENOTCONN.  The problem is
this error is ignored in dlm_send_remote_unlock_request() on the
**incorrect** assumption that the master is dead.  See NOTE in comment
why it returns DLM_NORMAL.  Upon getting DLM_NORMAL, node 0 proceeds to
sends convert (without cancel flg) which fails with -ENOTCONN.  waits 5
sec and resends.

This time gets DLM_IVLOCKID from the master since lock not found in
grant, it had been moved to converting queue in response to conv PR->EX
req.  No way out.

Node 1 (master)				Node 0
==============				======

  lock mode PR				PR

  convert PR -> EX
  mv grant -> convert and que BAST
  ...
                     <-------- convert PR -> EX
  convert que looks like this: ((node 1, PR -> EX) (node 0, PR -> EX))
  ...
                        BAST (want PR -> NL)
                     ------------------>
  ...
  idle timout, conn closed
                                ...
                                In response to BAST,
                                sends unlock with cancel convert flag
                                gets -ENOTCONN. Ignores and
                                sends remote convert request
                                gets -ENOTCONN, waits 5 Sec, retries
  ...
  reconnects
                   <----------------- convert req goes through on next try
  does not find lock on grant que
                   status DLM_IVLOCKID
                   ------------------>
  ...

No way out.  Fix is to keep retrying unlock with cancel flag until it
succeeds or the master dies.

Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23 16:47:45 -07:00
alex chen 5fb1beb069 ocfs2: should add inode into orphan dir after updating entry in ocfs2_rename()
There are two files a and b in dir /mnt/ocfs2.

    node A                           node B

  mv a b
  In ocfs2_rename(), after calling
  ocfs2_orphan_add(), the inode of
  file b will be added into orphan
  dir.

  If ocfs2_update_entry() fails,
  ocfs2_rename return error and mv
  operation fails. But file b still
  exists in the parent dir.

  ocfs2_queue_orphan_scan
   -> ocfs2_queue_recovery_completion
   -> ocfs2_complete_recovery
   -> ocfs2_recover_orphans
  The inode of the file b will be
  put with iput().

  ocfs2_evict_inode
   -> ocfs2_delete_inode
   -> ocfs2_wipe_inode
   -> ocfs2_remove_inode
  OCFS2_VALID_FL in the inode
  i_flags will be cleared.

                                   The file b still can be accessed
                                   on node B.
                                   ls /mnt/ocfs2
                                   When first read the file b with
                                   ocfs2_read_inode_block(). It will
                                   validate the inode using
                                   ocfs2_validate_inode_block().
                                   Because OCFS2_VALID_FL not set in
                                   the inode i_flags, so the file
                                   system will be readonly.

So we should add inode into orphan dir after updating entry in
ocfs2_rename().

Signed-off-by: alex.chen <alex.chen@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-23 16:47:45 -07:00
Jeff Layton d4c8e34fe8 nfsd: properly handle embedded newlines in fault_injection input
Currently rpc_pton() fails to handle the case where you echo an address
into the file, as it barfs on the newline. Ensure that we NULL out the
first occurrence of any newline.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-23 11:31:38 -04:00
Jeff Layton f7ce5d2842 nfsd: fix return of nfs4_acl_write_who
AFAICT, the only way to hit this error is to pass this function a bogus
"who" value. In that case, we probably don't want to return -1 as that
could get sent back to the client. Turn this into nfserr_serverfault,
which is a more appropriate error for a server bug like this.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-23 11:31:38 -04:00
Jeff Layton 94ec938b61 nfsd: add appropriate __force directives to filehandle generation code
The filehandle structs all use host-endian values, but will sometimes
stuff big-endian values into those fields. This is OK since these
values are opaque to the client, but it confuses sparse. Add __force to
make it clear that we are doing this intentionally.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-23 11:31:37 -04:00
Jeff Layton e2afc81919 nfsd: nfsd_splice_read and nfsd_readv should return __be32
The callers expect a __be32 return and the functions they call return
__be32, so having these return int is just wrong. Also, nfsd_finish_read
can be made static.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-23 11:31:37 -04:00
Jeff Layton b3d8d1284a nfsd: clean up sparse endianness warnings in nfscache.c
We currently hash the XID to determine a hash bucket to use for the
reply cache entry, which is fed into hash_32 without byte-swapping it.
Add __force to make sparse happy, and add some comments to explain
why.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-23 11:31:37 -04:00
Jeff Layton f419992c1f nfsd: add __force to opaque verifier field casts
sparse complains that we're stuffing non-byte-swapped values into
__be32's here. Since they're supposed to be opaque, it doesn't matter
much. Just add __force to make sparse happy.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-23 11:31:37 -04:00
Kinglong Mee bf18f163e8 NFSD: Using exp_get for export getting
Don't using cache_get besides export.h, using exp_get for export.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-23 11:31:36 -04:00
Kinglong Mee 0da22a919d NFSD: Using path_get when assigning path for export
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-23 11:31:36 -04:00
Kinglong Mee f15a5cf912 SUNRPC/NFSD: Change to type of bool for rq_usedeferral and rq_splice_ok
rq_usedeferral and rq_splice_ok are used as 0 and 1, just defined to bool.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-23 11:31:36 -04:00
Kinglong Mee 3c7aa15d20 NFSD: Using min/max/min_t/max_t for calculate
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-23 11:31:36 -04:00
Jaegeuk Kim 98397ff3cd f2fs: fix not to allocate unnecessary blocks during fallocate
This patch fixes the fallocate bug like below. (See xfstests/255)

In fallocate(fd, 0, 20480),
expand_inode_data processes
	for (index = pg_start; index <= pg_end; index++) {
		f2fs_reserve_block();
		...
	}

So, even though fallocate requests 20480, 5 blocks, f2fs allocates 6 blocks
including pg_end.
So, this patch adds one condition to avoid block allocation.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-06-23 10:05:08 +09:00
Jaegeuk Kim ead432756a f2fs: recover fallocated data and its i_size together
This patch arranges the f2fs_locks to cover the fallocated data and its i_size.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-06-23 10:05:08 +09:00
Jaegeuk Kim ccfb30001f f2fs: fix to report newly allocate region as extent
Previous get_block in f2fs didn't report the newly allocated region which has
NEW_ADDR.
For reader, it should not report, but fiemap needs this.
So, this patch introduces two get_block sharing core function.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-06-23 10:05:07 +09:00
Eric Sandeen b474c7ae43 xfs: Nuke XFS_ERROR macro
XFS_ERROR was designed long ago to trap return values, but it's not
runtime configurable, it's not consistently used, and we can do
similar error trapping with ftrace scripts and triggers from
userspace.

Just nuke XFS_ERROR and associated bits.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-22 15:04:54 +10:00
Eric Sandeen d99831ff39 xfs: return is not a function
return is not a function.  "return(EIO);" is silly;
"return (EIO);" moreso.  return is not a function.
Nuke the pointless parens.

[dchinner: catch a couple of extra cases in xfs_attr_list.c,
xfs_acl.c and xfs_linux.h.]

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-22 15:03:54 +10:00
Linus Torvalds 2dfded8210 File locking related bugfixes for v3.16 (pile #2)
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTpiwZAAoJEAAOaEEZVoIVplsP/383a9q3eXonbsi+Ea8CGbRl
 tdjjVhM1OY4NZYFAoulILDt3HqPTC6MBnqKlHz+BuMziwd/1+3w8S4E7IEwm/KtM
 ghNYX8ct5Bf1nc5QEdDmwf4PX48QbRwTuT1uIcXaJ+KtxTzI9qN7mnjRN91TUtq4
 WRGvOl0AsGCVq8YqxjztgD3TYbu7AG/72Em+DE9f81PTArAPTo2ySc3gxPuJJAsg
 G1x46Gx46sfqFX2FY4SPsXen+J/67Og67y6eBawxnT2Bp6ZGDuW+jyPRmkhf0yth
 pWAtkUi3XmEe6kk6GHiICsS0Yn0RG4jbz39+Ja+X7jibQVJ8Iz6b+Optw9RNQwYt
 jDWHKFS2AaL/CDejHYOQ1shHcozpRojtIbDLIZ9vTNTQ2r5cdaBvkXMmQzdoktmN
 wQtQ9AzBl8fHOFOQeCAwd/ZfCLIotvLoLds3K/CSqmpsyK2+9IyriQLKKZ2xm6Iu
 8+UUspGQcNVwcMP6YWtI6G+u58/mVanmK6dtpiyXrncZLAfU4H7ETL2IPu8jJTbv
 kTFCOJtXQzNZa5Xqur1hIewOG9/RlvAZAnii0Ghc3nXTWCCeNeI8re3jw4g9KdRv
 33t4sYfJld8LQ1NSMqIDyAs+fvytmGurYt+uhVpb58G/4CqBLNpmmIGIQ6LFLMfs
 75FQnbAezrD0H/JyAHUk
 =AJD3
 -----END PGP SIGNATURE-----

Merge tag 'locks-v3.16-2' of git://git.samba.org/jlayton/linux

Pull file locking fixes from Jeff Layton:
 "File locking related bugfixes

  Nothing too earth-shattering here.  A fix for a potential regression
  due to a patch in pile #1, and the addition of a memory barrier to
  prevent a race condition between break_deleg and generic_add_lease"

* tag 'locks-v3.16-2' of git://git.samba.org/jlayton/linux:
  locks: set fl_owner for leases back to current->files
  locks: add missing memory barrier in break_deleg
2014-06-21 16:40:30 -10:00
Linus Torvalds e13d100beb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This fixes some lockups in btrfs reported with rc1.  It probably has
  some performance impact because it is backing off our spinning locks
  more often and switching to a blocking lock.  I'll be able to nail
  that down next week, but for now I want to get the lockups taken care
  of.

  Otherwise some more stack reduction and assorted fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix wrong error handle when the device is missing or is not writeable
  Btrfs: fix deadlock when mounting a degraded fs
  Btrfs: use bio_endio_nodec instead of open code
  Btrfs: fix NULL pointer crash when running balance and scrub concurrently
  btrfs: Skip scrubbing removed chunks to avoid -ENOENT.
  Btrfs: fix broken free space cache after the system crashed
  Btrfs: make free space cache write out functions more readable
  Btrfs: remove unused wait queue in struct extent_buffer
  Btrfs: fix deadlocks with trylock on tree nodes
2014-06-21 14:21:43 -10:00
Linus Torvalds 147f1404db Merge branch 'for-3.16' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfixes from Bruce Fields:
 "Fixes for a new regression from the xdr encoding rewrite, and a
  delegation problem we've had for a while (made somewhat more annoying
  by the vfs delegation support added in 3.13)"

* 'for-3.16' of git://linux-nfs.org/~bfields/linux:
  NFSD: fix bug for readdir of pseudofs
  NFSD: Don't hand out delegations for 30 seconds after recalling them.
2014-06-21 14:20:38 -10:00
Miao Xie 8408c716d7 Btrfs: fix wrong error handle when the device is missing or is not writeable
The original bio might be submitted, so we shoud increase bi_remaining to
account for it when we deal with the error that the device is missing or
is not writeable, or we would skip the endio handle.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:56 -07:00
Miao Xie c55f139640 Btrfs: fix deadlock when mounting a degraded fs
The deadlock happened when we mount degraded filesystem, the reproduced
steps are following:
 # mkfs.btrfs -f -m raid1 -d raid1 <dev0> <dev1>
 # echo 1 > /sys/block/`basename <dev0>`/device/delete
 # mount -o degraded <dev1> <mnt>

The reason was that the counter -- bi_remaining was wrong. If the missing
or unwriteable device was the last device in the mapping array, we would
not submit the original bio, so we shouldn't increase bi_remaining of it
in btrfs_end_bio(), or we would skip the final endio handle.

Fix this problem by adding a flag into btrfs bio structure. If we submit
the original bio, we will set the flag, and we increase bi_remaining counter,
or we don't.

Though there is another way to fix it -- decrease bi_remaining counter of the
original bio when we make sure the original bio is not submitted, this method
need add more check and is easy to make mistake.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:56 -07:00
Miao Xie e990f16763 Btrfs: use bio_endio_nodec instead of open code
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:55 -07:00
Wang Shilong 298a8f9cf1 Btrfs: fix NULL pointer crash when running balance and scrub concurrently
While running balance, scrub, fsstress concurrently we hit the
following kernel crash:

[56561.448845] BTRFS info (device sde): relocating block group 11005853696 flags 132
[56561.524077] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
[56561.524237] IP: [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs]
[56561.524297] PGD 9be28067 PUD 7f3dd067 PMD 0
[56561.524325] Oops: 0000 [#1] SMP
[....]
[56561.527237] Call Trace:
[56561.527309]  [<ffffffffa038980e>] scrub_enumerate_chunks+0x24e/0x490 [btrfs]
[56561.527392]  [<ffffffff810abe00>] ? abort_exclusive_wait+0x50/0xb0
[56561.527476]  [<ffffffffa038add4>] btrfs_scrub_dev+0x1a4/0x530 [btrfs]
[56561.527561]  [<ffffffffa0368107>] btrfs_ioctl+0x13f7/0x2a90 [btrfs]
[56561.527639]  [<ffffffff811c82f0>] do_vfs_ioctl+0x2e0/0x4c0
[56561.527712]  [<ffffffff8109c384>] ? vtime_account_user+0x54/0x60
[56561.527788]  [<ffffffff810f768c>] ? __audit_syscall_entry+0x9c/0xf0
[56561.527870]  [<ffffffff811c8551>] SyS_ioctl+0x81/0xa0
[56561.527941]  [<ffffffff815707f7>] tracesys+0xdd/0xe2
[...]
[56561.528304] RIP  [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs]
[56561.528395]  RSP <ffff88004c0f5be8>
[56561.528454] CR2: 0000000000000078

This is because in btrfs_relocate_chunk(), we will free @bdev directly while
scrub may still hold extent mapping, and may access freed memory.

Fix this problem by wrapping freeing @bdev work into free_extent_map() which
is based on reference count.

Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:55 -07:00
Qu Wenruo ced96edc48 btrfs: Skip scrubbing removed chunks to avoid -ENOENT.
When run scrub with balance, sometimes -ENOENT will be returned, since
in scrub_enumerate_chunks() will search dev_extent in *COMMIT_ROOT*, but
btrfs_lookup_block_group() will search block group in *MEMORY*, so if a
chunk is removed but not committed, -ENOENT will be returned.

However, there is no need to stop scrubbing since other chunks may be
scrubbed without problem.

So this patch changes the behavior to skip removed chunks and continue
to scrub the rest.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:54 -07:00
Miao Xie e570fd27f2 Btrfs: fix broken free space cache after the system crashed
When we mounted the filesystem after the crash, we got the following
message:
  BTRFS error (device xxx): block group xxxx has wrong amount of free space
  BTRFS error (device xxx): failed to load free space cache for block group xxx

It is because we didn't update the metadata of the allocated space (in extent
tree) until the file data was written into the disk. During this time, there was
no information about the allocated spaces in either the extent tree nor the
free space cache. when we wrote out the free space cache at this time (commit
transaction), those spaces were lost. In fact, only the free space that is
used to store the file data had this problem, the others didn't because
the metadata of them is updated in the same transaction context.

There are many methods which can fix the above problem
- track the allocated space, and write it out when we write out the free
  space cache
- account the size of the allocated space that is used to store the file
  data, if the size is not zero, don't write out the free space cache.

The first one is complex and may make the performance drop down.
This patch chose the second method, we use a per-block-group variant to
account the size of that allocated space. Besides that, we also introduce
a per-block-group read-write semaphore to avoid the race between
the allocation and the free space cache write out.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:54 -07:00
Miao Xie 5349d6c3ff Btrfs: make free space cache write out functions more readable
This patch makes the free space cache write out functions more readable,
and beisdes that, it also reduces the stack space that the function --
__btrfs_write_out_cache uses from 194bytes to 144bytes.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:54 -07:00
Filipe Manana 46fefe41b5 Btrfs: remove unused wait queue in struct extent_buffer
The lock_wq wait queue is not used anywhere, therefore just remove it.
On a x86_64 system, this reduced sizeof(struct extent_buffer) from 320
bytes down to 296 bytes, which means a 4Kb page can now be used for
13 extent buffers instead of 12.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:28 -07:00
Chris Mason ea4ebde02e Btrfs: fix deadlocks with trylock on tree nodes
The Btrfs tree trylock function is poorly named.  It always takes
the spinlock and backs off if the blocking lock is held.  This
can lead to surprising lockups because people expect it to really be a
trylock.

This commit makes it a pure trylock, both for the spinlock and the
blocking lock.  It also reworks the nested lock handling slightly to
avoid taking the read lock while a spinning write lock might be held.

Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:19:55 -07:00
Jeff Layton 08bc03539d cifs: revalidate mapping prior to satisfying read_iter request with cache=loose
Before satisfying a read with cache=loose, we should always check
that the pagecache is valid before allowing a read to be satisfied
out of it.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-06-19 13:34:04 -05:00
Paul Bolle fe786f61f3 befs: remove check for CONFIG_BEFS_RW
Befs contains a check for CONFIG_BEFS_RW for over a decade now. The
related Kconfig symbol never existed, so this check always evaluated to
true. Remove it.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-06-19 15:28:14 +02:00
Kinglong Mee f41c5ad2ff NFSD: fix bug for readdir of pseudofs
Commit 561f0ed498 (nfsd4: allow large readdirs) introduces a bug
about readdir the root of pseudofs.

Call xdr_truncate_encode() revert encoded name when skipping.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-17 16:42:48 -04:00
NeilBrown 6282cd5655 NFSD: Don't hand out delegations for 30 seconds after recalling them.
If nfsd needs to recall a delegation for some reason it implies that there is
contention on the file, so further delegations should not be handed out.

The current code fails to do so, and the result is effectively a
live-lock under some workloads: a client attempting a conflicting
operation on a read-delegated file receives NFS4ERR_DELAY and retries
the operation, but by the time it retries the server may already have
given out another delegation.

We could simply avoid delegations for (say) 30 seconds after any recall, but
this is probably too heavy handed.

We could keep a list of inodes (or inode numbers or filehandles) for recalled
delegations, but that requires memory allocation and searching.

The approach taken here is to use a bloom filter to record the filehandles
which are currently blocked from delegation, and to accept the cost of a few
false positives.

We have 2 bloom filters, each of which is valid for 30 seconds.   When a
delegation is recalled the filehandle is added to one filter and will remain
disabled for between 30 and 60 seconds.

We keep a count of the number of filehandles that have been added, so when
that count is zero we can bypass all other tests.

The bloom filters have 256 bits and 3 hash functions.  This should allow a
couple of dozen blocked  filehandles with minimal false positives.  If many
more filehandles are all blocked at once, behaviour will degrade towards
rejecting all delegations for between 30 and 60 seconds, then resetting and
allowing new delegations.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-17 16:42:47 -04:00
Konstantin Khlebnikov ebe06187bf epoll: fix use-after-free in eventpoll_release_file
This fixes use-after-free of epi->fllink.next inside list loop macro.
This loop actually releases elements in the body.  The list is
rcu-protected but here we cannot hold rcu_read_lock because we need to
lock mutex inside.

The obvious solution is to use list_for_each_entry_safe().  RCU-ness
isn't essential because nobody can change this list under us, it's final
fput for this file.

The bug was introduced by ae10b2b4eb ("epoll: optimize EPOLL_CTL_DEL
using rcu")

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Reported-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Stable <stable@vger.kernel.org> # 3.13+
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-16 17:21:59 -10:00
Björn Baumbach a1d0b84c30 fs/cifs: fix regression in cifs_create_mf_symlink()
commit d81b8a40e2
("CIFS: Cleanup cifs open codepath")
changed disposition to FILE_OPEN.

Signed-off-by: Björn Baumbach <bb@sernet.de>
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Reviewed-by: Stefan Metzmacher <metze@samba.org>
Cc: <stable@vger.kernel.org> # v3.14+
Cc: Pavel Shilovsky <piastry@etersoft.ru>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2014-06-16 13:50:11 -05:00
Jan Kara c5c7b8ddfb ext4: Fix buffer double free in ext4_alloc_branch()
Error recovery in ext4_alloc_branch() calls ext4_forget() even for
buffer corresponding to indirect block it did not allocate. This leads
to brelse() being called twice for that buffer (once from ext4_forget()
and once from cleanup in ext4_ind_map_blocks()) leading to buffer use
count misaccounting. Eventually (but often much later because there
are other users of the buffer) we will see messages like:
VFS: brelse: Trying to free free buffer

Another manifestation of this problem is an error:
JBD2 unexpected failure: jbd2_journal_revoke: !buffer_revoked(bh);
inconsistent data on disk

The fix is easy - don't forget buffer we did not allocate. Also add an
explanatory comment because the indexing at ext4_alloc_branch() is
somewhat subtle.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-06-15 23:46:28 -04:00
Linus Torvalds 16d52ef7c0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull more btrfs updates from Chris Mason:
 "This has a few fixes since our last pull and a new ioctl for doing
  btree searches from userland.  It's very similar to the existing
  ioctl, but lets us return larger items back down to the app"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: fix error handling in create_pending_snapshot
  btrfs: fix use of uninit "ret" in end_extent_writepage()
  btrfs: free ulist in qgroup_shared_accounting() error path
  Btrfs: fix qgroups sanity test crash or hang
  btrfs: prevent RCU warning when dereferencing radix tree slot
  Btrfs: fix unfinished readahead thread for raid5/6 degraded mounting
  btrfs: new ioctl TREE_SEARCH_V2
  btrfs: tree_search, search_ioctl: direct copy to userspace
  btrfs: new function read_extent_buffer_to_user
  btrfs: tree_search, copy_to_sk: return needed size on EOVERFLOW
  btrfs: tree_search, copy_to_sk: return EOVERFLOW for too small buffer
  btrfs: tree_search, search_ioctl: accept varying buffer
  btrfs: tree_search: eliminate redundant nr_items check
2014-06-14 19:48:43 -05:00
Linus Torvalds a311c48038 Merge git://git.kvack.org/~bcrl/aio-next
Pull aio fix and cleanups from Ben LaHaise:
 "This consists of a couple of code cleanups plus a minor bug fix"

* git://git.kvack.org/~bcrl/aio-next:
  aio: cleanup: flatten kill_ioctx()
  aio: report error from io_destroy() when threads race in io_destroy()
  fs/aio.c: Remove ctx parameter in kiocb_cancel
2014-06-14 19:43:27 -05:00
Eric Sandeen 47a306a748 btrfs: fix error handling in create_pending_snapshot
fcebe456 cut and pasted some code to a later point
in create_pending_snapshot(), but didn't switch
to the appropriate error handling for this stage
of the function.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-13 09:52:30 -07:00
Eric Sandeen 3e2426bd0e btrfs: fix use of uninit "ret" in end_extent_writepage()
If this condition in end_extent_writepage() is false:

	if (tree->ops && tree->ops->writepage_end_io_hook)

we will then test an uninitialized "ret" at:

	ret = ret < 0 ? ret : -EIO;

The test for ret is for the case where ->writepage_end_io_hook
failed, and we'd choose that ret as the error; but if
there is no ->writepage_end_io_hook, nothing sets ret.

Initializing ret to 0 should be sufficient; if
writepage_end_io_hook wasn't set, (!uptodate) means
non-zero err was passed in, so we choose -EIO in that case.

Signed-of-by: Eric Sandeen <sandeen@redhat.com>

Signed-off-by: Chris Mason <clm@fb.com>
2014-06-13 09:52:28 -07:00
Eric Sandeen d737278091 btrfs: free ulist in qgroup_shared_accounting() error path
If tmp = ulist_alloc(GFP_NOFS) fails, we return without
freeing the previously allocated qgroups = ulist_alloc(GFP_NOFS)
and cause a memory leak.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-13 09:52:26 -07:00
Filipe Manana b050f9f6dd Btrfs: fix qgroups sanity test crash or hang
Often when running the qgroups sanity test, a crash or a hang happened.
This is because the extent buffer the test uses for the root node doesn't
have an header level explicitly set, making it have a random level value.
This is a problem when it's not zero for the btrfs_search_slot() calls
the test ends up doing, resulting in crashes or hangs such as the following:

[ 6454.127192] Btrfs loaded, debug=on, assert=on, integrity-checker=on
(...)
[ 6454.127760] BTRFS: selftest: Running qgroup tests
[ 6454.127964] BTRFS: selftest: Running test_test_no_shared_qgroup
[ 6454.127966] BTRFS: selftest: Qgroup basic add
[ 6480.152005] BUG: soft lockup - CPU#0 stuck for 23s! [modprobe:5383]
[ 6480.152005] Modules linked in: btrfs(+) xor raid6_pq binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc i2c_piix4 i2c_core pcspkr evbug psmouse serio_raw e1000 [last unloaded: btrfs]
[ 6480.152005] irq event stamp: 188448
[ 6480.152005] hardirqs last  enabled at (188447): [<ffffffff8168ef5c>] restore_args+0x0/0x30
[ 6480.152005] hardirqs last disabled at (188448): [<ffffffff81698e6a>] apic_timer_interrupt+0x6a/0x80
[ 6480.152005] softirqs last  enabled at (188446): [<ffffffff810516cf>] __do_softirq+0x1cf/0x450
[ 6480.152005] softirqs last disabled at (188441): [<ffffffff81051c25>] irq_exit+0xb5/0xc0
[ 6480.152005] CPU: 0 PID: 5383 Comm: modprobe Not tainted 3.15.0-rc8-fdm-btrfs-next-33+ #4
[ 6480.152005] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 6480.152005] task: ffff8802146125a0 ti: ffff8800d0d00000 task.ti: ffff8800d0d00000
[ 6480.152005] RIP: 0010:[<ffffffff81349a63>]  [<ffffffff81349a63>] __write_lock_failed+0x13/0x20
[ 6480.152005] RSP: 0018:ffff8800d0d038e8  EFLAGS: 00000287
[ 6480.152005] RAX: 0000000000000000 RBX: ffffffff8168ef5c RCX: 000005deb8525852
[ 6480.152005] RDX: 0000000000000000 RSI: 0000000000001d45 RDI: ffff8802105000b8
[ 6480.152005] RBP: ffff8800d0d038e8 R08: fffffe12710f63db R09: ffffffffa03196fb
[ 6480.152005] R10: ffff8802146125a0 R11: ffff880214612e28 R12: ffff8800d0d03858
[ 6480.152005] R13: 0000000000000000 R14: ffff8800d0d00000 R15: ffff8802146125a0
[ 6480.152005] FS:  00007f14ff804700(0000) GS:ffff880215e00000(0000) knlGS:0000000000000000
[ 6480.152005] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 6480.152005] CR2: 00007fff4df0dac8 CR3: 00000000d1796000 CR4: 00000000000006f0
[ 6480.152005] Stack:
[ 6480.152005]  ffff8800d0d03908 ffffffff810ae967 0000000000000001 ffff8802105000b8
[ 6480.152005]  ffff8800d0d03938 ffffffff8168e57e ffffffffa0319c16 0000000000000007
[ 6480.152005]  ffff880210500000 ffff880210500100 ffff8800d0d039b8 ffffffffa0319c16
[ 6480.152005] Call Trace:
[ 6480.152005]  [<ffffffff810ae967>] do_raw_write_lock+0x47/0xa0
[ 6480.152005]  [<ffffffff8168e57e>] _raw_write_lock+0x5e/0x80
[ 6480.152005]  [<ffffffffa0319c16>] ? btrfs_tree_lock+0x116/0x270 [btrfs]
[ 6480.152005]  [<ffffffffa0319c16>] btrfs_tree_lock+0x116/0x270 [btrfs]
[ 6480.152005]  [<ffffffffa02b2acb>] btrfs_lock_root_node+0x3b/0x50 [btrfs]
[ 6480.152005]  [<ffffffffa02b81a6>] btrfs_search_slot+0x916/0xa20 [btrfs]
[ 6480.152005]  [<ffffffff811a727f>] ? create_object+0x23f/0x300
[ 6480.152005]  [<ffffffffa02b9958>] btrfs_insert_empty_items+0x78/0xd0 [btrfs]
[ 6480.152005]  [<ffffffffa036041a>] insert_normal_tree_ref.constprop.4+0xa2/0x19a [btrfs]
[ 6480.152005]  [<ffffffffa03605c3>] test_no_shared_qgroup+0xb1/0x1ca [btrfs]
[ 6480.152005]  [<ffffffff8108cad6>] ? local_clock+0x16/0x30
[ 6480.152005]  [<ffffffffa035ef8e>] btrfs_test_qgroups+0x1ae/0x1d7 [btrfs]
[ 6480.152005]  [<ffffffffa03a69d2>] ? ftrace_define_fields_btrfs_space_reservation+0xfd/0xfd [btrfs]
[ 6480.152005]  [<ffffffffa03a6a86>] init_btrfs_fs+0xb4/0x153 [btrfs]
[ 6480.152005]  [<ffffffff81000352>] do_one_initcall+0x102/0x150
[ 6480.152005]  [<ffffffff8103d223>] ? set_memory_nx+0x43/0x50
[ 6480.152005]  [<ffffffff81682668>] ? set_section_ro_nx+0x6d/0x74
[ 6480.152005]  [<ffffffff810d91cc>] load_module+0x1cdc/0x2630
(...)

Therefore initialize the extent buffer as an empty leaf (level 0).

Issue easy to reproduce when btrfs is built as a module via:

    $ for ((i = 1; i <= 1000000; i++)); do rmmod btrfs; modprobe btrfs; done

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-13 09:52:24 -07:00
Sasha Levin f1e3c28949 btrfs: prevent RCU warning when dereferencing radix tree slot
Mark the dereference as protected by lock. Not doing so triggers
an RCU warning since the radix tree assumed that RCU is in use.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-13 09:52:22 -07:00
Wang Shilong 5fbc7c59fd Btrfs: fix unfinished readahead thread for raid5/6 degraded mounting
Steps to reproduce:

 # mkfs.btrfs -f /dev/sd[b-f] -m raid5 -d raid5
 # mkfs.ext4 /dev/sdc --->corrupt one of btrfs device
 # mount /dev/sdb /mnt -o degraded
 # btrfs scrub start -BRd /mnt

This is because readahead would skip missing device, this is not true
for RAID5/6, because REQ_GET_READ_MIRRORS return 1 for RAID5/6 block
mapping. If expected data locates in missing device, readahead thread
would not call __readahead_hook() which makes event @rc->elems=0
wait forever.

Fix this problem by checking return value of btrfs_map_block(),we
can only skip missing device safely if there are several mirrors.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-13 09:52:21 -07:00
Gerhard Heift cc68a8a5a4 btrfs: new ioctl TREE_SEARCH_V2
This new ioctl call allows the user to supply a buffer of varying size in which
a tree search can store its results. This is much more flexible if you want to
receive items which are larger than the current fixed buffer of 3992 bytes or
if you want to fetch more items at once. Items larger than this buffer are for
example some of the type EXTENT_CSUM.

Signed-off-by: Gerhard Heift <Gerhard@Heift.Name>
Signed-off-by: Chris Mason <clm@fb.com>
Acked-by: David Sterba <dsterba@suse.cz>
2014-06-13 09:52:19 -07:00
Linus Torvalds 4bdeb31208 dlm for 3.16
This set includes one small fix related to resending SCTP messages.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTmwixAAoJEDgbc8f8gGmqlT0P/Akhh1P264NVAXSwVhQf1HKs
 SZFlCIKolyM8ih4TkPM1QDcB3i2FBJiLkaLQT3yd8jcxW6fpMnH6wmRkU3T3uSlB
 cV88eXUdS18gnPeMx1vMZrip2VOa5xpCLWyp7ClC6U7E3xI5xkc3eIbzilhhsr0b
 sp72ahBujPsbltUYpg2giJg0DtoQtc9Tw0PMbF2smv/p1m+gKE8IE0v9JiVoK148
 B3vRxiuqSfScOYLhNjyLmgMeCxalbNUwdd+v16HfgUCjVQR/ji9EO29AGDGwvR8P
 UCqHPS+WA9gUBff0/kHFNNweDWdAhvfVSDmxLUWxmEN7zyCZA/SYsq6COnh+zYz6
 S1rT4nCvlchPNxK7OthZh/iSJy2xqfX2mPy/rKf7SFHoE/AZidY4OIKAE4Y92/mp
 cq3RCYvn+SGFNLt7Y4VceWmH5IUFak9KhK6d+13oNFneY5QRCqMH5AcOSoCV0V7m
 sGRiobgMY9PW06/WMSJYN1H1QQ1YbEY2Jn44xEIeyDcegRUIi8joOTuM3iRx/ZuZ
 uKIe5lMSajeIsQvJOrWkrr+sD+odhxMs3NlsTjLoopXAM9oihMdnrWAapP6G3z2G
 0ZSHAtuy9KJxi/ycKwwHRcXrY6OFtWpBUcmDeTz7wQ3q4idf3q64tyf0VRfkMTGj
 jqP6JMfHKKcn67bjEcbn
 =HEIE
 -----END PGP SIGNATURE-----

Merge tag 'dlm-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm fix from David Teigland:
 "This contains one small fix related to resending SCTP messages"

* tag 'dlm-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  dlm: keep listening connection alive with sctp mode
2014-06-13 07:41:57 -07:00
Linus Torvalds 6d87c225f5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
 "This has a mix of bug fixes and cleanups.

  Alex's patch fixes a rare race in RBD.  Ilya's patches fix an ENOENT
  check when a second rbd image is mapped and a couple memory leaks.
  Zheng fixes several issues with fragmented directories and multiple
  MDSs.  Josh fixes a spin/sleep issue, and Josh and Guangliang's
  patches fix setting and unsetting RBD images read-only.

  Naturally there are several other cleanups mixed in for good measure"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits)
  rbd: only set disk to read-only once
  rbd: move calls that may sleep out of spin lock range
  rbd: add ioctl for rbd
  ceph: use truncate_pagecache() instead of truncate_inode_pages()
  ceph: include time stamp in every MDS request
  rbd: fix ida/idr memory leak
  rbd: use reference counts for image requests
  rbd: fix osd_request memory leak in __rbd_dev_header_watch_sync()
  rbd: make sure we have latest osdmap on 'rbd map'
  libceph: add ceph_monc_wait_osdmap()
  libceph: mon_get_version request infrastructure
  libceph: recognize poolop requests in debugfs
  ceph: refactor readpage_nounlock() to make the logic clearer
  mds: check cap ID when handling cap export message
  ceph: remember subtree root dirfrag's auth MDS
  ceph: introduce ceph_fill_fragtree()
  ceph: handle cap import atomically
  ceph: pre-allocate ceph_cap struct for ceph_add_cap()
  ceph: update inode fields according to issued caps
  rbd: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
  ...
2014-06-12 23:06:23 -07:00
Linus Torvalds 3737a12761 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more perf updates from Ingo Molnar:
 "A second round of perf updates:

   - wide reaching kprobes sanitization and robustization, with the hope
     of fixing all 'probe this function crashes the kernel' bugs, by
     Masami Hiramatsu.

   - uprobes updates from Oleg Nesterov: tmpfs support, corner case
     fixes and robustization work.

   - perf tooling updates and fixes from Jiri Olsa, Namhyung Ki, Arnaldo
     et al:
        * Add support to accumulate hist periods (Namhyung Kim)
        * various fixes, refactorings and enhancements"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
  perf: Differentiate exec() and non-exec() comm events
  perf: Fix perf_event_comm() vs. exec() assumption
  uprobes/x86: Rename arch_uprobe->def to ->defparam, minor comment updates
  perf/documentation: Add description for conditional branch filter
  perf/x86: Add conditional branch filtering support
  perf/tool: Add conditional branch filter 'cond' to perf record
  perf: Add new conditional branch filter 'PERF_SAMPLE_BRANCH_COND'
  uprobes: Teach copy_insn() to support tmpfs
  uprobes: Shift ->readpage check from __copy_insn() to uprobe_register()
  perf/x86: Use common PMU interrupt disabled code
  perf/ARM: Use common PMU interrupt disabled code
  perf: Disable sampled events if no PMU interrupt
  perf: Fix use after free in perf_remove_from_context()
  perf tools: Fix 'make help' message error
  perf record: Fix poll return value propagation
  perf tools: Move elide bool into perf_hpp_fmt struct
  perf tools: Remove elide setup for SORT_MODE__MEMORY mode
  perf tools: Fix "==" into "=" in ui_browser__warning assignment
  perf tools: Allow overriding sysfs and proc finding with env var
  perf tools: Consider header files outside perf directory in tags target
  ...
2014-06-12 19:18:49 -07:00
Gerhard Heift ba346b357d btrfs: tree_search, search_ioctl: direct copy to userspace
By copying each found item seperatly to userspace, we do not need extra
buffer in the kernel.

Signed-off-by: Gerhard Heift <Gerhard@Heift.Name>
Signed-off-by: Chris Mason <clm@fb.com>
Acked-by: David Sterba <dsterba@suse.cz>
2014-06-12 18:22:05 -07:00
Gerhard Heift 550ac1d85e btrfs: new function read_extent_buffer_to_user
This new function reads the content of an extent directly to user memory.

Signed-off-by: Gerhard Heift <Gerhard@Heift.Name>
Signed-off-by: Chris Mason <clm@fb.com>
Acked-by: David Sterba <dsterba@suse.cz>
2014-06-12 18:21:56 -07:00
Gerhard Heift 9b6e817d02 btrfs: tree_search, copy_to_sk: return needed size on EOVERFLOW
If an item in tree_search is too large to be stored in the given buffer, return
the needed size (including the header).

Signed-off-by: Gerhard Heift <Gerhard@Heift.Name>
Signed-off-by: Chris Mason <clm@fb.com>
Acked-by: David Sterba <dsterba@suse.cz>
2014-06-12 18:21:47 -07:00
Gerhard Heift 8f5f6178f3 btrfs: tree_search, copy_to_sk: return EOVERFLOW for too small buffer
In copy_to_sk, if an item is too large for the given buffer, it now returns
-EOVERFLOW instead of copying a search_header with len = 0. For backward
compatibility for the first item it still copies such a header to the buffer,
but not any other following items, which could have fitted.

tree_search changes -EOVERFLOW back to 0 to behave similiar to the way it
behaved before this patch.

Signed-off-by: Gerhard Heift <Gerhard@Heift.Name>
Signed-off-by: Chris Mason <clm@fb.com>
Acked-by: David Sterba <dsterba@suse.cz>
2014-06-12 18:21:39 -07:00
Gerhard Heift 1254444288 btrfs: tree_search, search_ioctl: accept varying buffer
rewrite search_ioctl to accept a buffer with varying size

Signed-off-by: Gerhard Heift <Gerhard@Heift.Name>
Signed-off-by: Chris Mason <clm@fb.com>
Acked-by: David Sterba <dsterba@suse.cz>
2014-06-12 18:21:26 -07:00
Gerhard Heift 25c9bc2e2b btrfs: tree_search: eliminate redundant nr_items check
If the amount of items reached the given limit of nr_items, we can leave
copy_to_sk without updating the key. Also by returning 1 we leave the loop in
search_ioctl without rechecking if we reached the given limit.

Signed-off-by: Gerhard Heift <Gerhard@Heift.Name>
Signed-off-by: Chris Mason <clm@fb.com>
Acked-by: David Sterba <dsterba@suse.cz>
2014-06-12 18:20:39 -07:00
Linus Torvalds 16b9057804 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "This the bunch that sat in -next + lock_parent() fix.  This is the
  minimal set; there's more pending stuff.

  In particular, I really hope to get acct.c fixes merged this cycle -
  we need that to deal sanely with delayed-mntput stuff.  In the next
  pile, hopefully - that series is fairly short and localized
  (kernel/acct.c, fs/super.c and fs/namespace.c).  In this pile: more
  iov_iter work.  Most of prereqs for ->splice_write with sane locking
  order are there and Kent's dio rewrite would also fit nicely on top of
  this pile"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
  lock_parent: don't step on stale ->d_parent of all-but-freed one
  kill generic_file_splice_write()
  ceph: switch to iter_file_splice_write()
  shmem: switch to iter_file_splice_write()
  nfs: switch to iter_splice_write_file()
  fs/splice.c: remove unneeded exports
  ocfs2: switch to iter_file_splice_write()
  ->splice_write() via ->write_iter()
  bio_vec-backed iov_iter
  optimize copy_page_{to,from}_iter()
  bury generic_file_aio_{read,write}
  lustre: get rid of messing with iovecs
  ceph: switch to ->write_iter()
  ceph_sync_direct_write: stop poking into iov_iter guts
  ceph_sync_read: stop poking into iov_iter guts
  new helper: copy_page_from_iter()
  fuse: switch to ->write_iter()
  btrfs: switch to ->write_iter()
  ocfs2: switch to ->write_iter()
  xfs: switch to ->write_iter()
  ...
2014-06-12 10:30:18 -07:00
Lidong Zhong 883854c545 dlm: keep listening connection alive with sctp mode
The connection struct with nodeid 0 is the listening socket,
not a connection to another node.  The sctp resend function
was not checking that the nodeid was valid (non-zero), so it
would mistakenly get and resend on the listening connection
when nodeid was zero.

Signed-off-by: Lidong Zhong <lzhong@suse.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2014-06-12 10:26:14 -05:00
Al Viro c2338f2dc7 lock_parent: don't step on stale ->d_parent of all-but-freed one
Dentry that had been through (or into) __dentry_kill() might be seen
by shrink_dentry_list(); that's normal, it'll be taken off the shrink
list and freed if __dentry_kill() has already finished.  The problem
is, its ->d_parent might be pointing to already freed dentry, so
lock_parent() needs to be careful.

We need to check that dentry hasn't already gone into __dentry_kill()
*and* grab rcu_read_lock() before dropping ->d_lock - the latter makes
sure that whatever we see in ->d_parent after dropping ->d_lock it
won't be freed until we drop rcu_read_lock().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-06-12 00:29:13 -04:00
Al Viro 9c1d5284c7 Merge commit '9f12600fe425bc28f0ccba034a77783c09c15af4' into for-linus
Backmerge of dcache.c changes from mainline.  It's that, or complete
rebase...

Conflicts:
	fs/splice.c

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-06-12 00:28:09 -04:00
Al Viro 5f07385060 kill generic_file_splice_write()
no callers left

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-06-12 00:21:13 -04:00
Al Viro 3551dd79ac ceph: switch to iter_file_splice_write()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-06-12 00:21:12 -04:00
Al Viro 4da54c218d nfs: switch to iter_splice_write_file()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-06-12 00:21:11 -04:00
Al Viro 96f9bc8fbc fs/splice.c: remove unneeded exports
ocfs2 was using a bunch of splice.c guts...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-06-12 00:21:11 -04:00
Al Viro 6dc8bc0fb3 ocfs2: switch to iter_file_splice_write()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-06-12 00:21:10 -04:00
Al Viro 8d0207652c ->splice_write() via ->write_iter()
iter_file_splice_write() - a ->splice_write() instance that gathers the
pipe buffers, builds a bio_vec-based iov_iter covering those and feeds
it to ->write_iter().  A bunch of simple cases coverted to that...

[AV: fixed the braino spotted by Cyrill]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-06-12 00:18:51 -04:00
Linus Torvalds 2840c566e9 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull reiserfs and ext3 changes from Jan Kara:
 "Big reiserfs cleanup from Jeff, an ext3 deadlock fix, and some small
  cleanups"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (34 commits)
  reiserfs: Fix compilation breakage with CONFIG_REISERFS_CHECK
  ext3: Fix deadlock in data=journal mode when fs is frozen
  reiserfs: call truncate_setsize under tailpack mutex
  fs/jbd/revoke.c: replace shift loop by ilog2
  reiserfs: remove obsolete __constant_cpu_to_le32
  reiserfs: balance_leaf refactor, split up balance_leaf_when_delete
  reiserfs: balance_leaf refactor, format balance_leaf_finish_node
  reiserfs: balance_leaf refactor, format balance_leaf_new_nodes_paste
  reiserfs: balance_leaf refactor, format balance_leaf_paste_right
  reiserfs: balance_leaf refactor, format balance_leaf_insert_right
  reiserfs: balance_leaf refactor, format balance_leaf_paste_left
  reiserfs: balance_leaf refactor, format balance_leaf_insert_left
  reiserfs: balance_leaf refactor, pull out balance_leaf{left, right, new_nodes, finish_node}
  reiserfs: balance_leaf refactor, pull out balance_leaf_finish_node_paste
  reiserfs: balance_leaf refactor pull out balance_leaf_finish_node_insert
  reiserfs: balance_leaf refactor, pull out balance_leaf_new_nodes_paste
  reiserfs: balance_leaf refactor, pull out balance_leaf_new_nodes_insert
  reiserfs: balance_leaf refactor, pull out balance_leaf_paste_right
  reiserfs: balance_leaf refactor, pull out balance_leaf_insert_right
  reiserfs: balance_leaf refactor, pull out balance_leaf_paste_left
  ...
2014-06-11 10:45:14 -07:00
Linus Torvalds 859862ddd2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "The biggest change here is Josef's rework of the btrfs quota
  accounting, which improves the in-memory tracking of delayed extent
  operations.

  I had been working on Btrfs stack usage for a while, mostly because it
  had become impossible to do long stress runs with slab, lockdep and
  pagealloc debugging turned on without blowing the stack.  Even though
  you upgraded us to a nice king sized stack, I kept most of the
  patches.

  We also have some very hard to find corruption fixes, an awesome sysfs
  use after free, and the usual assortment of optimizations, cleanups
  and other fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (80 commits)
  Btrfs: convert smp_mb__{before,after}_clear_bit
  Btrfs: fix scrub_print_warning to handle skinny metadata extents
  Btrfs: make fsync work after cloning into a file
  Btrfs: use right type to get real comparison
  Btrfs: don't check nodes for extent items
  Btrfs: don't release invalid page in btrfs_page_exists_in_range()
  Btrfs: make sure we retry if page is a retriable exception
  Btrfs: make sure we retry if we couldn't get the page
  btrfs: replace EINVAL with EOPNOTSUPP for dev_replace raid56
  trivial: fs/btrfs/ioctl.c: fix typo s/substract/subtract/
  Btrfs: fix leaf corruption after __btrfs_drop_extents
  Btrfs: ensure btrfs_prev_leaf doesn't miss 1 item
  Btrfs: fix clone to deal with holes when NO_HOLES feature is enabled
  btrfs: free delayed node outside of root->inode_lock
  btrfs: replace EINVAL with ERANGE for resize when ULLONG_MAX
  Btrfs: fix transaction leak during fsync call
  btrfs: Avoid trucating page or punching hole in a already existed hole.
  Btrfs: update commit root on snapshot creation after orphan cleanup
  Btrfs: ioctl, don't re-lock extent range when not necessary
  Btrfs: avoid visiting all extent items when cloning a range
  ...
2014-06-11 09:22:21 -07:00
Linus Torvalds 412dd3a6da xfs: update for 3.16-rc1
This update contains:
 o cleanup removing unused function args
 o rework of the filestreams allocator to use dentry cache parent lookups
 o new on-disk free inode btree and optimised inode allocator
 o various bug fixes
 o rework of internal attribute API
 o cleanup of superblock feature bit support to remove historic cruft
 o more fixes and minor cleanups
 o added a new directory/attribute geometry abstraction
 o yet more fixes and minor cleanups.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJTl+YfAAoJEK3oKUf0dfod/T8QALmvR28JZTL3vtlD5rppXp9+
 DXOMrwgJ9V+GOI39tpUgw1u/C5DuaFPRPtmCjnb9Do4DJMrHj+zD8ADvoVd6asa0
 FHH4TuulQqOJVu67SZ3ng15yjyy+wPfymQZIQPQY/IwVMUUEpWnSFnKha1GAsL8Y
 RY/WNU50wMu4wxi0++ENooHJC2EoxXpzB80cHddN81zFEFZobw0cm5Aa5xBZEZ4i
 P+GpEuUpWHKvVaWRLuIMgVC0NuOt5KtLfS97ong+tRgWCw//QVl28Rxhrj1ZHsF3
 VAskVsSFVIIPHP7qKjQyCGk71iqBfrfAgRqqJHFZgSmtSzyK3hVvJlRRDdCT5hi6
 00aHg9vz9815I7zrQwyMuy872N3DTislOxJZGD7PKgLpgfeHs4qk+cQ1xCi2gdFn
 xnh2p4mLolZHzanUsoxYpSh7f7o+NT3xgET3yS63uuO/I57o74JJDfRDjWNX6I9F
 LLtIGb1cwVFUYbXcHGfP1wxQ1BS6rYYYwKpSJqqwJXApL9MqoxH2B8Hoo0BaG43/
 3UlNi+yljvhBNiJnx2pAIdU+WaIL1ZQj9XzuU1sFSa8lnFNb2x+wkgjHzJ0Hdotm
 zZqirCo1jyyNkyTwGfwJwGzNgZemQDMQ7cr2MYzG1mhFMLEZZJeFmWVzfuzJ3yoR
 jke/Hy/qiWVK0en43MdR
 =qnz2
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-3.16-rc1' of git://oss.sgi.com/xfs/xfs

Pull xfs updates from Dave Chinner:
 "This update contains:
   - cleanup removing unused function args
   - rework of the filestreams allocator to use dentry cache parent
     lookups
   - new on-disk free inode btree and optimised inode allocator
   - various bug fixes
   - rework of internal attribute API
   - cleanup of superblock feature bit support to remove historic cruft
   - more fixes and minor cleanups
   - added a new directory/attribute geometry abstraction
   - yet more fixes and minor cleanups"

* tag 'xfs-for-linus-3.16-rc1' of git://oss.sgi.com/xfs/xfs: (86 commits)
  xfs: fix xfs_da_args sparse warning in xfs_readdir
  xfs: Fix rounding in xfs_alloc_fix_len()
  xfs: tone down writepage/releasepage WARN_ONs
  xfs: small cleanup in xfs_lowbit64()
  xfs: kill xfs_buf_geterror()
  xfs: xfs_readsb needs to check for magic numbers
  xfs: block allocation work needs to be kswapd aware
  xfs: remove redundant geometry information from xfs_da_state
  xfs: replace attr LBSIZE with xfs_da_geometry
  xfs: pass xfs_da_args to xfs_attr_leaf_newentsize
  xfs: use xfs_da_geometry for block size in attr code
  xfs: remove mp->m_dir_geo from directory logging
  xfs: reduce direct usage of mp->m_dir_geo
  xfs: move node entry counts to xfs_da_geometry
  xfs: convert dir/attr btree threshold to xfs_da_geometry
  xfs: convert m_dirblksize to xfs_da_geometry
  xfs: convert m_dirblkfsbs to xfs_da_geometry
  xfs: convert directory segment limits to xfs_da_geometry
  xfs: convert directory db conversion to xfs_da_geometry
  xfs: convert directory dablk conversion to xfs_da_geometry
  ...
2014-06-11 09:03:47 -07:00
Jan Kara 19ef1229bc reiserfs: Fix compilation breakage with CONFIG_REISERFS_CHECK
There was a bug in debug printout when CONFIG_REISERFS_CHECK was
enabled so one of the assertions in do_balan.c didn't compile. Fix it.

Fixes: 0080e9f9d3
Signed-off-by: Jan Kara <jack@suse.cz>
2014-06-11 17:32:10 +02:00
Linus Torvalds 9ee4d7a653 Merge branch 'akpm' (patches from Andrew Morton)
Merge leftovers from Andrew Morton:
 "A few leftovers: ocfs2, gcov, RTC"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  rtc: s5m: consolidate two device type switch statements
  rtc: s5m: add support for S2MPS14 RTC
  rtc: s5m: support different register layout
  rtc: s5m: use shorter time of register update
  rtc: s5m: remove undocumented time init on first boot
  mfd/rtc: sec/s5m: rename SEC* symbols to S5M
  gcov: add support for GCC 4.9
  ocfs2/o2net: incorrect to terminate accepting connections loop upon rejecting an invalid one
2014-06-10 15:34:55 -07:00
Tariq Saeed 79deb3c148 ocfs2/o2net: incorrect to terminate accepting connections loop upon rejecting an invalid one
When o2net-accept-one() rejects an illegal connection, it terminates the
loop picking up the remaining queued connections.  This fix will
continue accepting connections till the queue is emtpy.

Addresses Orabug 17489469.

Signed-off-by: Tariq Saseed <tariq.x.saeed@oracle.com>
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-10 15:34:46 -07:00
Linus Torvalds d1e1cda862 NFS client updates for Linux 3.16
Highlights include:
 
 - Massive cleanup of the NFS read/write code by Anna and Dros
 - Support multiple NFS read/write requests per page in order to deal with
   non-page aligned pNFS striping. Also cleans up the r/wsize < page size
   code nicely.
 - stable fix for ensuring inode is declared uptodate only after all the
   attributes have been checked.
 - stable fix for a kernel Oops when remounting
 - NFS over RDMA client fixes
 - move the pNFS files layout driver into its own subdirectory
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTl3pmAAoJEGcL54qWCgDyraIP/08ZbbDowVTP9572bxl+VR2i
 zNbrflBtl1R05D4Imi/IEySK0w6xj1CLsncNpXAT2bxTlyKPW70tpiiPlRKMPuO8
 JW+iPiepR2t0mol6MEd46yuV8btXVk8I+7IYjPXANiMJG8O5dJzNQ8NiCQOERBNt
 FQ7rzTCFO0ESGXnT6vYrT4I0bwqYVklBiJRTT4PQVzhhhDq9qUdq21BlQjQJFXP4
 9aBLurxKptlHBvE6A2Quja6ObEC0s31CxcijqHIJ+Ue4GbKcFbMG1tgjY7ESE/AD
 rqzDeF0jvWHT+frmvFEUUXWqzF1ReZ4x9pfDoOgeG6T9/K6DT91O0yMOgG8jvlbF
 8DSATNYGDX5sSjpvaG5JokGG+cGCk9srVDx+itn7HlwzalRwn0PjKtIYwOJ7TJIr
 o/j20nOsPrRGF0OqLf9phyocgRrlbMKOzj1IXldHHfAbNkRcISTK08lxvsz96Ddn
 zRyDmbsbY6QFXdB3AVSeQmg5R0OOLtzNIcsFPmNdvy5eiy67qU0lsGg8UGNnoz8k
 PHN1pcGejkctLhQ32ee3w/W6zkrgpJZcNC9JSoG8Dc3SeXus0c3IgumRknFCmiep
 ssN+1jEITAGeS5a2aBxwLQLVI2JAr2lxs5e+R4D5EsQlFkCl6Mrgtzh/aToWTuFl
 Qt7l2zI3r3VieKT9u7Bh
 =OyXR
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

   - massive cleanup of the NFS read/write code by Anna and Dros
   - support multiple NFS read/write requests per page in order to deal
     with non-page aligned pNFS striping.  Also cleans up the r/wsize <
     page size code nicely.
   - stable fix for ensuring inode is declared uptodate only after all
     the attributes have been checked.
   - stable fix for a kernel Oops when remounting
   - NFS over RDMA client fixes
   - move the pNFS files layout driver into its own subdirectory"

* tag 'nfs-for-3.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (79 commits)
  NFS: populate ->net in mount data when remounting
  pnfs: fix lockup caused by pnfs_generic_pg_test
  NFSv4.1: Fix typo in dprintk
  NFSv4.1: Comment is now wrong and redundant to code
  NFS: Use raw_write_seqcount_begin/end int nfs4_reclaim_open_state
  xprtrdma: Disconnect on registration failure
  xprtrdma: Remove BUG_ON() call sites
  xprtrdma: Avoid deadlock when credit window is reset
  SUNRPC: Move congestion window constants to header file
  xprtrdma: Reset connection timeout after successful reconnect
  xprtrdma: Use macros for reconnection timeout constants
  xprtrdma: Allocate missing pagelist
  xprtrdma: Remove Tavor MTU setting
  xprtrdma: Ensure ia->ri_id->qp is not NULL when reconnecting
  xprtrdma: Reduce the number of hardway buffer allocations
  xprtrdma: Limit work done by completion handler
  xprtrmda: Reduce calls to ib_poll_cq() in completion handlers
  xprtrmda: Reduce lock contention in completion handlers
  xprtrdma: Split the completion queue
  xprtrdma: Make rpcrdma_ep_destroy() return void
  ...
2014-06-10 15:02:42 -07:00
Andy Lutomirski 23adbe12ef fs,userns: Change inode_capable to capable_wrt_inode_uidgid
The kernel has no concept of capabilities with respect to inodes; inodes
exist independently of namespaces.  For example, inode_capable(inode,
CAP_LINUX_IMMUTABLE) would be nonsense.

This patch changes inode_capable to check for uid and gid mappings and
renames it to capable_wrt_inode_uidgid, which should make it more
obvious what it does.

Fixes CVE-2014-4014.

Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: stable@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-10 13:57:22 -07:00
Chris Mason c7548af69d Btrfs: convert smp_mb__{before,after}_clear_bit
The new call is smp_mb__{before,after}_atomic.  The __ gives us extra
protection from the atomic rays.

Signed-off-by: Chris Mason <clm@fb.com>
2014-06-10 13:10:47 -07:00
Linus Torvalds 5b174fd647 Merge branch 'for-3.16' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
 "The largest piece is a long-overdue rewrite of the xdr code to remove
  some annoying limitations: for example, there was no way to return
  ACLs larger than 4K, and readdir results were returned only in 4k
  chunks, limiting performance on large directories.

  Also:
        - part of Neil Brown's work to make NFS work reliably over the
          loopback interface (so client and server can run on the same
          machine without deadlocks).  The rest of it is coming through
          other trees.
        - cleanup and bugfixes for some of the server RDMA code, from
          Steve Wise.
        - Various cleanup of NFSv4 state code in preparation for an
          overhaul of the locking, from Jeff, Trond, and Benny.
        - smaller bugfixes and cleanup from Christoph Hellwig and
          Kinglong Mee.

  Thanks to everyone!

  This summer looks likely to be busier than usual for knfsd.  Hopefully
  we won't break it too badly; testing definitely welcomed"

* 'for-3.16' of git://linux-nfs.org/~bfields/linux: (100 commits)
  nfsd4: fix FREE_STATEID lockowner leak
  svcrdma: Fence LOCAL_INV work requests
  svcrdma: refactor marshalling logic
  nfsd: don't halt scanning the DRC LRU list when there's an RC_INPROG entry
  nfs4: remove unused CHANGE_SECURITY_LABEL
  nfsd4: kill READ64
  nfsd4: kill READ32
  nfsd4: simplify server xdr->next_page use
  nfsd4: hash deleg stateid only on successful nfs4_set_delegation
  nfsd4: rename recall_lock to state_lock
  nfsd: remove unneeded zeroing of fields in nfsd4_proc_compound
  nfsd: fix setting of NFS4_OO_CONFIRMED in nfsd4_open
  nfsd4: use recall_lock for delegation hashing
  nfsd: fix laundromat next-run-time calculation
  nfsd: make nfsd4_encode_fattr static
  SUNRPC/NFSD: Remove using of dprintk with KERN_WARNING
  nfsd: remove unused function nfsd_read_file
  nfsd: getattr for FATTR4_WORD0_FILES_AVAIL needs the statfs buffer
  NFSD: Error out when getting more than one fsloc/secinfo/uuid
  NFSD: Using type of uint32_t for ex_nflavors instead of int
  ...
2014-06-10 11:50:57 -07:00
Jeff Layton 0c27362998 locks: set fl_owner for leases back to current->files
This fixes a regression due to commit 130d1f956a (locks: ensure that
fl_owner is always initialized properly in flock and lease codepaths). I
had mistakenly thought that the fl_owner wasn't used in the lease code,
but I missed the place in __break_lease that does use it.

The i_have_this_lease check in generic_add_lease uses it. While I'm not
sure that check is terribly helpful [1], reset it back to using
current->files in order to ensure that there's no behavior change here.

[1]: leases are owned by the file description. It's possible that this
     is a threaded program, and the lease breaker and the task that
     would handle the signal are different, even if they have the same
     file table. So, there is the potential for false positives with
     this check.

Fixes: 130d1f956a (locks: ensure that fl_owner is always initialized properly in flock and lease codepaths)
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2014-06-10 12:29:05 -04:00
Linus Torvalds d53b47c08d This pull request contains several UBIFS fixes. One of them fixes a race
condition between the mmap page fault path and fsync. Another just removes a
 bogus assertion from the UBIFS memory shrinker.
 
 UBIFS also started honoring the MS_SILENT mount flag, so now it won't print
 many I/O errors when user-space just tries to probe for the FS.
 
 Rest of the changes are rather minor UBI/UBIFS fixes, improvements, and
 clean-ups.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTlqqmAAoJECmIfjd9wqK0cYAP+QGoB8+DOGv4+rTtUegNaWaG
 QAeAkvOBxDa72NrTYMtDFBKiPPYcz0BUnPgsCyvYIlYknRuZ4xMsu1BucLF3y8E3
 P8aHpnNas0dA3el/M+M0qwpEG0pb1lQvFT1dJVP6/D4q4VexLlgt7PlQjP+98Dua
 jp9jBDMu3sEDsqA9NiO2hfuFuIqF6DFnBpglkNcGcAyOtzQ/Ps49nivhb1mFKuzI
 2kSm2kWmZCTnLlGG1uj9+diCpTKWZoES2Jmo6gcZjQIjlejSzQw4Fo8LqdmhfwOg
 0ba1D9kJuwPHmBeM0UW4fLtrJHkvs2F66YaRcbA7GUo5lNw6+yxTqi3jkGevsPOD
 7eQB8FVCco14PFKu8Vx4lbkS2ekZN6aQuYYw/EeY2f+w8MpQTfcWINP5OVNHHGVA
 QDeRiu9mRMbm3JARRe9NeL0+sryGN402jCHtESAhONUDsCmZKX9k8HUMY1iaPAIp
 kGJWOjbyzYRMJiOfQiklv4N6rusmnECiRWGCliKDCU6TER8U6m9ZCUHUKmtX+56c
 2yRtd6y6LVequ/p3wC3x66jF64wjh2PlTM5jrWSOIg42ihIGSMAbmB6MKA+4RxIv
 EejZ4lhCxUg6Xp7zd/qgdu3aJ/P2OM8yFL+BngGKFac54EnTDEUcc7d8c2DZZv8s
 7YYjpiKK1NQGnih0FABA
 =CP7Z
 -----END PGP SIGNATURE-----

Merge tag 'upstream-3.16-rc1-v2' of git://git.infradead.org/linux-ubifs

Pull UBIFS updates from Artem Bityutskiy:
 "This contains several UBIFS fixes.  One of them fixes a race condition
  between the mmap page fault path and fsync.  Another just removes a
  bogus assertion from the UBIFS memory shrinker.

  UBIFS also started honoring the MS_SILENT mount flag, so now it won't
  print many I/O errors when user-space just tries to probe for the FS.

  Rest of the changes are rather minor UBI/UBIFS fixes, improvements,
  and clean-ups"

* tag 'upstream-3.16-rc1-v2' of git://git.infradead.org/linux-ubifs:
  UBIFS: Add an assertion for clean_zn_cnt
  UBIFS: respect MS_SILENT mount flag
  UBIFS: Remove incorrect assertion in shrink_tnc()
  UBIFS: fix debugging check
  UBIFS: add missing ui pointer in debugging code
  UBI: block: Fix error path on alloc_workqueue failure
  UBIFS: Fix dump messages in ubifs_dump_lprops
  UBI: fix rb_tree node comparison in add_map
  UBIFS: Remove unused variables in ubifs_budget_space
  UBI: weaken the 'exclusive' constraint when opening volumes to rename
  UBIFS: fix an mmap and fsync race condition
2014-06-10 08:55:46 -07:00
Mateusz Guzik a914722f33 NFS: populate ->net in mount data when remounting
Otherwise the kernel oopses when remounting with IPv6 server because
net is dereferenced in dev_get_by_name.

Use net ns of current thread so that dev_get_by_name does not operate on
foreign ns. Changing the address is prohibited anyway so this should not
affect anything.

Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Cc: linux-nfs@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org # 3.4+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-10 11:10:29 -04:00
Weston Andros Adamson c5e20cb700 pnfs: fix lockup caused by pnfs_generic_pg_test
end_offset and req_offset both return u64 - avoid casting to u32
until it's needed, when it's less than the (u32) size returned by
nfs_generic_pg_test.

Also, fix the comments in pnfs_generic_pg_test.

Running the cthon04 special tests caused this lockup in the
"write/read at 2GB, 4GB edges" test when running against a file layout server:

BUG: soft lockup - CPU#0 stuck for 22s! [bigfile2:823]
Modules linked in: nfs_layout_nfsv41_files rpcsec_gss_krb5 nfsv4 nfs fscache ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_mangle ip6table_filter ip6_tables iptable_nat nf_nat_ipv4 nf_nat iptable_mangle ppdev crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd serio_raw e1000 shpchp i2c_piix4 i2c_core parport_pc parport nfsd auth_rpcgss oid_registry exportfs nfs_acl lockd sunrpc btrfs xor zlib_deflate raid6_pq mptspi scsi_transport_spi mptscsih mptbase ata_generic floppy autofs4
irq event stamp: 205958
hardirqs last  enabled at (205957): [<ffffffff814a62dc>] restore_args+0x0/0x30
hardirqs last disabled at (205958): [<ffffffff814ad96a>] apic_timer_interrupt+0x6a/0x80
softirqs last  enabled at (205956): [<ffffffff8103ffb2>] __do_softirq+0x1ea/0x2ab
softirqs last disabled at (205951): [<ffffffff8104026d>] irq_exit+0x44/0x9a
CPU: 0 PID: 823 Comm: bigfile2 Not tainted 3.15.0-rc1-branch-pgio_plus+ #3
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
task: ffff8800792ec480 ti: ffff880078c4e000 task.ti: ffff880078c4e000
RIP: 0010:[<ffffffffa02ce51f>]  [<ffffffffa02ce51f>] nfs_page_group_unlock+0x3e/0x4b [nfs]
RSP: 0018:ffff880078c4fab0  EFLAGS: 00000202
RAX: 0000000000000fff RBX: ffff88006bf83300 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88006bf83300
RBP: ffff880078c4fab8 R08: 0000000000000001 R09: 0000000000000000
R10: ffffffff8249840c R11: 0000000000000000 R12: 0000000000000035
R13: ffff88007ffc72d8 R14: 0000000000000001 R15: 0000000000000000
FS:  00007f45f11b7740(0000) GS:ffff88007f200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f3a8cb632d0 CR3: 000000007931c000 CR4: 00000000001407f0
Stack:
 ffff88006bf832c0 ffff880078c4fb00 ffffffffa02cec22 ffff880078c4fad8
 00000fff810f9d99 ffff880078c4fca0 ffff88006bf832c0 ffff88006bf832c0
 ffff880078c4fca0 ffff880078c4fd60 ffff880078c4fb28 ffffffffa02cee34
Call Trace:
 [<ffffffffa02cec22>] __nfs_pageio_add_request+0x298/0x34f [nfs]
 [<ffffffffa02cee34>] nfs_pageio_add_request+0x1f/0x42 [nfs]
 [<ffffffffa02d1722>] nfs_do_writepage+0x1b5/0x1e4 [nfs]
 [<ffffffffa02d1764>] nfs_writepages_callback+0x13/0x25 [nfs]
 [<ffffffffa02d1751>] ? nfs_do_writepage+0x1e4/0x1e4 [nfs]
 [<ffffffff810eb32d>] write_cache_pages+0x254/0x37f
 [<ffffffffa02d1751>] ? nfs_do_writepage+0x1e4/0x1e4 [nfs]
 [<ffffffff8149cf9e>] ? printk+0x54/0x56
 [<ffffffff810eacca>] ? __set_page_dirty_nobuffers+0x22/0xe9
 [<ffffffffa016d864>] ? put_rpccred+0x38/0x101 [sunrpc]
 [<ffffffffa02d1ae1>] nfs_writepages+0xb4/0xf8 [nfs]
 [<ffffffff810ec59c>] do_writepages+0x21/0x2f
 [<ffffffff810e36e8>] __filemap_fdatawrite_range+0x55/0x57
 [<ffffffff810e374a>] filemap_write_and_wait_range+0x2d/0x5b
 [<ffffffffa030ba0a>] nfs4_file_fsync+0x3a/0x98 [nfsv4]
 [<ffffffff8114ee3c>] vfs_fsync_range+0x18/0x20
 [<ffffffff810e40c2>] generic_file_aio_write+0xa7/0xbd
 [<ffffffffa02c5c6b>] nfs_file_write+0xf0/0x170 [nfs]
 [<ffffffff81129215>] do_sync_write+0x59/0x78
 [<ffffffff8112956c>] vfs_write+0xab/0x107
 [<ffffffff81129c8b>] SyS_write+0x49/0x7f
 [<ffffffff814acd12>] system_call_fastpath+0x16/0x1b

Reported-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-10 11:07:56 -04:00
Linus Torvalds 64b2d1fbbf f2fs updates for v3.16
This patch-set includes the following major enhancement patches.
  o enhance wait_on_page_writeback
  o support SEEK_DATA and SEEK_HOLE
  o enhance readahead flows
  o enhance IO flushes
  o support fiemap
  o add some tracepoints
 
 The other bug fixes are as follows.
  o fix to support a large volume > 2TB correctly
  o recovery bug fix wrt fallocated space
  o fix recursive lock on xattr operations
  o fix some cases on the remount flow
 
 And, there are a bunch of cleanups.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTleLYAAoJEEAUqH6CSFDSdhEP/iY5hTZ02jH4ZiFPP/Nee4hS
 l0BHeZvrMjoccaWUDu0ZvIPC8BiJ7kLOgK+/VWS7LAfY1PY11ALEtYQOrW+RM47+
 jMfULegTod/F8WS2B6dk31QMhAZldtnsYvA5PS1VV3S0rht+qbOz+PDejZFgSMc3
 VVQ7OZzq30gMmtsw7oh3FHfeTu4xe/bxygdMRXgljQQD2MvorJiOb4MA+ovEDd9z
 CZMMTQvRKjc0d8LPf0gOkZEvG63GR6klCgFMuiappUsua0O8IPIEhCyEGFrE66vS
 fObVKpqNAsR2ABhS2grgn6Q7UUvn4xrF6jOwDH5tuw2yzmkQiMAWINwBdgnbEy1c
 D5S89PQ8TkQK9KBSoU0v8WKWC4HzWZF4ZEd6eq9VxVTS8iT0w8DtNHXK518FVC34
 82iqrLc0EhrcGNiW/7Nrc6WzHrWqorylCFN7atB3ruhVqeMh3MZsDU4Gq0WgB2oh
 pF9XVBEpJQpV35DYSAPzLkm+2+mwHVNqwdY3HcvUs7IP90+wZlrWSRG2FEfFt/e8
 6nwvbORrHMTI5VfdYlEPgpjuesFmYPZqEvRGdaDa1YrHqhvvgStEPT9qiq2qVn9+
 tr0HjpNRDD/etkaE6ujYOYqdxuk3mm6RY68h+KSbNcY1/VTvt2bN2telwWuDsxIq
 8yOsxV2x3JB/euDAJsSU
 =xqsO
 -----END PGP SIGNATURE-----

Merge tag 'for-f2fs-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, there is no special interesting feature, but we've
  investigated a couple of tuning points with respect to the I/O flow.
  Several major bug fixes and a bunch of clean-ups also have been made.

  This patch-set includes the following major enhancement patches:
   - enhance wait_on_page_writeback
   - support SEEK_DATA and SEEK_HOLE
   - enhance readahead flows
   - enhance IO flushes
   - support fiemap
   - add some tracepoints

  The other bug fixes are as follows:
   - fix to support a large volume > 2TB correctly
   - recovery bug fix wrt fallocated space
   - fix recursive lock on xattr operations
   - fix some cases on the remount flow

  And, there are a bunch of cleanups"

* tag 'for-f2fs-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (52 commits)
  f2fs: support f2fs_fiemap
  f2fs: avoid not to call remove_dirty_inode
  f2fs: recover fallocated space
  f2fs: fix to recover data written by dio
  f2fs: large volume support
  f2fs: avoid crash when trace f2fs_submit_page_mbio event in ra_sum_pages
  f2fs: avoid overflow when large directory feathure is enabled
  f2fs: fix recursive lock by f2fs_setxattr
  MAINTAINERS: add a co-maintainer from samsung for F2FS
  MAINTAINERS: change the email address for f2fs
  f2fs: use inode_init_owner() to simplify codes
  f2fs: avoid to use slab memory in f2fs_issue_flush for efficiency
  f2fs: add a tracepoint for f2fs_read_data_page
  f2fs: add a tracepoint for f2fs_write_{meta,node,data}_pages
  f2fs: add a tracepoint for f2fs_write_{meta,node,data}_page
  f2fs: add a tracepoint for f2fs_write_end
  f2fs: add a tracepoint for f2fs_write_begin
  f2fs: fix checkpatch warning
  f2fs: deactivate inode page if the inode is evicted
  f2fs: decrease the lock granularity during write_begin
  ...
2014-06-09 19:11:44 -07:00
Linus Torvalds b1cce8032f Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French.

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  CIFS: Fix memory leaks in SMB2_open
  cifs: ensure that vol->username is not NULL before running strlen on it
  Clarify SMB2/SMB3 create context and add missing ones
  Do not send ClientGUID on SMB2.02 dialect
  cifs: Set client guid on per connection basis
  fs/cifs/netmisc.c: convert printk to pr_foo()
  fs/cifs/cifs.c: replace seq_printf by seq_puts
  Update cifs version number to 2.03
  fs: cifs: new helper: file_inode(file)
  cifs: fix potential races in cifs_revalidate_mapping
  cifs: new helper function: cifs_revalidate_mapping
  cifs: convert booleans in cifsInodeInfo to a flags field
  cifs: fix cifs_uniqueid_to_ino_t not to ever return 0
2014-06-09 19:08:43 -07:00
Liu Bo 6eda71d0c0 Btrfs: fix scrub_print_warning to handle skinny metadata extents
The skinny extents are intepreted incorrectly in scrub_print_warning(),
and end up hitting the BUG() in btrfs_extent_inline_ref_size.

Reported-by: Konstantinos Skarlatos <k.skarlatos@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:17 -07:00
Filipe Manana 7ffbb598a0 Btrfs: make fsync work after cloning into a file
When cloning into a file, we were correctly replacing the extent
items in the target range and removing the extent maps. However
we weren't replacing the extent maps with new ones that point to
the new extents - as a consequence, an incremental fsync (when the
inode doesn't have the full sync flag) was a NOOP, since it relies
on the existence of extent maps in the modified list of the inode's
extent map tree, which was empty. Therefore add new extent maps to
reflect the target clone range.

A test case for xfstests follows.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:16 -07:00
Liu Bo cd857dd6bc Btrfs: use right type to get real comparison
We want to make sure the point is still within the extent item, not to verify
the memory it's pointing to.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:15 -07:00
Josef Bacik 8a56457f5f Btrfs: don't check nodes for extent items
The backref code was looking at nodes as well as leaves when we tried to
populate extent item entries.  This is not good, and although we go away with it
for the most part because we'd skip where disk_bytenr != random_memory,
sometimes random_memory would match and suddenly boom.  This fixes that problem.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:14 -07:00
Filipe Manana 6fdef6d43c Btrfs: don't release invalid page in btrfs_page_exists_in_range()
In inode.c:btrfs_page_exists_in_range(), if the page we got from
the radix tree is an exception entry, which can't be retried, we
exit the loop with a non-NULL page and then call page_cache_release
against it, which is not ok since it's not a valid page. This could
also make us return true when we shouldn't.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:14 -07:00
Filipe Manana 809f901625 Btrfs: make sure we retry if page is a retriable exception
In inode.c:btrfs_page_exists_in_range(), if the page we get from the
radix tree is an exception which should make us retry, set page to
NULL in order to really retry, because otherwise we don't get another
loop iteration executed (page != NULL makes the while loop exit).
This also was making us call page_cache_release after exiting the loop,
which isn't correct because page doesn't point to a valid page, and
possibly return true from the function when we shouldn't.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:13 -07:00
Filipe Manana 91405151eb Btrfs: make sure we retry if we couldn't get the page
In inode.c:btrfs_page_exists_in_range(), if we can't get the page
we need to retry. However we weren't retrying because we weren't
setting page to NULL, which makes the while loop exit immediately
and will make us call page_cache_release after exiting the loop
which is incorrect because our page get didn't succeed. This could
also make us return true when we shouldn't.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:12 -07:00
Gui Hecheng c81d57679e btrfs: replace EINVAL with EOPNOTSUPP for dev_replace raid56
To return EOPNOTSUPP is more user friendly than to return EINVAL,
and then user-space tool will show that the dev_replace operation
for raid56 is not currently supported rather than showing that
there is an invalid argument.

Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:12 -07:00
Antonio Ospite 9391558411 trivial: fs/btrfs/ioctl.c: fix typo s/substract/subtract/
Signed-off-by: Antonio Ospite <ao2@ao2.it>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:11 -07:00
Liu Bo 0b43e04f70 Btrfs: fix leaf corruption after __btrfs_drop_extents
Several reports about leaf corruption has been floating on the list, one of them
points to __btrfs_drop_extents(), and we find that the leaf becomes corrupted
after __btrfs_drop_extents(), it's really a rare case but it does exist.

The problem turns out to be btrfs_next_leaf() called in __btrfs_drop_extents().

So in btrfs_next_leaf(), we release the current path to re-search the last key of
the leaf for locating next leaf, and we've taken it into account that there might
be balance operations between leafs during this 'unlock and re-lock' dance, so
we check the path again and advance it if there are now more items available.
But things are a bit different if that last key happens to be removed and balance
gets a bigger key as the last one, and btrfs_search_slot will return it with
ret > 0, IOW, nothing change in this leaf except the new last key, then we think
we're okay because there is no more item balanced in, fine, we thinks we can
go to the next leaf.

However, we should return that bigger key, otherwise we deserve leaf corruption,
for example, in endio, skipping that key means that __btrfs_drop_extents() thinks
it has dropped all extent matched the required range and finish_ordered_io can
safely insert a new extent, but it actually doesn't and ends up a leaf
corruption.

One may be asking that why our locking on extent io tree doesn't work as
expected, ie. it should avoid this kind of race situation.  But in
__btrfs_drop_extents(), we don't always find extents which are included within
our locking range, IOW, extents can start before our searching start, in this
case locking on extent io tree doesn't protect us from the race.

This takes the special case into account.

Reviewed-by: Filipe Manana <fdmanana@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:10 -07:00
Filipe Manana 337c6f6830 Btrfs: ensure btrfs_prev_leaf doesn't miss 1 item
We might have had an item with the previous key in the tree right
before we released our path. And after we released our path, that
item might have been pushed to the first slot (0) of the leaf we
were holding due to a tree balance. Alternatively, an item with the
previous key can exist as the only element of a leaf (big fat item).
Therefore account for these 2 cases, so that our callers (like
btrfs_previous_item) don't miss an existing item with a key matching
the previous key we computed above.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:09 -07:00
Filipe Manana f82a9901b0 Btrfs: fix clone to deal with holes when NO_HOLES feature is enabled
If the NO_HOLES feature is enabled holes don't have file extent items in
the btree that represent them anymore. This made the clone operation
ignore the gaps that exist between consecutive file extent items and
therefore not create the holes at the destination. When not using the
NO_HOLES feature, the holes were created at the destination.

A test case for xfstests follows.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:09 -07:00
Jeff Mahoney 964930312a btrfs: free delayed node outside of root->inode_lock
On heavy workloads, we're seeing soft lockup warnings on
root->inode_lock in __btrfs_release_delayed_node. The low hanging fruit
is to reduce the size of the critical section.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:08 -07:00
Gui Hecheng 902c68a4da btrfs: replace EINVAL with ERANGE for resize when ULLONG_MAX
To be accurate about the error case,
if the new size is beyond ULLONG_MAX, return ERANGE instead of EINVAL.

Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:07 -07:00
Filipe Manana b05fd8742f Btrfs: fix transaction leak during fsync call
If btrfs_log_dentry_safe() returns an error, we set ret to 1 and
fall through with the goal of committing the transaction. However,
in the case where the inode doesn't need a full sync, we would call
btrfs_wait_ordered_range() against the target range for our inode,
and if it returned an error, we would return without commiting or
ending the transaction.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:06 -07:00
Qu Wenruo d77815461f btrfs: Avoid trucating page or punching hole in a already existed hole.
btrfs_punch_hole() will truncate unaligned pages or punch hole on a
already existed hole.
This will cause unneeded zero page or holes splitting the original huge
hole.

This patch will skip already existed holes before any page truncating or
hole punching.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:06 -07:00
Filipe Manana 3821f34888 Btrfs: update commit root on snapshot creation after orphan cleanup
On snapshot creation (either writable or read-only), we do orphan cleanup
against the root of the snapshot. If the cleanup did remove any orphans,
then the current root node will be different from the commit root node
until the next transaction commit happens.

A send operation always uses the commit root of a snapshot - this means
it will see the orphans if it starts computing the send stream before the
next transaction commit happens (triggered by a timer or sync() for .e.g),
which is when the commit root gets assigned a reference to current root,
where the orphans are not visible anymore. The consequence of send seeing
the orphans is explained below.

For example:

    mkfs.btrfs -f /dev/sdd
    mount -o commit=999 /dev/sdd /mnt

    # open a file with O_TMPFILE and leave it open
    # write some data to the file
    btrfs subvolume snapshot -r /mnt /mnt/snap1

    btrfs send /mnt/snap1 -f /tmp/send.data

The send operation will fail with the following error:

    ERROR: send ioctl failed with -116: Stale file handle

What happens here is that our snapshot has an orphan inode still visible
through the commit root, that corresponds to the tmpfile. However send
will attempt to call inode.c:btrfs_iget(), with the goal of reading the
file's data, which will return -ESTALE because it will use the current
root (and not the commit root) of the snapshot.

Of course, there are other cases where we can get orphans, but this
example using a tmpfile makes it much easier to reproduce the issue.

Therefore on snapshot creation, after calling btrfs_orphan_cleanup, if
the commit root is different from the current root, just commit the
transaction associated with the snapshot's root (if it exists), so that
a send will not see any orphans that don't exist anymore. This also
guarantees a send will always see the same content regardless of whether
a transaction commit happened already before the send was requested and
after the orphan cleanup (meaning the commit root and current roots are
the same) or it hasn't happened yet (commit and current roots are
different).

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:05 -07:00
Filipe Manana ff5df9b884 Btrfs: ioctl, don't re-lock extent range when not necessary
In ioctl.c:lock_extent_range(), after locking our target range, the
ordered extent that btrfs_lookup_first_ordered_extent() returns us
may not overlap our target range at all. In this case we would just
unlock our target range, wait for any new ordered extents that overlap
the range to complete, lock again the range and repeat all these steps
until we don't get any ordered extent and the delalloc flag isn't set
in the io tree for our target range.

Therefore just stop if we get an ordered extent that doesn't overlap
our target range and the dealalloc flag isn't set for the range in
the inode's io tree.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:04 -07:00
Filipe Manana 2c463823cb Btrfs: avoid visiting all extent items when cloning a range
When cloning a range of a file, we were visiting all the extent items in
the btree that belong to our source inode. We don't need to visit those
extent items that don't overlap the range we are cloning, as doing so only
makes us waste time and do unnecessary btree navigations (btrfs_next_leaf)
for inodes that have a large number of file extent items in the btree.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:04 -07:00
Filipe Manana c55bfa67e9 Btrfs: set dead flag on the right root when destroying snapshot
We were setting the BTRFS_ROOT_SUBVOL_DEAD flag on the root of the
parent of our target snapshot, instead of setting it in the target
snapshot's root.

This is easy to observe by running the following scenario:

    mkfs.btrfs -f /dev/sdd
    mount /dev/sdd /mnt

    btrfs subvolume create /mnt/first_subvol
    btrfs subvolume snapshot -r /mnt /mnt/mysnap1

    btrfs subvolume delete /mnt/first_subvol
    btrfs subvolume snapshot -r /mnt /mnt/mysnap2

    btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/send.data

The send command failed because the send ioctl returned -EPERM.
A test case for xfstests follows.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:03 -07:00
Filipe Manana c125b8bff1 Btrfs: ensure readers see new data after a clone operation
We were cleaning the clone target file range from the page cache before
we did replace the file extent items in the fs tree. This was racy,
as right after cleaning the relevant range from the page cache and before
replacing the file extent items, a read against that range could be
performed by another task and populate again the page cache with stale
data (stale after the cloning finishes). This would result in reads after
the clone operation successfully finishes to get old data (and potentially
for a very long time). Therefore evict the pages after replacing the file
extent items, so that subsequent reads will always get the new data.

Similarly, we were prone to races while cloning the file extent items
because we weren't locking the target range and wait for any existing
ordered extents against that range to complete. It was possible that
after cloning the extent items, a write operation that was performed
before the clone operation and overlaps the same range, would end up
undoing all or part of the work the clone operation did (a worker task
running inode.c:btrfs_finish_ordered_io). Therefore lock the target
range in the io tree, wait for all pending ordered extents against that
range to finish and then safely perform the cloning.

The issue of reading stale data after the clone operation is easy to
reproduce by running the following C program in a loop until it exits
with return value 1.

 #include <unistd.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <errno.h>
 #include <pthread.h>
 #include <fcntl.h>
 #include <assert.h>
 #include <asm/types.h>
 #include <linux/ioctl.h>
 #include <sys/stat.h>
 #include <sys/types.h>
 #include <sys/ioctl.h>

 #define SRC_FILE "/mnt/sdd/foo"
 #define DST_FILE "/mnt/sdd/bar"
 #define FILE_SIZE (16 * 1024)
 #define PATTERN_SRC 'X'
 #define PATTERN_DST 'Y'

struct btrfs_ioctl_clone_range_args {
	__s64 src_fd;
	__u64 src_offset, src_length;
	__u64 dest_offset;
};

 #define BTRFS_IOCTL_MAGIC 0x94
 #define BTRFS_IOC_CLONE_RANGE _IOW(BTRFS_IOCTL_MAGIC, 13, \
				   struct btrfs_ioctl_clone_range_args)

static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
static int clone_done = 0;
static int reader_ready = 0;
static int stale_data = 0;

static void *reader_loop(void *arg)
{
	char buf[4096], want_buf[4096];

	memset(want_buf, PATTERN_SRC, 4096);
	pthread_mutex_lock(&mutex);
	reader_ready = 1;
	pthread_mutex_unlock(&mutex);

	while (1) {
		int done, fd, ret;

		fd = open(DST_FILE, O_RDONLY);
		assert(fd != -1);

		pthread_mutex_lock(&mutex);
		done = clone_done;
		pthread_mutex_unlock(&mutex);

		ret = read(fd, buf, 4096);
		assert(ret == 4096);
		close(fd);

		if (done) {
			ret = memcmp(buf, want_buf, 4096);
			if (ret == 0) {
				printf("Found new content\n");
			} else {
				printf("Found old content\n");
				pthread_mutex_lock(&mutex);
				stale_data = 1;
				pthread_mutex_unlock(&mutex);
			}
			break;
		}
	}
	return NULL;
}

int main(int argc, char *argv[])
{
	pthread_t reader;
	int ret, i, fd;
	struct btrfs_ioctl_clone_range_args clone_args;
	int fd1, fd2;

	ret = remove(SRC_FILE);
	if (ret == -1 && errno != ENOENT) {
		fprintf(stderr, "Error deleting src file: %s\n", strerror(errno));
		return 1;
	}
	ret = remove(DST_FILE);
	if (ret == -1 && errno != ENOENT) {
		fprintf(stderr, "Error deleting dst file: %s\n", strerror(errno));
		return 1;
	}

	fd = open(SRC_FILE, O_CREAT | O_WRONLY | O_TRUNC, S_IRWXU);
	assert(fd != -1);
	for (i = 0; i < FILE_SIZE; i++) {
		char c = PATTERN_SRC;
		ret = write(fd, &c, 1);
		assert(ret == 1);
	}
	close(fd);
	fd = open(DST_FILE, O_CREAT | O_WRONLY | O_TRUNC, S_IRWXU);
	assert(fd != -1);
	for (i = 0; i < FILE_SIZE; i++) {
		char c = PATTERN_DST;
		ret = write(fd, &c, 1);
		assert(ret == 1);
	}
	close(fd);
        sync();

	ret = pthread_create(&reader, NULL, reader_loop, NULL);
	assert(ret == 0);
	while (1) {
		int r;
		pthread_mutex_lock(&mutex);
		r = reader_ready;
		pthread_mutex_unlock(&mutex);
		if (r) break;
	}

	fd1 = open(SRC_FILE, O_RDONLY);
	if (fd1 < 0) {
		fprintf(stderr, "Error open src file: %s\n", strerror(errno));
		return 1;
	}
	fd2 = open(DST_FILE, O_RDWR);
	if (fd2 < 0) {
		fprintf(stderr, "Error open dst file: %s\n", strerror(errno));
		return 1;
	}
	clone_args.src_fd = fd1;
	clone_args.src_offset = 0;
	clone_args.src_length = 4096;
	clone_args.dest_offset = 0;
	ret = ioctl(fd2, BTRFS_IOC_CLONE_RANGE, &clone_args);
	assert(ret == 0);
	close(fd1);
	close(fd2);

	pthread_mutex_lock(&mutex);
	clone_done = 1;
	pthread_mutex_unlock(&mutex);
	ret = pthread_join(reader, NULL);
	assert(ret == 0);

	pthread_mutex_lock(&mutex);
	ret = stale_data ? 1 : 0;
	pthread_mutex_unlock(&mutex);
	return ret;
}

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:02 -07:00
Rickard Strandqvist 8321cf2596 fs: btrfs: volumes.c: Fix for possible null pointer dereference
There is otherwise a risk of a possible null pointer dereference.

Was largely found by using a static code analysis program called cppcheck.

Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:01 -07:00
Jeff Mahoney c1895442be btrfs: allocate raid type kobjects dynamically
We are currently allocating space_info objects in an array when we
allocate space_info. When a user does something like:

# btrfs balance start -mconvert=raid1 -dconvert=raid1 /mnt
# btrfs balance start -mconvert=single -dconvert=single /mnt -f
# btrfs balance start -mconvert=raid1 -dconvert=raid1 /

We can end up with memory corruption since the kobject hasn't
been reinitialized properly and the name pointer was left set.

The rationale behind allocating them statically was to avoid
creating a separate kobject container that just contained the
raid type. It used the index in the array to determine the index.

Ultimately, though, this wastes more memory than it saves in all
but the most complex scenarios and introduces kobject lifetime
questions.

This patch allocates the kobjects dynamically instead. Note that
we also remove the kobject_get/put of the parent kobject since
kobject_add and kobject_del do that internally.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:01 -07:00
Filipe Manana 7e3ae33efa Btrfs: send, use the right limits for xattr names and values
We were limiting the sum of the xattr name and value lengths to PATH_MAX,
which is not correct, specially on filesystems created with btrfs-progs
v3.12 or higher, where the default leaf size is max(16384, PAGE_SIZE), or
systems with page sizes larger than 4096 bytes.

Xattrs have their own specific maximum name and value lengths, which depend
on the leaf size, therefore use these limits to be able to send xattrs with
sizes larger than PATH_MAX.

A test case for xfstests follows.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:21:00 -07:00
Filipe Manana 1af56070e3 Btrfs: send, don't error in the presence of subvols/snapshots
If we are doing an incremental send and the base snapshot has a
directory with name X that doesn't exist anymore in the second
snapshot and a new subvolume/snapshot exists in the second snapshot
that has the same name as the directory (name X), the incremental
send would fail with -ENOENT error. This is because it attempts
to lookup for an inode with a number matching the objectid of a
root, which doesn't exist.

Steps to reproduce:

    mkfs.btrfs -f /dev/sdd
    mount /dev/sdd /mnt

    mkdir /mnt/testdir
    btrfs subvolume snapshot -r /mnt /mnt/mysnap1

    rmdir /mnt/testdir
    btrfs subvolume create /mnt/testdir
    btrfs subvolume snapshot -r /mnt /mnt/mysnap2

    btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/send.data

A test case for xfstests follows.

Reported-by: Robert White <rwhite@pobox.com>
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:59 -07:00
Chris Mason a79b7d4b3e Btrfs: async delayed refs
Delayed extent operations are triggered during transaction commits.
The goal is to queue up a healthly batch of changes to the extent
allocation tree and run through them in bulk.

This farms them off to async helper threads.  The goal is to have the
bulk of the delayed operations being done in the background, but this is
also important to limit our stack footprint.

Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:58 -07:00
Chris Mason 40f765805f Btrfs: split up __extent_writepage to lower stack usage
__extent_writepage has two unrelated parts.  First it does the delayed
allocation dance and second it does the mapping and IO for the page
we're actually writing.

This splits it up into those two parts so the stack from one doesn't
impact the stack from the other.

Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:58 -07:00
Alex Gartrell fc4adbff82 btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking
In these instances, we are trying to determine if a page has been accessed
since we began the operation for the sake of retry.  This is easily
accomplished by doing a gang lookup in the page mapping radix tree, and it
saves us the dependency on the flag (so that we might eventually delete
it).

btrfs_page_exists_in_range borrows heavily from find_get_page, replacing
the radix tree look up with a gang lookup of 1, so that we can find the
next highest page >= index and see if it falls into our lock range.

Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Alex Gartrell <agartrell@fb.com>
2014-06-09 17:20:57 -07:00
Chris Mason 0e378df15c Btrfs: cut down stack usage in btree_write_cache_pages
This adds noinline_for_stack to two helpers used by
btree_write_cache_pages.  It shaves us down from 424 bytes on the
stack to 280.

Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:56 -07:00
Chris Mason d4452bc526 Btrfs: break up __btrfs_write_out_cache to cut down stack usage
__btrfs_write_out_cache was one of our stack pigs.  This breaks it
up into helper functions and slims it down to 194 bytes.

Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:55 -07:00
Josef Bacik 2a10840945 Btrfs: free tmp ulist for qgroup rescan
Memory leaks are bad mmkay?

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:55 -07:00
Anand Jain 402a0f4759 btrfs: usage error should not be logged into system log
I have an opinion that system logs /var/log/messages are
valuable info to investigate the real system issues at
the data center. People handling data center issues
do spend a lot time and efforts analyzing messages
files. Having usage error logged into /var/log/messages
is something we should avoid.

Signed-off-by: Anand Jain <Anand.Jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:54 -07:00
David Sterba 67a77eb147 btrfs: remove newline from inode cache kthread name
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:53 -07:00
David Sterba 351fd35321 btrfs: remove stale newlines from log messages
I've noticed an extra line after "use no compression", but search
revealed much more in messages of more critical levels and rare errors.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:53 -07:00
Chris Mason 7d78874273 Btrfs: fix double free in find_lock_delalloc_range
We need to NULL the cached_state after freeing it, otherwise
we might free it again if find_delalloc_range doesn't find anything.

Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org
2014-06-09 17:20:52 -07:00
ZhangZhen 58dfae6365 btrfs: replace simple_strtoull() with kstrtoull()
use the newer and more pleasant kstrtoull() to replace simple_strtoull(),
because simple_strtoull() is marked for obsoletion.

Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:51 -07:00
Wang Shilong 298658414a Btrfs: set right total device count for seeding support
Seeding device support allows us to create a new filesystem
based on existed filesystem.

However newly created filesystem's @total_devices should include seed
devices. This patch fix the following problem:

 # mkfs.btrfs -f /dev/sdb
 # btrfstune -S 1 /dev/sdb
 # mount /dev/sdb /mnt
 # btrfs device add -f /dev/sdc /mnt --->fs_devices->total_devices = 1
 # umount /mnt
 # mount /dev/sdc /mnt               --->fs_devices->total_devices = 2

This is because we record right @total_devices in superblock, but
@fs_devices->total_devices is reset to be 0 in btrfs_prepare_sprout().

Fix this problem by not resetting @fs_devices->total_devices.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:50 -07:00
Guangliang Zhao 45ff35d6b9 Btrfs: remove OPT_acl parse when acl disabled
Even CONFIG_BTRFS_FS_POSIX_ACL is not defined, the acl still could
been enabled using a mount option, and now fs/btrfs/acl.o is not
built, so the mount options will appear to be supported but will
be silently ignored.

Signed-off-by: Guangliang Zhao <lucienchao@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:50 -07:00
Josef Bacik faa2dbf004 Btrfs: add sanity tests for new qgroup accounting code
This exercises the various parts of the new qgroup accounting code.  We do some
basic stuff and do some things with the shared refs to make sure all that code
works.  I had to add a bunch of infrastructure because I needed to be able to
insert items into a fake tree without having to do all the hard work myself,
hopefully this will be usefull in the future.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:49 -07:00
Josef Bacik fcebe4562d Btrfs: rework qgroup accounting
Currently qgroups account for space by intercepting delayed ref updates to fs
trees.  It does this by adding sequence numbers to delayed ref updates so that
it can figure out how the tree looked before the update so we can adjust the
counters properly.  The problem with this is that it does not allow delayed refs
to be merged, so if you say are defragging an extent with 5k snapshots pointing
to it we will thrash the delayed ref lock because we need to go back and
manually merge these things together.  Instead we want to process quota changes
when we know they are going to happen, like when we first allocate an extent, we
free a reference for an extent, we add new references etc.  This patch
accomplishes this by only adding qgroup operations for real ref changes.  We
only modify the sequence number when we need to lookup roots for bytenrs, this
reduces the amount of churn on the sequence number and allows us to merge
delayed refs as we add them most of the time.  This patch encompasses a bunch of
architectural changes

1) qgroup ref operations: instead of tracking qgroup operations through the
delayed refs we simply add new ref operations whenever we notice that we need to
when we've modified the refs themselves.

2) tree mod seq:  we no longer have this separation of major/minor counters.
this makes the sequence number stuff much more sane and we can remove some
locking that was needed to protect the counter.

3) delayed ref seq: we now read the tree mod seq number and use that as our
sequence.  This means each new delayed ref doesn't have it's own unique sequence
number, rather whenever we go to lookup backrefs we inc the sequence number so
we can make sure to keep any new operations from screwing up our world view at
that given point.  This allows us to merge delayed refs during runtime.

With all of these changes the delayed ref stuff is a little saner and the qgroup
accounting stuff no longer goes negative in some cases like it was before.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:48 -07:00
Liu Bo 5dca6eea91 Btrfs: mark mapping with error flag to report errors to userspace
According to commit 865ffef379
(fs: fix fsync() error reporting),
it's not stable to just check error pages because pages can be
truncated or invalidated, we should also mark mapping with error
flag so that a later fsync can catch the error.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:47 -07:00
Liu Bo 29cc83f69c Btrfs: fix NULL pointer crash of deleting a seed device
Same as normal devices, seed devices should be initialized with
fs_info->dev_root as well, otherwise we'll get a NULL pointer crash.

Cc: Chris Murphy <lists@colorremedies.com>
Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:47 -07:00
Wang Shilong f017f15f7c Btrfs: fix joining same transaction handle more than twice
We hit something like the following function call flows:

|->run_delalloc_range()
 |->btrfs_join_transaction()
   |->cow_file_range()
     |->btrfs_join_transaction()
       |->find_free_extent()
         |->btrfs_join_transaction()

Trace infomation can be seen as:

[ 7411.127040] ------------[ cut here ]------------
[ 7411.127060] WARNING: CPU: 0 PID: 11557 at fs/btrfs/transaction.c:383 start_transaction+0x561/0x580 [btrfs]()
[ 7411.127079] CPU: 0 PID: 11557 Comm: kworker/u8:9 Tainted: G           O 3.13.0+ #4
[ 7411.127080] Hardware name: LENOVO QiTianM4350/ , BIOS F1KT52AUS 05/24/2013
[ 7411.127085] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-5)
[ 7411.127092] Call Trace:
[ 7411.127097]  [<ffffffff815b87b0>] dump_stack+0x45/0x56
[ 7411.127101]  [<ffffffff81051ffd>] warn_slowpath_common+0x7d/0xa0
[ 7411.127102]  [<ffffffff810520da>] warn_slowpath_null+0x1a/0x20
[ 7411.127109]  [<ffffffffa0444fb1>] start_transaction+0x561/0x580 [btrfs]
[ 7411.127115]  [<ffffffffa0445027>] btrfs_join_transaction+0x17/0x20 [btrfs]
[ 7411.127120]  [<ffffffffa0431c91>] find_free_extent+0xa21/0xb50 [btrfs]
[ 7411.127126]  [<ffffffffa0431f68>] btrfs_reserve_extent+0xa8/0x1a0 [btrfs]
[ 7411.127131]  [<ffffffffa04322ce>] btrfs_alloc_free_block+0xee/0x440 [btrfs]
[ 7411.127137]  [<ffffffffa043bd6e>] ? btree_set_page_dirty+0xe/0x10 [btrfs]
[ 7411.127142]  [<ffffffffa041da51>] __btrfs_cow_block+0x121/0x530 [btrfs]
[ 7411.127146]  [<ffffffffa041dfff>] btrfs_cow_block+0x11f/0x1c0 [btrfs]
[ 7411.127151]  [<ffffffffa0421b74>] btrfs_search_slot+0x1d4/0x9c0 [btrfs]
[ 7411.127157]  [<ffffffffa0438567>] btrfs_lookup_file_extent+0x37/0x40 [btrfs]
[ 7411.127163]  [<ffffffffa0456bfc>] __btrfs_drop_extents+0x16c/0xd90 [btrfs]
[ 7411.127169]  [<ffffffffa0444ae3>] ? start_transaction+0x93/0x580 [btrfs]
[ 7411.127171]  [<ffffffff811663e2>] ? kmem_cache_alloc+0x132/0x140
[ 7411.127176]  [<ffffffffa041cd9a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
[ 7411.127182]  [<ffffffffa044aa61>] cow_file_range_inline+0x181/0x2e0 [btrfs]
[ 7411.127187]  [<ffffffffa044aead>] cow_file_range+0x2ed/0x440 [btrfs]
[ 7411.127194]  [<ffffffffa0464d7f>] ? free_extent_buffer+0x4f/0xb0 [btrfs]
[ 7411.127200]  [<ffffffffa044b38f>] run_delalloc_nocow+0x38f/0xa60 [btrfs]
[ 7411.127207]  [<ffffffffa0461600>] ? test_range_bit+0x30/0x180 [btrfs]
[ 7411.127212]  [<ffffffffa044bd48>] run_delalloc_range+0x2e8/0x350 [btrfs]
[ 7411.127219]  [<ffffffffa04618f9>] ? find_lock_delalloc_range+0x1a9/0x1e0 [btrfs]
[ 7411.127222]  [<ffffffff812a1e71>] ? blk_queue_bio+0x2c1/0x330
[ 7411.127228]  [<ffffffffa0462ad4>] __extent_writepage+0x2f4/0x760 [btrfs]

Here we fix it by avoiding joining transaction again if we have held
a transaction handle when allocating chunk in find_free_extent().

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:46 -07:00
Miao Xie 995946dd29 Btrfs: use helpers for last_trans_log_full_commit instead of opencode
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:45 -07:00
Filipe Manana 1f21ef0a34 Btrfs: check if items are ordered when a leaf is marked dirty
To ease finding bugs during development related to modifying btree leaves
in such a way that it makes its items not sorted by key anymore. Since this
is an expensive check, it's only enabled if CONFIG_BTRFS_FS_CHECK_INTEGRITY
is set, which isn't meant to be enabled for regular users.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:45 -07:00
Filipe Manana 35045bf2fd Btrfs: don't access non-existent key when csum tree is empty
When the csum tree is empty, our leaf (path->nodes[0]) has a number
of items equal to 0 and since btrfs_header_nritems() returns an
unsigned integer (and so is our local nritems variable) the following
comparison always evaluates to false:

     if (path->slots[0] >= nritems - 1) {

As the casting rules lead to:

     if ((u32)0 >= (u32)4294967295) {

This makes us access key at slot paths->slots[0] + 1 (1) of the empty leaf
some lines below:

    btrfs_item_key_to_cpu(path->nodes[0], &found_key, slot);
    if (found_key.objectid != BTRFS_EXTENT_CSUM_OBJECTID ||
        found_key.type != BTRFS_EXTENT_CSUM_KEY) {
		found_next = 1;
		goto insert;
    }

So just don't access such non-existent slot and don't set found_next to 1
when the tree is empty. It's very unlikely we'll get a random key with the
objectid and type values above, which is where we could go into trouble.

If nritems is 0, just set found_next to 1 anyway as it will make us insert
a csum item covering our whole extent (or the whole leaf) when the tree is
empty.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:44 -07:00
Wang Shilong de348ee022 Btrfs: make sure there are not any read requests before stopping workers
In close_ctree(), after we have stopped all workers,there maybe still
some read requests(for example readahead) to submit and this *maybe* trigger
an oops that user reported before:

kernel BUG at fs/btrfs/async-thread.c:619!

By hacking codes, i can reproduce this problem with one cpu available.
We fix this potential problem by invalidating all btree inode pages before
stopping all workers.

Thanks to Miao for pointing out this problem.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:43 -07:00
Tsutomu Itoh 59885b3930 Btrfs: fix possible memory leak in btrfs_create_tree()
In btrfs_create_tree(), if btrfs_insert_root() fails, we should
free root->commit_root.

Reported-by: Alex Lyakas <alex@zadarastorage.com>
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:42 -07:00
ZhangZhen 776e4aae55 btrfs: remove useless ACL check
posix_acl_xattr_set() already does the check, and it's the only
way to feed in an ACL from userspace.
So the check here is useless, remove it.

Signed-off-by: zhang zhen <zhenzhang.zhang@huawei.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:42 -07:00
Anand Jain 4d90d28b1c btrfs: btrfs_rm_device() should zero mirror SB as well
This fix will ensure all SB copies on the disk is zeroed
when the disk is intentionally removed. This helps to
better manage disks in the user land.

This version of patch also merges the Zach patch as below.

 btrfs: don't double brelse on device rm

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:41 -07:00
Miao Xie 27cdeb7096 Btrfs: use bitfield instead of integer data type for the some variants in btrfs_root
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:40 -07:00
Filipe Manana f959492fc1 Btrfs: send, fix more issues related to directory renames
This is a continuation of the previous changes titled:

   Btrfs: fix incremental send's decision to delay a dir move/rename
   Btrfs: part 2, fix incremental send's decision to delay a dir move/rename

There's a few more cases where a directory rename/move must be delayed which was
previously overlooked. If our immediate ancestor has a lower inode number than
ours and it doesn't have a delayed rename/move operation associated to it, it
doesn't mean there isn't any non-direct ancestor of our current inode that needs
to be renamed/moved before our current inode (i.e. with a higher inode number
than ours).

So we can't stop the search if our immediate ancestor has a lower inode number than
ours, we need to navigate the directory hierarchy upwards until we hit the root or:

1) find an ancestor with an higher inode number that was renamed/moved in the send
   root too (or already has a pending rename/move registered);
2) find an ancestor that is a new directory (higher inode number than ours and
   exists only in the send root).

Reproducer for case 1)

    $ mkfs.btrfs -f /dev/sdd
    $ mount /dev/sdd /mnt

    $ mkdir -p /mnt/a/b
    $ mkdir -p /mnt/a/c/d
    $ mkdir /mnt/a/b/e
    $ mkdir /mnt/a/c/d/f
    $ mv /mnt/a/b /mnt/a/c/d/2b
    $ mkdir /mnt/a/x
    $ mkdir /mnt/a/y

    $ btrfs subvolume snapshot -r /mnt /mnt/snap1
    $ btrfs send /mnt/snap1 -f /tmp/base.send

    $ mv /mnt/a/x /mnt/a/y
    $ mv /mnt/a/c/d/2b/e /mnt/a/c/d/2b/2e
    $ mv /mnt/a/c/d /mnt/a/h/2d
    $ mv /mnt/a/c /mnt/a/h/2d/2b/2c

    $ btrfs subvolume snapshot -r /mnt /mnt/snap2
    $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send

Simple reproducer for case 2)

    $ mkfs.btrfs -f /dev/sdd
    $ mount /dev/sdd /mnt

    $ mkdir -p /mnt/a/b
    $ mkdir /mnt/a/c
    $ mv /mnt/a/b /mnt/a/c/b2
    $ mkdir /mnt/a/e

    $ btrfs subvolume snapshot -r /mnt /mnt/snap1
    $ btrfs send /mnt/snap1 -f /tmp/base.send

    $ mv /mnt/a/c/b2 /mnt/a/e/b3
    $ mkdir /mnt/a/e/b3/f
    $ mkdir /mnt/a/h
    $ mv /mnt/a/c /mnt/a/e/b3/f/c2
    $ mv /mnt/a/e /mnt/a/h/e2

    $ btrfs subvolume snapshot -r /mnt /mnt/snap2
    $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send

Another simple reproducer for case 2)

    $ mkfs.btrfs -f /dev/sdd
    $ mount /dev/sdd /mnt

    $ mkdir -p /mnt/a/b
    $ mkdir /mnt/a/c
    $ mkdir /mnt/a/b/d
    $ mkdir /mnt/a/c/e

    $ btrfs subvolume snapshot -r /mnt /mnt/snap1
    $ btrfs send /mnt/snap1 -f /tmp/base.send

    $ mkdir /mnt/a/b/d/f
    $ mkdir /mnt/a/b/g
    $ mv /mnt/a/c/e /mnt/a/b/g/e2
    $ mv /mnt/a/c /mnt/a/b/d/f/c2
    $ mv /mnt/a/b/d/f /mnt/a/b/g/e2/f2

    $ btrfs subvolume snapshot -r /mnt /mnt/snap2
    $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send

More complex reproducer for case 2)

    $ mkfs.btrfs -f /dev/sdd
    $ mount /dev/sdd /mnt

    $ mkdir -p /mnt/a/b
    $ mkdir -p /mnt/a/c/d
    $ mkdir /mnt/a/b/e
    $ mkdir /mnt/a/c/d/f
    $ mv /mnt/a/b /mnt/a/c/d/2b
    $ mkdir /mnt/a/x
    $ mkdir /mnt/a/y

    $ btrfs subvolume snapshot -r /mnt /mnt/snap1
    $ btrfs send /mnt/snap1 -f /tmp/base.send

    $ mv /mnt/a/x /mnt/a/y
    $ mv /mnt/a/c/d/2b/e /mnt/a/c/d/2b/2e
    $ mv /mnt/a/c/d /mnt/a/h/2d
    $ mv /mnt/a/c /mnt/a/h/2d/2b/2c

    $ btrfs subvolume snapshot -r /mnt /mnt/snap2
    $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send

For both cases the incremental send would enter an infinite loop when building
path strings.

While solving these cases, this change also re-implements the code to detect
when directory moves/renames should be delayed. Instead of dealing with several
specific cases separately, it's now more generic handling all cases with a simple
detection algorithm and if when applying a delayed move/rename there's a path loop
detected, it further delays the move/rename registering a new ancestor inode as
the dependency inode (so our rename happens after that ancestor is renamed).

Tests for these cases is being added to xfstests too.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:40 -07:00
Filipe Manana a10c40766c Btrfs: send, remove dead code from __get_cur_name_and_parent
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:39 -07:00
Filipe Manana c992ec94f2 Btrfs: send, account for orphan directories when building path strings
If we have directories with a pending move/rename operation, we must take into
account any orphan directories that got created before executing the pending
move/rename. Those orphan directories are directories with an inode number higher
then the current send progress and that don't exist in the parent snapshot, they
are created before current progress reaches their inode number, with a generated
name of the form oN-M-I and at the root of the filesystem tree, and later when
progress matches their inode number, moved/renamed to their final location.

Reproducer:

          $ mkfs.btrfs -f /dev/sdd
          $ mount /dev/sdd /mnt

          $ mkdir -p /mnt/a/b/c/d
          $ mkdir /mnt/a/b/e
          $ mv /mnt/a/b/c /mnt/a/b/e/CC
          $ mkdir /mnt/a/b/e/CC/d/f
	  $ mkdir /mnt/a/g

          $ btrfs subvolume snapshot -r /mnt /mnt/snap1
          $ btrfs send /mnt/snap1 -f /tmp/base.send

          $ mkdir /mnt/a/g/h
	  $ mv /mnt/a/b/e /mnt/a/g/h/EE
          $ mv /mnt/a/g/h/EE/CC/d /mnt/a/g/h/EE/DD

          $ btrfs subvolume snapshot -r /mnt /mnt/snap2
          $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send

The second receive command failed with the following error:

    ERROR: rename a/b/e/CC/d -> o264-7-0/EE/DD failed. No such file or directory

A test case for xfstests follows soon.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:38 -07:00
Filipe Manana b46ab97bcd Btrfs: send, avoid unnecessary inode item lookup in the btree
Regardless of whether the caller is interested or not in knowing the inode's
generation (dir_gen != NULL), get_first_ref always does a btree lookup to get
the inode item. Avoid this useless lookup if dir_gen parameter is NULL (which
is in some cases).

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:37 -07:00
Gui Hecheng 23f8f9b7ca btrfs: add dev maxs limit for __btrfs_alloc_chunk in kernel space
For RAID0,5,6,10,
For system chunk, there shouldn't be too many stripes to
make a btrfs_chunk that exceeds BTRFS_SYSTEM_CHUNK_ARRAY_SIZE
For data/meta chunk, there shouldn't be too many stripes to
make a btrfs_chunk that exceeds a leaf.

Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:36 -07:00
Gui Hecheng 5f43f86e3f btrfs: fix wrong max system array size check in kernel space
For system chunk array,
We copy a "disk_key" and an chunk item each time,
so there should be enough space to hold both of them,
not only the chunk item.

Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:36 -07:00
Qu Wenruo 65d33fd7a6 btrfs: Add check to avoid cleanup roots already in fs_info->dead_roots.
Current btrfs_orphan_cleanup will also cleanup roots which is already in
fs_info->dead_roots without protection.
This will have conditional race with fs_info->cleaner_kthread.

This patch will use refs in root->root_item to detect roots in
dead_roots and avoid conflicts.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:35 -07:00
Miao Xie 21c7e75654 Btrfs: reclaim the reserved metadata space at background
Before applying this patch, the task had to reclaim the metadata space
by itself if the metadata space was not enough. And When the task started
the space reclamation, all the other tasks which wanted to reserve the
metadata space were blocked. At some cases, they would be blocked for
a long time, it made the performance fluctuate wildly.

So we introduce the background metadata space reclamation, when the space
is about to be exhausted, we insert a reclaim work into the workqueue, the
worker of the workqueue helps us to reclaim the reserved space at the
background. By this way, the tasks needn't reclaim the space by themselves at
most cases, and even if the tasks have to reclaim the space or are blocked
for the space reclamation, they will get enough space more quickly.

Here is my test result(Tested by compilebench):
 Memory:	2GB
 CPU:		2Cores * 1CPU
 Partition:	40GB(SSD)

Test command:
 # compilebench -D <mnt> -m

Without this patch:
 intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s)
 compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s)
 read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s)
 delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s)

With this patch:
 intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s)
 compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s)
 read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s)
 delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s)

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:34 -07:00
Miao Xie 32d6b47fe6 Btrfs: output warning instead of error when loading free space cache failed
If we fail to load a free space cache, we can rebuild it from the extent tree,
so it is not a serious error, we should not output a error message that
would make the users uncomfortable. This patch uses warning message instead
of it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:33 -07:00
Qu Wenruo 5a1972bd9f btrfs: Add ctime/mtime update for btrfs device add/remove.
Btrfs will send uevent to udev inform the device change,
but ctime/mtime for the block device inode is not udpated, which cause
libblkid used by btrfs-progs unable to detect device change and use old
cache, causing 'btrfs dev scan; btrfs dev rmove; btrfs dev scan' give an
error message.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Cc: Karel Zak <kzak@redhat.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:33 -07:00
David Sterba 61155aa04e btrfs: assert that send is not in progres before root deletion
CC: Miao Xie <miaox@cn.fujitsu.com>
CC: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:32 -07:00
David Sterba 521e0546c9 btrfs: protect snapshots from deleting during send
The patch "Btrfs: fix protection between send and root deletion"
(18f687d538) does not actually prevent to delete the snapshot
and just takes care during background cleaning, but this seems rather
user unfriendly, this patch implements the idea presented in

http://www.spinics.net/lists/linux-btrfs/msg30813.html

- add an internal root_item flag to denote a dead root
- check if the send_in_progress is set and refuse to delete, otherwise
  set the flag and proceed
- check the flag in send similar to the btrfs_root_readonly checks, for
  all involved roots

The root lookup in send via btrfs_read_fs_root_no_name will check if the
root is really dead or not. If it is, ENOENT, aborted send. If it's
alive, it's protected by send_in_progress, send can continue.

CC: Miao Xie <miaox@cn.fujitsu.com>
CC: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:31 -07:00
Daeseok Youn 944a4515b2 btrfs: remove redundant null check in btrfs_dentry_release()
It doesn't need to check NULL for kfree()

Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:31 -07:00
Filipe Manana ef3b9af50b Btrfs: implement inode_operations callback tmpfile
This implements the tmpfile callback of struct inode_operations, introduced
in the linux kernel 3.11, and implemented already by some filesystems. This
callback is invoked by the VFS when the flag O_TMPFILE is passed to the open
system call.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
2014-06-09 17:20:30 -07:00
David Sterba e4ef90ff61 btrfs: make FS_INFO ioctl available to anyone
This ioctl provides basic info about the filesystem that can be obtained
in other ways (eg. sysfs), there's no reason to restrict it to
CAP_SYSADMIN.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:29 -07:00
David Sterba 7d6213c5a7 btrfs: make DEV_INFO ioctl available to anyone
This ioctl provides basic info about the devices that can be obtained in
other ways (eg. sysfs), there's no reason to restrict it to
CAP_SYSADMIN.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:28 -07:00
David Sterba df93589a17 btrfs: export more from FS_INFO to sysfs
Similar to the FS_INFO updates, export the basic filesystem info through
sysfs: node size, sector size and clone alignment.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:28 -07:00
David Sterba 80a773fbfc btrfs: retrieve more info from FS_INFO ioctl
Provide the basic information about filesystem through the ioctl:
* b-tree node size (same as leaf size)
* sector size
* expected alignment of CLONE_RANGE and EXTENT_SAME ioctl arguments

Backward compatibility: if the values are 0, kernel does not provide
this information, the applications should ignore them.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:27 -07:00
David Sterba 7d824b6f9c btrfs: balance filter: add limit of processed chunks
This started as debugging helper, to watch the effects of converting
between raid levels on multiple devices, but could be useful standalone.

In my case the usage filter was not finegrained enough and led to
converting too many chunks at once. Another example use is in connection
with drange+devid or vrange filters that allow to work with a specific
chunk or even with a chunk on a given device.

The limit filter applies last, the value of 0 means no limiting.

CC: Ilya Dryomov <idryomov@gmail.com>
CC: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:26 -07:00
Filipe Manana fc19c5e736 Btrfs: fix leaf corruption caused by ENOSPC while hole punching
While running a stress test with multiple threads writing to the same btrfs
file system, I ended up with a situation where a leaf was corrupted in that
it had 2 file extent item keys that had the same exact key. I was able to
detect this quickly thanks to the following patch which triggers an assertion
as soon as a leaf is marked dirty if there are duplicated keys or out of order
keys:

    Btrfs: check if items are ordered when a leaf is marked dirty
    (https://patchwork.kernel.org/patch/3955431/)

Basically while running the test, I got the following in dmesg:

    [28877.415877] WARNING: CPU: 2 PID: 10706 at fs/btrfs/file.c:553 btrfs_drop_extent_cache+0x435/0x440 [btrfs]()
    (...)
    [28877.415917] Call Trace:
    [28877.415922]  [<ffffffff816f1189>] dump_stack+0x4e/0x68
    [28877.415926]  [<ffffffff8104a32c>] warn_slowpath_common+0x8c/0xc0
    [28877.415929]  [<ffffffff8104a37a>] warn_slowpath_null+0x1a/0x20
    [28877.415944]  [<ffffffffa03775a5>] btrfs_drop_extent_cache+0x435/0x440 [btrfs]
    [28877.415949]  [<ffffffff8118e7be>] ? kmem_cache_alloc+0xfe/0x1c0
    [28877.415962]  [<ffffffffa03777d9>] fill_holes+0x229/0x3e0 [btrfs]
    [28877.415972]  [<ffffffffa0345865>] ? block_rsv_add_bytes+0x55/0x80 [btrfs]
    [28877.415984]  [<ffffffffa03792cb>] btrfs_fallocate+0xb6b/0xc20 [btrfs]
    (...)
    [29854.132560] BTRFS critical (device sdc): corrupt leaf, bad key order: block=955232256,root=1, slot=24
    [29854.132565] BTRFS info (device sdc): leaf 955232256 total ptrs 40 free space 778
    (...)
    [29854.132637] 	item 23 key (3486 108 667648) itemoff 2694 itemsize 53
    [29854.132638] 		extent data disk bytenr 14574411776 nr 286720
    [29854.132639] 		extent data offset 0 nr 286720 ram 286720
    [29854.132640] 	item 24 key (3486 108 954368) itemoff 2641 itemsize 53
    [29854.132641] 		extent data disk bytenr 0 nr 0
    [29854.132643] 		extent data offset 0 nr 0 ram 0
    [29854.132644] 	item 25 key (3486 108 954368) itemoff 2588 itemsize 53
    [29854.132645] 		extent data disk bytenr 8699670528 nr 77824
    [29854.132646] 		extent data offset 0 nr 77824 ram 77824
    [29854.132647] 	item 26 key (3486 108 1146880) itemoff 2535 itemsize 53
    [29854.132648] 		extent data disk bytenr 8699670528 nr 77824
    [29854.132649] 		extent data offset 0 nr 77824 ram 77824
    (...)
    [29854.132707] kernel BUG at fs/btrfs/ctree.h:3901!
    (...)
    [29854.132771] Call Trace:
    [29854.132779]  [<ffffffffa0342b5c>] setup_items_for_insert+0x2dc/0x400 [btrfs]
    [29854.132791]  [<ffffffffa0378537>] __btrfs_drop_extents+0xba7/0xdd0 [btrfs]
    [29854.132794]  [<ffffffff8109c0d6>] ? trace_hardirqs_on_caller+0x16/0x1d0
    [29854.132797]  [<ffffffff8109c29d>] ? trace_hardirqs_on+0xd/0x10
    [29854.132800]  [<ffffffff8118e7be>] ? kmem_cache_alloc+0xfe/0x1c0
    [29854.132810]  [<ffffffffa036783b>] insert_reserved_file_extent.constprop.66+0xab/0x310 [btrfs]
    [29854.132820]  [<ffffffffa036a6c6>] __btrfs_prealloc_file_range+0x116/0x340 [btrfs]
    [29854.132830]  [<ffffffffa0374d53>] btrfs_prealloc_file_range+0x23/0x30 [btrfs]
    (...)

So this is caused by getting an -ENOSPC error while punching a file hole, more
specifically, we get -ENOSPC error from __btrfs_drop_extents in the while loop
of file.c:btrfs_punch_hole() when it's unable to modify the btree to delete one
or more file extent items due to lack of enough free space. When this happens,
in btrfs_punch_hole(), we attempt to reclaim free space by switching our transaction
block reservation object to root->fs_info->trans_block_rsv, end our transaction and
start a new transaction basically - and, we keep increasing our current offset
(cur_offset) as long as it's smaller than the end of the target range (lockend) -
this makes use leave the loop with cur_offset == drop_end which in turn makes us
call fill_holes() for inserting a file extent item that represents a 0 bytes range
hole (and this insertion succeeds, as in the meanwhile more space became available).

This 0 bytes file hole extent item is a problem because any subsequent caller of
__btrfs_drop_extents (regular file writes, or fallocate calls for e.g.), with a
start file offset that is equal to the offset of the hole, will not remove this
extent item due to the following conditional in the while loop of
__btrfs_drop_extents:

    if (extent_end <= search_start) {
            path->slots[0]++;
            goto next_slot;
    }

This later makes the call to setup_items_for_insert() (at the very end of
__btrfs_drop_extents), insert a new file extent item with the same offset as
the 0 bytes file hole extent item that follows it. Needless is to say that this
causes chaos, either when reading the leaf from disk (btree_readpage_end_io_hook),
where we perform leaf sanity checks or in subsequent operations that manipulate
file extent items, as in the fallocate call as shown by the dmesg trace above.

Without my other patch to perform the leaf sanity checks once a leaf is marked
as dirty (if the integrity checker is enabled), it would have been much harder
to debug this issue.

This change might fix a few similar issues reported by users in the mailing
list regarding assertion failures in btrfs_set_item_key_safe calls performed
by __btrfs_drop_extents, such as the following report:

    http://comments.gmane.org/gmane.comp.file-systems.btrfs/32938

Asking fill_holes() to create a 0 bytes wide file hole item also produced the
first warning in the trace above, as we passed a range to btrfs_drop_extent_cache
that has an end smaller (by -1) than its start.

On 3.14 kernels this issue manifests itself through leaf corruption, as we get
duplicated file extent item keys in a leaf when calling setup_items_for_insert(),
but on older kernels, setup_items_for_insert() isn't called by __btrfs_drop_extents(),
instead we have callers of __btrfs_drop_extents(), namely the functions
inode.c:insert_inline_extent() and inode.c:insert_reserved_file_extent(), calling
btrfs_insert_empty_item() to insert the new file extent item, which would fail with
error -EEXIST, instead of inserting a duplicated key - which is still a serious
issue as it would make all similar file extent item replace operations keep
failing if they target the same file range.

Cc: stable@vger.kernel.org
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:26 -07:00
Liu Bo d2cbf2a260 Btrfs: do not increment on bio_index one by one
'bio_index' is just a index, it's really not necessary to do increment
one by one.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:25 -07:00
Filipe Manana a1a50f60a6 Btrfs: read inode size after acquiring the mutex when punching a hole
In a previous change, commit 12870f1c9b,
I accidentally moved the roundup of inode->i_size to outside of the
critical section delimited by the inode mutex, which is not atomic and
not correct since the size can be changed by other task before we acquire
the mutex. Therefore fix it.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:24 -07:00
Tobias Klauser 7fb18a0664 btrfs: Remove unnecessary check for NULL
iput() already checks for the inode being NULL, thus it's unnecessary to
check before calling.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:23 -07:00
Zach Brown 166ae5a418 btrfs: fix inline compressed read err corruption
uncompress_inline() is dropping the error from btrfs_decompress() after
testing it and zeroing the page that was supposed to hold decompressed
data.  This can silently turn compressed inline data in to zeros if
decompression fails due to corrupt compressed data or memory allocation
failure.

I verified this by manually forcing the error from btrfs_decompress()
for a silly named copy of od:

	if (!strcmp(current->comm, "failod"))
		ret = -ENOMEM;

  # od -x /mnt/btrfs/dir/80 | head -1
  0000000 3031 3038 310a 2d30 6f70 6e69 0a74 3031
  # echo 3 > /proc/sys/vm/drop_caches
  # cp $(which od) /tmp/failod
  # /tmp/failod -x /mnt/btrfs/dir/80 | head -1
  0000000 0000 0000 0000 0000 0000 0000 0000 0000

The fix is to pass the error to its caller.  Which still has a BUG_ON().
So we fix that too.

There seems to be no reason for the zeroing of the page on the error
from btrfs_decompress() but not from the allocation error a few lines
above.  So the page zeroing is removed.

Signed-off-by: Zach Brown <zab@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:23 -07:00
Zach Brown 774bcb35f0 btrfs: return ptr error from compression workspace
The btrfs compression wrappers translated errors from workspace
allocation to either -ENOMEM or -1.  The compression type workspace
allocators are already returning a ERR_PTR(-ENOMEM).  Just return that
and get rid of the magical -1.

This helps a future patch return errors from the compression wrappers.

Signed-off-by: Zach Brown <zab@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:22 -07:00
Zach Brown 60e1975acb btrfs: return errno instead of -1 from compression
The compression layer seems to have been built to return -1 and have
callers make up errors that make sense.  This isn't great because there
are different errors that originate down in the compression layer.

Let's return real negative errnos from the compression layer so that
callers can pass on the error without having to guess what happened.
ENOMEM for allocation failure, E2BIG when compression exceeds the
uncompressed input, and EIO for everything else.

This helps a future path return errors from btrfs_decompress().

Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:21 -07:00
Stefan Behrens 98806b446d btrfs: check_int: propagate out-of-memory error upwards
This issue was not causing any harm but IMO (and in the opinion of the
static code checker) it is better to propagate this error status upwards.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:21 -07:00
Filipe Manana 61391d5622 Btrfs: fix hang on error (such as ENOSPC) when writing extent pages
When running low on available disk space and having several processes
doing buffered file IO, I got the following trace in dmesg:

[ 4202.720152] INFO: task kworker/u8:1:5450 blocked for more than 120 seconds.
[ 4202.720401]       Not tainted 3.13.0-fdm-btrfs-next-26+ #1
[ 4202.720596] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4202.720874] kworker/u8:1    D 0000000000000001     0  5450      2 0x00000000
[ 4202.720904] Workqueue: btrfs-flush_delalloc normal_work_helper [btrfs]
[ 4202.720908]  ffff8801f62ddc38 0000000000000082 ffff880203ac2490 00000000001d3f40
[ 4202.720913]  ffff8801f62ddfd8 00000000001d3f40 ffff8800c4f0c920 ffff880203ac2490
[ 4202.720918]  00000000001d4a40 ffff88020fe85a40 ffff88020fe85ab8 0000000000000001
[ 4202.720922] Call Trace:
[ 4202.720931]  [<ffffffff816a3cb9>] schedule+0x29/0x70
[ 4202.720950]  [<ffffffffa01ec48d>] btrfs_start_ordered_extent+0x6d/0x110 [btrfs]
[ 4202.720956]  [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0
[ 4202.720972]  [<ffffffffa01ec559>] btrfs_run_ordered_extent_work+0x29/0x40 [btrfs]
[ 4202.720988]  [<ffffffffa0201987>] normal_work_helper+0x137/0x2c0 [btrfs]
[ 4202.720994]  [<ffffffff810680e5>] process_one_work+0x1f5/0x530
(...)
[ 4202.721027] 2 locks held by kworker/u8:1/5450:
[ 4202.721028]  #0:  (%s-%s){++++..}, at: [<ffffffff81068083>] process_one_work+0x193/0x530
[ 4202.721037]  #1:  ((&work->normal_work)){+.+...}, at: [<ffffffff81068083>] process_one_work+0x193/0x530
[ 4202.721054] INFO: task btrfs:7891 blocked for more than 120 seconds.
[ 4202.721258]       Not tainted 3.13.0-fdm-btrfs-next-26+ #1
[ 4202.721444] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4202.721699] btrfs           D 0000000000000001     0  7891   7890 0x00000001
[ 4202.721704]  ffff88018c2119e8 0000000000000086 ffff8800a33d2490 00000000001d3f40
[ 4202.721710]  ffff88018c211fd8 00000000001d3f40 ffff8802144b0000 ffff8800a33d2490
[ 4202.721714]  ffff8800d8576640 ffff88020fe85bc0 ffff88020fe85bc8 7fffffffffffffff
[ 4202.721718] Call Trace:
[ 4202.721723]  [<ffffffff816a3cb9>] schedule+0x29/0x70
[ 4202.721727]  [<ffffffff816a2ebc>] schedule_timeout+0x1dc/0x270
[ 4202.721732]  [<ffffffff8109bd79>] ? mark_held_locks+0xb9/0x140
[ 4202.721736]  [<ffffffff816a90c0>] ? _raw_spin_unlock_irq+0x30/0x40
[ 4202.721740]  [<ffffffff8109bf0d>] ? trace_hardirqs_on_caller+0x10d/0x1d0
[ 4202.721744]  [<ffffffff816a488f>] wait_for_completion+0xdf/0x120
[ 4202.721749]  [<ffffffff8107fa90>] ? try_to_wake_up+0x310/0x310
[ 4202.721765]  [<ffffffffa01ebee4>] btrfs_wait_ordered_extents+0x1f4/0x280 [btrfs]
[ 4202.721781]  [<ffffffffa020526e>] btrfs_mksubvol.isra.62+0x30e/0x5a0 [btrfs]
[ 4202.721786]  [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0
[ 4202.721799]  [<ffffffffa02056a9>] btrfs_ioctl_snap_create_transid+0x1a9/0x1b0 [btrfs]
[ 4202.721813]  [<ffffffffa020583a>] btrfs_ioctl_snap_create_v2+0x10a/0x170 [btrfs]
(...)

It turns out that extent_io.c:__extent_writepage(), which ends up being called
through filemap_fdatawrite_range() in btrfs_start_ordered_extent(), was getting
-ENOSPC when calling the fill_delalloc callback. In this situation, it returned
without the writepage_end_io_hook callback (inode.c:btrfs_writepage_end_io_hook)
ever being called for the respective page, which prevents the ordered extent's
bytes_left count from ever reaching 0, and therefore a finish_ordered_fn work
is never queued into the endio_write_workers queue. This makes the task that
called btrfs_start_ordered_extent() hang forever on the wait queue of the ordered
extent.

This is fairly easy to reproduce using a small filesystem and fsstress on
a quad core vm:

    mkfs.btrfs -f -b `expr 2100 \* 1024 \* 1024` /dev/sdd
    mount /dev/sdd /mnt

    fsstress -p 6 -d /mnt -n 100000 -x \
        "btrfs subvolume snapshot -r /mnt /mnt/mysnap" \
	    -f allocsp=0 \
	    -f bulkstat=0 \
	    -f bulkstat1=0 \
	    -f chown=0 \
	    -f creat=1 \
	    -f dread=0 \
	    -f dwrite=0 \
	    -f fallocate=1 \
	    -f fdatasync=0 \
	    -f fiemap=0 \
	    -f freesp=0 \
	    -f fsync=0 \
	    -f getattr=0 \
	    -f getdents=0 \
	    -f link=0 \
	    -f mkdir=0 \
	    -f mknod=0 \
	    -f punch=1 \
	    -f read=0 \
	    -f readlink=0 \
	    -f rename=0 \
	    -f resvsp=0 \
	    -f rmdir=0 \
	    -f setxattr=0 \
	    -f stat=0 \
	    -f symlink=0 \
	    -f sync=0 \
	    -f truncate=1 \
	    -f unlink=0 \
	    -f unresvsp=0 \
	    -f write=4

So just ensure that if an error happens while writing the extent page
we call the writepage_end_io_hook callback. Also make it return the
error code and ensure the caller (extent_write_cache_pages) processes
all pages in the page vector even if an error happens only for some
of them, so that ordered extents end up released.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-09 17:20:20 -07:00
Dave Chinner 7691283d05 Merge branch 'xfs-misc-fixes-3-for-3.16' into for-next 2014-06-10 07:32:56 +10:00
Dave Chinner 8612c7e594 Merge branch 'xfs-da-geom' into for-next 2014-06-10 07:32:41 +10:00
Dave Chinner 35f46c5f04 xfs: fix xfs_da_args sparse warning in xfs_readdir
The kbuild test robot reported:

>> fs/xfs/xfs_dir2_readdir.c:672:41: sparse: Using plain integer as NULL pointer

Fix it.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-10 07:30:36 +10:00
J. Bruce Fields 48385408b4 nfsd4: fix FREE_STATEID lockowner leak
27b11428b7 ("nfsd4: remove lockowner when removing lock stateid")
introduced a memory leak.

Cc: stable@vger.kernel.org
Reported-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-09 17:13:54 -04:00
Tom Haynes f383b7e8fd NFSv4.1: Fix typo in dprintk
Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-09 09:54:42 -04:00
Tom Haynes bf96bc0b3b NFSv4.1: Comment is now wrong and redundant to code
The save of the write offset was removed some time ago, so that
part of the comment is bogus.

The remainder is pretty self-evident.

So off with it!

Signed-off-by: Tom Haynes <Thomas.Haynes@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-06-09 09:54:29 -04:00
Linus Torvalds 963649d735 9p fixes for the 3.16 merge window
Two bug fixes, one in xattr error path and the other in parsing
 major/minor numbers from devices.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: GPGTools - http://gpgtools.org
 
 iQIcBAABAgAGBQJTlM0UAAoJEDZk62b0Tg6xmiMP/i1t8K01JS64ZcyRYAy34oNn
 YR20Vz8zVfJ0Xkn7FdPXUZYIqi4CqawEWMUc1Yrw2qnRCkRMo1xtLT/Jobw7X0XF
 tfyc/bW9pvAfdfIA2npRENJtVWXB90nvzCf5ezCdr5tpRdWz1JJ5AhvTsubzEWrB
 oXZbn0/NHzEwvIt8sI/ZjuMKrvR1fD95a7wWnEg5Ckni+UZdKWJKA350vTXPfmfx
 334YbAy1/zBDnGsebzYMIEGtksybeiFZ2BWOscvafe0fh3eg0vqyE1i3wb7QZhO4
 IYuJA2Lq/L98BoFi3wshL+slfTwwXghFOct8OOV1NLdJh/osUndGeRGNXXZlW8R/
 zm2PE+pUKgKv0RjuSKUFr52bdcXsfKpyRJrSbh5Vh0jIT+FdZSW6sOD5ZTInK36o
 mD+VamD+wyJAmVwqoNlXRRweHyS1LCA7esuCjugK4gpy+lffRBWahnqxWQjK7Pc+
 7rAh6Q8ta79Y2AMQzC7SaiM87zcUOF6Dw/FkIHsS8UOI94dJNDNxdSfa8u+44WdT
 DgkR2ANQ5LryMJIwLwLKyEAt91zWSOUGk9dMbeTkBlJaBWWQDgTIhs6Ba7ptdI9/
 O/iu+S4jXSyFmemm0tSQwCysa4TQeAaiunjHQ5o+SnwbamuppyR4Y+/X40rGm+79
 Q5iBBF48tGfQal2AqLTx
 =EjHz
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-3.16-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs

Pull 9p fixes from Eric Van Hensbergen:
 "Two bug fixes, one in xattr error path and the other in parsing
  major/minor numbers from devices"

* tag 'for-linus-3.16-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  9P: fix return value in v9fs_fid_xattr_set
  fs/9p: adjust sscanf parameters accordingly to the variable types
2014-06-08 14:35:19 -07:00
Linus Torvalds f8409abdc5 Clean ups and miscellaneous bug fixes, in particular for the new
collapse_range and zero_range fallocate functions.  In addition,
 improve the scalability of adding and remove inodes from the orphan
 list.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJTk9x7AAoJENNvdpvBGATwQQ4QAN85xkNWWiq0feLGZjUVTre/
 JUgRQWXZYVogAQckQoTDXqJt1qKYxO45A8oIoUMI4uzgcFJm7iJIZJAv3Hjd2ftz
 48RVwjWHblmBz6e+CdmETpzJUaJr3KXbnk3EDQzagWg3Q64dBU/yT0c4foBO8wfX
 FI1MNin70r5NGQv6Mp4xNUfMoU6liCrsMO2RWkyxY2rcmxy6tkpNO/NBAPwhmn0e
 vwKHvnnqKM08Frrt6Lz3MpXGAJ+rhTSvmL+qSRXQn9BcbphdGa4jy+i3HbviRX4N
 z77UZMgMbfK1V3YHm8KzmmbIHrmIARXUlCM7jp4HPSnb4qhyERrhVmGCJZ8civ6Q
 3Cm9WwA93PQDfRX6Kid3K1tR/ql+ryac55o9SM990osrWp4C0IH+P/CdlSN0GspN
 3pJTLHUVVcxF6gSnOD+q/JzM8Iudl87Rxb17wA+6eg3AJRaPoQSPJoqtwZ89ZwOz
 RiZGuugFp7gDOxqo32lJ53fivO/e1zxXxu0dVHHjOnHBVWX063hlcibTg8kvFWg1
 7bBvUkvgT5jR+UuDX81wPZ+c0kkmfk4gxT5sHg6RlMKeCYi3uuLmAYgla3AM4j9G
 GeNNdVTmilH7wMgYB2wxd0C5HofgKgM5YFLZWc0FVSXMeFs5ST2kbLMXAZqzrKPa
 szHFEJHIGZByXfkP/jix
 =C1ZV
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Clean ups and miscellaneous bug fixes, in particular for the new
  collapse_range and zero_range fallocate functions.  In addition,
  improve the scalability of adding and remove inodes from the orphan
  list"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (25 commits)
  ext4: handle symlink properly with inline_data
  ext4: fix wrong assert in ext4_mb_normalize_request()
  ext4: fix zeroing of page during writeback
  ext4: remove unused local variable "stored" from ext4_readdir(...)
  ext4: fix ZERO_RANGE test failure in data journalling
  ext4: reduce contention on s_orphan_lock
  ext4: use sbi in ext4_orphan_{add|del}()
  ext4: use EXT_MAX_BLOCKS in ext4_es_can_be_merged()
  ext4: add missing BUFFER_TRACE before ext4_journal_get_write_access
  ext4: remove unnecessary double parentheses
  ext4: do not destroy ext4_groupinfo_caches if ext4_mb_init() fails
  ext4: make local functions static
  ext4: fix block bitmap validation when bigalloc, ^flex_bg
  ext4: fix block bitmap initialization under sparse_super2
  ext4: find the group descriptors on a 1k-block bigalloc,meta_bg filesystem
  ext4: avoid unneeded lookup when xattr name is invalid
  ext4: fix data integrity sync in ordered mode
  ext4: remove obsoleted check
  ext4: add a new spinlock i_raw_lock to protect the ext4's raw inode
  ext4: fix locking for O_APPEND writes
  ...
2014-06-08 13:03:35 -07:00
Linus Torvalds 3f17ea6dea Merge branch 'next' (accumulated 3.16 merge window patches) into master
Now that 3.15 is released, this merges the 'next' branch into 'master',
bringing us to the normal situation where my 'master' branch is the
merge window.

* accumulated work in next: (6809 commits)
  ufs: sb mutex merge + mutex_destroy
  powerpc: update comments for generic idle conversion
  cris: update comments for generic idle conversion
  idle: remove cpu_idle() forward declarations
  nbd: zero from and len fields in NBD_CMD_DISCONNECT.
  mm: convert some level-less printks to pr_*
  MAINTAINERS: adi-buildroot-devel is moderated
  MAINTAINERS: add linux-api for review of API/ABI changes
  mm/kmemleak-test.c: use pr_fmt for logging
  fs/dlm/debug_fs.c: replace seq_printf by seq_puts
  fs/dlm/lockspace.c: convert simple_str to kstr
  fs/dlm/config.c: convert simple_str to kstr
  mm: mark remap_file_pages() syscall as deprecated
  mm: memcontrol: remove unnecessary memcg argument from soft limit functions
  mm: memcontrol: clean up memcg zoneinfo lookup
  mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
  mm/mempool.c: update the kmemleak stack trace for mempool allocations
  lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations
  mm: introduce kmemleak_update_trace()
  mm/kmemleak.c: use %u to print ->checksum
  ...
2014-06-08 11:31:16 -07:00
Linus Torvalds 9d2cd01b15 Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd into next
Pull exofs raid6 support from Boaz Harrosh:
 "These simple patches will enable raid6 using the kernel's raid6_pq
  engine for support under exofs and pnfs-objects.

  There is nothing needed to do at exofs and pnfs-obj.  Just fire your
  mkfs.exofs with --raid=6 (that was already supported before) and off
  you go as usual.  The ORE will pick up the new map and will start
  writing two devices of redundancy bits.  The patches are so simple
  because most of the ORE was already for the general raid case, only a
  few bug fixes were needed and the actual wiring into the raid6_pq
  engine"

* 'for-linus' of git://git.open-osd.org/linux-open-osd:
  ore: Support for raid 6
  ore: Remove redundant dev_order(), more cleanups
  ore: (trivial) reformat some code
2014-06-07 17:07:20 -07:00
Jaegeuk Kim 9ab7013492 f2fs: support f2fs_fiemap
This patch links f2fs_fiemap with generic function with get_block.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2014-06-08 08:56:49 +09:00
Linus Torvalds c593e89787 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fix from Chris Mason:
 "I had this in my 3.16 merge window queue, but it is small and obvious
  enough for 3.15.  I cherry-picked and retested against current rc8"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: send, fix corrupted path strings for long paths
2014-06-07 15:12:18 -07:00
Yan, Zheng 4e217b5dc8 ceph: use truncate_pagecache() instead of truncate_inode_pages()
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2014-06-08 05:09:28 +08:00
Jeff Layton 1b19453d1c nfsd: don't halt scanning the DRC LRU list when there's an RC_INPROG entry
Currently, the DRC cache pruner will stop scanning the list when it
hits an entry that is RC_INPROG. It's possible however for a call to
take a *very* long time. In that case, we don't want it to block other
entries from being pruned if they are expired or we need to trim the
cache to get back under the limit.

Fix the DRC cache pruner to just ignore RC_INPROG entries.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-06 19:22:49 -04:00
J. Bruce Fields 999e568354 nfs4: remove unused CHANGE_SECURITY_LABEL
This constant has the wrong value.  And we don't use it.  And it's been
removed from the 4.2 spec anyway.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-06 19:22:49 -04:00
J. Bruce Fields 542d1ab3c7 nfsd4: kill READ64
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-06 19:22:48 -04:00
J. Bruce Fields 06553991e7 nfsd4: kill READ32
While we're here, let's kill off a couple of the read-side macros.

Leaving the more complicated ones alone for now.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-06 19:22:47 -04:00
J. Bruce Fields 05638dc73a nfsd4: simplify server xdr->next_page use
The rpc code makes available to the NFS server an array of pages to
encod into.  The server represents its reply as an xdr buf, with the
head pointing into the first page in that array, the pages ** array
starting just after that, and the tail (if any) sharing any leftover
space in the page used by the head.

While encoding, we use xdr_stream->page_ptr to keep track of which page
we're currently using.

Currently we set xdr_stream->page_ptr to buf->pages, which makes the
head a weird exception to the rule that page_ptr always points to the
page we're currently encoding into.  So, instead set it to buf->pages -
1 (the page actually containing the head), and remove the need for a
little unintuitive logic in xdr_get_next_encode_buffer() and
xdr_truncate_encode.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-06 19:22:46 -04:00
Fabian Frederick 0244756edc ufs: sb mutex merge + mutex_destroy
Commit 788257d610 ("ufs: remove the BKL") replaced BKL with mutex
protection using functions lock_ufs, unlock_ufs and struct mutex 'mutex'
in sb_info.

Commit b6963327e0 ("ufs: drop lock/unlock super") removed lock/unlock
super and added struct mutex 's_lock' in sb_info.

Those 2 mutexes are generally locked/unlocked at the same time except in
allocation (balloc, ialloc).

This patch merges the 2 mutexes and propagates first commit solution.
It also adds mutex destruction before kfree during ufs_fill_super
failure and ufs_put_super.

[akpm@linux-foundation.org: avoid ifdefs, return -EROFS not -EINVAL]
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Cc: "Chen, Jet" <jet.chen@intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:18 -07:00
Fabian Frederick c1d4518c4e fs/dlm/debug_fs.c: replace seq_printf by seq_puts
Replace seq_printf where possible.  This patch also fixes the following
checkpatch warning "unnecessary whitespace before a quoted newline"

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:18 -07:00
Fabian Frederick 6edb56871a fs/dlm/lockspace.c: convert simple_str to kstr
Replace obsolete functions.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:18 -07:00
Fabian Frederick 4f4c337fb7 fs/dlm/config.c: convert simple_str to kstr
Replace obsolete functions

simple_strtoul/kstrtouint
simple_strtol/kstrtoint
(kstr __must_check requires the right function to be applied)

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:17 -07:00
Fabian Frederick 310bf8f749 fs/reiserfs/stree.c: remove obsolete __constant
__constant_cpu_to_le32 converted to cpu_to_le32

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:17 -07:00
Fabian Frederick ae0a50aba0 fs/reiserfs/bitmap.c: coding style fixes
-Trivial code clean-up
-Fix endif }; (coccinelle warning)

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:16 -07:00
Joe Perches 1f7e0616cd fs: convert use of typedef ctl_table to struct ctl_table
This typedef is unnecessary and should just be removed.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:16 -07:00
Joe Perches 5eccdf3954 ntfs: convert use of typedef ctl_table to struct ctl_table
This typedef is unnecessary and should just be removed.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:16 -07:00
Joe Perches 92f778dd5d inotify: convert use of typedef ctl_table to struct ctl_table
This typedef is unnecessary and should just be removed.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:16 -07:00
Joe Perches f5102e5630 nfs: convert use of typedef ctl_table to struct ctl_table
This typedef is unnecessary and should just be removed.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:16 -07:00
Joe Perches 7ac9fe571d lockd: convert use of typedef ctl_table to struct ctl_table
This typedef is unnecessary and should just be removed.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:16 -07:00
Joe Perches 75a3294ec5 fscache: convert use of typedef ctl_table to struct ctl_table
This typedef is unnecessary and should just be removed.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:16 -07:00
Joe Perches a88bbbeef6 coda: convert use of typedef ctl_table to struct ctl_table
This typedef is unnecessary and should just be removed.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:16 -07:00
Fabian Frederick 04541a2f31 fs/devpts/inode.c: convert printk to pr_foo()
Also convert spaces to tabs (checkpatch warnings) if (!dentry) KERN_NOTICE
converted to pr_err (like if (!inode) error process)

[akpm@linux-foundation.org: use KBUILD_MODNAME, per Joe]
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joe Perches <joe@perches.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:14 -07:00
Fabian Frederick 0227d6abb3 fs/cachefiles: replace kerror by pr_err
Also add pr_fmt in internal.h

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:14 -07:00
Fabian Frederick 4e1eb88305 FS/CACHEFILES: convert printk to pr_foo()
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:14 -07:00
Fabian Frederick ef74885353 fs/pstore: logging clean-up
- Define pr_fmt in plateform.c and ram_core.c for global prefix.

- Coalesce format fragments.

- Separate format/arguments on lines > 80 characters.

Note: Some pr_foo() were initially declared without prefix and therefore
this could break existing log analyzer.

[akpm@linux-foundation.org: missed a couple of prefix removals]
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joe Perches <joe@perches.com>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:13 -07:00
Fabian Frederick 9606d9aa85 fs/affs: pr_debug cleanup
- Remove AFFS: prefix (defined in pr_fmt)

- Use __func__

- Separate format/arguments on lines > 80 characters.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:13 -07:00
Fabian Frederick 0158de12b0 fs/affs: convert printk to pr_foo()
-All printk(KERN_foo converted to pr_foo()

-Default printk converted to pr_warn()

-Add pr_fmt to affs.h

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:13 -07:00
Fabian Frederick 0c89d67016 fs/affs/file.c: remove unnecessary function parameters
- affs_do_readpage_ofs is always called with from = 0 ie reading from
  page->index

- File parameter is never used

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:13 -07:00
Fabian Frederick a05e16ada4 fs/proc/vmcore.c: remove NULL assignment to static
Static values are automatically initialized to NULL.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:12 -07:00
Fabian Frederick 17c2b4ee40 fs/proc/task_mmu.c: replace seq_printf by seq_puts
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:12 -07:00
Oleg Nesterov c240837fa7 signals: jffs2: fix the wrong usage of disallow_signal()
jffs2_garbage_collect_thread() does disallow_signal(SIGHUP) around
jffs2_garbage_collect_pass() and the comment says "We don't want SIGHUP
to interrupt us".

But disallow_signal() can't ensure that jffs2_garbage_collect_pass()
won't be interrupted by SIGHUP, the problem is that SIGHUP can be
already pending when disallow_signal() is called, and in this case any
interruptible sleep won't block.

Note: this is in fact because disallow_signal() is buggy and should be
fixed, see the next changes.

But there is another reason why disallow_signal() is wrong: SIG_IGN set
by disallow_signal() silently discards any SIGHUP which can be sent
before the next allow_signal(SIGHUP).

Change this code to use sigprocmask(SIG_UNBLOCK/SIG_BLOCK, SIGHUP).
This even matches the old (and wrong) semantics allow/disallow had when
this logic was written.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:11 -07:00
Manuel Schölling ef19470ef8 fs/fat/inode.c: clean up string initializations (char[] instead of char *)
Initializations like 'char *foo = "bar"' will create two variables: a
static string and a pointer (foo) to that static string.  Instead 'char
foo[] = "bar"' will declare a single variable and will end up in shorter
assembly (according to Jeff Garzik on the KernelJanitor's TODO list).

Signed-off-by: Manuel Schölling <manuel.schoelling@gmx.de>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:11 -07:00
Conrad Meyer 190a8843de fs/fat/: add support for DOS 1.x formatted volumes
Add structure for parsed BPB information, struct fat_bios_param_block,
and move all of the deserialization and validation logic from
fat_fill_super() into fat_read_bpb().

Add a 'dos1xfloppy' mount option to infer DOS 2.x BIOS Parameter Block
defaults from block device geometry for ancient floppies and floppy
images, as a fall-back from the default BPB parsing logic.

When fat_read_bpb() finds an invalid FAT filesystem and dos1xfloppy is
set, fall back to fat_read_static_bpb().  fat_read_static_bpb()
validates that the entire BPB is zero, and that the floppy has a
DOS-style 8086 code bootstrapping header.  Then it fills in default BPB
values from media size and a table.[0]

Media size is assumed to be static for archaic FAT volumes.  See also:
[1].

Fixes kernel.org bug #42617.

[0]: https://en.wikipedia.org/wiki/File_Allocation_Table#Exceptions
[1]: http://www.win.tue.nl/~aeb/linux/fs/fat/fat-1.html

[hirofumi@mail.parknet.co.jp: fix missed error code]
Signed-off-by: Conrad Meyer <cse.cem@gmail.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Tested-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:10 -07:00
Fabian Frederick a19189e553 fs/hpfs: increase pr_warn level
This patch applies a suggestion by Mikulas Patocka asking to increase
all pr_warn without commented ones to pr_err

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:10 -07:00
Fabian Frederick 1749a10e02 fs/hpfs: use __func__ for logging
Normalize function display fx() using __func__

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:10 -07:00
Fabian Frederick 14da17f9c4 fs/hpfs: use pr_fmt for logging
Also remove redundant level names (warning:...)

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:10 -07:00
Fabian Frederick b7cb1ce220 fs/hpfs: convert printk to pr_foo()
No level printk in hptfs_error converted to pr_err (others to pr_warn or
pr_info)

This patch also fixes if/then/else checkpatch warnings

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:10 -07:00
Fabian Frederick 45641c82c1 fs/ufs/balloc.c: remove err parameter in ufs_add_fragments
err is used in ufs_new_fragments (ufs_add_fragments only callsite)
not in ufs_add_fragments.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:10 -07:00
Christian Kujau df3d4e7a24 hfsplus: fix compiler warning on PowerPC
Commit a99b7069aa ("hfsplus: Fix undefined __divdi3 in
hfsplus_init_header_node()") introduced do_div() to xattr.c and the
warning below too.

As Geert remarked: "tmp" is "loff_t" which is "__kernel_loff_t", which
is "long long", i.e.  signed, while include/asm-generic/div64.h compares
its type with "uint64_t".  As inode sizes are positive, it should be
safe to change the type of "tmp" to "u64".

  In file included from
  arch/powerpc/include/asm/div64.h:1:0,
                    from include/linux/kernel.h:124,
                    from include/asm-generic/bug.h:13,
                    from arch/powerpc/include/asm/bug.h:127,
                    from include/linux/bug.h:4,
                    from include/linux/thread_info.h:11,
                    from include/asm-generic/preempt.h:4,
                    from arch/powerpc/include/generated/asm/preempt.h:1,
                    from include/linux/preempt.h:18,
                    from include/linux/spinlock.h:50,
                    from include/linux/wait.h:8,
                    from include/linux/fs.h:6,
                    from fs/hfsplus/hfsplus_fs.h:19,
                    from fs/hfsplus/xattr.c:9:
  fs/hfsplus/xattr.c: In function 'hfsplus_init_header_node':
  include/asm-generic/div64.h:43:28: warning: comparison of distinct pointer types lacks a cast [enabled by default]
     (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
                               ^
  fs/hfsplus/xattr.c:86:2: note: in expansion of macro 'do_div'
     do_div(tmp, node_size);
     ^

Signed-off-by: Christian Kujau <lists@nerdbynature.de>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Sergei Antonov <saproj@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:10 -07:00
Fabian Frederick b73f3d0e70 fs/hfsplus: fix pr_foo() and hfs_dbg formats
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Suggested-By: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:10 -07:00
Sergei Antonov ffbc067161 hfsplus: coding style fix for declarations in hfsplus_fs.h
Some function declarations in hfsplus_fs.h were with argument names,
some without, and some were mixed.  This patch adds argument names
everywhere, sorts function in order they go in .c files, and moves
hfs_part_find() to a proper section.

Auto-formatting and sorting was done with:
cfunctions *.c | indent -linux | sed "s| \* | \*|"

Signed-off-by: Sergei Antonov <saproj@gmail.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Hin-Tak Leung <htl10@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:10 -07:00
Fabian Frederick 297cc27207 fs/hfsplus/wrapper.c: replace shift loop by ilog2
Replace while blocksize;shift by ilog2

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:10 -07:00
Sergei Antonov 2cd282a1bc hfsplus: fix "unused node is not erased" error
Zero newly allocated extents in the catalog tree if volume attributes
tell us to.  Not doing so we risk getting the "unused node is not
erased" error.  See kHFSUnusedNodeFix flag in Apple's source code for
reference.

There was a previous commit clearing the node when it is freed: commit
899bed05e9 ("hfsplus: fix issue with unzeroed unused b-tree nodes").
But it did not handle newly allocated extents (this patch fixes it).
And it zeroed nodes in all trees unconditionally which is an overkill.

This patch adds a condition and also switches to 'tree->node_size' as a
simpler method of getting the length to zero.

Signed-off-by: Sergei Antonov <saproj@gmail.com>
Cc: Anton Altaparmakov <aia21@cam.ac.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Hin-Tak Leung <htl10@users.sourceforge.net>
Cc: Kyle Laracey <kalaracey@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:10 -07:00
Fabian Frederick 915ab236d3 fs/hfsplus/wrapper.c: replace min/casting by min_t
Also add * before function comments (it was not detected by kernel-doc)

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:10 -07:00