Commit Graph

4663 Commits (24d4e7f642882a8a13da170b4ba86eec8fa91bf2)

Author SHA1 Message Date
Philipp Reisner 2d56a974f3 drbd: reset ap_in_flight counter for new connections
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-28 10:10:24 -06:00
Wei Yongjun c613c5f686 mg_disk: fix error return code in mg_probe()
Fix to return a negative error code from the error handling
case instead of 0, as returned elsewhere in this function.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Reviewed-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-28 09:43:43 -06:00
Vishal Verma f8ebf8409a NVMe: Add definitions for format command
The SCSI emulation has the ability to send format commands, so we need
to add the definition of the command.  Also add a missing error code.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-03-27 07:17:49 -04:00
Vishal Verma 13c3b0fcc8 NVMe: Move structures & definitions to header file
nvme-scsi.c uses several data structures and definitions that were
previously private to nvme-core.c.  Move the definitions to nvme.h,
protected by __KERNEL__.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-03-27 07:05:42 -04:00
Philip J Kelleher 80b00df291 rsxx: remove unused variable
Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-26 14:48:12 -06:00
Philip J Kelleher 4dcaf47258 rsxx: enable error return of rsxx_eeh_save_issued_dmas()
Commit d8d595df introduced a bug where we did not check for a NULL
return from kmalloc(). Make rsxx_eeh_save_issued_dmas() return an
error for that case, and make the callers handle that.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-26 14:48:11 -06:00
Vishal Verma 729dd1bd80 NVMe: Rename nvme.c to nvme-core.c
In preparation for adding nvme-scsi.c
It is preferable to retain the module name 'nvme'

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-03-26 14:24:56 -04:00
Keith Busch 0e5e4f0e56 NVMe: Add discard support for capable devices
This adds discard support to block queues if the nvme device is capable of
deallocating blocks as indicated by the controller's optional command support.
A discard flagged bio request will submit an NVMe deallocate Data Set
Management command for the requested blocks.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-03-26 14:18:58 -04:00
Philip J Kelleher d8d595dfce block: removes dynamic allocation on stack
This patch removes dynamic allocation on the stack error.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-25 19:22:31 -06:00
Jens Axboe 2124469efa aoe: get rid of cached bv variable in bufinit()
Less error prone if we just kill it, it's only used once
anyway.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-25 15:27:26 -06:00
Kent Overstreet f1fb3449ef aoe: Fix unitialized var usage
Commit 4f2ac93c17 (block: Remove bi_idx
references) accidently removed the bit that set bv - readd that.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: fengguang.wu@intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-25 13:46:14 -06:00
Kent Overstreet d74c6d514f block: Add bio_for_each_segment_all()
__bio_for_each_segment() iterates bvecs from the specified index
instead of bio->bv_idx.  Currently, the only usage is to walk all the
bvecs after the bio has been advanced by specifying 0 index.

For immutable bvecs, we need to split these apart;
bio_for_each_segment() is going to have a different implementation.
This will also help document the intent of code that's using it -
bio_for_each_segment_all() is only legal to use for code that owns the
bio.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Neil Brown <neilb@suse.de>
CC: Boaz Harrosh <bharrosh@panasas.com>
2013-03-23 14:26:28 -07:00
Kent Overstreet ff8e0070d1 pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
In the short term this'll help with code auditing, and if this code ever
gets used now it's converted :)

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jiri Kosina <jkosina@suse.cz>
2013-03-23 14:15:37 -07:00
Kent Overstreet ffb25dc60f pktcdvd: use bio_copy_data()
Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Jiri Kosina <jkosina@suse.cz>
2013-03-23 14:15:37 -07:00
Kent Overstreet 4f2ac93c17 block: Remove bi_idx references
For immutable bvecs, all bi_idx usage needs to be audited - so here
we're removing all the unnecessary uses.

Most of these are places where it was being initialized on a bio that
was just allocated, a few others are conversions to standard macros.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
2013-03-23 14:15:31 -07:00
Kent Overstreet aa8b57aa3d block: Use bio_sectors() more consistently
Bunch of places in the code weren't using it where they could be -
this'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx
into a struct bvec_iter.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: "Ed L. Cashin" <ecashin@coraid.com>
CC: Nick Piggin <npiggin@kernel.dk>
CC: Jiri Kosina <jkosina@suse.cz>
CC: Jim Paris <jim@jtan.com>
CC: Geoff Levand <geoff@infradead.org>
CC: Alasdair Kergon <agk@redhat.com>
CC: dm-devel@redhat.com
CC: Neil Brown <neilb@suse.de>
CC: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Ed Cashin <ecashin@coraid.com>
2013-03-23 14:15:30 -07:00
Kent Overstreet f73a1c7d11 block: Add bio_end_sector()
Just a little convenience macro - main reason to add it now is preparing
for immutable bio vecs, it'll reduce the size of the patch that puts
bi_sector/bi_size/bi_idx into a struct bvec_iter.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Lars Ellenberg <drbd-dev@lists.linbit.com>
CC: Jiri Kosina <jkosina@suse.cz>
CC: Alasdair Kergon <agk@redhat.com>
CC: dm-devel@redhat.com
CC: Neil Brown <neilb@suse.de>
CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
CC: Heiko Carstens <heiko.carstens@de.ibm.com>
CC: linux-s390@vger.kernel.org
CC: Chris Mason <chris.mason@fusionio.com>
CC: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2013-03-23 14:15:29 -07:00
Lars Ellenberg 5bbcf5e6ab drbd: adjust upper limit for activity log extents
Now that the on-disk activity-log ring buffer size is adjustable,
the maximum active set can become larger, and is now limited by
the use of 16bit "labels".

This increases the maximum working set from 6433 to 65534 extents,
each of which covers an area of 4MiB.
Which means that if you use the maximum, you'd have to resync
more than 250 GiB after an unclean Primary shutdown.
With capable backend storage and replication links,
this is entirely feasible.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22 22:18:09 -06:00
Lars Ellenberg 45ad07b3ac drbd: try hard to max out the updates per AL transaction
There may have been more incoming requests while we where preparing
the current transaction. Try to consolidate more updates into this
transaction until we make no more progres.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22 22:18:09 -06:00
Lars Ellenberg 7e8c288f6c drbd: move start io accounting before activity log transaction
The IO accounting of the drbd "queue depth" was misleading.
We only started IO accounting once we already wrote the activity log.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22 22:18:09 -06:00
Lars Ellenberg 08a1ddab6d drbd: consolidate as many updates as possible into one AL transaction
Depending on current IO depth, try to consolidate as many updates
as possible into one activity log transaction.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22 22:18:09 -06:00
Lars Ellenberg 779b3fe4c0 drbd: queue writes on submitter thread, unless they pass the activity log fastpath
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22 18:15:17 -06:00
Lars Ellenberg 6c3c4355d6 drbd: split out some helper functions to drbd_al_begin_io
To make the code easier to follow,
use an explicit find_active_resync_extent(),
and add a "nonblock" parameter to _al_get().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22 18:15:17 -06:00
Lars Ellenberg b5bc8e0864 drbd: split drbd_al_begin_io into fastpath, prepare, and commit
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22 18:15:17 -06:00
Lars Ellenberg 113fef9e20 drbd: prepare to queue write requests on a submit worker
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22 18:14:40 -06:00
Lars Ellenberg 6d9febe237 drbd: split __drbd_make_request in before and after drbd_al_begin_io
This is in preparation to be able to defer requests that need to wait
for an activity log transaction to a submitter workqueue.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22 18:14:00 -06:00
Lars Ellenberg ebfd5d8f71 drbd: drbd_al_being_io: short circuit to reduce latency
A request hitting an already "hot" extent should proceed right away,
even if some other requests need to wait for pending transactions.

Without that short-circuit, several simultaneous make_request contexts
race for committing the transaction, possibly penalizing the innocent.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22 18:14:00 -06:00
Lars Ellenberg 56392d2f40 drbd: Clarify when activity log I/O is delegated to the worker thread
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22 18:14:00 -06:00
Lars Ellenberg c04ccaa669 drbd: read meta data early, base on-disk offsets on super block
We used to calculate all on-disk meta data offsets, and then compare
the stored offsets, basically treating them as magic numbers.

Now with the activity log striping, the activity log size is no longer
fixed.  We need to first read the super block, then base the activity
log and bitmap offsets on the stored offsets/al stripe settings.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22 18:13:59 -06:00
Lars Ellenberg cccac9857d drbd: mechanically rename la_size to la_size_sect
Make it obvious that this value is in units of 512 Byte sectors.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22 18:13:59 -06:00
Lars Ellenberg 68e41a43f1 drbd: use the cached meta_dev_idx
Now we have the cached meta_dev_idx member,
we can get rid of a few rcu_read_lock() sections and rcu_dereference().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22 18:13:59 -06:00
Lars Ellenberg 3a4d4eb3cb drbd: prepare for new striped layout of activity log
Introduce two new on-disk meta data fields: al_stripes and al_stripe_size_4k
The intended use case is activity log on RAID 0 or similar.
Logically consecutive transactions will advance their on-disk position
by al_stripe_size_4k 4kB (transaction sized) blocks.

Right now, these are still asserted to be the backward compatible
values al_stripes = 1, al_stripe_size_4k = 8 (which amounts to 32kB).

Also introduce a caching member for meta_dev_idx in the in-core
structure: even though it is initially passed in in the rcu-protected
disk_conf structure, it cannot change without a detach/attach cycle.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22 18:13:59 -06:00
Lars Ellenberg ae8bf312e9 drbd: cleanup ondisk meta data layout calculations and defines
Add a comment about our meta data layout variants,
and rename a few defines (e.g. MD_RESERVED_SECT -> MD_128MB_SECT)
to make it clear that they are short hand for fixed constants,
and not arbitrarily to be redefined as one may see fit.

Properly pad struct meta_data_on_disk to 4kB,
and initialize to zero not only the first 512 Byte,
but all of it in drbd_md_sync().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22 18:13:59 -06:00
Lars Ellenberg 9114d79569 drbd: cleanup bogus assert message
This fixes ASSERT( mdev->state.disk == D_FAILED ) in drivers/block/drbd/drbd_main.c

When we detach from local disk, we let the local refcount hit zero twice.

First, we transition to D_FAILED, so we won't give out new references
to incoming requests; we still may give out *internal* references, though.
Once the refcount hits zero [1] while in D_FAILED, we queue a transition
to D_DISKLESS to our worker.  We need to queue it, because we may be in
atomic context when putting the reference.
Once the transition to D_DISKLESS actually happened [2] from worker context,
we don't give out new internal references either.

Between hitting zero the first time [1] and actually transition to
D_DISKLESS [2], there may be a few very short lived internal get/put,
so we may hit zero more than once while being in D_FAILED, or even see a
race where a an internal get_ldev() happened while D_FAILED, but the
corresponding put_ldev() happens just after the transition to D_DISKLESS.

That's why we have the additional test_and_set_bit(GO_DISKLESS,);
and that's why the assert was placed wrong.
Since there was exactly one code path left to drbd_go_diskless(),
and that checks already for D_FAILED, drop that assert,
and fold in the drbd_queue_work().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22 18:13:59 -06:00
Linus Torvalds 5da273fe3f Merge git://git.infradead.org/users/willy/linux-nvme
Pull NVMe driver update from Matthew Wilcox:
 "These patches have mostly been baking for a few months; sorry I didn't
  get them in during the merge window.  They're all bug fixes, except
  for the addition of the SMART log and the addition to MAINTAINERS."

* git://git.infradead.org/users/willy/linux-nvme:
  NVMe: Add namespaces with no LBA range feature
  MAINTAINERS: Add entry for the NVMe driver
  NVMe: Initialize iod nents to 0
  NVMe: Define SMART log
  NVMe: Add result to nvme_get_features
  NVMe: Set result from user admin command
  NVMe: End queued bio requests when freeing queue
  NVMe: Free cmdid on nvme_submit_bio error
2013-03-22 16:43:53 -07:00
Keith Busch 122090366d NVMe: Add namespaces with no LBA range feature
The LBA Range Type feature is optional in the NVMe specification,
so we should continue with adding namespaces for controllers that do
not implement this feature.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2013-03-22 14:50:23 -04:00
Wei Yongjun d2b805d895 cciss: fix invalid use of sizeof in cciss_find_cfgtables()
sizeof() when applied to a pointer typed expression gives the
size of the pointer, not that of the pointed data.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: scameron@beardog.cce.hp.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22 12:22:49 -06:00
Phillip Susi 8761a3dc1f loop: cleanup partitions when detaching loop device
Any partitions added by user space to the loop device were being
left in place after detaching the loop device.  This was because
the detach path issued a BLKRRPART to clean up partitions if
LO_FLAGS_PARTSCAN was set, meaning that the partitions were auto
scanned on attach.  Replace this BLKRRPART with code that
unconditionally cleans up partitions on detach instead.

Signed-off-by: Phillip Susi <psusi@ubuntu.com>

Modified by Jens to export delete_partition().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22 12:21:53 -06:00
Wei Yongjun 183cfb5720 loop: fix error return code in loop_add()
Fix to return a negative error code from the error handling
case, as returned elsewhere in this function.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22 08:59:19 -06:00
Wei Yongjun d137c8306c mtip32xx: fix error return code in mtip_pci_probe()
Fix to return a negative error code from the error handling
case instead of 0, as returned elsewhere in this function.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-22 08:58:23 -06:00
Jens Axboe 7fbaee72ff Merge branch 'stable/for-jens-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-linus
Konrad writes:

[the branch] has a bunch of fixes. They vary from being able to deal
with unknown requests, overflow in statistics, compile warnings, bug in
the error path, removal of unnecessary logic. There is also one
performance fix - which is to allocate pages for requests when the
driver loads - instead of doing it per request
2013-03-22 08:56:32 -06:00
Rusty Russell 0a11cc36f7 virtio_blk: remove nents member.
It's simply a flag as to whether we have data now, so make it an
explicit function parameter rather than a member of struct
virtblk_req.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Asias He <asias@redhat.com>
2013-03-20 15:44:58 +10:30
Paolo Bonzini 20af3cfd20 virtio-blk: use virtqueue_add_sgs on req path
(This is a respin of Paolo Bonzini's patch, but it calls
virtqueue_add_sgs() instead of his multi-part API).

This is similar to the previous patch, but a bit more radical
because the bio and req paths now share the buffer construction
code.  Because the req path doesn't use vbr->sg, however, we
need to add a couple of arguments to __virtblk_add_req.

We also need to teach __virtblk_add_req how to build SCSI command
requests.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Asias He <asias@redhat.com>
2013-03-20 15:44:55 +10:30
Paolo Bonzini 8f39db9d37 virtio-blk: use virtqueue_add_sgs on bio path
(This is a respin of Paolo Bonzini's patch, but it calls
virtqueue_add_sgs() instead of his multi-part API).

Move the creation of the request header and response footer to
__virtblk_add_req.  vbr->sg only contains the data scatterlist,
the header/footer are added separately using virtqueue_add_sgs().

With this change, virtio-blk (with use_bio) is not relying anymore on
the virtio functions ignoring the end markers in a scatterlist.
The next patch will do the same for the other path.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Asias He <asias@redhat.com>
2013-03-20 15:44:54 +10:30
Paolo Bonzini 5ee21a52c0 virtio-blk: reorganize virtblk_add_req
Right now, both virtblk_add_req and virtblk_add_req_wait call
virtqueue_add_buf.  To prepare for the next patches, abstract the call
to virtqueue_add_buf into a new function __virtblk_add_req, and include
the waiting logic directly in virtblk_add_req.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Asias He <asias@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-03-20 15:44:54 +10:30
Roger Pau Monne b1173e316b xen-blkfront: remove frame list from blk_shadow
We already have the frame (pfn of the grant page) stored inside struct
grant, so there's no need to keep an aditional list of mapped frames
for a specific request. This reduces memory usage in blkfront.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xen.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-03-19 12:50:07 -04:00
Roger Pau Monne 9c1e050cae xen-blkfront: pre-allocate pages for requests
This prevents us from having to call alloc_page while we are preparing
the request. Since blkfront was calling alloc_page with a spinlock
held we used GFP_ATOMIC, which can fail if we are requesting a lot of
pages since it is using the emergency memory pools.

Allocating all the pages at init prevents us from having to call
alloc_page, thus preventing possible failures.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xen.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-03-19 12:50:06 -04:00
Roger Pau Monne ffb1dabd1e xen-blkback: don't store dev_bus_addr
dev_bus_addr returned in the grant ref map operation is the mfn of the
passed page, there's no need to store it in the persistent grant
entry, since we can always get it provided that we have the page.

This reduces the memory overhead of persistent grants in blkback.

While at it, rename the 'seg[i].buf' to be 'seg[i].offset' as
it makes much more sense - as we use that value in bio_add_page
which as the fourth argument expects the offset.

We hadn't used the physical address as part of this at all.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xen.org
[v1: s/buf/offset/]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-03-19 12:50:00 -04:00
Roger Pau Monne 155b7edb51 xen-blkfront: switch from llist to list
The git commit f84adf4921
(xen-blkfront: drop the use of llist_for_each_entry_safe)

was a stop-gate to fix a GCC4.1 bug. The appropiate way
is to actually use an list instead of using an llist.

As such this patch replaces the usage of llist with an
list.

Since we always manipulate the list while holding the io_lock, there's
no need for additional locking (llist used previously is safe to use
concurrently without additional locking).

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
CC: stable@vger.kernel.org
[v1: Redid the git commit description]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-03-19 10:27:56 -04:00
Roger Pau Monne 217fd5e709 xen-blkback: fix foreach_grant_safe to handle empty lists
We may use foreach_grant_safe in the future with empty lists, so make
sure we can handle them.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: xen-devel@lists.xen.org
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-03-19 08:53:15 -04:00
Mihnea Dobrescu-Balaur 29d0b218c8 xen-blkfront: replace kmalloc and then memcpy with kmemdup
The benefits are:
* code is cleaner
* kmemdup adds additional debugging info useful for tracking the real
place where memory was allocated (CONFIG_DEBUG_SLAB).

Signed-off-by: Mihnea Dobrescu-Balaur <mihneadb@gmail.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-03-18 16:31:31 -04:00
Jan Beulich 0e5e098ac2 xen-blkback: fix dispatch_rw_block_io() error path
Commit 7708992 ("xen/blkback: Seperate the bio allocation and the bio
submission") consolidated the pendcnt updates to just a single write,
neglecting the fact that the error path relied on it getting set to 1
up front (such that the decrement in __end_block_io_op() would actually
drop the count to zero, triggering the necessary cleanup actions).

Also remove a misleading and a stale (after said commit) comment.

CC: stable@vger.kernel.org
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-03-18 16:31:22 -04:00
Jens Axboe 351a2c6e7d rsxx: fix missing unlock on error return in rsxx_eeh_remap_dmas()
Spotted by Fenguan Wu's super build robot.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-16 10:10:48 +01:00
Philip J Kelleher c95246c3a2 Adding in EEH support to the IBM FlashSystem 70/80 device driver
Changes in v2 include:
o Fixed spelling of guarantee.
o Fixed potential memory leak if slot reset fails out.
o Changed list_for_each_entry_safe with list_for_each_entry.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-16 08:22:25 +01:00
Milos Vyletel 9d9598b81c virtio-blk: emit udev event when device is resized
When virtio-blk device is resized from host (using block_resize from QEMU) emit
KOBJ_CHANGE uevent to notify guest about such change. This allows user to have
custom udev rules which would take whatever action if such event occurs. As a
proof of concept I've created simple udev rule that automatically resize
filesystem on virtio-blk device.

ACTION=="change", KERNEL=="vd*", \
        ENV{RESIZE}=="1", \
        ENV{ID_FS_TYPE}=="ext[3-4]", \
        RUN+="/sbin/resize2fs /dev/%k"
ACTION=="change", KERNEL=="vd*", \
        ENV{RESIZE}=="1", \
        ENV{ID_FS_TYPE}=="LVM2_member", \
        RUN+="/sbin/pvresize /dev/%k"

Signed-off-by: Milos Vyletel <milos.vyletel@sde.cz>
Tested-by: Asias He <asias@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (minor simplification)
2013-03-12 15:44:55 +10:30
Philip J Kelleher 1ebfd10982 block: IBM RamSan 70/80 error message bug fix.
This patch includes a simple change to the rsxx_pci_remove
function that caused error messages because traffic was halted
too early.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-11 19:53:55 +01:00
Philip J Kelleher 9bb3c4469e block: IBM RamSan 70/80 branding changes.
This patch includes changing the hardware branding name from
IBM RamSan to IBM FlashSystem.

v2 Changes include:
o Removing the unnecessary IBM Vendor ID #define

v1 Changes include:
o Changed all references of RamSan to FlashSystem.
o Changed the vendor/device IDs for the product.
o Changed driver version number.
o Updated the MAINTAINERS file.
o Various other little things.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-11 19:53:55 +01:00
Philip J Kelleher 03ac03a897 block: IBM RamSan 70/80 fixes inconsistent locking.
This patch includes changes to the cregs locking scheme. Before,
inconsistant locking would occur because of misuse of spin_lock,
spin_lock_bh, and counter parts.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-11 19:53:55 +01:00
Philip J Kelleher f37912039e block: IBM RamSan 70/80 trivial changes.
This patch includes trivial changes that were recommended by
different members of the Linux Community.

Changes include:
o Removing the redundant wmb().
o Formatting
o Various other little things.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-03-11 19:53:55 +01:00
Zoltan Kiss 986cacbd26 xen/blkback: Change statistics counter types to unsigned
These values shouldn't be negative, but after an overflow their value
can turn into negative, if they are signed. xentop can show bogus
values in this case.

Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
Reported-by: Ichiro Ogino <ichiro.ogino@citrix.co.jp>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-03-11 13:56:54 -04:00
David Vrabel 0e367ae465 xen/blkback: correctly respond to unknown, non-native requests
If the frontend is using a non-native protocol (e.g., a 64-bit
frontend with a 32-bit backend) and it sent an unrecognized request,
the request was not translated and the response would have the
incorrect ID.  This may cause the frontend driver to behave
incorrectly or crash.

Since the ID field in the request is always in the same place,
regardless of the request type we can get the correct ID and make a
valid response (which will report BLKIF_RSP_EOPNOTSUPP).

This bug affected 64-bit SLES 11 guests when using a 32-bit backend.
This guest does a BLKIF_OP_RESERVED_1 (BLKIF_OP_PACKET in the SLES
source) and would crash in blkif_int() as the ID in the response would
be invalid.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: stable@vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-03-11 13:54:28 -04:00
Chen Gang a72d9002f8 xen/xen-blkback: preq.dev is used without initialized
If call xen_vbd_translate failed, the preq.dev will be not initialized.
Use blkif->vbd.pdevice instead (still better to print relative info).

Note that before commit 01c681d4c7
(xen/blkback: Don't trust the handle from the frontend.)
the value bogus, as it was the guest provided value from req->u.rw.handle
rather than the actual device.

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Acked-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-03-01 08:41:43 -05:00
Linus Torvalds 1cf0209c43 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
 "A few groups of patches here.  Alex has been hard at work improving
  the RBD code, layout groundwork for understanding the new formats and
  doing layering.  Most of the infrastructure is now in place for the
  final bits that will come with the next window.

  There are a few changes to the data layout.  Jim Schutt's patch fixes
  some non-ideal CRUSH behavior, and a set of patches from me updates
  the client to speak a newer version of the protocol and implement an
  improved hashing strategy across storage nodes (when the server side
  supports it too).

  A pair of patches from Sam Lang fix the atomicity of open+create
  operations.  Several patches from Yan, Zheng fix various mds/client
  issues that turned up during multi-mds torture tests.

  A final set of patches expose file layouts via virtual xattrs, and
  allow the policies to be set on directories via xattrs as well
  (avoiding the awkward ioctl interface and providing a consistent
  interface for both kernel mount and ceph-fuse users)."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (143 commits)
  libceph: add support for HASHPSPOOL pool flag
  libceph: update osd request/reply encoding
  libceph: calculate placement based on the internal data types
  ceph: update support for PGID64, PGPOOL3, OSDENC protocol features
  ceph: update "ceph_features.h"
  libceph: decode into cpu-native ceph_pg type
  libceph: rename ceph_pg -> ceph_pg_v1
  rbd: pass length, not op for osd completions
  rbd: move rbd_osd_trivial_callback()
  libceph: use a do..while loop in con_work()
  libceph: use a flag to indicate a fault has occurred
  libceph: separate non-locked fault handling
  libceph: encapsulate connection backoff
  libceph: eliminate sparse warnings
  ceph: eliminate sparse warnings in fs code
  rbd: eliminate sparse warnings
  libceph: define connection flag helpers
  rbd: normalize dout() calls
  rbd: barriers are hard
  rbd: ignore zero-length requests
  ...
2013-02-28 17:43:09 -08:00
Linus Torvalds f042fea0da Merge branch 'for-3.9/drivers' of git://git.kernel.dk/linux-block
Pull block driver bits from Jens Axboe:
 "After the block IO core bits are in, please grab the driver updates
  from below as well.  It contains:

   - Fix ancient regression in dac960.  Nobody must be using that
     anymore...

   - Some good fixes from Guo Ghao for loop, fixing both potential
     oopses and deadlocks.

   - Improve mtip32xx for NUMA systems, by being a bit more clever in
     distributing work.

   - Add IBM RamSan 70/80 driver.  A second round of fixes for that is
     pending, that will come in through for-linus during the 3.9 cycle
     as per usual.

   - A few xen-blk{back,front} fixes from Konrad and Roger.

   - Other minor fixes and improvements."

* 'for-3.9/drivers' of git://git.kernel.dk/linux-block:
  loopdev: ignore negative offset when calculate loop device size
  loopdev: remove an user triggerable oops
  loopdev: move common code into loop_figure_size()
  loopdev: update block device size in loop_set_status()
  loopdev: fix a deadlock
  xen-blkback: use balloon pages for persistent grants
  xen-blkfront: drop the use of llist_for_each_entry_safe
  xen/blkback: Don't trust the handle from the frontend.
  xen-blkback: do not leak mode property
  block: IBM RamSan 70/80 driver fixes
  rsxx: add slab.h include to dma.c
  drivers/block/mtip32xx: add missing GENERIC_HARDIRQS dependency
  block: remove new __devinit/exit annotations on ramsam driver
  block: IBM RamSan 70/80 device driver
  drivers/block/mtip32xx/mtip32xx.c:1726:5: sparse: symbol 'mtip_send_trim' was not declared. Should it be static?
  drivers/block/mtip32xx/mtip32xx.c:4029:1: sparse: symbol 'mtip_workq_sdbf0' was not declared. Should it be static?
  dac960: return success instead of -ENOTTY
  mtip32xx: add trim support
  mtip32xx: Add workqueue and NUMA support
  block: delete super ancient PC-XT driver for 1980's hardware
2013-02-28 13:16:07 -08:00
Linus Torvalds ee89f81252 Merge branch 'for-3.9/core' of git://git.kernel.dk/linux-block
Pull block IO core bits from Jens Axboe:
 "Below are the core block IO bits for 3.9.  It was delayed a few days
  since my workstation kept crashing every 2-8h after pulling it into
  current -git, but turns out it is a bug in the new pstate code (divide
  by zero, will report separately).  In any case, it contains:

   - The big cfq/blkcg update from Tejun and and Vivek.

   - Additional block and writeback tracepoints from Tejun.

   - Improvement of the should sort (based on queues) logic in the plug
     flushing.

   - _io() variants of the wait_for_completion() interface, using
     io_schedule() instead of schedule() to contribute to io wait
     properly.

   - Various little fixes.

  You'll get two trivial merge conflicts, which should be easy enough to
  fix up"

Fix up the trivial conflicts due to hlist traversal cleanups (commit
b67bfe0d42ca: "hlist: drop the node parameter from iterators").

* 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
  block: remove redundant check to bd_openers()
  block: use i_size_write() in bd_set_size()
  cfq: fix lock imbalance with failed allocations
  drivers/block/swim3.c: fix null pointer dereference
  block: don't select PERCPU_RWSEM
  block: account iowait time when waiting for completion of IO request
  sched: add wait_for_completion_io[_timeout]
  writeback: add more tracepoints
  block: add block_{touch|dirty}_buffer tracepoint
  buffer: make touch_buffer() an exported function
  block: add @req to bio_{front|back}_merge tracepoints
  block: add missing block_bio_complete() tracepoint
  block: Remove should_sort judgement when flush blk_plug
  block,elevator: use new hashtable implementation
  cfq-iosched: add hierarchical cfq_group statistics
  cfq-iosched: collect stats from dead cfqgs
  cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
  blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
  block: RCU free request_queue
  blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
  ...
2013-02-28 12:52:24 -08:00
Alex Elder 398eb08555 nbd: fix sparse warning
I just fixed this in "drivers/block/rbd.c" and I noticed that
"drivers/block/nbd.c" has the same problem.  Fix a warning issued by
sparse by adding some lockdep annotations to indicate the queue lock gets
dropped (because it's held when do_nbd_request() is called) and
re-acquired within the function.

Signed-off-by: Alex Elder <elder@inktank.com>
Cc: Paul Clements <paul.clements@steeleye.com>
Cc: Paul Clements <paul.clements@us.sios.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:22 -08:00
Paolo Bonzini a83e814b5b nbd: show read-only state in sysfs
Pass the read-only flag to set_device_ro, so that it will be visible to
the block layer and in sysfs.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul Clements <Paul.Clements@steeleye.com>
Cc: Alex Bligh <alex@alex.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:22 -08:00
Paolo Bonzini 3a2d63f879 nbd: fsync and kill block device on shutdown
There are two problems with shutdown in the NBD driver.

1: Receiving the NBD_DISCONNECT ioctl does not sync the filesystem.

   This patch adds the sync operation into __nbd_ioctl()'s
   NBD_DISCONNECT handler.  This is useful because BLKFLSBUF is restricted
   to processes that have CAP_SYS_ADMIN, and the NBD client may not
   possess it (fsync of the block device does not sync the filesystem,
   either).

2: Once we clear the socket we have no guarantee that later reads will
   come from the same backing storage.

   The patch adds calls to kill_bdev() in __nbd_ioctl()'s socket
   clearing code so the page cache is cleaned, lest reads that hit on the
   page cache will return stale data from the previously-accessible disk.

Example:

    # qemu-nbd -r -c/dev/nbd0 /dev/sr0
    # file -s /dev/nbd0
    /dev/stdin: # UDF filesystem data (version 1.5) etc.
    # qemu-nbd -d /dev/nbd0
    # qemu-nbd -r -c/dev/nbd0 /dev/sda
    # file -s /dev/nbd0
    /dev/stdin: # UDF filesystem data (version 1.5) etc.

While /dev/sda has:

    # file -s /dev/sda
    /dev/sda: x86 boot sector; etc.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Paul Clements <Paul.Clements@steeleye.com>
Cc: Alex Bligh <alex@alex.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:22 -08:00
Alex Bligh 75f187aba5 nbd: support FLUSH requests
Currently, the NBD device does not accept flush requests from the Linux
block layer.  If the NBD server opened the target with neither O_SYNC nor
O_DSYNC, however, the device will be effectively backed by a writeback
cache.  Without issuing flushes properly, operation of the NBD device will
not be safe against power losses.

The NBD protocol has support for both a cache flush command and a FUA
command flag; the server will also pass a flag to note its support for
these features.  This patch adds support for the cache flush command and
flag.  In the kernel, we receive the flags via the NBD_SET_FLAGS ioctl,
and map NBD_FLAG_SEND_FLUSH to the argument of blk_queue_flush.  When the
flag is active the block layer will send REQ_FLUSH requests, which we
translate to NBD_CMD_FLUSH commands.

FUA support is not included in this patch because all free software
servers implement it with a full fdatasync; thus it has no advantage over
supporting flush only.  Because I [Paolo] cannot really benchmark it in a
realistic scenario, I cannot tell if it is a good idea or not.  It is also
not clear if it is valid for an NBD server to support FUA but not flush.
The Linux block layer gives a warning for this combination, the NBD
protocol documentation says nothing about it.

The patch also fixes a small problem in the handling of flags: nbd->flags
must be cleared at the end of NBD_DO_IT, but the driver was not doing
that.  The bug manifests itself as follows.  Suppose you two different
client/server pairs to start the NBD device.  Suppose also that the first
client supports NBD_SET_FLAGS, and the first server sends
NBD_FLAG_SEND_FLUSH; the second pair instead does neither of these two
things.  Before this patch, the second invocation of NBD_DO_IT will use a
stale value of nbd->flags, and the second server will issue an error every
time it receives an NBD_CMD_FLUSH command.

This bug is pre-existing, but it becomes much more important after this
patch; flush failures make the device pretty much unusable, unlike

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Alex Bligh <alex@alex.org.uk>
Acked-by: Paul Clements <Paul.Clements@steeleye.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:22 -08:00
Tejun Heo 56de210245 drbd: convert to idr_alloc()
Convert to the much saner new idr interface.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:15 -08:00
Tejun Heo c718aa652d block/loop: convert to idr_alloc()
Convert to the much saner new idr interface.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:15 -08:00
Tejun Heo 9d60916677 block/loop: don't use idr_remove_all()
idr_destroy() can destroy idr by itself and idr_remove_all() is being
deprecated.  Drop its usage.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:13 -08:00
Linus Torvalds d895cb1af1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile (part one) from Al Viro:
 "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
  locking violations, etc.

  The most visible changes here are death of FS_REVAL_DOT (replaced with
  "has ->d_weak_revalidate()") and a new helper getting from struct file
  to inode.  Some bits of preparation to xattr method interface changes.

  Misc patches by various people sent this cycle *and* ocfs2 fixes from
  several cycles ago that should've been upstream right then.

  PS: the next vfs pile will be xattr stuff."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
  saner proc_get_inode() calling conventions
  proc: avoid extra pde_put() in proc_fill_super()
  fs: change return values from -EACCES to -EPERM
  fs/exec.c: make bprm_mm_init() static
  ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
  ocfs2: fix possible use-after-free with AIO
  ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
  get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
  target: writev() on single-element vector is pointless
  export kernel_write(), convert open-coded instances
  fs: encode_fh: return FILEID_INVALID if invalid fid_type
  kill f_vfsmnt
  vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
  nfsd: handle vfs_getattr errors in acl protocol
  switch vfs_getattr() to struct path
  default SET_PERSONALITY() in linux/elf.h
  ceph: prepopulate inodes only when request is aborted
  d_hash_and_lookup(): export, switch open-coded instances
  9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
  9p: split dropping the acls from v9fs_set_create_acl()
  ...
2013-02-26 20:16:07 -08:00
Sage Weil 1b83bef24c libceph: update osd request/reply encoding
Use the new version of the encoding for osd requests and replies.  In the
process, update the way we are tracking request ops and reply lengths and
results in the struct ceph_osd_request.  Update the rbd and fs/ceph users
appropriately.

The main changes are:
 - we keep pointers into the request memory for fields we need to update
   each time the request is sent out over the wire
 - we keep information about the result in an array in the request struct
   where the users can easily get at it.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2013-02-26 15:02:50 -08:00
Alex Elder c47f937154 rbd: pass length, not op for osd completions
The only thing type-specific osd completion functions do with their
osd op parameter is (in some cases) extract the number of bytes
transferred from it.  In the other cases, the xferred bytes field
is not used, and total message data transfer byte count (which may
well be zero) is used.

Just set the object request transfer count in the main osd request
callback function and provide that to the other routines.  There is
then no longer any need to pass the op pointer to the type-specific
completion routines, so drop those parameters.

Stop doing anything with the total message data length.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-02-26 15:00:06 -08:00
Alex Elder 39bf2c5d09 rbd: move rbd_osd_trivial_callback()
This function is slightly out of place, probably the result
of an errant automatic merge or something.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-02-26 14:59:49 -08:00
Al Viro 3dadecce20 switch vfs_getattr() to struct path
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-26 02:46:08 -05:00
Alex Elder cc344fa1b5 rbd: eliminate sparse warnings
Fengguang Wu reminded me that there were outstanding sparse reports
in the ceph and rbd code.  This patch fixes these problems in rbd
that lead to those reports:
    - Convert functions that are never referenced externally to have
      static scope.
    - Add a lockdep annotation to rbd_request_fn(), because it
      releases a lock before acquiring it again.

This partially resolves:
    http://tracker.ceph.com/issues/4184

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-25 15:37:08 -06:00
Alex Elder 37206ee5be rbd: normalize dout() calls
Add dout() calls to facilitate tracing of image and object requests.
Change a few existing calls so they use __func__ rather than the
hard-coded function name.  Have calls always add ":" after the name
of the function, and prefix pointer values with a consistent tag
indicating what it represents.  (Note that there remain some older
dout() calls that are left untouched by this patch.)

Issue a warning if rbd_osd_write_callback() ever gets a short write.

This resolves:
    http://tracker.ceph.com/issues/4235

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-25 15:36:56 -06:00
Alex Elder 632b88cade rbd: barriers are hard
Let's go shopping!

I'm afraid this may not have gotten it right:
    07741308  rbd: add barriers near done flag operations

The smp_wmb() should have been done *before* setting the done flag,
to ensure all other data was valid before marking the object request
done.

Switch to use atomic_inc_return() here to set the done flag, which
allows us to verify we don't mark something done more than once.
Doing this also implies general barriers before and after the call.

And although a read memory barrier might have been sufficient before
reading the done flag, convert this to a full memory barrier just
to put this issue to bed.

This resolves:
    http://tracker.ceph.com/issues/4238

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-25 15:36:50 -06:00
Alex Elder 4dda41d3d7 rbd: ignore zero-length requests
The old request code simply ignored zero-length requests.  We should
still operate that same way to avoid any changes in behavior.  We
can implement handling for special zero-length requests separately
(see http://tracker.ceph.com/issues/4236).

Add some assertions based on this new constraint.

This resolves:
    http://tracker.ceph.com/issues/4237

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-25 15:36:36 -06:00
Al Viro 496ad9aa8e new helper: file_inode(file)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-22 23:31:31 -05:00
Guo Chao b7a1da695f loopdev: ignore negative offset when calculate loop device size
Negative offset may cause loop device size larger than backing file
size.

 $ fallocate -l 1M a
 $ losetup --offset 0xffffffffffff0000 /dev/loop0 a
 $ blockdev --getsize64 /dev/loop0
 1114112
 $ ls -l a
 -rw-r--r-- 1 root root 1048576 Jan 23 12:46 a
 $ cat /dev/loop0
 cat: /dev/loop0: Input/output error

It makes no sense to do that. Only apply offset when it's positive.

Fix a typo in the comment by the way.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: M. Hindess <hindessm@uk.ibm.com>
Cc: Nikanth Karthikesan <knikanth@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-02-22 10:43:22 +01:00
Guo Chao b1a6650406 loopdev: remove an user triggerable oops
When loopdev is built as module and we pass an invalid parameter,
loop_init() will return directly without deregister misc device, which
will cause an oops when insert loop module next time because we left some
garbage in the misc device list.

Test case:
sudo modprobe loop max_part=1024
(failed due to invalid parameter)
sudo modprobe loop
(oops)

Clean up nicely to avoid such oops.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: M. Hindess <hindessm@uk.ibm.com>
Cc: Nikanth Karthikesan <knikanth@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-02-22 10:43:22 +01:00
Guo Chao 7b0576a3d8 loopdev: move common code into loop_figure_size()
Update block device size in accord with gendisk size and let userspace
know the change in loop_figure_size(). This is a clean up to remove
common code of loop_figure_size()'s two callers.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: M. Hindess <hindessm@uk.ibm.com>
Cc: Nikanth Karthikesan <knikanth@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-02-22 10:43:22 +01:00
Guo Chao 541c742a75 loopdev: update block device size in loop_set_status()
Loop device driver sometimes fails to impose the size limit on the
device. Keep issuing following two commands:

losetup --offset 7517244416 --sizelimit 3224971264 /dev/loop0 backed_file
blockdev --getsize64 /dev/loop0

blockdev reports file size instead of sizelimit several out of 100 times.

The problems are:

	- losetup set up the device in two ioctl:
		  LOOP_SET_FD and LOOP_SET_STATUS64.

	- LOOP_SET_STATUS64 only update size of gendisk.

Block device size will be updated lazily when device comes to use. If udev
rushes in between the two ioctl, it will bring in a block device whose
size is backing file size. If the device is not released after
LOOP_SET_STATUS64 ioctl, blockdev will not see the updated size.

Update block size in LOOP_SET_STATUS64 ioctl.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Reported-by: M. Hindess <hindessm@uk.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Nikanth Karthikesan <knikanth@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-02-22 10:43:22 +01:00
Guo Chao 5370019dc2 loopdev: fix a deadlock
bd_mutex and lo_ctl_mutex can be held in different order.

Path #1:

blkdev_open
 blkdev_get
  __blkdev_get (hold bd_mutex)
   lo_open (hold lo_ctl_mutex)

Path #2:

blkdev_ioctl
 lo_ioctl (hold lo_ctl_mutex)
  lo_set_capacity (hold bd_mutex)

Lockdep does not report it, because path #2 actually holds a subclass of
lo_ctl_mutex.  This subclass seems creep into the code by mistake.  The
patch author actually just mentioned it in the changelog, see commit
f028f3b2 ("loop: fix circular locking in loop_clr_fd()"), also see:

	http://marc.info/?l=linux-kernel&m=123806169129727&w=2

Path #2 hold bd_mutex to call bd_set_size(), I've protected it
with i_mutex in a previous patch, so drop bd_mutex at this site.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: M. Hindess <hindessm@uk.ibm.com>
Cc: Nikanth Karthikesan <knikanth@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-02-22 10:43:21 +01:00
Cong Ding 7414d4f64b drivers/block/swim3.c: fix null pointer dereference
The use of pointer fs should be after the null check.

Signed-off-by: Cong Ding <dinggnu@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-02-22 10:42:46 +01:00
Linus Torvalds 06991c28f3 Driver core patches for 3.9-rc1
Here is the big driver core merge for 3.9-rc1
 
 There are two major series here, both of which touch lots of drivers all
 over the kernel, and will cause you some merge conflicts:
   - add a new function called devm_ioremap_resource() to properly be
     able to check return values.
   - remove CONFIG_EXPERIMENTAL
 
 If you need me to provide a merged tree to handle these resolutions,
 please let me know.
 
 Other than those patches, there's not much here, some minor fixes and
 updates.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iEYEABECAAYFAlEmV0cACgkQMUfUDdst+yncCQCfbmnQZju7kzWXk6PjdFuKspT9
 weAAoMCzcAtEzzc4LXuUxxG/sXBVBCjW
 =yWAQ
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core patches from Greg Kroah-Hartman:
 "Here is the big driver core merge for 3.9-rc1

  There are two major series here, both of which touch lots of drivers
  all over the kernel, and will cause you some merge conflicts:

   - add a new function called devm_ioremap_resource() to properly be
     able to check return values.

   - remove CONFIG_EXPERIMENTAL

  Other than those patches, there's not much here, some minor fixes and
  updates"

Fix up trivial conflicts

* tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (221 commits)
  base: memory: fix soft/hard_offline_page permissions
  drivercore: Fix ordering between deferred_probe and exiting initcalls
  backlight: fix class_find_device() arguments
  TTY: mark tty_get_device call with the proper const values
  driver-core: constify data for class_find_device()
  firmware: Ignore abort check when no user-helper is used
  firmware: Reduce ifdef CONFIG_FW_LOADER_USER_HELPER
  firmware: Make user-mode helper optional
  firmware: Refactoring for splitting user-mode helper code
  Driver core: treat unregistered bus_types as having no devices
  watchdog: Convert to devm_ioremap_resource()
  thermal: Convert to devm_ioremap_resource()
  spi: Convert to devm_ioremap_resource()
  power: Convert to devm_ioremap_resource()
  mtd: Convert to devm_ioremap_resource()
  mmc: Convert to devm_ioremap_resource()
  mfd: Convert to devm_ioremap_resource()
  media: Convert to devm_ioremap_resource()
  iommu: Convert to devm_ioremap_resource()
  drm: Convert to devm_ioremap_resource()
  ...
2013-02-21 12:05:51 -08:00
Linus Torvalds e177bb587e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
Pull m68k update from Geert Uytterhoeven.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
  m68k: Sort out !CONFIG_MMU_SUN3 vs. CONFIG_HAS_DMA
  swim: Add missing spinlock init
2013-02-20 14:27:00 -08:00
Jens Axboe d4308febf3 Merge branch 'stable/for-jens-3.9' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-3.9/drivers
Konrad writes:

Please git pull the following branch:

 git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git stable/for-jens-3.9

which has bug-fixes that did not make it in v3.8. They all are marked as
material for the stable tree as well. There are two bug-fixes for
the code that has been in there for some time (that is the Jan's fix
and one of mine). And there are two bug-fixes for the persistent grant
feature that debuted in v3.8 for xen blk[back|front]end.
2013-02-20 08:26:06 +01:00
Alex Elder 4c7a08c83a Merge branch 'testing' of github.com:ceph/ceph-client into into linux-3.8-ceph 2013-02-19 19:21:08 -06:00
Alex Elder 903bb32e89 libceph: drop return value from page vector copy routines
The return values provided for ceph_copy_to_page_vector() and
ceph_copy_from_page_vector() serve no purpose, so get rid of them.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-19 19:14:05 -06:00
Alex Elder 23ed6e13b3 rbd: ignore result of ceph_copy_from_page_vector()
The result of ceph_copy_from_page_vector() is simply the length
argument it is provided.

This is called by rbd_obj_method_sync(), which returns the result if
it's non-negative.  But we always either ignore or overwrite that
return value.  So explicitly ignore what's returned by the copy
function, and have rbd_obj_method_sync() always return either a
negative errno or 0.

We also return the result of ceph_copy_from_page_vector() in
rbd_obj_read_sync().  There we still want to return the number of
bytes transferred, but we can use the value we already have in hand
rather than what ceph_copy_from_page_vector() provides.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-19 19:14:05 -06:00
Alex Elder 1ceae7ef0f rbd: prevent bytes transferred overflow
In rbd_obj_read_sync(), verify the number of bytes transferred won't
exceed what can be represented by a size_t before using it to
indicate the number of bytes to copy to the result buffer.

(The real motivation for this is to prepare for the next patch.)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-19 19:14:04 -06:00
Alex Elder fbfab53966 libceph: allow STAT osd operations
Add support for CEPH_OSD_OP_STAT operations in the osd client
and in rbd.

This operation sends no data to the osd; everything required is
encoded in identity of the target object.

The result will be ENOENT if the object doesn't exist.  If it does
exist and no other error occurs the server returns the size and last
modification time of the target object as output data (in little
endian format).  The size is a 64 bit unsigned and the time is
ceph_timespec structure (two unsigned 32-bit integers, representing
a seconds and nanoseconds value).

This resolves:
    http://tracker.ceph.com/issues/4007

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-19 19:14:03 -06:00
Alex Elder ef06f4d32a rbd: add parentheses to object request iterator macros
The for_each_obj_request*() macros should parenthesize their uses of
the ireq parameter.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-19 19:14:02 -06:00
Roger Pau Monne 087ffecdaa xen-blkback: use balloon pages for persistent grants
With current persistent grants implementation we are not freeing the
persistent grants after we disconnect the device. Since grant map
operations change the mfn of the allocated page, and we can no longer
pass it to __free_page without setting the mfn to a sane value, use
balloon grant pages instead, as the gntdev device does.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: stable@vger.kernel.org
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-02-19 15:17:21 -05:00
Konrad Rzeszutek Wilk f84adf4921 xen-blkfront: drop the use of llist_for_each_entry_safe
Replace llist_for_each_entry_safe with a while loop.

llist_for_each_entry_safe can trigger a bug in GCC 4.1, so it's best
to remove it and use a while loop and do the deletion manually.

Specifically this bug can be triggered by hot-unplugging a disk, either
by doing xm block-detach or by save/restore cycle.

BUG: unable to handle kernel paging request at fffffffffffffff0
IP: [<ffffffffa0047223>] blkif_free+0x63/0x130 [xen_blkfront]
The crash call trace is:
	...
bad_area_nosemaphore+0x13/0x20
do_page_fault+0x25e/0x4b0
page_fault+0x25/0x30
? blkif_free+0x63/0x130 [xen_blkfront]
blkfront_resume+0x46/0xa0 [xen_blkfront]
xenbus_dev_resume+0x6c/0x140
pm_op+0x192/0x1b0
device_resume+0x82/0x1e0
dpm_resume+0xc9/0x1a0
dpm_resume_end+0x15/0x30
do_suspend+0x117/0x1e0

When drilling down to the assembler code, on newer GCC it does
.L29:
        cmpq    $-16, %r12      #, persistent_gnt check
        je      .L30    	#, out of the loop
.L25:
	... code in the loop
        testq   %r13, %r13      # n
        je      .L29    	#, back to the top of the loop
        cmpq    $-16, %r12      #, persistent_gnt check
        movq    16(%r12), %r13  # <variable>.node.next, n
        jne     .L25    	#,	back to the top of the loop
.L30:

While on GCC 4.1, it is:
L78:
	... code in the loop
	testq   %r13, %r13      # n
        je      .L78    #,	back to the top of the loop
        movq    16(%rbx), %r13  # <variable>.node.next, n
        jmp     .L78    #,	back to the top of the loop

Which basically means that the exit loop condition instead of
being:

	&(pos)->member != NULL;

is:
	;

which makes the loop unbound.

Since xen-blkfront is the only user of the llist_for_each_entry_safe
macro remove it from llist.h.

Orabug: 16263164
CC: stable@vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-02-19 15:17:08 -05:00
Konrad Rzeszutek Wilk 01c681d4c7 xen/blkback: Don't trust the handle from the frontend.
The 'handle' is the device that the request is from. For the life-time
of the ring we copy it from a request to a response so that the frontend
is not surprised by it. But we do not need it - when we start processing
I/Os we have our own 'struct phys_req' which has only most essential
information about the request. In fact the 'vbd_translate' ends up
over-writing the preq.dev with a value from the backend.

This assignment of preq.dev with the 'handle' value is superfluous
so lets not do it.

Cc: stable@vger.kernel.org
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-02-19 15:17:03 -05:00
Jan Beulich 9d092603cc xen-blkback: do not leak mode property
"be->mode" is obtained from xenbus_read(), which does a kmalloc() for
the message body. The short string is never released, so do it along
with freeing "be" itself, and make sure the string isn't kept when
backend_changed() doesn't complete successfully (which made it
desirable to slightly re-structure that function, so that the error
cleanup can be done in one place).

Reported-by: Olaf Hering <olaf@aepfle.de>
CC: stable@vger.kernel.org
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2013-02-19 15:16:52 -05:00
Philip J Kelleher c206c70924 block: IBM RamSan 70/80 driver fixes
This patch includes the following driver fixes for the
IBM RamSan 70/80 driver:

o Changed the creg_ctrl lock from a mutex to a spinlock.
o Added a count check for ioctl calls.
o Removed unnecessary casting of void pointers.
o Made every function static that needed to be.
o Added comments to explain things more thoroughly.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-02-18 21:35:59 +01:00
Alex Elder 3c663bbdcd libceph: kill ceph_osdc_create_event() "one_shot" parameter
There is only one caller of ceph_osdc_create_event(), and it
provides 0 as its "one_shot" argument.  Get rid of that argument and
just use 0 in its place.

Replace the code in handle_watch_notify() that executes if one_shot
is nonzero in the event with a BUG_ON() call.

While modifying "osd_client.c", give handle_watch_notify() static
scope.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-18 12:20:00 -06:00
Linus Torvalds 11e7651432 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc
Pull sparc fixes from David Miller:
 "A couple small fixes for sparc including some THP brown-paper-bag
  material:

   1) During the merging of all the THP support for various
      architectures, sparc missed adding a
      HAVE_ARCH_TRANSPARENT_HUGEPAGE to it's Kconfig, oops.

   2) Sparc needs to be mindful of hugepages in get_user_pages_fast().

   3) Fix memory leak in SBUS probe, from Cong Ding.

   4) The sunvdc virtual disk client driver has a test of the bitmask of
      vdisk server supported operations which was off by one bit"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
  sunvdc: Fix off-by-one in generic_request().
  sparc64: Fix get_user_pages_fast() wrt. THP.
  sparc64: Add missing HAVE_ARCH_TRANSPARENT_HUGEPAGE.
  sparc: kernel/sbus.c: fix memory leakage
2013-02-15 12:05:57 -08:00
David S. Miller f4d9605434 sunvdc: Fix off-by-one in generic_request().
The 'operations' bitmap corresponds one-for-one with the operation
codes, no adjustment is necessary.

Reported-by: Mark Kettenis <mark.kettenis@xs4all.nl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-14 11:49:01 -08:00
Jens Axboe ec8edc764e Merge branch 'delete-xt-disk' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux into for-3.9/drivers
Paul writes:

Please pull the following to get the removal of the original IBM PC-XT
hard disk driver from the block layer (drivers/block/xd.c).

As near as I can tell, it hasn't seen a run time fix in over a dozen
years, and with drive sizes of 10-20MB, and performance of about 128kB/s
maximum, it is no surprise that it has been completely unused for well
over a decade.

The removal was originally posted[1] well over a month ago, and since
then, there has been nobody objecting to the removal, aside from someone
who had mistakenly confused it with a completely different driver (hd.c)
2013-02-14 16:29:34 +01:00
Alex Elder 077413082f rbd: add barriers near done flag operations
Somehow, I missed this little item in Documentation/atomic_ops.txt:
    *** WARNING: atomic_read() and atomic_set() DO NOT IMPLY BARRIERS! ***

Create and use some helper functions that include the proper memory
barriers for manipulating the done field.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:29:11 -08:00
Alex Elder a14ea269dd rbd: turn off interrupts for open/remove locking
This commit:
    bc7a62ee5 rbd: prevent open for image being removed
added checking for removing rbd before allowing an open, and used
the same request spinlock for protecting that and updating the open
count as is used for the request queue.

However it used the non-irq protected version of the spinlocks.
Fix that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:29:11 -08:00
Alex Elder 9cbb1d7268 libceph: don't require r_num_pages for bio requests
There is a check in the completion path for osd requests that
ensures the number of pages allocated is enough to hold the amount
of incoming data expected.

For bio requests coming from rbd the "number of pages" is not really
meaningful (although total length would be).  So stop requiring that
nr_pages be supplied for bio requests.  This is done by checking
whether the pages pointer is null before checking the value of
nr_pages.

Note that this value is passed on to the messenger, but there it's
only used for debugging--it's never used for validation.

While here, change another spot that used r_pages in a debug message
inappropriately, and also invalidate the r_con_filling_msg pointer
after dropping a reference to it.

This resolves:
    http://tracker.ceph.com/issues/3875

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:29:11 -08:00
Alex Elder 1e32d34cfa rbd: don't take extra bio reference for osd client
Currently, if the OSD client finds an osd request has had a bio list
attached to it, it drops a reference to it (or rather, to the first
entry on that list) when the request is released.

The code that added that reference (i.e., the rbd client) is
therefore required to take an extra reference to that first bio
structure.

The osd client doesn't really do anything with the bio pointer other
than transfer it from the osd request structure to outgoing (for
writes) and ingoing (for reads) messages.  So it really isn't the
right place to be taking or dropping references.

Furthermore, the rbd client already holds references to all bio
structures it passes to the osd client, and holds them until the
request is completed.  So there's no need for this extra reference
whatsoever.

So remove the bio_put() call in ceph_osdc_release_request(), as
well as its matching bio_get() call in rbd_osd_req_create().

This change could lead to a crash if old libceph.ko was used with
new rbd.ko.  Add a compatibility check at rbd initialization time to
avoid this possibilty.

This resolves:
    http://tracker.ceph.com/issues/3798    and
    http://tracker.ceph.com/issues/3799

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:29:11 -08:00
Alex Elder b82d167be6 rbd: prevent open for image being removed
An open request for a mapped rbd image can arrive while removal of
that mapping is underway.  We need to prevent such an open request
from succeeding.  (It appears that Maciej Galkiewicz ran into this
problem.)

Define and use a "removing" flag to indicate a mapping is getting
removed.  Set it in the remove path after verifying nothing holds
the device open.  And check it in the open path before allowing the
open to proceed.  Acquire the rbd device's lock around each of these
spots to avoid any races accessing the flags and open_count fields.

This addresses:
    http://tracker.newdream.net/issues/3427

Reported-by: Maciej Galkiewicz <maciejgalkiewicz@ragnarson.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:29:10 -08:00
Alex Elder 6d292906f8 rbd: define flags field, use it for exists flag
Define a new rbd device flags field, manipulated using bit
operations.  Replace the use of the current "exists" flag with a bit
in this new "flags" field.  Add a little commentary about the
"exists" flag, which does not need to be manipulated atomically.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:29:10 -08:00
Alex Elder 8eb8756530 rbd: don't drop watch requests on completion
When we register an osd request to linger, it means that request
will stay around (under control of the osd client) until we've
unregistered it.  We do that for an rbd image's header object, and
we keep a pointer to the object request associated with it.

Keep a reference to the watch object request for as long as it is
registered to linger.  Drop it again after we've removed the linger
registration.

This resolves:
    http://tracker.ceph.com/issues/3937

(Note: this originally came about because the osd client was
issuing a callback more than once.  But that behavior will be
changing soon, documented in tracker issue 3967.)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:29:10 -08:00
Alex Elder 25dcf954c3 rbd: decrement obj request count when deleting
Decrement the obj_request_count value when deleting an object
request from its image request's list.  Rearrange a few lines
in the surrounding code.

This resolves:
    http://tracker.ceph.com/issues/3940

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:29:10 -08:00
Alex Elder 975241afcb rbd: track object rather than osd request for watch
Switch to keeping track of the object request pointer rather than
the osd request used to watch the rbd image header object.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:29:10 -08:00
Alex Elder 6977c3f983 rbd: unregister linger in watch sync routine
Move the code that unregisters an rbd device's lingering header
object watch request into rbd_dev_header_watch_sync(), so it
occurs in the same function that originally sets up that request.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:29:09 -08:00
Alex Elder 9f20e02a53 rbd: get rid of rbd_req_sync_exec()
Get rid rbd_req_sync_exec() because it is no longer used.  That
eliminates the last use of rbd_req_sync_op(), so get rid of that
too.  And finally, that leaves rbd_do_request() unreferenced, so get
rid of that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:29:09 -08:00
Alex Elder 36be9a7618 rbd: implement sync method with new code
Reimplement synchronous object method calls using the new request
tracking code.  Use the name rbd_obj_method_sync()

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:29:09 -08:00
Alex Elder cf81b60e4b rbd: send notify ack asynchronously
When we receive notification of a change to an rbd image's header
object we need to refresh our information about the image (its
size and snapshot context).  Once we have refreshed our rbd image
we need to acknowledge the notification.

This acknowledgement was previously done synchronously, but there's
really no need to wait for it to complete.

Change it so the caller doesn't wait for the notify acknowledgement
request to complete.  And change the name to reflect it's no longer
synchronous.

This resolves:
    http://tracker.newdream.net/issues/3877

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:29:09 -08:00
Alex Elder 5ae9db81b4 rbd: get rid of rbd_req_sync_notify_ack()
Get rid rbd_req_sync_notify_ack() because it is no longer used.
As a result rbd_simple_req_cb() becomes unreferenced, so get rid
of that too.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:29:09 -08:00
Alex Elder b8d70035b3 rbd: use new code for notify ack
Use the new object request tracking mechanism for handling a
notify_ack request.

Move the callback function below the definition of this so we don't
have to do a pre-declaration.

This resolves:
    http://tracker.newdream.net/issues/3754

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:29:08 -08:00
Alex Elder ecf7a0318b rbd: get rid of rbd_req_sync_watch()
Get rid of rbd_req_sync_watch(), because it is no longer used.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:29:08 -08:00
Alex Elder 9969ebc5af rbd: implement watch/unwatch with new code
Implement a new function to set up or tear down a watch event
for an mapped rbd image header using the new request code.

Create a new object request type "nodata" to handle this.  And
define rbd_osd_trivial_callback() which simply marks a request done.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:29:08 -08:00
Alex Elder 86ea43bfcb rbd: get rid of rbd_req_sync_read()
Delete rbd_req_sync_read() is no longer used, so get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:29:08 -08:00
Alex Elder 788e2df3b9 rbd: implement sync object read with new code
Reimplement the synchronous read operation used for reading a
version 1 header using the new request tracking code.  Name the
resulting function rbd_obj_read_sync() to better reflect that
it's a full object operation, not an object request.  To do this,
implement a new OBJ_REQUEST_PAGES object request type.

This implements a new mechanism to allow the caller to wait for
completion for an rbd_obj_request by calling rbd_obj_request_wait().

This partially resolves:
    http://tracker.newdream.net/issues/3755

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:29:08 -08:00
Alex Elder 7d250b949a rbd: kill rbd_req_coll and rbd_request
The two remaining callers of rbd_do_request() always pass a null
collection pointer, so the "coll" and "coll_index" parameters are
not needed.  There is no other use of that data structure, so it
can be eliminated.

Deleting them means there is no need to allocate a rbd_request
structure for the callback function.  And since that's the only use
of *that* structure, it too can be eliminated.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:29:07 -08:00
Alex Elder 2250a71b59 rbd: kill rbd_rq_fn() and all other related code
Now that the request function has been replaced by one using the new
request management data structures the old one can go away.
Deleting it makes rbd_dev_do_request() no longer needed, and
deleting that makes other functions unneeded, and so on.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:29:07 -08:00
Alex Elder bf0d5f503d rbd: new request tracking code
This patch fully implements the new request tracking code for rbd
I/O requests.

Each I/O request to an rbd image will get an rbd_image_request
structure allocated to track it.  This provides access to all
information about the original request, as well as access to the
set of one or more object requests that are initiated as a result
of the image request.

An rbd_obj_request structure defines a request sent to a single osd
object (possibly) as part of an rbd image request.  An rbd object
request refers to a ceph_osd_request structure built up to represent
the request; for now it will contain a single osd operation.  It
also provides space to hold the result status and the version of the
object when the osd request completes.

An rbd_obj_request structure can also stand on its own.  This will
be used for reading the version 1 header object, for issuing
acknowledgements to event notifications, and for making object
method calls.

All rbd object requests now complete asynchronously with respect
to the osd client--they supply a common callback routine.

This resolves:
    http://tracker.newdream.net/issues/3741

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-02-13 18:29:07 -08:00
Jean Delvare 243eeb7890 swim: Add missing spinlock init
It doesn't seem this spinlock was properly initialized.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: Finn Thain <fthain@telegraphics.com.au>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
2013-02-09 14:23:33 +01:00
Jens Axboe e5e9fdaad4 rsxx: add slab.h include to dma.c
kbuild test robot says:

tree:   git://git.kernel.dk/linux-block.git for-3.9/drivers
head:   1262e24a59
commit: 8722ff8cdb [6/8] block: IBM RamSan 70/80 device driver
config: make ARCH=alpha allyesconfig

All error/warnings:

   drivers/block/rsxx/dma.c: In function 'rsxx_complete_dma':
>> drivers/block/rsxx/dma.c:251:2: error: implicit declaration of function 'kmem_cache_free' [-Werror=implicit-function-declaration]
   drivers/block/rsxx/dma.c: In function 'rsxx_queue_discard':
>> drivers/block/rsxx/dma.c:567:2: error: implicit declaration of function 'kmem_cache_alloc' [-Werror=implicit-function-declaration]
>> drivers/block/rsxx/dma.c:567:6: warning: assignment makes pointer from integer without a cast [enabled by default]
   drivers/block/rsxx/dma.c: In function 'rsxx_queue_dma':
>> drivers/block/rsxx/dma.c:601:6: warning: assignment makes pointer from integer without a cast [enabled by default]
   drivers/block/rsxx/dma.c: In function 'rsxx_dma_init':
>> drivers/block/rsxx/dma.c:985:2: error: implicit declaration of function 'KMEM_CACHE' [-Werror=implicit-function-declaration]
>> drivers/block/rsxx/dma.c:985:29: error: 'rsxx_dma' undeclared (first use in this function)
   drivers/block/rsxx/dma.c:985:29: note: each undeclared identifier is reported only once for each function it appears in
>> drivers/block/rsxx/dma.c:985:39: error: 'SLAB_HWCACHE_ALIGN' undeclared (first use in this function)
   drivers/block/rsxx/dma.c: In function 'rsxx_dma_cleanup':
>> drivers/block/rsxx/dma.c:995:2: error: implicit declaration of function 'kmem_cache_destroy' [-Werror=implicit-function-declaration]
   cc1: some warnings being treated as errors

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-02-07 10:04:21 +01:00
Heiko Carstens 1262e24a59 drivers/block/mtip32xx: add missing GENERIC_HARDIRQS dependency
The MTIP32XX driver calls devm_request_irq() and therefore needs a
GENERIC_HARDIRQS dependency to prevent building it on s390.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-02-07 07:37:29 +01:00
Linus Torvalds 2110cf029a Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "I've got a few bits pending for 3.8 final, that I better get sent out.
  It's all been sitting for a while, I consider it safe.

  It contains:

   - Two bug fixes for mtip32xx, fixing a driver hang and a crash.

   - A few-liner protocol error fix for drbd.

   - A few fixes for the xen block front/back driver, fixing a potential
     data corruption issue.

   - A race fix for disk_clear_events(), causing spurious warnings.  Out
     of the Chrome OS base.

   - A deadlock fix for disk_clear_events(), moving it to the a
     unfreezable workqueue.  Also from the Chrome OS base."

* 'for-linus' of git://git.kernel.dk/linux-block:
  drbd: fix potential protocol error and resulting disconnect/reconnect
  mtip32xx: fix for crash when the device surprise removed during rebuild
  mtip32xx: fix for driver hang after a command timeout
  block: prevent race/cleanup
  block: remove deadlock in disk_clear_events
  xen-blkfront: handle bvecs with partial data
  llist/xen-blkfront: implement safe version of llist_for_each_entry
  xen-blkback: implement safe iterator for the list of persistent grants
2013-02-07 08:38:33 +11:00
Stephen Rothwell 82bed4d5f8 block: remove new __devinit/exit annotations on ramsam driver
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-02-06 09:34:20 +01:00
josh.h.morris@us.ibm.com 8722ff8cdb block: IBM RamSan 70/80 device driver
This patch includes the device driver for the IBM RamSan
family of PCI SSD flash storage cards. This driver will
include support for the RamSan 70 and 80. The driver
presents a block device for device I/O.

Signed-off-by: Philip J Kelleher <pjk1939@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-02-05 14:16:05 +01:00
Alex Elder 969e5aa3b0 Merge branch 'testing' of github.com:ceph/ceph-client into v3.8-rc5-testing 2013-01-30 07:54:34 -06:00
Greg Kroah-Hartman 422d26b6ec Merge 3.8-rc5 into driver-core-next
This resolves a gpio driver merge issue pointed out in linux-next.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-25 21:06:30 -08:00
Alex Elder c04306471a rbd: don't retry setting up header watch
When an rbd image is initially mapped a watch event is registered so
we can do something if the header object changes.

The code that does this currently loops if initiating the watch
request results in an ERANGE error.  The osds will never return
ERANGE, so there's no reason to do this loop, so get rid of it.

This resolves:
    http://tracker.newdream.net/issues/3860

Note that the problem this loop was intended to solve is a race
between collecting image header information and setting up the watch
on the header object.  The real fix for that problem is described
here:
    http://tracker.newdream.net/issues/3871

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-25 17:33:37 -06:00
Alex Elder 38901e0f24 rbd: check for overflow in rbd_get_num_segments()
The return type of rbd_get_num_segments() is int, but the values it
operates on are u64.  Although it's not likely, there's no guarantee
the result won't exceed what can be respresented in an int.  The
function is already designed to return -ERANGE on error, so just add
this possible overflow as another reason to return that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-25 17:33:14 -06:00
Alex Elder 98571b5aa7 rbd: small changes
A few very minor changes to the rbd code:
    - RBD_MAX_OPT_LEN is unused, so get rid of it
    - Consolidate rbd options definitions
    - Make rbd_segment_name() return pointer to const char

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-25 17:32:46 -06:00
Jens Axboe 1383923d19 Merge branch 'for-jens' of git://git.drbd.org/linux-drbd into for-linus 2013-01-22 08:22:11 -07:00
Kees Cook 0cb3d9c6ba drivers/block/paride: remove depends on CONFIG_EXPERIMENTAL
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.

CC: Tim Waugh <tim@cyberelk.net>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-21 14:52:42 -08:00
Lars Ellenberg 2681f7f6ce drbd: fix potential protocol error and resulting disconnect/reconnect
When we notice a disk failure on the receiving side,
we stop sending it new incoming writes.

Depending on exact timing of various events, the same transfer log epoch
could end up containing both replicated (before we noticed the failure)
and local-only requests (after we noticed the failure).

The sanity checks in tl_release(), called when receiving a
P_BARRIER_ACK, check that the ack'ed transfer log epoch matches
the expected epoch, and the number of contained writes matches
the number of ack'ed writes.

In this case, they counted both replicated and local-only writes,
but the peer only acknowledges those it has seen.  We get a mismatch,
resulting in a protocol error and disconnect/reconnect cycle.

Messages logged are
  "BAD! BarrierAck #%u received with n_writes=%u, expected n_writes=%u!\n"

A similar issue can also be triggered when starting a resync while
having a healthy replication link, by invalidating one side, forcing a
full sync, or attaching to a diskless node.

Fix this by closing the current epoch if the state changes in a way
that would cause the replication intent of the next write.

Epochs now contain either only non-replicated,
or only replicated writes.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2013-01-21 22:58:36 +01:00
Linus Torvalds 226364766f Various minor fixes, but a slightly more complex one to fix the per-cpu overload
problem introduced recently by kvm id changes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJQ/IaJAAoJENkgDmzRrbjxOjAQAIrI9+Jo3Lsxk1v9gXeo9xn2
 ST4LNv7/oW2+3NFBOkKsGVpcXe1JtGySIXyx9k+dELPa5xe4Rs4HE3pHQj/VoEx8
 FKz3oUXSHkuh+paKuFXvZ2u/z0/FI99GmqHPObvGQ4iS3hTXAibzO83yYYPxwApq
 Zq4kof/dAcVVPLm8fGVAMPA2Rbh/WmjDfrIv8gv71QkDjtRLzcr40VIgky5cvu7V
 FWcBl4/DVoKkGnDPsLDhLK9QGqgBGhFIlNIcVX4Jv50DiCibOyzdjeUXYxMftoGr
 Rw56hHwGpPdqbRIjBkR071vIl/mlXTmxIv+d77vZNBin2MIBwAzCQXo8I1/HojCK
 /wKhI+RFj0J5DaDo/BTB80cmI3X2oah5sRUebW6vd9HjunhFFndg4mVeDNPa0E0+
 F72xWlj79BjdIOuD06TLg6Tg2klL49nC8bUc0wrsh6onEjhd9v7Cp/X/rxi5cKYW
 eEv3oLkKwUHoheF9gBlpnT0Yyl/HpFe+nemblzj/ybRKnk4A5vtJqV9eZnqoOS16
 lgIkKOpgXT9dzSom2EL/f4sMCeLLYC44DQwOvxNKt/BdMY0r5y8OLaJORXQGfEDF
 Ztvu2G8PmELxV0B3JZcGR/zOcKxpOBsrGoVn0/EQIul3A/0C0ID7i5zwJAyX6LP7
 V+6vyF2eHMf10tB0rbfB
 =SpOo
 -----END PGP SIGNATURE-----

Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux

Pull module fixes and a virtio block fix from Rusty Russell:
 "Various minor fixes, but a slightly more complex one to fix the
  per-cpu overload problem introduced recently by kvm id changes."

* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
  module: put modules in list much earlier.
  module: add new state MODULE_STATE_UNFORMED.
  module: prevent warning when finit_module a 0 sized file
  virtio-blk: Don't free ida when disk is in use
2013-01-20 16:44:28 -08:00
Alex Elder e0b49868d3 rbd: fix type of snap_id in rbd_dev_v2_snap_info()
The type of the snap_id local variable is defined with the
wrong byte order.  Fix that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 16:35:55 -06:00
Alex Elder 8b84de7940 rbd: assign watch request more directly
Both rbd_req_sync_op() and rbd_do_request() have a "linger"
parameter, which is the address of a pointer that should refer to
the osd request structure used to issue a request to an osd.

Only one case ever supplies a non-null "linger" argument: an
CEPH_OSD_OP_WATCH start.  And in that one case it is assigned
&rbd_dev->watch_request.

Within rbd_do_request() (where the assignment ultimately gets made)
we know the rbd_dev and therefore its watch_request field.  We
also know whether the op being sent is CEPH_OSD_OP_WATCH start.

Stop opaquely passing down the "linger" pointer, and instead just
assign the value directly inside rbd_do_request() when it's needed.

This makes it unnecessary for rbd_req_sync_watch() to make
arrangements to hold a value that's not available until a
bit later.  This more clearly separates setting up a watch
request from submitting it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 16:34:59 -06:00
Alex Elder 5efea49a98 rbd: move remaining osd op setup into rbd_osd_req_op_create()
The two remaining osd ops used by rbd are CEPH_OSD_OP_WATCH and
CEPH_OSD_OP_NOTIFY_ACK.  Move the setup of those operations into
rbd_osd_req_op_create(), and get rid of rbd_create_rw_op() and
rbd_destroy_op().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 16:34:59 -06:00
Alex Elder 2647ba3810 rbd: move call osd op setup into rbd_osd_req_op_create()
Move the initialization of the CEPH_OSD_OP_CALL operation into
rbd_osd_req_op_create().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 16:34:59 -06:00
Alex Elder 8d23bf2909 rbd: don't assign extent info in rbd_req_sync_op()
Move the assignment of the extent offset and length and payload
length out of rbd_req_sync_op() and into its caller in the one spot
where a read (and note--no write) operation might be initiated.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 16:34:58 -06:00
Alex Elder c561191813 rbd: don't assign extent info in rbd_do_request()
In rbd_do_request() there's a sort of last-minute assignment of the
extent offset and length and payload length for read and write
operations.  Move those assignments into the caller (in those spots
that might initiate read or write operations)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 16:34:58 -06:00
Alex Elder 1821665749 rbd: don't leak rbd_req for rbd_req_sync_notify_ack()
When rbd_req_sync_notify_ack() calls rbd_do_request() it supplies
rbd_simple_req_cb() as its callback function.  Because the callback
is supplied, an rbd_req structure gets allocated and populated so it
can be used by the callback.  However rbd_simple_req_cb() is not
freeing (or even using) the rbd_req structure, so it's getting
leaked.

Since rbd_simple_req_cb() has no need for the rbd_req structure,
just avoid allocating one for this case.  Of the three calls to
rbd_do_request(), only the one from rbd_do_op() needs the rbd_req
structure, and that call can be distinguished from the other two
because it supplies a non-null rbd_collection pointer.

So fix this leak by only allocating the rbd_req structure if a
non-null "coll" value is provided to rbd_do_request().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 16:34:58 -06:00
Alex Elder 2e53c6c379 rbd: don't leak rbd_req on synchronous requests
When rbd_do_request() is called it allocates and populates an
rbd_req structure to hold information about the osd request to be
sent.  This is done for the benefit of the callback function (in
particular, rbd_req_cb()), which uses this in processing when
the request completes.

Synchronous requests provide no callback function, in which case
rbd_do_request() waits for the request to complete before returning.
This case is not handling the needed free of the rbd_req structure
like it should, so it is getting leaked.

Note however that the synchronous case has no need for the rbd_req
structure at all.  So rather than simply freeing this structure for
synchronous requests, just don't allocate it to begin with.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 16:34:58 -06:00
Alex Elder 907703d050 rbd: combine rbd sync watch/unwatch functions
The rbd_req_sync_watch() and rbd_req_sync_unwatch() functions are
nearly identical.  Combine them into a single function with a flag
indicating whether a watch is to be initiated or torn down.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 16:34:58 -06:00
Alex Elder 0903e875ca rbd: use a common layout for each device
Each osd message includes a layout structure, and for rbd it is
always the same (at least for osd's in a given pool).

Initialize a layout structure when an rbd_dev gets created and just
copy that into osd requests for the rbd image.

Replace an assertion that was done when initializing the layout
structures with code that catches and handles anything that would
trigger the assertion as soon as it is identified.  This precludes
that (bad) condition from ever occurring.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 16:34:58 -06:00
Alex Elder 47dba7ba26 rbd: don't bother calculating file mapping
When rbd_do_request() has a request to process it initializes a ceph
file layout structure and uses it to compute offsets and limits for
the range of the request using ceph_calc_file_object_mapping().

The layout used is fixed, and is based on RBD_MAX_OBJ_ORDER (30).
It sets the layout's object size and stripe unit to be 1 GB (2^30),
and sets the stripe count to be 1.

The job of ceph_calc_file_object_mapping() is to determine which
of a sequence of objects will contain data covered by range, and
within that object, at what offset the range starts.  It also
truncates the length of the range at the end of the selected object
if necessary.

This is needed for ceph fs, but for rbd it really serves no purpose.
It does its own blocking of images into objects, echo of which is
(1 << obj_order) in size, and as a result it ignores the "bno"
value returned by ceph_calc_file_object_mapping().  In addition,
by the point a request has reached this function, it is already
destined for a single rbd object, and its length will not exceed
that object's extent.  Because of this, and because the mapping will
result in blocking up the range using an integer multiple of the
image's object order, ceph_calc_file_object_mapping() will never
change the offset or length values defined by the request.

In other words, this call is a big no-op for rbd data requests.

There is one exception.  We read the header object using this
function, and in that case we will not have already limited the
request size.  However, the header is a single object (not a file or
rbd image), and should not be broken into pieces anyway.  So in fact
we should *not* be calling ceph_calc_file_object_mapping() when
operating on the header object.

So...

Don't call ceph_calc_file_object_mapping() in rbd_do_request(),
because useless for image data and incorrect to do sofor the image
header.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 16:34:58 -06:00
Alex Elder e01e79273b rbd: open code rbd_calc_raw_layout()
This patch gets rid of rbd_calc_raw_layout() by simply open coding
it in its one caller.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 16:34:58 -06:00
Alex Elder 0829661863 rbd: pull in ceph_calc_raw_layout()
This is the first in a series of patches aimed at eliminating
the use of ceph_calc_raw_layout() by rbd.

It simply pulls in a copy of that function and renames it
rbd_calc_raw_layout().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 16:34:58 -06:00
Alex Elder 30573d6803 rbd: assume single op in a request
We now know that every of rbd_req_sync_op() passes an array of
exactly one operation, as evidenced by all callers passing 1 as its
num_op argument.  So get rid of that argument, assuming a single op.

Similarly, we now know that all callers of rbd_do_request() pass 1
as the num_op value, so that parameter can be eliminated as well.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 16:34:57 -06:00
Alex Elder 139b4318ad rbd: there is really only one op
Throughout the rbd code there are spots where it appears we can
handle an osd request containing more than one osd request op.

But that is only the way it appears.  In fact, currently only one
operation at a time can be supported, and supporting more than
one will require much more than fleshing out the support that's
there now.

This patch changes names to make it perfectly clear that anywhere
we're dealing with a block of ops, we're in fact dealing with
exactly one of them.  We'll be able to simplify some things as
a result.

When multiple op support is implemented, we can update things again
accordingly.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 16:34:57 -06:00
Alex Elder ae7ca4a35b libceph: pass num_op with ops
Both ceph_osdc_alloc_request() and ceph_osdc_build_request() are
provided an array of ceph osd request operations.  Rather than just
passing the number of operations in the array, the caller is
required append an additional zeroed operation structure to signal
the end of the array.

All callers know the number of operations at the time these
functions are called, so drop the silly zero entry and supply that
number directly.  As a result, get_num_ops() is no longer needed.
This also means that ceph_osdc_alloc_request() never uses its ops
argument, so that can be dropped.

Also rbd_create_rw_ops() no longer needs to add one to reserve room
for the additional op.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 16:34:57 -06:00
Alex Elder d07c09589f rbd: pass num_op with ops array
Add a num_op parameter to rbd_do_request() and rbd_req_sync_op() to
indicate the number of entries in the array.  The callers of these
functions always know how many entries are in the array, so just
pass that information down.

This is in anticipation of eliminating the extra zero-filled entry
in these ops arrays.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 16:34:57 -06:00
Alex Elder 54a5400721 libceph: don't set pages or bio in ceph_osdc_alloc_request()
Only one of the two callers of ceph_osdc_alloc_request() provides
page or bio data for its payload.  And essentially all that function
was doing with those arguments was assigning them to fields in the
osd request structure.

Simplify ceph_osdc_alloc_request() by having the caller take care of
making those assignments

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 16:34:57 -06:00
Alex Elder d178a9e740 libceph: don't set flags in ceph_osdc_alloc_request()
The only thing ceph_osdc_alloc_request() really does with the
flags value it is passed is assign it to the newly-created
osd request structure.  Do that in the caller instead.

Both callers subsequently call ceph_osdc_build_request(), so have
that function (instead of ceph_osdc_alloc_request()) issue a warning
if a request comes through with neither the read nor write flags set.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 15:52:05 -06:00
Alex Elder e75b45cf36 libceph: drop osdc from ceph_calc_raw_layout()
The osdc parameter to ceph_calc_raw_layout() is not used, so get rid
of it.  Consequently, the corresponding parameter in calc_layout()
becomes unused, so get rid of that as well.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 15:52:05 -06:00
Alex Elder 4d6b250bf1 libceph: drop snapid in ceph_calc_raw_layout()
A snapshot id must be provided to ceph_calc_raw_layout() even though
it is not needed at all for calculating the layout.

Where the snapshot id *is* needed is when building the request
message for an osd operation.

Drop the snapid parameter from ceph_calc_raw_layout() and pass
that value instead in ceph_osdc_build_request().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 15:52:05 -06:00
Alex Elder 0120be3c60 libceph: pass length to ceph_osdc_build_request()
The len argument to ceph_osdc_build_request() is set up to be
passed by address, but that function never updates its value
so there's no need to do this.  Tighten up the interface by
passing the length directly.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 15:52:04 -06:00
Alex Elder 7c3d22cf16 rbd: don't bother setting snapid in rbd_do_request()
For some reason, the snapid field of the osd request header is
explicitly set to CEPH_NOSNAP in rbd_do_request().  Just a few lines
later--with no code that would access this field in between--a call
is made to ceph_calc_raw_layout() passing the snapid provided to
rbd_do_request(), which encodes the snapid value it is provided into
that field instead.

In other words, there is no need to fill in CEPH_NOSNAP, and doing
so suggests it might be necessary.  Don't do that any more.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 15:52:03 -06:00
Alex Elder 25704ac9de rbd: kill rbd_req_sync_op() snapc and snapid parameters
The snapc and snapid parameters to rbd_req_sync_op() always take
the values NULL and CEPH_NOSNAP, respectively.  So just get rid
of them and use those values where needed.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 15:52:02 -06:00
Alex Elder 07b2391fbb rbd: drop flags parameter from rbd_req_sync_exec()
All callers of rbd_req_sync_exec() pass CEPH_OSD_FLAG_READ as their
flags argument.  Delete that parameter and use CEPH_OSD_FLAG_READ
within the function.  If we find a need to support write operations
we can add it back again.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 15:52:02 -06:00
Alex Elder 4775618d92 rbd: drop snapid parameter from rbd_req_sync_read()
There is only one caller of rbd_req_sync_read(), and it passes
CEPH_NOSNAP as the snapshot id argument.  Delete that parameter
and just use CEPH_NOSNAP within the function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 15:52:01 -06:00
Alex Elder af77f26caa rbd: drop oid parameters from ceph_osdc_build_request()
The last two parameters to ceph_osd_build_request() describe the
object id, but the values passed always come from the osd request
structure whose address is also provided.  Get rid of those last
two parameters.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 15:52:01 -06:00
Alex Elder 0ec8ce87f3 rbd: separate layout init
Pull a block of code that initializes the layout structure in an osd
request into its own function so it can be reused.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 15:52:01 -06:00
Alex Elder a7b4c65f4f rbd: only get snap context for write requests
Right now we get the snapshot context for an rbd image (under
protection of the header semaphore) for every request processed.

There's no need to get the snap context if we're doing a read,
so avoid doing so in that case.

Note that we no longer need to hold the header semaphore to
check the rbd_dev's existence flag.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 15:52:00 -06:00
Alex Elder d78b650a59 rbd: make exists flag atomic
The rbd_device->exists field can be updated asynchronously, changing
from set to clear if a mapped snapshot disappears from the base
image's snapshot context.

Currently, value of the "exists" flag is only read and modified
under protection of the header semaphore, but that will change with
the next patch.  Making it atomic ensures this won't be a problem
because the a the non-existence of device will be immediately known.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 15:52:00 -06:00
Alex Elder b395e8b5b8 rbd: a little more cleanup of rbd_rq_fn()
Now that a big hunk in the middle of rbd_rq_fn() has been moved
into its own routine we can simplify it a little more.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 15:51:51 -06:00
Alex Elder cd323ac0eb rbd: end request on error in rbd_do_request() caller
Only one of the three callers of rbd_do_request() provide a
collection structure to aggregate status.

If an error occurs in rbd_do_request(), have the caller
take care of calling rbd_coll_end_req() if necessary in
that one spot.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 15:33:41 -06:00
Alex Elder 8295cda7ce rbd: encapsulate handling for a single request
In rbd_rq_fn(), requests are fetched from the block layer and each
request is processed, looping through the request's list of bio's
until they've all been consumed.

Separate the handling for a single request into its own function to
make it a bit easier to see what's going on.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 15:04:47 -06:00
Alex Elder 8986cb37b1 rbd: be picky about osd request status type
The result field in a ceph osd reply header is a signed 32-bit type,
but rbd code often casually uses int to represent it.

The following changes the types of variables that handle this result
value to be "s32" instead of "int" to be completely explicit about
it.  Only at the point we pass that result to __blk_end_request()
does the type get converted to the plain old int defined for that
interface.

There is almost certainly no binary impact of this change, but I
prefer to show the exact size and signedness of the value since we
know it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2013-01-17 14:53:20 -06:00
Alex Elder 5f29ddd4f0 rbd: standardize ceph_osd_request variable names
There are spots where a ceph_osds_request pointer variable is given
the name "req".  Since we're dealing with (at least) three types of
requests (block layer, rbd, and osd), I find this slightly
distracting.

Change such instances to use "osd_req" consistently to make the
abstraction represented a little more obvious.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 14:53:15 -06:00
Alex Elder 725afc97c9 rbd: standardize rbd_request variable names
There are two names used for items of rbd_request structure type:
"req" and "req_data".  The former name is also used to represent
items of pointers to struct ceph_osd_request.

Change all variables that have these names so they are instead
called "rbd_req" consistently.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 14:53:07 -06:00
Alex Elder 935dc89f3e rbd: add warnings to rbd_dev_probe_update_spec()
Josh suggested adding warnings to this function to help users
diagnose problems.

Other than memory allocatino errors, there are two places where
errors can be returned.  Both represent problems that should
have been caught earlier, and as such might well have been
handled with BUG_ON() calls.  But if either ever did manage to
happen, it will be reported.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 14:12:46 -06:00
Alex Elder f5400b7a0e rbd: add a warning in bio_chain_clone_range()
Add a warning in bio_chain_clone_range() to help a user determine
what exactly might have led to a failure.  There is only one; please
say something if you disagree with the following reasoning.

There are three places this can return abnormally:
    - Initially, if there is nothing to clone.  It turns out that
      right now this cannot happen anyway.  The test is in place
      because the code below it doesn't work if those conditions
      don't hold.  As such they could be assertions but since I can
      return a null to indicate an error I just do that instead.
      I have not added a warning here because it won't happen.
    - While processing bio's, if none remain but there are supposed
      to be more bytes to clone.  Here I have added a warning.
    - If bio_clone_range() returns a null pointer.  That function
      will have already produced a warning (at least the first
      time, via WARN_ON_ONCE()) to distinguish the cause of the
      error.  The only exception is memory exhaustion, and I'd
      rather not pepper the code with warnings in all those spots.
      So no warning is added in that place.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 14:12:31 -06:00
Alex Elder 4fb5d67139 rbd: add warning messages for missing arguments
Tell the user (via dmesg) what was wrong with the arguments provided
via /sys/bus/rbd/add.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2013-01-17 14:10:21 -06:00
Alex Elder 06ecc6cbf7 rbd: define and use rbd_warn()
Define a new function rbd_warn() that produces a boilerplate warning
message, identifying in the resulting message the affected rbd
device in the best way available.  Use it in a few places that now
use pr_warning().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 14:09:29 -06:00
Alex Elder 4caf35f9ec rbd: use kmemdup()
This replaces two kmalloc()/memcpy() combinations with a single
call to kmemdup().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 14:09:00 -06:00
Alex Elder 979ed480a2 rbd: kill rbd_spec->image_id_len
There is no real benefit to keeping the length of an image id, so
get rid of it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 14:08:54 -06:00
Alex Elder 69e7a02f63 rbd: kill rbd_spec->image_name_len
There may have been a benefit to hanging on to the length of an
image name before, but there is really none now.  The only time it's
used is when probing for rbd images, so we can just compute the
length then.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 14:08:46 -06:00
Alex Elder c66c6e0c0b rbd: document rbd_spec structure
I promised Josh I would document whether there were any restrictions
needed for accessing fields of an rbd_spec structure.  This adds a
big block of comments that documents the structure and how it is
used--including the fact that we don't attempt to synchronize access
to it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-01-17 14:07:50 -06:00
Fengguang Wu 478c030eec drivers/block/mtip32xx/mtip32xx.c:1726:5: sparse: symbol 'mtip_send_trim' was not declared. Should it be static?
Hi Asai,

FYI, there are new sparse warnings show up in

tree:   git://git.kernel.dk/linux-block.git for-3.9/drivers
head:   3d6a87430e
commit: 152834694d [2/3] mtip32xx: add trim support

>> drivers/block/mtip32xx/mtip32xx.c:1726:5: sparse: symbol 'mtip_send_trim' was not declared. Should it be static?
   drivers/block/mtip32xx/mtip32xx.c:3348:17: sparse: cast to restricted __le32
   drivers/block/mtip32xx/mtip32xx.c:4125:1: sparse: symbol 'mtip_workq_sdbf0' was not declared. Should it be static?
   drivers/block/mtip32xx/mtip32xx.c:4126:1: sparse: symbol 'mtip_workq_sdbf1' was not declared. Should it be static?
   drivers/block/mtip32xx/mtip32xx.c:4127:1: sparse: symbol 'mtip_workq_sdbf2' was not declared. Should it be static?
   drivers/block/mtip32xx/mtip32xx.c:4128:1: sparse: symbol 'mtip_workq_sdbf3' was not declared. Should it be static?
   drivers/block/mtip32xx/mtip32xx.c:4129:1: sparse: symbol 'mtip_workq_sdbf4' was not declared. Should it be static?
   drivers/block/mtip32xx/mtip32xx.c:4130:1: sparse: symbol 'mtip_workq_sdbf5' was not declared. Should it be static?
   drivers/block/mtip32xx/mtip32xx.c:4131:1: sparse: symbol 'mtip_workq_sdbf6' was not declared. Should it be static?
   drivers/block/mtip32xx/mtip32xx.c:4132:1: sparse: symbol 'mtip_workq_sdbf7' was not declared. Should it be static?
   drivers/block/mtip32xx/mtip32xx.c: In function 'mtip_hw_read_flags':
   drivers/block/mtip32xx/mtip32xx.c:2804:1: warning: the frame size of 1036 bytes is larger than 1024 bytes [-Wframe-larger-than=]
   drivers/block/mtip32xx/mtip32xx.c: In function 'mtip_hw_read_registers':
   drivers/block/mtip32xx/mtip32xx.c:2781:1: warning: the frame size of 1044 bytes is larger than 1024 bytes [-Wframe-larger-than=]

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-01-12 09:15:19 +01:00
Fengguang Wu 25bac122b8 drivers/block/mtip32xx/mtip32xx.c:4029:1: sparse: symbol 'mtip_workq_sdbf0' was not declared. Should it be static?
Hi Asai,

FYI, there are new sparse warnings show up in

tree:   git://git.kernel.dk/linux-block.git for-3.9/drivers
head:   3d6a87430e
commit: 16c906e51c [1/3] mtip32xx: Add workqueue and NUMA support

drivers/block/mtip32xx/mtip32xx.c:3267:17: sparse: cast to restricted __le32
>> drivers/block/mtip32xx/mtip32xx.c:4029:1: sparse: symbol 'mtip_workq_sdbf0' was not declared. Should it be static?
>> drivers/block/mtip32xx/mtip32xx.c:4030:1: sparse: symbol 'mtip_workq_sdbf1' was not declared. Should it be static?
>> drivers/block/mtip32xx/mtip32xx.c:4031:1: sparse: symbol 'mtip_workq_sdbf2' was not declared. Should it be static?
>> drivers/block/mtip32xx/mtip32xx.c:4032:1: sparse: symbol 'mtip_workq_sdbf3' was not declared. Should it be static?
>> drivers/block/mtip32xx/mtip32xx.c:4033:1: sparse: symbol 'mtip_workq_sdbf4' was not declared. Should it be static?
>> drivers/block/mtip32xx/mtip32xx.c:4034:1: sparse: symbol 'mtip_workq_sdbf5' was not declared. Should it be static?
>> drivers/block/mtip32xx/mtip32xx.c:4035:1: sparse: symbol 'mtip_workq_sdbf6' was not declared. Should it be static?
>> drivers/block/mtip32xx/mtip32xx.c:4036:1: sparse: symbol 'mtip_workq_sdbf7' was not declared. Should it be static?
   drivers/block/mtip32xx/mtip32xx.c: In function 'mtip_hw_read_flags':
   drivers/block/mtip32xx/mtip32xx.c:2723:1: warning: the frame size of 1036 bytes is larger than 1024 bytes [-Wframe-larger-than=]
   drivers/block/mtip32xx/mtip32xx.c: In function 'mtip_hw_read_registers':
   drivers/block/mtip32xx/mtip32xx.c:2700:1: warning: the frame size of 1044 bytes is larger than 1024 bytes [-Wframe-larger-than=]

Please consider folding the below diff :-)

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-01-12 09:15:11 +01:00
Dan Carpenter 3d6a87430e dac960: return success instead of -ENOTTY
There is a missing break statement here.  This used to return directly
but we re-worked it in 2008 to add locking as part of the BKL push down.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-01-11 14:42:36 +01:00
Asai Thambi S P 152834694d mtip32xx: add trim support
TRIM support added through vendor unique command.

Signed-off-by: Sam Bradshaw < sbradshaw@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-01-11 14:41:34 +01:00
Asai Thambi S P 16c906e51c mtip32xx: Add workqueue and NUMA support
This patch contains
	* parallel command completion using workers
	* bind the workers to the chosen numa node
	* bind isr to the chosen numa node
	* allocating memory in the chosen numa node

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-01-11 14:38:57 +01:00
Asai Thambi S P 58c49df378 mtip32xx: fix for crash when the device surprise removed during rebuild
When rebuild is in progress, disk->queue is yet to be created. Surprise
removing the device will call remove()-> del_gendisk(). del_gendisk()
expect disk->queue be not NULL. Fix is to call put_disk() when disk_queue
is NULL.

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-01-11 14:35:58 +01:00
Asai Thambi S P 47cd4b3c7e mtip32xx: fix for driver hang after a command timeout
If an I/O command times out when a PIO command is active,
MTIP_PF_EH_ACTIVE_BIT is not cleared. This results in I/O
hang in the driver. Fix is to clear this bit.

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-01-11 14:35:55 +01:00
Paul Gortmaker d1a6f4f197 block: delete super ancient PC-XT driver for 1980's hardware
This driver was for the 8 bit ISA cards that were installed in
the PC-XT machines of 1980 vintage.  They supported the dual
ribbon cable MFM drives of 10-20MB capacity, and ran at a 3:1
interleave, giving performance on the order of 128kB/s.

By the introduction of the PC-AT (286) these controllers were
already scrapped in favour of 16 bit controllers with some onboard
RAM that could support a 1:1 interleave.

The git history doesn't show any evidence of runtime fixes that
would reflect active usage; instead just the usual tree-wide API
type changes/cleanups.  Going back to in-source changelogs, the
last "runtime" fix that is evident is something I did over a
dozen years ago[1] -- and even back then, the hardware was long
since unavailable, so that ancient fix was also not runtime tested.

The time is long overdue for this to get flushed, so lets get
rid of it before anyone wastes more time doing builds and sparse
checks etc. on long since dead code.

[1] http://lkml.indiana.edu/hypermail/linux/kernel/0102.2/0027.html

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2013-01-04 20:17:40 -05:00
Greg Kroah-Hartman 8d85fce77e Drivers: block: remove __dev* attributes.
CONFIG_HOTPLUG is going away as an option.  As a result, the __dev*
markings need to be removed.

This change removes the use of __devinit, __devexit_p, __devinitdata,
__devinitconst, and __devexit from these drivers.

Based on patches originally written by Bill Pemberton, but redone by me
in order to handle some of the coding style issues better, by hand.

Cc: Bill Pemberton <wfp5p@virginia.edu>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Chirag Kantharia <chirag.kantharia@hp.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Jim Paris <jim@jtan.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Grant Likely <grant.likely@secretlab.ca>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: NeilBrown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Tao Guo <Tao.Guo@emc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-03 15:57:15 -08:00
Alexander Graf f4953fe6c4 virtio-blk: Don't free ida when disk is in use
When a file system is mounted on a virtio-blk disk, we then remove it
and then reattach it, the reattached disk gets the same disk name and
ids as the hot removed one.

This leads to very nasty effects - mostly rendering the newly attached
device completely unusable.

Trying what happens when I do the same thing with a USB device, I saw
that the sd node simply doesn't get free'd when a device gets forcefully
removed.

Imitate the same behavior for vd devices. This way broken vd devices
simply are never free'd and newly attached ones keep working just fine.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@kernel.org
2013-01-02 15:37:58 +10:30
Linus Torvalds 40889e8d9f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph update from Sage Weil:
 "There are a few different groups of commits here.  The largest is
  Alex's ongoing work to enable the coming RBD features (cloning,
  striping).  There is some cleanup in libceph that goes along with it.

  Cyril and David have fixed some problems with NFS reexport (leaking
  dentries and page locks), and there is a batch of patches from Yan
  fixing problems with the fs client when running against a clustered
  MDS.  There are a few bug fixes mixed in for good measure, many of
  which will be going to the stable trees once they're upstream.

  My apologies for the late pull.  There is still a gremlin in the rbd
  map/unmap code and I was hoping to include the fix for that as well,
  but we haven't been able to confirm the fix is correct yet; I'll send
  that in a separate pull once it's nailed down."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (68 commits)
  rbd: get rid of rbd_{get,put}_dev()
  libceph: register request before unregister linger
  libceph: don't use rb_init_node() in ceph_osdc_alloc_request()
  libceph: init event->node in ceph_osdc_create_event()
  libceph: init osd->o_node in create_osd()
  libceph: report connection fault with warning
  libceph: socket can close in any connection state
  rbd: don't use ENOTSUPP
  rbd: remove linger unconditionally
  rbd: get rid of RBD_MAX_SEG_NAME_LEN
  libceph: avoid using freed osd in __kick_osd_requests()
  ceph: don't reference req after put
  rbd: do not allow remove of mounted-on image
  libceph: Unlock unprocessed pages in start_read() error path
  ceph: call handle_cap_grant() for cap import message
  ceph: Fix __ceph_do_pending_vmtruncate
  ceph: Don't add dirty inode to dirty list if caps is in migration
  ceph: Fix infinite loop in __wake_requests
  ceph: Don't update i_max_size when handling non-auth cap
  bdi_register: add __printf verification, fix arg mismatch
  ...
2012-12-20 14:00:13 -08:00
Alex Elder c3e946ce72 rbd: get rid of rbd_{get,put}_dev()
The functions rbd_get_dev() and rbd_put_dev() are trivial wrappers
that add no value, and their existence suggests they may do more
than what they do.

Get rid of them.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2012-12-20 10:56:44 -06:00
Jens Axboe b6c46cfa31 Merge branch 'stable/for-jens-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-linus
Konrad writes:

Please git pull the following branch:

 git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git stable/for-jens-3.8

which has a bug-fix to the xen-blkfront and xen-blkback driver
when using the persistent mode. An issue was discovered where LVM
disks could not be read correctly and this fixes it. There
is also a change in llist.h which has been blessed by akpm.
2012-12-19 20:37:10 +01:00
Linus Torvalds 848b81415c Merge branch 'akpm' (Andrew's patch-bomb)
Merge misc patches from Andrew Morton:
 "Incoming:

   - lots of misc stuff

   - backlight tree updates

   - lib/ updates

   - Oleg's percpu-rwsem changes

   - checkpatch

   - rtc

   - aoe

   - more checkpoint/restart support

  I still have a pile of MM stuff pending - Pekka should be merging
  later today after which that is good to go.  A number of other things
  are twiddling thumbs awaiting maintainer merges."

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (180 commits)
  scatterlist: don't BUG when we can trivially return a proper error.
  docs: update documentation about /proc/<pid>/fdinfo/<fd> fanotify output
  fs, fanotify: add @mflags field to fanotify output
  docs: add documentation about /proc/<pid>/fdinfo/<fd> output
  fs, notify: add procfs fdinfo helper
  fs, exportfs: add exportfs_encode_inode_fh() helper
  fs, exportfs: escape nil dereference if no s_export_op present
  fs, epoll: add procfs fdinfo helper
  fs, eventfd: add procfs fdinfo helper
  procfs: add ability to plug in auxiliary fdinfo providers
  tools/testing/selftests/kcmp/kcmp_test.c: print reason for failure in kcmp_test
  breakpoint selftests: print failure status instead of cause make error
  kcmp selftests: print fail status instead of cause make error
  kcmp selftests: make run_tests fix
  mem-hotplug selftests: print failure status instead of cause make error
  cpu-hotplug selftests: print failure status instead of cause make error
  mqueue selftests: print failure status instead of cause make error
  vm selftests: print failure status instead of cause make error
  ubifs: use prandom_bytes
  mtd: nandsim: use prandom_bytes
  ...
2012-12-17 20:58:12 -08:00
Roger Pau Monne d62f691858 xen-blkfront: handle bvecs with partial data
Currently blkfront fails to handle cases in blkif_completion like the
following:

1st loop in rq_for_each_segment
 * bv_offset: 3584
 * bv_len: 512
 * offset += bv_len
 * i: 0

2nd loop:
 * bv_offset: 0
 * bv_len: 512
 * i: 0

In the second loop i should be 1, since we assume we only wanted to
read a part of the previous page. This patches fixes this cases where
only a part of the shared page is read, and blkif_completion assumes
that if the bv_offset of a bvec is less than the previous bv_offset
plus the bv_size we have to switch to the next shared page.

Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: linux-kernel@vger.kernel.org
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-12-17 21:56:03 -05:00
Roger Pau Monne ebb351cf78 llist/xen-blkfront: implement safe version of llist_for_each_entry
Implement a safe version of llist_for_each_entry, and use it in
blkif_free. Previously grants where freed while iterating the list,
which lead to dereferences when trying to fetch the next item.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by:  Andrew Morton <akpm@linux-foundation.org>
[v2: Move the llist_for_each_entry_safe in llist.h]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-12-17 21:55:56 -05:00
Dan Carpenter 31279b1457 aoe: fix use after free in aoedev_by_aoeaddr()
We should return NULL on failure instead of returning a freed pointer.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:26 -08:00
Ed Cashin 2b37c7d865 aoe: update internal version number to 81
This version number is printed to the console on module initialization
and is available in sysfs, which is where the userland aoe-version tool
looks for it.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:26 -08:00
Ed Cashin bf29754ae8 aoe: identify source of runt AoE packets
This change only affects experimental AoE storage networks.

It modifies the console message about runt packets detected so that the
AoE major and minor addresses of the AoE target that generated the runt
are mentioned.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:26 -08:00
Ed Cashin 4a6c9ee93c aoe: allow comma separator in aoe_iflist value
By default, the aoe driver uses any ethernet interface for AoE, but the
aoe_iflist module parameter provides a convenient way to limit AoE
traffic to a specific list of local network interfaces.

This change allows a list to be specified using the comma character as a
separator.  For example,

  modprobe aoe aoe_iflist=eth2,eth3

Before, it was inconvenient to get the quoting right in shell scripts
when setting aoe_iflist to have more than one network interface.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:26 -08:00
Ed Cashin c450ba0fc1 aoe: allow user to disable target failure timeout
With this change, the aoe driver treats the value zero as special for
the aoe_deadsecs module parameter.  Normally, this value specifies the
number of seconds during which the driver will continue to attempt
retransmits to an unresponsive AoE target.  After aoe_deadsecs has
elapsed, the aoe driver marks the aoe device as "down" and fails all
I/O.

The new meaning of an aoe_deadsecs of zero is for the driver to
retransmit commands indefinitely.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:25 -08:00
Ed Cashin 71114ec45f aoe: use dynamic number of remote ports for AoE storage target
Many AoE targets have four or fewer network ports, but some existing
storage devices have many, and the AoE protocol sets no limit.

This patch allows the use of more than eight remote MAC addresses per AoE
target, while reducing the amount of memory used by the aoe driver in
cases where there are many AoE targets with fewer than eight MAC addresses
each.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:25 -08:00
Ed Cashin e52a293264 aoe: avoid races between device destruction and discovery
This change avoids a race that could result in a NULL pointer derference
following a WARNing from kobject_add_internal, "don't try to register
things with the same name in the same directory."

The problem was found with a test that forgets and discovers an
aoe device in a loop:

  while test ! -r /tmp/stop; do
	aoe-flush -a
	aoe-discover
  done

The race was between aoedev_flush taking aoedevs out of the devlist,
allowing a new discovery of the same AoE target to take place before the
driver gets around to calling sysfs_remove_group.  Fixing that one
revealed another race between do_open and add_disk, and this patch avoids
that, too.

The fix required some care, because for flushing (forgetting) an aoedev,
some of the steps must be performed under lock and some must be able to
sleep.  Also, for discovering a new aoedev, some steps might sleep.

The check for a bad aoedev pointer remains from a time when about half of
this patch was done, and it was possible for the
bdev->bd_disk->private_data to become corrupted.  The check should be
removed eventually, but it is not expected to add significant overhead,
occurring in the aoeblk_open routine.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:25 -08:00
Ed Cashin bbb44e30d0 aoe: improve handling of misbehaving network paths
An AoE target can have multiple network ports used for AoE, and in the
aoe driver, those are tracked by the aoetgt struct.  These changes allow
the aoe driver to handle network paths, or aoetgts, that are not working
well, compared to the others.

Paths that do not get responses despite the retransmission of AoE
commands are marked as "tainted", and non-tainted paths are preferred.

Meanwhile, the aoe driver attempts to "probe" the tainted path in the
background by issuing reads of LBA 0 that are padded out to full
(possibly jumbo-frame) size.  If the probes get responses, then the path
is "redeemed", and its taint is removed.

This mechanism has been shown to be helpful in transparently handling
and recovering from real-world network "brown outs" in ways that the
earlier "shoot the help-needing target in the head" mechanism could not.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:25 -08:00
Ed Cashin b91316f2b7 aoe: return real minor number for static minors
The value returned by the static minor device number number allocator is
the real minor number, so it must be multiplied by the supported number
of partitions per aoedev.

Without this fix the support for systems without udev is incomplete, and
the few users of aoe on such systems will have surprising results when
device nodes names do not match the AoE target.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:25 -08:00
Ed Cashin 10935d052e aoe: initialize sysminor to avoid compiler warning
Because the minor_get and related functions use the return values for
errors, the compiler doesn't know that sysminor will always either 1) be
initialized in aoedev_by_aoeaddr by the call to minor_get, or 2) be
unused as the "goto out" is executed.

This patch avoids the compiler warning.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:25 -08:00
Ed Cashin e0b2bbab0b aoe: make error messages more specific in static minor allocation
For some special-purpose systems where udev isn't present, static
allocation of minor numbers is desirable.  This update distinguishes
different failure scenarios, to help the user understand what went
wrong.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:25 -08:00
Ed Cashin 60116cf773 aoe: remove call to request handler from I/O completion
There is no need to call the request handler function in the I/O
completion routine.  The user impact of not doing it is a more "nice" aoe
driver that is less susceptible to causing soft lockups.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:25 -08:00
Ed Cashin 72837600ee aoe: cleanup: correct comment for aoetgt nout
A misplaced comment was attached to the nout member of the aoetgt.  This
change corrects the comment.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:25 -08:00
Ed Cashin 7b6ccc5f97 aoe: increase default cap on outstanding AoE commands in the network
The aoe driver will never be waiting for more than aoe_maxout AoE
commands from a given remote network port on an AoE target.  Increasing
the cap increases performance.  Users can tighten the setting to reduce
the amount of memory used for handling AoE traffic or the network
bandwidth used for AoE.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:25 -08:00
Ed Cashin 0a41409c51 aoe: remove vestigial request queue allocation
Before the aoe driver was an I/O request handler, it was a
make_request-style block driver.  Even so, there was a problem where
sysfs expected a request queue to exist, so one was provided in commit
7135a71b19 ("aoe: allocate unused request_queue for sysfs").

During the transition to the request-handler style, a patch was merged
that was based on a driver without the noop queue, and the noop queue
remained in place after the patch was merged, even though a new
functional queue was introduced by the patch, allocated through
blk_init_queue.

The user impact is a memory leak proportional to the number of AoE
targets discovered.  This patch removes the memory leak and cleans up
vestiges of the old do-nothing queue from the aoeblk_gdalloc function.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:25 -08:00
Ed Cashin fe7252bf51 aoe: copy fallback timing information on destination failover
Commit f3b8e07af774 ("aoe: commands in retransmit queue use new
destination on failure") omits the copying of the coarse-grained time
when an AoE command was sent during the failover from one destination
MAC address on the AoE target to another.

The coarse-grained timing is only used when the system time changes or
an unlikely length of time has passed since the sending of the AoE
command.  Users will not be impacted unless their system clock is very
inaccurate or something unusual (e.g., 10 GbE link reset) happens during
the period when the aoe driver is handling the failure of a port on the
AoE target.  Being effected will mean that an AoE target could be
considered "down" too eagerly.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:24 -08:00
Ed Cashin 519b77b032 aoe: update driver-internal version to 64+
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:24 -08:00
Ed Cashin 3fc9b03248 aoe: commands in retransmit queue use new destination on failure
When one remote MAC address isn't working as a destination for AoE
commands, the frames used to track information associated with the AoE
commands are moved to a new aoetgt (defined by the tuple of {AoE major,
AoE minor, target MAC address}).

This patch makes sure that the frames on the queue for retransmits that
need to be done are updated to use the new destination, so that
retransmits will be sent through a working network path.

Without this change, packets on the retransmit queue will be needlessly
retransmitted to the unresponsive destination MAC, possibly causing
premature target failure before there's time for the retransmit timer to
run again, decide to retransmit again, and finally update the destination
to a working MAC address on the AoE target.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:24 -08:00
Ed Cashin 5f0c9c48e7 aoe: use high-resolution RTTs with fallback to low-res
These changes improve the accuracy of the decision about whether it's time
to retransmit an AoE command by using the microsecond-resolution
gettimeofday instead of jiffies.

Because the system time can jump suddenly, the decision reverts to using
jiffies if the high-resolution time difference is relatively large.
Otherwise the AoE targets could be considered failed inappropriately.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:24 -08:00
Ed Cashin 0d555ecfa4 aoe: manipulate aoedev network stats under lock
With this bugfix in place the calculation of the criterion for "lateness"
is performed under lock.  Without the lock, there is a chance that one of
the non-atomic operations performed on the round trip time statistics
could be incomplete, such that an incorrect lateness criterion would be
calculated.

Without this change, the effect of the bug would be rare unecessary but
benign retransmissions.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:24 -08:00
Ed Cashin 2292a7e109 aoe: err device: include MAC addresses for unexpected responses
The /dev/etherd/err character device provides low-level information about
normal but sometimes interesting AoE command retransmits and "unexpected
responses", i.e., responses for packets that have already been
retransmitted.

This change adds MAC addresses to the messages about unexpected responses,
so that when they occur, it's more easy to determine the network paths to
which they belong.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:24 -08:00
Ed Cashin 3a0c40d2d2 aoe: improve network congestion handling
The aoe driver already had some congestion handling, but it was limited in
its ability to cope with the kind of congestion that can arise on more
complex networks such as those involving paths through multiple ethernet
switches.

Some of the lessons from TCP's history of development can be applied to
improving the congestion control and avoidance on AoE storage networks.
These changes use familar concepts from Van Jacobson's "Congestion
Avoidance and Control" paper from '88, without adding significant
overhead.

This patch depends on an upcoming patch that covers the failover case when
AoE commands being retransmitted are transferred from one retransmit queue
to another.  Another upcoming patch increases the timing accuracy.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:24 -08:00
Ed Cashin 667be1e757 aoe: provide ATA identify device content to user on request
Make the aoe driver follow expected behavior when the user uses ioctl to
get the ATA device identify information, allowing access to model, serial
number, etc.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:24 -08:00
Ed Cashin cd220bf51f aoe: update driver-internal version number to 60
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:24 -08:00
Ed Cashin a04b41cd2c aoe: whitespace cleanup
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:24 -08:00
Ed Cashin d437962504 aoe: cleanup: remove unused ata_scnt function
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:24 -08:00
Ed Cashin 90a2508d01 aoe: "payload" sysfs file exports per-AoE-command data transfer size
The userland aoetools package includes an "aoe-stat" command that can
display a "payload size" column when the aoe driver exports this
information.  Users can quickly see what amount of user data is
transferred inside each AoE command on the network, network headers
excluded.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:23 -08:00
Ed Cashin aa304fdefa aoe: support larger I/O requests via aoe_maxsectors module param
The GPFS filesystem is an example of an aoe user that requires the aoe
driver to support I/O request sizes larger than the default.  Most users
will not need large I/O request sizes, because they would need to be split
up into multiple AoE commands anyway.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:23 -08:00
Ed Cashin 4ba9aa7f98 aoe: support the forgetting (flushing) of a user-specified AoE target
Users sometimes want to cause the aoe driver to forget a particular
previously discovered device when it is no longer online.  The aoetools
provide an "aoe-flush" command that users run to perform this
administrative task.  The changes below provide the support needed in the
driver.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:23 -08:00
Ed Cashin 1b8a1636ce aoe: update cap on outstanding commands based on config query response
The ATA over Ethernet config query response contains a "buffer count"
field reflecting the AoE target's capacity to buffer incoming AoE
commands.

By taking the current value of this field into accound, we increase
performance throughput or avoid network congestion, when the value
has increased or decreased, respectively.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:23 -08:00
Ed Cashin 4e78dd144b aoe: print warning regarding a common reason for dropped transmits
Dropped transmits are not common, but when they do occur, increasing
the transmit queue length often helps.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:23 -08:00
Ed Cashin 662a889608 aoe: describe the behavior of the "err" character device
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:23 -08:00
Linus Torvalds 9228ff9038 Merge branch 'for-3.8/drivers' of git://git.kernel.dk/linux-block
Pull block driver update from Jens Axboe:
 "Now that the core bits are in, here are the driver bits for 3.8.  The
  branch contains:

   - A huge pile of drbd bits that were dumped from the 3.7 merge
     window.  Following that, it was both made perfectly clear that
     there is going to be no more over-the-wall pulls and how the
     situation on individual pulls can be improved.

   - A few cleanups from Akinobu Mita for drbd and cciss.

   - Queue improvement for loop from Lukas.  This grew into adding a
     generic interface for waiting/checking an even with a specific
     lock, allowing this to be pulled out of md and now loop and drbd is
     also using it.

   - A few fixes for xen back/front block driver from Roger Pau Monne.

   - Partition improvements from Stephen Warren, allowing partiion UUID
     to be used as an identifier."

* 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits)
  drbd: update Kconfig to match current dependencies
  drbd: Fix drbdsetup wait-connect, wait-sync etc... commands
  drbd: close race between drbd_set_role and drbd_connect
  drbd: respect no-md-barriers setting also when changed online via disk-options
  drbd: Remove obsolete check
  drbd: fixup after wait_even_lock_irq() addition to generic code
  loop: Limit the number of requests in the bio list
  wait: add wait_event_lock_irq() interface
  xen-blkfront: free allocated page
  xen-blkback: move free persistent grants code
  block: partition: msdos: provide UUIDs for partitions
  init: reduce PARTUUID min length to 1 from 36
  block: store partition_meta_info.uuid as a string
  cciss: use check_signature()
  cciss: cleanup bitops usage
  drbd: use copy_highpage
  drbd: if the replication link breaks during handshake, keep retrying
  drbd: check return of kmalloc in receive_uuids
  drbd: Broadcast sync progress no more often than once per second
  drbd: don't try to clear bits once the disk has failed
  ...
2012-12-17 13:39:11 -08:00
Alex Elder b8f5c6edca rbd: don't use ENOTSUPP
ENOTSUPP is not a standard errno (it shows up as "Unknown error 524"
in an error message).  This is what was getting produced when the
the local rbd code does not implement features required by a
discovered rbd image.

Change the error code returned in this case to ENXIO.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-12-17 12:07:32 -06:00
Alex Elder 2fd82b9e92 rbd: get rid of RBD_MAX_SEG_NAME_LEN
RBD_MAX_SEG_NAME_LEN represents the maximum length of an rbd object
name (i.e., one of the objects providing storage backing an rbd
image).

Another symbol, MAX_OBJ_NAME_SIZE, is used in the osd client code to
define the maximum length of any object name in an osd request.

Right now they disagree, with RBD_MAX_SEG_NAME_LEN being too big.

There's no real benefit at this point to defining the rbd object
name length limit separate from any other object name, so just
get rid of RBD_MAX_SEG_NAME_LEN and use MAX_OBJ_NAME_SIZE in its
place.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-12-17 08:37:29 -06:00
Alex Elder 42382b709b rbd: do not allow remove of mounted-on image
There is no check in rbd_remove() to see if anybody holds open the
image being removed.  That's not cool.

Add a simple open count that goes up and down with opens and closes
(releases) of the device, and don't allow an rbd image to be removed
if the count is non-zero.

Protect the updates of the open count value with ctl_mutex to ensure
the underlying rbd device doesn't get removed while concurrently
being opened.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-12-17 08:36:59 -06:00
Roger Pau Monne 7dc341175a xen-blkback: implement safe iterator for the list of persistent grants
Change foreach_grant iterator to a safe version, that allows freeing
the element while iterating. Also move the free code in
free_persistent_gnts to prevent freeing the element before the rb_next
call.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad@kernel.org>
Cc: xen-devel@lists.xen.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-12-07 15:13:09 -05:00
Lars Ellenberg d2ec180c23 drbd: update Kconfig to match current dependencies
We no longer need the connector.
But we need libcrc32c.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-12-06 13:08:29 +01:00
Philipp Reisner ef86b77957 drbd: Fix drbdsetup wait-connect, wait-sync etc... commands
This was introduces when moving the code over from the 8.3 codebase
with commit 328e0f125b

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-12-06 13:04:34 +01:00
Philipp Reisner 13c76aba78 drbd: close race between drbd_set_role and drbd_connect
drbd_set_role(, R_PRIMARY, ) does the state change to Primary,
some more housekeeping, and possibly generates a new UUID set.

All of this holding the "state_mutex".

The connection handshake involves sending of various state information,
including the current data generation UUID set, and two connection
state changes from C_WF_CONNECTION to C_WF_REPORT_PARAMS further to
a number of different outcomes, resync being one of them.

If the connection handshake happens between the state change to Primary
and the generation of the new UUIDs, the resync decision based on the
old UUID set may be confused, depending on circumstances.

Make sure that, before we do the handshake, any promotion to Primary
role will either be complete (including the housekeeping stuff), or can
see, and serialize with, the ongoing handshake, based on the
"STATE_SENT" bit, which is set when we start the handshake, and cleared
only when we leave C_WF_REPORT_PARAMS again.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-12-06 13:00:33 +01:00
Lars Ellenberg 691631c065 drbd: respect no-md-barriers setting also when changed online via disk-options
We need to propagate the configuration into the flag bits,
or it won't be effective.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-12-06 13:00:04 +01:00
Philipp Reisner 298307ed1d drbd: Remove obsolete check
Smatch complained about it this redundanct check.

The check was introduced in 2006-09-13. On 2007-07-24 the body of the
function was enclosed by get_ldev()/put_ldev() reference counting.
Since then the check is useless and miss leading.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-12-06 12:09:55 +01:00
Jens Axboe 84ad6845fb Merge branch 'stable/for-jens-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-3.8/drivers 2012-12-01 09:42:41 +01:00
Jens Axboe 2cecb73098 drbd: fixup after wait_even_lock_irq() addition to generic code
Compiling drbd yields:

drivers/block/drbd/drbd_state.c: In function ‘_conn_request_state’:
drivers/block/drbd/drbd_state.c:1804:5: error: macro "wait_event_lock_irq" passed 4 arguments, but takes just 3
drivers/block/drbd/drbd_state.c:1801:3: error: ‘wait_event_lock_irq’ undeclared (first use in this function)
drivers/block/drbd/drbd_state.c:1801:3: note: each undeclared identifier is reported only once for each function it appears in
drivers/block/drbd/drbd_state.c: At top level:
drivers/block/drbd/drbd_state.c:1734:1: warning: ‘_conn_rq_cond’ defined but not used [-Wunused-function]

Due to drbd having copied the MD definition for wait_event_lock_irq()
as well. Kill them.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-30 21:20:15 +01:00
Lukas Czerner 7b5a35225b loop: Limit the number of requests in the bio list
Currently there is not limitation of number of requests in the loop bio
list. This can lead into some nasty situations when the caller spawns
tons of bio requests taking huge amount of memory. This is even more
obvious with discard where blkdev_issue_discard() will submit all bios
for the range and wait for them to finish afterwards. On really big loop
devices and slow backing file system this can lead to OOM situation as
reported by Dave Chinner.

With this patch we will wait in loop_make_request() if the number of
bios in the loop bio list would exceed 'nr_congestion_on'.
We'll wake up the process as we process the bios form the list. Some
threshold hysteresis is in place to avoid high frequency oscillation.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reported-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-30 11:48:05 +01:00
Roger Pau Monne 07c540a0b5 xen-blkfront: free allocated page
Free the page allocated for the persistent grant.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-11-26 14:58:11 -05:00
Roger Pau Monne 4d4f270f18 xen-blkback: move free persistent grants code
Move the code that frees persistent grants from the red-black tree
to a function. This will make it easier for other consumers to move
this to a common place.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-11-26 14:58:11 -05:00
Selvan Mani 836413e8c7 mtip32xx: Fix padding issue
Hi Jens,

Another tiny patch.

Removed __packed before the struct smart_attr and added __packed at end of
the structure to fix padding issue.

Signed-off-by: Selvan Mani  <smani@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-23 14:32:55 +01:00
Ed Cashin 11cfb6ff73 aoe: avoid running request handler on plugged queue
Calling the request handler directly on a plugged queue defeats
the performance improvements provided by the plugging mechanism.
Use the __blk_run_queue function instead of calling the request
handler directly, so that we don't interfere with the block
layer's ability to plug the queue.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-23 14:32:55 +01:00
Wei Yongjun 298d80152c mtip32xx: fix potential NULL pointer dereference in mtip_timeout_function()
The dereference to port should be moved below the NULL test.

dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-23 14:32:55 +01:00
Jens Axboe 7c5d62388e mtip32xx: fix shift larger than type warning
If we're building a 32-bit kernel and CONFIG_LBADF isn't set,
sector_t is 32-bits wide. The shifts by 32 and 40 are thus
larger than we support.

Cast the sector offset to a u64 to avoid these warnings.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-23 14:32:55 +01:00
Selvan Mani 4b9e884523 mtip32xx: Fix incorrect mask used for erase mode
Previous commit use value 3 for erasemode mask.
Changing the mask to correct value to 2

Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-23 14:32:55 +01:00
Selvan Mani eda4531492 mtip32xx: Fix to make lba address correct in big-endian systems
Earlier lba address was assigned directly to lba_low and lba_low_ex,
which would result in a different number (bytes reversed) in
big-endian systems. Now assigning lba address byte-by-byte to fis.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-23 14:32:55 +01:00
Selvan Mani 3208795e61 mtip32xx: fix potential crash on SEC_ERASE_UNIT
The mtip driver lifted this code from elsewhere and then added a special
handling check for SEC_ERASE_UNIT. If the caller tries to do a security
erase but passes no output data for the command then outbuf is not
allocated and the driver duly explodes.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-23 14:32:54 +01:00
Jiri Kosina eac7cc52c6 floppy: destroy floppy workqueue before cleaning up the queue
We need to first destroy the floppy_wq workqueue before cleaning up
the queue. Otherwise we might race with still pending work with the
workqueue, but all the block queue already gone. This might lead to
various oopses, such as

 CPU 0
 Pid: 6, comm: kworker/u:0 Not tainted 3.7.0-rc4 #1 Bochs Bochs
 RIP: 0010:[<ffffffff8134eef5>]  [<ffffffff8134eef5>] blk_peek_request+0xd5/0x1c0
 RSP: 0000:ffff88000dc7dd88  EFLAGS: 00010092
 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
 RDX: ffff88000f602688 RSI: ffffffff81fd95d8 RDI: 6b6b6b6b6b6b6b6b
 RBP: ffff88000dc7dd98 R08: ffffffff81fd95c8 R09: 0000000000000000
 R10: ffffffff81fd9480 R11: 0000000000000001 R12: 6b6b6b6b6b6b6b6b
 R13: ffff88000dc7dfd8 R14: ffff88000dc7dfd8 R15: 0000000000000000
 FS:  0000000000000000(0000) GS:ffffffff81e21000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 CR2: 0000000000000000 CR3: 0000000001e11000 CR4: 00000000000006f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Process kworker/u:0 (pid: 6, threadinfo ffff88000dc7c000, task ffff88000dc5ecc0)
 Stack:
  0000000000000000 0000000000000000 ffff88000dc7ddb8 ffffffff8134efee
  ffff88000dc7ddb8 0000000000000000 ffff88000dc7dde8 ffffffff814aef3c
  ffffffff81e75d80 ffff88000dc0c640 ffff88000fbfb000 ffffffff814aed90
 Call Trace:
  [<ffffffff8134efee>] blk_fetch_request+0xe/0x30
  [<ffffffff814aef3c>] redo_fd_request+0x1ac/0x400
  [<ffffffff814aed90>] ? start_motor+0x130/0x130
  [<ffffffff8106b526>] process_one_work+0x136/0x450
  [<ffffffff8106af65>] ? manage_workers+0x205/0x2e0
  [<ffffffff8106bb6d>] worker_thread+0x14d/0x420
  [<ffffffff8106ba20>] ? rescuer_thread+0x1a0/0x1a0
  [<ffffffff8107075a>] kthread+0xba/0xc0
  [<ffffffff810706a0>] ? __kthread_parkme+0x80/0x80
  [<ffffffff818b553a>] ret_from_fork+0x7a/0xb0
  [<ffffffff810706a0>] ? __kthread_parkme+0x80/0x80
 Code: 0f 84 c0 00 00 00 83 f8 01 0f 85 e2 00 00 00 81 4b 40 00 00 80 00 48 89 df e8 58 f8 ff ff be fb ff ff ff
 fe ff ff <49> 8b 1c 24 49 39 dc 0f 85 2e ff ff ff 41 0f b6 84 24 28 04 00
 RIP  [<ffffffff8134eef5>] blk_peek_request+0xd5/0x1c0
  RSP <ffff88000dc7dd88>

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-23 14:32:54 +01:00
Akinobu Mita d48c152a41 cciss: use check_signature()
Use check_signature() to find a signature in the mmio address.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-23 14:28:34 +01:00
Akinobu Mita 1f118bc479 cciss: cleanup bitops usage
- Remove unnecessary correction of bit and address
- Use BITS_TO_LONGS macro to calculate bitmap size
- Use bitmap_zero()

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-23 14:27:22 +01:00
Keith Busch 2b19603415 NVMe: Initialize iod nents to 0
For commands that do not map a scatter list, we need to initilaize the iod's
number of sg entries (nents) to 0 and not unmap in this case.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-11-13 09:13:50 -05:00
Keith Busch 6ecec74520 NVMe: Define SMART log
This data structure is defined in the NVMe specification.  It's not used
by the kernel, but is available for use by userspace software.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-11-13 09:13:50 -05:00
Keith Busch 08df1e0565 NVMe: Add result to nvme_get_features
nvme_get_features() was not returning the result.  Add a parameter
to return the result in (similar to nvme_set_features()) and change
all callers.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-11-13 09:13:49 -05:00
Keith Busch f4f117f64b NVMe: Set result from user admin command
The ioctl data structure includes space for the 'result' of the admin
command to be returned; it just wasn't filled in.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-11-13 09:13:49 -05:00
Keith Busch 3295874b60 NVMe: End queued bio requests when freeing queue
If the queue has bios queued on it when it is freed, bio_endio() must be
called for them first.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-11-13 09:13:49 -05:00
Keith Busch 859361a228 NVMe: Free cmdid on nvme_submit_bio error
nvme_map_bio() is called after the cmdid is allocated, so we have to
free the cmdid before returning from nvme_submit_bio() if nvme_map_bio()
returned an error.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-11-13 09:13:49 -05:00
Jens Axboe 8d0ff3924b Merge branch 'stable/for-jens-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-3.8/drivers 2012-11-12 09:18:47 -07:00
Akinobu Mita f1d6a328bb drbd: use copy_highpage
Use copy_highpage() to copy from one page to another.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:22:26 +01:00
Lars Ellenberg ed635cb067 drbd: if the replication link breaks during handshake, keep retrying
The 8.3.12 commit drbd: Bugfix for the connection behavior fixes a
"wasted established connection", if a former connection attempt failed
during its early stages.

However it opened a window for a regression, if a connection attempt
fails during its last stages.  The result was a terminated receiver
thread, that left behind the supposedly transient "C_UNCONNECTED" state.
Any later requests to change the connection state fail, as they wait for
the connection state to "stabilize".

Fix: short circuit and keep retrying to restablish a new connection,
if we don't reach C_WF_REPORT_PARAMS.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:22:19 +01:00
Jing Wang 063eacf88c drbd: check return of kmalloc in receive_uuids
Signed-off-by: Jing Wang <windsdaemon@gmail.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:22:10 +01:00
Philipp Reisner 986836503e Merge branch 'drbd-8.4_ed6' into for-3.8-drivers-drbd-8.4_ed6 2012-11-09 14:20:23 +01:00
Philipp Reisner 328e0f125b drbd: Broadcast sync progress no more often than once per second
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:43 +01:00
Philipp Reisner 518a4d53b2 drbd: don't try to clear bits once the disk has failed
If the disk has failed already, there is no point trying to change the
bitmap. drbd_set_out_of_sync() already had this safeguard,
time to add it to drbd_set_in_sync() as well.

This also prevents some warning messages, like
 FIXME asender in bm_change_bits_to, bitmap locked for 'detach' by worker
if our disk fails during resync, while there are some resync acks queued up.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:42 +01:00
Philipp Reisner fd0017c124 drbd: fix regression: potential NULL pointer dereference
recent commit
    drbd: always write bitmap on detach
introduced a bitmap writeout during detach,
which obviously needs some meta data device to write to.

Unfortunately, that same error path may be taken if we fail to attach,
e.g. due to UUID mismatch, after we changed state to D_ATTACHING,
but before the lower level device pointer is even assigned.

We need to test for presence of mdev->ldev.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:42 +01:00
Philipp Reisner 4035e4c2eb drbd: Fix clearing of MDF_AL_DISABLED
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:42 +01:00
Lars Ellenberg 42839f6536 drbd: log request sector offset and size for IO errors
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:41 +01:00
Lars Ellenberg edc9f5eb7a drbd: always write bitmap on detach
If we detach due to local read-error (which sets a bit in the bitmap),
stay Primary, and then re-attach (which re-reads the bitmap from disk),
we potentially lost the "out-of-sync" (or, "bad block") information in
the bitmap.

Always (try to) write out the changed bitmap pages before going diskless.

That way, we don't lose the bit for the bad block,
the next resync will fetch it from the peer, and rewrite
it locally, which may result in block reallocation in some
lower layer (or the hardware), and thereby "heal" the bad blocks.

If the bitmap writeout errors out as well, we will (again: try to)
mark the "we need a full sync" bit in our super block,
if it was a READ error; writes are covered by the activity log already.

If that superblock does not make it to disk either, we are sorry.

Maybe we just lost an entire disk or controller (or iSCSI connection),
and there actually are no bad blocks at all, so we don't need to
re-fetch from the peer, there is no "auto-healing" necessary.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:41 +01:00
Lars Ellenberg e34b677d09 drbd: wait for meta data IO completion even with failed disk, unless force-detached
The intention of force-detach is to be able to deal with a completely
unresponsive lower level IO stack, which does not even deliver error
completions anymore, but no completion at all.

In all other cases, we must still wait for the meta data IO completion.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:40 +01:00
Lars Ellenberg 8747d30af9 drbd: a few more GFP_KERNEL -> GFP_NOIO
This has not yet been observed, but conceivably, when using GFP_KERNEL
allocations from drbd_md_sync(), drbd_flush_after_epoch() or
receive_SyncParam(), we could trigger additional IO to our own device,
or an other device in a criss-cross setup, and end up in a local
deadlock, or potentially a distributed deadlock in a criss-cross setup
involving the peer blocked in a similar way waiting for us to make
progress.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:40 +01:00
Lars Ellenberg bc891c9ae3 drbd: fix potential deadlock during bitmap (re-)allocation
The former comment arguing that GFP_KERNEL was good enough was wrong: it
did not take resize into account at all, and assumed the only path
leading here was the normal attach on a still secondary device, so no
deadlock would be possible.

Both resize on a Primary, or attach on a diskless Primary,
could potentially deadlock.

drbd_bm_resize() is called while IO to the respective device is
suspended, so we must use GFP_NOIO to avoid potential deadlock.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:39 +01:00
Lars Ellenberg a506c13a4d drbd: use list_move_tail instead of list_del/list_add_tail
Using list_move_tail() instead of list_del() + list_add_tail().

spatch with a semantic match is used to found this problem.
(http://coccinelle.lip6.fr/)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:39 +01:00
Philipp Reisner 1b6dd252e6 drbd: panic on delayed completion of aborted requests
"aborting" requests, or force-detaching the disk, is intended for
completely blocked/hung local backing devices which do no longer
complete requests at all, not even do error completions.  In this
situation, usually a hard-reset and failover is the only way out.

By "aborting", basically faking a local error-completion,
we allow for a more graceful swichover by cleanly migrating services.
Still the affected node has to be rebooted "soon".

By completing these requests, we allow the upper layers to re-use
the associated data pages.

If later the local backing device "recovers", and now DMAs some data
from disk into the original request pages, in the best case it will
just put random data into unused pages; but typically it will corrupt
meanwhile completely unrelated data, causing all sorts of damage.

Which means delayed successful completion,
especially for READ requests,
is a reason to panic().

We assume that a delayed *error* completion is OK,
though we still will complain noisily about it.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:39 +01:00
Philipp Reisner a3025a2737 drbd: Fix comparison of is_valid_transition()'s return code
is_valid_transition() might return SS_NOTHING_TO_DO.

The condition function _req_st_cond() returned SS_NOTHING_TO_DO, which
caused the wait_event to abort too early. Therefore drbd_req_state()
did not consume the next CL_ST_CHG_SUCCESS or SS_CW_FAILED_BY_PEER
causing serve disruption of the state machine logic...

Detaching from a single volue was one way to trigger this bug.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:38 +01:00
Philipp Reisner 1393b59f8c drbd: Remove duplicate code
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:38 +01:00
Lars Ellenberg 70f17b6bd1 drbd: differentiate early and later "postponing" of requests
We use the RQ_POSTPONED flag to mark a request for several reasons.

It may be a conflicting request in a dual-primary setup,
where conflict detection and resolution on the peer decided that
this request needs to be re-submitted, it needs to re-enter
drbd_make_request() to fix the data divergence caused by these
conflicting, partially overlapping, quasi-simultaneous requests.

In this case we need to mark the corresponding area as out-of-sync,
before we call drbd_al_complete_io().

We also use the RQ_POSTPONED flag to just "push back" a request,
before even processing it, if IO is suspended for some reason.
In this case, as this request was neither submitted nor sent yet,
we must not touch the bitmap.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:37 +01:00
Philipp Reisner 76590cd1fc drbd: Fix postponed requests
A postponed request might has RQ_IN_ACT_LOG already set, but
is POSTPONED before it gets something in the RQ_LOCAL_MASK
set. Up to now this caused a left-over active extent.

Fix that by only testing for the RQ_IN_ACT_LOG bit in drbd_req_destroy()

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:37 +01:00
Philipp Reisner 19fffd7b03 drbd: Call drbd_md_sync() explicitly after a state change on the connection
Without this, the meta-data gets updates after 5 seconds by the
md_sync_timer. Better to do it immeditaly after a state change.

If the asender detects a network failure, it may take a bit until
the worker processes the according after-conn-state-change work item.

  The worker might be blocked in sending something, i.e. it
  takes until it gets into its timeout. That is 6 seconds by
  default which is longer than the 5 seconds of the md_sync_timer.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:08 +01:00
Philipp Reisner d76440181d drbd: Fix postponed requests
* Postponed requests should not set or clear out-of-sync marks
* When a request gets postponed we need to drop its reference
  mdev->local_cnt (put_ldev()).

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:24 +01:00
Philipp Reisner 4ae98b4db3 drbd: Imporve the error reporting of failed conn state changes
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:24 +01:00
Philipp Reisner 7970201177 drbd: Fix the way the STATE_SENT bit is cleared
With merging the commit
'drbd: Delay/reject other state changes while establishing a connection'
the condition check for clearing the flag was wrong.

Move the bit clearing to the __drbd_set_state() function
in order to have it already cleared for the other parts of
the function. I.e. clearing the susp_fen in the after_state_ch() function.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:23 +01:00
Philipp Reisner 07fc96197a drbd: Do not check aspects that are not subject to change in _conn_requests_state()
When _conn_requests_state() is used to change other parts of the state
than the connection, do not check for a valid connection transition.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:23 +01:00
Philipp Reisner 892fdd1aee drbd: Improve readability of IO resuming after freeze due to no data access
The previous way of doing the state change was also okay since the
state change on the susp flag gets propagated from the mdev
to the tconn.

Fortunately all this goes away in drbd-9.0

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:22 +01:00
Philipp Reisner 88f79ec4ae drbd: Fix IO resuming after connection was established while executing the fence handler
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:22 +01:00
Lars Ellenberg b792b655cd drbd: fix potential list_add corruption
If the md_sync_timer triggers a second time,
while the work queued during the first time is still pending,
this could result in list_add() of an already added item,
and corrupt the work item list.

This likely only triggered because of the erroneous
batch-dequeueing of work items fixed with
  drbd: dequeue single work items in wait_for_work()

Still, skip queueing if md_sync_work is already queued.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:21 +01:00
Lars Ellenberg bc317a9ecd drbd: dequeue single work items in wait_for_work()
As long as we still use drbd_queue_work_front(),
we must only dequeue the single first item during normal operation.

The comment in drbd_worker() even says so,
but bc8a5a1 drbd: remove struct drbd_tl_epoch objects (barrier works)
introduced the batch dequeueing again via list_splice_init() in
wait_for_work().

Change back to list_move() of the first item, if any.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:21 +01:00
Lars Ellenberg c02abda2b2 drbd: mutex_unlock "... must no be used in interrupt context"
Documentation of mutex_unlock says
we must not use it in interrupt context.
So do not call it while holding the spin_lock_irq,
but give up the spinlock temporarily.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:21 +01:00
Philipp Reisner c1fd29a11f drbd: Fix a race condition that can lead to a BUG()
If the preconditions for a state change change after the wait_event() we
might hit the BUG() statement in conn_set_state().

With holding the spin_lock while evaluating the condition AND until the
actual state change we ensure the the preconditions can not change anymore.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:20 +01:00
Lars Ellenberg 0ee98e2eb0 drbd: temporarily suspend io in drbd_adm_disk_opts
drbd_adm_disk_opts() does
	wait_event(mdev->al_wait, lc_try_lock(mdev->act_log));
	drbd_al_shrink(mdev);

If the device is very busy, this can take a very long time to succeed.
Fix this by temporarily suspending IO,
then quickly change the settings, and resume.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:20 +01:00
Lars Ellenberg 4eb9b3cba0 drbd: don't send out P_BARRIER with stale information
We must only send P_BARRIER for epochs we actually sent P_DATA in.

If we (re-)establish a connection, we reinitialized the
send.current_epoch_nr, but forgot to reset send.current_epoch_writes.

This could result in a spurious P_BARRIER with stale epoch information,
and a disconnect/reconnect cycle once the then "unexpected"
P_BARRIER_ACK is received:
  BAD! BarrierAck #28823 received, expected #28829!

Introduce re_init_if_first_write() and maybe_send_barrier() helpers,
and call them appropriately for read/write/set-out-of-sync requests.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:19 +01:00
Lars Ellenberg 08332d7325 drbd: properly call drbd_rs_cancel_all() in drbd_disconnected()
drbd_disconnected() is supposed to clear the resync lru cache,
by calling drbd_rs_cancel_all().

We must do so before we call drbd_flush_workqueue(), as at least the
callback w_restart_disk_io() may wait for resync progres, and would
otherwise deadlock.

drbd_finish_peer_reqs() may again populate that cache, which will
then potentially be stale after the next resync handshake and bitmap
exchange, we have to do it again after that.

A stale resync lru cache causes no harm but ugly messages like this:
 BAD! sector=196608s enr=6 rs_left=-256 rs_failed=0 count=256 cstate=SyncTarget

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:19 +01:00
Philipp Reisner 155522df5b drbd: Remove dead code
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:19 +01:00
Philipp Reisner b66623e33e drbd: Avoid NetworkFailure state during disconnect
Disconnecting is a cluster wide state change. In case the peer node agrees
to the state transition, it sends back the fact on the meta-data connection
and closes both sockets.

In case the node node that initiated the state transfer sees the closing
action on the data-socket, before the P_STATE_CHG_REPLY packet, it was
going into one of the network failure states.

At least with the fencing option set to something else thatn "dont-care",
the unclean shutdown of the connection causes a short IO freeze or
a fence operation.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:18 +01:00
Philipp Reisner 39a1aa7f49 drbd: Protect accesses to the uuid set with a spinlock
There is at least the worker context, the receiver context, the context of
receiving netlink packts.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:04 +01:00
Philipp Reisner fef45d297e drbd: Write all pages of the bitmap after an online resize
We need to write the whole bitmap after we moved the meta data
due to an online resize operation.

With the support for one peta byte devices bitmap IO was optimized
to only write out touched pages. This optimization must be turned
off when writing the bitmap after an online resize.

This issue was introduced with drbd-8.3.10.

The impact of this bug is that after an online resize, the next
resync could become larger than expected.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:51 +01:00
Philipp Reisner 5af2e8ce2b drbd: Fix completion of requests while the device is suspended
In various places (E.g. CONNECTION_LOST_WHILE_PENDING) the
RQ_COMPLETION_SUSP mask is passed in the clear set to mod_rq_state().

The issue was that it tried to clear the RQ_COMPLETION_SUSP bit
out of the state mask first, and eventuelly set it afterwards,
in the drbd_req_put_completion_ref() function.

Fixed that by moving the reference getting out of
drbd_req_put_completion_ref() into the mod_rq_state(), before the place
where the extra reference might be put.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:50 +01:00
Andreas Gruenbacher 715306f69d drbd: Don't unregister socket state_change callback from within the callback
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:50 +01:00
Lars Ellenberg eb12010e9a drbd: disambiguation, s/ERR_DISCARD/ERR_DISCARD_IMPOSSIBLE/
If for some reason (typically "split-brained" cluster manager)
drbd replica data has diverged, we can chose a victim,
and reconnect using "--discard-my-data", causing the victim
to become sync-target, fetching all changed blocks from the peer.

If we are Primary, we are potentially in use, and we refuse to
"roll back" changes to the data below the page cache and other users.

Rename the error symbol for this to ERR_DISCARD_IMPOSSIBLE.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:50 +01:00
Lars Ellenberg 427c0434fc drbd: disambiguation, s/DISCARD_CONCURRENT/RESOLVE_CONFLICTS/
We don't discard anything here, really.
We resolve conflicting, concurrent writes to overlapping data blocks.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:49 +01:00
Lars Ellenberg d4dabbe22d drbd: disambiguation, s/P_DISCARD_WRITE/P_SUPERSEDED/
To avoid confusion with REQ_DISCARD aka TRIM, rename our
"discard concurrent write acks" from P_DISCARD_WRITE to P_SUPERSEDED.

At the same time, rename the drbd request event DISCARD_WRITE
to CONFLICT_RESOLVED. It already triggers both successful completion
or restart of the request, depending on our RQ_POSTPONED flag.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:49 +01:00
Lars Ellenberg 232fd3f4a0 drbd: cleanup, drop unused struct
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:48 +01:00
Lars Ellenberg 46e21bbadb drbd: NEG_ACK does not imply a barrier-ack
Don't drop a request from the transfer log just because it was NEG_ACKED.
We need it around to be able to verify P_BARRIER_ACKs against the
transver log.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:48 +01:00
Lars Ellenberg 99b4d8fe6d drbd: only start a new epoch, if the current epoch contains writes
Almost all code paths calling start_new_tl_epoch() guarded it with
	if (... current_tle_writes > 0 ... ).
Just move that inside start_new_tl_epoch().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:47 +01:00
Philipp Reisner 8a0bab2a6d drbd: Finish requests that completed while IO was frozen
Requests of an acked epoch are stored on the barrier_acked_requests list. In
case the private bio of such a request completes while IO on the drbd device
is suspended [req_mod(completed_ok)] then the request stays there.

When thawing IO because the fence_peer handler returned, then we use
tl_clear() to apply the connection_lost_while_pending event to all requests
on the transfer-log and the barrier_acked_requests list.

Up to now the connection_lost_while_pending event was not applied
on requests on the barrier_acked_requests list. Fixed that.

I.e. now the connection_lost_while_pending and resend events are
applied to requests on the barrier_acked_requests list. For that
it is necessary that the resend event finishes (local only)
READS correctly.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:47 +01:00
Lars Ellenberg e959d08d3e drbd: Fix a potential issue with the DISCARD_CONCURRENT flag
The DISCARD_CONCURRENT flag should be set on one node and cleared on the
other node.
As the code was before it was theoretical possible that a node accepts the
meta socket, but has to close it later on, and keeps the DISCARD_CONCURRENT
flag.
Correct this by moving the clear_bit(DISCARD_CONCURRENT) where the packet
gets sent.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:47 +01:00
Lars Ellenberg 519b6d3eac drbd: fix drbd wire compatibility for empty flushes
DRBD has a concept of request epochs or reorder-domains,
which are separated on the wire by P_BARRIER packets.

Older DRBD is not able to handle zero-sized requests at all,
so we need to map empty flushes to these drbd barriers.

These are the equivalent of empty flushes, and
by default trigger flushes on the receiving side anyways
(unless not supported or explicitly disabled),
so there is no need to handle this differently in newer drbd either.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:46 +01:00
Philipp Reisner 80c6eed49d drbd: More random to the connect logic
Since the listening socket is open all the time, it was possible to
get into stable "initial packet S crossed" loops.

* when both sides realize in the drbd_socket_okay() call at the end
  of the loop that the other side closed the main socket you had
  the chance to get into a stable loop with repeated "packet S crossed"
  messages.

* when both sides do not realize with the drbd_socket_okay() call at the end
  of the loop that the other side closed the main socket you had
  the chance to get into a stable loop with alternating "packet S crossed"
  "packet M crossed" messages.

In order to break out these stable loops randomize the behaviour if
such a crossing of P_INITIAL_DATA or P_INITIAL_META packets is detected.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:46 +01:00
Philipp Reisner 92f14951c0 drbd: Try to connec to peer only once per cycle
Since now our listening socket is open all the time we will get
connection tries of the peer always in. No need to try it three
times.

This is valid when connecting to older peers as well, it simply
increases the probability that the new version DRBD will accept
a connection instead that it will establish one.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:45 +01:00
Philipp Reisner b666dbf819 drbd: Remove redundant and wrong test for NULL simplification in conn_connect()
Since the drbd_socket_okay() function itself tests if the the
socket is NULL, the explicit test "if (sock.socket && &msock.socket)"
was redundent.
Apart from that the address opperator ('&') before msock.socket rendered
the test pointless.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:45 +01:00
Philipp Marek 3174f8c504 drbd: pass some more information to userspace.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:45 +01:00
Lars Ellenberg 81a3537a97 drbd: announce FLUSH/FUA capability to upper layers
In 8.4, we may have bios spanning two activity log extents.
Fixup drbd_al_begin_io() and drbd_al_complete_io() to deal with zero sized bios.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:44 +01:00
Lars Ellenberg 58ffa580a7 drbd: introduce stop-sector to online verify
We now can schedule only a specific range of sectors for online verify,
or interrupt a running verify without interrupting the connection.

Had to bump the protocol version differently, we are now 101.
Added verify_can_do_stop_sector() { protocol >= 97 && protocol != 100; }

Also, the return value convention for worker callbacks has changed,
we returned "true/false" for "keep the connection up" in 8.3,
we return 0 for success and <= for failure in 8.4.
Affected: receive_state()

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:32 +01:00
Lars Ellenberg 970fbde1f1 drbd: flush drbd work queue before invalidate/invalidate remote
If you do back to back wait-sync/invalidate on a Primary in a tight loop,
during application IO load, you could trigger a race:
  kernel: block drbd6: FIXME going to queue 'set_n_write from StartingSync'
    but 'write from resync_finished' still pending?

Fix this by changing the order of the drbd_queue_work() and
the wake_up() in dec_ap_pending(), and adding the additional
drbd_flush_workqueue() before requesting the full sync.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:41 +01:00
Lars Ellenberg 6f1a656325 drbd: call local-io-error handler early
In case we want to hard-reset from the local-io-error handler,
we need to call it before notifying the peer or aborting local IO.
Otherwise the peer will advance its data generation UUIDs even
if secondary.

This way, local io error looks like a "regular" node crash,
which reduces the number of different failure cases.
This may be useful in a bigger picture where crashed or otherwise
"misbehaving" nodes are automatically re-deployed.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:40 +01:00
Lars Ellenberg a324896b17 drbd: do not reset rs_pending_cnt too early
Fix asserts like
  block drbd0: in got_BlockAck:4634: rs_pending_cnt = -35 < 0 !

We reset the resync lru cache and related information (rs_pending_cnt),
once we successfully finished a resync or online verify, or if the
replication connection is lost.

We also need to reset it if a resync or online verify is aborted
because a lower level disk failed.

In that case the replication link is still established,
and we may still have packets queued in the network buffers
which want to touch rs_pending_cnt.

We do not have any synchronization mechanism to know for sure when all
such pending resync related packets have been drained.

To avoid this counter to go negative (and violate the ASSERT that it
will always be >= 0), just do not reset it when we lose a disk.

It is good enough to make sure it is re-initialized before the next
resync can start: reset it when we re-attach a disk.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:40 +01:00
Lars Ellenberg 8a94317071 drbd: reset congestion information before reporting it in /proc/drbd
We cache the congestion status in mdev->congestion_reason whenever
drbd_congested() was called.
Reset this cached info before reporting it when reading /proc/drbd.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:39 +01:00
Lars Ellenberg 6f3465ed82 drbd: report congestion if we are waiting for some userland callback
If the drbd worker thread is synchronously waiting for some userland
callback, we don't want some casual pageout to block on us.
Have drbd_congested() report congestion in that case.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:39 +01:00
Lars Ellenberg 0c84966601 drbd: differentiate between normal and forced detach
Aborting local requests (not waiting for completion from the lower level
disk) is dangerous: if the master bio has been completed to upper
layers, data pages may be re-used for other things already.
If local IO is still pending and later completes,
this may cause crashes or corrupt unrelated data.

Only abort local IO if explicitly requested.
Intended use case is a lower level device that turned into a tarpit,
not completing io requests, not even doing error completion.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:39 +01:00
Lars Ellenberg bf709c8552 drbd: cleanup, remove two unused global flags
The two unused "global flags" in 8.3 are "per volume" flags in 8.4.
Still, they are unused, so lose them.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:38 +01:00
Lars Ellenberg 3b9ef85e05 drbd: fix null pointer dereference with on-congestion policy when diskless
We must not look at mdev->actlog, unless we have a get_ldev() reference.
It also does not make much sense to try to disconnect or pull-ahead of
the peer, if we don't have good local data.

Only even consider congestion policies, if our local disk is D_UP_TO_DATE.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:38 +01:00
Lars Ellenberg 27012382bc drbd: take error path in drbd_adm_down if interrupted by signal
drbd_adm_down() does adm_detach(), which can fail with various error
codes, or be interrupted by a signal.

The interrupted by signal case was not properly handled,
leading to
	block drbd0: ASSERT( mdev->state.disk == D_DISKLESS &&
	                     mdev->state.conn == C_STANDALONE ) in drbd/drbd_worker.c
and further to destroying objects while still in use, and resulting crashes.

Detect the interruption, and take the error path out.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:37 +01:00
Lars Ellenberg 9a278a7906 drbd: allow read requests to be retried after force-detach
Sometimes, a lower level block device turns into a tar-pit,
not completing requests at all, not even doing error completion.

We can force-detach from such a tar-pit block device,
either by disk-timeout, or by drbdadm detach --force.

Queueing for retry only from the request destruction path (kref hit 0)
makes it impossible to retry affected read requests from the peer,
until the local IO completion happened, as the locally submitted
bio holds a reference on the drbd request object.

If we can only complete READs when the local completion finally
happens, we would not need to force-detach in the first place.

Instead, queue for retry where we otherwise had done the error completion.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:37 +01:00
Lars Ellenberg 934722a2db drbd: __req_mod: make DISCARD_WRITE and independend case
cherry-picked and adapted from drbd 9 devel branch

This looks cleaner to me,
and also gets rid of the other ugly if-inside-case-fall-through.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:37 +01:00
Lars Ellenberg a0d856dfae drbd: base completion and destruction of requests on ref counts
cherry-picked and adapted from drbd 9 devel branch

The logic for when to get or put a reference is in mod_rq_state().

To not get confused in the freeze/thaw respectively resend/restart
paths, or when cleaning up requests waiting for P_BARRIER_ACK, this
also introduces additional state flags:
RQ_COMPLETION_SUSP, and RQ_EXP_BARR_ACK.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:36 +01:00
Lars Ellenberg b406777e64 drbd: introduce completion_ref and kref to struct drbd_request
cherry-picked and adapted from drbd 9 devel branch

completion_ref will count pending events necessary for completion.
kref is for destruction.

This only introduces these new members of struct drbd_request,
a followup patch will make actual use of them.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:36 +01:00
Lars Ellenberg 5df69ece6e drbd: __drbd_make_request() is now void
The previous commit causes __drbd_make_request() to always return 0.
Change it to void.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:35 +01:00
Lars Ellenberg 5da9c83644 drbd: better separate WRITE and READ code paths in drbd_make_request
cherry-picked and adapted from drbd 9 devel branch

READs will be interesting to at most one connection,
WRITEs should be interesting for all established connections.

Introduce some helper functions to hopefully make this easier to follow.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:35 +01:00
Lars Ellenberg b6dd1a8976 drbd: remove struct drbd_tl_epoch objects (barrier works)
cherry-picked and adapted from drbd 9 devel branch

DRBD requests (struct drbd_request) are already on the per resource
transfer log list, and carry their epoch number. We do not need to
additionally link them on other ring lists in other structs.

The drbd sender thread can recognize itself when to send a P_BARRIER,
by tracking the currently processed epoch, and how many writes
have been processed for that epoch.

If the epoch of the request to be processed does not match the currently
processed epoch, any writes have been processed in it, a P_BARRIER for
this last processed epoch is send out first.
The new epoch then becomes the currently processed epoch.

To not get stuck in drbd_al_begin_io() waiting for P_BARRIER_ACK,
the sender thread also needs to handle the case when the current
epoch was closed already, but no new requests are queued yet,
and send out P_BARRIER as soon as possible.

This is done by comparing the per resource "current transfer log epoch"
(tconn->current_tle_nr) with the per connection "currently processed
epoch number" (tconn->send.current_epoch_nr), while waiting for
new requests to be processed in wait_for_work().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:35 +01:00
Lars Ellenberg d5b27b01f1 drbd: move the drbd_work_queue from drbd_socket to drbd_connection
cherry-picked and adapted from drbd 9 devel branch
In 8.4, we don't distinguish between "resource work" and "connection
work" yet, we have one worker for both, as we still have only one connection.

We only ever used the "data.work",
no need to keep the "meta.work" around.

Move tconn->data.work to tconn->sender_work.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:34 +01:00
Lars Ellenberg 8c0785a5c9 drbd: allow to dequeue batches of work at a time
cherry-picked and adapted from drbd 9 devel branch

In 8.4, we still use drbd_queue_work_front(),
so in normal operation, we can not dequeue batches,
but only single items.

Still, followup commits will wake the worker
without explicitly queueing a work item,
so up() is replaced by a simple wake_up().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:34 +01:00
Lars Ellenberg b379c41ed7 drbd: transfer log epoch numbers are now per resource
cherry-picked from drbd 9 devel branch.

In preparation of multiple connections, the "barrier number" or
"epoch number" needs to be tracked per-resource, not per connection.
The sequence number space will not be reset anymore.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:33 +01:00
Lars Ellenberg 9d05e7c4e7 drbd: rename drbd_restart_write to drbd_restart_request
Meanwhile, this is used to restart failed READ requests as well.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:33 +01:00
Lars Ellenberg 629663c942 drbd: fix wrong assert in completion/retry path of failed local reads
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:33 +01:00
Lars Ellenberg ab53b90e89 drbd: fix local read error hung forever
The commit
    drbd: simplify retry path of failed READ requests
simplified it too much:
it just did not do anything for local read errors.

Add the missing req_may_be_completed_not_susp() to the
READ_COMPLETED_WITH_ERROR case.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:32 +01:00
Lars Ellenberg 1b6f19740d drbd: fix access of unallocated pages and kernel panic
BUG: unable to handle kernel NULL pointer dereference at (null)
...
 [<d1e17561>] ? _drbd_bm_set_bits+0x151/0x240 [drbd]
 [<d1e236f8>] ? receive_bitmap+0x4f8/0xbc0 [drbd]

This fixes an off-by-one error in the receive_bitmap() path,
if run-length encoded bitmap transfer is enabled.

If the bitmap is an exact multiple of PAGE_SIZE, which means the visible
capacity of the drbd device is an exact multiple of 128 MiB (for 4k page
size), and bitmap compression (use-rle) is enabled (which became default
with 8.4), and the very last bit is dirty and reported in an rle
comressed bitmap packet, we ended up trying to kmap_atomic a page pointer
that does not exist (bitmap->bm_pages[last index + 1]).

bug introduced by:
    Date:   Fri Jul 24 15:33:24 2009 +0200
    set bits: optimize for complete last word, fix off-by-one-word corner case

made effective by:
    Date:   Thu Dec 16 00:32:38 2010 +0100
    drbd: get rid of unused debug code

    Long time ago, we had paranoia code in the bitmap that allocated one
    extra word, assigned a magic value, and checked on every occasion that
    the magic value was still unchanged.

    That debug code is unused, the extra long word complicates code a bit.
    Get rid of it.

No-one triggered this bug in the last few years, because a large subset
of our userbase is unaffected:
 * typically the last few blocks of a device are not modified
   frequently, and remain unset
 * use-rle was disabled by default in drbd < 8.4
 * those with slightly "odd" device sizes, or
 * drbd internal meta data (which will skew the device size slightly,
   thus makes it harder to have a bug relevant device size)

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:32 +01:00
Philipp Reisner 7a426fd8d5 drbd: Keep the listening socket open while trying to connect to the peer
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:31 +01:00
Philipp Reisner 1f3e509b76 drbd: pull prepare_listen_socket() out of drbd_wait_for_connect()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:31 +01:00
Philipp Reisner 9a51ab1c1b drbd: New disk option al-updates
By disabling al-updates one might increase performace. The price for
that is that in case a crashed primary (that had al-updates disabled)
is reintegraded, it will receive a full-resync instead of a bitmap
based resync.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:31 +01:00
Andreas Gruenbacher 26ec92871b drbd: Stop using NLA_PUT*().
These macros no longer exist in kernel version v3.5-rc1.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:30 +01:00
Philipp Reisner 7e0f096b8d drbd: Remove drbd_accept() and use kernel_accept() instead
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:30 +01:00
Philipp Reisner 2820fd3969 drbd: Move the call to listen() out of drbd_accept()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:29 +01:00
Philipp Reisner c5b005ab70 drbd: use bitmap_parse instead of __bitmap_parse
The buffer 'sc.cpu_mask' is a kernel buffer.  If bitmap_parse is used
instead of __bitmap_parse the extra parameter that indicates a kernel
buffer is not needed.

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:29 +01:00
Lars Ellenberg 1882e22df7 drbd: grammar fix in log message
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:29 +01:00
Lars Ellenberg f66ee69746 drbd: bm_page_async_io: properly initialize page->private
If bm_page_async_io is advised to use a new page for I/O
(BM_AIO_COPY_PAGES is set), it will get it from a mempool.
Once the mempool has to dip into its reserves the page is
not reinitialized, i.e. page->private contains garbage, which
will lead to various problems once the I/O completes (dereferences
of NULL pointers, the submitting thread getting stuck in D-state,
 ...).

Signed-off-by: Arne Redlich <arne.redlich@googlemail.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:28 +01:00
Lars Ellenberg a220d29180 drbd: allow bitmap to change during writeout from resync_finished
Symptom: messages similar to
 "FIXME asender in bm_change_bits_to,
  bitmap locked for 'write from resync_finished' by worker"

If a resync or verify is finished (or aborted), a full bitmap writeout
is triggered.  If we have ongoing local IO, the bitmap may still change
during that writeout, pending and not yet processed acks may cause bits
to be cleared, while new writes may cause bits to be to be set.

To fix this, introduce the drbd_bm_write_copy_pages() variant.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:28 +01:00
Lars Ellenberg 5016b82a49 drbd: fix race between drbdadm invalidate/verify and finishing resync
When a resync or online verify is finished or aborted,
drbd does a bulk write-out of changed bitmap pages.

If *in that very moment* a new verify or resync is triggered,
this can race:
 ASSERT( !test_bit(BITMAP_IO, &mdev->flags) ) in drbd_main.c
 FIXME going to queue 'set_n_write from StartingSync' but 'write from resync_finished' still pending?
and similar.

This can be observed with e.g. tight invalidate loops in test scripts,
and probably has no real-life implication.

Still, that race can be solved by first quiescen the device,
before starting a new resync or verify.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:27 +01:00
Lars Ellenberg 07be15b12c drbd: fix resend/resubmit of frozen IO
DRBD can freeze IO, due to fencing policy (fencing resource-and-stonith),
or because we lost access to data (on-no-data-accessible suspend-io).

Resuming from there (re-connect, or re-attach, or explicit admin
intervention) should "just work".

Unfortunately, if the re-attach/re-connect did not happen within
the timeout, since the commit

  drbd: Implemented real timeout checking for request processing time

if so configured, the request_timer_fn() would timeout and
detach/disconnect virtually immediately.

This change tracks the most recent attach and connect, and does not
timeout within <configured timeout interval> after attach/connect.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:27 +01:00
Philipp Reisner 3ea35df83f drbd: fix spelling, remove boring development log message
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:27 +01:00
Philipp Reisner e4bad1bcac drbd: Ensure that data_size is not 0 before using data_size-1 as index
This could be exploited by a peer which runs modified code.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:26 +01:00
Philipp Reisner a1096a6e9d drbd: Delay/reject other state changes while establishing a connection
Changes to the role and disk state should be delayed or rejected
while we establish a connection.

This is necessary, since the peer will base its resync decision
on the UUIDs and the state we sent in the drbd_connect() function.

The most prominent example for this race is becoming primary after
sending state and UUIDs and before the state changes to C_WF_CONNECTION.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:26 +01:00
Philipp Reisner 27eb13e99b drbd: Fixed processing of disk-barrier, disk-flushes and disk-drain
Since drbd_bump_write_ordering() is called in the attaching
process while the disk state is D_ATTACHING, it was not
considering these three flags during attach.

A call to this function was missing form drbd_adm_disk_opts().

Fixed both issues.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:25 +01:00
Lars Ellenberg 9ed57dcbda drbd: ignore volume number for drbd barrier packet exchange
Transfer log epochs, and therefore P_BARRIER packets,
are per resource, not per volume.
We must not associate them with "some random volume".

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:25 +01:00
Lars Ellenberg 648e46b531 drbd: complete_conflicting_writes() should not care about connections
complete_conflicting_writes() should not cause -EIO.
It should not timeout either, or care for connection states.

Connection timeout is detected elsewhere, and it's cleanup path is
supposed to remove any pending requests or peer_requests from the
write_requests tree.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:25 +01:00
Lars Ellenberg 4439c400ab drbd: simplify retry path of failed READ requests
If a local or remote READ request fails, just push it back to the retry
workqueue.  It will re-enter __drbd_make_request, and be re-assigned to
a suitable local or remote path, or failed, if we do not have access to
good data anymore.

This obsoletes w_read_retry_remote(),
and eliminates two goto...retry blocks in __req_mod()

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:24 +01:00
Lars Ellenberg 2415308eb9 drbd: move put_ldev from __req_mod() to the endio callback
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:24 +01:00
Lars Ellenberg 6870ca6d46 drbd: factor out master_bio completion and drbd_request destruction paths
In preparation for multiple connections and reference counting,
separate the code paths for completion of the master bio
and destruction of the request object.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:23 +01:00
Lars Ellenberg 8d6cdd7848 drbd: conflicting writes: make wake_up of waiting peer_requests explicit
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:23 +01:00
Lars Ellenberg 0afd569a40 drbd: fix WRITE_ACKED_BY_PEER_AND_SIS to not set RQ_NET_DONE
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:23 +01:00
Lars Ellenberg ea9d6729bd drbd: fix READ_RETRY_REMOTE_CANCELED to not complete if device is suspended
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:22 +01:00
Lars Ellenberg 27a434fe40 drbd: make OOS_HANDED_TO_NETWORK its own case
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:22 +01:00
Lars Ellenberg 2312f0b3c5 drbd: fix potential deadlock during "restart" of conflicting writes
w_restart_write(), run from worker context, calls __drbd_make_request()
and further drbd_al_begin_io(, delegate=true), which then
potentially deadlocks.  The previous patch moved a BUG_ON to expose
such call paths, which would now be triggered.

Also, if we call __drbd_make_request() from resource worker context,
like w_restart_write() did, and that should block for whatever reason
(!drbd_state_is_stable(), resource suspended, ...),
we potentially deadlock the whole resource, as the worker
is needed for state changes and other things.

Create a dedicated retry workqueue for this instead.

Also make sure that inc_ap_bio()/dec_ap_bio() are properly paired,
even if do_retry() needs to retry itself,
in case __drbd_make_request() returns != 0.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:21 +01:00
Lars Ellenberg f9916d61a4 drbd: don't pretend that barrier_nr == 0 was special
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:21 +01:00
Lars Ellenberg 0642d5f8e0 drbd: remove unused static helper function
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:21 +01:00
Lars Ellenberg e44d71f36c drbd: remove some very outdated comments
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:20 +01:00
Lars Ellenberg 9dab3842b5 drbd: fix memleak in error path in bm_rw and drbd_bm_write_range
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:20 +01:00
Lars Ellenberg a6a7d4f0c1 drbd: missing wakeup after drbd_rs_del_all
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:19 +01:00
Lars Ellenberg 5cdb0bf322 drbd: remove now unused seq_num member from struct drbd_request
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:19 +01:00
Lars Ellenberg 4b8514ee28 drbd: fix potential data corruption and protocol error
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:19 +01:00
Lars Ellenberg e8744f5aca drbd: Fixed detach
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:18 +01:00
Lars Ellenberg d93f63028f drbd: Fix a potential write ordering issue on SyncTarget nodes
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:18 +01:00
Lars Ellenberg 81f448629a drbd: Fix a potential race that could case data inconsistency
When we have a write request and a state change C_WF_BITMAP_S -> C_SYNC_SOURCE
at the same time, and it happens that the line

    remote = remote && drbd_should_do_remote(s);

stills sees C_WF_BITMAP_S, and

     send_oos = rw == WRITE && drbd_should_send_oos(s);

already sees C_SYNC_SOURCE both are 0.

This causes the write to not be mirrored, but marked as out-of-sync on the
Sync_Source node.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:17 +01:00
Philipp Reisner 38a05c16b8 drbd: Consider that bio->bi_bdev might be modified below DRBD
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:17 +01:00
Philipp Reisner 72585d2428 drbd: add missing part_round_stats to _drbd_start_io_acct
Without this, iostat frequently sees bogus svctime and >= 100% "utilization".

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:17 +01:00
Philipp Reisner dd9b360475 drbd: Fix module refcount leak in drbd_accept()
drbd_accept was modelled after kernel_accept
with drbd commit 53eb779 in July 2008.

Only, kernel_accept was then broken, and only fixed later
with kernel commit 1b08534e in Dec 2008:
net: Fix module refcount leak in kernel_accept()

Impact: protocol families provided as modules, e.g. ipv6 or ib_sdp,
would soon have their reference count become negative, preventing
them from being unloaded (likely), or worse, hit zero without actually
being unused, allowing them to be unloaded while still in use (unlikely,
but if triggered, causing a kernel crash).

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:16 +01:00
Philipp Reisner 93f5afe956 drbd: If disk timeout expires fail only the affected volume
...and not all volumes of the resource

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:16 +01:00
Philipp Reisner 32db80f6f6 drbd: Consider the disk-timeout also for meta-data IO operations
If the backing device is already frozen during attach, we failed
to recognize that. The current disk-timeout code works on top
of the drbd_request objects. During attach we do not allow IO
and therefore never generate a drbd_request object but block
before that in drbd_make_request().

This patch adds the timeout to all drbd_md_sync_page_io().

Before this patch we used to go from D_ATTACHING directly
to D_DISKLESS if IO failed during attach. We can no longer
do this since we have to stay in D_FAILED until all IO
ops issued to the backing device returned.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:15 +01:00
Philipp Reisner 25b0d6c8c1 drbd: Reinstate disabling AL updates with invalidate-remote
Commit d0ef827e (drbd: switch configuration interface from connector to
genetlink) introduced a regression by removing the ability to set all
bits in the out of sync bitmap and to suspend updates to the activity log
of a disconnected device via the invalidate-remote management call.

Credits for reporting the issue are going to Arne Redlich.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:15 +01:00
Lars Ellenberg b17f33cb0a drbd: explicitly clear unused dp_flags in drbd_send_block
We send left-over garbage from the previous packet in P_DATA_REPLY and
P_RS_DATA_REPLY packets. That's bad behaviour.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:15 +01:00
Philipp Reisner 4d0fc3fdc3 drbd: Fixed compat issue with disconnecting 8.4 from a primary 8.3
For compatibility reasons 8.4 has to send P_STATE_CHG_REQ (instead
of P_CONN_ST_CHG_REQ) when disconnecting.

In the receiving code path we missed to convert the old
answer (P_STATE_CHG_REPLY) back to 8.4 logic. Therefore
the CL_ST_CHG_SUCCESS or CL_ST_CHG_FAIL bit in the flags word
of mdev got set, while the state code was waiting for
the CONN_WD_ST_CHG_OKAY or CONN_WD_ST_CHG_FAIL bits in tconn.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:14 +01:00
Andreas Gruenbacher 1a3cde4406 drbd: drbd_bm_ALe_set_all(): Remove unused function
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:14 +01:00
Philipp Reisner 69b6a3b159 drbd: restart loop in drbd_make_request() [prepare for Linux-3.2]
With Linux-3.2 generic_make_request() will no longer loop over
the request function until it finally returns 0. Move this
loop into our drbd_make_request() function.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:13 +01:00
Philipp Reisner 7da358625c drbd: Restore late assigning of tconn->data.sock and meta.sock
With commit from Mon Mar 28 16:33:12 2011 +0200
"drbd: drbd_connect(): Initialize struct drbd_socket before sending anything"

tconn->data.sock and tconn->meta.sock get assigned early, in
conn_connect.

The early assigning can trigger an OOPS, because it may released the socket
without acquiring the mutex protecting the socket. An other thread (worker)
might use setsockopt() on the socket while it gets free()ed.

Restored the (proven) 8.3 behavior of assigning these sockets after the two
connections are established.

Credits for reporting the issue are going to Arne Redlich.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:13 +01:00
Philipp Reisner a01842ebee drbd: Log failures of connection state changes
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:13 +01:00
Philipp Reisner e8cdc34335 drbd: Consider that read requests could be NEG_ACKEDed
ap_in_flight only counts writes. NEG_ACKED is an action
on a request that might be called for reads and writes.

This bug was there forever, but it becomes much more
relevant with the read balincing code.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:12 +01:00
Philipp Reisner 6ab9b1b60b drbd: Do not send state packets while lower than C_CONNECTED cstate
I.e. in C_WF_REPORT_PARAMS or in C_WF_CONNECTION.
Sending may already work in these cstates, but the peer still expects
the HandShake / ConnectionFeatures packet.

Actually triggered by the Testuite on kugel.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:12 +01:00
Philipp Reisner b8853dbd8c drbd: fix race between disconnect and receive_state
If the asender thread, or request_timer_fn(), or some other part of
the code, decided to drop the connection (because of timeout or other),
but the receiver just now was processing a P_STATE packet, there was a
chance that receive_state() would do a hard state change
"re-establishing" an already failed connection without additional handshake.

Log excerpt:
  Remote failed to finish a request within ko-count * timeout
  peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )
  asender terminated
  ...
  peer( Unknown -> Secondary ) conn( Timeout -> Connected ) pdsk( DUnknown -> UpToDate ) peer_isp( 0 -> 1 )
  ...
  Connection closed
  peer( Secondary -> Unknown ) conn( Connected -> Unconnected ) pdsk( UpToDate -> DUnknown ) peer_isp( 1 -> 0 )
  receiver terminated

Impact:
while the connection state is erroneously "Connected",
requests may be queued and even sent,
which would never be acknowledged,
and may have been missed by the cleanup.
These requests would never be completed.

The next drbd_suspend_io() will then lock up,
waiting forever for these requests to complete.

Fixed in several code paths:
  Make sure the connection state is NetworkFailure or worse
  before starting the cleanup in drbd_disconnect().
  This should make sure the cleanup won't miss any requests.

  Disallow receive_state() to "upgrade" the connection state
  from an error state. This will make sure the "illegal" state
  transition won't happen.

  For all connection failure states,
  relax the safe-guard in sanitize_state() again
  to silently mask out those state changes
  (e.g. Timeout -> Connected becomes Timeout -> Timeout).

 Note by Philipp Reisner:
  The 3rd chunk described as "relax the safe-guard..."
  is not there in 8.4 as it is relaxed to the maximum in
  8.4 already

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:12 +01:00
Philipp Reisner 57bcb6cf1d drbd: Do not call generic_make_request() while holding req_lock
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:11 +01:00
Philipp Reisner d60de03a66 drbd: Load balancing method: striping
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:11 +01:00
Philipp Reisner 380207d08e drbd: Load balancing of read requests
New config option for the disk secition "read-balancing", with
the values: prefer-local, prefer-remote, round-robin, when-congested-remote.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:10 +01:00
Philipp Reisner d10b4ea32b drbd: Get rid of "ASSERTION FAILED: tconn->current_epoch->list not empty"
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:10 +01:00
Lars Ellenberg 615e087fbd drbd: add missing rcu locks around recently introduced idr_for_each
Recent commit
 drbd: Move write_ordering from mdev to tconn
introduced a new idr_for_each loop over all volumes,
but did not take necessary rcu locks or krefs.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:10 +01:00
Andreas Gruenbacher 03d63e1d1e drbd: Remove leftover prototype
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:09 +01:00
Philipp Reisner 975b297947 drbd: fix potential spinlock deadlock
drbd_try_clear_on_disk_bm() has a sanity check for the number of blocks
left to be resynced (rs_left) in the current resync extent.
If it detects a mismatch, it complains, and forces a disconnect using
drbd_force_state(mdev, NS(conn, C_DISCONNECTING));

Unfortunately, this may be called while holding the req_lock,
and drbd_force_state() want's to aquire that lock itself. Deadlock.

Don't force a disconnect, but fix up rs_left by recounting and
reassigning the number of dirty blocks in that extent.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:09 +01:00
Philipp Reisner 77fede5137 drbd: Fix the WO=drain implementation for multiple volumes
Wait until IO is drained in all volumes.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:08 +01:00
Philipp Reisner 1e9dd2912e drbd: Switch drbd_may_finish_epoch() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:08 +01:00
Philipp Reisner 12038a3a71 drbd: Move list of epochs from mdev to tconn
This is necessary since the transfer_log on the sending is also
per tconn.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:08 +01:00
Philipp Reisner 1d2783d532 drbd: Prepare epochs per connection
An epoch object needs a pointer to the mdev it was received for.
This is necessary to be able to send the barrier ack packet for
the same volume as the original barrier packet was assigned to.

This prepares the next step, in which the (receiver side)
epoch list is moved from the device (mdev) to the connection (tconn)
object.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:07 +01:00
Philipp Reisner 4b0007c0e8 drbd: Move write_ordering from mdev to tconn
This is necessary in order to prepare the move of the (receiver side)
epoch list from the device (mdev) to the connection (tconn) objects.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:07 +01:00
Philipp Reisner 6936fcb49a drbd: Move the CREATE_BARRIER flag from connection to device
That is necessary since the whole transfer log is per connection(tconn)
and not per device(mdev).

This bug caused list corruption on the worker list. When a barrier is queued
for sending in the context of one device, another device did not see the
CREATE_BARRIER bit, and queued the same object again -> list corruption.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:06 +01:00
Philipp Reisner 36baf6117b drbd: Fixed an obvious copy-n-paste mistake
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:06 +01:00
Philipp Reisner 43de7c852b drbd: Fixes from the drbd-8.3 branch
* drbd-8.3:
  drbd: O_SYNC gives EIO on ramdisks for some kernels (eg. RHEL6).
  drbd: send intermediate state change results to the peer

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:06 +01:00
Philipp Reisner 0cfac5dd90 drbd: Fixes from the drbd-8.3 branch
* drbd-8.3:
  drbd: fix spurious meta data IO "error"
  drbd: Fixed a race condition between detach and start of resync
  drbd: fix harmless race to not trigger an ASSERT
  drbd: Derive sync-UUIDs only from the bitmap-uuid if it is non-zero
  drbd: Fixed current UUID generation (regression introduced recently, after 8.3.11)

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:05 +01:00
Philipp Reisner 376694a054 drbd: Silenced compiler warnings
Since version 4.6.1 gcc warns about variables that get
a value assigned, but which are never read later on.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:05 +01:00
Philipp Reisner 9bcd252182 drbd: fix "stalled" empty resync
With sync-after dependencies, given "lucky" timing of pause/unpause
events, and the end of an empty (0 bits set) resync was sometimes not
detected on the SyncTarget, leading to a "stalled" SyncSource state.

Fixed this by expecting not only "Inconsistent -> UpToDate" but also
"Consistent -> UpToDate" transitions for the peer disk state
to end a resync.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:04 +01:00
Lars Ellenberg 22d81140ae drbd: fix bitmap writeout after aborted resync
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:04 +01:00
Philipp Reisner 08b165ba11 drbd: Consider the discard-my-data flag for all volumes [bugz 359]
...not only for the first volume

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:04 +01:00
Andreas Gruenbacher 935be260c1 drbd: Improve error reporting in drbd_md_sync_page_io()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:03 +01:00
Lars Ellenberg 25e409321a drbd: fix connect failure with all default net-options
If no net-options are configured (all on their default),
no DRBD_NLA_NET_CONF will be passed to the kernel.
The kernel must not require its presence,
there is no required option in there.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:03 +01:00
Andreas Gruenbacher a209b4aec3 drbd: Update some outdated comments to match the code
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:02 +01:00
Philipp Reisner c4e7afdc01 drbd: Remove unused code
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:02 +01:00
Philipp Reisner 4276dea70c drbd: Remove dead code
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:02 +01:00
Philipp Reisner 85d735138a drbd: Cleanup all epoch objects upon connection loss
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:01 +01:00
Philipp Reisner f132f554ce drbd: Do not display bogus log lines for pdsk in case pdsk < D_UNKNOWN
This was a regression recently introduced with commit
7848ddb752c09b6dfd1ddfabb06b69b08aa8f6b9
"drbd: Correctly handle resources without volumes"

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:00 +01:00
Lars Ellenberg 97ddb68790 drbd: detach must not try to abort non-local requests
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:00 +01:00
Andreas Gruenbacher f497609e4c drbd: Get rid of MR_{READ,WRITE}_SHIFT
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:00 +01:00
Philipp Reisner 823bd832a6 drbd: Bugfix for the connection behavior
If we get into the C_BROKEN_PIPE cstate once, the state engine set the
thi->t_state of the receiver thread to restarting.  But with the while loop
in drbdd_init() a new connection gets established. After the call into
drbdd() returns immediately since the thi->t_state is not RUNNING.  The
restart of drbd_init() then resets thi->t_state to RUNNING.

I.e. after entering C_BROKEN_PIPE once, the next successful established
connection gets wasted.

The two parts of the fix:
  * Do not cause the thread to restart if we detect the issue
    with the sockets while we are in C_WF_CONNECTION.

  * Make sure that all actions that would have set us to C_BROKEN_PIPE
    happen before the state change to C_WF_REPORT_PARAMS.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:59 +01:00
Andreas Gruenbacher 7d4c782cbd drbd: Fix the data-integrity-alg setting
The last data-integrity-alg fix made data integrity checking work when the
algorithm was changed for an established connection, but the common case of
configuring the algorithm before connecting was still broken.  Fix that.

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:59 +01:00
Andreas Gruenbacher 71fc7eedb3 drbd: Turn tl_apply() into tl_abort_disk_io()
There is no need to overly generalize this function; it only makes the code
harder to understand.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:58 +01:00
Philipp Reisner 1b7ab15b11 drbd: Fixed w_restart_disk_io() to handle non active AL-extents
Since we now apply the AL in user space onto the bitmap, the AL
is not active for the requests we want to reply.

For that a al_write_transaction() that might be called from
worker context became necessary.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:58 +01:00
Philipp Reisner 9b743da96c drbd: Missing assignment of mdev before drbd_queue_work()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:58 +01:00
Philipp Reisner 3b03ad5929 drbd: Do not mod_timer() with a past time
In case we can not find out why the request takes too long
(happens e.g. when IO got suspended on DRBD level). rearm
the timer with a reasonable value.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:57 +01:00
Philipp Reisner 3fb4746d8d drbd: Consider that the no-data-condition could be in connected state
...when the peer has inconsistent data. In that case we failed to
clear the susp_nod flag. When the local disk was attached again

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:57 +01:00
Philipp Reisner 97af09d51e drbd: Dropped wrong clause to generate new current UUIDs
Looks like a remainder from long ago.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:56 +01:00
Andreas Gruenbacher accdbcc5f9 drbd: receive_protocol(): We cannot change our own data-integrity-alg setting here
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:56 +01:00
Andreas Gruenbacher d505d9bef2 drbd: Be consistent in reporting incompatibilities in P_PROTOCOL settings
Refer to the settings by the names which drbdsetup and drbd.conf are using.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:56 +01:00
Andreas Gruenbacher fbc12f4514 drbd: receive_protocol(): Make the program flow less confusing
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:55 +01:00
Andreas Gruenbacher b792c35cfb drbd: receive_protocol(): Give variables more easily searchable names
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:55 +01:00
Andreas Gruenbacher 5af172ed9e drbd: Print memory address in hex instead of decimal in error message
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:55 +01:00
Andreas Gruenbacher f2257a56ee drbd: Allow to create devices with a minor number > minor_count
The minor_count module/kernel parameter serves to scale the size of drbd's
internal memory pool, but it is no longer a limit for the number of minors or
the minor number.  (Minor numbers can be arbitrarily high within the allowed
limit of 2^20.)

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:54 +01:00
Lars Ellenberg 367d675da8 drbd: report net config even for resources without a single volume
Currently it is legal (though unusual) to create and connect a resource,
before adding in all necessary volumes. We should include the network
configuration details, even if we don't have a single volume (yet).

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:53 +01:00
Philipp Reisner e0e1665381 drbd: Correctly handle resources without volumes
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:52 +01:00
Philipp Reisner a67e1d9e8c drbd: Eliminated the "notified peer" messages
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:52 +01:00
Philipp Reisner 369bea6371 drbd: Fixed removal of volumes/devices from connected resources
When removing a volume/device we need to switch the connection
status of the peer back into WFReportParams.

  Before this fix it was left in Connected state. That means that
  the peer device continued to inform us about state changes, etc...
  But we deleted that minor -> protocol error.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:51 +01:00
Lars Ellenberg d5d7ebd422 drbd: on attach, enforce clean meta data
Detection of unclean shutdown has moved into user space.

The kernel code will, whenever it updates the meta data, mark it as
"unclean", and will refuse to attach to such unclean meta data.

"drbdadm up" now schedules "drbdmeta apply-al", which will apply
the activity log to the bitmap, and/or reinitialize it, if necessary,
as well as set a "clean" indicator flag.

This moves a bit code out of kernel space.
As a side effect, it also prevents some 8.3 module from accidentally
ignoring the 8.4 style activity log, if someone should downgrade,
whether on purpose, or accidentally because he changed kernel versions
without providing an 8.4 for the new kernel, and the new kernel comes
with in-tree 8.3.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:51 +01:00
Philipp Reisner cdfda633d2 drbd: detach from frozen backing device
* drbd-8.3:
  documentation: Documented detach's --force and disk's --disk-timeout
  drbd: Implemented the disk-timeout option
  drbd: Force flag for the detach operation
  drbd: Allow new IOs while the local disk in in FAILED state
  drbd: Bitmap IO functions can not return prematurely if the disk breaks
  drbd: Added a kref to bm_aio_ctx
  drbd: Hold a reference to ldev while doing meta-data IO
  drbd: Keep a reference to the bio until the completion handler finished
  drbd: Implemented wait_until_done_or_disk_failure()
  drbd: Replaced md_io_mutex by an atomic: md_io_in_use
  drbd: moved md_io into mdev
  drbd: Immediately allow completion of IOs, that wait for IO completions on a failed disk
  drbd: Keep a reference to barrier acked requests

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:50 +01:00
Andreas Gruenbacher 2fcb8f307f drbd: Improve the "unexpected packet" error messages
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:50 +01:00
Philipp Reisner 9510b2411d drbd: Fixed state transitions in case reading meta data failes
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:50 +01:00
Philipp Reisner 2ffca4f3ee drbd: Improve compatibility with drbd's older than 8.3.7
Regression introduced with 8.3.11 commit:
drbd: Take a more conservative approach when deciding max_bio_size

Never ever tell an older drbd, that we support more than 32KiB
in a single data request (packet).
Never believe an older drbd, that is supports more than 32KiB
in a single data request (packet)

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:49 +01:00
Philipp Reisner d942ae4453 drbd: Fixes from the 8.3 development branch
* commit 'ae57a0a':
   drbd: Only print sanitize state's warnings, if the state change happens
   drbd: we should write meta data updates with FLUSH FUA
   drbd: fix limit define, we support 1 PiByte now
   drbd: fix log message argument order
   drbd: Typo in user-visible message.
   drbd: Make "(rcv|snd)buf-size" and "ping-timeout" available for the proxy, too.
   drbd: Allow keywords to be used in multiple config sections.
   drbd: fix typos in comments.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:49 +01:00
Lars Ellenberg 4dbdae3ec9 drbd: downgraded error printk to info
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:48 +01:00
David Howells 3b7cd457d0 DRBD: Fix comparison always false warning due to long/long long compare
Fix warnings of the following nature in the drbd header:

In file included from drivers/block/drbd/drbd_bitmap.c:32:
drivers/block/drbd/drbd_int.h: In function 'drbd_get_syncer_progress':
drivers/block/drbd/drbd_int.h:2234: warning: comparison is always false due to limited range of data

where mdev->rs_total (an unsigned long) is being compared to 1ULL << 32, which
is always false on a 32-bit machine.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:48 +01:00
Andreas Gruenbacher 6dff290220 drbd: Rename --dry-run to --tentative
drbdadm already has a --dry-run option, so this option cannot directly be
passed through to drbdsetup.  Rename the drbdsetup option to resolve this
conflict.

For backward compatibility, make --dry-run an alias of --tentative.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:47 +01:00
Andreas Gruenbacher d0fa7fd680 drbd: Remove dead code
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:47 +01:00
Andreas Gruenbacher afbbfa88bc drbd: Allow to pass resource options to the new-resource command
This is equivalent to how the attach and connect commands work.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:46 +01:00
Andreas Gruenbacher 089c075d88 drbd: Convert the generic netlink interface to accept connection endpoints
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:46 +01:00
Andreas Gruenbacher 44e52cfaa2 drbd: Rename DRBD_ADM_NEED_{CONN -> RESOURCE}
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:46 +01:00
Andreas Gruenbacher 01b39b50d3 drbd: Split off netlink mandatory attribute handling into separate file
Duplicate this file in the kernel module and in user space; both sides need it.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:45 +01:00
Andreas Gruenbacher 7c3063cc6f drbd: Also need to check for DRBD_GENLA_F_MANDATORY flags before nla_find_nested()
This is done by introducing drbd_nla_find_nested() which handles the flag
before calling nla_find_nested().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:45 +01:00
Andreas Gruenbacher 789c1b626c drbd: Use the terminology suggested by the command names in the source code and messages
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:44 +01:00
Lars Ellenberg 67b58bf723 drbd: spelling fix: too small
It is not "to small", but "too small".

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:44 +01:00
Lars Ellenberg fc251d5c24 drbd: cosmetic: fix accidental division instead of modulo when pretty printing
For large resync rates, seq_printf_with_thousands_grouping()
accidentally only produced Y,000,00Y, instead of the real numbers.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:43 +01:00
Andreas Gruenbacher 46530e859c drbd: Use DRBD_MINOR_COUNT_DEF in one more place
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:43 +01:00
Philipp Reisner a67b813cfa drbd: Lower log priority for an event that is definitely not an error
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:43 +01:00
Andreas Gruenbacher c75b9b10e7 drbd: Don't use empty nested netlink attributes
Before mainline commit ea5693cc (v2.6.29-rc1), empty nested netlink attributes
were not allowed.  Fix that by leaving out nested attributes if they are empty
and by allowing the top-level attributes to be missing.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:40 +01:00
Andreas Gruenbacher 1e2a2551ee drbd: drbd_adm_prepare(): Pass through error codes
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:56 +01:00
Philipp Reisner d659f2aaea drbd: Send PROTOCOL_UPDATE packets when appropriate
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:54 +01:00
Philipp Reisner 036b17eaab drbd: Receiving part for the PROTOCOL_UPDATE packet
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:53 +01:00
Philipp Reisner 7aca6c7549 drbd: Allocation of int_dig_in and int_dig_vv was missing
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:53 +01:00
Philipp Reisner f179d76d76 drbd: Made cmp_after_sb() more generic into convert_after_sb()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:53 +01:00
Philipp Reisner 46e1ce4177 drbd: protect updates to integrits_tfm by tconn->data->mutex
Since we need to hold that mutex anyways to make sure the peer
gets that change in the right position in the data stream,
it makes a lot of sense to use the same mutex to ensure existence
of the tfm.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:52 +01:00
Philipp Reisner dcb20d1a8e drbd: Refuse to change network options online when...
* the peer does not speak protocol_version 100 and the
  user wants to change one of:
    - wire_protocol
    - two_primaries
    - integrity_alg

* the user wants to remove the allow_two_primaries flag
  when there are two primaries

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:52 +01:00
Andreas Gruenbacher 95f8efd08b drbd: Fix the upper limit of resync-after
The 32-bit resync_after netlink field takes a device minor number as
parameter, which is no longer limited to 255.  We cannot statically
verify which device numbers are valid, so set the ummer limit to the
highest possible signed 32-bit integer.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:51 +01:00
Andreas Gruenbacher 69ef82dea4 drbd: Refer to connect-int consistently throughout the code
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:50 +01:00
Andreas Gruenbacher 6394b9358e drbd: Refer to resync-rate consistently throughout the code
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:50 +01:00
Lars Ellenberg 7dc1d67f7c drbd: skip spurious wait_event in drbd_al_begin_io
Activity log transaction writes are serialized on a bit lock.
If several CPUs race to write an AL transaction,
those that did not get the lock the first time
may continue as soon as there are no more pending transactions.

The do not need to all grab the lock in turn,
just to realize that the AL is clean already,
and they have nothing to do.

This also closes a potential deadlock with drbd_adm_disk_opts.
Once it got the AL bit lock, it knows there are no pending transactions,
the AL is clean, and it should be safe to wait for all element references
to drop to zero.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:49 +01:00
Andreas Gruenbacher 6139f60dc1 drbd: Rename the want_lose field/flag to discard_my_data
This is what it is called in config files and on the command line as
well.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:49 +01:00
Andreas Gruenbacher 6f9b5f84f5 drbd: Make broadcast events return NO_ERROR
Instead of returning a ret_code outside of the range of enum
drbd_ret_code, use NO_ERROR to indicate success.  This way,
ret_code has the same meaning in all packets.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:48 +01:00
Philipp Reisner c141ebda03 drbd: Removing drbd_cfg_rwsem
* Updates to all configuration items is done under genl_lock().
   Including removal of mdevs or tconns.
 * All read non sleeping read sides are protected by rcu
 * All sleeping read sides keep reference counts to keep the
   objects alive

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:48 +01:00
Philipp Reisner ec0bddbc55 drbd: Use RCU for the drbd_tconns list
Preparing removal of drbd_cfg_rwsem

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:47 +01:00
Philipp Reisner 81fa2e675c drbd: Refcounting for mdev objects
Preparing removal of drbd_cfg_rwsem

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:47 +01:00
Andreas Gruenbacher bb77d34ecc drbd: Turn no-tcp-cork into tcp-cork={yes|no}
Change the --no-tcp-cork drbdsetup command line option as well as
the no_cork netlink packet.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:46 +01:00
Andreas Gruenbacher e544046ab8 drbd: Turn no-md-flushes into md-flushes={yes|no}
Change the --no-md-flushes drbdsetup command line option as well as
the no_md_flush netlink packet.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:46 +01:00
Andreas Gruenbacher d0c980e236 drbd: Turn no-disk-drain into disk-drain={yes|no}
Change the --no-disk-drain drbdsetup command line option as well as
the no_disk_drain netlink packet.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:46 +01:00
Andreas Gruenbacher 66b2f6b9c5 drbd: Turn no-disk-flushes into disk-flushes={yes|no}
Change the --no-disk-flushes drbdsetup command line option as well as
the no_disk_flush netlink packet.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:45 +01:00
Philipp Reisner 813472ced7 drbd: RCU for rs_plan_s
This removes the issue with using peer_seq_lock out of different
contexts.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:44 +01:00
Philipp Reisner d589a21e5d drbd: Enforce limits of disk_conf members; centralized these checks
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:44 +01:00
Philipp Reisner 9958c857c7 drbd: Made the fifo object a self contained object (preparing for RCU)
* Moved rs_planed into it, named total
* When having a pointer to the object the values can
  be embedded into the fifo object.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:43 +01:00
Philipp Reisner daeda1cca9 drbd: RCU for disk_conf
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:43 +01:00
Lars Ellenberg 563e4cf25e drbd: Introduce __s32_field in the genetlink macro magic
...and drop explicit typecasts (int)meta_dev_idx < 0.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:43 +01:00
Philipp Reisner 2ec91e0e29 drbd: Renamed (old|new)_conf into (old|new)_net_conf in receive_SyncParam
Preparing RCU for disk_conf

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:42 +01:00
Philipp Reisner dc97b70801 drbd: Split drbd_alter_sa() into drbd_sync_after_valid() and drbd_sync_after_changed()
Preparing RCU for disk_conf

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:42 +01:00
Philipp Reisner ef5e44a672 drbd: drbd_dew_dev_size() gets the user requests disk_size as argument
Preparing RCU for disk_conf

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:41 +01:00
Philipp Reisner a0095508ca drbd: Renamed the net_conf_update mutex to conf_update
Preparing to use the same mutex for disk_conf updates

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:41 +01:00
Philipp Reisner 934e6138b5 drbd: Removed dead code
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:40 +01:00
Andreas Gruenbacher b966b5dd8e drbd: Generate the drbd_set_*_defaults() functions from drbd_genl.h
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:38 +01:00
Lars Ellenberg 009ba89db5 drbd: fix schedule in atomic
An administrative detach used to request a state change directly to D_DISKLESS,
first suspending IO to avoid the last put_ldev() occuring from an endio handler,
potentially in irq context.

This is not enough on the receiving side (typically secondary), we may miss
some peer_req on the way to local disk, which then may do the last put_ldev()
from their drbd_peer_request_endio().

This patch makes the detach always go through the intermediate D_FAILED state.
We may consider to rename it D_DETACHING.

Alternative approach would be to create yet an other work item to be scheduled
on the worker, do the destructor work from there, and get the timing right.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:53:01 +01:00
Lars Ellenberg 992d6e91d3 drbd: fix thread stop deadlock
There are races where the receiver may be exiting,
but still need the worker to process some stuff.

Do not wait for the receiver to die from an exiting worker.
The receiver must already be dead in case the worker decides to exit.
If the receiver was still alive, it may still want to queue work, and do
drbd_flush_workqueue() from it's disconnect cleanup code,
which would no longer be processed by an exiting worker.

This also would deadlock,
if the worker was to synchornously wait for the receiver to die.

Do not implicitly stop the worker.
The worker will only be stopped from configuration context, from
conn_reconfig_done(), drbd_adm_down() or drbd_adm_delete_connection(),
after making sure the receiver is already stopped.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:53:00 +01:00
Lars Ellenberg f3dfa40a67 drbd: fix race when forcefully disconnecting
If a forced disconnect hits a restarting receiver right after it passed
its final "if (C_DISCONNECTING)" test in drbdd_init(), but before it was
actually restarted by drbd_thread_setup, we could be left with a
connection stuck in C_DISCONNECTING, never reaching C_STANDALONE,
which would be necessary to take it down or reconfigure it.

Move the last cleanup into w_after_conn_state_ch(), and do an additional
state change request in conn_try_disconnect(), just in case.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:53:00 +01:00
Andreas Gruenbacher 88104ca458 drbd: Allow to change data-integrity-alg on the fly
The main purpose of this is to allow to turn data integrity checking on
and off on demand without causing interruptions.

Implemented by allocating tconn->peer_integrity_tfm only when receiving
a P_PROTOCOL message.  l accesses to tconn->peer_integrity_tf happen in
worker context, and no further synchronization is necessary.

On the sender side, tconn->integrity_tfm is modified under
tconn->data.mutex, and a P_PROTOCOL message is sent whenever.  All
accesses to tconn->integrity_tfm already happen under this mutex.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:52:59 +01:00
Andreas Gruenbacher a7eb7bdf58 drbd: Introduce a "lockless" variant of drbd_send_protocoll()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:52:59 +01:00
Andreas Gruenbacher 4b6ad6d457 drbd: Remove obsolete drbd_crypto_is_hash()
We allocate hash transformations with crypto_alloc_hash() which will
only return hash algorithms.  It is not necessary to reconfirm that we
actually got a hash algorithm.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:52:58 +01:00
Andreas Gruenbacher 5b614abe30 drbd: Rename integrity_r_tfm -> peer_integrity_tfm
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:52:58 +01:00
Andreas Gruenbacher 8d412fc6d5 drbd: Rename integrity_w_tfm -> integrity_tfm
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:52:58 +01:00
Andreas Gruenbacher 86db06180a drbd: Wrong use of RCU in receive_protocol()
It is not enough to grab net_conf->integrity_alg under rcu_read_lock()
and access it outside of it; the entire net_conf object may be gone by
then.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:52:57 +01:00
Lars Ellenberg acb104c396 drbd: fix copy/paste error in comment
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:52:57 +01:00
Lars Ellenberg b57a1e27ee drbd: rename variable sc to res_opts
sc was short for syncer conf, which does not exist anymore anyways.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:52:56 +01:00
Lars Ellenberg 5ecc72c3b9 drbd: rename variable ndc to new_disk_conf
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:52:54 +01:00
Lars Ellenberg 5979e36155 drbd: on reconfiguration requests, mind the SET_DEFAULTS flag
The DRBD_GENL_F_SET_DEFAULTS flag was ignored
for drbd_adm_disk_opts() and drbd_adm_net_opts().

Factor out drbd_set_*_defaults() helper functions,
and call them appropriately.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:50:38 +01:00
Philipp Reisner 0fd0ea064c drbd: Consider all crypto options in connect and in net-options
So for this was simply not considered after the options have been
re-arranged.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:49:08 +01:00
Lars Ellenberg d9cc6e2318 drbd: fix various disconnecting races
If an admin requests disconnect at a time when the state handling
already disconnects/reconnects, there have been some races.

Make sure to always really stop the network threads before
returning success for disconnect. Do not pretend successfull
forced disconnect, if the state handling returned an error.

Return success from drbd_adm_down() only after all threads are finished.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:49:08 +01:00
Lars Ellenberg 5ee743e92d drbd: remove useless kobject_uevent from drbd_adm_connect
Calling kobject_uevent, which may sleep, from within rcu_read_lock()
protected regions is not possible.
This particular kobject_uevent also is also wrong. It was supposed to
trigger a udev run, just in case something relevant to udev symlink
magic has changed, when adjusting runtime re-configurable settings while
we still had the "syncer conf".  It was improperly placed in connect
when we dropped the "syncer conf".  The right thing to do is probably to
call "udevadm trigger" directly in those cases where drbdadm thinks
there was a need to trigger extra udev runs.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:49:07 +01:00
Philipp Reisner a18e9d1eb0 drbd: Removed the OBJECT_DYING and the CONFIG_PENDING bits
superseded by refcounting

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:49:07 +01:00
Philipp Reisner 0ace9dfabe drbd: Take a reference on tconn when finding a tconn by name
Rule #3 of kref.txt

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:49:06 +01:00
Philipp Reisner 9dc9fbb357 drbd: Basic refcounting for drbd_tconn
References hold by:
 * Each (running) drbd thread has a reference on tconn
 * Each mdev has a referenc on tconn
 * Beeing in the all_tconn list counts for one reference
 * Each after_conn_state_chg_work has a reference to tconn

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:49:06 +01:00
Philipp Reisner 1d04122599 drbd: Eliminated drbd_free_resoruces() it is superseeded by conn_free_crypto()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:49:05 +01:00
Lars Ellenberg f5e2b8b3b6 drbd: move comment about stopping the receiver thread to where it belongs
When the last volume of a replication group is unconfigured,
the worker thread exits. To not interfere with cleanup
of other threads, before the the last cleanups run,
we need to make sure the receiver has already exited.

The commend explaining that clearly belongs above
drbd_thread_stop(&tconn->receiver), not in the cleanup loop below.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:49:05 +01:00
Lars Ellenberg ae25b336e0 drbd: cmdname() enum to string convertion was missing a few constants
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:49:05 +01:00
Lars Ellenberg ed439848ca drbd: fix setsockopt for user mode linux
We use our own copy of kernel_setsockopt, and did not mess around with
get_fs/set_fs, since we thought we knew we would always be KERNEL_DS
anyways. Apparently not so for at least user mode linux, so put the
set_fs(KERNEL_DS) in there.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:49:04 +01:00
Lars Ellenberg 71932efc1c drbd: allow status dump request all volumes of a specific resource
We had drbd_adm_get_status (one single volume),
and drbd_adm_get_status_all (dump of all volumes of all resources).

This enhances the latter to be able to dump all volumes
of just one specific resource.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:49:04 +01:00
Philipp Reisner 302bdeae49 drbd: Considering that the two_primaries config flag can change
Now since it is possible to change the two_primaries config
flag while the connection is up, make sure we treat a peer_req
in a consistent way if the config flag changes while the peer_req
is under IO.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:49:03 +01:00
Philipp Reisner 91fd4dad64 drbd: Proper locking for updates to net_conf under RCU
Removing the get_net_conf()/put_net_conf() functions

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:49:03 +01:00
Philipp Reisner 44ed167da7 drbd: rcu_read_lock() and rcu_dereference() for tconn->net_conf
Removing the get_net_conf()/put_net_conf() calls

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:48:59 +01:00
Philipp Reisner b032b6fa35 drbd: Allow online change of replication protocol only with agreed_pv >= 100
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:18 +01:00
Philipp Reisner cd64397c0b drbd: Check consistency of net options when the get changed online
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:18 +01:00
Philipp Reisner 303d1448a0 drbd: Runtime changeable wire protocol
The wire protocol is no longer a property that is negotiated
between the two peers. It is now expressed with two bits
(DP_SEND_WRITE_ACK and DP_SEND_RECEIVE_ACK) in each data
packet. Therefore the primary node is free to change the
wire protocol at any time without disconnect/reconnect.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:18 +01:00
Philipp Reisner d3fcb4908d drbd: protect all idr accesses that might sleep with drbd_cfg_rwsem
With this commit the locking for all accesses to IDRs is complete:

 * Non sleeping read accesses are protected by RCU
 * sleeping read accesses are protocted by a read lock on drbd_cfg_rwsem
 * accesses that add anything are protected by a write lock
 * accesses that remove an object are protoected by a write lock
   and a call to synchronize_rcu() after it is removed from the IDR
   and before the object is actually free()ed.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:17 +01:00
Philipp Reisner ef35626284 drbd: Converted drbd_cfg_mutex into drbd_cfg_rwsem
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:17 +01:00
Philipp Reisner 695d08fa94 drbd: rcu_read_[un]lock() for all idr accesses that do not sleep
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:16 +01:00
Philipp Reisner cd1d9950f6 drbd: Inlined drbd_free_mdev(); it got called only from one place
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:16 +01:00
Philipp Reisner ff370e5a9e drbd: drbd_delete_device() takes a struct drbd_conf * now
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:15 +01:00
Andreas Gruenbacher 5cc287e0ae drbd: Rename drbd_pp_free() to drbd_free_pages()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:15 +01:00
Andreas Gruenbacher c37c8ecfee drbd: Rename drbd_pp_alloc() to drbd_alloc_pages() and make it non-static
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:15 +01:00
Andreas Gruenbacher 18c2d52249 drbd: Rename drbd_pp_first_pages_or_try_alloc() to __drbd_alloc_pages()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:14 +01:00
Andreas Gruenbacher d4da15374b drbd: Make drbd_wait_ee_list_empty() and _drbd_wait_ee_list_empty() static
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:14 +01:00
Andreas Gruenbacher 045417f75c drbd: Rename drbd_{ ee -> peer_req }_has_active_page
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:13 +01:00
Andreas Gruenbacher a990be4682 drbd: Rename reclaim_net_ee(), drbd_process_done_ee(), drbd_process_done_ee(), tconn_process_done_ee() to *_peer_reqs
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:13 +01:00
Andreas Gruenbacher 7721f5675e drbd: Rename drbd_release_ee() to drbd_free_peer_reqs()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:13 +01:00
Andreas Gruenbacher 3967deb192 drbd: Rename drbd_free_ee() and variants to *_peer_req()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:12 +01:00
Andreas Gruenbacher 0db55363cb drbd: Rename drbd_alloc_ee() to drbd_alloc_peer_req()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:12 +01:00
Andreas Gruenbacher e0ab6ad4bc drbd: drbd_init_ee() no longer exists
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:11 +01:00
Andreas Gruenbacher 2735a59467 drbd: Make all asynchronous command handlers return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:11 +01:00
Andreas Gruenbacher 859976758d drbd: validate_req_change_req_state(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:11 +01:00
Andreas Gruenbacher b55d84ba17 drbd: Removed outdated comments and code that envisioned VNRs in header 95
Since have now header 100, that has space for 16 bit volume numbers,
the high byte of the length in header 95 is no longer reserved for
8 bit volume numbers.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:10 +01:00
Andreas Gruenbacher 0c8e36d9b8 drbd: Introduce protocol version 100 headers
The 8 byte header finally becomes too small. With the protocol 100 header we
have 16 bit for the volume number, proper 32 bit for the data length, and
32 bit for further extensions in the future.

Previous versions of drbd are using version 80 headers for all packets
short enough for protocol 80.  They support both header versions in
worker context, but only version 80 headers in asynchronous context.
For backwards compatibility, continue to use version 80 headers for
short packets before protocol version 100.

From protocol version 100 on, use the same header version for all
packets.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:10 +01:00
Andreas Gruenbacher e658983af6 drbd: Remove headers from on-the-wire data structures (struct p_*)
Prepare the introduction of the protocol 100 headers. The actual protocol
header is removed for the packet declarations. I.e. allow us to use the
packets with different headers.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:09 +01:00
Andreas Gruenbacher 50d0b1ad78 drbd: Remove some fixed header size assumptions
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:09 +01:00
Andreas Gruenbacher da39fec492 drbd: Remove now-unused int_dig_out buffer
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:09 +01:00
Andreas Gruenbacher 9f5bdc339e drbd: Replace and remove old primitives
Centralize sock->mutex locking and unlocking in [drbd|conn]_prepare_command()
and [drbd|conn]_send_comman().

Therefore all *_send_* functions are touched to use these primitives instead
of drbd_get_data_sock()/drbd_put_data_sock() and former helper functions.

That change makes the *_send_* functions more standardized.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:08 +01:00
Andreas Gruenbacher 52b061a440 drbd: Introduce drbd_header_size()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:08 +01:00
Andreas Gruenbacher dba5858750 drbd: Introduce new primitives for sending commands
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:07 +01:00
Andreas Gruenbacher a17647aae4 drbd: drbd_send_ping(), drbd_send_ping(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:07 +01:00
Philipp Reisner 8b924f1d63 drbd: Use tconn in request_timer_fn()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:07 +01:00
Philipp Reisner 706cb24c23 drbd: Improved logging of state changes
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:06 +01:00
Philipp Reisner a6d00c8ec3 drbd: Implemented IO thawing for multiple volumes
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:06 +01:00
Philipp Reisner 4669265a7b drbd: Implemented conn_lowest_disk()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:05 +01:00
Philipp Reisner 19f83c7661 drbd: Implemented conn_lowest_conn()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:05 +01:00
Philipp Reisner 8c7e16c39f drbd: Calculate and provide ns_min to the w_after_conn_state_ch() work
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:05 +01:00
Philipp Reisner 5f082f98f5 drbd: Renamed nms to ns_max
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:04 +01:00
Philipp Reisner da9fbc276e drbd: Introduced a new type union drbd_dev_state
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:04 +01:00
Philipp Reisner 8e0af25fa8 drbd: Moved susp, susp_nod and susp_fen to the connection object
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:03 +01:00
Philipp Reisner 2aebfabb17 drbd: Renamed id_susp(union drbd_state s) to drbd_suspended(struct drbd_conf *)
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:03 +01:00
Philipp Reisner 78bae59b1b drbd: Introduced drbd_read_state()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:03 +01:00
Lars Ellenberg e15766e9c9 drbd: improvements to activate/deactivate multiple activity log extents
Recent commit drbd: get rid of bio_split, allow bios of "arbitrary" size
had a reference count leak: it only deactivated the first of several
activity log extents for intervals crossing extent boundaries.

This commit generalizes on bios spanning multiple activity log extents
in drbd_al_begin_io, and adds the necessary loop around lc_put in
drbd_al_complete_io as well.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:02 +01:00
Lars Ellenberg 23361cf32b drbd: get rid of bio_split, allow bios of "arbitrary" size
Where "arbitrary" size is currently 1 MiB, which is the BIO_MAX_SIZE
for architectures with 4k PAGE_CACHE_SIZE (most).

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:02 +01:00
Lars Ellenberg 7726547e67 drbd: prepare to activate two activity log extents at once
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:01 +01:00
Lars Ellenberg 181286ad22 drbd: preparation commit, pass drbd_interval to drbd_al_begin/complete_io
We want to avoid bio_split for bios crossing activity log boundaries.
So we may need to activate two activity log extents "atomically".
drbd_al_begin_io() needs to know more than just the start sector.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:01 +01:00
Lars Ellenberg 85f103d88c drbd: introduce the "initialized" activity log transaction type
So we can initialize a clean on disk activity log area,
without the module complaining with loud assert messages
because of checksum or magic value mismatches.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:01 +01:00
Andreas Gruenbacher 6038178ebe drbd: Change how the "handshake" packets are called
Packets of type P_HAND_SHAKE define which protocol versions and features
a node supports.  For clarity, call those packets P_CONNECTION_FEATURES
instead.

(This does not determine the features that a specific drbd device
supports, such as drbd protocol A, B, C.)

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:00 +01:00
Andreas Gruenbacher e5d6f33abe drbd: Change how the initial packets are called
The first packets exchanged when a connection is established are
referred to as P_HAND_SHAKE_S and P_HAND_SHAKE_M in the code, followed
by P_HAND_SHAKE packets.  To avoid confusion between these two unrelated
things, call the initial packets P_INITIAL_DATA and P_INITIAL_META.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:00 +01:00
Andreas Gruenbacher 7c96715aa8 drbd: _conn_send_cmd(), _drbd_send_cmd(): Pass a struct drbd_socket instead of a plain socket
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:59 +01:00
Andreas Gruenbacher 2bf896213d drbd: drbd_connect(): Initialize struct drbd_socket before sending anything
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:59 +01:00
Andreas Gruenbacher 1952e9166a drbd: Map from (connection, volume number) to device in the asender handlers
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:59 +01:00
Andreas Gruenbacher e05e1e59db drbd: Pass struct packet_info down to the asender receive functions
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:58 +01:00
Philipp Reisner 438c8374ae drbd: Do not segfault if a sync dependency reaches a diskless device
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:58 +01:00
Philipp Reisner 778bcf2e29 drbd: Allow to disconnect if one volume is diskless
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:58 +01:00
Philipp Reisner 435693e89b drbd: Print common state changes of all volumes as connection state changes
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:57 +01:00
Philipp Reisner 88ef594ed7 drbd: Fixed logging of old connection state
During a disconnect the oc variable in _conn_request_state()
could become outdated. Determin the common old state after
sleeping.
While at it, I implemented that for all parts of the state

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:57 +01:00
Philipp Reisner bd0c824a9d drbd: Use the idr_for_each_entry() iterator instead of idr_for_each()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:56 +01:00
Andreas Gruenbacher 4a76b1612f drbd: Map from (connection, volume number) to device in the receive handlers
The receive handlers do not all handle unknown volume numbers the same
way.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:56 +01:00
Andreas Gruenbacher e28572167c drbd: Pass struct packet_info down to the receive functions
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:56 +01:00
Andreas Gruenbacher 49ba9b1bb3 drbd: Remove useless error messages
These messages can only trigger in case there is a pretty obvious
internal programming error.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:55 +01:00
Andreas Gruenbacher deebe19579 drbd: A small cleanup in drbdd()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:55 +01:00
Andreas Gruenbacher 79ed9bd053 drbd: _drbd_send_bitmap(): Use the pre-allocated send buffer
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:54 +01:00
Andreas Gruenbacher 5a87d920f3 drbd: Preallocate one page per drbd_socket as a send buffer
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:54 +01:00
Andreas Gruenbacher fc56815c81 drbd: receive_bitmap(): Use the pre-allocated receive buffer
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:54 +01:00
Andreas Gruenbacher e6ef8a5cb3 drbd: Preallocate one page per drbd_socket as a receive buffer
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:53 +01:00
Philipp Reisner cb703454a2 drbd: Converted drbd_try_outdate_peer() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:53 +01:00
Andreas Gruenbacher a02d124091 drbd: Rename the DCBP_* functions to dcbp_* and move them to where they are used
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:52 +01:00
Andreas Gruenbacher 058820cdd7 drbd: Make _drbd_send_bitmap() static
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:52 +01:00
Andreas Gruenbacher e307f352b4 drbd: Move drbd_send_ping() and drbd_send_ping_ack() to drbd_main.c
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:52 +01:00
Andreas Gruenbacher 0916e0e308 drbd: Always use the same protocol version for the same peer
There is no need to send protocol 80 headers to peers that understand
protocol 95 headers.  Make sure that we don't send protocol 95 headers
until we have agreed upon a protocol version with our peer, though.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:51 +01:00
Andreas Gruenbacher 0829f5edf3 drbd: drbd_connected(): Return an error code upon failure.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:51 +01:00
Andreas Gruenbacher a5c3190435 drbd: Introduce and use drbd_recv_all_warn()
The pattern of receiving a fixed number of bytes and warning if a short
packet is received and the receiver has not actively been interruped is
repeated many times; clean that up.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:50 +01:00
Andreas Gruenbacher 309a834896 drbd: Get rid of typedef drbd_work_cb
This type is not used anywhere else.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:50 +01:00
Andreas Gruenbacher 8f7bed7774 drbd: Rename various functions from *_oos_* to *_out_of_sync_* for clarity
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:50 +01:00
Andreas Gruenbacher 0da34df0d0 drbd: drbd_may_do_local_read(): Use bool/true/false
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:49 +01:00
Andreas Gruenbacher 1097e9a80c drbd: Remove unnecessary assertion
This is also checked further below in the same function.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:49 +01:00
Andreas Gruenbacher 69f5ec728c drbd: Remove duplicate initialization
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:49 +01:00
Andreas Gruenbacher 3fbf4d21ae drbd: drbd_md_sync_page_io(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:48 +01:00
Andreas Gruenbacher ac29f4039a drbd: _drbd_md_sync_page_io(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:48 +01:00
Andreas Gruenbacher 22ab6a30b8 drbd: drbd_bm_read() never returns a positive value through drbd_bitmap_io()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:47 +01:00
Andreas Gruenbacher 82bc01940a drbd: Make all command handlers return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:47 +01:00
Andreas Gruenbacher c696774691 drbd: Add drbd_recv_all(): Receive an entire buffer
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:47 +01:00
Andreas Gruenbacher a982dd579c drbd: send_bitmap_rle_or_plain(): Error handling cleanup
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:46 +01:00
Andreas Gruenbacher e1c1b0fc8f drbd: recv_resync_read(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:46 +01:00
Andreas Gruenbacher 28284ceff0 drbd: recv_dless_read(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:45 +01:00
Andreas Gruenbacher fc5be8397f drbd: drbd_drain_block(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:45 +01:00
Andreas Gruenbacher 69bc7bc351 drbd: drbd_recv_header(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:45 +01:00
Andreas Gruenbacher 8172f3e9bf drbd: decode_header(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:44 +01:00
Andreas Gruenbacher e2b3032b90 drbd: drbd_process_done_ee(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:44 +01:00
Andreas Gruenbacher 99920dc5c5 drbd: Make all worker callbacks return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:43 +01:00
Andreas Gruenbacher b2f0ab62ec drbd: Temporarily change the return type of all worker callbacks
This helps to ensure that we don't miss one of them when changing their
return value semantics.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:43 +01:00
Andreas Gruenbacher a896527c06 drbd: drbd_send_short_cmd(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:43 +01:00
Andreas Gruenbacher 6bdb9b0e23 drbd: drbd_send_dblock(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:42 +01:00
Andreas Gruenbacher 7fae55da38 drbd: _drbd_send_bio(), _drbd_send_zc_bio(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:42 +01:00
Andreas Gruenbacher 7b57b89d62 drbd: drbd_send_block(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:42 +01:00
Andreas Gruenbacher 9f69230cd6 drbd: _drbd_send_zc_ee(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:41 +01:00
Andreas Gruenbacher 88b390ff63 drbd: _drbd_send_page(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:41 +01:00
Andreas Gruenbacher b987427b53 drbd: _drbd_no_send_page(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:40 +01:00
Andreas Gruenbacher 73218a3c4c drbd: drbd_send_oos(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:40 +01:00
Andreas Gruenbacher db1b0b724e drbd: drbd_send_drequest_csum(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:40 +01:00
Andreas Gruenbacher 6c1005e74d drbd: drbd_send_drequest(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:39 +01:00
Andreas Gruenbacher 5b9f499c66 drbd: drbd_send_ov_request(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:39 +01:00
Andreas Gruenbacher fa79abd893 drbd: drbd_send_ack_ex(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:38 +01:00
Andreas Gruenbacher a9a9994dc7 drbd: drbd_send_ack_{dp,rp}(): Return void: the result is never used
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:38 +01:00
Andreas Gruenbacher dd5161218b drbd: drbd_send_ack(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:38 +01:00
Andreas Gruenbacher a8c32aa846 drbd: _drbd_send_ack(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:37 +01:00
Andreas Gruenbacher d4e67d7c4f drbd: drbd_send_b_ack(): Return void: the result is never used
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:37 +01:00
Andreas Gruenbacher 2f4e7abe51 drbd: drbd_send_sr_reply(): Return void: the result is never used
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:36 +01:00
Andreas Gruenbacher d24ae219e9 drbd: drbd_send_state_req(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:36 +01:00
Andreas Gruenbacher caee1c3a92 drbd: conn_send_state_req(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:36 +01:00
Andreas Gruenbacher 758970c832 drbd: _conn_send_state_req(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:35 +01:00
Andreas Gruenbacher f02d4d0a9c drbd: drbd_send_sizes(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:35 +01:00
Andreas Gruenbacher 9c1b7f7282 drbd: drbd_gen_and_send_sync_uuid(): Return void: the result is never used
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:34 +01:00
Andreas Gruenbacher 2ae5f95b1a drbd: drbd_send_uuids() and its variants: Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:34 +01:00
Andreas Gruenbacher 387eb30817 drbd: drbd_send_protocol(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:34 +01:00
Andreas Gruenbacher e8d17b015e drbd: drbd_send_handshake(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:33 +01:00
Andreas Gruenbacher 927036f908 drbd: drbd_send_state(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:33 +01:00
Andreas Gruenbacher 103ea27528 drbd: drbd_send_sync_param(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:33 +01:00
Andreas Gruenbacher f725446353 drbd: drbd_send_cmd(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:32 +01:00
Andreas Gruenbacher 7d168ed30f drbd: Get rid of USE_DATA_SOCKET and USE_META_SOCKET
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:32 +01:00
Andreas Gruenbacher 596a37f9ef drbd: conn_send_cmd(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:31 +01:00
Andreas Gruenbacher 04dfa13788 drbd: _drbd_send_cmd(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:31 +01:00
Andreas Gruenbacher ecf2363cb5 drbd: _conn_send_cmd(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:31 +01:00
Andreas Gruenbacher ce9879cb1f drbd: conn_send_cmd2(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:30 +01:00
Andreas Gruenbacher fb708e408f drbd: Add drbd_send_all(): Send an entire buffer
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:30 +01:00
Andreas Gruenbacher 11b0be28e5 drbd: drbd_get_data_sock(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:29 +01:00
Andreas Gruenbacher c0d42c8e57 drbd: drbd_send(): Return a "real" error code if we have no socket
Q: Can this case even trigger?  Is failing this way any better than one
that causes a NULL pointer access?

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:29 +01:00
Philipp Reisner e90285e0ba drbd: Fixed conn_lowest_minor
It actually returned the lowest volume number. While doing that
renamed a few wrongly named variables.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:28 +01:00
Lars Ellenberg f399002e68 drbd: distribute former syncer_conf settings to disk, connection, and resource level
This commit breaks the API again.

Move per-volume former syncer options into disk_conf.
Move per-connection former syncer options into net_conf.
Renamed the remainign sync_conf to res_opts

Syncer settings have been changeable at runtime, so we need to prepare
for these settings to be runtime-changeable in their new home as well.

Introduce new configuration operations, and share the netlink attribute
between "attach" (create new disk) and "disk-opts" (change options).
Same for "connect" and "net-opts".

Some fields cannot be changed at runtime, however.
Introduce a new flag GENLA_F_INVARIANT to be able to trigger on that in
the generated validation and assignment functions.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:20 +01:00
Roger Pau Monne cb5bd4d19b xen/blkback: persistent-grants fixes
This patch contains fixes for persistent grants implementation v2:

 * handle == 0 is a valid handle, so initialize grants in blkback
   setting the handle to BLKBACK_INVALID_HANDLE instead of 0. Reported
   by Konrad Rzeszutek Wilk.

 * new_map is a boolean, use "true" or "false" instead of 1 and 0.
   Reported by Konrad Rzeszutek Wilk.

 * blkfront announces the persistent-grants feature as
   feature-persistent-grants, use feature-persistent instead which is
   consistent with blkback and the public Xen headers.

 * Add a consistency check in blkfront to make sure we don't try to
   access segments that have not been set.

Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Roger Pau Monne <roger.pau@citrix.com>
[v1: The new_map int->bool had already been changed]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-11-04 10:35:40 -05:00
Philipp Reisner 6b75dced00 drbd: conn_khelper() for user mode callbacks for connections
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:32 +01:00
Philipp Reisner 047e95e259 drbd: Allow volumes to become primary only on one side
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:31 +01:00
Lars Ellenberg 40cbf085f5 drbd: fix conn_reconfig_start without conn_reconfig_done in drbd_adm_attach
If drbd_adm_attach failed early, it left the CONFIG_PENDING bit on,
blocking any further conn_reconfig_start on that connection.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:31 +01:00
Philipp Reisner e4f78edee1 drbd: Separate connection state changes from minor dev state changes #2
New function got_conn_RqSReply()

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:30 +01:00
Philipp Reisner f19e4f8ba7 drbd: Converted got_Ping() and got_PingAck() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:30 +01:00
Philipp Reisner a4fbda8eca drbd: Allow packet handler functions that take a connection (meta connection)
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:29 +01:00
Philipp Reisner dfafcc8a7b drbd: Separate connection state changes from minor dev state changes #1
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:28 +01:00
Philipp Reisner 7204624c5e drbd: Converted receive_protocol() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:28 +01:00
Philipp Reisner d9ae84e790 drbd: Allow packet handler functions that take a connection
That is necessary in case a connection does not have a volume 0

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:27 +01:00
Philipp Reisner 8169e41b3e drbd: Moved CONN_DRY_RUN to the per connection (tconn) flags
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:27 +01:00
Philipp Reisner 38fa9988fa drbd: Do not modify the connection state with something else that conn_request_state()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:26 +01:00
Philipp Reisner 34f646bd57 drbd: Allow two diskless minors to be connected
In the context of drbd-8.4 it no longer makes sense to
dissalow that.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:25 +01:00
Philipp Reisner 2325eb661f drbd: New minors have to intherit the connection state form their connection
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:25 +01:00
Philipp Reisner 082a3439a2 drbd: process_done_ee() has to handle unconfigured devices now
Took the chance and converted tconn_process_done_ee() to use
idr_for_each_entry()

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:24 +01:00
Philipp Reisner 2de876efa6 drbd: Ignore packets for non existing volumes
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:23 +01:00
Lars Ellenberg 85f75dd763 drbd: introduce in-kernel "down" command
This greatly simplifies deconfiguration of whole resources.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:23 +01:00
Lars Ellenberg 3c5e5f6afd drbd: add forgotten spin_unlock
somehow a "goto abort" was introduced with commit
  drbd: Extracted is_valid_transition() out of sanitize_state()
which left drbd_req_state still holding the spin lock.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:22 +01:00
Lars Ellenberg 527f4b24e5 drbd: bail out if a config requrest is over-determined, and not matching
We have resources resp. connections, volumes, and minor numbers.
A config request may specifies all three of them.
If it turns out that the minor belongs to a different connection, or a
different volume number in the same connection, that configuration
request is invalid.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:21 +01:00
Lars Ellenberg 38f19616d2 drbd: new-connection and new-minor succeed, if the object already exists
Follow O_CREAT semantics when creating connection or minor device/volume
objects.  If we need O_CREAT|O_EXCL semantics some time down the road,
we can add NLM_F_EXCL to the netlink message flags.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:21 +01:00
Lars Ellenberg cffec5b2fe drbd: Allow a Diskless Secondary volume to be removed
Even if the connection is still established.
We should be able to reduce a volume from a replication group,
without taking the whole group offline.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:20 +01:00
Lars Ellenberg d0456c72df drbd: simplify conn_all_vols_unconf, make it bool
Get rid of a temporary variable and, funny bitand assignment.
Just short circuit, returning false, once we encounter the first
still configured volume.

FIXME verify call sites for need of rcu_read_lock or stronger.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:19 +01:00
Lars Ellenberg 543cc10b4c drbd: drbd_adm_get_status needs to show some more detail
We want to see existing connection objects, even if they do not
currently have volumes attached.

Change the .dumpit variant of drbd_adm_get_status to iterate not over
minor devices, but over connections + volumes.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:19 +01:00
Lars Ellenberg 8432b31457 drbd: allow holes in minor and volume id allocation
s/idr_get_new/idr_get_new_above/

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:17 +01:00
Lars Ellenberg 3b98c0c209 drbd: switch configuration interface from connector to genetlink
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:17 +01:00
Alex Elder 9e15b77d9a rbd: get additional info in parent spec
When a layered rbd image has a parent, that parent is identified
only by its pool id, image id, and snapshot id.  Images that have
been mapped also record *names* for those three id's.

Add code to look up these names for parent images so they match
mapped images more closely.  Skip doing this for an image if it
already has its pool name defined (this will be the case for images
mapped by the user).

It is possible that an the name of a parent image can't be
determined, even if the image id is valid.  If this occurs it
does not preclude correct operation, so don't treat this as
an error.

On the other hand, defined pools will always have both an id and a
name.   And any snapshot of an image identified as a parent for a
clone image will exist, and will have a name (if not it indicates
some other internal error).  So treat failure to get these bits
of information as errors.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-11-01 07:55:42 -05:00
Alex Elder 86b00e0da6 rbd: get parent spec for version 2 images
Add support for getting the the information identifying the parent
image for rbd images that have them.  The child image holds a
reference to its parent image specification structure.  Create a new
entry "parent" in /sys/bus/rbd/image/N/ to report the identifying
information for the parent image, if any.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-11-01 07:55:42 -05:00
Alex Elder a92ffdf8a9 rbd: allow null image name
Format 2 parent images are partially identified by their image id,
but it may not be possible to determine their image name.  The name
is not strictly needed for correct operation, so we won't be
treating it as an error if we don't know it.  Handle this case
gracefully in rbd_name_show().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-11-01 07:55:42 -05:00
Alex Elder 2c0d0a10ea rbd: allow null image name
We will know the image id for format 2 parent images, but won't
initially know its image name.  Avoid making the query for an image
id in rbd_dev_image_id() if it's already known.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-11-01 07:55:42 -05:00
Alex Elder 83a0626362 rbd: encapsulate last part of probe
Group the activities that now take place after an rbd_dev_probe()
call into a single function, and move the call to that function
into rbd_dev_probe() itself.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-11-01 07:55:42 -05:00
Alex Elder c53d589337 rbd: define rbd_dev_{create,destroy}() helpers
Encapsulate the creation/initialization and destruction of rbd
device structures.  The rbd_client and the rbd_spec structures
provided on creation hold references whose ownership is transferred
to the new rbd_device structure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-11-01 07:55:42 -05:00
Alex Elder bd4ba6554d rbd: consolidate rbd_dev init in rbd_add()
Group the allocation and initialization of fields of the rbd device
structure created in rbd_add().  Move the grouped code down later in
the function, just prior to the call to rbd_dev_probe().  This is
for the most part simple code movement.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-11-01 07:55:42 -05:00
Alex Elder 9d3997fdf4 rbd: don't pass rbd_dev to rbd_get_client()
The only reason rbd_dev is passed to rbd_get_client() is so its
rbd_client field can get assigned.  Instead, just return the
rbd_client pointer as a result and have the caller do the
assignment.

Change rbd_put_client() so it takes an rbd_client structure,
so follows the more typical symmetry with rbd_get_client().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-11-01 07:55:42 -05:00
Alex Elder 859c31df9c rbd: fill rbd_spec in rbd_add_parse_args()
Pass the address of an rbd_spec structure to rbd_add_parse_args().
Use it to hold the information defining the rbd image to be mapped
in an rbd_add() call.

Use the result in the caller to initialize the rbd_dev->id field.

This means rbd_dev is no longer needed in rbd_add_parse_args(),
so get rid of it.

Now that this transformation of rbd_add_parse_args() is complete,
correct and expand on the its header documentation to reflect the
new reality.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-11-01 07:55:42 -05:00
Alex Elder 8b8fb99c5c rbd: add reference counting to rbd_spec
With layered images we'll share rbd_spec structures, so add a
reference count to it.  It neatens up some code also.

A silly get/put pair is added to the alloc routine just to avoid
"defined but not used" warnings.  It will go away soon.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-11-01 07:55:41 -05:00
Roger Pau Monne 0a8704a51f xen/blkback: Persistent grant maps for xen blk drivers
This patch implements persistent grants for the xen-blk{front,back}
mechanism. The effect of this change is to reduce the number of unmap
operations performed, since they cause a (costly) TLB shootdown. This
allows the I/O performance to scale better when a large number of VMs
are performing I/O.

Previously, the blkfront driver was supplied a bvec[] from the request
queue. This was granted to dom0; dom0 performed the I/O and wrote
directly into the grant-mapped memory and unmapped it; blkfront then
removed foreign access for that grant. The cost of unmapping scales
badly with the number of CPUs in Dom0. An experiment showed that when
Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
ramdisk, the IPIs from performing unmap's is a bottleneck at 5 guests
(at which point 650,000 IOPS are being performed in total). If more
than 5 guests are used, the performance declines. By 10 guests, only
400,000 IOPS are being performed.

This patch improves performance by only unmapping when the connection
between blkfront and back is broken.

On startup blkfront notifies blkback that it is using persistent
grants, and blkback will do the same. If blkback is not capable of
persistent mapping, blkfront will still use the same grants, since it
is compatible with the previous protocol, and simplifies the code
complexity in blkfront.

To perform a read, in persistent mode, blkfront uses a separate pool
of pages that it maps to dom0. When a request comes in, blkfront
transmutes the request so that blkback will write into one of these
free pages. Blkback keeps note of which grefs it has already
mapped. When a new ring request comes to blkback, it looks to see if
it has already mapped that page. If so, it will not map it again. If
the page hasn't been previously mapped, it is mapped now, and a record
is kept of this mapping. Blkback proceeds as usual. When blkfront is
notified that blkback has completed a request, it memcpy's from the
shared memory, into the bvec supplied. A record that the {gref, page}
tuple is mapped, and not inflight is kept.

Writes are similar, except that the memcpy is peformed from the
supplied bvecs, into the shared pages, before the request is put onto
the ring.

Blkback stores a mapping of grefs=>{page mapped to by gref} in
a red-black tree. As the grefs are not known apriori, and provide no
guarantees on their ordering, we have to perform a search
through this tree to find the page, for every gref we receive. This
operation takes O(log n) time in the worst case. In blkfront grants
are stored using a single linked list.

The maximum number of grants that blkback will persistenly map is
currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to
prevent a malicios guest from attempting a DoS, by supplying fresh
grefs, causing the Dom0 kernel to map excessively. If a guest
is using persistent grants and exceeds the maximum number of grants to
map persistenly the newly passed grefs will be mapped and unmaped.
Using this approach, we can have requests that mix persistent and
non-persistent grants, and we need to handle them correctly.
This allows us to set the maximum number of persistent grants to a
lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although
setting it will lead to unpredictable performance.

In writing this patch, the question arrises as to if the additional
cost of performing memcpys in the guest (to/from the pool of granted
pages) outweigh the gains of not performing TLB shootdowns. The answer
to that question is `no'. There appears to be very little, if any
additional cost to the guest of using persistent grants. There is
perhaps a small saving, from the reduced number of hypercalls
performed in granting, and ending foreign access.

Signed-off-by: Oliver Chick <oliver.chick@citrix.com>
Signed-off-by: Roger Pau Monne <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[v1: Fixed up the misuse of bool as int]
2012-10-30 09:50:04 -04:00
Alex Elder 0d7dbfce9d rbd: define image specification structure
Group the fields that uniquely specify an rbd image into a new
reference-counted rbd_spec structure.  This structure will be used
to describe the desired image when mapping an image, and when
probing parent images in layered rbd devices.  Replace the set of
fields in the rbd device structure with a pointer to a dynamically
allocated rbd_spec.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-30 08:34:30 -05:00
Alex Elder dc79b113d6 rbd: have rbd_add_parse_args() return error
Change the interface to rbd_add_parse_args() so it returns an
error code rather than a pointer.  Return the ceph_options result
via a pointer whose address is passed as an argument.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-30 08:34:30 -05:00
Alex Elder 4e9afeba7a rbd: pass and populate rbd_options structure
Have the caller pass the address of an rbd_options structure to
rbd_add_parse_args(), to be initialized with the information
gleaned as a result of the parse.

I know, this is another near-reversal of a recent change...

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-30 08:34:29 -05:00
Alex Elder 819d52bf72 rbd: remove snap_name arg from rbd_add_parse_args()
The snapshot name returned by rbd_add_parse_args() just gets saved
in the rbd_dev eventually.  So just do that inside that function and
do away with the snap_name argument, both in rbd_add_parse_args()
and rbd_dev_set_mapping().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-30 08:34:29 -05:00
Alex Elder f28e565a1b rbd: remove options args from rbd_add_parse_args()
They "options" argument to rbd_add_parse_args() (and it's partner
options_size) is now only needed within the function, so there's no
need to have the caller allocate and pass the options buffer.  Just
allocate the options buffer within the function using dup_token().

Also distinguish between failures due to failed memory allocation
and failing because a required argument was missing.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-30 08:34:29 -05:00
Alex Elder e5c3553404 rbd: get rid of snap_name_len
The value returned in the "snap_name_len" argument to
rbd_add_parse_args() is never actually used, so get rid of it.

The snap_name_len recorded in rbd_dev_v2_snap_name() is not
useful either, so get rid of that too.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-30 08:34:29 -05:00
Alex Elder 0ddebc0c6c rbd: do all argument parsing in one place
This patch makes rbd_add_parse_args() be the single place all
argument parsing occurs for an image map request:
    - Move the ceph_parse_options() call into that function
    - Use local variables rather than parameters to hold the list
      of monitor addresses supplied
    - Rather than returning it, pass the snapshot name (and its
      length) back via parameters
    - Have the function return a ceph_options structure pointer

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-30 08:34:29 -05:00
Alex Elder 78cea76e05 rbd: move ceph_parse_options() call up
Move option parsing out of rbd_get_client() and into its caller.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-30 08:34:29 -05:00
Alex Elder daba5fdb4c rbd: rename snap_exists field
A Boolean field "snap_exists" in an rbd mapping is used to indicate
whether a mapped snapshot has been removed from an image's snapshot
context, to stop sending requests for that snapshot as soon as we
know it's gone.

Generalize the interpretation of this field so it applies to
non-snapshot (i.e. "head") mappings.  That is, define its value
to be false until the mapping has been set, and then define it to be
true for both snapshot mappings or head mappings.

Rename the field "exists" to reflect the broader interpretation.
The rbd_mapping structure is on its way out, so move the field
back into the rbd_device structure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-30 08:34:29 -05:00
Alex Elder 971f839a76 rbd: move snap info out of rbd_mapping struct
Moving the snap_id and snap_name fields into the separate
rbd_mapping structure was misguided.  (And in time, perhaps
we'll do away with that structure altogether...)

Move these fields back into struct rbd_device.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-30 08:34:29 -05:00
Alex Elder 86992098e7 rbd: make pool_id a 64 bit value
If a format 2 image has a parent, its pool id will be specified
using a 64-bit value.  Change the pool id we save for an image to
match that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-30 08:34:29 -05:00
Alex Elder 41f38c2b2f rbd: remove snapshots on error in rbd_add()
If rbd_dev_snaps_update() has ever been called for an rbd device
structure there could be snapshot structures on its snaps list.
In rbd_add(), this function is called but a subsequent error
path neglected to clean up any of these snapshots.

Add a call to rbd_remove_all_snaps() in the appropriate spot to
remedy this.  Change a couple of error labels to be a little
clearer while there.

Drop the leading underscores from the function name; there's nothing
special about that function that they might signify.  As suggested
in review, the leading underscores in __rbd_remove_snap_dev() have
been removed as well.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-30 08:34:28 -05:00
Alex Elder f7760dad28 rbd: simplify rbd_rq_fn()
When processing a request, rbd_rq_fn() makes clones of the bio's in
the request's bio chain and submits the results to osd's to be
satisfied.  If a request bio straddles the boundary between objects
backing the rbd image, it must be represented by two cloned bio's,
one for the first part (at the end of one object) and one for the
second (at the beginning of the next object).

This has been handled by a function bio_chain_clone(), which
includes an interface only a mother could love, and which has
been found to have other problems.

This patch defines two new fairly generic bio functions (one which
replaces bio_chain_clone()) to help out the situation, and then
revises rbd_rq_fn() to make use of them.

First, bio_clone_range() clones a portion of a single bio, starting
at a given offset within the bio and including only as many bytes
as requested.  As a convenience, a request to clone the entire bio
is passed directly to bio_clone().

Second, bio_chain_clone_range() performs a similar function,
producing a chain of cloned bio's covering a sub-range of the
source chain.  No bio_pair structures are used, and if successful
the result will represent exactly the specified range.

Using bio_chain_clone_range() makes bio_rq_fn() a little easier
to understand, because it avoids the need to pass very much
state information between consecutive calls.  By avoiding the need
to track a bio_pair structure, it also eliminates the problem
described here:  http://tracker.newdream.net/issues/2933

Note that a block request (and therefore the complete length of
a bio chain processed in rbd_rq_fn()) is an unsigned int, while
the result of rbd_segment_length() is u64.  This change makes
this range trunctation explicit, and trips a bug if the the
segment boundary is too far off.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-30 08:34:28 -05:00
Lars Ellenberg ccae7868b0 drbd: log request sector offset and size for IO errors
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:39:18 +01:00
Lars Ellenberg a2a3c74f24 drbd: always write bitmap on detach
If we detach due to local read-error (which sets a bit in the bitmap),
stay Primary, and then re-attach (which re-reads the bitmap from disk),
we potentially lost the "out-of-sync" (or, "bad block") information in
the bitmap.

Always (try to) write out the changed bitmap pages before going diskless.

That way, we don't lose the bit for the bad block,
the next resync will fetch it from the peer, and rewrite
it locally, which may result in block reallocation in some
lower layer (or the hardware), and thereby "heal" the bad blocks.

If the bitmap writeout errors out as well, we will (again: try to)
mark the "we need a full sync" bit in our super block,
if it was a READ error; writes are covered by the activity log already.

If that superblock does not make it to disk either, we are sorry.

Maybe we just lost an entire disk or controller (or iSCSI connection),
and there actually are no bad blocks at all, so we don't need to
re-fetch from the peer, there is no "auto-healing" necessary.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:39:18 +01:00
Lars Ellenberg 06f10adbdb drbd: prepare for more than 32 bit flags
- struct drbd_conf { ... unsigned long flags; ... }
 + struct drbd_conf { ... unsigned long drbd_flags[N]; ... }

And introduce wrapper functions for test/set/clear bit operations
on this member.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:39:18 +01:00
Lars Ellenberg 44edfb0d78 drbd: wait for meta data IO completion even with failed disk, unless force-detached
The intention of force-detach is to be able to deal with a completely
unresponsive lower level IO stack, which does not even deliver error
completions anymore, but no completion at all.

In all other cases, we must still wait for the meta data IO completion.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:39:18 +01:00
Lars Ellenberg 8b45a5c8a1 drbd: a few more GFP_KERNEL -> GFP_NOIO
This has not yet been observed, but conceivably, when using GFP_KERNEL
allocations from drbd_md_sync(), drbd_flush_after_epoch() or
receive_SyncParam(), we could trigger additional IO to our own device,
or an other device in a criss-cross setup, and end up in a local
deadlock, or potentially a distributed deadlock in a criss-cross setup
involving the peer blocked in a similar way waiting for us to make
progress.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:39:18 +01:00
Lars Ellenberg 0b143d4382 drbd: fix potential deadlock during bitmap (re-)allocation
The former comment arguing that GFP_KERNEL was good enough was wrong: it
did not take resize into account at all, and assumed the only path
leading here was the normal attach on a still secondary device, so no
deadlock would be possible.

Both resize on a Primary, or attach on a diskless Primary,
could potentially deadlock.

drbd_bm_resize() is called while IO to the respective device is
suspended, so we must use GFP_NOIO to avoid potential deadlock.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:39:18 +01:00
Lars Ellenberg 7fb907c15f drbd: panic on delayed completion of aborted requests
"aborting" requests, or force-detaching the disk, is intended for
completely blocked/hung local backing devices which do no longer
complete requests at all, not even do error completions.  In this
situation, usually a hard-reset and failover is the only way out.

By "aborting", basically faking a local error-completion,
we allow for a more graceful swichover by cleanly migrating services.
Still the affected node has to be rebooted "soon".

By completing these requests, we allow the upper layers to re-use
the associated data pages.

If later the local backing device "recovers", and now DMAs some data
from disk into the original request pages, in the best case it will
just put random data into unused pages; but typically it will corrupt
meanwhile completely unrelated data, causing all sorts of damage.

Which means delayed successful completion,
especially for READ requests,
is a reason to panic().

We assume that a delayed *error* completion is OK,
though we still will complain noisily about it.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:39:17 +01:00
Philipp Reisner dbd0820c6f drbd: Remove dead code
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:39:17 +01:00
Philipp Reisner 599377acb7 drbd: Avoid NetworkFailure state during disconnect
Disconnecting is a cluster wide state change. In case the peer node agrees
to the state transition, it sends back the fact on the meta-data connection
and closes both sockets.

In case the node node that initiated the state transfer sees the closing
action on the data-socket, before the P_STATE_CHG_REPLY packet, it was
going into one of the network failure states.

At least with the fencing option set to something else thatn "dont-care",
the unclean shutdown of the connection causes a short IO freeze or
a fence operation.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:39:17 +01:00
Philipp Reisner c12a3d8c84 drbd: Fix a potential issue with the DISCARD_CONCURRENT flag
The DISCARD_CONCURRENT flag should be set on one node and cleared on the
other node.
As the code was before it was theoretical possible that a node accepts the
meta socket, but has to close it later on, and keeps the DISCARD_CONCURRENT
flag.
Correct this by moving the clear_bit(DISCARD_CONCURRENT) where the packet
gets sent.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:39:01 +01:00
Lars Ellenberg 02b91b5526 drbd: introduce stop-sector to online verify
We now can schedule only a specific range of sectors for online verify,
or interrupt a running verify without interrupting the connection.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:39:01 +01:00
Philipp Reisner 9f2247bb9b drbd: Protect accesses to the uuid set with a spinlock
There is at least the worker context, the receiver context, the context of
receiving netlink packts and processes reading a sysfs attribute that access
the uuids.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:39:01 +01:00
Dave Chinner a1ecac3b06 loop: Make explicit loop device destruction lazy
xfstests has always had random failures of tests due to loop devices
failing to be torn down and hence leaving filesytems that cannot be
unmounted. This causes test runs to immediately stop.

Over the past 6 or 7 years we've added hacks like explicit unmount
-d commands for loop mounts, losetup -d after unmount -d fails, etc,
but still the problems persist.  Recently, the frequency of loop
related failures increased again to the point that xfstests 259 will
reliably fail with a stray loop device that was not torn down.

That is despite the fact the test is above as simple as it gets -
loop 5 or 6 times running mkfs.xfs with different paramters:

        lofile=$(losetup -f)
        losetup $lofile "$testfile"
        "$MKFS_XFS_PROG" -b size=512 $lofile >/dev/null || echo "mkfs failed!"
        sync
        losetup -d $lofile

And losteup -d $lofile is failing with EBUSY on 1-3 of these loops
every time the test is run.

Turns out that blkid is running simultaneously with losetup -d, and
so it sees an elevated reference count and returns EBUSY.  But why
is blkid running? It's obvious, isn't it? udev has decided to try
and find out what is on the block device as a result of a creation
notification. And it is racing with mkfs, so might still be scanning
the device when mkfs finishes and we try to tear it down.

So, make losetup -d force autoremove behaviour. That is, when the
last reference goes away, tear down the device. xfstests wants it
*gone*, not causing random teardown failures when we know that all
the operations the tests have specifically run on the device have
completed and are no longer referencing the loop device.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:37:31 +01:00
Selvan Mani 4453bc88f0 mtip32xx:Added appropriate timeout value for secure erase
Added appropriate timeout value for secure erase based on identify device data

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:37:27 +01:00
Oliver Chick 1f999572f2 xen/blkback: Change xen_vbd's flush_support and discard_secure to have type unsigned int, rather than bool
Changing the type of bdev parameters to be unsigned int :1, rather than bool.
This is more consistent with the types of other features in the block drivers.

Signed-off-by: Oliver Chick <oliver.chick@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:37:20 +01:00
Akinobu Mita b7010ede43 cciss: select CONFIG_CHECK_SIGNATURE
The patch cciss-use-check_signature.patch in -mm tree introduced
a build error:

drivers/built-in.o: In function `CISS_signature_present':
drivers/block/cciss.c:4270: undefined reference to `check_signature'

Add missing CONFIG_CHECK_SIGNATURE to fix this issue.

Reported-by: Fengguang Wu <wfg@linux.intel.com>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Fengguang Wu <wfg@linux.intel.com>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Jens Axboe <axboe@kernel.dk>
Acked-by: "Stephen M. Cameron" <scameron@beardog.cce.hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:37:00 +01:00
Wei Yongjun 2541aa799f cciss: remove unneeded memset()
The memory return by kzalloc() or kmem_cache_zalloc() has already be set
to zero, so remove useless memset(0).

spatch with a semantic match is used to found this problem.
(http://coccinelle.lip6.fr/)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:36:58 +01:00
Wei Yongjun 654dbef214 xen/blkback: use kmem_cache_zalloc instead of kmem_cache_alloc/memset
Using kmem_cache_zalloc() instead of kmem_cache_alloc() and memset().

spatch with a semantic match is used to found this problem.
(http://coccinelle.lip6.fr/)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-10-30 08:36:27 +01:00
Herton Ronaldo Krzesinski 1a4ae43e4f floppy: remove dr, reuse drive on do_floppy_init
This is a small cleanup, that also may turn error handling of
unitialized disks more readable. We don't need a separate variable to
track allocated disks, remove dr and reuse drive variable instead.

Signed-off-by: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:36:07 +01:00
Herton Ronaldo Krzesinski 8d3ab4ebfd floppy: use common function to check if floppies can be registered
The same checks to see if a drive can be or is registered are
repeated through the code, factor out the checks in a common function
and replace the repeated checks with it.

Signed-off-by: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:34:25 +01:00
Herton Ronaldo Krzesinski d60e7ec18c floppy: properly handle failure on add_disk loop
On floppy initialization, if something failed inside the loop we call
add_disk, there was no cleanup of previous iterations in the error
handling.

Cc: stable@vger.kernel.org
Signed-off-by: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:34:25 +01:00
Herton Ronaldo Krzesinski 238ab78469 floppy: do put_disk on current dr if blk_init_queue fails
If blk_init_queue fails, we do not call put_disk on the current dr
(dr is decremented first in the error handling loop).

Cc: stable@vger.kernel.org
Reviewed-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:34:25 +01:00
Herton Ronaldo Krzesinski b54e1f8889 floppy: don't call alloc_ordered_workqueue inside the alloc_disk loop
Since commit 070ad7e ("floppy: convert to delayed work and single-thread
wq"), we end up calling alloc_ordered_workqueue multiple times inside
the loop, which shouldn't be intended. Besides the leak, other side
effect in the current code is if blk_init_queue fails, we would end up
calling unregister_blkdev even if we didn't call yet register_blkdev.

Just moved the allocation of floppy_wq before the loop, and adjusted the
code accordingly.

Cc: stable@vger.kernel.org # 3.5+
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:34:24 +01:00
Konrad Rzeszutek Wilk 2911758f14 xen/blkback: Fix compile warning
drivers/block/xen-blkback/xenbus.c:260:5: warning: symbol 'xenvbd_sysfs_addif' was not declared. Should it be static?
drivers/block/xen-blkback/xenbus.c:284:6: warning: symbol 'xenvbd_sysfs_delif' was not declared. Should it be static?

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-10-30 08:32:43 +01:00
Alex Elder 069a4b5690 rbd: kill rbd_device->rbd_opts
The rbd_device structure has an embedded rbd_options structure.
Such a structure is needed to work with the generic ceph argument
parsing code, but there's no need to keep it around once argument
parsing is done.

Use a local variable to hold the rbd options used in parsing in
rbd_get_client(), and just transfer its content (it's just a
read_only flag) into the field in the rbd_mapping sub-structure
that requires that information.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2012-10-26 17:18:08 -05:00
Alex Elder e5cfeed281 rbd: simplify rbd_merge_bvec()
The aim of this patch is to make what's going on rbd_merge_bvec() a
bit more obvious than it was before.  This was an issue when a
recent btrfs bug led us to question whether the merge function was
working correctly.

Use "obj" rather than "chunk" to indicate the units whose boundaries
we care about we call (rados) "objects".

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2012-10-26 17:18:08 -05:00
Alex Elder d4b125e9eb rbd: increase maximum snapshot name length
Change RBD_MAX_SNAP_NAME_LEN to be based on NAME_MAX.  That is a
practical limit for the length of a snapshot name (based on the
presence of a directory using the name under /sys/bus/rbd to
represent the snapshot).

The /sys entry is created by prefixing it with "snap_"; define that
prefix symbolically, and take its length into account in defining
the snapshot name length limit.

Enforce the limit in rbd_add_parse_args().  Also delete a dout()
call in that function that was not meant to be committed.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-26 17:18:08 -05:00
Alex Elder db2388b6ee rbd: verify rbd image order value
This adds a verification that an rbd image's object order is
within the upper and lower bounds supported by this implementation.

It must be at least 9 (SECTOR_SHIFT), because the Linux bio system
assumes that minimum granularity.

It also must be less than 32 (at the moment anyway) because there
exist spots in the code that store the size of a "segment" (object
backing an rbd image) in a signed int variable, which can be 32 bits
including the sign.  We should be able to relax this limit once
we've verified the code uses 64-bit types where needed.

Note that the CLI tool already limits the order to the range 12-25.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-26 17:18:08 -05:00
Alex Elder 4634246db8 rbd: consolidate rbd_do_op() calls
The two calls to rbd_do_op() from rbd_rq_fn() differ only in the
value passed for the snapshot id and the snapshot context.

For reads the snapshot always comes from the mapping, and for writes
the snapshot id is always CEPH_NOSNAP.

The snapshot context is always null for reads.  For writes, the
snapshot context always comes from the rbd header, but it is
acquired under protection of header semaphore and could change
thereafter, so we can't simply use what's available inside
rbd_do_op().

Eliminate the snapid parameter from rbd_do_op(), and set it
based on the I/O direction inside that function instead.  Always
pass the snapshot context acquired in the caller, but reset it
to a null pointer inside rbd_do_op() if the operation is a read.

As a result, there is no difference in the read and write calls
to rbd_do_op() made in rbd_rq_fn(), so just call it unconditionally.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-26 17:18:08 -05:00
Alex Elder ff2e4bb5b3 rbd: drop rbd_do_op() opcode and flags
The only callers of rbd_do_op() are in rbd_rq_fn(), where call one
is used for writes and the other used for reads.  The request passed
to rbd_do_op() already encodes the I/O direction, and that
information can be used inside the function to set the opcode and
flags value (rather than passing them in as arguments).

So get rid of the opcode and flags arguments to rbd_do_op().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-26 17:18:08 -05:00
Alex Elder 13f4042c05 rbd: kill rbd_req_{read,write}()
Both rbd_req_read() and rbd_req_write() are simple wrapper routines
for rbd_do_op(), and each is only called once.  Replace each wrapper
call with a direct call to rbd_do_op(), and get rid of the wrapper
functions.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-26 17:18:08 -05:00
Alex Elder be466c1cc3 rbd: fix read-only option name
The name of the "read-only" mapping option was inadvertently changed
in this commit:

    f84344f3 rbd: separate mapping info in rbd_dev

Revert that hunk to return it to what it should be.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-26 17:18:08 -05:00
Alex Elder a0ea3a40fd rbd: zero return code in rbd_dev_image_id()
When rbd_dev_probe() calls rbd_dev_image_id() it expects to get
a 0 return code if successful, but it is getting a positive value.

The reason is that rbd_dev_image_id() returns the value it gets from
rbd_req_sync_exec(), which returns the number of bytes read in as a
result of the request.  (This ultimately comes from
ceph_copy_from_page_vector() in rbd_req_sync_op()).

Force the return value to 0 when successful in rbd_dev_image_id().
Do the same in rbd_dev_v2_object_prefix().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2012-10-26 17:18:08 -05:00
Alex Elder b213e0b1a6 rbd: fix bug in rbd_dev_id_put()
In rbd_dev_id_put(), there's a loop that's intended to determine
the maximum device id in use.  But it isn't doing that at all,
the effect of how it's written is to simply use the just-put id
number, which ignores whole purpose of this function.

Fix the bug.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-26 17:18:08 -05:00
Kees Cook b8977285ec drivers/block: remove CONFIG_EXPERIMENTAL
This config item has not carried much meaning for a while now and is
almost always enabled by default. As agreed during the Linux kernel
summit, remove it.

CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: Asai Thambi S P <asamymuthupa@micron.com>
CC: Pete Zaitcev <zaitcev@redhat.com>
CC: Cong Wang <xiyou.wangcong@gmail.com>
CC: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-23 22:30:38 +02:00
Linus Torvalds ce40be7a82 Merge branch 'for-3.7/core' of git://git.kernel.dk/linux-block
Pull block IO update from Jens Axboe:
 "Core block IO bits for 3.7.  Not a huge round this time, it contains:

   - First series from Kent cleaning up and generalizing bio allocation
     and freeing.

   - WRITE_SAME support from Martin.

   - Mikulas patches to prevent O_DIRECT crashes when someone changes
     the block size of a device.

   - Make bio_split() work on data-less bio's (like trim/discards).

   - A few other minor fixups."

Fixed up silent semantic mis-merge as per Mikulas Patocka and Andrew
Morton.  It is due to the VM no longer using a prio-tree (see commit
6b2dbba8b6ac: "mm: replace vma prio_tree with an interval tree").

So make set_blocksize() use mapping_mapped() instead of open-coding the
internal VM knowledge that has changed.

* 'for-3.7/core' of git://git.kernel.dk/linux-block: (26 commits)
  block: makes bio_split support bio without data
  scatterlist: refactor the sg_nents
  scatterlist: add sg_nents
  fs: fix include/percpu-rwsem.h export error
  percpu-rw-semaphore: fix documentation typos
  fs/block_dev.c:1644:5: sparse: symbol 'blkdev_mmap' was not declared
  blockdev: turn a rw semaphore into a percpu rw semaphore
  Fix a crash when block device is read and block size is changed at the same time
  block: fix request_queue->flags initialization
  block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()
  block: ioctl to zero block ranges
  block: Make blkdev_issue_zeroout use WRITE SAME
  block: Implement support for WRITE SAME
  block: Consolidate command flag and queue limit checks for merges
  block: Clean up special command handling logic
  block/blk-tag.c: Remove useless kfree
  block: remove the duplicated setting for congestion_threshold
  block: reject invalid queue attribute values
  block: Add bio_clone_bioset(), bio_clone_kmalloc()
  block: Consolidate bio_alloc_bioset(), bio_kmalloc()
  ...
2012-10-11 09:04:23 +09:00
Alex Elder 35152979e6 rbd: activate v2 image support
Now that v2 images support is fully implemented, have
rbd_dev_v2_probe() return 0 to indicate a successful probe.

(Note that an image that implements layering will fail
the probe early because of the feature chekc.)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-10 07:44:01 -07:00
Alex Elder d889140c4a rbd: implement feature checks
Version 2 images have two sets of feature bit fields.  The first
indicates features possibly used by the image.  The second indicates
features that the client *must* support in order to use the image.

When an image (or snapshot) is first examined, we need to make sure
that the local implementation supports the image's required
features.  If not, fail the probe for the image.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-10 07:43:51 -07:00
Alex Elder 117973fb4c rbd: define rbd_dev_v2_refresh()
Define a new function rbd_dev_v2_refresh() to update/refresh the
snapshot context for a format version 2 rbd image.  This function
will update anything that is not fixed for the life of an rbd
image--at the moment this is mainly the snapshot context and (for
a base mapping) the size.

Update rbd_refresh_header() so it selects which function to use
based on the image format.

Rename __rbd_refresh_header() to be rbd_dev_v1_refresh()
to be consistent with the naming of its version 2 counterpart.
Similarly rename rbd_refresh_header() to be rbd_dev_refresh().

Unrelated--we use rbd_image_format_valid() here.  Delete the other
use of it, which was primarily put in place to ensure that function
was referenced at the time it was defined.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-10 07:43:39 -07:00
Alex Elder 9478554ae5 rbd: define rbd_update_mapping_size()
Encapsulate the code that handles updating the size of a mapping
after an rbd image has been refreshed.  This is done in anticipation
of the next patch, which will make this common code for format 1 and
2 images.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-10 07:43:28 -07:00
Linus Torvalds 7035cdf36d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph updates from Sage Weil:
 "The bulk of this pull is a series from Alex that refactors and cleans
  up the RBD code to lay the groundwork for supporting the new image
  format and evolving feature set.  There are also some cleanups in
  libceph, and for ceph there's fixed validation of file striping
  layouts and a bugfix in the code handling a shrinking MDS cluster."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (71 commits)
  ceph: avoid 32-bit page index overflow
  ceph: return EIO on invalid layout on GET_DATALOC ioctl
  rbd: BUG on invalid layout
  ceph: propagate layout error on osd request creation
  libceph: check for invalid mapping
  ceph: convert to use le32_add_cpu()
  ceph: Fix oops when handling mdsmap that decreases max_mds
  rbd: update remaining header fields for v2
  rbd: get snapshot name for a v2 image
  rbd: get the snapshot context for a v2 image
  rbd: get image features for a v2 image
  rbd: get the object prefix for a v2 rbd image
  rbd: add code to get the size of a v2 rbd image
  rbd: lay out header probe infrastructure
  rbd: encapsulate code that gets snapshot info
  rbd: add an rbd features field
  rbd: don't use index in __rbd_add_snap_dev()
  rbd: kill create_snap sysfs entry
  rbd: define rbd_dev_image_id()
  rbd: define some new format constants
  ...
2012-10-08 06:38:18 +09:00
Linus Torvalds dc92b1f9ab Merge branch 'virtio-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull virtio changes from Rusty Russell:
 "New workflow: same git trees pulled by linux-next get sent straight to
  Linus.  Git is awkward at shuffling patches compared with quilt or mq,
  but that doesn't happen often once things get into my -next branch."

* 'virtio-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (24 commits)
  lguest: fix occasional crash in example launcher.
  virtio-blk: Disable callback in virtblk_done()
  virtio_mmio: Don't attempt to create empty virtqueues
  virtio_mmio: fix off by one error allocating queue
  drivers/virtio/virtio_pci.c: fix error return code
  virtio: don't crash when device is buggy
  virtio: remove CONFIG_VIRTIO_RING
  virtio: add help to CONFIG_VIRTIO option.
  virtio: support reserved vqs
  virtio: introduce an API to set affinity for a virtqueue
  virtio-ring: move queue_index to vring_virtqueue
  virtio_balloon: not EXPERIMENTAL any more.
  virtio-balloon: dependency fix
  virtio-blk: fix NULL checking in virtblk_alloc_req()
  virtio-blk: Add REQ_FLUSH and REQ_FUA support to bio path
  virtio-blk: Add bio-based IO path for virtio-blk
  virtio: console: fix error handling in init() function
  tools: Fix pthread flag for Makefile of trace-agent used by virtio-trace
  tools: Add guest trace agent as a user tool
  virtio/console: Allocate scatterlist according to the current pipe size
  ...
2012-10-07 21:04:56 +09:00
Linus Torvalds f1c6872e49 Features:
* Allow a Linux guest to boot as initial domain and as normal guests
    on Xen on ARM (specifically ARMv7 with virtualized extensions).
    PV console, block and network frontend/backends are working.
 Bug-fixes:
  * Fix compile linux-next fallout.
  * Fix PVHVM bootup crashing.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJQbJELAAoJEFjIrFwIi8fJSI4H/32qrQKyF5IIkFKHTN9FYDC1
 OxEGc4y47DIQpGUd/PgZ/i6h9Iyhj+I6pb4lCevykwgd0j83noepluZlCIcJnTfL
 HVXNiRIQKqFhqKdjTANxVM4APup+7Lqrvqj6OZfUuoxaZ3tSTLhabJ/7UXf2+9xy
 g2RfZtbSdQ1sukQ/A2MeGQNT79rh7v7PrYQUYSrqytjSjSLPTqRf75HWQ+eapIAH
 X3aVz8Tn6nTixZWvZOK7rAaD4awsFxGP6E46oFekB02f4x9nWHJiCZiXwb35lORb
 tz9F9td99f6N4fPJ9LgcYTaCPwzVnceZKqE9hGfip4uT+0WrEqDxq8QmBqI5YtI=
 =gxJD
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.7-arm-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull ADM Xen support from Konrad Rzeszutek Wilk:

  Features:
   * Allow a Linux guest to boot as initial domain and as normal guests
     on Xen on ARM (specifically ARMv7 with virtualized extensions).  PV
     console, block and network frontend/backends are working.
  Bug-fixes:
   * Fix compile linux-next fallout.
   * Fix PVHVM bootup crashing.

  The Xen-unstable hypervisor (so will be 4.3 in a ~6 months), supports
  ARMv7 platforms.

  The goal in implementing this architecture is to exploit the hardware
  as much as possible.  That means use as little as possible of PV
  operations (so no PV MMU) - and use existing PV drivers for I/Os
  (network, block, console, etc).  This is similar to how PVHVM guests
  operate in X86 platform nowadays - except that on ARM there is no need
  for QEMU.  The end result is that we share a lot of the generic Xen
  drivers and infrastructure.

  Details on how to compile/boot/etc are available at this Wiki:

    http://wiki.xen.org/wiki/Xen_ARMv7_with_Virtualization_Extensions

  and this blog has links to a technical discussion/presentations on the
  overall architecture:

    http://blog.xen.org/index.php/2012/09/21/xensummit-sessions-new-pvh-virtualisation-mode-for-arm-cortex-a15arm-servers-and-x86/

* tag 'stable/for-linus-3.7-arm-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: (21 commits)
  xen/xen_initial_domain: check that xen_start_info is initialized
  xen: mark xen_init_IRQ __init
  xen/Makefile: fix dom-y build
  arm: introduce a DTS for Xen unprivileged virtual machines
  MAINTAINERS: add myself as Xen ARM maintainer
  xen/arm: compile netback
  xen/arm: compile blkfront and blkback
  xen/arm: implement alloc/free_xenballooned_pages with alloc_pages/kfree
  xen/arm: receive Xen events on ARM
  xen/arm: initialize grant_table on ARM
  xen/arm: get privilege status
  xen/arm: introduce CONFIG_XEN on ARM
  xen: do not compile manage, balloon, pci, acpi, pcpu and cpu_hotplug on ARM
  xen/arm: Introduce xen_ulong_t for unsigned long
  xen/arm: Xen detection and shared_info page mapping
  docs: Xen ARM DT bindings
  xen/arm: empty implementation of grant_table arch specific functions
  xen/arm: sync_bitops
  xen/arm: page.h definitions
  xen/arm: hypercalls
  ...
2012-10-07 07:13:01 +09:00
Ed Cashin 322c9ec009 aoe: update aoe-internal version number to 50
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:30 +09:00
Ed Cashin 1ac9e60262 aoe: remove unused code
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:30 +09:00
Ed Cashin 08b6062351 aoe: make dynamic block minor numbers the default
Because udev use is so widespread, making the old static mapping the
default is too conservative, given the severe limitations it places on
usable AoE addresses.  Storage virtualization and larger shelves have made
the old limitations too confining.

These changes make the dynamic block device minor numbers the default,
removing the limitations on usable AoE addresses.

The static arrangement is still available with aoe_dyndevs=0, and the
aoe-stat tool from the userland aoetools package, the user space
counterpart to the aoe driver, recognizes the case where there is a
mismatch between the minor number in sysfs and the minor number in a
special device file.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:29 +09:00
Ed Cashin 7159e969d1 aoe: update and specify AoE address guards and error messages
In general, specific is better when it comes to messages about AoE usage
problems.  Also, explicit checks for the AoE broadcast addresses are
added.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:29 +09:00
Ed Cashin 4bcce1a355 aoe: retain static block device numbers for backwards compatibility
The old mapping between AoE target shelf and slot addresses and the block
device minor number is retained as a backwards-compatible feature, with a
new "aoe_dyndevs" module parameter available for enabling dynamic block
device minor numbers.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:29 +09:00
Ed Cashin 0c96621458 aoe: support more AoE addresses with dynamic block device minor numbers
The ATA over Ethernet protocol uses a major (shelf) and minor (slot)
address to identify a particular storage target.  These changes remove an
artificial limitation the aoe driver imposes on the use of AoE addresses.
For example, without these changes, the slot address has a maximum of 15,
but users commonly use slot numbers much greater than that.

The AoE shelf and slot address space is often used sparsely.  Instead of
using a static mapping between AoE addresses and the block device minor
number, the block device minor numbers are now allocated on demand.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:28 +09:00
Ed Cashin fea05a26c3 aoe: update copyright year in touched files
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:28 +09:00
Ed Cashin 7392fbe5ad aoe: update internal version number to 49
The internal version number of the aoe driver appears in a console message
when the driver loads and is usually obtained by the user with the
userland aoe-version tool, part of the aoetools.[1]

Although this patchset includes bugfixes backported from higher-numbered
versions published on the coraid.com website, it is a form of version 49.

1. http://aoetools.sourceforge.net/

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:27 +09:00
Ed Cashin b21faa25c6 aoe: remove unused code and add cosmetic improvements
This change removes some unused code and attempts to increase code
consistency.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:27 +09:00
Ed Cashin 1b86fda9ad aoe: increase net_device reference count while using it
This change eliminates the danger that the user could rmmod the driver for
a network interface that is being used for AoE by the aoe driver.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:27 +09:00
Ed Cashin 64a80f5ac7 aoe: associate frames with the AoE storage target
In the driver code, "target" and aoetgt refer to a particular remote
interface on the AoE storage target.  The latter is identified by its AoE
major and minor addresses.  Commands that are being sent to an AoE storage
target {major, minor} can be sent or retransmitted to any of the remote
MAC addresses associated with the AoE storage target.

That is, frames are naturally associated with not an aoetgt (AoE major,
AoE minor, remote MAC address) but an aoedev (AoE major, AoE minor).
Making the code reflect that reality simplifies the driver, especially
when the path to a remote MAC address becomes unusable.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:27 +09:00
Ed Cashin 6583303c5e aoe: disallow unsupported AoE minor addresses
A guard is inserted to prevent AoE minor addresses (slot addresses) higher
than 15 to be used, as they are not yet supported by the driver.

There is a change coming that will allow the aoe driver to overcome this
limit by using system device minor numbers dynamically, but until then,
this guard prevents unexpected targets from being used by the driver when
AoE targets with high minor numbers are on the AoE network.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:26 +09:00
Ed Cashin 25f4d75ea4 aoe: do revalidation steps in order
The discovery process begins with an optional AoE config query command and
an AoE config query response.  Normally when an aoe device is already
open, the config query response does not trigger an ATA identify device
command to be sent out, since the response contains storage capacity
information that, if changed, could surprise the user of the device.

The userland "aoe-revalidate" tool uses a character device to trigger an
AoE config query for a particular AoE storage target and an ATA device
identify command, even when the device is open.

This change causes the config query to go out first, reflecting the normal
discovery sequence.  The responses could come back in any order, so this
change is fairly cosmetic.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:26 +09:00
Ed Cashin d54d35ac66 aoe: failover remote interface based on aoe_deadsecs parameter
The aoe_deadsecs module parameter allows the user to specify a hard limit
on the number of seconds an AoE command can be retransmitted before the
AoE block device is considered to have failed.

Using aoe_deadsecs to determine the time we try using a different remote
interface helps to ensure that the hard limit is not reached before we've
tried to recover by sending to a different remote port.

As a data storage target, the AoE target is unambiguously identified by
its {major, minor} AoE address tuple, and an AoE target can have multiple
MAC addresses.  However, note that "target" in the driver code and
comments means a {major, minor, MAC address} tuple, as in "somewhere to
send packets".

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:26 +09:00
Ed Cashin 3f0f013374 aoe: use packets that work with the smallest-MTU local interface
Users with several network interfaces dedicated to AoE generally do not
configure them to support different-sized AoE data payloads on purpose.

For a given AoE target, there will be a set of local network interfaces
that can reach it.  Using only the payload that will fit in the
smallest-sized MTU of all those local interfaces greatly simplifies the
driver, especially in failure scenarios.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:25 +09:00
Ed Cashin eb086ec596 aoe: use a kernel thread for transmissions
The dev_queue_xmit function needs to have interrupts enabled, so the most
simple way to get the locking right but still fulfill that requirement is
to use a process that can call dev_queue_xmit serially over queued
transmissions.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:25 +09:00
Ed Cashin 69cf2d85de aoe: become I/O request queue handler for increased user control
To allow users to choose an elevator algorithm for their particular
workloads, change from a make_request-style driver to an
I/O-request-queue-handler-style driver.

We have to do a couple of things that might be surprising.  We manipulate
the page _count directly on the assumption that we still have no guarantee
that users of the block layer are prohibited from submitting bios
containing pages with zero reference counts.[1] If such a prohibition now
exists, I can get rid of the _count manipulation.

Just as before this patch, we still keep track of the sk_buffs that the
network layer still hasn't finished yet and cap the resources we use with
a "pool" of skbs.[2]

Now that the block layer maintains the disk stats, the aoe driver's
diskstats function can go away.

1. https://lkml.org/lkml/2007/3/1/374
2. https://lkml.org/lkml/2007/7/6/241

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:25 +09:00
Ed Cashin 896831f590 aoe: kernel thread handles I/O completions for simple locking
Make the frames the aoe driver uses to track the relationship between bios
and packets more flexible and detached, so that they can be passed to an
"aoe_ktio" thread for completion of I/O.

The frames are handled much like skbs, with a capped amount of
preallocation so that real-world use cases are likely to run smoothly and
degenerate gracefully even under memory pressure.

Decoupling I/O completion from the receive path and serializing it in a
process makes it easier to think about the correctness of the locking in
the driver, especially in the case of a remote MAC address becoming
unusable.

[dan.carpenter@oracle.com: cleanup an allocation a bit]
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:24 +09:00
Ed Cashin 3d5b06051c aoe: for performance support larger packet payloads
tAdd adds the ability to work with large packets composed of a number of
segments, using the scatter gather feature of the block layer (biovecs)
and the network layer (skb frag array).  The motivation is the performance
gained by using a packet data payload greater than a page size and by
using the network card's scatter gather feature.

Users of the out-of-tree aoe driver already had these changes, but since
early 2011, they have complained of increased memory utilization and
higher CPU utilization during heavy writes.[1] The commit below appears
related, as it disables scatter gather on non-IP protocols inside the
harmonize_features function, even when the NIC supports sg.

  commit f01a5236bd
  Author: Jesse Gross <jesse@nicira.com>
  Date:   Sun Jan 9 06:23:31 2011 +0000

      net offloading: Generalize netif_get_vlan_features().

With that regression in place, transmits always linearize sg AoE packets,
but in-kernel users did not have this patch.  Before 2.6.38, though, these
changes were working to allow sg to increase performance.

1. http://www.spinics.net/lists/linux-mm/msg15184.html

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:24 +09:00
Paul Clements a336d29870 nbd: handle discard requests
Add discard support to nbd.  If the nbd-server supports discard, it will
send NBD_FLAG_SEND_TRIM to the client.  The client will then set the flag
in the kernel via NBD_SET_FLAGS, which tells the kernel to enable discards
for the device (QUEUE_FLAG_DISCARD).

If discard support is enabled, then when the nbd client system receives a
discard request, this will be passed along to the nbd-server.  When the
discard request is received by the nbd-server, it will perform:

	fallocate(.. FALLOC_FL_PUNCH_HOLE ..)

To punch a hole in the backend storage, which is no longer needed.

Signed-off-by: Paul Clements <paul.clements@steeleye.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:24 +09:00
Paul Clements 2f01250888 nbd: add set flags ioctl
Add a set-flags ioctl, allowing various option flags to be set on an nbd
device.  This allows the nbd-client to set the device flags (to enable
read-only mode, or enable discard support, etc.).

Flags are typically specified by the nbd-server.  During the negotiation
phase of the nbd connection, the server sends its flags to the client.
The client then uses NBD_SET_FLAGS to inform the kernel of the options.

Also included is a one-line fix to debug output for the set-timeout ioctl.

Signed-off-by: Paul Clements <paul.clements@steeleye.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:23 +09:00
Linus Torvalds 437589a74b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace changes from Eric Biederman:
 "This is a mostly modest set of changes to enable basic user namespace
  support.  This allows the code to code to compile with user namespaces
  enabled and removes the assumption there is only the initial user
  namespace.  Everything is converted except for the most complex of the
  filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
  nfs, ocfs2 and xfs as those patches need a bit more review.

  The strategy is to push kuid_t and kgid_t values are far down into
  subsystems and filesystems as reasonable.  Leaving the make_kuid and
  from_kuid operations to happen at the edge of userspace, as the values
  come off the disk, and as the values come in from the network.
  Letting compile type incompatible compile errors (present when user
  namespaces are enabled) guide me to find the issues.

  The most tricky areas have been the places where we had an implicit
  union of uid and gid values and were storing them in an unsigned int.
  Those places were converted into explicit unions.  I made certain to
  handle those places with simple trivial patches.

  Out of that work I discovered we have generic interfaces for storing
  quota by projid.  I had never heard of the project identifiers before.
  Adding full user namespace support for project identifiers accounts
  for most of the code size growth in my git tree.

  Ultimately there will be work to relax privlige checks from
  "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
  root in a user names to do those things that today we only forbid to
  non-root users because it will confuse suid root applications.

  While I was pushing kuid_t and kgid_t changes deep into the audit code
  I made a few other cleanups.  I capitalized on the fact we process
  netlink messages in the context of the message sender.  I removed
  usage of NETLINK_CRED, and started directly using current->tty.

  Some of these patches have also made it into maintainer trees, with no
  problems from identical code from different trees showing up in
  linux-next.

  After reading through all of this code I feel like I might be able to
  win a game of kernel trivial pursuit."

Fix up some fairly trivial conflicts in netfilter uid/git logging code.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
  userns: Convert the ufs filesystem to use kuid/kgid where appropriate
  userns: Convert the udf filesystem to use kuid/kgid where appropriate
  userns: Convert ubifs to use kuid/kgid
  userns: Convert squashfs to use kuid/kgid where appropriate
  userns: Convert reiserfs to use kuid and kgid where appropriate
  userns: Convert jfs to use kuid/kgid where appropriate
  userns: Convert jffs2 to use kuid and kgid where appropriate
  userns: Convert hpfs to use kuid and kgid where appropriate
  userns: Convert btrfs to use kuid/kgid where appropriate
  userns: Convert bfs to use kuid/kgid where appropriate
  userns: Convert affs to use kuid/kgid wherwe appropriate
  userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
  userns: On ia64 deal with current_uid and current_gid being kuid and kgid
  userns: On ppc convert current_uid from a kuid before printing.
  userns: Convert s390 getting uid and gid system calls to use kuid and kgid
  userns: Convert s390 hypfs to use kuid and kgid where appropriate
  userns: Convert binder ipc to use kuids
  userns: Teach security_path_chown to take kuids and kgids
  userns: Add user namespace support to IMA
  userns: Convert EVM to deal with kuids and kgids in it's hmac computation
  ...
2012-10-02 11:11:09 -07:00
Linus Torvalds 033d9959ed Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue changes from Tejun Heo:
 "This is workqueue updates for v3.7-rc1.  A lot of activities this
  round including considerable API and behavior cleanups.

   * delayed_work combines a timer and a work item.  The handling of the
     timer part has always been a bit clunky leading to confusing
     cancelation API with weird corner-case behaviors.  delayed_work is
     updated to use new IRQ safe timer and cancelation now works as
     expected.

   * Another deficiency of delayed_work was lack of the counterpart of
     mod_timer() which led to cancel+queue combinations or open-coded
     timer+work usages.  mod_delayed_work[_on]() are added.

     These two delayed_work changes make delayed_work provide interface
     and behave like timer which is executed with process context.

   * A work item could be executed concurrently on multiple CPUs, which
     is rather unintuitive and made flush_work() behavior confusing and
     half-broken under certain circumstances.  This problem doesn't
     exist for non-reentrant workqueues.  While non-reentrancy check
     isn't free, the overhead is incurred only when a work item bounces
     across different CPUs and even in simulated pathological scenario
     the overhead isn't too high.

     All workqueues are made non-reentrant.  This removes the
     distinction between flush_[delayed_]work() and
     flush_[delayed_]_work_sync().  The former is now as strong as the
     latter and the specified work item is guaranteed to have finished
     execution of any previous queueing on return.

   * In addition to the various bug fixes, Lai redid and simplified CPU
     hotplug handling significantly.

   * Joonsoo introduced system_highpri_wq and used it during CPU
     hotplug.

  There are two merge commits - one to pull in IRQ safe timer from
  tip/timers/core and the other to pull in CPU hotplug fixes from
  wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

Fixed a number of trivial conflicts, but the more interesting conflicts
were silent ones where the deprecated interfaces had been used by new
code in the merge window, and thus didn't cause any real data conflicts.

Tejun pointed out a few of them, I fixed a couple more.

* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
  workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
  workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
  workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
  workqueue: remove @delayed from cwq_dec_nr_in_flight()
  workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
  workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
  workqueue: use __cpuinit instead of __devinit for cpu callbacks
  workqueue: rename manager_mutex to assoc_mutex
  workqueue: WORKER_REBIND is no longer necessary for idle rebinding
  workqueue: WORKER_REBIND is no longer necessary for busy rebinding
  workqueue: reimplement idle worker rebinding
  workqueue: deprecate __cancel_delayed_work()
  workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
  workqueue: use mod_delayed_work() instead of __cancel + queue
  workqueue: use irqsafe timer for delayed_work
  workqueue: clean up delayed_work initializers and add missing one
  workqueue: make deferrable delayed_work initializer names consistent
  workqueue: cosmetic whitespace updates for macro definitions
  workqueue: deprecate system_nrt[_freezable]_wq
  workqueue: deprecate flush[_delayed]_work_sync()
  ...
2012-10-02 09:54:49 -07:00
Sage Weil 6cae3717cd rbd: BUG on invalid layout
This shouldn't actually be possible because the layout struct is
constructed from the RBD header and validated then.

[elder@inktank.com: converted BUG() call to equivalent rbd_assert()]

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-10-01 17:20:00 -05:00
Linus Torvalds d9a807461f USB merge for 3.7-rc1
Here is the big USB pull request for 3.7-rc1
 
 There are lots of gadget driver changes (including copying a bunch of
 files into the drivers/staging/ccg/ directory so that the other gadget
 drivers can be fixed up properly without breaking that driver), and we
 remove the old obsolete ub.c driver from the tree.  There are also the
 usual XHCI set of updates, and other various driver changes and updates.
 We also are trying hard to remove the old dbg() macro, but the final
 bits of that removal will be coming in through the networking tree
 before we can delete it for good.
 
 All of these patches have been in the linux-next tree.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iEYEABECAAYFAlBp3+AACgkQMUfUDdst+ym5vwCfe93FyJyXn/RDkGz7iBemvWFd
 vrwAoIxjaOa4/yWZWcgrWc5bP4aO3ssc
 =jYDr
 -----END PGP SIGNATURE-----

Merge tag 'usb-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB changes from Greg Kroah-Hartman:
 "Here is the big USB pull request for 3.7-rc1

  There are lots of gadget driver changes (including copying a bunch of
  files into the drivers/staging/ccg/ directory so that the other gadget
  drivers can be fixed up properly without breaking that driver), and we
  remove the old obsolete ub.c driver from the tree.

  There are also the usual XHCI set of updates, and other various driver
  changes and updates.  We also are trying hard to remove the old dbg()
  macro, but the final bits of that removal will be coming in through
  the networking tree before we can delete it for good.

  All of these patches have been in the linux-next tree.

  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

Fix up several annoying - but fairly mindless - conflicts due to the
termios structure having moved into the tty device, and often clashing
with dbg -> dev_dbg conversion.

* tag 'usb-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (339 commits)
  USB: ezusb: move ezusb.c from drivers/usb/serial to drivers/usb/misc
  USB: uas: fix gcc warning
  USB: uas: fix locking
  USB: Fix race condition when removing host controllers
  USB: uas: add locking
  USB: uas: fix abort
  USB: uas: remove aborted field, replace with status bit.
  USB: uas: fix task management
  USB: uas: keep track of command urbs
  xhci: Intel Panther Point BEI quirk.
  powerpc/usb: remove checking PHY_CLK_VALID for UTMI PHY
  USB: ftdi_sio: add TIAO USB Multi-Protocol Adapter (TUMPA) support
  Revert "usb : Add sysfs files to control port power."
  USB: serial: remove vizzini driver
  usb: host: xhci: Fix Null pointer dereferencing with 71c731a for non-x86 systems
  Increase XHCI suspend timeout to 16ms
  USB: ohci-at91: fix null pointer in ohci_hcd_at91_overcurrent_irq
  USB: sierra_ms: don't keep unused variable
  fsl/usb: Add support for USB controller version 2.4
  USB: qcaux: add Pantech vendor class match
  ...
2012-10-01 13:23:01 -07:00
Alex Elder 6e14b1a6c3 rbd: update remaining header fields for v2
There are three fields that are not yet updated for format 2 rbd
image headers:  the version of the header object; the encryption
type; and the compression type.  There is no interface defined for
fetching the latter two, so just initialize them explicitly to 0 for
now.

Change rbd_dev_v2_snap_context() so the caller can be supplied the
version for the header object.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:54 -05:00
Alex Elder b8b1e2db52 rbd: get snapshot name for a v2 image
Define rbd_dev_v2_snap_name() to fetch the name for a particular
snapshot in a format 2 rbd image.

Define rbd_dev_v2_snap_info() to to be a wrapper for getting the
name, size, and features for a particular snapshot, using an
interface that matches the equivalent function for version 1 images.

Define rbd_dev_snap_info() wrapper function and use it to call the
appropriate function for getting the snapshot name, size, and
features, dependent on the rbd image format.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:54 -05:00
Alex Elder 35d489f946 rbd: get the snapshot context for a v2 image
Fetch the snapshot context for an rbd format 2 image by calling
the "get_snapcontext" method on its header object.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:53 -05:00
Alex Elder b1b5402aa9 rbd: get image features for a v2 image
The features values for an rbd format 2 image are fetched from the
server using a "get_features" method.  The same method is used for
getting the features for a snapshot, so structure this addition with
a generic helper routine that can get this information for either.

The server will provide two 64-bit feature masks, one representing
the features potentially in use for this image (or its snapshot),
and one representing features that must be supported by the client
in order to work with the image.

For the time being, neither of these is really used so we keep
things simple and just record the first feature vector.  Once we
start using these feature masks, what we record and what we expose
to the user will most likely change.

Signed-off-by: Alex Elder <elder@inktank.com>
2012-10-01 14:30:53 -05:00
Alex Elder 1e1301998e rbd: get the object prefix for a v2 rbd image
The object prefix of an rbd format 2 image is fetched from the
server using a "get_object_prefix" method.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:53 -05:00
Alex Elder 9d475de5d1 rbd: add code to get the size of a v2 rbd image
The size of an rbd format 2 image is fetched from the server using a
"get_size" method.  The same method is used for getting the size of
a snapshot, so structure this addition with a generic helper routine
that we can get this information for either.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:53 -05:00
Alex Elder a30b71b999 rbd: lay out header probe infrastructure
This defines a new function rbd_dev_probe() as a top-level
function for populating detailed information about an rbd device.

It first checks for the existence of a format 2 rbd image id object.
If it exists, the image is assumed to be a format 2 rbd image, and
another function rbd_dev_v2() is called to finish populating
header data for that image.  If it does not exist, it is assumed to
be an old (format 1) rbd image, and calls a similar function
rbd_dev_v1() to populate its header information.

A new field, rbd_dev->format, is defined to record which version
of the rbd image format the device represents.  For a valid mapped
rbd device it will have one of two values, 1 or 2.

So far, the format 2 images are not really supported; this is
laying out the infrastructure for fleshing out that support.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:53 -05:00
Alex Elder cd892126c6 rbd: encapsulate code that gets snapshot info
Create a function that encapsulates looking up the name, size and
features related to a given snapshot, which is indicated by its
index in an rbd device's snapshot context array of snapshot ids.

This interface will be used to hide differences between the format 1
and format 2 images.

At the moment this (looking up the name anyway) is slightly less
efficient than what's done currently, but we may be able to optimize
this a bit later on by cacheing the last lookup if it proves to be a
problem.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:53 -05:00
Alex Elder 34b131849f rbd: add an rbd features field
Record the features values for each rbd image and each of its
snapshots.  This is really something that only becomes meaningful
for version 2 images, so this is just putting in place code
that will form common infrastructure.

It may be useful to expand the sysfs entries--and therefore the
information we maintain--for the image and for each snapshot.
But I'm going to hold off doing that until we start making
active use of the feature bits.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:53 -05:00
Alex Elder c8d184250d rbd: don't use index in __rbd_add_snap_dev()
Pass the snapshot id and snapshot size rather than an index
to __rbd_add_snap_dev() to specify values for a new snapshot.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:53 -05:00
Alex Elder 02cdb02cea rbd: kill create_snap sysfs entry
Josh proposed the following change, and I don't think I could
explain it any better than he did:

    From: Josh Durgin <josh.durgin@inktank.com>
    Date: Tue, 24 Jul 2012 14:22:11 -0700
    To: ceph-devel <ceph-devel@vger.kernel.org>
    Message-ID: <500F1203.9050605@inktank.com>

    Right now the kernel still has one piece of rbd management
    duplicated from the rbd command line tool: snapshot creation.
    There's nothing special about snapshot creation that makes it
    advantageous to do from the kernel, so I'd like to remove the
    create_snap sysfs interface.  That is,
	/sys/bus/rbd/devices/<id>/create_snap
    would be removed.

    Does anyone rely on the sysfs interface for creating rbd
    snapshots?  If so, how hard would it be to replace with:

	rbd snap create pool/image@snap

    Is there any benefit to the sysfs interface that I'm missing?

    Josh

This patch implements this proposal, removing the code that
implements the "snap_create" sysfs interface for rbd images.
As a result, quite a lot of other supporting code goes away.

Suggested-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:53 -05:00
Alex Elder 589d30e0b3 rbd: define rbd_dev_image_id()
New format 2 rbd images are permanently identified by a unique image
id.  Each rbd image also has a name, but the name can be changed.
A format 2 rbd image will have an object--whose name is based on the
image name--which maps an image's name to its image id.

Create a new function rbd_dev_image_id() that checks for the
existence of the image id object, and if it's found, records the
image id in the rbd_device structure.

Create a new rbd device attribute (/sys/bus/rbd/<num>/image_id) that
makes this information available.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:53 -05:00
Alex Elder 3bb59ad515 rbd: define some new format constants
Define constant symbols related to the rbd format 2 object names.
This begins to bring this version of the "rbd_types.h" header
more in line with the current user-space version of that file.
Complete reconciliation of differences will be done at some
point later, as a separate task.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:53 -05:00
Alex Elder f8d4de6e1c rbd: support data returned from OSD methods
An OSD object method call can be made using rbd_req_sync_exec().
Until now this has only been used for creating a new RBD snapshot,
and that has only required sending data out, not receiving anything
back from the OSD.

We will now need to get data back from an OSD on a method call, so
add parameters to rbd_req_sync_exec() that allow a buffer into which
returned data should be placed to be specified, along with its size.

Previously, rbd_req_sync_exec() passed a null pointer and zero
size to rbd_req_sync_op(); change this so the new inbound buffer
information is provided instead.

Rename the "buf" and "len" parameters in rbd_req_sync_op() to
make it more obvious they are describing inbound data.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:53 -05:00
Alex Elder 3cb4a687c7 rbd: pass flags to rbd_req_sync_exec()
In order to allow both read requests and write requests to be
initiated using rbd_req_sync_exec(), add an OSD flags value
which can be passed down to rbd_req_sync_op().  Rename the "data"
and "len" parameters to be more clear that they represent data
that is outbound.

At this point, this function is still only used (and only works) for
write requests.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:52 -05:00
Alex Elder 3ee4001e0c rbd: set up watch before announcing disk
We're ready to handle header object (refresh) events at the point we
call rbd_bus_add_dev().  Set up the watch request on the rbd image
header just after that, and after we've registered the devices for
the snapshots for the initial snapshot context.  Do this before
announce the disk as available for use.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:52 -05:00
Alex Elder 12f029448c rbd: set initial capacity in rbd_init_disk()
Move the setting of the initial capacity for an rbd image mapping
into rb_init_disk().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:52 -05:00
Alex Elder 86ff77bb68 rbd: drop dev registration check for new snap
By the time rbd_dev_snaps_register() gets called during rbd device
initialization, the main device will have already been registered.
Similarly, a header refresh will only occur for an rbd device whose
Linux device is registered.  There is therefore no need to verify
the main device is registered when registering a snapshot device.

For the time being, turn the check into a WARN_ON(), but it can
eventually just go away.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:52 -05:00
Alex Elder 0f308a3188 rbd: call rbd_init_disk() sooner
Call rbd_init_disk() from rbd_add() as soon as we have the major
device number for the mapping.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:52 -05:00
Alex Elder 85ae892675 rbd: defer setting device id
Hold off setting the device id and formatting the device name
in rbd_add() until just before it's needed.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:52 -05:00
Alex Elder 05fd6f6f8c rbd: read the header before registering device
Read the rbd header information and call rbd_dev_set_mapping()
earlier--before registering the block device or setting up the sysfs
entries for the image.  The sysfs entries provide users access to
some information that's only available after doing the rbd header
initialization, so this will make sure it's valid right away.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:52 -05:00
Alex Elder 5ed1617731 rbd: call set_snap() before snap_devs_update()
rbd_header_set_snap() is a simple initialization routine for an rbd
device's mapping.  It has to be called after the snapshot context
for the rbd_dev has been updated, but can be done before snapshot
devices have been registered.

Change the name to rbd_dev_set_mapping() to better reflect its
purpose, and call it a little sooner, before registering snapshot
devices.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:52 -05:00
Alex Elder 304f68086f rbd: defer registering snapshot devices
When a new snapshot is found in an rbd device's updated snapshot
context, __rbd_add_snap_dev() is called to create and insert an
entry in the rbd devices list of snapshots.  In addition, a Linux
device is registered to represent the snapshot.

For version 2 rbd images, it will be undesirable to initialize the
device right away.  So in anticipation of that, this patch separates
the insertion of a snapshot entry in the snaps list from the
creation of devices for those snapshots.

To do this, create a new function rbd_dev_snaps_register() which
traverses the list of snapshots and calls rbd_register_snap_dev()
on any that have not yet been registered.

Rename rbd_dev_snap_devs_update() to be rbd_dev_snaps_update()
to better reflect that only the entry in the snaps list and not
the snapshot's device is affected by the function.

For now, call rbd_dev_snaps_register() immediately after each
call to rbd_dev_snaps_update().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:52 -05:00
Alex Elder 3fcf2581c2 rbd: assign header name later
Move the assignment of the header name for an rbd image a bit later,
outside rbd_add_parse_args() and into its caller.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:52 -05:00
Alex Elder e86924a809 rbd: use snaps list in rbd_snap_by_name()
An rbd_dev structure maintains a list of current snapshots that have
already been fully initialized.  The entries on the list have type
struct rbd_snap, and each entry contains a copy of information
that's found in the rbd_dev's snapshot context and header.

The only caller of snap_by_name() is rbd_header_set_snap().  In that
call site any positive return value (the index in the snapshot
array) is ignored, so there's no need to return the index in
the snapshot context's id array when it's found.

rbd_header_set_snap() also has only one caller--rbd_add()--and that
call is made after a call to rbd_dev_snap_devs_update().  Because
the rbd_snap structures are initialized in that function, the
current snapshot list can be used instead of the snapshot context to
look up a snapshot's information by name.

Change snap_by_name() so it uses the snapshot list rather than the
rbd_dev's snapshot context in looking up snapshot information.
Return 0 if it's found rather than the snapshot id.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:52 -05:00
Alex Elder cd789ab9ca rbd: don't register snapshots in bus_add_dev()
When rbd_bus_add_dev() is called (one spot--in rbd_add()), the rbd
image header has not even been read yet.  This means that the list
of snapshots will be empty at the time of the call.  As a result,
there is no need for the code that calls rbd_register_snap_dev()
for each entry in that list--so get rid of it.

Once the header has been read (just after returning), a call will
be made to rbd_dev_snap_devs_update(), which will then find every
snapshot in the context to be new and will therefore call
rbd_register_snap_dev() via __rbd_add_snap_dev() accomplishing
the same thing.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:52 -05:00
Alex Elder 4bb1f1ed00 rbd: move locking out of rbd_header_set_snap()
Move the calls to get the header semaphore out of
rbd_header_set_snap() and into its caller.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:51 -05:00
Alex Elder 1fcdb8aa1f rbd: simplify rbd_init_disk() a bit
This just simplifies a few things in rbd_init_disk(), now that the
previous patch has moved a bunch of initialization code out if it.
Done separately to facilitate review.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:51 -05:00
Alex Elder 2ac4e75d89 rbd: do some header initialization earlier
Move some of the code that initializes an rbd header out of
rbd_init_disk() and into its caller.

Move the code at the end of rbd_init_disk() that sets the device
capacity and activates the Linux device out of that function and
into the caller, ensuring we still have the disk size available
where we need it.

Update rbd_free_disk() so it still aligns well as an inverse of
rbd_init_disk(), moving the rbd_header_free() call out to its
caller.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:51 -05:00
Alex Elder 8836b995fd rbd: simplify snap_by_name() interface
There is only one caller of snap_by_name(), and it passes two values
to be assigned, both of which are found within an rbd device
structure.

Change the interface so it just passes the address of the rbd_dev,
and make the assignments to its fields directly.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:51 -05:00
Alex Elder 4e1105a299 rbd: set mapping name with the rest
With the exception of the snapshot name, all of the mapping-specific
fields in an rbd device structure are set in rbd_header_set_snap().

Pass the snapshot name to be assigned into rbd_header_set_snap()
to keep all of the mapping assignments together.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:51 -05:00
Alex Elder 3feeb89467 rbd: return snap name from rbd_add_parse_args()
This is the first of two patches aimed at isolating the code that
sets the mapping information into a single spot.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:51 -05:00
Alex Elder 99c1f08f64 rbd: record mapped size
Add the size of the mapped image to the set of mapping-specific
fields in an rbd_device, and use it when setting the capacity of the
disk.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:51 -05:00
Alex Elder f84344f334 rbd: separate mapping info in rbd_dev
Several fields in a struct rbd_dev are related to what is mapped, as
opposed to the actual base rbd image.  If the base image is mapped
these are almost unneeded, but if a snapshot is mapped they describe
information about that snapshot.

In some contexts this can be a little bit confusing.  So group these
mapping-related field into a structure to make it clear what they
are describing.

This also includes a minor change that rearranges the fields in the
in-core image header structure so that invariant fields are at the
top, followed by those that change.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:51 -05:00
Alex Elder c9aadfe786 rbd: kill rbd_image_header->total_snaps
The "total_snaps" field in an rbd header structure is never any
different from the value of "num_snaps" stored within a snapshot
context.  Avoid any confusion by just using the value held within
the snapshot context, and get rid of the "total_snaps" field.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:51 -05:00
Alex Elder 98cec111c0 rbd: kill rbd_dev->q
A copy of rbd_dev->disk->queue is held in rbd_dev->q, but it's
never actually used.  So get just get rid of the field.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:51 -05:00
Alex Elder 9fcbb80024 rbd: rename __rbd_init_snaps_header()
The name __rbd_init_snaps_header() doesn't really convey what that
function does very well.  Its purpose is to scan a new snapshot
context and either create or destroy snapshot device entries so
that local host's view is consistent with the reality maintained
on the OSDs.  This patch just changes the name of this function,
to be rbd_dev_snap_devs_update().  Still not perfect, but I think
better.

Also add some dynamic debug statements to this function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:50 -05:00
Alex Elder e28393082d rbd: rename rbd_id_get()
This should have been done as part of this commit:

    commit de71a2970d
    Author: Alex Elder <elder@inktank.com>
    Date:   Tue Jul 3 16:01:19 2012 -0500
    rbd: rename rbd_device->id

rbd_id_get() is assigning the rbd_dev->dev_id field.  Change the
name of that function as well as rbd_id_put() and rbd_id_max
to reflect what they are affecting.

Add some dynamic debug statements related to rbd device id activity.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:50 -05:00
Alex Elder aafb230ebc rbd: define rbd_assert()
Define rbd_assert() and use it in place of various BUG_ON() calls
now present in the code.  By default assertion checking is enabled;
we want to do this differently at some point.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:50 -05:00
Alex Elder 65ccfe21dd rbd: split up rbd_get_segment()
There are two places where rbd_get_segment() is called.  One, in
rbd_rq_fn(), only needs to know the length within a segment that an
I/O request should be.  The other, in rbd_do_op(), also needs the
name of the object and the offset within it for the I/O request.

Split out rbd_segment_name() into three dedicated functions:
    - rbd_segment_name() allocates and formats the name of the
      object for a segment containing a given rbd image offset
    - rbd_segment_offset() computes the offset within a segment for
      a given rbd image offset
    - rbd_segment_length() computes the length to use for I/O within
      a segment for a request, not to exceed the end of a segment
      object.

In the new functions be a bit more careful, checking for possible
error conditions:
    - watch for errors or overflows returned by snprintf()
    - catch (using BUG_ON()) potential overflow conditions
      when computing segment length

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-10-01 14:30:50 -05:00
Alex Elder df111be631 rbd: check for overflow in rbd_get_num_segments()
It is possible in rbd_get_num_segments() for an overflow to occur
when adding the offset and length.  This is easily avoided.

Since the function returns an int and the one caller is already
prepared to handle errors, have it return -ERANGE if overflow would
occur.

The overflow check would not work if a zero-length request was
being tested, so short-circuit that case, returning 0 for the
number of segments required.  (This condition might be avoided
elsewhere already, I don't know.)

Have the caller end the request if either an error or 0 is returned.
The returned value is passed to __blk_end_request_all(), meaning
a 0 length request is not treated an error.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-10-01 14:30:50 -05:00
Alex Elder 38f5f65e9d rbd: drop needless test in rbd_rq_fn()
There's a test for null rq pointer inside the while loop in
rbd_rq_fn() that's not needed.  That same test already occurred
in the immediatly preceding loop condition test.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-10-01 14:30:50 -05:00
Alex Elder 542582fce1 rbd: bio_chain_clone() cleanups
In bio_chain_clone(), at the end of the function the bi_next field
of the tail of the new bio chain is nulled.  This isn't necessary,
because if "tail" is non-null, its value will be the last bio
structure allocated at the top of the while loop in that function.
And before that structure is added to the end of the new chain, its
bi_next pointer is always made null.

While touching that function, clean a few other things:
    - define each local variable on its own line
    - move the definition of "tmp" to an inner scope
    - move the modification of gfpmask closer to where it's used
    - rearrange the logic that sets the chain's tail pointer

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-10-01 14:30:50 -05:00
Alex Elder 84d34dcc11 rbd: kill notify_timeout option
The "notify_timeout" rbd device option is never used, so get rid of
it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-10-01 14:30:50 -05:00
Alex Elder cc0538b62c rbd: add read_only rbd map option
Add the ability to map an rbd image read-only, by specifying either
"read_only" or "ro" as an option on the rbd "command line."  Also
allow the inverse to be explicitly specified using "read_write" or
"rw".

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-10-01 14:30:50 -05:00
Alex Elder f8c3892911 rbd: move rbd_opts to struct rbd_device
The rbd options don't really apply to the ceph client.  So don't
store a pointer to it in the ceph_client structure, and put them
(a struct, not a pointer) into the rbd_dev structure proper.

Pass the rbd device structure to rbd_client_create() so it can
assign rbd_dev->rbdc if successful, and have it return an error code
instead of the rbd client pointer.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-10-01 14:30:50 -05:00
Alex Elder 621901d652 rbd: more cleanup in rbd_header_from_disk()
This just rearranges things a bit more in rbd_header_from_disk()
so that the snapshot sizes are initialized right after the buffer
to hold them is allocated and doing a little further consolidation
that follows from that.  Also adds a few simple comments.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-10-01 14:30:50 -05:00
Alex Elder f785cc1dbe rbd: kill incore snap_names_len
The only thing the on-disk snap_names_len field is needed is to
size the buffer allocated to hold a copy of the snapshot names
for an rbd image.

So don't bother saving it in the in-core rbd_image_header structure.
Just use a local variable to hold the required buffer size while
it's needed.

Move the code that actually copies the snapshot names up closer
to where the required length is saved.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-10-01 14:30:49 -05:00
Alex Elder 58c17b0e1b rbd: don't over-allocate space for object prefix
In rbd_header_from_disk() the object prefix buffer is sized based on
the maximum size it's block_name equivalent on disk could be.

Instead, only allocate enough to hold null-terminated string from
the on-disk header--or the maximum size of no NUL is found.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-10-01 14:30:49 -05:00
Alex Elder 1f7ba33115 rbd: handle locking inside __rbd_client_find()
There is only caller of __rbd_client_find(), and it somewhat
clumsily gets the appropriate lock and gets a reference to the
existing ceph_client structure if it's found.

Instead, have that function handle its own locking, and acquire the
reference if found while it holds the lock.  Drop the underscores
from the name because there's no need to signify anything special
about this function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-10-01 14:30:49 -05:00
Alex Elder 523f32582f rbd: add new snapshots at the tail
This fixes a bug that went in with this commit:

    commit f6e0c99092cca7be00fca4080cfc7081739ca544
    Author: Alex Elder <elder@inktank.com>
    Date:   Thu Aug 2 11:29:46 2012 -0500
    rbd: simplify __rbd_init_snaps_header()

The problem is that a new rbd snapshot needs to go either after an
existing snapshot entry, or at the *end* of an rbd device's snapshot
list.  As originally coded, it is placed at the beginning.  This was
based on the assumption the list would be empty (so it wouldn't
matter), but in fact if multiple new snapshots are added to an empty
list in one shot the list will be non-empty after the first one is
added.

This addresses http://tracker.newdream.net/issues/3063

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:49 -05:00
Alex Elder 843a0d0879 rbd: rename block_name -> object_prefix
In the on-disk image header structure there is a field "block_name"
which represents what we now call the "object prefix" for an rbd
image.  Rename this field "object_prefix" to be consistent with
modern usage.

This appears to be the only remaining vestige of the use of "block"
in symbols that represent objects in the rbd code.

This addresses http://tracker.newdream.net/issues/1761

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2012-10-01 14:30:49 -05:00
Alex Elder 4156d99840 rbd: separate reading header from decoding it
Right now rbd_read_header() both reads the header object for an rbd
image and decodes its contents.  It does this repeatedly if needed,
in order to ensure a complete and intact header is obtained.

Separate this process into two steps--reading of the raw header
data (in new function, rbd_dev_v1_header_read()) and separately
decoding its contents (in rbd_header_from_disk()).  As a result,
the latter function no longer requires its allocated_snaps argument.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:49 -05:00
Alex Elder 103a150f0c rbd: expand rbd_dev_ondisk_valid() checks
Add checks on the validity of the snap_count and snap_names_len
field values in rbd_dev_ondisk_valid().  This eliminates the
need to do them in rbd_header_from_disk().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:48 -05:00
Alex Elder 28cb775de1 rbd: return earlier in rbd_header_from_disk()
The only caller of rbd_header_from_disk() is rbd_read_header().
It passes as allocated_snaps the number of snapshots it will
have received from the server for the snapshot context that
rbd_header_from_disk() is to interpret.  The first time through
it provides 0--mainly to extract the number of snapshots from
the snapshot context header--so that it can allocate an
appropriately-sized buffer to receive the entire snapshot
context from the server in a second request.

rbd_header_from_disk() will not fill in the array of snapshot ids
unless the number in the snapshot matches the number the caller
had allocated.

This patch adjusts that logic a little further to be more efficient.
rbd_read_header() doesn't even examine the snapshot context unless
the snapshot count (stored in header->total_snaps) matches the
number of snapshots allocated.  So rbd_header_from_disk() doesn't
need to allocate or fill in the snapshot context field at all in
that case.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:48 -05:00
Alex Elder 6a52325f61 rbd: rearrange rbd_header_from_disk()
This just moves code around for the most part.  It was pulled out as
a separate patch to avoid cluttering up some upcoming patches which
are more substantive.  The point is basically to group everything
related to initializing the snapshot context together.

The only functional change is that rbd_header_from_disk() now
ensures the (in-core) header it is passed is zero-filled.  This
allows a simpler error handling path in rbd_header_from_disk().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:48 -05:00
Alex Elder d2bb24e506 rbd: use sizeof (object) instead of sizeof (type)
Fix a few spots in rbd_header_from_disk() to use sizeof (object)
rather than sizeof (type).  Use a local variable to record sizes
to shorten some lines and improve readability.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:48 -05:00
Alex Elder d78fd7ae03 rbd: ensure invalid pointers are made null
Fix a number of spots where a pointer value that is known to
have become invalid but was not reset to null.

Also, toss in a change so we use sizeof (object) rather than
sizeof (type).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:48 -05:00
Alex Elder 0f1d3f9385 rbd: make snap_names_len a u64
The snap_names_len field of an rbd_image_header structure is defined
with type size_t.  That field is used as both the source and target
of 64-bit byte-order swapping operations though, so it's best to
define it with type u64 instead.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:48 -05:00
Alex Elder 3593815022 rbd: simplify __rbd_init_snaps_header()
The purpose of __rbd_init_snaps_header() is to compare a new
snapshot context with an rbd device's list of existing snapshots.
It updates the list by adding any new snapshots or removing any
that are not present in the new snapshot context.

The code as written is a little confusing, because it traverses both
the existing snapshot list and the set of snapshots in the snapshot
context in reverse.  This was done based on an assumption about
snapshots that is not true--namely that a duplicate snapshot name
could cause an error in intepreting things if they were not
processed in ascending order.

These precautions are not necessary, because:
    - all snapshots are uniquely identified by their snapshot id
    - a new snapshot cannot be created if the rbd device has another
      snapshot with the same name
(It is furthermore not currently possible to rename a snapshot.)

This patch re-implements __rbd_init_snaps_header() so it passes
through both the existing snapshot list and the entries in the
snapshot context in forward order.  It still does the same thing
as before, but I find the logic considerably easier to understand.

By going forward through the names in the snapshot context, there
is no longer a need for the rbd_prev_snap_name() helper function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:48 -05:00
Linus Torvalds fdb2f9c2eb PCI changes for the 3.7 merge window:
Host bridge hotplug
     - Protect acpi_pci_drivers and acpi_pci_roots (Taku Izumi)
     - Clear host bridge resource info to avoid issue when releasing (Yinghai Lu)
     - Notify acpi_pci_drivers when hot-plugging host bridges (Jiang Liu)
     - Use standard list ops for acpi_pci_drivers (Jiang Liu)
 
   Device hotplug
     - Use pci_get_domain_bus_and_slot() to close hotplug races (Jiang Liu)
     - Remove fakephp driver (Bjorn Helgaas)
     - Fix VGA ref count in hotplug remove path (Yinghai Lu)
     - Allow acpiphp to handle PCIe ports without native hotplug (Jiang Liu)
     - Implement resume regardless of pciehp_force param (Oliver Neukum)
     - Make pci_fixup_irqs() work after init (Thierry Reding)
 
   Miscellaneous
     - Add pci_pcie_type(dev) and remove pci_dev.pcie_type (Yijing Wang)
     - Factor out PCI Express Capability accessors (Jiang Liu)
     - Add pcibios_window_alignment() so powerpc EEH can use generic resource assignment (Gavin Shan)
     - Make pci_error_handlers const (Stephen Hemminger)
     - Cleanup drivers/pci/remove.c (Bjorn Helgaas)
     - Improve Vendor-Specific Extended Capability support (Bjorn Helgaas)
     - Use standard list ops for bus->devices (Bjorn Helgaas)
     - Avoid kmalloc in pci_get_subsys() and pci_get_class() (Feng Tang)
     - Reassign invalid bus number ranges (Intel DP43BF workaround) (Yinghai Lu)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJQac4hAAoJEPGMOI97Hn6zjZYP/iaqU9kjmgTsBbSyzB4oApv/
 RRxo3I+ad9GF6XlMQfVAtyx1pgCD1gdGAtoDgGSCTqgdYD3CO10AxKU+yleAk1wo
 dbMxLifJNTrT3G1mZ/NL16yEGhCwvhfwzRtB1VoZmCT4lSApO/7cJkXl2DzHfA/i
 pmltOOiQCN8kbUcJbVPtUyTVPi2zl/8bsyCyTkS7YG0VXeGRM+ZUvPWZJ7MnWYYB
 5qoCdrw5ENCCiDQ9yw5SAfgL23b+0p6OI/x3Lkex0QQOWwSqGSiaHt4b7eitrC5b
 2eAJg32f/AzZke1YbKLMfdsL0VJP3GAswhDVHlgmo63rZkOZChm+97dgZ35Mcv5v
 kEXkWyBb1xJ3t8rZir6Qer9Iv2wOB+MkZ5qtU/Vf+l0wLQLYTrRVsKngrEDREONk
 dXbokp6iVSPeA1sTSdH9MmHlTUIj82ZLSGcxcjTsN8NWZjxx6g3rNx1uay+5MYOW
 4ET9zNu5snrAqN6N4Tb81gvtG8qYfxzdvVfrA9AaGKI6xxB7jkqgFJRp55JiEcFc
 x4cmWkhvdlhVsG2TQwFxYNfswOqD+7NCs6V4kSVZX6ezpDrH7I5VvcnnhstF7C8l
 KZul0EV7OW+kDK23pNe24lVP2xtOv6G8eK/3PmeKIXWl1V83nqre/oLufRzTfs+Z
 SxkILwY/MFpuCFteKE1t
 =haBu
 -----END PGP SIGNATURE-----

Merge tag 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI changes from Bjorn Helgaas:
 "Host bridge hotplug
    - Protect acpi_pci_drivers and acpi_pci_roots (Taku Izumi)
    - Clear host bridge resource info to avoid issue when releasing
      (Yinghai Lu)
    - Notify acpi_pci_drivers when hot-plugging host bridges (Jiang Liu)
    - Use standard list ops for acpi_pci_drivers (Jiang Liu)

  Device hotplug
    - Use pci_get_domain_bus_and_slot() to close hotplug races (Jiang
      Liu)
    - Remove fakephp driver (Bjorn Helgaas)
    - Fix VGA ref count in hotplug remove path (Yinghai Lu)
    - Allow acpiphp to handle PCIe ports without native hotplug (Jiang
      Liu)
    - Implement resume regardless of pciehp_force param (Oliver Neukum)
    - Make pci_fixup_irqs() work after init (Thierry Reding)

  Miscellaneous
    - Add pci_pcie_type(dev) and remove pci_dev.pcie_type (Yijing Wang)
    - Factor out PCI Express Capability accessors (Jiang Liu)
    - Add pcibios_window_alignment() so powerpc EEH can use generic
      resource assignment (Gavin Shan)
    - Make pci_error_handlers const (Stephen Hemminger)
    - Cleanup drivers/pci/remove.c (Bjorn Helgaas)
    - Improve Vendor-Specific Extended Capability support (Bjorn
      Helgaas)
    - Use standard list ops for bus->devices (Bjorn Helgaas)
    - Avoid kmalloc in pci_get_subsys() and pci_get_class() (Feng Tang)
    - Reassign invalid bus number ranges (Intel DP43BF workaround)
      (Yinghai Lu)"

* tag 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (102 commits)
  PCI: acpiphp: Handle PCIe ports without native hotplug capability
  PCI/ACPI: Use acpi_driver_data() rather than searching acpi_pci_roots
  PCI/ACPI: Protect acpi_pci_roots list with mutex
  PCI/ACPI: Use acpi_pci_root info rather than looking it up again
  PCI/ACPI: Pass acpi_pci_root to acpi_pci_drivers' add/remove interface
  PCI/ACPI: Protect acpi_pci_drivers list with mutex
  PCI/ACPI: Notify acpi_pci_drivers when hot-plugging PCI root bridges
  PCI/ACPI: Use normal list for struct acpi_pci_driver
  PCI/ACPI: Use DEVICE_ACPI_HANDLE rather than searching acpi_pci_roots
  PCI: Fix default vga ref_count
  ia64/PCI: Clear host bridge aperture struct resource
  x86/PCI: Clear host bridge aperture struct resource
  PCI: Stop all children first, before removing all children
  Revert "PCI: Use hotplug-safe pci_get_domain_bus_and_slot()"
  PCI: Provide a default pcibios_update_irq()
  PCI: Discard __init annotations for pci_fixup_irqs() and related functions
  PCI: Use correct type when freeing bus resource list
  PCI: Check P2P bridge for invalid secondary/subordinate range
  PCI: Convert "new_id"/"remove_id" into generic pci_bus driver attributes
  xen-pcifront: Use hotplug-safe pci_get_domain_bus_and_slot()
  ...
2012-10-01 12:05:36 -07:00
Linus Torvalds 21e98932dc Merge git://git.infradead.org/users/willy/linux-nvme
Pull NVMe driver fixes from Matthew Wilcox:
 "Now that actual hardware has been released (don't have any yet
  myself), people are starting to want some of these fixes merged."

Willy doesn't have hardware? Guys...

* git://git.infradead.org/users/willy/linux-nvme:
  NVMe: Cancel outstanding IOs on queue deletion
  NVMe: Free admin queue memory on initialisation failure
  NVMe: Use ida for nvme device instance
  NVMe: Fix whitespace damage in nvme_init
  NVMe: handle allocation failure in nvme_map_user_pages()
  NVMe: Fix uninitialized iod compiler warning
  NVMe: Do not set IO queue depth beyond device max
  NVMe: Set block queue max sectors
  NVMe: use namespace id for nvme_get_features
  NVMe: replace nvme_ns with nvme_dev for user admin
  NVMe: Fix nvme module init when nvme_major is set
  NVMe: Set request queue logical block size
2012-09-29 10:31:52 -07:00
Asias He bb8111086c virtio-blk: Disable callback in virtblk_done()
This reduces unnecessary interrupts that host could send to guest while
guest is in the progress of irq handling.

If one vcpu is handling the irq, while another interrupt comes, in
handle_edge_irq(), the guest will mask the interrupt via mask_msi_irq()
which is a very heavy operation that goes all the way down to host.

 Here are some performance numbers on qemu:

 Before:
 -------------------------------------
   seq-read  : io=0 B, bw=269730KB/s, iops=67432 , runt= 62200msec
   seq-write : io=0 B, bw=339716KB/s, iops=84929 , runt= 49386msec
   rand-read : io=0 B, bw=270435KB/s, iops=67608 , runt= 62038msec
   rand-write: io=0 B, bw=354436KB/s, iops=88608 , runt= 47335msec
     clat (usec): min=101 , max=138052 , avg=14822.09, stdev=11771.01
     clat (usec): min=96 , max=81543 , avg=11798.94, stdev=7735.60
     clat (usec): min=128 , max=140043 , avg=14835.85, stdev=11765.33
     clat (usec): min=109 , max=147207 , avg=11337.09, stdev=5990.35
   cpu          : usr=15.93%, sys=60.37%, ctx=7764972, majf=0, minf=54
   cpu          : usr=32.73%, sys=120.49%, ctx=7372945, majf=0, minf=1
   cpu          : usr=18.84%, sys=58.18%, ctx=7775420, majf=0, minf=1
   cpu          : usr=24.20%, sys=59.85%, ctx=8307886, majf=0, minf=0
   vdb: ios=8389107/8368136, merge=0/0, ticks=19457874/14616506,
 in_queue=34206098, util=99.68%
  43: interrupt in total: 887320
 fio --exec_prerun="echo 3 > /proc/sys/vm/drop_caches" --group_reporting
 --ioscheduler=noop --thread --bs=4k --size=512MB --direct=1 --numjobs=16
 --ioengine=libaio --iodepth=64 --loops=3 --ramp_time=0
 --filename=/dev/vdb --name=seq-read --stonewall --rw=read
 --name=seq-write --stonewall --rw=write --name=rnd-read --stonewall
 --rw=randread --name=rnd-write --stonewall --rw=randwrite

 After:
 -------------------------------------
   seq-read  : io=0 B, bw=309503KB/s, iops=77375 , runt= 54207msec
   seq-write : io=0 B, bw=448205KB/s, iops=112051 , runt= 37432msec
   rand-read : io=0 B, bw=311254KB/s, iops=77813 , runt= 53902msec
   rand-write: io=0 B, bw=377152KB/s, iops=94287 , runt= 44484msec
     clat (usec): min=81 , max=90588 , avg=12946.06, stdev=9085.94
     clat (usec): min=57 , max=72264 , avg=8967.97, stdev=5951.04
     clat (usec): min=29 , max=101046 , avg=12889.95, stdev=9067.91
     clat (usec): min=52 , max=106152 , avg=10660.56, stdev=4778.19
   cpu          : usr=15.05%, sys=57.92%, ctx=7710941, majf=0, minf=54
   cpu          : usr=26.78%, sys=101.40%, ctx=7387891, majf=0, minf=2
   cpu          : usr=19.03%, sys=58.17%, ctx=7681976, majf=0, minf=8
   cpu          : usr=24.65%, sys=58.34%, ctx=8442632, majf=0, minf=4
   vdb: ios=8389086/8361888, merge=0/0, ticks=17243780/12742010,
 in_queue=30078377, util=99.59%
  43: interrupt in total: 1259639
 fio --exec_prerun="echo 3 > /proc/sys/vm/drop_caches" --group_reporting
 --ioscheduler=noop --thread --bs=4k --size=512MB --direct=1 --numjobs=16
 --ioengine=libaio --iodepth=64 --loops=3 --ramp_time=0
 --filename=/dev/vdb --name=seq-read --stonewall --rw=read
 --name=seq-write --stonewall --rw=write --name=rnd-read --stonewall
 --rw=randread --name=rnd-write --stonewall --rw=randwrite

Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28 15:05:16 +09:30
Dan Carpenter f22cf8eb48 virtio-blk: fix NULL checking in virtblk_alloc_req()
Smatch complains about the inconsistent NULL checking here.  Fix it to
return NULL on failure.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (fixed accidental deletion)
2012-09-28 15:05:14 +09:30
Asias He c85a1f91b3 virtio-blk: Add REQ_FLUSH and REQ_FUA support to bio path
We need to support both REQ_FLUSH and REQ_FUA for bio based path since
it does not get the sequencing of REQ_FUA into REQ_FLUSH that request
based drivers can request.

REQ_FLUSH is emulated by:
A) If the bio has no data to write:
1. Send VIRTIO_BLK_T_FLUSH to device,
2. In the flush I/O completion handler, finish the bio

B) If the bio has data to write:
1. Send VIRTIO_BLK_T_FLUSH to device
2. In the flush I/O completion handler, send the actual write data to device
3. In the write I/O completion handler, finish the bio

REQ_FUA is emulated by:
1. Send the actual write data to device
2. In the write I/O completion handler, send VIRTIO_BLK_T_FLUSH to device
3. In the flush I/O completion handler, finish the bio

Changes in v7:
- Using vbr->flags to trace request type
- Dropped unnecessary struct virtio_blk *vblk parameter
- Reuse struct virtblk_req in bio done function

Cahnges in v6:
- Reworked REQ_FLUSH and REQ_FUA emulatation order

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28 15:05:14 +09:30
Asias He a98755c559 virtio-blk: Add bio-based IO path for virtio-blk
This patch introduces bio-based IO path for virtio-blk.

Compared to request-based IO path, bio-based IO path uses driver
provided ->make_request_fn() method to bypasses the IO scheduler. It
handles the bio to device directly without allocating a request in block
layer. This reduces the IO path in guest kernel to achieve high IOPS
and lower latency. The downside is that guest can not use the IO
scheduler to merge and sort requests. However, this is not a big problem
if the backend disk in host side uses faster disk device.

When the bio-based IO path is not enabled, virtio-blk still uses the
original request-based IO path, no performance difference is observed.

Using a slow device e.g. normal SATA disk, the bio-based IO path for
sequential read and write are slower than req-based IO path due to lack
of merge in guest kernel. So we make the bio-based path optional.

Performance evaluation:
-----------------------------
1) Fio test is performed in a 8 vcpu guest with ramdisk based guest using
kvm tool.

Short version:
 With bio-based IO path, sequential read/write, random read/write
 IOPS boost         : 28%, 24%, 21%, 16%
 Latency improvement: 32%, 17%, 21%, 16%

Long version:
 With bio-based IO path:
  seq-read  : io=2048.0MB, bw=116996KB/s, iops=233991 , runt= 17925msec
  seq-write : io=2048.0MB, bw=100829KB/s, iops=201658 , runt= 20799msec
  rand-read : io=3095.7MB, bw=112134KB/s, iops=224268 , runt= 28269msec
  rand-write: io=3095.7MB, bw=96198KB/s,  iops=192396 , runt= 32952msec
    clat (usec): min=0 , max=2631.6K, avg=58716.99, stdev=191377.30
    clat (usec): min=0 , max=1753.2K, avg=66423.25, stdev=81774.35
    clat (usec): min=0 , max=2915.5K, avg=61685.70, stdev=120598.39
    clat (usec): min=0 , max=1933.4K, avg=76935.12, stdev=96603.45
  cpu : usr=74.08%, sys=703.84%, ctx=29661403, majf=21354, minf=22460954
  cpu : usr=70.92%, sys=702.81%, ctx=77219828, majf=13980, minf=27713137
  cpu : usr=72.23%, sys=695.37%, ctx=88081059, majf=18475, minf=28177648
  cpu : usr=69.69%, sys=654.13%, ctx=145476035, majf=15867, minf=26176375
 With request-based IO path:
  seq-read  : io=2048.0MB, bw=91074KB/s, iops=182147 , runt= 23027msec
  seq-write : io=2048.0MB, bw=80725KB/s, iops=161449 , runt= 25979msec
  rand-read : io=3095.7MB, bw=92106KB/s, iops=184211 , runt= 34416msec
  rand-write: io=3095.7MB, bw=82815KB/s, iops=165630 , runt= 38277msec
    clat (usec): min=0 , max=1932.4K, avg=77824.17, stdev=170339.49
    clat (usec): min=0 , max=2510.2K, avg=78023.96, stdev=146949.15
    clat (usec): min=0 , max=3037.2K, avg=74746.53, stdev=128498.27
    clat (usec): min=0 , max=1363.4K, avg=89830.75, stdev=114279.68
  cpu : usr=53.28%, sys=724.19%, ctx=37988895, majf=17531, minf=23577622
  cpu : usr=49.03%, sys=633.20%, ctx=205935380, majf=18197, minf=27288959
  cpu : usr=55.78%, sys=722.40%, ctx=101525058, majf=19273, minf=28067082
  cpu : usr=56.55%, sys=690.83%, ctx=228205022, majf=18039, minf=26551985

2) Fio test is performed in a 8 vcpu guest with Fusion-IO based guest using
kvm tool.

Short version:
 With bio-based IO path, sequential read/write, random read/write
 IOPS boost         : 11%, 11%, 13%, 10%
 Latency improvement: 10%, 10%, 12%, 10%
Long Version:
 With bio-based IO path:
  read : io=2048.0MB, bw=58920KB/s, iops=117840 , runt= 35593msec
  write: io=2048.0MB, bw=64308KB/s, iops=128616 , runt= 32611msec
  read : io=3095.7MB, bw=59633KB/s, iops=119266 , runt= 53157msec
  write: io=3095.7MB, bw=62993KB/s, iops=125985 , runt= 50322msec
    clat (usec): min=0 , max=1284.3K, avg=128109.01, stdev=71513.29
    clat (usec): min=94 , max=962339 , avg=116832.95, stdev=65836.80
    clat (usec): min=0 , max=1846.6K, avg=128509.99, stdev=89575.07
    clat (usec): min=0 , max=2256.4K, avg=121361.84, stdev=82747.25
  cpu : usr=56.79%, sys=421.70%, ctx=147335118, majf=21080, minf=19852517
  cpu : usr=61.81%, sys=455.53%, ctx=143269950, majf=16027, minf=24800604
  cpu : usr=63.10%, sys=455.38%, ctx=178373538, majf=16958, minf=24822612
  cpu : usr=62.04%, sys=453.58%, ctx=226902362, majf=16089, minf=23278105
 With request-based IO path:
  read : io=2048.0MB, bw=52896KB/s, iops=105791 , runt= 39647msec
  write: io=2048.0MB, bw=57856KB/s, iops=115711 , runt= 36248msec
  read : io=3095.7MB, bw=52387KB/s, iops=104773 , runt= 60510msec
  write: io=3095.7MB, bw=57310KB/s, iops=114619 , runt= 55312msec
    clat (usec): min=0 , max=1532.6K, avg=142085.62, stdev=109196.84
    clat (usec): min=0 , max=1487.4K, avg=129110.71, stdev=114973.64
    clat (usec): min=0 , max=1388.6K, avg=145049.22, stdev=107232.55
    clat (usec): min=0 , max=1465.9K, avg=133585.67, stdev=110322.95
  cpu : usr=44.08%, sys=590.71%, ctx=451812322, majf=14841, minf=17648641
  cpu : usr=48.73%, sys=610.78%, ctx=418953997, majf=22164, minf=26850689
  cpu : usr=45.58%, sys=581.16%, ctx=714079216, majf=21497, minf=22558223
  cpu : usr=48.40%, sys=599.65%, ctx=656089423, majf=16393, minf=23824409

3) Fio test is performed in a 8 vcpu guest with normal SATA based guest
using kvm tool.

Short version:
 With bio-based IO path, sequential read/write, random read/write
 IOPS boost         : -10%, -10%, 4.4%, 0.5%
 Latency improvement: -12%, -15%, 2.5%, 0.8%
Long Version:
 With bio-based IO path:
  read : io=124812KB, bw=36537KB/s, iops=9060 , runt=  3416msec
  write: io=169180KB, bw=24406KB/s, iops=6065 , runt=  6932msec
  read : io=256200KB, bw=2089.3KB/s, iops=520 , runt=122630msec
  write: io=257988KB, bw=1545.7KB/s, iops=384 , runt=166910msec
    clat (msec): min=1 , max=1527 , avg=28.06, stdev=89.54
    clat (msec): min=2 , max=344 , avg=41.12, stdev=38.70
    clat (msec): min=8 , max=1984 , avg=490.63, stdev=207.28
    clat (msec): min=33 , max=4131 , avg=659.19, stdev=304.71
  cpu          : usr=4.85%, sys=17.15%, ctx=31593, majf=0, minf=7
  cpu          : usr=3.04%, sys=11.45%, ctx=39377, majf=0, minf=0
  cpu          : usr=0.47%, sys=1.59%, ctx=262986, majf=0, minf=16
  cpu          : usr=0.47%, sys=1.46%, ctx=337410, majf=0, minf=0

 With request-based IO path:
  read : io=150120KB, bw=40420KB/s, iops=10037 , runt=  3714msec
  write: io=194932KB, bw=27029KB/s, iops=6722 , runt=  7212msec
  read : io=257136KB, bw=2001.1KB/s, iops=498 , runt=128443msec
  write: io=258276KB, bw=1537.2KB/s, iops=382 , runt=168028msec
    clat (msec): min=1 , max=1542 , avg=24.84, stdev=32.45
    clat (msec): min=3 , max=628 , avg=35.62, stdev=39.71
    clat (msec): min=8 , max=2540 , avg=503.28, stdev=236.97
    clat (msec): min=41 , max=4398 , avg=653.88, stdev=302.61
  cpu          : usr=3.91%, sys=15.75%, ctx=26968, majf=0, minf=23
  cpu          : usr=2.50%, sys=10.56%, ctx=19090, majf=0, minf=0
  cpu          : usr=0.16%, sys=0.43%, ctx=20159, majf=0, minf=16
  cpu          : usr=0.18%, sys=0.53%, ctx=81364, majf=0, minf=0

How to use:
-----------------------------
Add 'virtio_blk.use_bio=1' to kernel cmdline or 'modprobe virtio_blk
use_bio=1' to enable ->make_request_fn() based I/O path.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Asias He <asias@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28 15:05:13 +09:30
Linus Torvalds bee2d97b2c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull two ceph fixes from Sage Weil:
 "The first fixes a leak in the rbd setup error path, and the second
  fixes a more serious problem with mismatched kmap/kunmap that surfaced
  after the recent refactoring work."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  libceph: only kunmap kmapped pages
  rbd: drop dev reference on error in rbd_open()
2012-09-24 16:13:49 -07:00
Alex Elder 340c7a2b2c rbd: drop dev reference on error in rbd_open()
If a read-only rbd device is opened for writing in rbd_open(), it
returns without dropping the just-acquired device reference.

Fix this by moving the read-only check before getting the reference.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-09-21 20:48:54 -07:00
Linus Torvalds abef3bd710 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking updates from David Miller:
 "More bug fixes, nothing gets past these guys"

 1) More kernel info leaks found by Mathias Krause, this time in the
    IPSEC configuration layers.

 2) When IPSEC policies change, we do not properly make sure that cached
    routes (which could now be stale) throughout the system will be
    revalidated.  Fix this by generalizing the generation count
    invalidation scheme used by ipv4.  From Nicolas Dichtel.

 3) When repairing TCP sockets, we need to allow to restore not just the
    send window scale, but the receive one too.  Extend the existing
    interface to achieve this in a backwards compatible way.  From
    Andrey Vagin.

 4) A fix for FCOE scatter gather feature validation erroneously caused
    scatter gather to be disabled for things like AOE too.  From Ed L
    Cashin.

 5) Several cases of mishandling of error pointers, from Mathias Krause,
    Wei Yongjun, and Devendra Naga.

 6) Fix gianfar build, from Richard Cochran.

 7) CAP_NET_* failures should return -EPERM not -EACCES, from Zhao
    Hongjiang.

 8) Hardware reset fix in janz-ican3 CAN driver, from Ira W Snyder.

 9) Fix oops during rmmod in ti_hecc CAN driver, from Marc Kleine-Budde.

10) The removal of the conditional compilation of the clk support code
    in the stmmac driver broke things.  This is because the interfaces
    used are the ones that don't also perform the enable/disable of the
    clk.  Fix from Stefan Roese.

11) The QFQ packet scheduler can record out of range virtual start
    times, resulting later in misbehavior and even crashes.  Fix from
    Paolo Valente.

12) If MSG_WAITALL is used with IOAT DMA under TCP, we can wedge the
    receiver when the advertised receive window goes to zero.  Detect
    this case and force the processing of the IOAT DMA queue when it
    happens to avoid getting stuck.  Fix from Michal Kubecek.

13) batman-adv assumes that test_bit() returns only 0 or 1, but this is
    not true for x86 (which returns -1 or 0, via the 'sbb' instruction).
    Fix from Linus Lussing.

14) Fix small packet corruption in e1000, from Tushar Dave.

15) make_blackhole() in the IPSEC policy code can do one read unlock too
    many, fix from Li RongQing.

16) The new tcp_try_coalesce() code introduced a bug in TCP URG
    handling, fix from Eric Dumazet.

17) Fix memory leak in __netif_receive_skb() when doing zerocopy and
    when hit an OOM condition.  From Michael S Tsirkin.

18) netxen blindly deferences pdev->bus->self, which is not guarenteed
    to be non-NULL.  Fix from Nikolay Aleksandrov.

19) Fix a performance regression caused by mistakes in ipv6 checksum
    validation in the bnx2x driver, fix from Michal Schmidt.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (45 commits)
  net/stmmac: Use clk_prepare_enable and clk_disable_unprepare
  net: change return values from -EACCES to -EPERM
  net/irda: sh_sir: fix return value check in sh_sir_set_baudrate()
  stmmac: fix return value check in stmmac_open_ext_timer()
  gianfar: fix phc index build failure
  ipv6: fix return value check in fib6_add()
  bnx2x: remove false warning regarding interrupt number
  can: ti_hecc: fix oops during rmmod
  can: janz-ican3: fix support for older hardware revisions
  net: do not disable sg for packets requiring no checksum
  aoe: assert AoE packets marked as requiring no checksum
  at91ether: return PTR_ERR if call to clk_get fails
  xfrm_user: don't copy esn replay window twice for new states
  xfrm_user: ensure user supplied esn replay window is valid
  xfrm_user: fix info leak in copy_to_user_tmpl()
  xfrm_user: fix info leak in copy_to_user_policy()
  xfrm_user: fix info leak in copy_to_user_state()
  xfrm_user: fix info leak in copy_to_user_auth()
  net: qmi_wwan: adding Huawei E367, ZTE MF683 and Pantech P4200
  tcp: restore rcv_wscale in a repair mode (v2)
  ...
2012-09-21 14:32:55 -07:00
Linus Torvalds 8ca7de9164 Bug-fixes:
* Fix M2P batching re-using the incorrect structure field.
  * Disable BIOS SMP MP table search.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJQXGfdAAoJEFjIrFwIi8fJbWcH/0FI2d/VyB+ZU0ng3R0Oa7mt
 iR/x+Z+mfFdp2dXS6gs6DgJIZVA7i2K9pX4rOXjpDGGGyUeo1xoqjlQfsFWQGjZ/
 p49RrDrM93c2GdRXk3iMSWfboQI7BXBs5rnyYZQL7kMxUSR75MxbeONvhPrMSO9I
 3EBidWH08qjrn2HVF44F6xh5ONjpclo5AvGIzJ0eU4X0D0eqMnhvlAw8/UYJU2HV
 heRvuxWF9l2jNpLhKhZy1730D1X/vKA5qKAcBW8rCOpEijyPpmtKbqapeUJg/9pH
 NVquuwGutP5ozrSi7a/23+L+ezvQBmCPm5ZRG44PccBoZ/HVs8haT8UypSWSDzo=
 =TwvM
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull Xen bug-fixes from Konrad Rzeszutek Wilk:
 - Fix M2P batching re-using the incorrect structure field.

   In v3.5 we added batching for M2P override (Machine Frame Number ->
   Physical Frame Number), but the original MFN was saved in an
   incorrect structure - and we would oops/restore when restoring with
   the old MFN.

 - Disable BIOS SMP MP table search.

   A bootup issue that we had ignored until we found that on DL380 G6 it
   was needed.

* tag 'stable/for-linus-3.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/boot: Disable BIOS SMP MP table search.
  xen/m2p: do not reuse kmap_op->dev_bus_addr
2012-09-21 12:06:54 -07:00
Eric W. Biederman e4849737f7 userns: Convert loop to use kuid_t instead of uid_t
Cc: Jens Axboe <jaxboe@fusionio.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-09-21 03:13:20 -07:00
Ed Cashin 8babe8cc65 aoe: assert AoE packets marked as requiring no checksum
In order for the network layer to see that AoE requires
no checksumming in a generic way, the packets must be
marked as requiring no checksum, so we make this requirement
explicit with the assertion.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-20 22:23:40 -04:00
Linus Torvalds c46de2263f Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A small collection of driver fixes/updates and a core fix for 3.6.  It
  contains:

   - Bug fixes for mtip32xx, and support for new hardware (just addition
     of IDs).  They have been queued up for 3.7 for a few weeks as well.

   - rate-limit a failing command error message in block core.

   - A fix for an old cciss bug from Stephen.

   - Prevent overflow of partition count from Alan."

* 'for-linus' of git://git.kernel.dk/linux-block:
  cciss: fix handling of protocol error
  blk: add an upper sanity check on partition adding
  mtip32xx: fix user_buffer check in exec_drive_command
  mtip32xx: Remove dead code
  mtip32xx: Change printk to pr_xxxx
  mtip32xx: Proper reporting of write protect status on big-endian
  mtip32xx: Increase timeout for standby command
  mtip32xx: Handle NCQ commands during the security locked state
  mtip32xx: Add support for new devices
  block: rate-limit the error message from failing commands
2012-09-19 11:04:34 -07:00
Stephen M. Cameron 2453f5f992 cciss: fix handling of protocol error
If a command completes with a status of CMD_PROTOCOL_ERR, this
information should be conveyed to the SCSI mid layer, not dropped
on the floor.  Unlike a similar bug in the hpsa driver, this bug
only affects tape drives and CD and DVD ROM drives in the cciss
driver, and to induce it, you have to disconnect (or damage) a
cable, so it is not a very likely scenario (which would explain
why the bug has gone undetected for the last 10 years.)

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-18 11:57:08 +02:00
Paul Clements fded4e090c nbd: clear waiting_queue on shutdown
Fix a serious but uncommon bug in nbd which occurs when there is heavy
I/O going to the nbd device while, at the same time, a failure (server,
network) or manual disconnect of the nbd connection occurs.

There is a small window between the time that the nbd_thread is stopped
and the socket is shutdown where requests can continue to be queued to
nbd's internal waiting_queue.  When this happens, those requests are
never completed or freed.

The fix is to clear the waiting_queue on shutdown of the nbd device, in
the same way that the nbd request queue (queue_head) is already being
cleared.

Signed-off-by: Paul Clements <paul.clements@steeleye.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17 15:00:37 -07:00
Greg Kroah-Hartman 2bcb132c69 Merge 3.6-rc6 into usb-next
This resolves the merge problems with:
	drivers/usb/dwc3/gadget.c
	drivers/usb/musb/tusb6010.c
that had been seen in linux-next.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-09-16 20:42:46 -07:00
Bjorn Helgaas 78890b5989 Merge commit 'v3.6-rc5' into next
* commit 'v3.6-rc5': (1098 commits)
  Linux 3.6-rc5
  HID: tpkbd: work even if the new Lenovo Keyboard driver is not configured
  Remove user-triggerable BUG from mpol_to_str
  xen/pciback: Fix proper FLR steps.
  uml: fix compile error in deliver_alarm()
  dj: memory scribble in logi_dj
  Fix order of arguments to compat_put_time[spec|val]
  xen: Use correct masking in xen_swiotlb_alloc_coherent.
  xen: fix logical error in tlb flushing
  xen/p2m: Fix one-off error in checking the P2M tree directory.
  powerpc: Don't use __put_user() in patch_instruction
  powerpc: Make sure IPI handlers see data written by IPI senders
  powerpc: Restore correct DSCR in context switch
  powerpc: Fix DSCR inheritance in copy_thread()
  powerpc: Keep thread.dscr and thread.dscr_inherit in sync
  powerpc: Update DSCR on all CPUs when writing sysfs dscr_default
  powerpc/powernv: Always go into nap mode when CPU is offline
  powerpc: Give hypervisor decrementer interrupts their own handler
  powerpc/vphn: Fix arch_update_cpu_topology() return value
  ARM: gemini: fix the gemini build
  ...

Conflicts:
	drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
	drivers/rapidio/devices/tsi721.c
2012-09-13 08:41:01 -06:00
David Milburn 97651ea687 mtip32xx: fix user_buffer check in exec_drive_command
Current user_buffer check is incorrect and causes hdparm to fail

# hdparm -I /dev/rssda
 HDIO_DRIVE_CMD(identify) failed: Input/output error

/dev/rssda:

Patching linux-3.6-rc5 hdparm works as expected

# hdparm -I /dev/rssda
/dev/rssda:

ATA device, with non-removable media
	Model Number:       DELL_P320h-MTFDGAL350SAH
	Serial Number:      00000000121302025F01
	Firmware Revision:  B1442808
<snip>

Reported-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: David Milburn <dmilburn@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-12 22:21:13 +02:00
Asai Thambi S P ac64e6572d mtip32xx: Remove dead code
Removed the dead code in mtip_hw_read_registers() and mtip_hw_read_flags().

Reported-by: Coverity
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-12 22:20:22 +02:00
Asai Thambi S P 45422e7431 mtip32xx: Change printk to pr_xxxx
Changed printk to be compliant with latest style changes

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-12 22:20:11 +02:00
Asai Thambi S P b62868e50e mtip32xx: Proper reporting of write protect status on big-endian
Proper reporting of write protect status on big-endian

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-12 22:19:59 +02:00
Asai Thambi S P d7c8b94548 mtip32xx: Increase timeout for standby command
Increased timeout for standby command to work with larger capacity drives

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-12 22:19:49 +02:00
Asai Thambi S P 12a166c919 mtip32xx: Handle NCQ commands during the security locked state
Return error for NCQ commands when the drive is in security locked state

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-12 22:19:39 +02:00
Asai Thambi S P 1a131458dd mtip32xx: Add support for new devices
Added supported device IDs in pci table

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-12 22:19:29 +02:00
Stefano Stabellini 2fc136eecd xen/m2p: do not reuse kmap_op->dev_bus_addr
If the caller passes a valid kmap_op to m2p_add_override, we use
kmap_op->dev_bus_addr to store the original mfn, but dev_bus_addr is
part of the interface with Xen and if we are batching the hypercalls it
might not have been written by the hypervisor yet. That means that later
on Xen will write to it and we'll think that the original mfn is
actually what Xen has written to it.

Rather than "stealing" struct members from kmap_op, keep using
page->index to store the original mfn and add another parameter to
m2p_remove_override to get the corresponding kmap_op instead.
It is now responsibility of the caller to keep track of which kmap_op
corresponds to a particular page in the m2p_override (gntdev, the only
user of this interface that passes a valid kmap_op, is already doing that).

CC: stable@kernel.org
Reported-and-Tested-By: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-09-12 11:21:40 -04:00
Kent Overstreet bf800ef181 block: Add bio_clone_bioset(), bio_clone_kmalloc()
Previously, there was bio_clone() but it only allocated from the fs bio
set; as a result various users were open coding it and using
__bio_clone().

This changes bio_clone() to become bio_clone_bioset(), and then we add
bio_clone() and bio_clone_kmalloc() as wrappers around it, making use of
the functionality the last patch adedd.

This will also help in a later patch changing how bio cloning works.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
CC: Alasdair Kergon <agk@redhat.com>
CC: Boaz Harrosh <bharrosh@panasas.com>
CC: Jeff Garzik <jeff@garzik.org>
Acked-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-09 10:35:39 +02:00
Kent Overstreet ccc5c9ca6a pktcdvd: Switch to bio_kmalloc()
This is prep work for killing bi_destructor - previously, pktcdvd had
its own pkt_bio_alloc which was basically duplication bio_kmalloc(),
necessitating its own bi_destructor implementation.

v5: Un-reorder some functions, to make the patch easier to review

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-09 10:35:39 +02:00
Kent Overstreet 395c72a707 block: Generalized bio pool freeing
With the old code, when you allocate a bio from a bio pool you have to
implement your own destructor that knows how to find the bio pool the
bio was originally allocated from.

This adds a new field to struct bio (bi_pool) and changes
bio_alloc_bioset() to use it. This makes various bio destructors
unnecessary, so they're then deleted.

v6: Explain the temporary if statement in bio_put

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
CC: Alasdair Kergon <agk@redhat.com>
CC: Nicholas Bellinger <nab@linux-iscsi.org>
CC: Lars Ellenberg <lars.ellenberg@linbit.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-09 10:35:38 +02:00
Stephen Hemminger 1d3520357d make drivers with pci error handlers const
Covers the rest of the uses of pci error handler.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2012-09-07 16:35:00 -06:00
Cong Wang 68a5059ecf block: remove the deprecated ub driver
It was scheduled to be removed in 3.6.

Acked-by: Pete Zaitcev <zaitcev@redhat.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-09-05 17:18:53 -07:00
Linus Torvalds a7e546f175 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block-related fixes from Jens Axboe:

 - Improvements to the buffered and direct write IO plugging from
   Fengguang.

 - Abstract out the mapping of a bio in a request, and use that to
   provide a blk_bio_map_sg() helper.  Useful for mapping just a bio
   instead of a full request.

 - Regression fix from Hugh, fixing up a patch that went into the
   previous release cycle (and marked stable, too) attempting to prevent
   a loop in __getblk_slow().

 - Updates to discard requests, fixing up the sizing and how we align
   them.  Also a change to disallow merging of discard requests, since
   that doesn't really work properly yet.

 - A few drbd fixes.

 - Documentation updates.

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: replace __getblk_slow misfix by grow_dev_page fix
  drbd: Write all pages of the bitmap after an online resize
  drbd: Finish requests that completed while IO was frozen
  drbd: fix drbd wire compatibility for empty flushes
  Documentation: update tunable options in block/cfq-iosched.txt
  Documentation: update tunable options in block/cfq-iosched.txt
  Documentation: update missing index files in block/00-INDEX
  block: move down direct IO plugging
  block: remove plugging at buffered write time
  block: disable discard request merge temporarily
  bio: Fix potential memory leak in bio_find_or_create_slab()
  block: Don't use static to define "void *p" in show_partition_start()
  block: Add blk_bio_map_sg() helper
  block: Introduce __blk_segment_map_sg() helper
  fs/block-dev.c:fix performance regression in O_DIRECT writes to md block devices
  block: split discard into aligned requests
  block: reorganize rounding of max_discard_sectors
2012-08-25 11:36:43 -07:00
Stephen M. Cameron b0cf0b118c cciss: fix incorrect scsi status reporting
Delete code which sets SCSI status incorrectly as it's already been set
correctly above this incorrect code.  The bug was introduced in 2009 by
commit b0e15f6db1 ("cciss: fix typo that causes scsi status to be
lost.")

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Reported-by: Roel van Meer <roel.vanmeer@bokxing.nl>
Tested-by: Roel van Meer <roel.vanmeer@bokxing.nl>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21 16:45:02 -07:00
Tejun Heo 136b5721d7 workqueue: deprecate __cancel_delayed_work()
Now that cancel_delayed_work() can be safely called from IRQ handlers,
there's no reason to use __cancel_delayed_work().  Use
cancel_delayed_work() instead of __cancel_delayed_work() and mark the
latter deprecated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Roland Dreier <roland@kernel.org>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
2012-08-21 13:18:24 -07:00
Tejun Heo e7c2f96744 workqueue: use mod_delayed_work() instead of __cancel + queue
Now that mod_delayed_work() is safe to call from IRQ handlers,
__cancel_delayed_work() followed by queue_delayed_work() can be
replaced with mod_delayed_work().

Most conversions are straight-forward except for the following.

* net/core/link_watch.c: linkwatch_schedule_work() was doing a quite
  elaborate dancing around its delayed_work.  Collapse it such that
  linkwatch_work is queued for immediate execution if LW_URGENT and
  existing timer is kept otherwise.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
2012-08-21 13:18:24 -07:00
Tejun Heo 43829731dd workqueue: deprecate flush[_delayed]_work_sync()
flush[_delayed]_work_sync() are now spurious.  Mark them deprecated
and convert all users to flush[_delayed]_work().

If you're cc'd and wondering what's going on: Now all workqueues are
non-reentrant and the regular flushes guarantee that the work item is
not pending or running on any CPU on return, so there's no reason to
use the sync flushes at all and they're going away.

This patch doesn't make any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mattia Dongili <malattia@linux.it>
Cc: Kent Yoder <key@linux.vnet.ibm.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: Bryan Wu <bryan.wu@canonical.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-wireless@vger.kernel.org
Cc: Anton Vorontsov <cbou@mail.ru>
Cc: Sangbeom Kim <sbkim73@samsung.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Petr Vandrovec <petr@vandrovec.name>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Avi Kivity <avi@redhat.com>
2012-08-20 14:51:24 -07:00
Jens Axboe 79df9b40b3 Merge branch 'for-jens' of git://git.drbd.org/linux-drbd into for-linus 2012-08-16 19:27:18 +02:00
Philipp Reisner d1aa4d04da drbd: Write all pages of the bitmap after an online resize
We need to write the whole bitmap after we moved the meta data
due to an online resize operation.

With the support for one peta byte devices bitmap IO was optimized
to only write out touched pages. This optimization must be turned
off when writing the bitmap after an online resize.

This issue was introduced with drbd-8.3.10.

The impact of this bug is that after an online resize, the next
resync could become larger than expected.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-08-16 17:17:35 +02:00
Philipp Reisner 509fc019e5 drbd: Finish requests that completed while IO was frozen
Requests of an acked epoch are stored on the barrier_acked_requests list. In
case the private bio of such a request completes while IO on the drbd device
is suspended [req_mod(completed_ok)] then the request stays there.

When thawing IO because the fence_peer handler returned, then we use
tl_clear() to apply the connection_lost_while_pending event to all requests
on the transfer-log and the barrier_acked_requests list.

Up to now the connection_lost_while_pending event was not applied
on requests on the barrier_acked_requests list. Fixed that.

I.e. now the connection_lost_while_pending and resend events are
applied to requests on the barrier_acked_requests list. For that
it is necessary that the resend event finishes (local only)
READS correctly.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-08-16 17:14:45 +02:00
Lars Ellenberg 227f052f47 drbd: fix drbd wire compatibility for empty flushes
DRBD has a concept of request epochs or reorder-domains,
which are separated on the wire by P_BARRIER packets.

Older DRBD is not able to handle zero-sized requests at all,
so we need to map empty flushes to these drbd barriers.

These are the equivalent of empty flushes, and
by default trigger flushes on the receiving side anyways
(unless not supported or explicitly disabled),
so there is no need to handle this differently in newer drbd either.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-08-16 17:12:56 +02:00
Stefano Stabellini e79affc3f2 xen/arm: compile blkfront and blkback
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-08-08 17:21:14 +00:00
Matthew Wilcox a09115b23e NVMe: Cancel outstanding IOs on queue deletion
If the device is hot-unplugged while there are active commands, we should
time out the I/Os so that upper layers don't just see the I/Os disappear.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-08-07 15:56:23 -04:00
Artem Bityutskiy d97482ede2 drbd: nuke pdflush from comments
The pdflush thread is long gone, so this patch removes references to pdflush
from drbd comments.

Cc: drbd-dev@lists.linbit.com
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-08-04 12:15:39 +04:00
Matthew Wilcox 9e866774aa NVMe: Free admin queue memory on initialisation failure
If the adapter fails initialisation, the memory allocated for the
admin queue may not be freed.  Split the memory freeing part of
nvme_free_queue() into nvme_free_queue_mem() and call it in the case of
initialisation failure.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reported-by: Vishal Verma <vishal.l.verma@intel.com>
2012-08-03 13:55:56 -04:00
Linus Torvalds eff0d13f38 Merge branch 'for-3.6/drivers' of git://git.kernel.dk/linux-block
Pull block driver changes from Jens Axboe:

 - Making the plugging support for drivers a bit more sane from Neil.
   This supersedes the plugging change from Shaohua as well.

 - The usual round of drbd updates.

 - Using a tail add instead of a head add in the request completion for
   ndb, making us find the most completed request more quickly.

 - A few floppy changes, getting rid of a duplicated flag and also
   running the floppy init async (since it takes forever in boot terms)
   from Andi.

* 'for-3.6/drivers' of git://git.kernel.dk/linux-block:
  floppy: remove duplicated flag FD_RAW_NEED_DISK
  blk: pass from_schedule to non-request unplug functions.
  block: stack unplug
  blk: centralize non-request unplug handling.
  md: remove plug_cnt feature of plugging.
  block/nbd: micro-optimization in nbd request completion
  drbd: announce FLUSH/FUA capability to upper layers
  drbd: fix max_bio_size to be unsigned
  drbd: flush drbd work queue before invalidate/invalidate remote
  drbd: fix potential access after free
  drbd: call local-io-error handler early
  drbd: do not reset rs_pending_cnt too early
  drbd: reset congestion information before reporting it in /proc/drbd
  drbd: report congestion if we are waiting for some userland callback
  drbd: differentiate between normal and forced detach
  drbd: cleanup, remove two unused global flags
  floppy: Run floppy initialization asynchronous
2012-08-01 09:06:47 -07:00
Linus Torvalds ac694dbdbc Merge branch 'akpm' (Andrew's patch-bomb)
Merge Andrew's second set of patches:
 - MM
 - a few random fixes
 - a couple of RTC leftovers

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (120 commits)
  rtc/rtc-88pm80x: remove unneed devm_kfree
  rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails
  mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
  tmpfs: distribute interleave better across nodes
  mm: remove redundant initialization
  mm: warn if pg_data_t isn't initialized with zero
  mips: zero out pg_data_t when it's allocated
  memcg: gix memory accounting scalability in shrink_page_list
  mm/sparse: remove index_init_lock
  mm/sparse: more checks on mem_section number
  mm/sparse: optimize sparse_index_alloc
  memcg: add mem_cgroup_from_css() helper
  memcg: further prevent OOM with too many dirty pages
  memcg: prevent OOM with too many dirty pages
  mm: mmu_notifier: fix freed page still mapped in secondary MMU
  mm: memcg: only check anon swapin page charges for swap cache
  mm: memcg: only check swap cache pages for repeated charging
  mm: memcg: split swapin charge function into private and public part
  mm: memcg: remove needless !mm fixup to init_mm when charging
  mm: memcg: remove unneeded shmem charge type
  ...
2012-07-31 19:25:39 -07:00
Linus Torvalds 3e9a97082f This patch series contains a major revamp of how we collect entropy
from interrupts for /dev/random and /dev/urandom.  The goal is to
 addresses weaknesses discussed in the paper "Mining your Ps and Qs:
 Detection of Widespread Weak Keys in Network Devices", by Nadia
 Heninger, Zakir Durumeric, Eric Wustrow, J. Alex Halderman, which will
 be published in the Proceedings of the 21st Usenix Security Symposium,
 August 2012.  (See https://factorable.net for more information and an
 extended version of the paper.)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJQF/0DAAoJENNvdpvBGATwIowQAOep9QKtLrBvb2lwIRVmeiy8
 lRf7V/tYZnz4FePbR0W92JQfKYkCV8yyOO0bmeRzWL3v4m+lRwDTSyA1DDyQMoH+
 LOMzvDKSLJMSXTXdSOIr1WYACphViCR/9CrbMBCKSkYfZLJ1MdaEDxT3rcpTGD0T
 6iknUweiSkHHhkerU5yQL7FKzD5kYUe0hsF47w7QVlHRHJsW2fsZqkFoh+RpnhNw
 03u+djxNGBo9qV81vZ9D1b0vA9uRlEjoWOOEG2XE4M2iq6TUySueA72dQnCwunfi
 3kG/u1Swv2dgq6aRrP3H7zdwhYSourGxziu3jNhEKwKEohrxYY7xjNX3RVeTqP67
 AzlKsOTWpRLIDrzjSLlb8VxRQiZewu8Unex3e1G+eo20sbcIObHGrxNp7K00zZvd
 QZiMHhOwItwFTe4lBO+XbqH2JKbL9/uJmwh5EipMpQTraKO9E6N3CJiUHjzBLo2K
 iGDZxRMKf4gVJRwDxbbP6D70JPVu8ZJ09XVIpsXQ3Z1xNqaMF0QdCmP3ty56q1o0
 NvkSXxPKrijZs8Sk0rVDqnJ3ll8PuDnXMv5eDtL42VT818I5WxESn9djjwEanGv0
 TYxbFub/NRxmPEE5B2Js5FBpqsLf5f282OSMeS/5WLBbnHJR1OoPoAhGVpHvxntC
 bi5FC1OolqhvzVIdsqgt
 =u7KM
 -----END PGP SIGNATURE-----

Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random

Pull random subsystem patches from Ted Ts'o:
 "This patch series contains a major revamp of how we collect entropy
  from interrupts for /dev/random and /dev/urandom.

  The goal is to addresses weaknesses discussed in the paper "Mining
  your Ps and Qs: Detection of Widespread Weak Keys in Network Devices",
  by Nadia Heninger, Zakir Durumeric, Eric Wustrow, J.  Alex Halderman,
  which will be published in the Proceedings of the 21st Usenix Security
  Symposium, August 2012.  (See https://factorable.net for more
  information and an extended version of the paper.)"

Fix up trivial conflicts due to nearby changes in
drivers/{mfd/ab3100-core.c, usb/gadget/omap_udc.c}

* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random: (33 commits)
  random: mix in architectural randomness in extract_buf()
  dmi: Feed DMI table to /dev/random driver
  random: Add comment to random_initialize()
  random: final removal of IRQF_SAMPLE_RANDOM
  um: remove IRQF_SAMPLE_RANDOM which is now a no-op
  sparc/ldc: remove IRQF_SAMPLE_RANDOM which is now a no-op
  [ARM] pxa: remove IRQF_SAMPLE_RANDOM which is now a no-op
  board-palmz71: remove IRQF_SAMPLE_RANDOM which is now a no-op
  isp1301_omap: remove IRQF_SAMPLE_RANDOM which is now a no-op
  pxa25x_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
  omap_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
  goku_udc: remove IRQF_SAMPLE_RANDOM which was commented out
  uartlite: remove IRQF_SAMPLE_RANDOM which is now a no-op
  drivers: hv: remove IRQF_SAMPLE_RANDOM which is now a no-op
  xen-blkfront: remove IRQF_SAMPLE_RANDOM which is now a no-op
  n2_crypto: remove IRQF_SAMPLE_RANDOM which is now a no-op
  pda_power: remove IRQF_SAMPLE_RANDOM which is now a no-op
  i2c-pmcmsp: remove IRQF_SAMPLE_RANDOM which is now a no-op
  input/serio/hp_sdc.c: remove IRQF_SAMPLE_RANDOM which is now a no-op
  mfd: remove IRQF_SAMPLE_RANDOM which is now a no-op
  ...
2012-07-31 19:07:42 -07:00
Mel Gorman 7f338fe454 nbd: set SOCK_MEMALLOC for access to PFMEMALLOC reserves
Set SOCK_MEMALLOC on the NBD socket to allow access to PFMEMALLOC reserves
so pages backed by NBD, particularly if swap related, can be cleaned to
prevent the machine being deadlocked.  It is still possible that the
PFMEMALLOC reserves get depleted resulting in deadlock but this can be
resolved by the administrator by increasing min_free_kbytes.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:46 -07:00
Linus Torvalds cc8362b1f6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph changes from Sage Weil:
 "Lots of stuff this time around:

   - lots of cleanup and refactoring in the libceph messenger code, and
     many hard to hit races and bugs closed as a result.
   - lots of cleanup and refactoring in the rbd code from Alex Elder,
     mostly in preparation for the layering functionality that will be
     coming in 3.7.
   - some misc rbd cleanups from Josh Durgin that are finally going
     upstream
   - support for CRUSH tunables (used by newer clusters to improve the
     data placement)
   - some cleanup in our use of d_parent that Al brought up a while back
   - a random collection of fixes across the tree

  There is another patch coming that fixes up our ->atomic_open()
  behavior, but I'm going to hammer on it a bit more before sending it."

Fix up conflicts due to commits that were already committed earlier in
drivers/block/rbd.c, net/ceph/{messenger.c, osd_client.c}

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (132 commits)
  rbd: create rbd_refresh_helper()
  rbd: return obj version in __rbd_refresh_header()
  rbd: fixes in rbd_header_from_disk()
  rbd: always pass ops array to rbd_req_sync_op()
  rbd: pass null version pointer in add_snap()
  rbd: make rbd_create_rw_ops() return a pointer
  rbd: have __rbd_add_snap_dev() return a pointer
  libceph: recheck con state after allocating incoming message
  libceph: change ceph_con_in_msg_alloc convention to be less weird
  libceph: avoid dropping con mutex before fault
  libceph: verify state after retaking con lock after dispatch
  libceph: revoke mon_client messages on session restart
  libceph: fix handling of immediate socket connect failure
  ceph: update MAINTAINERS file
  libceph: be less chatty about stray replies
  libceph: clear all flags on con_close
  libceph: clean up con flags
  libceph: replace connection state bits with states
  libceph: drop unnecessary CLOSED check in socket state change callback
  libceph: close socket directly from ceph_con_close()
  ...
2012-07-31 14:35:28 -07:00
Quoc-Son Anh cd58ad7d18 NVMe: Use ida for nvme device instance
Signed-off-by: Quoc-Son Anh <quoc-sonx.anh@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-07-31 13:31:34 -04:00
Matthew Wilcox 0ac13140d7 NVMe: Fix whitespace damage in nvme_init
Commit 5c42ea1643 used spaces instead of tabs.
Also remove the unnecessary initialisation of the 'result' variable.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-07-31 13:31:16 -04:00
Dan Carpenter 22fff826e7 NVMe: handle allocation failure in nvme_map_user_pages()
We should return here and avoid a NULL dereference.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-07-31 13:15:43 -04:00
Jens Axboe 10af8138eb Merge branch 'upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/floppy into for-3.6/drivers 2012-07-31 11:47:36 +02:00
Fengguang Wu 2fb2ca6f5b floppy: remove duplicated flag FD_RAW_NEED_DISK
Fix coccinelle warning (without behavior change):

drivers/block/floppy.c:2518:32-48: duplicated argument to & or |

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-07-31 11:38:39 +02:00
NeilBrown 74018dc306 blk: pass from_schedule to non-request unplug functions.
This will allow md/raid to know why the unplug was called,
and will be able to act according - if !from_schedule it
is safe to perform tasks which could themselves schedule.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-07-31 09:08:15 +02:00
NeilBrown 9cbb175088 blk: centralize non-request unplug handling.
Both md and umem has similar code for getting notified on an
blk_finish_plug event.
Centralize this code in block/ and allow each driver to
provide its distinctive difference.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-07-31 09:08:14 +02:00
Chetan Loke 01ff5dbc09 block/nbd: micro-optimization in nbd request completion
Add in-flight cmds to the tail. That way while searching
(during request completion),we will always get a hit on the
first element.

Signed-off-by: Chetan Loke <loke.chetan@gmail.com>
Acked-by: Paul.Clements@steeleye.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-07-31 08:47:13 +02:00
Alex Elder 1fe5e99321 rbd: create rbd_refresh_helper()
Create a simple helper that handles the common case of calling
__rbd_refresh_header() while holding the ctl_mutex.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:21:46 -07:00
Alex Elder b813623ab9 rbd: return obj version in __rbd_refresh_header()
Add a new parameter to __rbd_refresh_header() through which the
version of the header object is passed back to the caller.  In most
cases this isn't needed.  The main motivation is to normalize
(almost) all calls to __rbd_refresh_header() so they are all
wrapped immediately by mutex_lock()/mutex_unlock().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:21:46 -07:00
Alex Elder ccece235d3 rbd: fixes in rbd_header_from_disk()
This fixes a few issues in rbd_header_from_disk():
    - There is a check intended to catch overflow, but it's wrong in
      two ways.
	- First, the type we don't want to overflow is size_t, not
	  unsigned int, and there is now a SIZE_MAX we can use for
	  use with that type.
	- Second, we're allocating the snapshot ids and snapshot
	  image sizes separately (each has type u64; on disk they
          grouped together as a rbd_image_header_ondisk structure).
	  So we can use the size of u64 in this overflow check.
    - If there are no snapshots, then there should be no snapshot
      names.  Enforce this, and issue a warning if we encounter a
      header with no snapshots but a non-zero snap_names_len.
    - When saving the snapshot names into the header, be more direct
      in defining the offset in the on-disk structure from which
      they're being copied by using "snap_count" rather than "i"
      in the array index.
    - If an error occurs, the "snapc" and "snap_names" fields are
      freed at the end of the function.  Make those fields be null
      pointers after they're freed, to be explicit that they are
      no longer valid.
    - Finally, move the definition of the local variable "i" to the
      innermost scope in which it's needed.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:21:46 -07:00
Alex Elder 913d2fdcf6 rbd: always pass ops array to rbd_req_sync_op()
All of the callers of rbd_req_sync_op() except one pass a non-null
"ops" pointer.  The only one that does not is rbd_req_sync_read(),
which passes CEPH_OSD_OP_READ as its "opcode" and, CEPH_OSD_FLAG_READ
for "flags".

By allocating the ops array in rbd_req_sync_read() and moving the
special case code for the null ops pointer into it, it becomes
clear that much of that code is not even necessary.

In addition, the "opcode" argument to rbd_req_sync_op() is never
actually used, so get rid of that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:21:46 -07:00
Alex Elder d67d4be56a rbd: pass null version pointer in add_snap()
rbd_header_add_snap() passes the address of a version variable to
rbd_req_sync_exec(), but it ignores the result.  Just pass a null
pointer instead.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:21:46 -07:00
Alex Elder 57cfc1060f rbd: make rbd_create_rw_ops() return a pointer
Either rbd_create_rw_ops() will succeed, or it will fail because a
memory allocation failed.  Have it just return a valid pointer or
null rather than stuffing a pointer into a provided address and
returning an errno.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:21:46 -07:00
Alex Elder 4e891e0af0 rbd: have __rbd_add_snap_dev() return a pointer
It's not obvious whether the snapshot pointer whose address is
provided to __rbd_add_snap_dev() will be assigned by that function.
Change it to return the snapshot, or a pointer-coded errno in the
event of a failure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:21:45 -07:00
Alex Elder 070c633f60 rbd: drop "object_name" from rbd_req_sync_unwatch()
rbd_req_sync_unwatch() only ever uses rbd_dev->header_name as the
value of its "object_name" parameter, and that value is available
within the function already.  So get rid of the parameter.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:54 -07:00
Alex Elder 7f0a24d855 rbd: drop "object_name" from rbd_req_sync_notify_ack()
rbd_req_sync_notify_ack() only ever uses rbd_dev->header_name as the
value of its "object_name" parameter, and that value is available
within the function already.  So get rid of the parameter.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:53 -07:00
Alex Elder 4cb162508a rbd: drop "object_name" from rbd_req_sync_notify()
rbd_req_sync_notify() only ever uses rbd_dev->header_name as the
value of its "object_name" parameter, and that value is available
within the function already.  So get rid of the parameter.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:53 -07:00
Alex Elder 0e6f322d55 rbd: drop "object_name" from rbd_req_sync_watch()
rbd_req_sync_watch() is only called in one place, and in that place
it passes rbd_dev->header_name as the value of the "object_name"
parameter.  This value is available within the function already.

Having the extra parameter leaves the impression the object name
could take on different values, but it does not.

So get rid of the parameter.  We can always add it back again if
we find we want to watch some other object in the future.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:52 -07:00
Alex Elder 14e7085d84 rbd: drop rbd_dev parameter in snap functions
Both rbd_register_snap_dev() and __rbd_remove_snap_dev() have
rbd_dev parameters that are unused.  Remove them.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:51 -07:00
Alex Elder ed63f4fd9a rbd: drop rbd_header_from_disk() gfp_flags parameter
The function rbd_header_from_disk() is only called in one spot, and
it passes GFP_KERNEL as its value for the gfp_flags parameter.

Just drop that parameter and substitute GFP_KERNEL everywhere within
that function it had been used.  (If we find we need the parameter
again in the future it's easy enough to add back again.)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:50 -07:00
Alex Elder 9a5d690b08 rbd: snapc is unused in rbd_req_sync_read()
The "snapc" parameter to in rbd_req_sync_read() is not used, so
get rid of it.

Reported-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:49 -07:00
Alex Elder de71a2970d rbd: rename rbd_device->id
The "id" field of an rbd device structure represents the unique
client-local device id mapped to the underlying rbd image.  Each rbd
image will have another id--the image id--and each snapshot has its
own id as well.  The simple name "id" no longer conveys the
information one might like to have.

Rename the device "id" field in struct rbd_dev to be "dev_id" to
make it a little more obvious what we're dealing with without having
to think more about context.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:49 -07:00
Alex Elder 8e94af8e2b rbd: encapsulate header validity test
If an rbd image header is read and it doesn't begin with the
expected magic information, a warning is displayed.  This is
a fairly simple test, but it could be extended at some point.
Fix the comparison so it actually looks at the "text" field
rather than the front of the structure.

In any case, encapsulate the validity test in its own function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:48 -07:00
Alex Elder bd919d45aa rbd: clean up a few dout() calls
There was a dout() call in rbd_do_request() that was reporting
the reporting the offset as the length and vice versa.  While
fixing that I did a quick scan of other dout() calls and fixed
a couple of other minor things.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:46 -07:00
Alex Elder a059329056 rbd: simplify __rbd_remove_all_snaps()
This just replaces a while loop with list_for_each_entry_safe()
in __rbd_remove_all_snaps().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:45 -07:00
Alex Elder a66f8c97a3 rbd: drop extra header_rwsem init
In commit c666601a there was inadvertently added an extra
initialization of rbd_dev->header_rwsem.  This gets rid of the
duplicate.

Reported-by: Guangliang Zhao <gzhao@suse.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:45 -07:00
Alex Elder 9e15dc735a rbd: kill rbd_image_header->snap_seq
The snap_seq field in an rbd_image_header structure held the value
from the rbd image header when it was last refreshed.  We now
maintain this value in the snapc->seq field.  So get rid of the
other one.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:44 -07:00
Alex Elder 505cbb9bed rbd: set snapc->seq only when refreshing header
In rbd_header_add_snap() there is code to set snapc->seq to the
just-added snapshot id.  This is the only remnant left of the
use of that field for recording which snapshot an rbd_dev was
associated with.  That functionality is no longer supported,
so get rid of that final bit of code.

Doing so means we never actually set snapc->seq any more.  On the
server, the snapshot context's sequence value represents the highest
snapshot id ever issued for a particular rbd image.  So we'll make
it have that meaning here as well.  To do so, set this value
whenever the rbd header is (re-)read.  That way it will always be
consistent with the rest of the snapshot context we maintain.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:43 -07:00
Alex Elder 78dc447d3c rbd: preserve snapc->seq in rbd_header_set_snap()
In rbd_header_set_snap(), there is logic to make the snap context's
seq field get set to a particular snapshot id, or 0 if there is no
snapshot for the rbd image.

This seems to be an artifact of how the current snapshot id for an
rbd_dev was recorded before the rbd_dev->snap_id field began to be
used for that purpose.

There's no need to update the value of snapc->seq here any more, so
stop doing it.  Tidy up a few local variables in that function
while we're at it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:42 -07:00
Alex Elder 75fe9e1981 rbd: don't use snapc->seq that way
In what appears to be an artifact of a different way of encoding
whether an rbd image maps a snapshot, __rbd_refresh_header() has
code that arranges to update the seq value in an rbd image's
snapshot context to point to the first entry in its snapshot
array if that's where it was pointing initially.

We now use rbd_dev->snap_id to record the snapshot id--using the
special value CEPH_NOSNAP to indicate the rbd_dev is not mapping a
snapshot at all.

There is therefore no need to check for this case, nor to update the
seq value, in __rbd_refresh_header().  Just preserve the seq value
that rbd_read_header() provides (which, at the moment, is nothing).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:41 -07:00
Josh Durgin a71b891bc7 rbd: send header version when notifying
Previously the original header version was sent. Now, we update it
when the header changes.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:15:40 -07:00
Josh Durgin d1d2564654 rbd: use reference counting for the snap context
This prevents a race between requests with a given snap context and
header updates that free it. The osd client was already expecting the
snap context to be reference counted, since it get()s it in
ceph_osdc_build_request and put()s it when the request completes.

Also remove the second down_read()/up_read() on header_rwsem in
rbd_do_request, which wasn't actually preventing this race or
protecting any other data.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:15:40 -07:00
Josh Durgin 93a24e084d rbd: set image size when header is updated
The image may have been resized.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:15:39 -07:00
Josh Durgin a51aa0c042 rbd: expose the correct size of the device in sysfs
If an image was mapped to a snapshot, the size of the head version
would be shown. Protect capacity with header_rwsem, since it may
change.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:15:38 -07:00
Josh Durgin 474ef7ce83 rbd: only reset capacity when pointing to head
Snapshots cannot be resized, and the new capacity of head should not
be reflected by the snapshot.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:15:37 -07:00
Josh Durgin e88a36ec96 rbd: return errors for mapped but deleted snapshot
When a snapshot is deleted, the OSD will return ENOENT when reading
from it. This is normally interpreted as a hole by rbd, which will
return zeroes. To minimize the time in which this can happen, stop
requests early when we are notified that our snapshot no longer
exists.

[elder@inktank.com: updated __rbd_init_snaps_header() logic]

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:15:36 -07:00
Alex Elder d1f57ea663 rbd: kill num_reply parameters
Several functions include a num_reply parameter, but it is never
used.  Just get rid of it everywhere--it seems to be something
that never got fully implemented.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:30:09 -07:00
Alex Elder 43ae470112 rbd: option symbol renames
Use the name "ceph_opts" consistently (rather than just "opt") for
pointers to a ceph_options structure.

Change the few spots that don't use "rbd_opts" for a rbd_options
pointer to match the rest.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:30:08 -07:00
Alex Elder aded07ea9f rbd: more symbol renames
Rename variables named "obj" which represent object names so they're
consistently named "object_name".

Rename the "cls" and "method" parameters in rbd_req_sync_exec()
to be "class_name" and "method_name", and make similar changes
to the names of local variables in that function representing
the lengths of those names.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:30:07 -07:00
Alex Elder 0bed54dc9a rbd: rename some fields in struct rbd_dev
An rbd image is not a single object, but a logical construct made up
of an aggregation of objects.

Rename some fields in struct rbd_dev, in hopes of reinforcing this.
    obj         --> image_name
    obj_len     --> image_name_len
    obj_md_name --> header_name

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:30:06 -07:00
Alex Elder 0ce1a79413 rbd: use rbd_dev consistently
Most variables that represent a struct rbd_device are named
"rbd_dev", but in some cases "dev" is used instead.  Change all the
"dev" references so they use "rbd_dev" consistently, to make it
clear from the name that we're working with an RBD device (as
opposed to, for example, a struct device).  Similarly, change the
name of the "dev" field in struct rbd_notify_info to be "rbd_dev".

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:30:05 -07:00
Alex Elder 820a5f3e94 rbd: dynamically allocate snapshot name
There is no need to impose a small limit the length of the snapshot
name recorded for an rbd image in a struct rbd_dev.  Remove the
limitation by allocating space for the snapshot name dynamically.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:30:04 -07:00
Alex Elder bf3e5ae112 rbd: dynamically allocate image name
There is no need to impose a small limit the length of the rbd image
name recorded in a struct rbd_dev.  Remove the limitation by
allocating space for the image name dynamically.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:30:04 -07:00
Alex Elder cb8627c76d rbd: dynamically allocate image header name
There is no need to impose a small limit the length of the header
name recorded for an rbd image in a struct rbd_dev.  Remove the
limitation by allocating space for the header name dynamically.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:30:03 -07:00
Alex Elder 849b4260d4 rbd: dynamically allocate object prefix
There is no need to impose a small limit the length of the object
prefix recorded for an rbd image in a struct rbd_image_header.
Remove the limitation by allocating space for the object prefix
dynamically.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:30:02 -07:00
Alex Elder d22f76e703 rbd: dynamically allocate pool name
There is no need to impose a small limit the length of the pool name
recorded for an rbd image in a struct rbd_device.  Remove the
limitation by allocating space for the pool name ynamically.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:30:01 -07:00
Alex Elder 9bb2f334b9 rbd: create pool_id device attribute
Add an entry under /sys/bus/rbd/devices/<N>/ named "pool_id" that
provides the id for the pool the rbd image is assocatied with.  This
is in addition to the pool name already provided.

Rename the "poolid" field in struct rbd_device  to be "pool_id".

Update the documentation to reflect the addition of this new entry.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:30:00 -07:00
Alex Elder ca1e49a6af rbd: rename rbd_dev->block_name
Each rbd image has a name that forms the basis of all data objects
backing the device.  Old (format 1) images refer to this name as the
"block name," while new (format 2) images use the term "object
prefix" for this.

Change the field name in the in-core rbd image header structure to
reflect the more modern usage.  We intentionally keep the the name
"block_name" in the on-disk definition for format 1 image headers.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:29:59 -07:00
Alex Elder ea3352f4aa rbd: define dup_token()
Define a new function dup_token(), to be used during argument
parsing for making dynamically-allocated copies of tokens being
parsed.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:29:58 -07:00
Alex Elder ad4f232f28 rbd: drop a useless local variable
In rbd_req_sync_notify_ack(), a local variable was needlessly being
used to hold a null pointer.  Just pass NULL instead.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:29:56 -07:00
Jens Axboe 72ea1f74fc Merge branch 'for-jens' of git://git.drbd.org/linux-drbd into for-3.6/drivers 2012-07-30 09:03:10 +02:00
Paolo Bonzini cd5d503862 virtio-blk: allow toggling host cache between writeback and writethrough
This patch adds support for the new VIRTIO_BLK_F_CONFIG_WCE feature,
which exposes the cache mode in the configuration space and lets the
driver modify it.  The cache mode is exposed via sysfs.

Even if the host does not support the new feature, the cache mode is
visible (thanks to the existing VIRTIO_BLK_F_WCE), but not modifiable.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-07-30 13:30:52 +09:30
Asias He 2c95a32909 virtio-blk: Use block layer provided spinlock
Block layer will allocate a spinlock for the queue if the driver does
not provide one in blk_init_queue().

The reason to use the internal spinlock is that blk_cleanup_queue() will
switch to use the internal spinlock in the cleanup code path.

        if (q->queue_lock != &q->__queue_lock)
                q->queue_lock = &q->__queue_lock;

However, processes which are in D state might have taken the driver
provided spinlock, when the processes wake up, they would release the
block provided spinlock.

=====================================
[ BUG: bad unlock balance detected! ]
3.4.0-rc7+ #238 Not tainted
-------------------------------------
fio/3587 is trying to release lock (&(&q->__queue_lock)->rlock) at:
[<ffffffff813274d2>] blk_queue_bio+0x2a2/0x380
but there are no more locks to release!

other info that might help us debug this:
1 lock held by fio/3587:
 #0:  (&(&vblk->lock)->rlock){......}, at:
[<ffffffff8132661a>] get_request_wait+0x19a/0x250

Other drivers use block layer provided spinlock as well, e.g. SCSI.

Switching to the block layer provided spinlock saves a bit of memory and
does not increase lock contention. Performance test shows no real
difference is observed before and after this patch.

Changes in v2: Improve commit log as Michael suggested.

Cc: virtualization@lists.linux-foundation.org
Cc: kvm@vger.kernel.org
Cc: stable@kernel.org
Signed-off-by: Asias He <asias@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-07-30 13:30:51 +09:30
Asias He 483001c765 virtio-blk: Reset device after blk_cleanup_queue()
blk_cleanup_queue() will call blk_drian_queue() to drain all the
requests before queue DEAD marking. If we reset the device before
blk_cleanup_queue() the drain would fail.

1) if the queue is stopped in do_virtblk_request() because device is
full, the q->request_fn() will not be called.

blk_drain_queue() {
   while(true) {
      ...
      if (!list_empty(&q->queue_head))
        __blk_run_queue(q) {
	    if (queue is not stoped)
		q->request_fn()
	}
      ...
   }
}

Do no reset the device before blk_cleanup_queue() gives the chance to
start the queue in interrupt handler blk_done().

2) In commit b79d866c8b, We abort requests
dispatched to driver before blk_cleanup_queue(). There is a race if
requests are dispatched to driver after the abort and before the queue
DEAD mark. To fix this, instead of aborting the requests explicitly, we
can just reset the device after after blk_cleanup_queue so that the
device can complete all the requests before queue DEAD marking in the
drain process.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: virtualization@lists.linux-foundation.org
Cc: kvm@vger.kernel.org
Cc: stable@kernel.org
Signed-off-by: Asias He <asias@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-07-30 13:30:51 +09:30
Asias He 02e2b12494 virtio-blk: Call del_gendisk() before disable guest kick
del_gendisk() might not return due to failing to remove the
/sys/block/vda/serial sysfs entry when another thread (udev) is
trying to read it.

virtblk_remove()
  vdev->config->reset() : guest will not kick us through interrupt
    del_gendisk()
      device_del()
        kobject_del(): got stuck, sysfs entry ref count non zero

sysfs_open_file(): user space process read /sys/block/vda/serial
   sysfs_get_active() : got sysfs entry ref count
      dev_attr_show()
        virtblk_serial_show()
           blk_execute_rq() : got stuck, interrupt is disabled
                              request cannot be finished

This patch fixes it by calling del_gendisk() before we disable guest's
interrupt so that the request sent in virtblk_serial_show() will be
finished and del_gendisk() will success.

This fixes another race in hot-unplug process.

It is save to call del_gendisk(vblk->disk) before
flush_work(&vblk->config_work) which might access vblk->disk, because
vblk->disk is not freed until put_disk(vblk->disk).

Cc: virtualization@lists.linux-foundation.org
Cc: kvm@vger.kernel.org
Cc: stable@kernel.org
Signed-off-by: Asias He <asias@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-07-30 13:30:51 +09:30
Keith Busch c7d36ab8fa NVMe: Fix uninitialized iod compiler warning
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-07-27 14:25:22 -04:00
Keith Busch a0cadb85b8 NVMe: Do not set IO queue depth beyond device max
Set the depth for IO queues to the device's maximum supported queue
entries if the requested depth exceeds the device's capabilities.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-07-27 13:57:23 -04:00
Keith Busch 8fc23e032d NVMe: Set block queue max sectors
Set the max hw sectors in a namespace's request queue if the nvme device
has a max data transfer size.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-07-26 13:43:20 -04:00
Keith Busch a42ceccef0 NVMe: use namespace id for nvme_get_features
The specification does not provide a use for command dword11 in the NVMe
Get Features command, but does use the NSID for some features.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-07-26 12:46:36 -04:00
Keith Busch 50af8baec4 NVMe: replace nvme_ns with nvme_dev for user admin
The function nvme_user_admin_command does not require a namespace to
proceed.  Replace with the nvme_dev structure so that it can be called
from contexts that do not have a namespace.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-07-26 12:24:36 -04:00
Keith Busch 5c42ea1643 NVMe: Fix nvme module init when nvme_major is set
register_blkdev returns 0 when given a valid major number.

Reported-by:Ross Zwisler <ross.zwisler@intel.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-07-26 12:23:38 -04:00
Keith Busch e9ef46369f NVMe: Set request queue logical block size
Sets the request queue logical block size with the block size of the
namespace.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-07-25 14:52:49 -04:00
Lars Ellenberg a73ff3231d drbd: announce FLUSH/FUA capability to upper layers
Unconditionally announce FLUSH/FUA to upper layers.
If the lower layers on either node do not actually support this,
generic_make_request() will deal with it.

If this causes performance regressions on your setup,
make sure there are no volatile caches involved,
and mount -o nobarrier or equivalent.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-07-24 15:14:28 +02:00
Lars Ellenberg db141b2f42 drbd: fix max_bio_size to be unsigned
We capped our max_bio_size respectively max_hw_sectors with
min_t(int, lower level limit, our limit);
unfortunately, some drivers, e.g. the kvm virtio block driver, initialize their
limits to "-1U", and that is of course a smaller "int" value than our limit.

Impact: we started to request 16 MB resync requests,
which lead to protocol error and a reconnect loop.

Fix all relevant constants and parameters to be unsigned int.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-07-24 15:14:00 +02:00
Lars Ellenberg 7ee1fb93f3 drbd: flush drbd work queue before invalidate/invalidate remote
If you do back to back wait-sync/invalidate on a Primary in a tight loop,
during application IO load, you could trigger a race:
  kernel: block drbd6: FIXME going to queue 'set_n_write from StartingSync'
	but 'write from resync_finished' still pending?

Fix this by changing the order of the drbd_queue_work() and
the wake_up() in dec_ap_pending(), and adding the additional
drbd_flush_workqueue() before requesting the full sync.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-07-24 14:15:58 +02:00
Lars Ellenberg c12e9c8964 drbd: fix potential access after free
Occasionally, if we disconnect, we triggered this assert:
  block drbd7: ASSERT FAILED tl_hash[27] == c30b0f04, expected NULL

hlist_del() happens only on master bio completion.

We used to wait for pending IO to complete before freeing tl_hash
on disconnect. We no longer do so, since we learned to "freeze"
IO on disconnect.

If the local disk is too slow, we may reach C_STANDALONE early,
and there are still some requests pending locally when we call
drbd_free_tl_hash().

If we now free the tl_hash, and later the local IO completion completes
the master bio, which then does hlist_del() and clobbers freed memory.

Do hlist_del_init() and hlist_add_fake() before kfree(tl_hash),
so the hlist_del() on master bio completion is harmless.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-07-24 14:15:16 +02:00
Lars Ellenberg 63a6d0bb3d drbd: call local-io-error handler early
In case we want to hard-reset from the local-io-error handler,
we need to call it before notifying the peer or aborting local IO.
Otherwise the peer will advance its data generation UUIDs even
if secondary.

This way, local io error looks like a "regular" node crash,
which reduces the number of different failure cases.
This may be useful in a bigger picture where crashed or otherwise
"misbehaving" nodes are automatically re-deployed.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-07-24 14:10:41 +02:00
Lars Ellenberg 0029d62434 drbd: do not reset rs_pending_cnt too early
Fix asserts like
  block drbd0: in got_BlockAck:4634: rs_pending_cnt = -35 < 0 !

We reset the resync lru cache and related information (rs_pending_cnt),
once we successfully finished a resync or online verify, or if the
replication connection is lost.

We also need to reset it if a resync or online verify is aborted
because a lower level disk failed.

In that case the replication link is still established,
and we may still have packets queued in the network buffers
which want to touch rs_pending_cnt.

We do not have any synchronization mechanism to know for sure when all
such pending resync related packets have been drained.

To avoid this counter to go negative (and violate the ASSERT that it
will always be >= 0), just do not reset it when we lose a disk.

It is good enough to make sure it is re-initialized before the next
resync can start: reset it when we re-attach a disk.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-07-24 14:09:53 +02:00
Lars Ellenberg 88437879fb drbd: reset congestion information before reporting it in /proc/drbd
We cache the congestion status in mdev->congestion_reason whenever
drbd_congested() was called.
Reset this cached info before reporting it when reading /proc/drbd.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-07-24 14:07:48 +02:00
Lars Ellenberg c2ba686f35 drbd: report congestion if we are waiting for some userland callback
If the drbd worker thread is synchronously waiting for some userland
callback, we don't want some casual pageout to block on us.
Have drbd_congested() report congestion in that case.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-07-24 14:07:18 +02:00
Lars Ellenberg 383606e0de drbd: differentiate between normal and forced detach
Aborting local requests (not waiting for completion from the lower level
disk) is dangerous: if the master bio has been completed to upper
layers, data pages may be re-used for other things already.
If local IO is still pending and later completes,
this may cause crashes or corrupt unrelated data.

Only abort local IO if explicitly requested.
Intended use case is a lower level device that turned into a tarpit,
not completing io requests, not even doing error completion.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-07-24 14:06:18 +02:00
Lars Ellenberg d264580145 drbd: cleanup, remove two unused global flags
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-07-24 14:02:41 +02:00
Jens Axboe b1af9be5ef Merge branch 'upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/floppy into for-3.6/drivers 2012-07-24 13:15:08 +02:00
Linus Torvalds 7100e505b7 Power management updates for 3.6
* ACPI conversion to PM handling based on struct dev_pm_ops.
 * Conversion of a number of platform drivers to PM handling based on struct
   dev_pm_ops and removal of empty legacy PM callbacks from a couple of PCI
   drivers.
 * Suspend-to-both for in-kernel hibernation from Bojan Smojver.
 * cpuidle fixes and cleanups from ShuoX Liu, Daniel Lezcano and Preeti U Murthy.
 * cpufreq bug fixes from Jonghwa Lee and Stephen Boyd.
 * Suspend and hibernate fixes from Srivatsa S. Bhat and Colin Cross.
 * Generic PM domains framework updates.
 * RTC CMOS wakeup signaling update from Paul Fox.
 * sparse warnings fixes from Sachin Kamat.
 * Build warnings fixes for the generic PM domains framework and PM sysfs code.
 * sysfs switch for printing device suspend times from Sameer Nanda.
 * Documentation fix from Oskar Schirmer.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIcBAABAgAGBQJQDF5eAAoJEKhOf7ml8uNsEaAP/2wg4faoOGob5A0/7tLqG3Cw
 xnTmGsfL7wG07Q8ykCL1BSlBb1VeJz8L6LTmUpaABI4M//oIBlcYQKyCE0Tat1AO
 9bJXFzK7qcHMhkTz6d6LDqtVzR3NGM3ypjZqj8aEXBov07LMR1AXvgNwXXhv25zM
 0unwrh1XNinBN3n+oaktpWk1YHUjsa5IMU+2tQJrocuHXcgK30vGXZVrZ4g9w1c2
 eS+ED1oKUqOYtFzIUX+aCtaDDheGaPlugk/GOtIB7Sae0s0vMlxH/T5ncB4SxRC+
 v3s4OykqQc5Dc8+0bNlBH7ykSVNB0PoQiyKDY67CxtH+q1xQSc9/f3XJqnGMaVDE
 17eZUZsL4qSyzRuCbPCGAgwBHmx3qNCMu1i1BcmnSxU+ikPUeCR7mYOP0mRThwPH
 OSfs+c/vZ+Ow6CwVE4UFrbm9Jve7ADnCrlZzT2m6XjhHGyjKP7SJlzP9TPsZ0LRk
 oxgQDYHmxbo50t9tBCz5L4ZTMKkDp28e78x84/CteP85srcW3GqDxrPyp2uzJu5O
 tvIEBvVlc4ucq8sG83RkugQwrG/2cQwG2HO9ERAwq01HHA1BYsuU3A961Jqf5CZo
 nFRSnByvVj/imPf47OWpDPAbVEs7jxufJuLEbPwGj1MkttTGDBIRu3zldXt2S6kP
 Q4qYU6fDaQQHFc90pqxQ
 =vC4/
 -----END PGP SIGNATURE-----

Merge tag 'pm-for-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:

 - ACPI conversion to PM handling based on struct dev_pm_ops.
 - Conversion of a number of platform drivers to PM handling based on
   struct dev_pm_ops and removal of empty legacy PM callbacks from a
   couple of PCI drivers.
 - Suspend-to-both for in-kernel hibernation from Bojan Smojver.
 - cpuidle fixes and cleanups from ShuoX Liu, Daniel Lezcano and Preeti
   Murthy.
 - cpufreq bug fixes from Jonghwa Lee and Stephen Boyd.
 - Suspend and hibernate fixes from Srivatsa Bhat and Colin Cross.
 - Generic PM domains framework updates.
 - RTC CMOS wakeup signaling update from Paul Fox.
 - sparse warnings fixes from Sachin Kamat.
 - Build warnings fixes for the generic PM domains framework and PM
   sysfs code.
 - sysfs switch for printing device suspend times from Sameer Nanda.
 - Documentation fix from Oskar Schirmer.

* tag 'pm-for-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (70 commits)
  cpufreq: Fix sysfs deadlock with concurrent hotplug/frequency switch
  EXYNOS: bugfix on retrieving old_index from freqs.old
  PM / Sleep: call early resume handlers when suspend_noirq fails
  PM / QoS: Use NULL pointer instead of plain integer in qos.c
  PM / QoS: Use NULL pointer instead of plain integer in pm_qos.h
  PM / Sleep: Require CAP_BLOCK_SUSPEND to use wake_lock/wake_unlock
  PM / Sleep: Add missing static storage class specifiers in main.c
  cpuilde / ACPI: remove time from acpi_processor_cx structure
  cpuidle / ACPI: remove usage from acpi_processor_cx structure
  cpuidle / ACPI : remove latency_ticks from acpi_processor_cx structure
  rtc-cmos: report wakeups from interrupt handler
  PM / Sleep: Fix build warning in sysfs.c for CONFIG_PM_SLEEP unset
  PM / Domains: Fix build warning for CONFIG_PM_RUNTIME unset
  olpc-xo15-sci: Use struct dev_pm_ops for power management
  PM / Domains: Replace plain integer with NULL pointer in domain.c file
  PM / Domains: Add missing static storage class specifier in domain.c file
  PM / crypto / ux500: Use struct dev_pm_ops for power management
  PM / IPMI: Remove empty legacy PCI PM callbacks
  tpm_nsc: Use struct dev_pm_ops for power management
  tpm_tis: Use struct dev_pm_ops for power management
  ...
2012-07-22 13:36:52 -07:00
Theodore Ts'o 89c30f161c xen-blkfront: remove IRQF_SAMPLE_RANDOM which is now a no-op
With the changes in the random tree, IRQF_SAMPLE_RANDOM is now a
no-op; interrupt randomness is now collected unconditionally in a very
low-overhead fashion; see commit 775f4b297b.  The IRQF_SAMPLE_RANDOM
flag was scheduled to be removed in 2009 on the
feature-removal-schedule, so this patch is preparation for the final
removal of this flag.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
2012-07-19 10:39:25 -04:00
Rafael J. Wysocki bfaa07bc32 Merge branch 'pm-drivers'
* pm-drivers:
  rtc-cmos: report wakeups from interrupt handler
  PM / crypto / ux500: Use struct dev_pm_ops for power management
  PM / IPMI: Remove empty legacy PCI PM callbacks
  tpm_nsc: Use struct dev_pm_ops for power management
  tpm_tis: Use struct dev_pm_ops for power management
  tpm_atmel: Use struct dev_pm_ops for power management
  PM / TPM: Drop unused pm_message_t argument from tpm_pm_suspend()
  omap-rng: Use struct dev_pm_ops for power management
  mg_disk: Use struct dev_pm_ops for power management
  msi-laptop: Use struct dev_pm_ops for power management
  hdaps: Use struct dev_pm_ops for power management
  sonypi: Use struct dev_pm_ops for power management
  intel_mid_thermal: Use struct dev_pm_ops for power management
  acer-wmi: Use struct dev_pm_ops for power management
  intel_ips: Remove empty legacy PM callbacks
  thinkpad_acpi: Use struct dev_pm_ops instead of legacy PM routines
  thinkpad_acpi: Drop pm_message_t arguments from suspend routines
2012-07-19 00:03:42 +02:00
Dan Carpenter 6a3ca4f188 rbd: endian bug in rbd_req_cb()
Sparse complains about this because:
drivers/block/rbd.c:996:20: warning: cast to restricted __le32
drivers/block/rbd.c:996:20: warning: cast from restricted __le16

These are set in osd_req_encode_op() and they are le16.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Alex Elder <elder@inktank.com>
(cherry picked from commit 895cfcc810)
2012-07-17 21:30:31 -07:00
Yan, Zheng 236df3755d rbd: Fix ceph_snap_context size calculation
ceph_snap_context->snaps is an u64 array

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
(cherry picked from commit f9f9a19044)
2012-07-17 21:30:19 -07:00
Silva Paulo 68d740d79c blk: fix wrong idr_pre_get() error check in loop.c
The idr_pre_get() function never returns a value < 0.  It returns 0 (no
memory) or 1 (OK).

Reported-by: Silva Paulo <psdasilva@yahoo.com>
[ Rewrote Silva's patch, but attributing it to Silva anyway  - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-14 15:39:58 -07:00
Rafael J. Wysocki 156ffcb42a mg_disk: Use struct dev_pm_ops for power management
Make the mg_disk driver define its PM callbacks through
a struct dev_pm_ops object rather than by using legacy PM hooks
in struct platform_driver.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-07-06 19:07:00 +02:00
Andi Kleen 0cc15d03bc floppy: Run floppy initialization asynchronous
floppy_init is quite slow, 3s on my test system to determine
that there is no floppy. Run it asynchronous to the other
init calls to improve boot time.

[jkosina@suse.cz: fix modular build]

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-07-04 14:01:40 +02:00
Linus Torvalds dab058fd5f floppy: cancel any pending fd_timeouts before adding a new one
In commit 070ad7e793 ("floppy: convert to delayed work and
single-thread wq") the 'fd_timeout' timer was converted to a delayed
work.  However, the "del_timer(&fd_timeout)" was lost in the process,
and any previous pending timeouts would stay active when we then
re-queued the timeout.

This resulted in the floppy probe sequence having a (stale) 20s timeout
rather than the intended 3s timeout, and thus made booting with the
floppy driver (but no actual floppy controller) take much longer than it
should.

Of course, there's little reason for most people to compile the floppy
driver into the kernel at all, which is why most people never noticed.

Canceling the delayed work where we used to do the del_timer() fixes the
issue, and makes the floppy probing use the proper new timeout instead.
The three second timeout is still very wasteful, but better than the 20s
one.

Reported-and-tested-by: Andi Kleen <ak@linux.intel.com>
Reported-and-tested-by: Calvin Walton <calvin.walton@kepstin.ca>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-03 15:51:22 -07:00
Sage Weil 9a64e8e0ac Linux 3.5-rc1
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQEcBAABAgAGBQJPyr4LAAoJEHm+PkMAQRiGhvMH/1uXaDmJiiyAtMhC9kQbLclK
 5RpUOV+ukRrPXBJhwWGEZvC9G/DiWAfZ/19Ee6qTGZbA46yxkgZklqO+bw7fuOLH
 dPf4MNXdhgOgbs0KkVAk6aXIYzIU836pcYg+LcapG8E8SZp3SWbJzrVbUPFwPM+m
 Sv11ZcpJfM2HH9wFRdKErUOiZHsMY+LZHcw0nx+BObytjgzBbzHNkpF57F714TUO
 QplYpIToO3XtGhIM1yRDxww+2zFlVNsCZ8IC57EDbLb8BMZWuyZoFgWZqLAnrU0u
 vy7CHLledMSvs855juJ9JxGo/EDnfwJpCnjmcp8BY+h4b5T/k5mGK6d9aeXYRf4=
 =CcWn
 -----END PGP SIGNATURE-----

Merge tag 'v3.5-rc1'

Linux 3.5-rc1

Conflicts:
	net/ceph/messenger.c
2012-06-15 12:32:04 -07:00
Jens Axboe 6d407cfaf5 Merge branch 'for-jens' of git://git.drbd.org/linux-drbd into for-linus 2012-06-13 21:19:42 +02:00
Jens Axboe 987751719b Merge branch 'stable/for-jens-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-linus 2012-06-13 21:19:06 +02:00
Tao Guo 32587371ad umem: fix up unplugging
Fix a regression introduced by 7eaceaccab ("block: remove per-queue
plugging").  In that patch, Jens removed the whole mm_unplug_device()
function, which used to be the trigger to make umem start to work.

We need to implement unplugging to make umem start to work, or I/O will
never be triggered.

Signed-off-by: Tao Guo <Tao.Guo@emc.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Shaohua Li <shli@kernel.org>
Cc: <stable@vger.kernel.org>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-13 21:17:21 +02:00
Lars Ellenberg 0d5934e3c2 drbd: fix null pointer dereference with on-congestion policy when diskless
We must not look at mdev->actlog, unless we have a get_ldev() reference.
It also does not make much sense to try to disconnect or pull-ahead of
the peer, if we don't have good local data.

Only even consider congestion policies, if our local disk is D_UP_TO_DATE.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-06-12 14:35:19 +02:00
Lars Ellenberg 1ed25b269e drbd: fix list corruption by failing but already aborted reads
If a read is aborted due to force-detach of a supposedly unresponsive
local backing device, and retried on the peer, it can happen that the
local request later still completes (hopefully with an error).
As it may already have been completed to upper layers meanwhile,
it must not be retried again now.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-06-12 14:34:51 +02:00
Lars Ellenberg 4eccc57979 drbd: fix access of unallocated pages and kernel panic
BUG: unable to handle kernel NULL pointer dereference at (null)
...
 [<d1e17561>] ? _drbd_bm_set_bits+0x151/0x240 [drbd]
 [<d1e236f8>] ? receive_bitmap+0x4f8/0xbc0 [drbd]

This fixes an off-by-one error in the receive_bitmap() path,
if run-length encoded bitmap transfer is enabled.

If the bitmap is an exact multiple of PAGE_SIZE, which means the visible
capacity of the drbd device is an exact multiple of 128 MiB (for 4k page
size), and bitmap compression (use-rle) is enabled (which became default
with 8.4), and the very last bit is dirty and reported in an rle
comressed bitmap packet, we ended up trying to kmap_atomic a page pointer
that does not exist (bitmap->bm_pages[last index + 1]).

bug introduced by:
    Date:   Fri Jul 24 15:33:24 2009 +0200
    set bits: optimize for complete last word, fix off-by-one-word corner case

made effective by:
    Date:   Thu Dec 16 00:32:38 2010 +0100
    drbd: get rid of unused debug code

    Long time ago, we had paranoia code in the bitmap that allocated one
    extra word, assigned a magic value, and checked on every occasion that
    the magic value was still unchanged.

    That debug code is unused, the extra long word complicates code a bit.
    Get rid of it.

No-one triggered this bug in the last few years, because a large subset
of our userbase is unaffected:
 * typically the last few blocks of a device are not modified
   frequently, and remain unset
 * use-rle was disabled by default in drbd < 8.4
 * those with slightly "odd" device sizes, or
 * drbd internal meta data (which will skew the device size slightly,
   thus makes it harder to have a bug relevant device size)

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-06-12 14:32:48 +02:00
Konrad Rzeszutek Wilk 6878c32e5c xen/blkfront: Add WARN to deal with misbehaving backends.
Part of the ring structure is the 'id' field which is under
control of the frontend. The frontend stamps it with "some"
value (this some in this implementation being a value less
than BLK_RING_SIZE), and when it gets a response expects
said value to be in the response structure. We have a check
for the id field when spolling new requests but not when
de-spolling responses.

We also add an extra check in add_id_to_freelist to make
sure that the 'struct request' was not NULL - as we cannot
pass a NULL to __blk_end_request_all, otherwise that crashes
(and all the operations that the response is dealing with
end up with __blk_end_request_all).

Lastly we also print the name of the operation that failed.

[v1: s/BUG/WARN/ suggested by Stefano]
[v2: Add extra check in add_id_to_freelist]
[v3: Redid op_name per Jan's suggestion]
[v4: add const * and add WARN on failure returns]
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-06-12 08:29:04 -04:00
Dan Carpenter 895cfcc810 rbd: endian bug in rbd_req_cb()
Sparse complains about this because:
drivers/block/rbd.c:996:20: warning: cast to restricted __le32
drivers/block/rbd.c:996:20: warning: cast from restricted __le16

These are set in osd_req_encode_op() and they are le16.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-06-06 09:23:54 -05:00
Yan, Zheng f9f9a19044 rbd: Fix ceph_snap_context size calculation
ceph_snap_context->snaps is an u64 array

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-06-06 09:23:53 -05:00
Asai Thambi S P 7b421d24ea mtip32xx: Create debugfs entries for troubleshooting
On module load, creates a debugfs parent 'rssd' in debugfs root. Then for each
device, create a new node with corresponding disk name. Under the new node, two
entries 'registers' and 'flags' are created.

NOTE: These entries were removed from sysfs in the previous patch

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-05 09:13:49 +02:00
Asai Thambi S P 7412ff139d mtip32xx: Remove 'registers' and 'flags' from sysfs
This patch removes entries 'registers' and 'flags' from sysfs. Updated ABI file
to reflect this change.

Reported-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-05 09:13:48 +02:00
Sachin Kamat 87c9ea76a2 mtip32xx: Remove version.h header file inclusion
version.h header file inclusion is no longer required.

Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
2012-06-04 10:00:32 +02:00
Asai Thambi S P b77874c969 mtip32xx: Changes to sysfs entries
* Formatted the output of 'registers' entry
* Added "Commands in Q' to output of 'registers' entry
* Added a new entry 'flags'

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-31 08:46:50 +02:00
Asai Thambi S P 8ce800935d mtip32xx: Convert macro definitions for flag bits to enum
Convert macro definitions for flags bits to enum

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-31 08:46:50 +02:00
Asai Thambi S P 377b8fc6d7 mtip32xx: minor performance tweak
When checking for command completions if the register value is zero, proceed
to next register.

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-31 08:46:50 +02:00
Asai Thambi S P e602878fd8 mtip32xx: Fix to support more than one sector in exec_drive_command()
Fix to support more than one sector in exec_drive_command().

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-31 08:46:50 +02:00
Asai Thambi S P 0a07ab224a mtip32xx: Use plain spinlock for 'cmd_issue_lock'
'cmd_issue_lock' is for only acquiring a free slot, and it is not used
in interrupt context. So replaced irq version with non-irq version of spinlock.

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-31 08:46:50 +02:00
Asai Thambi S P 6c8ab69818 mtip32xx: Set block queue boundary variables
Set the following block queue boundary variables
	* max_hw_sectors
	* max_segment_size

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>

Removed setting of q->nr_requests.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-31 08:46:50 +02:00
Asai Thambi S P d02e1f0ad0 mtip32xx: Fix to handle TFE for PIO(IOCTL/internal) commands
If a PIO (IOCTL/internal) command resulted in TFE, signal the wait event or break out of polling.

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-31 08:36:55 +02:00
Asai Thambi S P 971890f258 mtip32xx: Change HDIO_GET_IDENTITY to return stored data
For the ioctl command HDIO_GET_IDENTITY, return the stored copy of IDENTIFY
DATA instead of sending the command to the device - similar to libata.

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-31 08:36:55 +02:00
Asai Thambi S P 2df7aa96e7 mtip32xx: Set custom timeouts for PIO commands
This change sets custom timeouts depending on PIO command.

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-31 08:36:55 +02:00
Asai Thambi S P 6bb688c048 mtip32xx: fix clearing an incorrect register in mtip_init_port
Fix clearing an incorrect register in mtip_init_port

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-31 08:36:55 +02:00
Konrad Rzeszutek Wilk 8c9ce606a6 xen/blkback: Copy id field when doing BLKIF_DISCARD.
We weren't copying the id field so when we sent the response
back to the frontend (especially with a 64-bit host and 32-bit
guest), we ended up using a random value. This lead to the
frontend crashing as it would try to pass to __blk_end_request_all
a NULL 'struct request' (b/c it would use the 'id' to find the
proper 'struct request' in its shadow array) and end up crashing:

BUG: unable to handle kernel NULL pointer dereference at 000000e4
IP: [<c0646d4c>] __blk_end_request_all+0xc/0x40
.. snip..
EIP is at __blk_end_request_all+0xc/0x40
.. snip..
 [<ed95db72>] blkif_interrupt+0x172/0x330 [xen_blkfront]

This fixes the bug by passing in the proper id for the response.

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=824641

CC: stable@kernel.org
Tested-by: William Dauchy <wdauchy@gmail.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-05-30 17:20:04 -04:00
Linus Torvalds af56e0aa35 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph updates from Sage Weil:
 "There are some updates and cleanups to the CRUSH placement code, a bug
  fix with incremental maps, several cleanups and fixes from Josh Durgin
  in the RBD block device code, a series of cleanups and bug fixes from
  Alex Elder in the messenger code, and some miscellaneous bounds
  checking and gfp cleanups/fixes."

Fix up trivial conflicts in net/ceph/{messenger.c,osdmap.c} due to the
networking people preferring "unsigned int" over just "unsigned".

* git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (45 commits)
  libceph: fix pg_temp updates
  libceph: avoid unregistering osd request when not registered
  ceph: add auth buf in prepare_write_connect()
  ceph: rename prepare_connect_authorizer()
  ceph: return pointer from prepare_connect_authorizer()
  ceph: use info returned by get_authorizer
  ceph: have get_authorizer methods return pointers
  ceph: ensure auth ops are defined before use
  ceph: messenger: reduce args to create_authorizer
  ceph: define ceph_auth_handshake type
  ceph: messenger: check return from get_authorizer
  ceph: messenger: rework prepare_connect_authorizer()
  ceph: messenger: check prepare_write_connect() result
  ceph: don't set WRITE_PENDING too early
  ceph: drop msgr argument from prepare_write_connect()
  ceph: messenger: send banner in process_connect()
  ceph: messenger: reset connection kvec caller
  libceph: don't reset kvec in prepare_write_banner()
  ceph: ignore preferred_osd field
  ceph: fully initialize new layout
  ...
2012-05-30 11:17:19 -07:00
Linus Torvalds a70f35af4e Merge branch 'for-3.5/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "Here are the driver related changes for 3.5.  It contains:

   - The floppy changes from Jiri.  Jiri is now also marked as the
     maintainer of floppy.c, I shall be publically branding his forehead
     with red hot iron at the next opportune moment.

   - A batch of drbd updates and fixes from the linbit crew, as well as
     fixes from others.

   - Two small fixes for xen-blkfront courtesy of Jan."

* 'for-3.5/drivers' of git://git.kernel.dk/linux-block: (70 commits)
  floppy: take over maintainership
  floppy: remove floppy-specific O_EXCL handling
  floppy: convert to delayed work and single-thread wq
  xen-blkfront: module exit handling adjustments
  xen-blkfront: properly name all devices
  drbd: grammar fix in log message
  drbd: check MODULE for THIS_MODULE
  drbd: Restore the request restart logic
  drbd: introduce a bio_set to allocate housekeeping bios from
  drbd: remove unused define
  drbd: bm_page_async_io: properly initialize page->private
  drbd: use the newly introduced page pool for bitmap IO
  drbd: add page pool to be used for meta data IO
  drbd: allow bitmap to change during writeout from resync_finished
  drbd: fix race between drbdadm invalidate/verify and finishing resync
  drbd: fix resend/resubmit of frozen IO
  drbd: Ensure that data_size is not 0 before using data_size-1 as index
  drbd: Delay/reject other state changes while establishing a connection
  drbd: move put_ldev from __req_mod() to the endio callback
  drbd: fix WRITE_ACKED_BY_PEER_AND_SIS to not set RQ_NET_DONE
  ...
2012-05-30 09:05:47 -07:00
Linus Torvalds 99262a3daf Autogenerated GPG tag for Rusty D1ADB8F1: 15EE 8D6C AB0E 7F0C F999 BFCB D920 0E6C D1AD B8F1
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJPuv35AAoJENkgDmzRrbjxUx4P/0uc+0oNnZv11vYQsqHuhURa
 zMlsVdlXGVkvPqQiLY0QkrK5LcO6KiSnSk8vEnOYFIPjL4wNqL/4RRRLnTAJwmE+
 lsrL9DblI8Ira/EZRv7d2L12QrP+F2ZGKOZr67uVxSaxH71fUqtiJ0jqA/I8AYH7
 /V7+DgdIB1DD28Ya/JEFEUi41F08A6MU10hpaQWy9kXv09gCc9apgvH7/S3s9DaQ
 G640YWkoKZAx/OFBb8XFvpu9LqZcVl02Nl8goMZOKnMctC4iU3km7HeVjfwCgLjO
 AdA5spLMhDkS/xrpI0mSQ/wT0k0+sSYW5vEdW9N4XLZza0NgH9GfU4RtEuK85Slj
 7bPviZOcpjtt0sGi4wXCaVjZyHROX6tyRvTMUAIj3D0oJglb5T9D3MCvQnadILb0
 I0+7gk3d9rHqkO6CmjNaZG9IwR9NpFkbuolcFQuEaZoUMoKd2pYNQyxpbFGl+jCl
 7ViFHAy+fydNqDoETKincld4A43KWxOV7jyEJd7hloKcCixsqI7ZdPS7X8amec72
 a0hfNgMJzarZkTgo61Hair/d+vKGRJPgEdF1Yq76SDhYKD1TeWeDjmboctsiMjqe
 f5M4C6IdNJj9cDIlCxMk+3bX250oy7KG77v7Ux0/7nvtSWVa3yEMowD57hnn1But
 0gNC8bjXDHRsho90rDRN
 =Kj9v
 -----END PGP SIGNATURE-----

Merge tag 'virtio-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus

Pull virtio updates from Rusty Russell.

* tag 'virtio-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
  virtio: fix typo in comment
  virtio-mmio: Devices parameter parsing
  virtio_blk: Drop unused request tracking list
  virtio-blk: Fix hot-unplug race in remove method
  virtio: Use ida to allocate virtio index
  virtio: balloon: separate out common code between remove and freeze functions
  virtio: balloon: drop restore_common()
  9p: disconnect channel when PCI device is removed
  virtio: update documentation to v0.9.5 of spec
2012-05-21 20:20:23 -07:00
Asias He f65ca1dc6a virtio_blk: Drop unused request tracking list
Benchmark shows small performance improvement on fusion io device.

Before:
  seq-read : io=1,024MB, bw=19,982KB/s, iops=39,964, runt= 52475msec
  seq-write: io=1,024MB, bw=20,321KB/s, iops=40,641, runt= 51601msec
  rnd-read : io=1,024MB, bw=15,404KB/s, iops=30,808, runt= 68070msec
  rnd-write: io=1,024MB, bw=14,776KB/s, iops=29,552, runt= 70963msec

After:
  seq-read : io=1,024MB, bw=20,343KB/s, iops=40,685, runt= 51546msec
  seq-write: io=1,024MB, bw=20,803KB/s, iops=41,606, runt= 50404msec
  rnd-read : io=1,024MB, bw=16,221KB/s, iops=32,442, runt= 64642msec
  rnd-write: io=1,024MB, bw=15,199KB/s, iops=30,397, runt= 68991msec

Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-05-22 12:16:14 +09:30
Asias He b79d866c8b virtio-blk: Fix hot-unplug race in remove method
If we reset the virtio-blk device before the requests already dispatched
to the virtio-blk driver from the block layer are finised, we will stuck
in blk_cleanup_queue() and the remove will fail.

blk_cleanup_queue() calls blk_drain_queue() to drain all requests queued
before DEAD marking. However it will never success if the device is
already stopped. We'll have q->in_flight[] > 0, so the drain will not
finish.

How to reproduce the race:
1. hot-plug a virtio-blk device
2. keep reading/writing the device in guest
3. hot-unplug while the device is busy serving I/O

Test:
~1000 rounds of hot-plug/hot-unplug test passed with this patch.

Changes in v3:
- Drop blk_abort_queue and blk_abort_request
- Use __blk_end_request_all to complete request dispatched to driver

Changes in v2:
- Drop req_in_flight
- Use virtqueue_detach_unused_buf to get request dispatched to driver

Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-05-22 12:16:13 +09:30
David S. Miller 17eea0df5f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-05-20 21:53:04 -04:00
Linus Torvalds 14e931a264 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
 "A few small, but important fixes.  Most of them are marked for stable
  as well

   - Fix failure to release a semaphore on error path in mtip32xx.
   - Fix crashable condition in bio_get_nr_vecs().
   - Don't mark end-of-disk buffers as mapped, limit it to i_size.
   - Fix for build problem with CONFIG_BLOCK=n on arm at least.
   - Fix for a buffer overlow on UUID partition printing.
   - Trivial removal of unused variables in dac960."

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: fix buffer overflow when printing partition UUIDs
  Fix blkdev.h build errors when BLOCK=n
  bio allocation failure due to bio_get_nr_vecs()
  block: don't mark buffers beyond end of disk as mapped
  mtip32xx: release the semaphore on an error path
  dac960: Remove unused variables from DAC960_CreateProcEntries()
2012-05-19 10:12:17 -07:00
Jens Axboe 4fd1ffaa12 Merge branch 'for-jens' of git://git.drbd.org/linux-drbd into for-3.5/drivers
Philipp writes:

This are the updates we have in the drbd-8.3 tree. They are intended
for your "for-3.5/drivers" drivers branch.

These changes include one new feature:
 * Allow detach from frozen backing devices with the new --force option;
   configurable timeout for backing devices by the new disk-timeout option

And huge number of bug fixes:
 * Fixed a write ordering problem on SyncTarget nodes for a write
   to a block that gets resynced at the same time. The bug can
   only be triggered with a device that has a firmware that
   actually reorders writes to the same block
 * Fixed a race between disconnect and receive_state, that could cause
   a IO lockup
 * Fixed resend/resubmit for requests with disk or network timeout
 * Make sure that hard state changed do not disturb the connection
   establishing process (I.e. detach due to an IO error). When the
   bug was triggered it caused a retry in the connect process
 * Postpone soft state changes to no disturb the connection
   establishing process (I.e. becoming primary). When the bug
   was triggered it could cause both nodes going into SyncSource state
 * Fixed a refcount leak that could cause failures when trying to
   unload a protocol family modules, that was used by DRBD
 * Dedicated page pool for meta data IOs
 * Deny normal detach (as opposed to --forced) if the user tries
   to detach from the last UpToDate disk in the resource
 * Fixed a possible protocol error that could be caused by
   "unusual" BIOs.
 * Enforce the disk-timeout option also on meta-data IO operations
 * Implemented stable bitmap pages when we do a full write out of
   the bitmap
 * Fixed a rare compatibility issue with DRBD's older than 8.3.7
   when negotiating the bio_size
 * Fixed a rare race condition where an empty resync could stall with
   if pause/unpause events happen in parallel
 * Made the re-establishing of connections quicker, if it got a broken pipe
   once. Previously there was a bug in the code caused it to waste the first
   successful established connection after a broken pipe event.

PS: I am postponing the drbd-8.4 for mainline for one or two kernel
    development cycles more (the ~400 patchets set).
2012-05-18 16:20:06 +02:00
Jens Axboe 13828dec45 Merge branch 'stable/for-jens-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-3.5/drivers
Konrad writes:

Please git pull the following branch:

 git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git stable/for-jens-3.5

in your for-3.5/drivers branch. The changes in it are rather simple - cleaning
up some code and adding proper mechanism to unload without leaking memory.
2012-05-18 16:17:41 +02:00
Jiri Kosina bfa10b8c98 floppy: remove floppy-specific O_EXCL handling
Block layer now handles O_EXCL in a generic way for block devices.

The semantics is however different for floppy and all other block devices,
as floppy driver contains its own O_EXCL handling.

The semantics for all-but-floppy bdevs is "there can be at most one O_EXCL
open of this file", while for floppy bdev the semantics is "if someone has
the bdev open with O_EXCL, noone else can open it".

There is actual userspace-observable change in behavior because of this
since commit e525fd89d3 ("block: make blkdev_get/put() handle exclusive
access") -- on kernels containing this commit, mount of /dev/fd0 causes
the fd0 block device be claimed with _EXCL, preventing subsequent
open(/dev/fd0).

Bring things back into shape, i.e.  make it possible, analogically to
other block devices, to mount the floppy and open() it afterwards --
remove the floppy-specific handling and let the generic bdev code O_EXCL
handling take over.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-05-18 15:19:11 +02:00
Jiri Kosina 070ad7e793 floppy: convert to delayed work and single-thread wq
There are several races in floppy driver between bottom half
(scheduled_work) and timers (fd_timeout, fd_timer). Due to slowness
of the actual floppy devices, those races are never (at least to my
knowledge) triggered on a bare floppy metal. However on virtualized
(emulated) floppy drives, which are of course magnitudes faster
than the real ones, these races trigger reliably. They usually exhibit
themselves as NULL pointer dereferences during DMA setup, such as

	BUG: unable to handle kernel NULL pointer dereference at 0000000a
	[ ... snip ... ]
	EIP: 0060:[<c02053d5>] EFLAGS: 00010293 CPU: 0
	EAX: ffffe000 EBX: 0000000a ECX: 00000000 EDX: 0000000a
	ESI: c05d2718 EDI: 00000000 EBP: 00000000 ESP: f540fe44
	 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
	Process swapper (pid: 0, ti=f540e000 task=c082d5a0 task.ti=c0826000)
	Stack:
	 ffffe000 00001ffc 00000000 00000000 00000000 c05d2718 c0708b40 f540fe80
	 c020470f c05d2718 c0708b40 00000000 f540fe80 0000000a f540fee4 00000000
	 c0708b40 f540fee4 00000000 00000000 c020526b 00000000 c05d2718 c0708b40
	Call Trace:
	 [<c020470f>] dump_trace+0xaf/0x110
	 [<c020526b>] show_trace_log_lvl+0x4b/0x60
	 [<c0205298>] show_trace+0x18/0x20
	 [<c05c5811>] dump_stack+0x6d/0x72
	 [<c0248527>] warn_slowpath_common+0x77/0xb0
	 [<c02485f3>] warn_slowpath_fmt+0x33/0x40
	 [<f7ec593c>] setup_DMA+0x14c/0x210 [floppy]
	 [<f7ecaa95>] setup_rw_floppy+0x105/0x190 [floppy]
	 [<c0256d08>] run_timer_softirq+0x168/0x2a0
	 [<c024e762>] __do_softirq+0xc2/0x1c0
	 [<c02042ed>] do_softirq+0x7d/0xb0
	 [<f54d8a00>] 0xf54d89ff

but other instances can be easily seen as well. This can be observed at least under
VMWare, VirtualBox and KVM.

This patch converts all the timers and bottom halfs to be processed in a single
workqueue. This aproach has been already discussed back in 2010 if I remember
correctly, and Acked by Linus [1], but it then never made it to the tree.

This all is based on original idea and code of Stephen Hemminger.  I have
ported original Stepen's code to the current state of the floppy driver, and
performed quite some testing (on real hardware), which didn't reveal any issues
(this includes not only writing and reading data, but also formatting
(unfortunately I didn't find any Double-Density disks any more)). Ability to
handle errors properly (supplying known bad floppies) has also been verified.

[1] http://kerneltrap.org/mailarchive/linux-kernel/2010/6/11/4582092

Based-on-patch-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-05-18 15:19:10 +02:00
David S. Miller 028940342a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-05-16 22:17:37 -04:00
Josh Durgin 263c6ca007 rbd: rename __rbd_update_snaps to __rbd_refresh_header
This function rereads the entire header and handles any changes in
it, not just changes in snapshots.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2012-05-14 12:13:09 -05:00
Josh Durgin 3591538fb2 rbd: fix snapshot size type
Snapshot sizes should be the same type as regular image sizes. This
only affects their displayed size in sysfs, not the reported size of
an actual block device sizes.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2012-05-14 12:13:03 -05:00
Josh Durgin b06e6a6be7 rbd: remove conditional snapid parameters
The snapid parameters passed to rbd_do_op() and rbd_req_sync_op()
are now always either a valid snapid or an explicit CEPH_NOSNAP.

[elder@dreamhost.com: Rephrased the description]

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2012-05-14 12:12:58 -05:00
Josh Durgin 77dfe99fe3 rbd: store snapshot id instead of index
When a device was open at a snapshot, and snapshots were deleted or
added, data from the wrong snapshot could be read. Instead of
assuming the snap context is constant, store the actual snap id when
the device is initialized, and rely on the OSDs to signal an error
if we try reading from a snapshot that was deleted.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2012-05-14 12:12:52 -05:00
Josh Durgin 403f24d3d5 rbd: protect read of snapshot sequence number
This is updated whenever a snapshot is added or deleted, and the
snapc pointer is changed with every refresh of the header.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2012-05-14 12:12:46 -05:00
Xi Wang 50f7c4c967 rbd: fix integer overflow in rbd_header_from_disk()
ondisk->snap_count is read from disk via rbd_req_sync_read() and thus
needs validation.  Otherwise, a bogus `snap_count' could overflow the
kmalloc() size, leading to memory corruption.

Also use `u32' consistently for `snap_count'.

[elder@dreamhost.com: changed to use UINT_MAX rather than ULONG_MAX]

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
2012-05-14 12:12:41 -05:00
Dan Carpenter f8ad495a8a rbd: use gfp_flags parameter in rbd_header_from_disk()
We should use the gfp_flags that the caller specified instead of
GFP_KERNEL here.

There is only one caller and it uses GFP_KERNEL, so this change is
just a cleanup and doesn't change how the code works.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
2012-05-14 12:12:35 -05:00
Jan Beulich 8605067fb9 xen-blkfront: module exit handling adjustments
The blkdev major must be released upon exit, or else the module can't
attach to devices using the same majors upon being loaded again. Also
avoid leaking the minor tracking bitmap.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-05-11 16:11:54 -04:00
Jan Beulich e77c78c022 xen-blkfront: properly name all devices
- devices beyond xvdzz didn't get proper names assigned at all
- extended devices with minors not representable within the kernel's
  major/minor bit split spilled into foreign majors

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-05-11 16:11:52 -04:00
Asai Thambi S P a09ba13eef mtip32xx: release the semaphore on an error path
Release the semaphore in an error path in mtip_hw_get_scatterlist(). This
fixes the smatch warning inconsistent returns.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-11 16:42:14 +02:00
Jesper Juhl d88a440edd dac960: Remove unused variables from DAC960_CreateProcEntries()
The variables 'StatusProcEntry' and 'UserCommandProcEntry' are
assigned to once and then never used. This patch gets rid of the
variables.

While I was there I also fixed the indentation of the function to use
tabs rather than spaces for the lines that did not already do so.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-11 16:42:14 +02:00
Eric W. Biederman 38bf195398 connector/userns: replace netlink uses of cap_raised() with capable()
In 2009 Philip Reiser notied that a few users of netlink connector
interface needed a capability check and added the idiom
cap_raised(nsp->eff_cap, CAP_SYS_ADMIN) to a few of them, on the premise
that netlink was asynchronous.

In 2011 Patrick McHardy noticed we were being silly because netlink is
synchronous and removed eff_cap from the netlink_skb_params and changed
the idiom to cap_raised(current_cap(), CAP_SYS_ADMIN).

Looking at those spots with a fresh eye we should be calling
capable(CAP_SYS_ADMIN).  The only reason I can see for not calling capable
is that it once appeared we were not in the same task as the caller which
would have made calling capable() impossible.

In the initial user_namespace the only difference between between
cap_raised(current_cap(), CAP_SYS_ADMIN) and capable(CAP_SYS_ADMIN) are a
few sanity checks and the fact that capable(CAP_SYS_ADMIN) sets
PF_SUPERPRIV if we use the capability.

Since we are going to be using root privilege setting PF_SUPERPRIV seems
the right thing to do.

The motivation for this that patch is that in a child user namespace
cap_raised(current_cap(),...) tests your capabilities with respect to that
child user namespace not capabilities in the initial user namespace and
thus will allow processes that should be unprivielged to use the kernel
services that are only protected with cap_raised(current_cap(),..).

To fix possible user_namespace issues and to just clean up the code
replace cap_raised(current_cap(), CAP_SYS_ADMIN) with
capable(CAP_SYS_ADMIN).

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: Andrew G. Morgan <morgan@kernel.org>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-10 23:21:39 -04:00
Lars Ellenberg 92b4ca291f drbd: grammar fix in log message
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-10 12:00:56 +02:00
Cong Wang bc4854bc91 drbd: check MODULE for THIS_MODULE
THIS_MODULE is NULL only when drbd is compiled as built-in,
so the #ifdef CONFIG_MODULES should be #ifdef MODULE instead.

This fixes the warning:

drivers/block/drbd/drbd_main.c: In function ‘drbd_buildtag’:
drivers/block/drbd/drbd_main.c:4187:24: warning: the comparison will always evaluate as ‘true’ for the address of ‘__this_module’ will never be NULL [-Waddress]

Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-10 12:00:54 +02:00
Philipp Reisner f6d0a8dbfd drbd: Restore the request restart logic
It got lost with the commit 5a7bbad27a
"block: remove support for bio remapping from ->make_request"

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 17:20:59 +02:00
Lars Ellenberg 9476f39d66 drbd: introduce a bio_set to allocate housekeeping bios from
Don't rely on availability of bios from the global fs_bio_set,
we should use our own bio_set for meta data IO.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:17:07 +02:00
Lars Ellenberg 3c2f7a856f drbd: remove unused define
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:17:06 +02:00
Arne Redlich 0c7db27920 drbd: bm_page_async_io: properly initialize page->private
If bm_page_async_io is advised to use a new page for I/O
(BM_AIO_COPY_PAGES is set), it will get it from a mempool.
Once the mempool has to dip into its reserves the page is
not reinitialized, i.e. page->private contains garbage, which
will lead to various problems once the I/O completes (dereferences
of NULL pointers, the submitting thread getting stuck in D-state,
 ...).

Signed-off-by: Arne Redlich <arne.redlich@googlemail.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2012-05-09 15:17:04 +02:00
Lars Ellenberg 4d95a10f97 drbd: use the newly introduced page pool for bitmap IO
Conflicts:

	drbd/drbd_bitmap.c

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:17:03 +02:00
Lars Ellenberg 4281808fb3 drbd: add page pool to be used for meta data IO
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:17:02 +02:00
Lars Ellenberg 0e8488ade2 drbd: allow bitmap to change during writeout from resync_finished
Symptom: messages similar to
 "FIXME asender in bm_change_bits_to,
  bitmap locked for 'write from resync_finished' by worker"

If a resync or verify is finished (or aborted), a full bitmap writeout
is triggered.  If we have ongoing local IO, the bitmap may still change
during that writeout, pending and not yet processed acks may cause bits
to be cleared, while new writes may cause bits to be to be set.

To fix this, introduce the drbd_bm_write_copy_pages() variant.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:17:00 +02:00
Lars Ellenberg a574daf5d7 drbd: fix race between drbdadm invalidate/verify and finishing resync
When a resync or online verify is finished or aborted,
drbd does a bulk write-out of changed bitmap pages.

If *in that very moment* a new verify or resync is triggered,
this can race:
 ASSERT( !test_bit(BITMAP_IO, &mdev->flags) ) in drbd_main.c
 FIXME going to queue 'set_n_write from StartingSync' but 'write from resync_finished' still pending?
and similar.

This can be observed with e.g. tight invalidate loops in test scripts,
and probably has no real-life implication.

Still, that race can be solved by first quiescen the device,
before starting a new resync or verify.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:59 +02:00
Lars Ellenberg ba280c092e drbd: fix resend/resubmit of frozen IO
DRBD can freeze IO, due to fencing policy (fencing resource-and-stonith),
or because we lost access to data (on-no-data-accessible suspend-io).

Resuming from there (re-connect, or re-attach, or explicit admin
intervention) should "just work".

Unfortunately, if the re-attach/re-connect did not happen within
the timeout, since the commit
  drbd: Implemented real timeout checking for request processing time
if so configured, the request_timer_fn() would timeout and
detach/disconnect virtually immediately.

This change tracks the most recent attach and connect, and does not
timeout within <configured timeout interval> after attach/connect.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:58 +02:00
Philipp Reisner 5de738272e drbd: Ensure that data_size is not 0 before using data_size-1 as index
This could be exploited by a peer which runs modified code.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:56 +02:00
Philipp Reisner 197296ffed drbd: Delay/reject other state changes while establishing a connection
Changes to the role and disk state should be delayed or rejected
while we establish a connection.

This is necessary, since the peer will base its resync decision
on the UUIDs and the state we sent in the drbd_connect() function.

The most prominent example for this race is becoming primary after
sending state and UUIDs and before the state changes to C_WF_CONNECTION.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:55 +02:00
Lars Ellenberg 46385c84ac drbd: move put_ldev from __req_mod() to the endio callback
One invocation in the endio handler is good enough,
we don't need mention it for each of the different ways
it calls __req_mod().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:51 +02:00
Lars Ellenberg d64957c9a9 drbd: fix WRITE_ACKED_BY_PEER_AND_SIS to not set RQ_NET_DONE
Just because this request happened during a resync does
not mean it may pretend to have been barrier-acked.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:50 +02:00
Lars Ellenberg 41c4a0035b drbd: fix READ_RETRY_REMOTE_CANCELED to not complete if device is suspended
READ_RETRY_REMOTE_CANCELED needs to be grouped with the other _CANCELED
cases, not with CONNECTION_LOST_WHILE_PENDING, as that would complete
(fail) the bio even if the device became suspended.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:48 +02:00
Lars Ellenberg 6d49e101fd drbd: make OOS_HANDED_TO_NETWORK its own case
OOS_HANDED_TO_NETWORK should not be grouped with the various
*_CANCELED/*_FAILED cases.
Also, not only clear the RQ_NET_QUEUED flag, but also mark it RQ_NET_DONE,
so it can be distinguished from a local-only request even after that.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:47 +02:00
Lars Ellenberg c088b2d904 drbd: don't pretend that barrier_nr == 0 was special
We used to have a barrier implementation where barrier_nr 0 was
reserved. That is long gone. Just use the full sequence space.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:46 +02:00
Lars Ellenberg 7ffcaa7194 drbd: remove unused static helper function
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:44 +02:00
Lars Ellenberg a5d214f621 drbd: remove some very outdated comments
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:43 +02:00
Lars Ellenberg 1abc2af205 drbd: missing wakeup after drbd_rs_del_all
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:42 +02:00
Lars Ellenberg 671a74e749 drbd: remove now unused seq_num member from struct drbd_request
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:40 +02:00
Lars Ellenberg 001a88687a drbd: fix potential data corruption and protocol error
We assumed only bios with bi_idx == 0 would end up
in drbd_make_request().

That is wrong.

At least device mapper, in __clone_and_map(), may submit
clones only covering a partial bio, but sharing
the original bvec, by adjusting bi_idx and relevant
other bio members of the clone.

We used __bio_for_each_segment() in various places,
even though that is documented as
 * drivers should not use the __ version unless they _really_ want to
 * run through the entire bio and not just pending pieces

Impact: we would send the full bio bvec, even for the clone
with bi_idx > 0, which will cause data corruption on the
peer (because we submit wrong data at the clone offset),
and will cause a DRBD protocol error, disconnect/reconnect
and resync (thus fixing the corruption),
because the next package header would be expected right
in the middle of the sent data, causing DRBD magic mismatch.

Fix: drop the assert, and use bio_for_each_segment()
instead of the __ version.

Conflicts:

	drbd/drbd_tracing.c

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:39 +02:00
Philipp Reisner b6a370ba07 drbd: Fix a potential write ordering issue on SyncTarget nodes
If a SyncTarget node gets a P_RS_DATA_REPLY before a P_DATA packet
for the same sector, it simply submits these two IO requests.

  This is be possible because on the SyncSource node, the data of the
  P_RS_DATA_REPLY packet was read from disk.  Immediately after that a
  write request from upper layers came in.

The disk scheduler or even the "hardware" queues on the disk drive might
reorder these writes.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:38 +02:00
Philipp Reisner fc28845bc0 drbd: Fix a potential race that could case data inconsistency
When we have a write request and a state change C_WF_BITMAP_S -> C_SYNC_SOURCE
at the same time, and it happens that the line

	remote = remote && drbd_should_do_remote(s);

stills sees C_WF_BITMAP_S, and

	send_oos = rw == WRITE && drbd_should_send_oos(s);

already sees C_SYNC_SOURCE both are 0.

This causes the write to not be mirrored, but marked as out-of-sync on the
Sync_Source node.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:34 +02:00
Lars Ellenberg 031a7c173f drbd: add missing part_round_stats to _drbd_start_io_acct
Without this, iostat frequently sees bogus svctime and >= 100% "utilization".

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:33 +02:00
Lars Ellenberg 47a4f1c1bb drbd: Fix module refcount leak in drbd_accept()
drbd_accept was modelled after kernel_accept
with drbd commit 53eb779 in July 2008.

Only, kernel_accept was then broken, and only fixed later
with kernel commit 1b08534e in Dec 2008:
net: Fix module refcount leak in kernel_accept()

Impact: protocol families provided as modules, e.g. ipv6 or ib_sdp,
would soon have their reference count become negative, preventing
them from being unloaded (likely), or worse, hit zero without actually
being unused, allowing them to be unloaded while still in use (unlikely,
but if triggered, causing a kernel crash).

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:32 +02:00
Philipp Reisner 7caacb69ac drbd: Consider the disk-timeout also for meta-data IO operations
If the backing device is already frozen during attach, we failed
to recognize that. The current disk-timeout code works on top
of the drbd_request objects. During attach we do not allow IO
and therefore never generate a drbd_request object but block
before that in drbd_make_request().

This patch adds the timeout to all drbd_md_sync_page_io().

Before this patch we used to go from D_ATTACHING directly
to D_DISKLESS if IO failed during attach. We can no longer
do this since we have to stay in D_FAILED until all IO
ops issued to the backing device returned.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:30 +02:00
Philipp Reisner 4afc433cf8 drbd: Do not send state packets while lower than C_CONNECTED cstate
I.e. in C_WF_REPORT_PARAMS or in C_WF_CONNECTION.
Sending may already work in these cstates, but the peer still expects
the HandShake / ConnectionFeatures packet.

Actually triggered by the Testuite on kugel.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:29 +02:00
Lars Ellenberg 545752d5d8 drbd: fix race between disconnect and receive_state
If the asender thread, or request_timer_fn(), or some other part of
the code, decided to drop the connection (because of timeout or other),
but the receiver just now was processing a P_STATE packet, there was a
chance that receive_state() would do a hard state change
"re-establishing" an already failed connection without additional handshake.

Log excerpt:
  Remote failed to finish a request within ko-count * timeout
  peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )
  asender terminated
  ...
  peer( Unknown -> Secondary ) conn( Timeout -> Connected ) pdsk( DUnknown -> UpToDate ) peer_isp( 0 -> 1 )
  ...
  Connection closed
  peer( Secondary -> Unknown ) conn( Connected -> Unconnected ) pdsk( UpToDate -> DUnknown ) peer_isp( 1 -> 0 )
  receiver terminated

Impact:
while the connection state is erroneously "Connected",
requests may be queued and even sent,
which would never be acknowledged,
and may have been missed by the cleanup.
These requests would never be completed.

The next drbd_suspend_io() will then lock up,
waiting forever for these requests to complete.

Fixed in several code paths:
  Make sure the connection state is NetworkFailure or worse
  before starting the cleanup in drbd_disconnect().
  This should make sure the cleanup won't miss any requests.

  Disallow receive_state() to "upgrade" the connection state
  from an error state. This will make sure the "illegal" state
  transition won't happen.

  For all connection failure states,
  relax the safe-guard in sanitize_state() again
  to silently mask out those state changes
  (e.g. Timeout -> Connected becomes Timeout -> Timeout).

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:01 +02:00
Lars Ellenberg 763eb63625 drbd: fix potential spinlock deadlock
drbd_try_clear_on_disk_bm() has a sanity check for the number of blocks
left to be resynced (rs_left) in the current resync extent.
If it detects a mismatch, it complains, and forces a disconnect using
drbd_force_state(mdev, NS(conn, C_DISCONNECTING));

Unfortunately, this may be called while holding the req_lock,
and drbd_force_state() want's to aquire that lock itself. Deadlock.

Don't force a disconnect, but fix up rs_left by recounting and
reassigning the number of dirty blocks in that extent.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:15:58 +02:00
Philipp Reisner e89868a092 drbd: Fixed an obvious copy-n-paste mistake
This bug might have caused troubles if disk-barriers and the ahead-behind
more are enabled at the same time.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:15:57 +02:00
Lars Ellenberg f479ea0661 drbd: send intermediate state change results to the peer
DRBD state changes schedule after_state_ch() actions to a worker thread,
which decides on the old and new states of that change, whether to send
an informational state update packet (P_STATE) to the peer.
If it decides to drbd_send_state(), it would however always send the
_curent_ state, which, if a second state change happens before the
after_state_ch() of the first ran, may "fast-forward" the peer's view
about this node.  In most cases that is harmless, but sometimes this can
confuse DRBD, for example into not actually starting a necessary resync
if you do a very tight detach/attach loop on a Connected Secondary.

Fix this by always sending the "new" state of the respective state
transition which scheduled this after_state_ch() work.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:15:56 +02:00
Lars Ellenberg a2e9138197 drbd: fix spurious meta data IO "error"
When detaching, even cleanly detaching due to administrator request,
we always go through D_FAILED before we become D_DISKLESS.

Don't let that state change race with an in-flight meta data IO,
or that one might think it actually experienced an IO error.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:15:54 +02:00
Philipp Reisner aaae506d54 drbd: Fixed a race condition between detach and start of resync
drbd_state_lock() is only there to serialize cluster wide state
changes. Testing the local disk state needs to happen while
holding the global_state_lock.

Otherwise you might see something like this (Oct 6 on kugel)
14:20:24 drbd0: conn( WFSyncUUID -> Connected ) disk( Inconsistent -> Failed )
14:20:24 drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
14:20:24 drbd0: conn( Connected -> SyncTarget ) disk( Failed -> Inconsistent )

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:15:53 +02:00
Lars Ellenberg 6a9a92f4ef drbd: fix harmless race to not trigger an ASSERT
We have one pre-allocated page to do certain synchronous meta data IO with,
using it is serialized like so:
	drbd_md_get_buffer();
	drbd_md_sync_page_io();
	drbd_md_sync_page_io();
	...
	drbd_md_put_buffer();

In drbd_md_sync_page_io() there is an
	ASSERT(atomic_read(&mdev->md_io_in_use) == 1);

We want to be able to timeout on unresponsive lower level devices, so we
can "detach" in that case. Inside drbd_md_sync_page_io() we grab an extra
reference, to not have a dangling pointer in case a delayed IO eventually
does still complete, even after we "detached" already.

We need to put the extra reference before we signal completion from the
completion handler, or the second drbd_md_sync_page_io() above may
trigger the assert (reference count still 2).

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:15:52 +02:00
Philipp Reisner 5ba3dac521 drbd: Derive sync-UUIDs only from the bitmap-uuid if it is non-zero
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:15:50 +02:00
Andreas Gruenbacher 7b4e4d3126 drbd: drbd_nl_resize(): Fix missing put_ldev() on error path
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:15:49 +02:00
Lars Ellenberg 40424e4a24 drbd: fix "stalled" empty resync
With sync-after dependencies, given "lucky" timing of pause/unpause
events, and the end of an empty (0 bits set) resync was sometimes not
detected on the SyncTarget, leading to a "stalled" SyncSource state.

Fixed this by expecting not only "Inconsistent -> UpToDate" but also
"Consistent -> UpToDate" transitions for the peer disk state
to end a resync.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:15:47 +02:00
Philipp Reisner 1e86ac48af drbd: Bugfix for the connection behavior
If we get into the C_BROKEN_PIPE cstate once, the state engine set the
thi->t_state of the receiver thread to restarting.  But with the while loop
in drbdd_init() a new connection gets established. After the call into
drbdd() returns immediately since the thi->t_state is not RUNNING.  The
restart of drbd_init() then resets thi->t_state to RUNNING.

I.e. after entering C_BROKEN_PIPE once, the next successful established
connection gets wasted.

The two parts of the fix:
  * Do not cause the thread to restart if we detect the issue
    with the sockets while we are in C_WF_CONNECTION.

  * Make sure that all actions that would have set us to C_BROKEN_PIPE
    happen before the state change to C_WF_REPORT_PARAMS.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:15:46 +02:00
Philipp Reisner 80f9fd55a6 drbd: Cleanup all epoch objects upon connection loss
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:15:44 +02:00
Philipp Reisner fd2491f4a4 drbd: detach must not try to abort non-local requests from drbd-8.4
Cherry picked form 8.4

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:15:43 +02:00
Philipp Reisner 79f16f5dbc drbd: Consider that the no-data-condition could be in connected state
...when the peer has inconsistent data. In that case we failed to
clear the susp_nod flag. When the local disk was attached again

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:15:42 +02:00
Philipp Reisner bca482e90b drbd: Fixed current UUID generation
Now, the new edition of the clause only fires if a diskless
peer gets promoted.

This is a fixup for "drbd: Delayed creation of current-UUID".

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:10:50 +02:00
Lars Ellenberg 22f46ce2ef drbd: change some GFP_KERNEL to GFP_NOIO
Bitmap IO may happend in the context of an application write,
in the generic block IO path.  We need to use GFP_NOIO.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:10:47 +02:00
Philipp Reisner dfa8bedbfe drbd: Implemented the disk-timeout option
When the disk-timeout is active, and it expires for a single request,
we consider the local disk as D_FAILED. Note: With this change,
I made both timeout based state transitions HARD state transitions.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:10:45 +02:00
Philipp Reisner 02ee8f95fa drbd: Force flag for the detach operation
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:10:38 +02:00
Philipp Reisner 5ca1de0384 drbd: Allow new IOs while the local disk in in FAILED state
The last bunch of commits prepared the 'detach from tar pit' feature.
With that we can be for long time in disk state FAILED. We need
to accept new IO requests during that time.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:10:34 +02:00
Philipp Reisner 9e58c4dad7 drbd: Bitmap IO functions can now return prematurely if the disk breaks
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:10:33 +02:00
Philipp Reisner d1f3779bbe drbd: Added a kref to bm_aio_ctx
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 10:37:19 +02:00
Philipp Reisner b2057629ea drbd: Hold a reference to ldev while doing meta-data IO
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 10:31:11 +02:00
Philipp Reisner 4a2fe568b5 drbd: Keep a reference to the bio until the completion handler finished
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 10:28:51 +02:00
Philipp Reisner 0c46442515 drbd: Implemented wait_until_done_or_disk_failure()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 10:26:51 +02:00
Philipp Reisner e17117310b drbd: Replaced md_io_mutex by an atomic: md_io_in_use
The new function drbd_md_get_buffer() aborts waiting for the buffer
in case the disk failes in the meantime.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 10:22:31 +02:00
Philipp Reisner cc94c65015 drbd: moved md_io into mdev
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 10:17:24 +02:00
Philipp Reisner 2b4dd36fba drbd: Immediately allow completion of IOs, that wait for IO completions on a failed disk
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 10:16:04 +02:00
Philipp Reisner 6d7e32f568 drbd: Keep a reference to barrier acked requests
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 10:15:28 +02:00
Philipp Reisner 6809384c71 drbd: Improve compatibility with drbd's older than 8.3.7
Regression introduced with 8.3.11 commit:
drbd: Take a more conservative approach when deciding max_bio_size

Never ever tell an older drbd, that we support more than 32KiB
in a single data request (packet).
Never believe an older drbd, that is supports more than 32KiB
in a single data request (packet)

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 10:08:57 +02:00
Philipp Reisner 77e8fdfc18 drbd: Only print sanitize state's warnings, if the state change happens
The reason for this change is that, with when doing
'drbdadm invalidate' on a disconnected resource caused
an "implicitly set pdsk from UpToDate to DUnknown" message,
which was missleading.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 10:08:22 +02:00
Lars Ellenberg 07667347c8 drbd: downgraded error printk to info
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 10:05:25 +02:00
David Howells 5f138ce01a DRBD: Fix comparison always false warning due to long/long long compare
Fix warnings of the following nature in the drbd header:

In file included from drivers/block/drbd/drbd_bitmap.c:32:
drivers/block/drbd/drbd_int.h: In function 'drbd_get_syncer_progress':
drivers/block/drbd/drbd_int.h:2234: warning: comparison is always false due to limited range of data

where mdev->rs_total (an unsigned long) is being compared to 1ULL << 32, which
is always false on a 32-bit machine.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2012-05-09 10:03:19 +02:00
Lars Ellenberg 7948bcdc38 drbd: spelling fix: too small
It is not "to small", but "too small".

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 10:02:22 +02:00
Lars Ellenberg 1381e9a496 drbd: cosmetic: fix accidental division instead of modulo when pretty printing
For large resync rates, seq_printf_with_thousands_grouping()
accidentally only produced Y,000,00Y, instead of the real numbers.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 10:01:39 +02:00
Philipp Reisner ebd2b0cde5 drbd: Lower log priority for an event that is definitely not an error
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 09:59:29 +02:00
Sage Weil 3469ac1aa3 ceph: drop support for preferred_osd pgs
This was an ill-conceived feature that has been removed from Ceph.  Do
this gracefully:

 - reject attempts to specify a preferred_osd via the ioctl
 - stop exposing this information via virtual xattrs
 - always fill in -1 for requests, in case we talk to an older server
 - don't calculate preferred_osd placements/pgids

Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-05-07 15:33:36 -07:00
David S. Miller f24001941c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Fix merge between commit 3adadc08cc ("net ax25: Reorder ax25_exit to
remove races") and commit 0ca7a4c87d ("net ax25: Simplify and
cleanup the ax25 sysctl handling")

The former moved around the sysctl register/unregister calls, the
later simply removed them.

With help from Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-23 23:15:17 -04:00
Pavel Emelyanov 4a17fd5229 sock: Introduce named constants for sk_reuse
Name them in a "backward compatible" manner, i.e. reuse or not
are still 1 and 0 respectively. The reuse value of 2 means that
the socket with it will forcibly reuse everyone else's port.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-21 15:52:25 -04:00
Linus Torvalds c1acb0ba33 Fixes in various components:
* mechanism to work with misconfigured backends (where they are
    advertised but in reality don't exist).
  * two tiny compile warning fixes.
  * proper error handling in gnttab_resume
  * Not using VM_PFNMAP anymore to allow backends in the same domain.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQEcBAABAgAGBQJPkYeqAAoJEFjIrFwIi8fJEXIIAI+PYLNMcHTc4bxa6pErpKaS
 rq5eCXL9+EaZOwTUqHRJjfrjnlAc+BWO8lN0H41oRQWFYh14hgfUVJ+ziEujb1kw
 N1eTMVHnH/XRJV6rIFX+TiBasnyoMmNfWEAb45UL1nEUTMPL1Jv7AiRY/GxUlHyg
 M+uFG52KP3ytXxcIiGW6pYEqJd6UgWrqnclaeg5TR5zvDlWfJbUIBEMQ/PyV0WSS
 4e7biiwi4XPWT2f1qewOmI+3r68CltU3GAs1XxjcSX+bYYuh00UtY39AsBWo2N8I
 1VORuq0QPs+GB22r3e47IqBcjXkBGRIf6w1e/5a6WLiq7TqVq4bYgCGUmWT8V7o=
 =olCU
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.4-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull xen fixes from Konrad Rzeszutek Wilk:
 - mechanism to work with misconfigured backends (where they are
   advertised but in reality don't exist).
 - two tiny compile warning fixes.
 - proper error handling in gnttab_resume
 - Not using VM_PFNMAP anymore to allow backends in the same domain.

* tag 'stable/for-linus-3.4-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  Revert "xen/p2m: m2p_find_override: use list_for_each_entry_safe"
  xen/resume: Fix compile warnings.
  xen/xenbus: Add quirk to deal with misconfigured backends.
  xen/blkback: Fix warning error.
  xen/p2m: m2p_find_override: use list_for_each_entry_safe
  xen/gntdev: do not set VM_PFNMAP
  xen/grant-table: add error-handling code on failure of gnttab_resume
2012-04-20 11:31:00 -07:00
Konrad Rzeszutek Wilk a71e23d992 xen/blkback: Fix warning error.
drivers/block/xen-blkback/xenbus.c: In function 'xen_blkbk_discard':
drivers/block/xen-blkback/xenbus.c:419:4: warning: passing argument 1 of 'dev_warn' makes pointer from integer without a cast
+[enabled by default]
include/linux/device.h:894:5: note: expected 'const struct device *' but argument is of type 'long int'

It is unclear how that mistake made it in. It surely is wrong.

Acked-by: Jens Axboe <axboe@kernel.dk>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-04-18 15:54:08 -04:00
Linus Torvalds cdd5983063 virtio: fixes on top of 3.4-rc2
Here are some virtio fixes for 3.4:
 a test build fix, a patch by Ren fixing naming for systems with a massive
 number of virtio blk devices, and balloon fixes for powerpc
 by David Gibson.
 
 There was some discussion about Ren's patch for virtio disc naming: some people
 wanted to move the legacy name mangling function to the block core.  But
 there's no concensus on that yet, and we can always deduplicate later.
 Added comments in the hope that this will stop people from
 copying this legacy naming scheme into future drivers.
 
 Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQEcBAABAgAGBQJPio1GAAoJECgfDbjSjVRpGDAH/3C/bXm9mriuNauRHwktHgJe
 gmh2BfUgnxly6vheuz0Fv61lTe6V8kekHVolbUYwAUgXeWEKK1C59xehrMGRIPDG
 1XUiti50U3P+skhIfrbkS5nZ7L+5Hk0ToQ6dd9v0BM2GxDOvgwidlY1bZe+wJEZf
 Lvl6w/djBCr1e3k4qfRnpTcdJJ4FnOjGbikLQhSTGfUXeNo6uWS1hljYWnAhzFkd
 1xU8h5PP0TDR0nYb80CeB+9Lxw0w4qyNPJIBhNN6ucB/1U6R+55HpEpmrLUkn910
 sEFEFsc0cRVWr8FiOTlmzxLHnwTc8AY/Bsp9TMSmnTRu3ZQcoQMTQQCczRj04xI=
 =VmpJ
 -----END PGP SIGNATURE-----

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio fixes from Michael S. Tsirkin:
 "Here are some virtio fixes for 3.4: a test build fix, a patch by Ren
  fixing naming for systems with a massive number of virtio blk devices,
  and balloon fixes for powerpc by David Gibson.

  There was some discussion about Ren's patch for virtio disc naming:
  some people wanted to move the legacy name mangling function to the
  block core.  But there's no concensus on that yet, and we can always
  deduplicate later.  Added comments in the hope that this will stop
  people from copying this legacy naming scheme into future drivers."

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  virtio_balloon: fix handling of PAGE_SIZE != 4k
  virtio_balloon: Fix endian bug
  virtio_blk: helper function to format disk names
  tools/virtio: fix up vhost/test module build
2012-04-16 18:34:12 -07:00
Linus Torvalds c104f1fa1e Merge branch 'for-3.4/drivers' of git://git.kernel.dk/linux-block
Pull block driver bits from Jens Axboe:

 - A series of fixes for mtip32xx.  Most from Asai at Micron, but also
   one from Greg, getting rid of the dependency on PCIE_HOTPLUG.

 - A few bug fixes for xen-blkfront, and blkback.

 - A virtio-blk fix for Vivek, making resize actually work.

 - Two fixes from Stephen, making larger transfers possible on cciss.
   This is needed for tape drive support.

* 'for-3.4/drivers' of git://git.kernel.dk/linux-block:
  block: mtip32xx: remove HOTPLUG_PCI_PCIE dependancy
  mtip32xx: dump tagmap on failure
  mtip32xx: fix handling of commands in various scenarios
  mtip32xx: Shorten macro names
  mtip32xx: misc changes
  mtip32xx: Add new sysfs entry 'status'
  mtip32xx: make setting comp_time as common
  mtip32xx: Add new bitwise flag 'dd_flag'
  mtip32xx: fix error handling in mtip_init()
  virtio-blk: Call revalidate_disk() upon online disk resize
  xen/blkback: Make optional features be really optional.
  xen/blkback: Squash the discard support for 'file' and 'phy' type.
  mtip32xx: fix incorrect value set for drv_cleanup_done, and re-initialize and start port in mtip_restart_port()
  cciss: Fix scsi tape io with more than 255 scatter gather elements
  cciss: Initialize scsi host max_sectors for tape drive support
  xen-blkfront: make blkif_io_lock spinlock per-device
  xen/blkfront: don't put bdev right after getting it
  xen-blkfront: use bitmap_set() and bitmap_clear()
  xen/blkback: Enable blkback on HVM guests
  xen/blkback: use grant-table.c hypercall wrappers
2012-04-13 18:45:13 -07:00
Ren Mingxin c0aa3e0916 virtio_blk: helper function to format disk names
The current virtio block's naming algorithm just supports 18278
(26^3 + 26^2 + 26) disks. If there are more virtio blocks,
there will be disks with the same name.

Based on commit 3e1a7ff8a0, add
a function "virtblk_name_format()" for virtio block to support mass
of disks naming.

Notes:
- Our naming scheme is ugly. We are stuck with it
  for virtio but don't use it for any new driver:
  new drivers should name their devices PREFIX%d
  where the sequence number can be allocated by ida
- sd_format_disk_name has exactly the same logic.
  Moving it to a central place was deferred over worries
  that this will make people keep using the legacy naming
  in new drivers.
  We kept code idential in case someone wants to deduplicate later.

Signed-off-by: Ren Mingxin <renmx@cn.fujitsu.com>
Acked-by: Asias He <asias@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2012-04-12 10:37:05 +03:00
Greg Kroah-Hartman 6363480651 block: mtip32xx: remove HOTPLUG_PCI_PCIE dependancy
This removes the HOTPLUG_PCI_PCIE dependency on the driver and makes it
depend on PCI.

Cc: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-12 08:47:05 +02:00
Asai Thambi S P 95fea2f1d9 mtip32xx: dump tagmap on failure
Dump tagmap on failure, instead of individual tags.

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-09 08:35:39 +02:00
Asai Thambi S P c74b0f586f mtip32xx: fix handling of commands in various scenarios
* If a ncq  command time out and a non-ncq command is active, skip restart port
* Queue(pause) ncq commands during operations spanning more than one non-ncq commands - secure erase, download microcode
* When a non-ncq command is active, allow incoming non-ncq commands to wait instead of failing back
* Changed timeout for download microcode and smart commands
* If the device in write protect mode, fail all writes (do not send to device)
* Set maximum retries to 2

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-09 08:35:39 +02:00
Asai Thambi S P 8a857a880b mtip32xx: Shorten macro names
Shortened macros used to represent mtip_port->flags and dd->dd_flag

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-09 08:35:38 +02:00
Asai Thambi S P 8182b49528 mtip32xx: misc changes
* Handle the interrupt completion of polled internal commands
* Do not check remove pending flag for standby command
* On rebuild failure,
    - set corresponding bit dd_flag
    - do not send standby command
* Free ida index in remove path

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-09 08:35:38 +02:00
Asai Thambi S P f65872177d mtip32xx: Add new sysfs entry 'status'
* Add support for detecting the following device status
        - write protect
        - over temp (thermal shutdown)
* Add new sysfs entry 'status', possible values - online, write_protect, thermal_shutdown
* Add new file 'sysfs-block-rssd' to document ABI (Reported-by: Greg Kroah-Hartman)

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-09 08:35:38 +02:00
Asai Thambi S P dad40f16ff mtip32xx: make setting comp_time as common
Moved setting completion time into mtip_issue_ncq_command()

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-09 08:35:38 +02:00
Asai Thambi S P 45038367c2 mtip32xx: Add new bitwise flag 'dd_flag'
* Merged the following flags into one variable 'dd_flag':
        * drv_cleanup_done
        * resumeflag
* Added the following flags into 'dd_flag'
        * remove pending
        * init done
* Removed 'ftlrebuildflag' (similar flag is already part of mti_port->flags)

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-09 08:35:38 +02:00
Linus Torvalds 9479f0f801 Two fixes for regressions:
* one is a workaround that will be removed in v3.5 with proper fix in the tip/x86 tree,
  * the other is to fix drivers to load on PV (a previous patch made them only
    load in PVonHVM mode).
 
 The rest are just minor fixes in the various drivers and some cleanup in the
 core code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQEcBAABAgAGBQJPfyVUAAoJEFjIrFwIi8fJUjUH/jbY5JavRqSlNELZW2A4Ta76
 8p00LqLHw/C56iHZcWKke8mqtWNb+ZfcQt7ZYcxDIYa4QWBL28x0OLAO2tOBIt37
 ZjYESWSdFJaJvmpADluWtFyGyZ9TYJllDTBm/jWj1ZtKSZvR1YkhuMXCS0f4AmGQ
 xFzSWJZUDdiOAqpN+VQD8wP00gfR8knQLg16XE2fvFdQo4XwpCtqLfHV/5pMMGdy
 Cs/ep6rq/7cdv/nshKOcBnw7RW8l3Xoi/28ht8k3DvAQ2VtFq1Tugv2G9pcCHwQG
 DIBkB3SOU6/v6P5at5+egKS5xR1fJetCWlkMd8kkbcdz2NPI4UDMkvOW6Q8yQls=
 =6Ve+
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull xen fixes from Konrad Rzeszutek Wilk:
 "Two fixes for regressions:
   * one is a workaround that will be removed in v3.5 with proper fix in
     the tip/x86 tree,
   * the other is to fix drivers to load on PV (a previous patch made
     them only load in PVonHVM mode).

  The rest are just minor fixes in the various drivers and some cleanup
  in the core code."

* tag 'stable/for-linus-3.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/pcifront: avoid pci_frontend_enable_msix() falsely returning success
  xen/pciback: fix XEN_PCI_OP_enable_msix result
  xen/smp: Remove unnecessary call to smp_processor_id()
  xen/x86: Workaround 'x86/ioapic: Add register level checks to detect bogus io-apic entries'
  xen: only check xen_platform_pci_unplug if hvm
2012-04-06 17:54:53 -07:00
Igor Mammedov e95ae5a493 xen: only check xen_platform_pci_unplug if hvm
commit b9136d207f08
  xen: initialize platform-pci even if xen_emul_unplug=never

breaks blkfront/netfront by not loading them because of
xen_platform_pci_unplug=0 and it is never set for PV guest.

Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-04-06 12:12:52 -04:00
Alex Elder cd9d9f5df6 rbd: don't hold spinlock during messenger flush
A recent change made changes to the rbd_client_list be protected by
a spinlock.  Unfortunately in rbd_put_client(), the lock is taken
before possibly dropping the last reference to an rbd_client, and on
the last reference that eventually calls flush_workqueue() which can
sleep.

The problem was flagged by a debug spinlock warning:
    BUG: spinlock wrong CPU on CPU#3, rbd/27814

The solution is to move the spinlock acquisition and release inside
rbd_client_release(), which is the spot where it's really needed for
protecting the removal of the rbd_client from the client list.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
2012-04-05 15:43:58 -05:00
Ryosuke Saito 6d27f09a63 mtip32xx: fix error handling in mtip_init()
Ensure that block device is properly unregistered, if
pci_register_driver() fails.

Signed-off-by: Ryosuke Saito <raitosyo@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-05 08:09:34 -06:00
Len Brown f6365201d8 x86: Remove the ancient and deprecated disable_hlt() and enable_hlt() facility
The X86_32-only disable_hlt/enable_hlt mechanism was used by the
32-bit floppy driver. Its effect was to replace the use of the
HLT instruction inside default_idle() with cpu_relax() - essentially
it turned off the use of HLT.

This workaround was commented in the code as:

 "disable hlt during certain critical i/o operations"

 "This halt magic was a workaround for ancient floppy DMA
  wreckage. It should be safe to remove."

H. Peter Anvin additionally adds:

 "To the best of my knowledge, no-hlt only existed because of
  flaky power distributions on 386/486 systems which were sold to
  run DOS.  Since DOS did no power management of any kind,
  including HLT, the power draw was fairly uniform; when exposed
  to the much hhigher noise levels you got when Linux used HLT
  caused some of these systems to fail.

  They were by far in the minority even back then."

Alan Cox further says:

 "Also for the Cyrix 5510 which tended to go castors up if a HLT
  occurred during a DMA cycle and on a few other boxes HLT during
  DMA tended to go astray.

  Do we care ? I doubt it. The 5510 was pretty obscure, the 5520
  fixed it, the 5530 is probably the oldest still in any kind of
  use."

So, let's finally drop this.

Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Josh Boyer <jwboyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Acked-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Stephen Hemminger <shemminger@vyatta.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/n/tip-3rhk9bzf0x9rljkv488tloib@git.kernel.org
[ If anyone cares then alternative instruction patching could be
  used to replace HLT with a one-byte NOP instruction. Much simpler. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-30 08:50:27 +02:00
Vivek Goyal e9986f303d virtio-blk: Call revalidate_disk() upon online disk resize
If a virtio disk is open in guest and a disk resize operation is done,
(virsh blockresize), new size is not visible to tools like "fdisk -l".
This seems to be happening as we update only part->nr_sects and not
bdev->bd_inode size.

Call revalidate_disk() which should take care of it. I tested growing disk
size of already open disk and it works for me.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-29 10:09:44 +02:00
Linus Torvalds 532bfc851a Merge branch 'akpm' (Andrew's patch-bomb)
Merge third batch of patches from Andrew Morton:
 - Some MM stragglers
 - core SMP library cleanups (on_each_cpu_mask)
 - Some IPI optimisations
 - kexec
 - kdump
 - IPMI
 - the radix-tree iterator work
 - various other misc bits.

 "That'll do for -rc1.  I still have ~10 patches for 3.4, will send
  those along when they've baked a little more."

* emailed from Andrew Morton <akpm@linux-foundation.org>: (35 commits)
  backlight: fix typo in tosa_lcd.c
  crc32: add help text for the algorithm select option
  mm: move hugepage test examples to tools/testing/selftests/vm
  mm: move slabinfo.c to tools/vm
  mm: move page-types.c from Documentation to tools/vm
  selftests/Makefile: make `run_tests' depend on `all'
  selftests: launch individual selftests from the main Makefile
  radix-tree: use iterators in find_get_pages* functions
  radix-tree: rewrite gang lookup using iterator
  radix-tree: introduce bit-optimized iterator
  fs/proc/namespaces.c: prevent crash when ns_entries[] is empty
  nbd: rename the nbd_device variable from lo to nbd
  pidns: add reboot_pid_ns() to handle the reboot syscall
  sysctl: use bitmap library functions
  ipmi: use locks on watchdog timeout set on reboot
  ipmi: simplify locking
  ipmi: fix message handling during panics
  ipmi: use a tasklet for handling received messages
  ipmi: increase KCS timeouts
  ipmi: decrease the IPMI message transaction time in interrupt mode
  ...
2012-03-28 17:19:28 -07:00
Wanlong Gao f4507164e7 nbd: rename the nbd_device variable from lo to nbd
rename the nbd_device variable from "lo" to "nbd", since "lo" is just a name
copied from loop.c.

Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Cc: Paul Clements <paul.clements@steeleye.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-28 17:14:37 -07:00
Linus Torvalds 0195c00244 Disintegrate and delete asm/system.h
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIVAwUAT3NKzROxKuMESys7AQKElw/+JyDxJSlj+g+nymkx8IVVuU8CsEwNLgRk
 8KEnRfLhGtkXFLSJYWO6jzGo16F8Uqli1PdMFte/wagSv0285/HZaKlkkBVHdJ/m
 u40oSjgT013bBh6MQ0Oaf8pFezFUiQB5zPOA9QGaLVGDLXCmgqUgd7exaD5wRIwB
 ZmyItjZeAVnDfk1R+ZiNYytHAi8A5wSB+eFDCIQYgyulA1Igd1UnRtx+dRKbvc/m
 rWQ6KWbZHIdvP1ksd8wHHkrlUD2pEeJ8glJLsZUhMm/5oMf/8RmOCvmo8rvE/qwl
 eDQ1h4cGYlfjobxXZMHqAN9m7Jg2bI946HZjdb7/7oCeO6VW3FwPZ/Ic75p+wp45
 HXJTItufERYk6QxShiOKvA+QexnYwY0IT5oRP4DrhdVB/X9cl2MoaZHC+RbYLQy+
 /5VNZKi38iK4F9AbFamS7kd0i5QszA/ZzEzKZ6VMuOp3W/fagpn4ZJT1LIA3m4A9
 Q0cj24mqeyCfjysu0TMbPtaN+Yjeu1o1OFRvM8XffbZsp5bNzuTDEvviJ2NXw4vK
 4qUHulhYSEWcu9YgAZXvEWDEM78FXCkg2v/CrZXH5tyc95kUkMPcgG+QZBB5wElR
 FaOKpiC/BuNIGEf02IZQ4nfDxE90QwnDeoYeV+FvNj9UEOopJ5z5bMPoTHxm4cCD
 NypQthI85pc=
 =G9mT
 -----END PGP SIGNATURE-----

Merge tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system

Pull "Disintegrate and delete asm/system.h" from David Howells:
 "Here are a bunch of patches to disintegrate asm/system.h into a set of
  separate bits to relieve the problem of circular inclusion
  dependencies.

  I've built all the working defconfigs from all the arches that I can
  and made sure that they don't break.

  The reason for these patches is that I recently encountered a circular
  dependency problem that came about when I produced some patches to
  optimise get_order() by rewriting it to use ilog2().

  This uses bitops - and on the SH arch asm/bitops.h drags in
  asm-generic/get_order.h by a circuituous route involving asm/system.h.

  The main difficulty seems to be asm/system.h.  It holds a number of
  low level bits with no/few dependencies that are commonly used (eg.
  memory barriers) and a number of bits with more dependencies that
  aren't used in many places (eg.  switch_to()).

  These patches break asm/system.h up into the following core pieces:

    (1) asm/barrier.h

        Move memory barriers here.  This already done for MIPS and Alpha.

    (2) asm/switch_to.h

        Move switch_to() and related stuff here.

    (3) asm/exec.h

        Move arch_align_stack() here.  Other process execution related bits
        could perhaps go here from asm/processor.h.

    (4) asm/cmpxchg.h

        Move xchg() and cmpxchg() here as they're full word atomic ops and
        frequently used by atomic_xchg() and atomic_cmpxchg().

    (5) asm/bug.h

        Move die() and related bits.

    (6) asm/auxvec.h

        Move AT_VECTOR_SIZE_ARCH here.

  Other arch headers are created as needed on a per-arch basis."

Fixed up some conflicts from other header file cleanups and moving code
around that has happened in the meantime, so David's testing is somewhat
weakened by that.  We'll find out anything that got broken and fix it..

* tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system: (38 commits)
  Delete all instances of asm/system.h
  Remove all #inclusions of asm/system.h
  Add #includes needed to permit the removal of asm/system.h
  Move all declarations of free_initmem() to linux/mm.h
  Disintegrate asm/system.h for OpenRISC
  Split arch_align_stack() out from asm-generic/system.h
  Split the switch_to() wrapper out of asm-generic/system.h
  Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h
  Create asm-generic/barrier.h
  Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h
  Disintegrate asm/system.h for Xtensa
  Disintegrate asm/system.h for Unicore32 [based on ver #3, changed by gxt]
  Disintegrate asm/system.h for Tile
  Disintegrate asm/system.h for Sparc
  Disintegrate asm/system.h for SH
  Disintegrate asm/system.h for Score
  Disintegrate asm/system.h for S390
  Disintegrate asm/system.h for PowerPC
  Disintegrate asm/system.h for PA-RISC
  Disintegrate asm/system.h for MN10300
  ...
2012-03-28 15:58:21 -07:00
Linus Torvalds 47b816ff7d Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull a few more things for powerpc by Benjamin Herrenschmidt:
 - Anton's did some recent improvements to EPOW event reporting on
   pSeries (power supply failures and such).  The patches are self
   contained enough and replace really nasty code so I felt it should
   still go in
 - I did the vio driver registration change Greg requested, I don't see
   the point of leaving that til the next merge window
 - The remaining EEH changes I said were still pending to get rid of the
   EEH references from the generic struct device_node
 - A few more iSeries removal bits
 - A perf bug fix on 970

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
  powerpc/perf: Fix instruction address sampling on 970 and Power4
  powerpc+sparc/vio: Modernize driver registration
  powerpc: Random little legacy iSeries removal tidy ups
  powerpc: Remove NO_IRQ_IGNORE
  powerpc/pseries: Cut down on enthusiastic use of defines in RAS code
  powerpc/pseries: Clean up ras_error_interrupt code
  powerpc/pseries: Remove RTAS_POWERMGM_EVENTS
  powerpc/pseries: Use rtas_get_sensor in RAS code
  powerpc/pseries: Parse and handle EPOW interrupts
  powerpc: Make function that parses RTAS error logs global
  powerpc/eeh: Retrieve PHB from global list
  powerpc/eeh: Remove eeh information from pci_dn
  powerpc/eeh: Remove eeh device from OF node
2012-03-28 14:41:36 -07:00
David Howells 9ffc93f203 Remove all #inclusions of asm/system.h
Remove all #inclusions of asm/system.h preparatory to splitting and killing
it.  Performed with the following command:

perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *`

Signed-off-by: David Howells <dhowells@redhat.com>
2012-03-28 18:30:03 +01:00
Linus Torvalds 56b59b429b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates for 3.4-rc1 from Sage Weil:
 "Alex has been busy.  There are a range of rbd and libceph cleanups,
  especially surrounding device setup and teardown, and a few critical
  fixes in that code.  There are more cleanups in the messenger code,
  virtual xattrs, a fix for CRC calculation/checks, and lots of other
  miscellaneous stuff.

  There's a patch from Amon Ott to make inos behave a bit better on
  32-bit boxes, some decode check fixes from Xi Wang, and network
  throttling fix from Jim Schutt, and a couple RBD fixes from Josh
  Durgin.

  No new functionality, just a lot of cleanup and bug fixing."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (65 commits)
  rbd: move snap_rwsem to the device, rename to header_rwsem
  ceph: fix three bugs, two in ceph_vxattrcb_file_layout()
  libceph: isolate kmap() call in write_partial_msg_pages()
  libceph: rename "page_shift" variable to something sensible
  libceph: get rid of zero_page_address
  libceph: only call kernel_sendpage() via helper
  libceph: use kernel_sendpage() for sending zeroes
  libceph: fix inverted crc option logic
  libceph: some simple changes
  libceph: small refactor in write_partial_kvec()
  libceph: do crc calculations outside loop
  libceph: separate CRC calculation from byte swapping
  libceph: use "do" in CRC-related Boolean variables
  ceph: ensure Boolean options support both senses
  libceph: a few small changes
  libceph: make ceph_tcp_connect() return int
  libceph: encapsulate some messenger cleanup code
  libceph: make ceph_msgr_wq private
  libceph: encapsulate connection kvec operations
  libceph: move prepare_write_banner()
  ...
2012-03-28 10:01:29 -07:00
Benjamin Herrenschmidt cb52d8970e powerpc+sparc/vio: Modernize driver registration
This makes vio_register_driver() get the module owner & name at compile
time like PCI drivers do, and adds a name pointer directly in struct
vio_driver to avoid having to explicitly initialize the embedded
struct device.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: David S. Miller <davem@davemloft.net>
2012-03-28 11:33:24 +11:00
Jens Axboe 6674fb79ca Merge branch 'stable/for-jens-3.4-bugfixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-3.4/drivers
Konrad writes:

I've two small fixes for the xen-blkback - and I think one more will show up
eventually (a partial revert), but not sure when. So in the spirit of keeping
the patches flowing, please git pull the following branch.
2012-03-26 09:13:14 +02:00
Linus Torvalds e22057c859 One tiny feature that accidentally got lost in the initial git pull:
* Add fast-EOI acking of interrupts (clear a bit instead of hypercall)
 And bug-fixes:
  * Fix CPU bring-up code missing a call to notify other subsystems.
  * Fix reading /sys/hypervisor even if PVonHVM drivers are not loaded.
  * In Xen ACPI processor driver: remove too verbose WARN messages, fix up
    the Kconfig dependency to be a module by default, and add dependency on
    CPU_FREQ.
  * Disable CPU frequency drivers from loading when booting under Xen
    (as we want the Xen ACPI processor to be used instead).
  * Cleanups in tmem code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQEcBAABAgAGBQJPbc3DAAoJEFjIrFwIi8fJTQkIAMnH2fPhcHAb4mNaz+3gdmsZ
 Flo6V1gMBcO8xKZlUkFgKKPYoOm7lLmvoceXLVSH5oOKSnSJo1zSinzKmcdJQo/D
 kPo4/EguNwtzcAcQh2dmT6/IM9O3ihMKUli7Oajif9PLCFFFqTaG3Y3YNBo/rxTY
 D3HAnJrIfmIyG0NpLnaFCWhCzUvcB4M7ysutECqcF8l5gnbHxRVeCKD0blM+n9GH
 Wyum00dQCwo6h6wTduhPOAxHAM4rncyR3heOB2vDxq9YJHSUhhcva5QCgQ+tdUVt
 6U2TQT1L2Px8iXXzr2w9YBpepOVajZReoKhajLjJ5VbkpBZFz5dVNfJ8LpF8RV8=
 =z8IB
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.4-tag-two' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull more xen updates from Konrad Rzeszutek Wilk:
 "One tiny feature that accidentally got lost in the initial git pull:
   * Add fast-EOI acking of interrupts (clear a bit instead of
     hypercall)
  And bug-fixes:
   * Fix CPU bring-up code missing a call to notify other subsystems.
   * Fix reading /sys/hypervisor even if PVonHVM drivers are not loaded.
   * In Xen ACPI processor driver: remove too verbose WARN messages, fix
     up the Kconfig dependency to be a module by default, and add
     dependency on CPU_FREQ.
   * Disable CPU frequency drivers from loading when booting under Xen
     (as we want the Xen ACPI processor to be used instead).
   * Cleanups in tmem code."

* tag 'stable/for-linus-3.4-tag-two' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/acpi: Fix Kconfig dependency on CPU_FREQ
  xen: initialize platform-pci even if xen_emul_unplug=never
  xen/smp: Fix bringup bug in AP code.
  xen/acpi: Remove the WARN's as they just create noise.
  xen/tmem: cleanup
  xen: support pirq_eoi_map
  xen/acpi-processor: Do not depend on CPU frequency scaling drivers.
  xen/cpufreq: Disable the cpu frequency scaling drivers from loading.
  provide disable_cpufreq() function to disable the API.
2012-03-24 12:20:25 -07:00
Konrad Rzeszutek Wilk 3389bb8bf7 xen/blkback: Make optional features be really optional.
They were using the xenbus_dev_fatal() function which would
change the state of the connection immediately. Which is not
what we want when we advertise optional features.

So make 'feature-discard','feature-barrier','feature-flush-cache'
optional.

Suggested-by: Jan Beulich <JBeulich@suse.com>
[v1: Made the discard function void and static]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-03-24 10:04:36 -04:00
Konrad Rzeszutek Wilk 4dae76705f xen/blkback: Squash the discard support for 'file' and 'phy' type.
The only reason for the distinction was for the special case of
'file' (which is assumed to be loopback device), was to reach inside
the loopback device, find the underlaying file, and call fallocate on it.
Fortunately "xen-blkback: convert hole punching to discard request on
loop devices" removes that use-case and we now based the discard
support based on blk_queue_discard(q) and extract all appropriate
parameters from the 'struct request_queue'.

CC: Li Dongyang <lidongyang@novell.com>
Acked-by: Jan Beulich <JBeulich@suse.com>
[v1: Dropping pointless initializer and keeping blank line]
[v2: Remove the kfree as it is not used anymore]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-03-24 10:04:35 -04:00
Oleg Nesterov 70834d3070 usermodehelper: use UMH_WAIT_PROC consistently
A few call_usermodehelper() callers use the hardcoded constant instead of
the proper UMH_WAIT_PROC, fix them.

Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Januszewski <spock@gentoo.org>
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-23 16:58:41 -07:00
Asai Thambi S P 22be2e6e13 mtip32xx: fix incorrect value set for drv_cleanup_done, and re-initialize and start port in mtip_restart_port()
This patch includes two changes:
	* fix incorrect value set for drv_cleanup_done
	* re-initialize and start port in mtip_restart_port()

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-23 12:33:03 +01:00
Stephen M. Cameron bc67f63650 cciss: Fix scsi tape io with more than 255 scatter gather elements
The total number of scatter gather elements in the CISS command
used by the scsi tape code was being cast to a u8, which can hold
at most 255 scatter gather elements.  It should have been cast to
a u16.  Without this patch the command gets rejected by the controller
since the total scatter gather count did not add up to the right
value resulting in an i/o error.

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-22 21:40:09 +01:00
Stephen M. Cameron 395d287526 cciss: Initialize scsi host max_sectors for tape drive support
The default is too small (1024 blocks), use h->cciss_max_sectors (8192 blocks)
Without this change, if you try to set the block size of a tape drive above
512*1024, via "mt -f /dev/st0 setblk nnn" where nnn is greater than 524288,
it won't work right.

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-22 21:40:08 +01:00
Josh Durgin c666601a93 rbd: move snap_rwsem to the device, rename to header_rwsem
A new temporary header is allocated each time the header changes, but
only the changed properties are copied over. We don't need a new
semaphore for each header update.

This addresses http://tracker.newdream.net/issues/2174

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
2012-03-22 10:47:52 -05:00
Alex Elder 32eec68d2f rbd: don't drop the rbd_id too early
Currently an rbd device's id is released when it is removed, but it
is done before the code is run to clean up sysfs-related files (such
as /sys/bus/rbd/devices/1).

It's possible that an rbd is still in use after the rbd_remove()
call has been made.  It's essentially the same as an active inode
that stays around after it has been removed--until its final close
operation.  This means that the id shows up as free for reuse at a
time it should not be.

The effect of this was seen by Jens Rehpoehler, who:
    - had a filesystem mounted on an rbd device
    - unmapped that filesystem (without unmounting)
    - found that the mount still worked properly
    - but hit a panic when he attempted to re-map a new rbd device

This re-map attempt found the previously-unmapped id available.
The subsequent attempt to reuse it was met with a panic while
attempting to (re-)install the sysfs entry for the new mapped
device.

Fix this by holding off "putting" the rbd id, until the rbd_device
release function is called--when the last reference is finally
dropped.

Note: This fixes: http://tracker.newdream.net/issues/1907

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:50 -05:00
Alex Elder 593a9e7b34 rbd: small changes
Here is another set of small code tidy-ups:
    - Define SECTOR_SHIFT and SECTOR_SIZE, and use these symbolic
      names throughout.  Tell the blk_queue system our physical
      block size, in the (unlikely) event we want to use something
      other than the default.
    - Delete the definition of struct rbd_info, which is never used.
    - Move the definition of dev_to_rbd() down in its source file,
      just above where it gets first used, and change its name to
      dev_to_rbd_dev().
    - Replace an open-coded operation in rbd_dev_release() to use
      dev_to_rbd_dev() instead.
    - Calculate the segment size for a given rbd_device just once in
      rbd_init_disk().
    - Use the '%zd' conversion specifier in rbd_snap_size_show(),
      since the value formatted is a size_t.
    - Switch to the '%llu' conversion specifier in rbd_snap_id_show().
      since the value formatted is unsigned.

Signed-off-by: Alex Elder <elder@dreamhost.com>
2012-03-22 10:47:50 -05:00
Alex Elder 00f1f36ffa rbd: do some refactoring
A few blocks of code are rearranged a bit here:
    - In rbd_header_from_disk():
	- Don't bother computing snap_count until we're sure the
	  on-disk header starts with a good signature.
	- Move a few independent lines of code so they are *after* a
	  check for a failed memory allocation.
	- Get rid of unnecessary local variable "ret".
    - Make a few other changes in rbd_read_header(), similar to the
      above--just moving things around a bit while preserving the
      functionality.
    - In rbd_rq_fn(), just assign rq in the while loop's controlling
      expression rather than duplicating it before and at the end of
      the loop body.  This allows the use of "continue" rather than
      "goto next" in a number of spots.
    - Rearrange the logic in snap_by_name().  End result is the same.

Signed-off-by: Alex Elder <elder@dreamhost.com>
2012-03-22 10:47:50 -05:00
Alex Elder fed4c143ba rbd: fix module sysfs setup/teardown code
Once rbd_bus_type is registered, it allows an "add" operation via
the /sys/bus/rbd/add bus attribute, and adding a new rbd device that
way establishes a connection between the device and rbd_root_dev.
But rbd_root_dev is not registered until after the rbd_bus_type
registration is complete.  This could (in principle anyway) result
in an invalid state.

Since rbd_root_dev has no tie to rbd_bus_type we can reorder these
two initializations and never be faced with this scenario.

In addition, unregister the device in the event the bus registration
fails at module init time.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:50 -05:00
Alex Elder 7ef3214af2 rbd: don't allocate mon_addrs buffer in rbd_add()
The mon_addrs buffer in rbd_add is used to hold a copy of the
monitor IP addresses supplied via /sys/bus/rbd/add.  That is
passed to rbd_get_client(), which never modifies it (nor do
any of the functions it gets passed to thereafter)--the mon_addr
parameter to rbd_get_client() is a pointer to constant data, so it
can't be modifed.  Furthermore, rbd_get_client() has the length of
the mon_addrs buffer and that is used to ensure nothing goes beyond
its end.

Based on all this, there is no reason that a buffer needs to
be used to hold a copy of the mon_addrs provided via
/sys/bus/rbd/add.   Instead, the location within that passed-in
buffer can be provided, along with the length of the "token"
therein which represents the monitor IP's.

A small change to rbd_add_parse_args() allows the address within the
buffer to be passed back, and the length is already returned.  This
now means that, at least from the perspective of this interface,
there is no such thing as a list of monitor addresses that is too
long.

Signed-off-by: Alex Elder <elder@dreamhost.com>
2012-03-22 10:47:50 -05:00
Alex Elder 5214ecc45c rbd: have rbd_parse_args() report found mon_addrs size
The argument parsing routine already computes the size of the
mon_addrs buffer it extracts from the "command."  Pass it to the
caller so it can use it to provide the length to rbd_get_client().

Signed-off-by: Alex Elder <elder@dreamhost.com>
2012-03-22 10:47:49 -05:00
Alex Elder 81a8979378 rbd: do a few checks at build time
This is a bit gratuitous, but there are a few things that can be
verified at build time rather than run time, so do that.

Signed-off-by: Alex Elder <elder@dreamhost.com>
2012-03-22 10:47:49 -05:00
Alex Elder e28fff268e rbd: don't use sscanf() in rbd_add_parse_args()
Make use of a few simple helper routines to parse the arguments
rather than sscanf().  This will treat both missing and too-long
arguments as invalid input (rather than silently truncating the
input in the too-long case).  In time this can also be used by
rbd_add() to use the passed-in buffer in place, rather than copying
its contents into new buffers.

It appears to me that the sscanf() previously used would not
correctly handle a supplied snapshot--the two final "%s" conversion
specifications were not separated by a space, and I'm not sure
how sscanf() handles that situation.  It may not be well-defined.
So that may be a bug this change fixes (but I didn't verify that).

The sizes of the mon_addrs and options buffers are now passed to
rbd_add_parse_args(), so they can be supplied to copy_token().

Signed-off-by: Alex Elder <elder@dreamhost.com>
2012-03-22 10:47:49 -05:00
Alex Elder a725f65e52 rbd: encapsulate argument parsing for rbd_add()
Move the code that parses the arguments provided to rbd_add() (which
are supplied via /sys/bus/rbd/add) into a separate function.

Also rename the "mon_dev_name" variable in rbd_add() to be
"mon_addrs".   The variable represents a list of one or more
comma-separated monitor IP addresses, each with an optional port
number.  I think "mon_addrs" captures that notion a little better.

Signed-off-by: Alex Elder <elder@dreamhost.com>
2012-03-22 10:47:48 -05:00
Alex Elder 27cc25943f rbd: simplify error handling in rbd_add()
If a couple pointers are initialized to NULL then a single
"out_nomem" label can be used for all of the memory allocation
failure cases in rbd_add().

Also, get rid of the "irc" local variable there.  There is no
real need for "rc" to be type ssize_t, and it can be used in
the spot "irc" was.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:48 -05:00
Alex Elder 60571c7d55 rbd: reduce memory used for rbd_dev fields
The length of the string containing the monitor address
specification(s) will never exceed the length of the string passed
in to rbd_add().  The same holds true for the ceph + rbd options
string.  So reduce the amount of memory allocated for these to
that length rather than the maximum (1024 bytes).

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:48 -05:00
Alex Elder d720bcb0a8 rbd: have rbd_get_client() return a rbd_client
Since rbd_get_client() currently returns an error code.  It assigns
the rbd_client field of the rbd_device structure it is passed if
successful.  Instead, have it return the created rbd_client
structure and return a pointer-coded error if there is an error.
This makes the assignment of the client pointer more obvious at the
call site.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:48 -05:00
Alex Elder f0f8cef5a3 rbd: a few simple changes
Here are a few very simple cleanups:
    - Add a "RBD_" prefix to the two driver name string definitions.
    - Move the definition of struct rbd_request below struct rbd_req_coll
      to avoid the need for an empty declaration of the latter.
    - Move and group the definitions of rbd_root_dev_release() and
      rbd_root_dev, as well as rbd_bus_type and rbd_bus_attrs[],
      close to the top of the file.  Arrange the latter so
      rbd_bus_type.bus_attrs can be initialized statically.
    - Get rid of an unnecessary local variable in rbd_open().
    - Rework some hokey logic in rbd_bus_add_dev(), so the value of
      "ret" at the end is either 0 or -ENOENT to avoid the need for
      the code duplication that was there.
    - Rename a goto target in rbd_add().

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:48 -05:00
Alex Elder 432b858749 rbd: rename "node_lock"
The spinlock used to protect rbd_client_list is named "node_lock".
Rename it to "rbd_client_list_lock" to make it more obvious what
it's for.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:48 -05:00
Alex Elder bc534d86be rbd: move ctl_mutex lock inside rbd_client_create()
Since rbd_client_create() is only called in one place, move the
acquisition of the mutex around that call inside that function.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:47 -05:00
Alex Elder d97081b0c7 rbd: move ctl_mutex lock inside rbd_get_client()
Since rbd_get_client() is only called in one place, move the
acquisition of the mutex around that call inside that function.

Furthermore, within rbd_get_client(), it appears the mutex only
needs to be held while calling rbd_client_create().  (Moving
the lock inside that function will wait for the next patch.)

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:47 -05:00
Alex Elder e6994d3dde rbd: release client list lock sooner
In rbd_get_client(), if a client is reused, a number of things
get done while still holding the list lock unnecessarily.

This just moves a few things that need no lock protection outside
the lock.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:47 -05:00
Alex Elder d184f6bfde rbd: restore previous rbd id sequence behavior
It used to be that selecting a new unique identifier for an added
rbd device required searching all existing ones to find the highest
id is used.  A recent change made that unnecessary, but made it
so that id's used were monotonically non-decreasing.  It's a bit
more pleasant to have smaller rbd id's though, and this change
makes ids get allocated as they were before--each new id is one more
than the maximum currently in use.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:47 -05:00
Alex Elder 499afd5b8e rbd: tie rbd_dev_list changes to rbd_id operations
The only time entries are added to or removed from the global
rbd_dev_list is exactly when a "put" or "get" operation is being
performed on a rbd_dev's id.  So just move the list management code
into get/put routines.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:47 -05:00
Alex Elder e124a82f3c rbd: protect the rbd_dev_list with a spinlock
The rbd_dev_list is just a simple list of all the current
rbd_devices.  Using the ctl_mutex as a concurrency guard is
overkill.  Instead, use a spinlock for that specific purpose.

This also reduces the window that the ctl_mutex needs to be held in
rbd_add().

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:47 -05:00
Alex Elder 1ddbe94eda rbd: rework calculation of new rbd id's
In order to select a new unique identifier for an added rbd device,
the list of all existing ones is searched and a value one greater
than the highest id is used.

The list search can be avoided by using an atomic variable that
keeps track of the current highest id.  Using a get/put model for
id's we can limit the boundless growth of id numbers a bit by
arranging to reuse the current highest id once it gets released.
Add these calls to "put" the id when an rbd is getting removed.

Note that this changes the pattern of device id's used--new values
will never be below the highest one seen so far (even if there
exists an unused lower one).  I assert this is OK because the key
property of an rbd id is its uniqueness, not its magnitude.

Regardless, a follow-on patch will restore the old way of doing
things, I just think this commit just makes the incremental change
to atomics a little easier to understand.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:47 -05:00
Alex Elder b7f23c361b rbd: encapsulate new rbd id selection
Move the loop that finds a new unique rbd id to use into
its own helper function.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:47 -05:00
Josh Durgin cc9d734c3d rbd: use a single value of snap_name to mean no snap
There's already a constant for this anyway.

Since rbd_header_set_snap() is only used to set the rbd device
snap_name field, just do that within that function rather than
having it take the snap_name as an argument.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

v2: Changed interface rbd_header_set_snap() so it explicitly updates
    the snap_name in the rbd_device.  Also added a BUILD_BUG_ON()
    to verify the size of the snap_name field is sufficient for
    SNAP_HEAD_NAME.
2012-03-22 10:47:47 -05:00
Alex Elder 1dbb439913 rbd: do not duplicate ceph_client pointer in rbd_device
The rbd_device structure maintains a duplicate copy of the
ceph_client pointer maintained in its rbd_client structure.  There
appears to be no good reason for this, and its presence presents a
risk of them getting out of synch or otherwise misused.  So kill it
off, and use the rbd_client copy only.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:47 -05:00
Alex Elder ee57741c52 rbd: make ceph_parse_options() return a pointer
ceph_parse_options() takes the address of a pointer as an argument
and uses it to return the address of an allocated structure if
successful.  With this interface is not evident at call sites that
the pointer is always initialized.  Change the interface to return
the address instead (or a pointer-coded error code) to make the
validity of the returned pointer obvious.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:47 -05:00
Alex Elder 2107978668 rbd: a few small cleanups
Some minor cleanups in "drivers/block/rbd.c:
    - Use the more meaningful "RBD_MAX_OBJ_NAME_LEN" in place if "96"
      in the definition of RBD_MAX_MD_NAME_LEN.
    - Use DEFINE_SPINLOCK() to define and initialize node_lock.
    - Drop a needless (char *) cast in parse_rbd_opts_token().
    - Make a few minor formatting changes.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:46 -05:00
Igor Mammedov b9136d207f xen: initialize platform-pci even if xen_emul_unplug=never
When xen_emul_unplug=never is specified on kernel command line
reading files from /sys/hypervisor is broken (returns -EBUSY).
It is caused by xen_bus dependency on platform-pci and
platform-pci isn't initialized when xen_emul_unplug=never is
specified.

Fix it by allowing platform-pci to ignore xen_emul_unplug=never,
and do not intialize xen_[blk|net]front instead.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-03-22 11:37:11 -04:00
Linus Torvalds 5375871d43 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc merge from Benjamin Herrenschmidt:
 "Here's the powerpc batch for this merge window.  It is going to be a
  bit more nasty than usual as in touching things outside of
  arch/powerpc mostly due to the big iSeriesectomy :-) We finally got
  rid of the bugger (legacy iSeries support) which was a PITA to
  maintain and that nobody really used anymore.

  Here are some of the highlights:

   - Legacy iSeries is gone.  Thanks Stephen ! There's still some bits
     and pieces remaining if you do a grep -ir series arch/powerpc but
     they are harmless and will be removed in the next few weeks
     hopefully.

   - The 'fadump' functionality (Firmware Assisted Dump) replaces the
     previous (equivalent) "pHyp assisted dump"...  it's a rewrite of a
     mechanism to get the hypervisor to do crash dumps on pSeries, the
     new implementation hopefully being much more reliable.  Thanks
     Mahesh Salgaonkar.

   - The "EEH" code (pSeries PCI error handling & recovery) got a big
     spring cleaning, motivated by the need to be able to implement a
     new backend for it on top of some new different type of firwmare.

     The work isn't complete yet, but a good chunk of the cleanups is
     there.  Note that this adds a field to struct device_node which is
     not very nice and which Grant objects to.  I will have a patch soon
     that moves that to a powerpc private data structure (hopefully
     before rc1) and we'll improve things further later on (hopefully
     getting rid of the need for that pointer completely).  Thanks Gavin
     Shan.

   - I dug into our exception & interrupt handling code to improve the
     way we do lazy interrupt handling (and make it work properly with
     "edge" triggered interrupt sources), and while at it found & fixed
     a wagon of issues in those areas, including adding support for page
     fault retry & fatal signals on page faults.

   - Your usual random batch of small fixes & updates, including a bunch
     of new embedded boards, both Freescale and APM based ones, etc..."

I fixed up some conflicts with the generalized irq-domain changes from
Grant Likely, hopefully correctly.

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (141 commits)
  powerpc/ps3: Do not adjust the wrapper load address
  powerpc: Remove the rest of the legacy iSeries include files
  powerpc: Remove the remaining CONFIG_PPC_ISERIES pieces
  init: Remove CONFIG_PPC_ISERIES
  powerpc: Remove FW_FEATURE ISERIES from arch code
  tty/hvc_vio: FW_FEATURE_ISERIES is no longer selectable
  powerpc/spufs: Fix double unlocks
  powerpc/5200: convert mpc5200 to use of_platform_populate()
  powerpc/mpc5200: add options to mpc5200_defconfig
  powerpc/mpc52xx: add a4m072 board support
  powerpc/mpc5200: update mpc5200_defconfig to fit for charon board
  Documentation/powerpc/mpc52xx.txt: Checkpatch cleanup
  powerpc/44x: Add additional device support for APM821xx SoC and Bluestone board
  powerpc/44x: Add support PCI-E for APM821xx SoC and Bluestone board
  MAINTAINERS: Update PowerPC 4xx tree
  powerpc/44x: The bug fixed support for APM821xx SoC and Bluestone board
  powerpc: document the FSL MPIC message register binding
  powerpc: add support for MPIC message register API
  powerpc/fsl: Added aliased MSIIR register address to MSI node in dts
  powerpc/85xx: mpc8548cds - add 36-bit dts
  ...
2012-03-21 18:55:10 -07:00
Linus Torvalds 9f3938346a Merge branch 'kmap_atomic' of git://github.com/congwang/linux
Pull kmap_atomic cleanup from Cong Wang.

It's been in -next for a long time, and it gets rid of the (no longer
used) second argument to k[un]map_atomic().

Fix up a few trivial conflicts in various drivers, and do an "evil
merge" to catch some new uses that have come in since Cong's tree.

* 'kmap_atomic' of git://github.com/congwang/linux: (59 commits)
  feature-removal-schedule.txt: schedule the deprecated form of kmap_atomic() for removal
  highmem: kill all __kmap_atomic() [swarren@nvidia.com: highmem: Fix ARM build break due to __kmap_atomic rename]
  drbd: remove the second argument of k[un]map_atomic()
  zcache: remove the second argument of k[un]map_atomic()
  gma500: remove the second argument of k[un]map_atomic()
  dm: remove the second argument of k[un]map_atomic()
  tomoyo: remove the second argument of k[un]map_atomic()
  sunrpc: remove the second argument of k[un]map_atomic()
  rds: remove the second argument of k[un]map_atomic()
  net: remove the second argument of k[un]map_atomic()
  mm: remove the second argument of k[un]map_atomic()
  lib: remove the second argument of k[un]map_atomic()
  power: remove the second argument of k[un]map_atomic()
  kdb: remove the second argument of k[un]map_atomic()
  udf: remove the second argument of k[un]map_atomic()
  ubifs: remove the second argument of k[un]map_atomic()
  squashfs: remove the second argument of k[un]map_atomic()
  reiserfs: remove the second argument of k[un]map_atomic()
  ocfs2: remove the second argument of k[un]map_atomic()
  ntfs: remove the second argument of k[un]map_atomic()
  ...
2012-03-21 09:40:26 -07:00
Linus Torvalds 69a7aebcf0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree from Jiri Kosina:
 "It's indeed trivial -- mostly documentation updates and a bunch of
  typo fixes from Masanari.

  There are also several linux/version.h include removals from Jesper."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (101 commits)
  kcore: fix spelling in read_kcore() comment
  constify struct pci_dev * in obvious cases
  Revert "char: Fix typo in viotape.c"
  init: fix wording error in mm_init comment
  usb: gadget: Kconfig: fix typo for 'different'
  Revert "power, max8998: Include linux/module.h just once in drivers/power/max8998_charger.c"
  writeback: fix fn name in writeback_inodes_sb_nr_if_idle() comment header
  writeback: fix typo in the writeback_control comment
  Documentation: Fix multiple typo in Documentation
  tpm_tis: fix tis_lock with respect to RCU
  Revert "media: Fix typo in mixer_drv.c and hdmi_drv.c"
  Doc: Update numastat.txt
  qla4xxx: Add missing spaces to error messages
  compiler.h: Fix typo
  security: struct security_operations kerneldoc fix
  Documentation: broken URL in libata.tmpl
  Documentation: broken URL in filesystems.tmpl
  mtd: simplify return logic in do_map_probe()
  mm: fix comment typo of truncate_inode_pages_range
  power: bq27x00: Fix typos in comment
  ...
2012-03-20 21:12:50 -07:00
Linus Torvalds ed378a52da USB merge for 3.4-rc1
Here's the big USB merge for the 3.4-rc1 merge window.
 
 Lots of gadget driver reworks here, driver updates, xhci changes, some
 new drivers added, usb-serial core reworking to fix some bugs, and other
 various minor things.
 
 There are some patches touching arch code, but they have all been acked
 by the various arch maintainers.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iEYEABECAAYFAk9njL8ACgkQMUfUDdst+ylQ9wCfbBOnIT01lGOorkaE9pom0hhk
 HfMAoKq1xzCR2B+OS3UMyUQffk+Ri9Ri
 =KIQ2
 -----END PGP SIGNATURE-----

Merge tag 'usb-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB merge for 3.4-rc1 from Greg KH:
 "Here's the big USB merge for the 3.4-rc1 merge window.

  Lots of gadget driver reworks here, driver updates, xhci changes, some
  new drivers added, usb-serial core reworking to fix some bugs, and
  other various minor things.

  There are some patches touching arch code, but they have all been
  acked by the various arch maintainers."

* tag 'usb-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (302 commits)
  net: qmi_wwan: add support for ZTE MF820D
  USB: option: add ZTE MF820D
  usb: gadget: f_fs: Remove lock is held before freeing checks
  USB: option: make interface blacklist work again
  usb/ub: deprecate & schedule for removal the "Low Performance USB Block" driver
  USB: ohci-pxa27x: add clk_prepare/clk_unprepare calls
  USB: use generic platform driver on ath79
  USB: EHCI: Add a generic platform device driver
  USB: OHCI: Add a generic platform device driver
  USB: ftdi_sio: new PID: LUMEL PD12
  USB: ftdi_sio: add support for FT-X series devices
  USB: serial: mos7840: Fixed MCS7820 device attach problem
  usb: Don't make USB_ARCH_HAS_{XHCI,OHCI,EHCI} depend on USB_SUPPORT.
  usb gadget: fix a section mismatch when compiling g_ffs with CONFIG_USB_FUNCTIONFS_ETH
  USB: ohci-nxp: Remove i2c_write(), use smbus
  USB: ohci-nxp: Support for LPC32xx
  USB: ohci-nxp: Rename symbols from pnx4008 to nxp
  USB: OHCI-HCD: Rename ohci-pnx4008 to ohci-nxp
  usb: gadget: Kconfig: fix typo for 'different'
  usb: dwc3: pci: fix another failure path in dwc3_pci_probe()
  ...
2012-03-20 11:26:30 -07:00
Cong Wang 589973a704 drbd: remove the second argument of k[un]map_atomic()
Signed-off-by: Cong Wang <amwang@redhat.com>
2012-03-20 21:48:29 +08:00
Cong Wang cfd8005c99 block: remove the second argument of k[un]map_atomic()
Signed-off-by: Cong Wang <amwang@redhat.com>
2012-03-20 21:48:16 +08:00
Steven Noonan 3467811e26 xen-blkfront: make blkif_io_lock spinlock per-device
This patch moves the global blkif_io_lock to the per-device structure. The
spinlock seems to exists for two reasons: to disable IRQs when in the interrupt
handlers for blkfront, and to protect the blkfront VBDs when a detachment is
requested.

Having a global blkif_io_lock doesn't make sense given the use case, and it
drastically hinders performance due to contention. All VBDs with pending IOs
have to take the lock in order to get work done, which serializes everything
pretty badly.

Signed-off-by: Steven Noonan <snoonan@amazon.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-03-20 12:52:41 +01:00
Andrew Jones dad5cf659b xen/blkfront: don't put bdev right after getting it
We should hang onto bdev until we're done with it.

Signed-off-by: Andrew Jones <drjones@redhat.com>
[v1: Fixed up git commit description]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-03-20 12:52:41 +01:00
Akinobu Mita 34ae2e47d9 xen-blkfront: use bitmap_set() and bitmap_clear()
Use bitmap_set and bitmap_clear rather than modifying individual bits
in a memory region.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xensource.com
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-03-20 12:52:41 +01:00
Daniel De Graaf b2167ba6dd xen/blkback: Enable blkback on HVM guests
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-03-20 12:52:41 +01:00
Daniel De Graaf 4f14faaab4 xen/blkback: use grant-table.c hypercall wrappers
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-03-20 12:52:41 +01:00
Sebastian Andrzej Siewior 7396bd9fa1 usb/ub: deprecate & schedule for removal the "Low Performance USB Block" driver
Deprecate this driver. All devices which can be handled by this driver
can also be handled by the usb-storage driver.

Acked-By: Pete Zaitcev <zaitcev@redhat.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-03-16 13:30:10 -07:00
Stephen Rothwell ba7a4822b4 powerpc: Remove some of the legacy iSeries specific device drivers
These drivers are specific to the PowerPC legacy iSeries platform and
their Kconfig is specified in arch/powerpc.  Legacy iSeries is being
removed, so these drivers can no longer be selected.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-03-16 09:28:05 +11:00
Linus Torvalds f1cbd03f5e Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "Been sitting on this for a while, but lets get this out the door.
  This fixes various important bugs for 3.3 final, along with a few more
  trivial ones.  Please pull!"

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: fix ioc leak in put_io_context
  block, sx8: fix pointer math issue getting fw version
  Block: use a freezable workqueue for disk-event polling
  drivers/block/DAC960: fix -Wuninitialized warning
  drivers/block/DAC960: fix DAC960_V2_IOCTL_Opcode_T -Wenum-compare warning
  block: fix __blkdev_get and add_disk race condition
  block: Fix setting bio flags in drivers (sd_dif/floppy)
  block: Fix NULL pointer dereference in sd_revalidate_disk
  block: exit_io_context() should call elevator_exit_icq_fn()
  block: simplify ioc_release_fn()
  block: replace icq->changed with icq->flags
2012-03-14 17:16:45 -07:00
Greg Kroah-Hartman f7a0d426f3 Merge 3.3-rc7 into usb-next
This resolves the conflict with drivers/usb/host/ehci-fsl.h that
happened with changes in Linus's and this branch at the same time.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-03-12 09:13:31 -07:00
Muthu Kumar 9354f1b8e6 floppy/scsi: fix setting of BIO flags
Fix setting bio flags in drivers (sd_dif/floppy).

Signed-off-by: Muthukumar R <muthur@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-05 15:49:43 -08:00