Commit Graph

4663 Commits (24d4e7f642882a8a13da170b4ba86eec8fa91bf2)

Author SHA1 Message Date
Linus Torvalds 848b81415c Merge branch 'akpm' (Andrew's patch-bomb)
Merge misc patches from Andrew Morton:
 "Incoming:

   - lots of misc stuff

   - backlight tree updates

   - lib/ updates

   - Oleg's percpu-rwsem changes

   - checkpatch

   - rtc

   - aoe

   - more checkpoint/restart support

  I still have a pile of MM stuff pending - Pekka should be merging
  later today after which that is good to go.  A number of other things
  are twiddling thumbs awaiting maintainer merges."

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (180 commits)
  scatterlist: don't BUG when we can trivially return a proper error.
  docs: update documentation about /proc/<pid>/fdinfo/<fd> fanotify output
  fs, fanotify: add @mflags field to fanotify output
  docs: add documentation about /proc/<pid>/fdinfo/<fd> output
  fs, notify: add procfs fdinfo helper
  fs, exportfs: add exportfs_encode_inode_fh() helper
  fs, exportfs: escape nil dereference if no s_export_op present
  fs, epoll: add procfs fdinfo helper
  fs, eventfd: add procfs fdinfo helper
  procfs: add ability to plug in auxiliary fdinfo providers
  tools/testing/selftests/kcmp/kcmp_test.c: print reason for failure in kcmp_test
  breakpoint selftests: print failure status instead of cause make error
  kcmp selftests: print fail status instead of cause make error
  kcmp selftests: make run_tests fix
  mem-hotplug selftests: print failure status instead of cause make error
  cpu-hotplug selftests: print failure status instead of cause make error
  mqueue selftests: print failure status instead of cause make error
  vm selftests: print failure status instead of cause make error
  ubifs: use prandom_bytes
  mtd: nandsim: use prandom_bytes
  ...
2012-12-17 20:58:12 -08:00
Roger Pau Monne d62f691858 xen-blkfront: handle bvecs with partial data
Currently blkfront fails to handle cases in blkif_completion like the
following:

1st loop in rq_for_each_segment
 * bv_offset: 3584
 * bv_len: 512
 * offset += bv_len
 * i: 0

2nd loop:
 * bv_offset: 0
 * bv_len: 512
 * i: 0

In the second loop i should be 1, since we assume we only wanted to
read a part of the previous page. This patches fixes this cases where
only a part of the shared page is read, and blkif_completion assumes
that if the bv_offset of a bvec is less than the previous bv_offset
plus the bv_size we have to switch to the next shared page.

Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: linux-kernel@vger.kernel.org
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-12-17 21:56:03 -05:00
Roger Pau Monne ebb351cf78 llist/xen-blkfront: implement safe version of llist_for_each_entry
Implement a safe version of llist_for_each_entry, and use it in
blkif_free. Previously grants where freed while iterating the list,
which lead to dereferences when trying to fetch the next item.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by:  Andrew Morton <akpm@linux-foundation.org>
[v2: Move the llist_for_each_entry_safe in llist.h]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-12-17 21:55:56 -05:00
Dan Carpenter 31279b1457 aoe: fix use after free in aoedev_by_aoeaddr()
We should return NULL on failure instead of returning a freed pointer.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:26 -08:00
Ed Cashin 2b37c7d865 aoe: update internal version number to 81
This version number is printed to the console on module initialization
and is available in sysfs, which is where the userland aoe-version tool
looks for it.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:26 -08:00
Ed Cashin bf29754ae8 aoe: identify source of runt AoE packets
This change only affects experimental AoE storage networks.

It modifies the console message about runt packets detected so that the
AoE major and minor addresses of the AoE target that generated the runt
are mentioned.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:26 -08:00
Ed Cashin 4a6c9ee93c aoe: allow comma separator in aoe_iflist value
By default, the aoe driver uses any ethernet interface for AoE, but the
aoe_iflist module parameter provides a convenient way to limit AoE
traffic to a specific list of local network interfaces.

This change allows a list to be specified using the comma character as a
separator.  For example,

  modprobe aoe aoe_iflist=eth2,eth3

Before, it was inconvenient to get the quoting right in shell scripts
when setting aoe_iflist to have more than one network interface.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:26 -08:00
Ed Cashin c450ba0fc1 aoe: allow user to disable target failure timeout
With this change, the aoe driver treats the value zero as special for
the aoe_deadsecs module parameter.  Normally, this value specifies the
number of seconds during which the driver will continue to attempt
retransmits to an unresponsive AoE target.  After aoe_deadsecs has
elapsed, the aoe driver marks the aoe device as "down" and fails all
I/O.

The new meaning of an aoe_deadsecs of zero is for the driver to
retransmit commands indefinitely.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:25 -08:00
Ed Cashin 71114ec45f aoe: use dynamic number of remote ports for AoE storage target
Many AoE targets have four or fewer network ports, but some existing
storage devices have many, and the AoE protocol sets no limit.

This patch allows the use of more than eight remote MAC addresses per AoE
target, while reducing the amount of memory used by the aoe driver in
cases where there are many AoE targets with fewer than eight MAC addresses
each.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:25 -08:00
Ed Cashin e52a293264 aoe: avoid races between device destruction and discovery
This change avoids a race that could result in a NULL pointer derference
following a WARNing from kobject_add_internal, "don't try to register
things with the same name in the same directory."

The problem was found with a test that forgets and discovers an
aoe device in a loop:

  while test ! -r /tmp/stop; do
	aoe-flush -a
	aoe-discover
  done

The race was between aoedev_flush taking aoedevs out of the devlist,
allowing a new discovery of the same AoE target to take place before the
driver gets around to calling sysfs_remove_group.  Fixing that one
revealed another race between do_open and add_disk, and this patch avoids
that, too.

The fix required some care, because for flushing (forgetting) an aoedev,
some of the steps must be performed under lock and some must be able to
sleep.  Also, for discovering a new aoedev, some steps might sleep.

The check for a bad aoedev pointer remains from a time when about half of
this patch was done, and it was possible for the
bdev->bd_disk->private_data to become corrupted.  The check should be
removed eventually, but it is not expected to add significant overhead,
occurring in the aoeblk_open routine.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:25 -08:00
Ed Cashin bbb44e30d0 aoe: improve handling of misbehaving network paths
An AoE target can have multiple network ports used for AoE, and in the
aoe driver, those are tracked by the aoetgt struct.  These changes allow
the aoe driver to handle network paths, or aoetgts, that are not working
well, compared to the others.

Paths that do not get responses despite the retransmission of AoE
commands are marked as "tainted", and non-tainted paths are preferred.

Meanwhile, the aoe driver attempts to "probe" the tainted path in the
background by issuing reads of LBA 0 that are padded out to full
(possibly jumbo-frame) size.  If the probes get responses, then the path
is "redeemed", and its taint is removed.

This mechanism has been shown to be helpful in transparently handling
and recovering from real-world network "brown outs" in ways that the
earlier "shoot the help-needing target in the head" mechanism could not.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:25 -08:00
Ed Cashin b91316f2b7 aoe: return real minor number for static minors
The value returned by the static minor device number number allocator is
the real minor number, so it must be multiplied by the supported number
of partitions per aoedev.

Without this fix the support for systems without udev is incomplete, and
the few users of aoe on such systems will have surprising results when
device nodes names do not match the AoE target.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:25 -08:00
Ed Cashin 10935d052e aoe: initialize sysminor to avoid compiler warning
Because the minor_get and related functions use the return values for
errors, the compiler doesn't know that sysminor will always either 1) be
initialized in aoedev_by_aoeaddr by the call to minor_get, or 2) be
unused as the "goto out" is executed.

This patch avoids the compiler warning.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:25 -08:00
Ed Cashin e0b2bbab0b aoe: make error messages more specific in static minor allocation
For some special-purpose systems where udev isn't present, static
allocation of minor numbers is desirable.  This update distinguishes
different failure scenarios, to help the user understand what went
wrong.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:25 -08:00
Ed Cashin 60116cf773 aoe: remove call to request handler from I/O completion
There is no need to call the request handler function in the I/O
completion routine.  The user impact of not doing it is a more "nice" aoe
driver that is less susceptible to causing soft lockups.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:25 -08:00
Ed Cashin 72837600ee aoe: cleanup: correct comment for aoetgt nout
A misplaced comment was attached to the nout member of the aoetgt.  This
change corrects the comment.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:25 -08:00
Ed Cashin 7b6ccc5f97 aoe: increase default cap on outstanding AoE commands in the network
The aoe driver will never be waiting for more than aoe_maxout AoE
commands from a given remote network port on an AoE target.  Increasing
the cap increases performance.  Users can tighten the setting to reduce
the amount of memory used for handling AoE traffic or the network
bandwidth used for AoE.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:25 -08:00
Ed Cashin 0a41409c51 aoe: remove vestigial request queue allocation
Before the aoe driver was an I/O request handler, it was a
make_request-style block driver.  Even so, there was a problem where
sysfs expected a request queue to exist, so one was provided in commit
7135a71b19 ("aoe: allocate unused request_queue for sysfs").

During the transition to the request-handler style, a patch was merged
that was based on a driver without the noop queue, and the noop queue
remained in place after the patch was merged, even though a new
functional queue was introduced by the patch, allocated through
blk_init_queue.

The user impact is a memory leak proportional to the number of AoE
targets discovered.  This patch removes the memory leak and cleans up
vestiges of the old do-nothing queue from the aoeblk_gdalloc function.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:25 -08:00
Ed Cashin fe7252bf51 aoe: copy fallback timing information on destination failover
Commit f3b8e07af774 ("aoe: commands in retransmit queue use new
destination on failure") omits the copying of the coarse-grained time
when an AoE command was sent during the failover from one destination
MAC address on the AoE target to another.

The coarse-grained timing is only used when the system time changes or
an unlikely length of time has passed since the sending of the AoE
command.  Users will not be impacted unless their system clock is very
inaccurate or something unusual (e.g., 10 GbE link reset) happens during
the period when the aoe driver is handling the failure of a port on the
AoE target.  Being effected will mean that an AoE target could be
considered "down" too eagerly.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:24 -08:00
Ed Cashin 519b77b032 aoe: update driver-internal version to 64+
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:24 -08:00
Ed Cashin 3fc9b03248 aoe: commands in retransmit queue use new destination on failure
When one remote MAC address isn't working as a destination for AoE
commands, the frames used to track information associated with the AoE
commands are moved to a new aoetgt (defined by the tuple of {AoE major,
AoE minor, target MAC address}).

This patch makes sure that the frames on the queue for retransmits that
need to be done are updated to use the new destination, so that
retransmits will be sent through a working network path.

Without this change, packets on the retransmit queue will be needlessly
retransmitted to the unresponsive destination MAC, possibly causing
premature target failure before there's time for the retransmit timer to
run again, decide to retransmit again, and finally update the destination
to a working MAC address on the AoE target.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:24 -08:00
Ed Cashin 5f0c9c48e7 aoe: use high-resolution RTTs with fallback to low-res
These changes improve the accuracy of the decision about whether it's time
to retransmit an AoE command by using the microsecond-resolution
gettimeofday instead of jiffies.

Because the system time can jump suddenly, the decision reverts to using
jiffies if the high-resolution time difference is relatively large.
Otherwise the AoE targets could be considered failed inappropriately.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:24 -08:00
Ed Cashin 0d555ecfa4 aoe: manipulate aoedev network stats under lock
With this bugfix in place the calculation of the criterion for "lateness"
is performed under lock.  Without the lock, there is a chance that one of
the non-atomic operations performed on the round trip time statistics
could be incomplete, such that an incorrect lateness criterion would be
calculated.

Without this change, the effect of the bug would be rare unecessary but
benign retransmissions.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:24 -08:00
Ed Cashin 2292a7e109 aoe: err device: include MAC addresses for unexpected responses
The /dev/etherd/err character device provides low-level information about
normal but sometimes interesting AoE command retransmits and "unexpected
responses", i.e., responses for packets that have already been
retransmitted.

This change adds MAC addresses to the messages about unexpected responses,
so that when they occur, it's more easy to determine the network paths to
which they belong.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:24 -08:00
Ed Cashin 3a0c40d2d2 aoe: improve network congestion handling
The aoe driver already had some congestion handling, but it was limited in
its ability to cope with the kind of congestion that can arise on more
complex networks such as those involving paths through multiple ethernet
switches.

Some of the lessons from TCP's history of development can be applied to
improving the congestion control and avoidance on AoE storage networks.
These changes use familar concepts from Van Jacobson's "Congestion
Avoidance and Control" paper from '88, without adding significant
overhead.

This patch depends on an upcoming patch that covers the failover case when
AoE commands being retransmitted are transferred from one retransmit queue
to another.  Another upcoming patch increases the timing accuracy.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:24 -08:00
Ed Cashin 667be1e757 aoe: provide ATA identify device content to user on request
Make the aoe driver follow expected behavior when the user uses ioctl to
get the ATA device identify information, allowing access to model, serial
number, etc.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:24 -08:00
Ed Cashin cd220bf51f aoe: update driver-internal version number to 60
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:24 -08:00
Ed Cashin a04b41cd2c aoe: whitespace cleanup
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:24 -08:00
Ed Cashin d437962504 aoe: cleanup: remove unused ata_scnt function
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:24 -08:00
Ed Cashin 90a2508d01 aoe: "payload" sysfs file exports per-AoE-command data transfer size
The userland aoetools package includes an "aoe-stat" command that can
display a "payload size" column when the aoe driver exports this
information.  Users can quickly see what amount of user data is
transferred inside each AoE command on the network, network headers
excluded.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:23 -08:00
Ed Cashin aa304fdefa aoe: support larger I/O requests via aoe_maxsectors module param
The GPFS filesystem is an example of an aoe user that requires the aoe
driver to support I/O request sizes larger than the default.  Most users
will not need large I/O request sizes, because they would need to be split
up into multiple AoE commands anyway.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:23 -08:00
Ed Cashin 4ba9aa7f98 aoe: support the forgetting (flushing) of a user-specified AoE target
Users sometimes want to cause the aoe driver to forget a particular
previously discovered device when it is no longer online.  The aoetools
provide an "aoe-flush" command that users run to perform this
administrative task.  The changes below provide the support needed in the
driver.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:23 -08:00
Ed Cashin 1b8a1636ce aoe: update cap on outstanding commands based on config query response
The ATA over Ethernet config query response contains a "buffer count"
field reflecting the AoE target's capacity to buffer incoming AoE
commands.

By taking the current value of this field into accound, we increase
performance throughput or avoid network congestion, when the value
has increased or decreased, respectively.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:23 -08:00
Ed Cashin 4e78dd144b aoe: print warning regarding a common reason for dropped transmits
Dropped transmits are not common, but when they do occur, increasing
the transmit queue length often helps.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:23 -08:00
Ed Cashin 662a889608 aoe: describe the behavior of the "err" character device
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:23 -08:00
Linus Torvalds 9228ff9038 Merge branch 'for-3.8/drivers' of git://git.kernel.dk/linux-block
Pull block driver update from Jens Axboe:
 "Now that the core bits are in, here are the driver bits for 3.8.  The
  branch contains:

   - A huge pile of drbd bits that were dumped from the 3.7 merge
     window.  Following that, it was both made perfectly clear that
     there is going to be no more over-the-wall pulls and how the
     situation on individual pulls can be improved.

   - A few cleanups from Akinobu Mita for drbd and cciss.

   - Queue improvement for loop from Lukas.  This grew into adding a
     generic interface for waiting/checking an even with a specific
     lock, allowing this to be pulled out of md and now loop and drbd is
     also using it.

   - A few fixes for xen back/front block driver from Roger Pau Monne.

   - Partition improvements from Stephen Warren, allowing partiion UUID
     to be used as an identifier."

* 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits)
  drbd: update Kconfig to match current dependencies
  drbd: Fix drbdsetup wait-connect, wait-sync etc... commands
  drbd: close race between drbd_set_role and drbd_connect
  drbd: respect no-md-barriers setting also when changed online via disk-options
  drbd: Remove obsolete check
  drbd: fixup after wait_even_lock_irq() addition to generic code
  loop: Limit the number of requests in the bio list
  wait: add wait_event_lock_irq() interface
  xen-blkfront: free allocated page
  xen-blkback: move free persistent grants code
  block: partition: msdos: provide UUIDs for partitions
  init: reduce PARTUUID min length to 1 from 36
  block: store partition_meta_info.uuid as a string
  cciss: use check_signature()
  cciss: cleanup bitops usage
  drbd: use copy_highpage
  drbd: if the replication link breaks during handshake, keep retrying
  drbd: check return of kmalloc in receive_uuids
  drbd: Broadcast sync progress no more often than once per second
  drbd: don't try to clear bits once the disk has failed
  ...
2012-12-17 13:39:11 -08:00
Alex Elder b8f5c6edca rbd: don't use ENOTSUPP
ENOTSUPP is not a standard errno (it shows up as "Unknown error 524"
in an error message).  This is what was getting produced when the
the local rbd code does not implement features required by a
discovered rbd image.

Change the error code returned in this case to ENXIO.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-12-17 12:07:32 -06:00
Alex Elder 2fd82b9e92 rbd: get rid of RBD_MAX_SEG_NAME_LEN
RBD_MAX_SEG_NAME_LEN represents the maximum length of an rbd object
name (i.e., one of the objects providing storage backing an rbd
image).

Another symbol, MAX_OBJ_NAME_SIZE, is used in the osd client code to
define the maximum length of any object name in an osd request.

Right now they disagree, with RBD_MAX_SEG_NAME_LEN being too big.

There's no real benefit at this point to defining the rbd object
name length limit separate from any other object name, so just
get rid of RBD_MAX_SEG_NAME_LEN and use MAX_OBJ_NAME_SIZE in its
place.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-12-17 08:37:29 -06:00
Alex Elder 42382b709b rbd: do not allow remove of mounted-on image
There is no check in rbd_remove() to see if anybody holds open the
image being removed.  That's not cool.

Add a simple open count that goes up and down with opens and closes
(releases) of the device, and don't allow an rbd image to be removed
if the count is non-zero.

Protect the updates of the open count value with ctl_mutex to ensure
the underlying rbd device doesn't get removed while concurrently
being opened.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2012-12-17 08:36:59 -06:00
Roger Pau Monne 7dc341175a xen-blkback: implement safe iterator for the list of persistent grants
Change foreach_grant iterator to a safe version, that allows freeing
the element while iterating. Also move the free code in
free_persistent_gnts to prevent freeing the element before the rb_next
call.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad@kernel.org>
Cc: xen-devel@lists.xen.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-12-07 15:13:09 -05:00
Lars Ellenberg d2ec180c23 drbd: update Kconfig to match current dependencies
We no longer need the connector.
But we need libcrc32c.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-12-06 13:08:29 +01:00
Philipp Reisner ef86b77957 drbd: Fix drbdsetup wait-connect, wait-sync etc... commands
This was introduces when moving the code over from the 8.3 codebase
with commit 328e0f125b

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-12-06 13:04:34 +01:00
Philipp Reisner 13c76aba78 drbd: close race between drbd_set_role and drbd_connect
drbd_set_role(, R_PRIMARY, ) does the state change to Primary,
some more housekeeping, and possibly generates a new UUID set.

All of this holding the "state_mutex".

The connection handshake involves sending of various state information,
including the current data generation UUID set, and two connection
state changes from C_WF_CONNECTION to C_WF_REPORT_PARAMS further to
a number of different outcomes, resync being one of them.

If the connection handshake happens between the state change to Primary
and the generation of the new UUIDs, the resync decision based on the
old UUID set may be confused, depending on circumstances.

Make sure that, before we do the handshake, any promotion to Primary
role will either be complete (including the housekeeping stuff), or can
see, and serialize with, the ongoing handshake, based on the
"STATE_SENT" bit, which is set when we start the handshake, and cleared
only when we leave C_WF_REPORT_PARAMS again.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-12-06 13:00:33 +01:00
Lars Ellenberg 691631c065 drbd: respect no-md-barriers setting also when changed online via disk-options
We need to propagate the configuration into the flag bits,
or it won't be effective.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-12-06 13:00:04 +01:00
Philipp Reisner 298307ed1d drbd: Remove obsolete check
Smatch complained about it this redundanct check.

The check was introduced in 2006-09-13. On 2007-07-24 the body of the
function was enclosed by get_ldev()/put_ldev() reference counting.
Since then the check is useless and miss leading.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-12-06 12:09:55 +01:00
Jens Axboe 84ad6845fb Merge branch 'stable/for-jens-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-3.8/drivers 2012-12-01 09:42:41 +01:00
Jens Axboe 2cecb73098 drbd: fixup after wait_even_lock_irq() addition to generic code
Compiling drbd yields:

drivers/block/drbd/drbd_state.c: In function ‘_conn_request_state’:
drivers/block/drbd/drbd_state.c:1804:5: error: macro "wait_event_lock_irq" passed 4 arguments, but takes just 3
drivers/block/drbd/drbd_state.c:1801:3: error: ‘wait_event_lock_irq’ undeclared (first use in this function)
drivers/block/drbd/drbd_state.c:1801:3: note: each undeclared identifier is reported only once for each function it appears in
drivers/block/drbd/drbd_state.c: At top level:
drivers/block/drbd/drbd_state.c:1734:1: warning: ‘_conn_rq_cond’ defined but not used [-Wunused-function]

Due to drbd having copied the MD definition for wait_event_lock_irq()
as well. Kill them.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-30 21:20:15 +01:00
Lukas Czerner 7b5a35225b loop: Limit the number of requests in the bio list
Currently there is not limitation of number of requests in the loop bio
list. This can lead into some nasty situations when the caller spawns
tons of bio requests taking huge amount of memory. This is even more
obvious with discard where blkdev_issue_discard() will submit all bios
for the range and wait for them to finish afterwards. On really big loop
devices and slow backing file system this can lead to OOM situation as
reported by Dave Chinner.

With this patch we will wait in loop_make_request() if the number of
bios in the loop bio list would exceed 'nr_congestion_on'.
We'll wake up the process as we process the bios form the list. Some
threshold hysteresis is in place to avoid high frequency oscillation.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reported-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-30 11:48:05 +01:00
Roger Pau Monne 07c540a0b5 xen-blkfront: free allocated page
Free the page allocated for the persistent grant.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-11-26 14:58:11 -05:00
Roger Pau Monne 4d4f270f18 xen-blkback: move free persistent grants code
Move the code that frees persistent grants from the red-black tree
to a function. This will make it easier for other consumers to move
this to a common place.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-11-26 14:58:11 -05:00
Selvan Mani 836413e8c7 mtip32xx: Fix padding issue
Hi Jens,

Another tiny patch.

Removed __packed before the struct smart_attr and added __packed at end of
the structure to fix padding issue.

Signed-off-by: Selvan Mani  <smani@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-23 14:32:55 +01:00
Ed Cashin 11cfb6ff73 aoe: avoid running request handler on plugged queue
Calling the request handler directly on a plugged queue defeats
the performance improvements provided by the plugging mechanism.
Use the __blk_run_queue function instead of calling the request
handler directly, so that we don't interfere with the block
layer's ability to plug the queue.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-23 14:32:55 +01:00
Wei Yongjun 298d80152c mtip32xx: fix potential NULL pointer dereference in mtip_timeout_function()
The dereference to port should be moved below the NULL test.

dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-23 14:32:55 +01:00
Jens Axboe 7c5d62388e mtip32xx: fix shift larger than type warning
If we're building a 32-bit kernel and CONFIG_LBADF isn't set,
sector_t is 32-bits wide. The shifts by 32 and 40 are thus
larger than we support.

Cast the sector offset to a u64 to avoid these warnings.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-23 14:32:55 +01:00
Selvan Mani 4b9e884523 mtip32xx: Fix incorrect mask used for erase mode
Previous commit use value 3 for erasemode mask.
Changing the mask to correct value to 2

Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-23 14:32:55 +01:00
Selvan Mani eda4531492 mtip32xx: Fix to make lba address correct in big-endian systems
Earlier lba address was assigned directly to lba_low and lba_low_ex,
which would result in a different number (bytes reversed) in
big-endian systems. Now assigning lba address byte-by-byte to fis.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-23 14:32:55 +01:00
Selvan Mani 3208795e61 mtip32xx: fix potential crash on SEC_ERASE_UNIT
The mtip driver lifted this code from elsewhere and then added a special
handling check for SEC_ERASE_UNIT. If the caller tries to do a security
erase but passes no output data for the command then outbuf is not
allocated and the driver duly explodes.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-23 14:32:54 +01:00
Jiri Kosina eac7cc52c6 floppy: destroy floppy workqueue before cleaning up the queue
We need to first destroy the floppy_wq workqueue before cleaning up
the queue. Otherwise we might race with still pending work with the
workqueue, but all the block queue already gone. This might lead to
various oopses, such as

 CPU 0
 Pid: 6, comm: kworker/u:0 Not tainted 3.7.0-rc4 #1 Bochs Bochs
 RIP: 0010:[<ffffffff8134eef5>]  [<ffffffff8134eef5>] blk_peek_request+0xd5/0x1c0
 RSP: 0000:ffff88000dc7dd88  EFLAGS: 00010092
 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
 RDX: ffff88000f602688 RSI: ffffffff81fd95d8 RDI: 6b6b6b6b6b6b6b6b
 RBP: ffff88000dc7dd98 R08: ffffffff81fd95c8 R09: 0000000000000000
 R10: ffffffff81fd9480 R11: 0000000000000001 R12: 6b6b6b6b6b6b6b6b
 R13: ffff88000dc7dfd8 R14: ffff88000dc7dfd8 R15: 0000000000000000
 FS:  0000000000000000(0000) GS:ffffffff81e21000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 CR2: 0000000000000000 CR3: 0000000001e11000 CR4: 00000000000006f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Process kworker/u:0 (pid: 6, threadinfo ffff88000dc7c000, task ffff88000dc5ecc0)
 Stack:
  0000000000000000 0000000000000000 ffff88000dc7ddb8 ffffffff8134efee
  ffff88000dc7ddb8 0000000000000000 ffff88000dc7dde8 ffffffff814aef3c
  ffffffff81e75d80 ffff88000dc0c640 ffff88000fbfb000 ffffffff814aed90
 Call Trace:
  [<ffffffff8134efee>] blk_fetch_request+0xe/0x30
  [<ffffffff814aef3c>] redo_fd_request+0x1ac/0x400
  [<ffffffff814aed90>] ? start_motor+0x130/0x130
  [<ffffffff8106b526>] process_one_work+0x136/0x450
  [<ffffffff8106af65>] ? manage_workers+0x205/0x2e0
  [<ffffffff8106bb6d>] worker_thread+0x14d/0x420
  [<ffffffff8106ba20>] ? rescuer_thread+0x1a0/0x1a0
  [<ffffffff8107075a>] kthread+0xba/0xc0
  [<ffffffff810706a0>] ? __kthread_parkme+0x80/0x80
  [<ffffffff818b553a>] ret_from_fork+0x7a/0xb0
  [<ffffffff810706a0>] ? __kthread_parkme+0x80/0x80
 Code: 0f 84 c0 00 00 00 83 f8 01 0f 85 e2 00 00 00 81 4b 40 00 00 80 00 48 89 df e8 58 f8 ff ff be fb ff ff ff
 fe ff ff <49> 8b 1c 24 49 39 dc 0f 85 2e ff ff ff 41 0f b6 84 24 28 04 00
 RIP  [<ffffffff8134eef5>] blk_peek_request+0xd5/0x1c0
  RSP <ffff88000dc7dd88>

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-23 14:32:54 +01:00
Akinobu Mita d48c152a41 cciss: use check_signature()
Use check_signature() to find a signature in the mmio address.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-23 14:28:34 +01:00
Akinobu Mita 1f118bc479 cciss: cleanup bitops usage
- Remove unnecessary correction of bit and address
- Use BITS_TO_LONGS macro to calculate bitmap size
- Use bitmap_zero()

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-11-23 14:27:22 +01:00
Keith Busch 2b19603415 NVMe: Initialize iod nents to 0
For commands that do not map a scatter list, we need to initilaize the iod's
number of sg entries (nents) to 0 and not unmap in this case.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-11-13 09:13:50 -05:00
Keith Busch 6ecec74520 NVMe: Define SMART log
This data structure is defined in the NVMe specification.  It's not used
by the kernel, but is available for use by userspace software.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-11-13 09:13:50 -05:00
Keith Busch 08df1e0565 NVMe: Add result to nvme_get_features
nvme_get_features() was not returning the result.  Add a parameter
to return the result in (similar to nvme_set_features()) and change
all callers.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-11-13 09:13:49 -05:00
Keith Busch f4f117f64b NVMe: Set result from user admin command
The ioctl data structure includes space for the 'result' of the admin
command to be returned; it just wasn't filled in.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-11-13 09:13:49 -05:00
Keith Busch 3295874b60 NVMe: End queued bio requests when freeing queue
If the queue has bios queued on it when it is freed, bio_endio() must be
called for them first.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-11-13 09:13:49 -05:00
Keith Busch 859361a228 NVMe: Free cmdid on nvme_submit_bio error
nvme_map_bio() is called after the cmdid is allocated, so we have to
free the cmdid before returning from nvme_submit_bio() if nvme_map_bio()
returned an error.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-11-13 09:13:49 -05:00
Jens Axboe 8d0ff3924b Merge branch 'stable/for-jens-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-3.8/drivers 2012-11-12 09:18:47 -07:00
Akinobu Mita f1d6a328bb drbd: use copy_highpage
Use copy_highpage() to copy from one page to another.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:22:26 +01:00
Lars Ellenberg ed635cb067 drbd: if the replication link breaks during handshake, keep retrying
The 8.3.12 commit drbd: Bugfix for the connection behavior fixes a
"wasted established connection", if a former connection attempt failed
during its early stages.

However it opened a window for a regression, if a connection attempt
fails during its last stages.  The result was a terminated receiver
thread, that left behind the supposedly transient "C_UNCONNECTED" state.
Any later requests to change the connection state fail, as they wait for
the connection state to "stabilize".

Fix: short circuit and keep retrying to restablish a new connection,
if we don't reach C_WF_REPORT_PARAMS.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:22:19 +01:00
Jing Wang 063eacf88c drbd: check return of kmalloc in receive_uuids
Signed-off-by: Jing Wang <windsdaemon@gmail.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:22:10 +01:00
Philipp Reisner 986836503e Merge branch 'drbd-8.4_ed6' into for-3.8-drivers-drbd-8.4_ed6 2012-11-09 14:20:23 +01:00
Philipp Reisner 328e0f125b drbd: Broadcast sync progress no more often than once per second
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:43 +01:00
Philipp Reisner 518a4d53b2 drbd: don't try to clear bits once the disk has failed
If the disk has failed already, there is no point trying to change the
bitmap. drbd_set_out_of_sync() already had this safeguard,
time to add it to drbd_set_in_sync() as well.

This also prevents some warning messages, like
 FIXME asender in bm_change_bits_to, bitmap locked for 'detach' by worker
if our disk fails during resync, while there are some resync acks queued up.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:42 +01:00
Philipp Reisner fd0017c124 drbd: fix regression: potential NULL pointer dereference
recent commit
    drbd: always write bitmap on detach
introduced a bitmap writeout during detach,
which obviously needs some meta data device to write to.

Unfortunately, that same error path may be taken if we fail to attach,
e.g. due to UUID mismatch, after we changed state to D_ATTACHING,
but before the lower level device pointer is even assigned.

We need to test for presence of mdev->ldev.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:42 +01:00
Philipp Reisner 4035e4c2eb drbd: Fix clearing of MDF_AL_DISABLED
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:42 +01:00
Lars Ellenberg 42839f6536 drbd: log request sector offset and size for IO errors
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:41 +01:00
Lars Ellenberg edc9f5eb7a drbd: always write bitmap on detach
If we detach due to local read-error (which sets a bit in the bitmap),
stay Primary, and then re-attach (which re-reads the bitmap from disk),
we potentially lost the "out-of-sync" (or, "bad block") information in
the bitmap.

Always (try to) write out the changed bitmap pages before going diskless.

That way, we don't lose the bit for the bad block,
the next resync will fetch it from the peer, and rewrite
it locally, which may result in block reallocation in some
lower layer (or the hardware), and thereby "heal" the bad blocks.

If the bitmap writeout errors out as well, we will (again: try to)
mark the "we need a full sync" bit in our super block,
if it was a READ error; writes are covered by the activity log already.

If that superblock does not make it to disk either, we are sorry.

Maybe we just lost an entire disk or controller (or iSCSI connection),
and there actually are no bad blocks at all, so we don't need to
re-fetch from the peer, there is no "auto-healing" necessary.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:41 +01:00
Lars Ellenberg e34b677d09 drbd: wait for meta data IO completion even with failed disk, unless force-detached
The intention of force-detach is to be able to deal with a completely
unresponsive lower level IO stack, which does not even deliver error
completions anymore, but no completion at all.

In all other cases, we must still wait for the meta data IO completion.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:40 +01:00
Lars Ellenberg 8747d30af9 drbd: a few more GFP_KERNEL -> GFP_NOIO
This has not yet been observed, but conceivably, when using GFP_KERNEL
allocations from drbd_md_sync(), drbd_flush_after_epoch() or
receive_SyncParam(), we could trigger additional IO to our own device,
or an other device in a criss-cross setup, and end up in a local
deadlock, or potentially a distributed deadlock in a criss-cross setup
involving the peer blocked in a similar way waiting for us to make
progress.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:40 +01:00
Lars Ellenberg bc891c9ae3 drbd: fix potential deadlock during bitmap (re-)allocation
The former comment arguing that GFP_KERNEL was good enough was wrong: it
did not take resize into account at all, and assumed the only path
leading here was the normal attach on a still secondary device, so no
deadlock would be possible.

Both resize on a Primary, or attach on a diskless Primary,
could potentially deadlock.

drbd_bm_resize() is called while IO to the respective device is
suspended, so we must use GFP_NOIO to avoid potential deadlock.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:39 +01:00
Lars Ellenberg a506c13a4d drbd: use list_move_tail instead of list_del/list_add_tail
Using list_move_tail() instead of list_del() + list_add_tail().

spatch with a semantic match is used to found this problem.
(http://coccinelle.lip6.fr/)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:39 +01:00
Philipp Reisner 1b6dd252e6 drbd: panic on delayed completion of aborted requests
"aborting" requests, or force-detaching the disk, is intended for
completely blocked/hung local backing devices which do no longer
complete requests at all, not even do error completions.  In this
situation, usually a hard-reset and failover is the only way out.

By "aborting", basically faking a local error-completion,
we allow for a more graceful swichover by cleanly migrating services.
Still the affected node has to be rebooted "soon".

By completing these requests, we allow the upper layers to re-use
the associated data pages.

If later the local backing device "recovers", and now DMAs some data
from disk into the original request pages, in the best case it will
just put random data into unused pages; but typically it will corrupt
meanwhile completely unrelated data, causing all sorts of damage.

Which means delayed successful completion,
especially for READ requests,
is a reason to panic().

We assume that a delayed *error* completion is OK,
though we still will complain noisily about it.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:39 +01:00
Philipp Reisner a3025a2737 drbd: Fix comparison of is_valid_transition()'s return code
is_valid_transition() might return SS_NOTHING_TO_DO.

The condition function _req_st_cond() returned SS_NOTHING_TO_DO, which
caused the wait_event to abort too early. Therefore drbd_req_state()
did not consume the next CL_ST_CHG_SUCCESS or SS_CW_FAILED_BY_PEER
causing serve disruption of the state machine logic...

Detaching from a single volue was one way to trigger this bug.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:38 +01:00
Philipp Reisner 1393b59f8c drbd: Remove duplicate code
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:38 +01:00
Lars Ellenberg 70f17b6bd1 drbd: differentiate early and later "postponing" of requests
We use the RQ_POSTPONED flag to mark a request for several reasons.

It may be a conflicting request in a dual-primary setup,
where conflict detection and resolution on the peer decided that
this request needs to be re-submitted, it needs to re-enter
drbd_make_request() to fix the data divergence caused by these
conflicting, partially overlapping, quasi-simultaneous requests.

In this case we need to mark the corresponding area as out-of-sync,
before we call drbd_al_complete_io().

We also use the RQ_POSTPONED flag to just "push back" a request,
before even processing it, if IO is suspended for some reason.
In this case, as this request was neither submitted nor sent yet,
we must not touch the bitmap.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:37 +01:00
Philipp Reisner 76590cd1fc drbd: Fix postponed requests
A postponed request might has RQ_IN_ACT_LOG already set, but
is POSTPONED before it gets something in the RQ_LOCAL_MASK
set. Up to now this caused a left-over active extent.

Fix that by only testing for the RQ_IN_ACT_LOG bit in drbd_req_destroy()

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:37 +01:00
Philipp Reisner 19fffd7b03 drbd: Call drbd_md_sync() explicitly after a state change on the connection
Without this, the meta-data gets updates after 5 seconds by the
md_sync_timer. Better to do it immeditaly after a state change.

If the asender detects a network failure, it may take a bit until
the worker processes the according after-conn-state-change work item.

  The worker might be blocked in sending something, i.e. it
  takes until it gets into its timeout. That is 6 seconds by
  default which is longer than the 5 seconds of the md_sync_timer.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:11:08 +01:00
Philipp Reisner d76440181d drbd: Fix postponed requests
* Postponed requests should not set or clear out-of-sync marks
* When a request gets postponed we need to drop its reference
  mdev->local_cnt (put_ldev()).

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:24 +01:00
Philipp Reisner 4ae98b4db3 drbd: Imporve the error reporting of failed conn state changes
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:24 +01:00
Philipp Reisner 7970201177 drbd: Fix the way the STATE_SENT bit is cleared
With merging the commit
'drbd: Delay/reject other state changes while establishing a connection'
the condition check for clearing the flag was wrong.

Move the bit clearing to the __drbd_set_state() function
in order to have it already cleared for the other parts of
the function. I.e. clearing the susp_fen in the after_state_ch() function.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:23 +01:00
Philipp Reisner 07fc96197a drbd: Do not check aspects that are not subject to change in _conn_requests_state()
When _conn_requests_state() is used to change other parts of the state
than the connection, do not check for a valid connection transition.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:23 +01:00
Philipp Reisner 892fdd1aee drbd: Improve readability of IO resuming after freeze due to no data access
The previous way of doing the state change was also okay since the
state change on the susp flag gets propagated from the mdev
to the tconn.

Fortunately all this goes away in drbd-9.0

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:22 +01:00
Philipp Reisner 88f79ec4ae drbd: Fix IO resuming after connection was established while executing the fence handler
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:22 +01:00
Lars Ellenberg b792b655cd drbd: fix potential list_add corruption
If the md_sync_timer triggers a second time,
while the work queued during the first time is still pending,
this could result in list_add() of an already added item,
and corrupt the work item list.

This likely only triggered because of the erroneous
batch-dequeueing of work items fixed with
  drbd: dequeue single work items in wait_for_work()

Still, skip queueing if md_sync_work is already queued.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:21 +01:00
Lars Ellenberg bc317a9ecd drbd: dequeue single work items in wait_for_work()
As long as we still use drbd_queue_work_front(),
we must only dequeue the single first item during normal operation.

The comment in drbd_worker() even says so,
but bc8a5a1 drbd: remove struct drbd_tl_epoch objects (barrier works)
introduced the batch dequeueing again via list_splice_init() in
wait_for_work().

Change back to list_move() of the first item, if any.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:21 +01:00
Lars Ellenberg c02abda2b2 drbd: mutex_unlock "... must no be used in interrupt context"
Documentation of mutex_unlock says
we must not use it in interrupt context.
So do not call it while holding the spin_lock_irq,
but give up the spinlock temporarily.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:21 +01:00
Philipp Reisner c1fd29a11f drbd: Fix a race condition that can lead to a BUG()
If the preconditions for a state change change after the wait_event() we
might hit the BUG() statement in conn_set_state().

With holding the spin_lock while evaluating the condition AND until the
actual state change we ensure the the preconditions can not change anymore.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:20 +01:00
Lars Ellenberg 0ee98e2eb0 drbd: temporarily suspend io in drbd_adm_disk_opts
drbd_adm_disk_opts() does
	wait_event(mdev->al_wait, lc_try_lock(mdev->act_log));
	drbd_al_shrink(mdev);

If the device is very busy, this can take a very long time to succeed.
Fix this by temporarily suspending IO,
then quickly change the settings, and resume.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:20 +01:00
Lars Ellenberg 4eb9b3cba0 drbd: don't send out P_BARRIER with stale information
We must only send P_BARRIER for epochs we actually sent P_DATA in.

If we (re-)establish a connection, we reinitialized the
send.current_epoch_nr, but forgot to reset send.current_epoch_writes.

This could result in a spurious P_BARRIER with stale epoch information,
and a disconnect/reconnect cycle once the then "unexpected"
P_BARRIER_ACK is received:
  BAD! BarrierAck #28823 received, expected #28829!

Introduce re_init_if_first_write() and maybe_send_barrier() helpers,
and call them appropriately for read/write/set-out-of-sync requests.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:19 +01:00
Lars Ellenberg 08332d7325 drbd: properly call drbd_rs_cancel_all() in drbd_disconnected()
drbd_disconnected() is supposed to clear the resync lru cache,
by calling drbd_rs_cancel_all().

We must do so before we call drbd_flush_workqueue(), as at least the
callback w_restart_disk_io() may wait for resync progres, and would
otherwise deadlock.

drbd_finish_peer_reqs() may again populate that cache, which will
then potentially be stale after the next resync handshake and bitmap
exchange, we have to do it again after that.

A stale resync lru cache causes no harm but ugly messages like this:
 BAD! sector=196608s enr=6 rs_left=-256 rs_failed=0 count=256 cstate=SyncTarget

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:19 +01:00
Philipp Reisner 155522df5b drbd: Remove dead code
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:19 +01:00
Philipp Reisner b66623e33e drbd: Avoid NetworkFailure state during disconnect
Disconnecting is a cluster wide state change. In case the peer node agrees
to the state transition, it sends back the fact on the meta-data connection
and closes both sockets.

In case the node node that initiated the state transfer sees the closing
action on the data-socket, before the P_STATE_CHG_REPLY packet, it was
going into one of the network failure states.

At least with the fencing option set to something else thatn "dont-care",
the unclean shutdown of the connection causes a short IO freeze or
a fence operation.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:18 +01:00
Philipp Reisner 39a1aa7f49 drbd: Protect accesses to the uuid set with a spinlock
There is at least the worker context, the receiver context, the context of
receiving netlink packts.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:08:04 +01:00
Philipp Reisner fef45d297e drbd: Write all pages of the bitmap after an online resize
We need to write the whole bitmap after we moved the meta data
due to an online resize operation.

With the support for one peta byte devices bitmap IO was optimized
to only write out touched pages. This optimization must be turned
off when writing the bitmap after an online resize.

This issue was introduced with drbd-8.3.10.

The impact of this bug is that after an online resize, the next
resync could become larger than expected.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:51 +01:00
Philipp Reisner 5af2e8ce2b drbd: Fix completion of requests while the device is suspended
In various places (E.g. CONNECTION_LOST_WHILE_PENDING) the
RQ_COMPLETION_SUSP mask is passed in the clear set to mod_rq_state().

The issue was that it tried to clear the RQ_COMPLETION_SUSP bit
out of the state mask first, and eventuelly set it afterwards,
in the drbd_req_put_completion_ref() function.

Fixed that by moving the reference getting out of
drbd_req_put_completion_ref() into the mod_rq_state(), before the place
where the extra reference might be put.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:50 +01:00
Andreas Gruenbacher 715306f69d drbd: Don't unregister socket state_change callback from within the callback
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:50 +01:00
Lars Ellenberg eb12010e9a drbd: disambiguation, s/ERR_DISCARD/ERR_DISCARD_IMPOSSIBLE/
If for some reason (typically "split-brained" cluster manager)
drbd replica data has diverged, we can chose a victim,
and reconnect using "--discard-my-data", causing the victim
to become sync-target, fetching all changed blocks from the peer.

If we are Primary, we are potentially in use, and we refuse to
"roll back" changes to the data below the page cache and other users.

Rename the error symbol for this to ERR_DISCARD_IMPOSSIBLE.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:50 +01:00
Lars Ellenberg 427c0434fc drbd: disambiguation, s/DISCARD_CONCURRENT/RESOLVE_CONFLICTS/
We don't discard anything here, really.
We resolve conflicting, concurrent writes to overlapping data blocks.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:49 +01:00
Lars Ellenberg d4dabbe22d drbd: disambiguation, s/P_DISCARD_WRITE/P_SUPERSEDED/
To avoid confusion with REQ_DISCARD aka TRIM, rename our
"discard concurrent write acks" from P_DISCARD_WRITE to P_SUPERSEDED.

At the same time, rename the drbd request event DISCARD_WRITE
to CONFLICT_RESOLVED. It already triggers both successful completion
or restart of the request, depending on our RQ_POSTPONED flag.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:49 +01:00
Lars Ellenberg 232fd3f4a0 drbd: cleanup, drop unused struct
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:48 +01:00
Lars Ellenberg 46e21bbadb drbd: NEG_ACK does not imply a barrier-ack
Don't drop a request from the transfer log just because it was NEG_ACKED.
We need it around to be able to verify P_BARRIER_ACKs against the
transver log.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:48 +01:00
Lars Ellenberg 99b4d8fe6d drbd: only start a new epoch, if the current epoch contains writes
Almost all code paths calling start_new_tl_epoch() guarded it with
	if (... current_tle_writes > 0 ... ).
Just move that inside start_new_tl_epoch().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:47 +01:00
Philipp Reisner 8a0bab2a6d drbd: Finish requests that completed while IO was frozen
Requests of an acked epoch are stored on the barrier_acked_requests list. In
case the private bio of such a request completes while IO on the drbd device
is suspended [req_mod(completed_ok)] then the request stays there.

When thawing IO because the fence_peer handler returned, then we use
tl_clear() to apply the connection_lost_while_pending event to all requests
on the transfer-log and the barrier_acked_requests list.

Up to now the connection_lost_while_pending event was not applied
on requests on the barrier_acked_requests list. Fixed that.

I.e. now the connection_lost_while_pending and resend events are
applied to requests on the barrier_acked_requests list. For that
it is necessary that the resend event finishes (local only)
READS correctly.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:47 +01:00
Lars Ellenberg e959d08d3e drbd: Fix a potential issue with the DISCARD_CONCURRENT flag
The DISCARD_CONCURRENT flag should be set on one node and cleared on the
other node.
As the code was before it was theoretical possible that a node accepts the
meta socket, but has to close it later on, and keeps the DISCARD_CONCURRENT
flag.
Correct this by moving the clear_bit(DISCARD_CONCURRENT) where the packet
gets sent.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:47 +01:00
Lars Ellenberg 519b6d3eac drbd: fix drbd wire compatibility for empty flushes
DRBD has a concept of request epochs or reorder-domains,
which are separated on the wire by P_BARRIER packets.

Older DRBD is not able to handle zero-sized requests at all,
so we need to map empty flushes to these drbd barriers.

These are the equivalent of empty flushes, and
by default trigger flushes on the receiving side anyways
(unless not supported or explicitly disabled),
so there is no need to handle this differently in newer drbd either.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:46 +01:00
Philipp Reisner 80c6eed49d drbd: More random to the connect logic
Since the listening socket is open all the time, it was possible to
get into stable "initial packet S crossed" loops.

* when both sides realize in the drbd_socket_okay() call at the end
  of the loop that the other side closed the main socket you had
  the chance to get into a stable loop with repeated "packet S crossed"
  messages.

* when both sides do not realize with the drbd_socket_okay() call at the end
  of the loop that the other side closed the main socket you had
  the chance to get into a stable loop with alternating "packet S crossed"
  "packet M crossed" messages.

In order to break out these stable loops randomize the behaviour if
such a crossing of P_INITIAL_DATA or P_INITIAL_META packets is detected.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:46 +01:00
Philipp Reisner 92f14951c0 drbd: Try to connec to peer only once per cycle
Since now our listening socket is open all the time we will get
connection tries of the peer always in. No need to try it three
times.

This is valid when connecting to older peers as well, it simply
increases the probability that the new version DRBD will accept
a connection instead that it will establish one.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:45 +01:00
Philipp Reisner b666dbf819 drbd: Remove redundant and wrong test for NULL simplification in conn_connect()
Since the drbd_socket_okay() function itself tests if the the
socket is NULL, the explicit test "if (sock.socket && &msock.socket)"
was redundent.
Apart from that the address opperator ('&') before msock.socket rendered
the test pointless.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:45 +01:00
Philipp Marek 3174f8c504 drbd: pass some more information to userspace.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:45 +01:00
Lars Ellenberg 81a3537a97 drbd: announce FLUSH/FUA capability to upper layers
In 8.4, we may have bios spanning two activity log extents.
Fixup drbd_al_begin_io() and drbd_al_complete_io() to deal with zero sized bios.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:44 +01:00
Lars Ellenberg 58ffa580a7 drbd: introduce stop-sector to online verify
We now can schedule only a specific range of sectors for online verify,
or interrupt a running verify without interrupting the connection.

Had to bump the protocol version differently, we are now 101.
Added verify_can_do_stop_sector() { protocol >= 97 && protocol != 100; }

Also, the return value convention for worker callbacks has changed,
we returned "true/false" for "keep the connection up" in 8.3,
we return 0 for success and <= for failure in 8.4.
Affected: receive_state()

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-09 14:05:32 +01:00
Lars Ellenberg 970fbde1f1 drbd: flush drbd work queue before invalidate/invalidate remote
If you do back to back wait-sync/invalidate on a Primary in a tight loop,
during application IO load, you could trigger a race:
  kernel: block drbd6: FIXME going to queue 'set_n_write from StartingSync'
    but 'write from resync_finished' still pending?

Fix this by changing the order of the drbd_queue_work() and
the wake_up() in dec_ap_pending(), and adding the additional
drbd_flush_workqueue() before requesting the full sync.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:41 +01:00
Lars Ellenberg 6f1a656325 drbd: call local-io-error handler early
In case we want to hard-reset from the local-io-error handler,
we need to call it before notifying the peer or aborting local IO.
Otherwise the peer will advance its data generation UUIDs even
if secondary.

This way, local io error looks like a "regular" node crash,
which reduces the number of different failure cases.
This may be useful in a bigger picture where crashed or otherwise
"misbehaving" nodes are automatically re-deployed.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:40 +01:00
Lars Ellenberg a324896b17 drbd: do not reset rs_pending_cnt too early
Fix asserts like
  block drbd0: in got_BlockAck:4634: rs_pending_cnt = -35 < 0 !

We reset the resync lru cache and related information (rs_pending_cnt),
once we successfully finished a resync or online verify, or if the
replication connection is lost.

We also need to reset it if a resync or online verify is aborted
because a lower level disk failed.

In that case the replication link is still established,
and we may still have packets queued in the network buffers
which want to touch rs_pending_cnt.

We do not have any synchronization mechanism to know for sure when all
such pending resync related packets have been drained.

To avoid this counter to go negative (and violate the ASSERT that it
will always be >= 0), just do not reset it when we lose a disk.

It is good enough to make sure it is re-initialized before the next
resync can start: reset it when we re-attach a disk.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:40 +01:00
Lars Ellenberg 8a94317071 drbd: reset congestion information before reporting it in /proc/drbd
We cache the congestion status in mdev->congestion_reason whenever
drbd_congested() was called.
Reset this cached info before reporting it when reading /proc/drbd.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:39 +01:00
Lars Ellenberg 6f3465ed82 drbd: report congestion if we are waiting for some userland callback
If the drbd worker thread is synchronously waiting for some userland
callback, we don't want some casual pageout to block on us.
Have drbd_congested() report congestion in that case.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:39 +01:00
Lars Ellenberg 0c84966601 drbd: differentiate between normal and forced detach
Aborting local requests (not waiting for completion from the lower level
disk) is dangerous: if the master bio has been completed to upper
layers, data pages may be re-used for other things already.
If local IO is still pending and later completes,
this may cause crashes or corrupt unrelated data.

Only abort local IO if explicitly requested.
Intended use case is a lower level device that turned into a tarpit,
not completing io requests, not even doing error completion.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:39 +01:00
Lars Ellenberg bf709c8552 drbd: cleanup, remove two unused global flags
The two unused "global flags" in 8.3 are "per volume" flags in 8.4.
Still, they are unused, so lose them.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:38 +01:00
Lars Ellenberg 3b9ef85e05 drbd: fix null pointer dereference with on-congestion policy when diskless
We must not look at mdev->actlog, unless we have a get_ldev() reference.
It also does not make much sense to try to disconnect or pull-ahead of
the peer, if we don't have good local data.

Only even consider congestion policies, if our local disk is D_UP_TO_DATE.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:38 +01:00
Lars Ellenberg 27012382bc drbd: take error path in drbd_adm_down if interrupted by signal
drbd_adm_down() does adm_detach(), which can fail with various error
codes, or be interrupted by a signal.

The interrupted by signal case was not properly handled,
leading to
	block drbd0: ASSERT( mdev->state.disk == D_DISKLESS &&
	                     mdev->state.conn == C_STANDALONE ) in drbd/drbd_worker.c
and further to destroying objects while still in use, and resulting crashes.

Detect the interruption, and take the error path out.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:37 +01:00
Lars Ellenberg 9a278a7906 drbd: allow read requests to be retried after force-detach
Sometimes, a lower level block device turns into a tar-pit,
not completing requests at all, not even doing error completion.

We can force-detach from such a tar-pit block device,
either by disk-timeout, or by drbdadm detach --force.

Queueing for retry only from the request destruction path (kref hit 0)
makes it impossible to retry affected read requests from the peer,
until the local IO completion happened, as the locally submitted
bio holds a reference on the drbd request object.

If we can only complete READs when the local completion finally
happens, we would not need to force-detach in the first place.

Instead, queue for retry where we otherwise had done the error completion.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:37 +01:00
Lars Ellenberg 934722a2db drbd: __req_mod: make DISCARD_WRITE and independend case
cherry-picked and adapted from drbd 9 devel branch

This looks cleaner to me,
and also gets rid of the other ugly if-inside-case-fall-through.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:37 +01:00
Lars Ellenberg a0d856dfae drbd: base completion and destruction of requests on ref counts
cherry-picked and adapted from drbd 9 devel branch

The logic for when to get or put a reference is in mod_rq_state().

To not get confused in the freeze/thaw respectively resend/restart
paths, or when cleaning up requests waiting for P_BARRIER_ACK, this
also introduces additional state flags:
RQ_COMPLETION_SUSP, and RQ_EXP_BARR_ACK.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:36 +01:00
Lars Ellenberg b406777e64 drbd: introduce completion_ref and kref to struct drbd_request
cherry-picked and adapted from drbd 9 devel branch

completion_ref will count pending events necessary for completion.
kref is for destruction.

This only introduces these new members of struct drbd_request,
a followup patch will make actual use of them.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:36 +01:00
Lars Ellenberg 5df69ece6e drbd: __drbd_make_request() is now void
The previous commit causes __drbd_make_request() to always return 0.
Change it to void.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:35 +01:00
Lars Ellenberg 5da9c83644 drbd: better separate WRITE and READ code paths in drbd_make_request
cherry-picked and adapted from drbd 9 devel branch

READs will be interesting to at most one connection,
WRITEs should be interesting for all established connections.

Introduce some helper functions to hopefully make this easier to follow.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:35 +01:00
Lars Ellenberg b6dd1a8976 drbd: remove struct drbd_tl_epoch objects (barrier works)
cherry-picked and adapted from drbd 9 devel branch

DRBD requests (struct drbd_request) are already on the per resource
transfer log list, and carry their epoch number. We do not need to
additionally link them on other ring lists in other structs.

The drbd sender thread can recognize itself when to send a P_BARRIER,
by tracking the currently processed epoch, and how many writes
have been processed for that epoch.

If the epoch of the request to be processed does not match the currently
processed epoch, any writes have been processed in it, a P_BARRIER for
this last processed epoch is send out first.
The new epoch then becomes the currently processed epoch.

To not get stuck in drbd_al_begin_io() waiting for P_BARRIER_ACK,
the sender thread also needs to handle the case when the current
epoch was closed already, but no new requests are queued yet,
and send out P_BARRIER as soon as possible.

This is done by comparing the per resource "current transfer log epoch"
(tconn->current_tle_nr) with the per connection "currently processed
epoch number" (tconn->send.current_epoch_nr), while waiting for
new requests to be processed in wait_for_work().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:35 +01:00
Lars Ellenberg d5b27b01f1 drbd: move the drbd_work_queue from drbd_socket to drbd_connection
cherry-picked and adapted from drbd 9 devel branch
In 8.4, we don't distinguish between "resource work" and "connection
work" yet, we have one worker for both, as we still have only one connection.

We only ever used the "data.work",
no need to keep the "meta.work" around.

Move tconn->data.work to tconn->sender_work.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:34 +01:00
Lars Ellenberg 8c0785a5c9 drbd: allow to dequeue batches of work at a time
cherry-picked and adapted from drbd 9 devel branch

In 8.4, we still use drbd_queue_work_front(),
so in normal operation, we can not dequeue batches,
but only single items.

Still, followup commits will wake the worker
without explicitly queueing a work item,
so up() is replaced by a simple wake_up().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:34 +01:00
Lars Ellenberg b379c41ed7 drbd: transfer log epoch numbers are now per resource
cherry-picked from drbd 9 devel branch.

In preparation of multiple connections, the "barrier number" or
"epoch number" needs to be tracked per-resource, not per connection.
The sequence number space will not be reset anymore.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:33 +01:00
Lars Ellenberg 9d05e7c4e7 drbd: rename drbd_restart_write to drbd_restart_request
Meanwhile, this is used to restart failed READ requests as well.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:33 +01:00
Lars Ellenberg 629663c942 drbd: fix wrong assert in completion/retry path of failed local reads
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:33 +01:00
Lars Ellenberg ab53b90e89 drbd: fix local read error hung forever
The commit
    drbd: simplify retry path of failed READ requests
simplified it too much:
it just did not do anything for local read errors.

Add the missing req_may_be_completed_not_susp() to the
READ_COMPLETED_WITH_ERROR case.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:32 +01:00
Lars Ellenberg 1b6f19740d drbd: fix access of unallocated pages and kernel panic
BUG: unable to handle kernel NULL pointer dereference at (null)
...
 [<d1e17561>] ? _drbd_bm_set_bits+0x151/0x240 [drbd]
 [<d1e236f8>] ? receive_bitmap+0x4f8/0xbc0 [drbd]

This fixes an off-by-one error in the receive_bitmap() path,
if run-length encoded bitmap transfer is enabled.

If the bitmap is an exact multiple of PAGE_SIZE, which means the visible
capacity of the drbd device is an exact multiple of 128 MiB (for 4k page
size), and bitmap compression (use-rle) is enabled (which became default
with 8.4), and the very last bit is dirty and reported in an rle
comressed bitmap packet, we ended up trying to kmap_atomic a page pointer
that does not exist (bitmap->bm_pages[last index + 1]).

bug introduced by:
    Date:   Fri Jul 24 15:33:24 2009 +0200
    set bits: optimize for complete last word, fix off-by-one-word corner case

made effective by:
    Date:   Thu Dec 16 00:32:38 2010 +0100
    drbd: get rid of unused debug code

    Long time ago, we had paranoia code in the bitmap that allocated one
    extra word, assigned a magic value, and checked on every occasion that
    the magic value was still unchanged.

    That debug code is unused, the extra long word complicates code a bit.
    Get rid of it.

No-one triggered this bug in the last few years, because a large subset
of our userbase is unaffected:
 * typically the last few blocks of a device are not modified
   frequently, and remain unset
 * use-rle was disabled by default in drbd < 8.4
 * those with slightly "odd" device sizes, or
 * drbd internal meta data (which will skew the device size slightly,
   thus makes it harder to have a bug relevant device size)

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:32 +01:00
Philipp Reisner 7a426fd8d5 drbd: Keep the listening socket open while trying to connect to the peer
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:31 +01:00
Philipp Reisner 1f3e509b76 drbd: pull prepare_listen_socket() out of drbd_wait_for_connect()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:31 +01:00
Philipp Reisner 9a51ab1c1b drbd: New disk option al-updates
By disabling al-updates one might increase performace. The price for
that is that in case a crashed primary (that had al-updates disabled)
is reintegraded, it will receive a full-resync instead of a bitmap
based resync.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:31 +01:00
Andreas Gruenbacher 26ec92871b drbd: Stop using NLA_PUT*().
These macros no longer exist in kernel version v3.5-rc1.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:30 +01:00
Philipp Reisner 7e0f096b8d drbd: Remove drbd_accept() and use kernel_accept() instead
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:30 +01:00
Philipp Reisner 2820fd3969 drbd: Move the call to listen() out of drbd_accept()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:29 +01:00
Philipp Reisner c5b005ab70 drbd: use bitmap_parse instead of __bitmap_parse
The buffer 'sc.cpu_mask' is a kernel buffer.  If bitmap_parse is used
instead of __bitmap_parse the extra parameter that indicates a kernel
buffer is not needed.

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:29 +01:00
Lars Ellenberg 1882e22df7 drbd: grammar fix in log message
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:29 +01:00
Lars Ellenberg f66ee69746 drbd: bm_page_async_io: properly initialize page->private
If bm_page_async_io is advised to use a new page for I/O
(BM_AIO_COPY_PAGES is set), it will get it from a mempool.
Once the mempool has to dip into its reserves the page is
not reinitialized, i.e. page->private contains garbage, which
will lead to various problems once the I/O completes (dereferences
of NULL pointers, the submitting thread getting stuck in D-state,
 ...).

Signed-off-by: Arne Redlich <arne.redlich@googlemail.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:28 +01:00
Lars Ellenberg a220d29180 drbd: allow bitmap to change during writeout from resync_finished
Symptom: messages similar to
 "FIXME asender in bm_change_bits_to,
  bitmap locked for 'write from resync_finished' by worker"

If a resync or verify is finished (or aborted), a full bitmap writeout
is triggered.  If we have ongoing local IO, the bitmap may still change
during that writeout, pending and not yet processed acks may cause bits
to be cleared, while new writes may cause bits to be to be set.

To fix this, introduce the drbd_bm_write_copy_pages() variant.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:28 +01:00
Lars Ellenberg 5016b82a49 drbd: fix race between drbdadm invalidate/verify and finishing resync
When a resync or online verify is finished or aborted,
drbd does a bulk write-out of changed bitmap pages.

If *in that very moment* a new verify or resync is triggered,
this can race:
 ASSERT( !test_bit(BITMAP_IO, &mdev->flags) ) in drbd_main.c
 FIXME going to queue 'set_n_write from StartingSync' but 'write from resync_finished' still pending?
and similar.

This can be observed with e.g. tight invalidate loops in test scripts,
and probably has no real-life implication.

Still, that race can be solved by first quiescen the device,
before starting a new resync or verify.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:27 +01:00
Lars Ellenberg 07be15b12c drbd: fix resend/resubmit of frozen IO
DRBD can freeze IO, due to fencing policy (fencing resource-and-stonith),
or because we lost access to data (on-no-data-accessible suspend-io).

Resuming from there (re-connect, or re-attach, or explicit admin
intervention) should "just work".

Unfortunately, if the re-attach/re-connect did not happen within
the timeout, since the commit

  drbd: Implemented real timeout checking for request processing time

if so configured, the request_timer_fn() would timeout and
detach/disconnect virtually immediately.

This change tracks the most recent attach and connect, and does not
timeout within <configured timeout interval> after attach/connect.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:27 +01:00
Philipp Reisner 3ea35df83f drbd: fix spelling, remove boring development log message
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:27 +01:00
Philipp Reisner e4bad1bcac drbd: Ensure that data_size is not 0 before using data_size-1 as index
This could be exploited by a peer which runs modified code.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:26 +01:00
Philipp Reisner a1096a6e9d drbd: Delay/reject other state changes while establishing a connection
Changes to the role and disk state should be delayed or rejected
while we establish a connection.

This is necessary, since the peer will base its resync decision
on the UUIDs and the state we sent in the drbd_connect() function.

The most prominent example for this race is becoming primary after
sending state and UUIDs and before the state changes to C_WF_CONNECTION.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:26 +01:00
Philipp Reisner 27eb13e99b drbd: Fixed processing of disk-barrier, disk-flushes and disk-drain
Since drbd_bump_write_ordering() is called in the attaching
process while the disk state is D_ATTACHING, it was not
considering these three flags during attach.

A call to this function was missing form drbd_adm_disk_opts().

Fixed both issues.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:25 +01:00
Lars Ellenberg 9ed57dcbda drbd: ignore volume number for drbd barrier packet exchange
Transfer log epochs, and therefore P_BARRIER packets,
are per resource, not per volume.
We must not associate them with "some random volume".

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:25 +01:00
Lars Ellenberg 648e46b531 drbd: complete_conflicting_writes() should not care about connections
complete_conflicting_writes() should not cause -EIO.
It should not timeout either, or care for connection states.

Connection timeout is detected elsewhere, and it's cleanup path is
supposed to remove any pending requests or peer_requests from the
write_requests tree.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:25 +01:00
Lars Ellenberg 4439c400ab drbd: simplify retry path of failed READ requests
If a local or remote READ request fails, just push it back to the retry
workqueue.  It will re-enter __drbd_make_request, and be re-assigned to
a suitable local or remote path, or failed, if we do not have access to
good data anymore.

This obsoletes w_read_retry_remote(),
and eliminates two goto...retry blocks in __req_mod()

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:24 +01:00
Lars Ellenberg 2415308eb9 drbd: move put_ldev from __req_mod() to the endio callback
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:24 +01:00
Lars Ellenberg 6870ca6d46 drbd: factor out master_bio completion and drbd_request destruction paths
In preparation for multiple connections and reference counting,
separate the code paths for completion of the master bio
and destruction of the request object.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:23 +01:00
Lars Ellenberg 8d6cdd7848 drbd: conflicting writes: make wake_up of waiting peer_requests explicit
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:23 +01:00
Lars Ellenberg 0afd569a40 drbd: fix WRITE_ACKED_BY_PEER_AND_SIS to not set RQ_NET_DONE
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:23 +01:00
Lars Ellenberg ea9d6729bd drbd: fix READ_RETRY_REMOTE_CANCELED to not complete if device is suspended
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:22 +01:00
Lars Ellenberg 27a434fe40 drbd: make OOS_HANDED_TO_NETWORK its own case
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:22 +01:00
Lars Ellenberg 2312f0b3c5 drbd: fix potential deadlock during "restart" of conflicting writes
w_restart_write(), run from worker context, calls __drbd_make_request()
and further drbd_al_begin_io(, delegate=true), which then
potentially deadlocks.  The previous patch moved a BUG_ON to expose
such call paths, which would now be triggered.

Also, if we call __drbd_make_request() from resource worker context,
like w_restart_write() did, and that should block for whatever reason
(!drbd_state_is_stable(), resource suspended, ...),
we potentially deadlock the whole resource, as the worker
is needed for state changes and other things.

Create a dedicated retry workqueue for this instead.

Also make sure that inc_ap_bio()/dec_ap_bio() are properly paired,
even if do_retry() needs to retry itself,
in case __drbd_make_request() returns != 0.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:21 +01:00
Lars Ellenberg f9916d61a4 drbd: don't pretend that barrier_nr == 0 was special
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:21 +01:00
Lars Ellenberg 0642d5f8e0 drbd: remove unused static helper function
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:21 +01:00
Lars Ellenberg e44d71f36c drbd: remove some very outdated comments
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:20 +01:00
Lars Ellenberg 9dab3842b5 drbd: fix memleak in error path in bm_rw and drbd_bm_write_range
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:20 +01:00
Lars Ellenberg a6a7d4f0c1 drbd: missing wakeup after drbd_rs_del_all
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:19 +01:00
Lars Ellenberg 5cdb0bf322 drbd: remove now unused seq_num member from struct drbd_request
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:19 +01:00
Lars Ellenberg 4b8514ee28 drbd: fix potential data corruption and protocol error
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:19 +01:00
Lars Ellenberg e8744f5aca drbd: Fixed detach
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:18 +01:00
Lars Ellenberg d93f63028f drbd: Fix a potential write ordering issue on SyncTarget nodes
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:18 +01:00
Lars Ellenberg 81f448629a drbd: Fix a potential race that could case data inconsistency
When we have a write request and a state change C_WF_BITMAP_S -> C_SYNC_SOURCE
at the same time, and it happens that the line

    remote = remote && drbd_should_do_remote(s);

stills sees C_WF_BITMAP_S, and

     send_oos = rw == WRITE && drbd_should_send_oos(s);

already sees C_SYNC_SOURCE both are 0.

This causes the write to not be mirrored, but marked as out-of-sync on the
Sync_Source node.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:17 +01:00
Philipp Reisner 38a05c16b8 drbd: Consider that bio->bi_bdev might be modified below DRBD
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:17 +01:00
Philipp Reisner 72585d2428 drbd: add missing part_round_stats to _drbd_start_io_acct
Without this, iostat frequently sees bogus svctime and >= 100% "utilization".

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:17 +01:00
Philipp Reisner dd9b360475 drbd: Fix module refcount leak in drbd_accept()
drbd_accept was modelled after kernel_accept
with drbd commit 53eb779 in July 2008.

Only, kernel_accept was then broken, and only fixed later
with kernel commit 1b08534e in Dec 2008:
net: Fix module refcount leak in kernel_accept()

Impact: protocol families provided as modules, e.g. ipv6 or ib_sdp,
would soon have their reference count become negative, preventing
them from being unloaded (likely), or worse, hit zero without actually
being unused, allowing them to be unloaded while still in use (unlikely,
but if triggered, causing a kernel crash).

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:16 +01:00
Philipp Reisner 93f5afe956 drbd: If disk timeout expires fail only the affected volume
...and not all volumes of the resource

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:16 +01:00
Philipp Reisner 32db80f6f6 drbd: Consider the disk-timeout also for meta-data IO operations
If the backing device is already frozen during attach, we failed
to recognize that. The current disk-timeout code works on top
of the drbd_request objects. During attach we do not allow IO
and therefore never generate a drbd_request object but block
before that in drbd_make_request().

This patch adds the timeout to all drbd_md_sync_page_io().

Before this patch we used to go from D_ATTACHING directly
to D_DISKLESS if IO failed during attach. We can no longer
do this since we have to stay in D_FAILED until all IO
ops issued to the backing device returned.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:15 +01:00
Philipp Reisner 25b0d6c8c1 drbd: Reinstate disabling AL updates with invalidate-remote
Commit d0ef827e (drbd: switch configuration interface from connector to
genetlink) introduced a regression by removing the ability to set all
bits in the out of sync bitmap and to suspend updates to the activity log
of a disconnected device via the invalidate-remote management call.

Credits for reporting the issue are going to Arne Redlich.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:15 +01:00
Lars Ellenberg b17f33cb0a drbd: explicitly clear unused dp_flags in drbd_send_block
We send left-over garbage from the previous packet in P_DATA_REPLY and
P_RS_DATA_REPLY packets. That's bad behaviour.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:15 +01:00
Philipp Reisner 4d0fc3fdc3 drbd: Fixed compat issue with disconnecting 8.4 from a primary 8.3
For compatibility reasons 8.4 has to send P_STATE_CHG_REQ (instead
of P_CONN_ST_CHG_REQ) when disconnecting.

In the receiving code path we missed to convert the old
answer (P_STATE_CHG_REPLY) back to 8.4 logic. Therefore
the CL_ST_CHG_SUCCESS or CL_ST_CHG_FAIL bit in the flags word
of mdev got set, while the state code was waiting for
the CONN_WD_ST_CHG_OKAY or CONN_WD_ST_CHG_FAIL bits in tconn.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:14 +01:00
Andreas Gruenbacher 1a3cde4406 drbd: drbd_bm_ALe_set_all(): Remove unused function
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:14 +01:00
Philipp Reisner 69b6a3b159 drbd: restart loop in drbd_make_request() [prepare for Linux-3.2]
With Linux-3.2 generic_make_request() will no longer loop over
the request function until it finally returns 0. Move this
loop into our drbd_make_request() function.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:13 +01:00
Philipp Reisner 7da358625c drbd: Restore late assigning of tconn->data.sock and meta.sock
With commit from Mon Mar 28 16:33:12 2011 +0200
"drbd: drbd_connect(): Initialize struct drbd_socket before sending anything"

tconn->data.sock and tconn->meta.sock get assigned early, in
conn_connect.

The early assigning can trigger an OOPS, because it may released the socket
without acquiring the mutex protecting the socket. An other thread (worker)
might use setsockopt() on the socket while it gets free()ed.

Restored the (proven) 8.3 behavior of assigning these sockets after the two
connections are established.

Credits for reporting the issue are going to Arne Redlich.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:13 +01:00
Philipp Reisner a01842ebee drbd: Log failures of connection state changes
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:13 +01:00
Philipp Reisner e8cdc34335 drbd: Consider that read requests could be NEG_ACKEDed
ap_in_flight only counts writes. NEG_ACKED is an action
on a request that might be called for reads and writes.

This bug was there forever, but it becomes much more
relevant with the read balincing code.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:12 +01:00
Philipp Reisner 6ab9b1b60b drbd: Do not send state packets while lower than C_CONNECTED cstate
I.e. in C_WF_REPORT_PARAMS or in C_WF_CONNECTION.
Sending may already work in these cstates, but the peer still expects
the HandShake / ConnectionFeatures packet.

Actually triggered by the Testuite on kugel.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:12 +01:00
Philipp Reisner b8853dbd8c drbd: fix race between disconnect and receive_state
If the asender thread, or request_timer_fn(), or some other part of
the code, decided to drop the connection (because of timeout or other),
but the receiver just now was processing a P_STATE packet, there was a
chance that receive_state() would do a hard state change
"re-establishing" an already failed connection without additional handshake.

Log excerpt:
  Remote failed to finish a request within ko-count * timeout
  peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )
  asender terminated
  ...
  peer( Unknown -> Secondary ) conn( Timeout -> Connected ) pdsk( DUnknown -> UpToDate ) peer_isp( 0 -> 1 )
  ...
  Connection closed
  peer( Secondary -> Unknown ) conn( Connected -> Unconnected ) pdsk( UpToDate -> DUnknown ) peer_isp( 1 -> 0 )
  receiver terminated

Impact:
while the connection state is erroneously "Connected",
requests may be queued and even sent,
which would never be acknowledged,
and may have been missed by the cleanup.
These requests would never be completed.

The next drbd_suspend_io() will then lock up,
waiting forever for these requests to complete.

Fixed in several code paths:
  Make sure the connection state is NetworkFailure or worse
  before starting the cleanup in drbd_disconnect().
  This should make sure the cleanup won't miss any requests.

  Disallow receive_state() to "upgrade" the connection state
  from an error state. This will make sure the "illegal" state
  transition won't happen.

  For all connection failure states,
  relax the safe-guard in sanitize_state() again
  to silently mask out those state changes
  (e.g. Timeout -> Connected becomes Timeout -> Timeout).

 Note by Philipp Reisner:
  The 3rd chunk described as "relax the safe-guard..."
  is not there in 8.4 as it is relaxed to the maximum in
  8.4 already

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:12 +01:00
Philipp Reisner 57bcb6cf1d drbd: Do not call generic_make_request() while holding req_lock
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:11 +01:00
Philipp Reisner d60de03a66 drbd: Load balancing method: striping
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:11 +01:00
Philipp Reisner 380207d08e drbd: Load balancing of read requests
New config option for the disk secition "read-balancing", with
the values: prefer-local, prefer-remote, round-robin, when-congested-remote.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:10 +01:00
Philipp Reisner d10b4ea32b drbd: Get rid of "ASSERTION FAILED: tconn->current_epoch->list not empty"
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:10 +01:00
Lars Ellenberg 615e087fbd drbd: add missing rcu locks around recently introduced idr_for_each
Recent commit
 drbd: Move write_ordering from mdev to tconn
introduced a new idr_for_each loop over all volumes,
but did not take necessary rcu locks or krefs.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:10 +01:00
Andreas Gruenbacher 03d63e1d1e drbd: Remove leftover prototype
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:09 +01:00
Philipp Reisner 975b297947 drbd: fix potential spinlock deadlock
drbd_try_clear_on_disk_bm() has a sanity check for the number of blocks
left to be resynced (rs_left) in the current resync extent.
If it detects a mismatch, it complains, and forces a disconnect using
drbd_force_state(mdev, NS(conn, C_DISCONNECTING));

Unfortunately, this may be called while holding the req_lock,
and drbd_force_state() want's to aquire that lock itself. Deadlock.

Don't force a disconnect, but fix up rs_left by recounting and
reassigning the number of dirty blocks in that extent.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:09 +01:00
Philipp Reisner 77fede5137 drbd: Fix the WO=drain implementation for multiple volumes
Wait until IO is drained in all volumes.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:08 +01:00
Philipp Reisner 1e9dd2912e drbd: Switch drbd_may_finish_epoch() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:08 +01:00
Philipp Reisner 12038a3a71 drbd: Move list of epochs from mdev to tconn
This is necessary since the transfer_log on the sending is also
per tconn.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:08 +01:00
Philipp Reisner 1d2783d532 drbd: Prepare epochs per connection
An epoch object needs a pointer to the mdev it was received for.
This is necessary to be able to send the barrier ack packet for
the same volume as the original barrier packet was assigned to.

This prepares the next step, in which the (receiver side)
epoch list is moved from the device (mdev) to the connection (tconn)
object.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:07 +01:00
Philipp Reisner 4b0007c0e8 drbd: Move write_ordering from mdev to tconn
This is necessary in order to prepare the move of the (receiver side)
epoch list from the device (mdev) to the connection (tconn) objects.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:07 +01:00
Philipp Reisner 6936fcb49a drbd: Move the CREATE_BARRIER flag from connection to device
That is necessary since the whole transfer log is per connection(tconn)
and not per device(mdev).

This bug caused list corruption on the worker list. When a barrier is queued
for sending in the context of one device, another device did not see the
CREATE_BARRIER bit, and queued the same object again -> list corruption.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:06 +01:00
Philipp Reisner 36baf6117b drbd: Fixed an obvious copy-n-paste mistake
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:06 +01:00
Philipp Reisner 43de7c852b drbd: Fixes from the drbd-8.3 branch
* drbd-8.3:
  drbd: O_SYNC gives EIO on ramdisks for some kernels (eg. RHEL6).
  drbd: send intermediate state change results to the peer

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:06 +01:00
Philipp Reisner 0cfac5dd90 drbd: Fixes from the drbd-8.3 branch
* drbd-8.3:
  drbd: fix spurious meta data IO "error"
  drbd: Fixed a race condition between detach and start of resync
  drbd: fix harmless race to not trigger an ASSERT
  drbd: Derive sync-UUIDs only from the bitmap-uuid if it is non-zero
  drbd: Fixed current UUID generation (regression introduced recently, after 8.3.11)

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:05 +01:00
Philipp Reisner 376694a054 drbd: Silenced compiler warnings
Since version 4.6.1 gcc warns about variables that get
a value assigned, but which are never read later on.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:05 +01:00
Philipp Reisner 9bcd252182 drbd: fix "stalled" empty resync
With sync-after dependencies, given "lucky" timing of pause/unpause
events, and the end of an empty (0 bits set) resync was sometimes not
detected on the SyncTarget, leading to a "stalled" SyncSource state.

Fixed this by expecting not only "Inconsistent -> UpToDate" but also
"Consistent -> UpToDate" transitions for the peer disk state
to end a resync.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:04 +01:00
Lars Ellenberg 22d81140ae drbd: fix bitmap writeout after aborted resync
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:04 +01:00
Philipp Reisner 08b165ba11 drbd: Consider the discard-my-data flag for all volumes [bugz 359]
...not only for the first volume

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:04 +01:00
Andreas Gruenbacher 935be260c1 drbd: Improve error reporting in drbd_md_sync_page_io()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:03 +01:00
Lars Ellenberg 25e409321a drbd: fix connect failure with all default net-options
If no net-options are configured (all on their default),
no DRBD_NLA_NET_CONF will be passed to the kernel.
The kernel must not require its presence,
there is no required option in there.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:03 +01:00
Andreas Gruenbacher a209b4aec3 drbd: Update some outdated comments to match the code
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:02 +01:00
Philipp Reisner c4e7afdc01 drbd: Remove unused code
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:02 +01:00
Philipp Reisner 4276dea70c drbd: Remove dead code
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:02 +01:00
Philipp Reisner 85d735138a drbd: Cleanup all epoch objects upon connection loss
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:01 +01:00
Philipp Reisner f132f554ce drbd: Do not display bogus log lines for pdsk in case pdsk < D_UNKNOWN
This was a regression recently introduced with commit
7848ddb752c09b6dfd1ddfabb06b69b08aa8f6b9
"drbd: Correctly handle resources without volumes"

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:00 +01:00
Lars Ellenberg 97ddb68790 drbd: detach must not try to abort non-local requests
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:00 +01:00
Andreas Gruenbacher f497609e4c drbd: Get rid of MR_{READ,WRITE}_SHIFT
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:58:00 +01:00
Philipp Reisner 823bd832a6 drbd: Bugfix for the connection behavior
If we get into the C_BROKEN_PIPE cstate once, the state engine set the
thi->t_state of the receiver thread to restarting.  But with the while loop
in drbdd_init() a new connection gets established. After the call into
drbdd() returns immediately since the thi->t_state is not RUNNING.  The
restart of drbd_init() then resets thi->t_state to RUNNING.

I.e. after entering C_BROKEN_PIPE once, the next successful established
connection gets wasted.

The two parts of the fix:
  * Do not cause the thread to restart if we detect the issue
    with the sockets while we are in C_WF_CONNECTION.

  * Make sure that all actions that would have set us to C_BROKEN_PIPE
    happen before the state change to C_WF_REPORT_PARAMS.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:59 +01:00
Andreas Gruenbacher 7d4c782cbd drbd: Fix the data-integrity-alg setting
The last data-integrity-alg fix made data integrity checking work when the
algorithm was changed for an established connection, but the common case of
configuring the algorithm before connecting was still broken.  Fix that.

Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:59 +01:00
Andreas Gruenbacher 71fc7eedb3 drbd: Turn tl_apply() into tl_abort_disk_io()
There is no need to overly generalize this function; it only makes the code
harder to understand.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:58 +01:00
Philipp Reisner 1b7ab15b11 drbd: Fixed w_restart_disk_io() to handle non active AL-extents
Since we now apply the AL in user space onto the bitmap, the AL
is not active for the requests we want to reply.

For that a al_write_transaction() that might be called from
worker context became necessary.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:58 +01:00
Philipp Reisner 9b743da96c drbd: Missing assignment of mdev before drbd_queue_work()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:58 +01:00
Philipp Reisner 3b03ad5929 drbd: Do not mod_timer() with a past time
In case we can not find out why the request takes too long
(happens e.g. when IO got suspended on DRBD level). rearm
the timer with a reasonable value.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:57 +01:00
Philipp Reisner 3fb4746d8d drbd: Consider that the no-data-condition could be in connected state
...when the peer has inconsistent data. In that case we failed to
clear the susp_nod flag. When the local disk was attached again

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:57 +01:00
Philipp Reisner 97af09d51e drbd: Dropped wrong clause to generate new current UUIDs
Looks like a remainder from long ago.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:56 +01:00
Andreas Gruenbacher accdbcc5f9 drbd: receive_protocol(): We cannot change our own data-integrity-alg setting here
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:56 +01:00
Andreas Gruenbacher d505d9bef2 drbd: Be consistent in reporting incompatibilities in P_PROTOCOL settings
Refer to the settings by the names which drbdsetup and drbd.conf are using.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:56 +01:00
Andreas Gruenbacher fbc12f4514 drbd: receive_protocol(): Make the program flow less confusing
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:55 +01:00
Andreas Gruenbacher b792c35cfb drbd: receive_protocol(): Give variables more easily searchable names
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:55 +01:00
Andreas Gruenbacher 5af172ed9e drbd: Print memory address in hex instead of decimal in error message
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:55 +01:00
Andreas Gruenbacher f2257a56ee drbd: Allow to create devices with a minor number > minor_count
The minor_count module/kernel parameter serves to scale the size of drbd's
internal memory pool, but it is no longer a limit for the number of minors or
the minor number.  (Minor numbers can be arbitrarily high within the allowed
limit of 2^20.)

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:54 +01:00
Lars Ellenberg 367d675da8 drbd: report net config even for resources without a single volume
Currently it is legal (though unusual) to create and connect a resource,
before adding in all necessary volumes. We should include the network
configuration details, even if we don't have a single volume (yet).

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:53 +01:00
Philipp Reisner e0e1665381 drbd: Correctly handle resources without volumes
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:52 +01:00
Philipp Reisner a67e1d9e8c drbd: Eliminated the "notified peer" messages
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:52 +01:00
Philipp Reisner 369bea6371 drbd: Fixed removal of volumes/devices from connected resources
When removing a volume/device we need to switch the connection
status of the peer back into WFReportParams.

  Before this fix it was left in Connected state. That means that
  the peer device continued to inform us about state changes, etc...
  But we deleted that minor -> protocol error.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:51 +01:00
Lars Ellenberg d5d7ebd422 drbd: on attach, enforce clean meta data
Detection of unclean shutdown has moved into user space.

The kernel code will, whenever it updates the meta data, mark it as
"unclean", and will refuse to attach to such unclean meta data.

"drbdadm up" now schedules "drbdmeta apply-al", which will apply
the activity log to the bitmap, and/or reinitialize it, if necessary,
as well as set a "clean" indicator flag.

This moves a bit code out of kernel space.
As a side effect, it also prevents some 8.3 module from accidentally
ignoring the 8.4 style activity log, if someone should downgrade,
whether on purpose, or accidentally because he changed kernel versions
without providing an 8.4 for the new kernel, and the new kernel comes
with in-tree 8.3.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:51 +01:00
Philipp Reisner cdfda633d2 drbd: detach from frozen backing device
* drbd-8.3:
  documentation: Documented detach's --force and disk's --disk-timeout
  drbd: Implemented the disk-timeout option
  drbd: Force flag for the detach operation
  drbd: Allow new IOs while the local disk in in FAILED state
  drbd: Bitmap IO functions can not return prematurely if the disk breaks
  drbd: Added a kref to bm_aio_ctx
  drbd: Hold a reference to ldev while doing meta-data IO
  drbd: Keep a reference to the bio until the completion handler finished
  drbd: Implemented wait_until_done_or_disk_failure()
  drbd: Replaced md_io_mutex by an atomic: md_io_in_use
  drbd: moved md_io into mdev
  drbd: Immediately allow completion of IOs, that wait for IO completions on a failed disk
  drbd: Keep a reference to barrier acked requests

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:50 +01:00
Andreas Gruenbacher 2fcb8f307f drbd: Improve the "unexpected packet" error messages
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:50 +01:00
Philipp Reisner 9510b2411d drbd: Fixed state transitions in case reading meta data failes
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:50 +01:00
Philipp Reisner 2ffca4f3ee drbd: Improve compatibility with drbd's older than 8.3.7
Regression introduced with 8.3.11 commit:
drbd: Take a more conservative approach when deciding max_bio_size

Never ever tell an older drbd, that we support more than 32KiB
in a single data request (packet).
Never believe an older drbd, that is supports more than 32KiB
in a single data request (packet)

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:49 +01:00
Philipp Reisner d942ae4453 drbd: Fixes from the 8.3 development branch
* commit 'ae57a0a':
   drbd: Only print sanitize state's warnings, if the state change happens
   drbd: we should write meta data updates with FLUSH FUA
   drbd: fix limit define, we support 1 PiByte now
   drbd: fix log message argument order
   drbd: Typo in user-visible message.
   drbd: Make "(rcv|snd)buf-size" and "ping-timeout" available for the proxy, too.
   drbd: Allow keywords to be used in multiple config sections.
   drbd: fix typos in comments.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:49 +01:00
Lars Ellenberg 4dbdae3ec9 drbd: downgraded error printk to info
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:48 +01:00
David Howells 3b7cd457d0 DRBD: Fix comparison always false warning due to long/long long compare
Fix warnings of the following nature in the drbd header:

In file included from drivers/block/drbd/drbd_bitmap.c:32:
drivers/block/drbd/drbd_int.h: In function 'drbd_get_syncer_progress':
drivers/block/drbd/drbd_int.h:2234: warning: comparison is always false due to limited range of data

where mdev->rs_total (an unsigned long) is being compared to 1ULL << 32, which
is always false on a 32-bit machine.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:48 +01:00
Andreas Gruenbacher 6dff290220 drbd: Rename --dry-run to --tentative
drbdadm already has a --dry-run option, so this option cannot directly be
passed through to drbdsetup.  Rename the drbdsetup option to resolve this
conflict.

For backward compatibility, make --dry-run an alias of --tentative.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:47 +01:00
Andreas Gruenbacher d0fa7fd680 drbd: Remove dead code
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:47 +01:00
Andreas Gruenbacher afbbfa88bc drbd: Allow to pass resource options to the new-resource command
This is equivalent to how the attach and connect commands work.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:46 +01:00
Andreas Gruenbacher 089c075d88 drbd: Convert the generic netlink interface to accept connection endpoints
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:46 +01:00
Andreas Gruenbacher 44e52cfaa2 drbd: Rename DRBD_ADM_NEED_{CONN -> RESOURCE}
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:46 +01:00
Andreas Gruenbacher 01b39b50d3 drbd: Split off netlink mandatory attribute handling into separate file
Duplicate this file in the kernel module and in user space; both sides need it.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:45 +01:00
Andreas Gruenbacher 7c3063cc6f drbd: Also need to check for DRBD_GENLA_F_MANDATORY flags before nla_find_nested()
This is done by introducing drbd_nla_find_nested() which handles the flag
before calling nla_find_nested().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:45 +01:00
Andreas Gruenbacher 789c1b626c drbd: Use the terminology suggested by the command names in the source code and messages
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:44 +01:00
Lars Ellenberg 67b58bf723 drbd: spelling fix: too small
It is not "to small", but "too small".

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:44 +01:00
Lars Ellenberg fc251d5c24 drbd: cosmetic: fix accidental division instead of modulo when pretty printing
For large resync rates, seq_printf_with_thousands_grouping()
accidentally only produced Y,000,00Y, instead of the real numbers.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:43 +01:00
Andreas Gruenbacher 46530e859c drbd: Use DRBD_MINOR_COUNT_DEF in one more place
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:43 +01:00
Philipp Reisner a67b813cfa drbd: Lower log priority for an event that is definitely not an error
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:43 +01:00
Andreas Gruenbacher c75b9b10e7 drbd: Don't use empty nested netlink attributes
Before mainline commit ea5693cc (v2.6.29-rc1), empty nested netlink attributes
were not allowed.  Fix that by leaving out nested attributes if they are empty
and by allowing the top-level attributes to be missing.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:57:40 +01:00
Andreas Gruenbacher 1e2a2551ee drbd: drbd_adm_prepare(): Pass through error codes
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:56 +01:00
Philipp Reisner d659f2aaea drbd: Send PROTOCOL_UPDATE packets when appropriate
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:54 +01:00
Philipp Reisner 036b17eaab drbd: Receiving part for the PROTOCOL_UPDATE packet
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:53 +01:00
Philipp Reisner 7aca6c7549 drbd: Allocation of int_dig_in and int_dig_vv was missing
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:53 +01:00
Philipp Reisner f179d76d76 drbd: Made cmp_after_sb() more generic into convert_after_sb()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:53 +01:00
Philipp Reisner 46e1ce4177 drbd: protect updates to integrits_tfm by tconn->data->mutex
Since we need to hold that mutex anyways to make sure the peer
gets that change in the right position in the data stream,
it makes a lot of sense to use the same mutex to ensure existence
of the tfm.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:52 +01:00
Philipp Reisner dcb20d1a8e drbd: Refuse to change network options online when...
* the peer does not speak protocol_version 100 and the
  user wants to change one of:
    - wire_protocol
    - two_primaries
    - integrity_alg

* the user wants to remove the allow_two_primaries flag
  when there are two primaries

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:52 +01:00
Andreas Gruenbacher 95f8efd08b drbd: Fix the upper limit of resync-after
The 32-bit resync_after netlink field takes a device minor number as
parameter, which is no longer limited to 255.  We cannot statically
verify which device numbers are valid, so set the ummer limit to the
highest possible signed 32-bit integer.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:51 +01:00
Andreas Gruenbacher 69ef82dea4 drbd: Refer to connect-int consistently throughout the code
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:50 +01:00
Andreas Gruenbacher 6394b9358e drbd: Refer to resync-rate consistently throughout the code
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:50 +01:00
Lars Ellenberg 7dc1d67f7c drbd: skip spurious wait_event in drbd_al_begin_io
Activity log transaction writes are serialized on a bit lock.
If several CPUs race to write an AL transaction,
those that did not get the lock the first time
may continue as soon as there are no more pending transactions.

The do not need to all grab the lock in turn,
just to realize that the AL is clean already,
and they have nothing to do.

This also closes a potential deadlock with drbd_adm_disk_opts.
Once it got the AL bit lock, it knows there are no pending transactions,
the AL is clean, and it should be safe to wait for all element references
to drop to zero.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:49 +01:00
Andreas Gruenbacher 6139f60dc1 drbd: Rename the want_lose field/flag to discard_my_data
This is what it is called in config files and on the command line as
well.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:49 +01:00
Andreas Gruenbacher 6f9b5f84f5 drbd: Make broadcast events return NO_ERROR
Instead of returning a ret_code outside of the range of enum
drbd_ret_code, use NO_ERROR to indicate success.  This way,
ret_code has the same meaning in all packets.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:48 +01:00
Philipp Reisner c141ebda03 drbd: Removing drbd_cfg_rwsem
* Updates to all configuration items is done under genl_lock().
   Including removal of mdevs or tconns.
 * All read non sleeping read sides are protected by rcu
 * All sleeping read sides keep reference counts to keep the
   objects alive

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:48 +01:00
Philipp Reisner ec0bddbc55 drbd: Use RCU for the drbd_tconns list
Preparing removal of drbd_cfg_rwsem

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:47 +01:00
Philipp Reisner 81fa2e675c drbd: Refcounting for mdev objects
Preparing removal of drbd_cfg_rwsem

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:47 +01:00
Andreas Gruenbacher bb77d34ecc drbd: Turn no-tcp-cork into tcp-cork={yes|no}
Change the --no-tcp-cork drbdsetup command line option as well as
the no_cork netlink packet.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:46 +01:00
Andreas Gruenbacher e544046ab8 drbd: Turn no-md-flushes into md-flushes={yes|no}
Change the --no-md-flushes drbdsetup command line option as well as
the no_md_flush netlink packet.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:46 +01:00
Andreas Gruenbacher d0c980e236 drbd: Turn no-disk-drain into disk-drain={yes|no}
Change the --no-disk-drain drbdsetup command line option as well as
the no_disk_drain netlink packet.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:46 +01:00
Andreas Gruenbacher 66b2f6b9c5 drbd: Turn no-disk-flushes into disk-flushes={yes|no}
Change the --no-disk-flushes drbdsetup command line option as well as
the no_disk_flush netlink packet.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:45 +01:00
Philipp Reisner 813472ced7 drbd: RCU for rs_plan_s
This removes the issue with using peer_seq_lock out of different
contexts.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:44 +01:00
Philipp Reisner d589a21e5d drbd: Enforce limits of disk_conf members; centralized these checks
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:44 +01:00
Philipp Reisner 9958c857c7 drbd: Made the fifo object a self contained object (preparing for RCU)
* Moved rs_planed into it, named total
* When having a pointer to the object the values can
  be embedded into the fifo object.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:43 +01:00
Philipp Reisner daeda1cca9 drbd: RCU for disk_conf
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:43 +01:00
Lars Ellenberg 563e4cf25e drbd: Introduce __s32_field in the genetlink macro magic
...and drop explicit typecasts (int)meta_dev_idx < 0.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:43 +01:00
Philipp Reisner 2ec91e0e29 drbd: Renamed (old|new)_conf into (old|new)_net_conf in receive_SyncParam
Preparing RCU for disk_conf

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:42 +01:00
Philipp Reisner dc97b70801 drbd: Split drbd_alter_sa() into drbd_sync_after_valid() and drbd_sync_after_changed()
Preparing RCU for disk_conf

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:42 +01:00
Philipp Reisner ef5e44a672 drbd: drbd_dew_dev_size() gets the user requests disk_size as argument
Preparing RCU for disk_conf

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:41 +01:00
Philipp Reisner a0095508ca drbd: Renamed the net_conf_update mutex to conf_update
Preparing to use the same mutex for disk_conf updates

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:41 +01:00
Philipp Reisner 934e6138b5 drbd: Removed dead code
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:40 +01:00
Andreas Gruenbacher b966b5dd8e drbd: Generate the drbd_set_*_defaults() functions from drbd_genl.h
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:55:38 +01:00
Lars Ellenberg 009ba89db5 drbd: fix schedule in atomic
An administrative detach used to request a state change directly to D_DISKLESS,
first suspending IO to avoid the last put_ldev() occuring from an endio handler,
potentially in irq context.

This is not enough on the receiving side (typically secondary), we may miss
some peer_req on the way to local disk, which then may do the last put_ldev()
from their drbd_peer_request_endio().

This patch makes the detach always go through the intermediate D_FAILED state.
We may consider to rename it D_DETACHING.

Alternative approach would be to create yet an other work item to be scheduled
on the worker, do the destructor work from there, and get the timing right.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:53:01 +01:00
Lars Ellenberg 992d6e91d3 drbd: fix thread stop deadlock
There are races where the receiver may be exiting,
but still need the worker to process some stuff.

Do not wait for the receiver to die from an exiting worker.
The receiver must already be dead in case the worker decides to exit.
If the receiver was still alive, it may still want to queue work, and do
drbd_flush_workqueue() from it's disconnect cleanup code,
which would no longer be processed by an exiting worker.

This also would deadlock,
if the worker was to synchornously wait for the receiver to die.

Do not implicitly stop the worker.
The worker will only be stopped from configuration context, from
conn_reconfig_done(), drbd_adm_down() or drbd_adm_delete_connection(),
after making sure the receiver is already stopped.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:53:00 +01:00
Lars Ellenberg f3dfa40a67 drbd: fix race when forcefully disconnecting
If a forced disconnect hits a restarting receiver right after it passed
its final "if (C_DISCONNECTING)" test in drbdd_init(), but before it was
actually restarted by drbd_thread_setup, we could be left with a
connection stuck in C_DISCONNECTING, never reaching C_STANDALONE,
which would be necessary to take it down or reconfigure it.

Move the last cleanup into w_after_conn_state_ch(), and do an additional
state change request in conn_try_disconnect(), just in case.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:53:00 +01:00
Andreas Gruenbacher 88104ca458 drbd: Allow to change data-integrity-alg on the fly
The main purpose of this is to allow to turn data integrity checking on
and off on demand without causing interruptions.

Implemented by allocating tconn->peer_integrity_tfm only when receiving
a P_PROTOCOL message.  l accesses to tconn->peer_integrity_tf happen in
worker context, and no further synchronization is necessary.

On the sender side, tconn->integrity_tfm is modified under
tconn->data.mutex, and a P_PROTOCOL message is sent whenever.  All
accesses to tconn->integrity_tfm already happen under this mutex.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:52:59 +01:00
Andreas Gruenbacher a7eb7bdf58 drbd: Introduce a "lockless" variant of drbd_send_protocoll()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:52:59 +01:00
Andreas Gruenbacher 4b6ad6d457 drbd: Remove obsolete drbd_crypto_is_hash()
We allocate hash transformations with crypto_alloc_hash() which will
only return hash algorithms.  It is not necessary to reconfirm that we
actually got a hash algorithm.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:52:58 +01:00
Andreas Gruenbacher 5b614abe30 drbd: Rename integrity_r_tfm -> peer_integrity_tfm
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:52:58 +01:00
Andreas Gruenbacher 8d412fc6d5 drbd: Rename integrity_w_tfm -> integrity_tfm
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:52:58 +01:00
Andreas Gruenbacher 86db06180a drbd: Wrong use of RCU in receive_protocol()
It is not enough to grab net_conf->integrity_alg under rcu_read_lock()
and access it outside of it; the entire net_conf object may be gone by
then.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:52:57 +01:00
Lars Ellenberg acb104c396 drbd: fix copy/paste error in comment
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:52:57 +01:00
Lars Ellenberg b57a1e27ee drbd: rename variable sc to res_opts
sc was short for syncer conf, which does not exist anymore anyways.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:52:56 +01:00
Lars Ellenberg 5ecc72c3b9 drbd: rename variable ndc to new_disk_conf
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:52:54 +01:00
Lars Ellenberg 5979e36155 drbd: on reconfiguration requests, mind the SET_DEFAULTS flag
The DRBD_GENL_F_SET_DEFAULTS flag was ignored
for drbd_adm_disk_opts() and drbd_adm_net_opts().

Factor out drbd_set_*_defaults() helper functions,
and call them appropriately.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:50:38 +01:00
Philipp Reisner 0fd0ea064c drbd: Consider all crypto options in connect and in net-options
So for this was simply not considered after the options have been
re-arranged.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:49:08 +01:00
Lars Ellenberg d9cc6e2318 drbd: fix various disconnecting races
If an admin requests disconnect at a time when the state handling
already disconnects/reconnects, there have been some races.

Make sure to always really stop the network threads before
returning success for disconnect. Do not pretend successfull
forced disconnect, if the state handling returned an error.

Return success from drbd_adm_down() only after all threads are finished.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:49:08 +01:00
Lars Ellenberg 5ee743e92d drbd: remove useless kobject_uevent from drbd_adm_connect
Calling kobject_uevent, which may sleep, from within rcu_read_lock()
protected regions is not possible.
This particular kobject_uevent also is also wrong. It was supposed to
trigger a udev run, just in case something relevant to udev symlink
magic has changed, when adjusting runtime re-configurable settings while
we still had the "syncer conf".  It was improperly placed in connect
when we dropped the "syncer conf".  The right thing to do is probably to
call "udevadm trigger" directly in those cases where drbdadm thinks
there was a need to trigger extra udev runs.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:49:07 +01:00
Philipp Reisner a18e9d1eb0 drbd: Removed the OBJECT_DYING and the CONFIG_PENDING bits
superseded by refcounting

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:49:07 +01:00
Philipp Reisner 0ace9dfabe drbd: Take a reference on tconn when finding a tconn by name
Rule #3 of kref.txt

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:49:06 +01:00
Philipp Reisner 9dc9fbb357 drbd: Basic refcounting for drbd_tconn
References hold by:
 * Each (running) drbd thread has a reference on tconn
 * Each mdev has a referenc on tconn
 * Beeing in the all_tconn list counts for one reference
 * Each after_conn_state_chg_work has a reference to tconn

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:49:06 +01:00
Philipp Reisner 1d04122599 drbd: Eliminated drbd_free_resoruces() it is superseeded by conn_free_crypto()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:49:05 +01:00
Lars Ellenberg f5e2b8b3b6 drbd: move comment about stopping the receiver thread to where it belongs
When the last volume of a replication group is unconfigured,
the worker thread exits. To not interfere with cleanup
of other threads, before the the last cleanups run,
we need to make sure the receiver has already exited.

The commend explaining that clearly belongs above
drbd_thread_stop(&tconn->receiver), not in the cleanup loop below.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:49:05 +01:00
Lars Ellenberg ae25b336e0 drbd: cmdname() enum to string convertion was missing a few constants
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:49:05 +01:00
Lars Ellenberg ed439848ca drbd: fix setsockopt for user mode linux
We use our own copy of kernel_setsockopt, and did not mess around with
get_fs/set_fs, since we thought we knew we would always be KERNEL_DS
anyways. Apparently not so for at least user mode linux, so put the
set_fs(KERNEL_DS) in there.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:49:04 +01:00
Lars Ellenberg 71932efc1c drbd: allow status dump request all volumes of a specific resource
We had drbd_adm_get_status (one single volume),
and drbd_adm_get_status_all (dump of all volumes of all resources).

This enhances the latter to be able to dump all volumes
of just one specific resource.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:49:04 +01:00
Philipp Reisner 302bdeae49 drbd: Considering that the two_primaries config flag can change
Now since it is possible to change the two_primaries config
flag while the connection is up, make sure we treat a peer_req
in a consistent way if the config flag changes while the peer_req
is under IO.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:49:03 +01:00
Philipp Reisner 91fd4dad64 drbd: Proper locking for updates to net_conf under RCU
Removing the get_net_conf()/put_net_conf() functions

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:49:03 +01:00
Philipp Reisner 44ed167da7 drbd: rcu_read_lock() and rcu_dereference() for tconn->net_conf
Removing the get_net_conf()/put_net_conf() calls

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:48:59 +01:00
Philipp Reisner b032b6fa35 drbd: Allow online change of replication protocol only with agreed_pv >= 100
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:18 +01:00
Philipp Reisner cd64397c0b drbd: Check consistency of net options when the get changed online
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:18 +01:00
Philipp Reisner 303d1448a0 drbd: Runtime changeable wire protocol
The wire protocol is no longer a property that is negotiated
between the two peers. It is now expressed with two bits
(DP_SEND_WRITE_ACK and DP_SEND_RECEIVE_ACK) in each data
packet. Therefore the primary node is free to change the
wire protocol at any time without disconnect/reconnect.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:18 +01:00
Philipp Reisner d3fcb4908d drbd: protect all idr accesses that might sleep with drbd_cfg_rwsem
With this commit the locking for all accesses to IDRs is complete:

 * Non sleeping read accesses are protected by RCU
 * sleeping read accesses are protocted by a read lock on drbd_cfg_rwsem
 * accesses that add anything are protected by a write lock
 * accesses that remove an object are protoected by a write lock
   and a call to synchronize_rcu() after it is removed from the IDR
   and before the object is actually free()ed.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:17 +01:00
Philipp Reisner ef35626284 drbd: Converted drbd_cfg_mutex into drbd_cfg_rwsem
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:17 +01:00
Philipp Reisner 695d08fa94 drbd: rcu_read_[un]lock() for all idr accesses that do not sleep
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:16 +01:00
Philipp Reisner cd1d9950f6 drbd: Inlined drbd_free_mdev(); it got called only from one place
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:16 +01:00
Philipp Reisner ff370e5a9e drbd: drbd_delete_device() takes a struct drbd_conf * now
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:15 +01:00
Andreas Gruenbacher 5cc287e0ae drbd: Rename drbd_pp_free() to drbd_free_pages()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:15 +01:00
Andreas Gruenbacher c37c8ecfee drbd: Rename drbd_pp_alloc() to drbd_alloc_pages() and make it non-static
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:15 +01:00
Andreas Gruenbacher 18c2d52249 drbd: Rename drbd_pp_first_pages_or_try_alloc() to __drbd_alloc_pages()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:14 +01:00
Andreas Gruenbacher d4da15374b drbd: Make drbd_wait_ee_list_empty() and _drbd_wait_ee_list_empty() static
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:14 +01:00
Andreas Gruenbacher 045417f75c drbd: Rename drbd_{ ee -> peer_req }_has_active_page
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:13 +01:00
Andreas Gruenbacher a990be4682 drbd: Rename reclaim_net_ee(), drbd_process_done_ee(), drbd_process_done_ee(), tconn_process_done_ee() to *_peer_reqs
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:13 +01:00
Andreas Gruenbacher 7721f5675e drbd: Rename drbd_release_ee() to drbd_free_peer_reqs()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:13 +01:00
Andreas Gruenbacher 3967deb192 drbd: Rename drbd_free_ee() and variants to *_peer_req()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:12 +01:00
Andreas Gruenbacher 0db55363cb drbd: Rename drbd_alloc_ee() to drbd_alloc_peer_req()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:12 +01:00
Andreas Gruenbacher e0ab6ad4bc drbd: drbd_init_ee() no longer exists
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:11 +01:00
Andreas Gruenbacher 2735a59467 drbd: Make all asynchronous command handlers return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:11 +01:00
Andreas Gruenbacher 859976758d drbd: validate_req_change_req_state(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:11 +01:00
Andreas Gruenbacher b55d84ba17 drbd: Removed outdated comments and code that envisioned VNRs in header 95
Since have now header 100, that has space for 16 bit volume numbers,
the high byte of the length in header 95 is no longer reserved for
8 bit volume numbers.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:10 +01:00
Andreas Gruenbacher 0c8e36d9b8 drbd: Introduce protocol version 100 headers
The 8 byte header finally becomes too small. With the protocol 100 header we
have 16 bit for the volume number, proper 32 bit for the data length, and
32 bit for further extensions in the future.

Previous versions of drbd are using version 80 headers for all packets
short enough for protocol 80.  They support both header versions in
worker context, but only version 80 headers in asynchronous context.
For backwards compatibility, continue to use version 80 headers for
short packets before protocol version 100.

From protocol version 100 on, use the same header version for all
packets.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:10 +01:00
Andreas Gruenbacher e658983af6 drbd: Remove headers from on-the-wire data structures (struct p_*)
Prepare the introduction of the protocol 100 headers. The actual protocol
header is removed for the packet declarations. I.e. allow us to use the
packets with different headers.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:09 +01:00
Andreas Gruenbacher 50d0b1ad78 drbd: Remove some fixed header size assumptions
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:09 +01:00
Andreas Gruenbacher da39fec492 drbd: Remove now-unused int_dig_out buffer
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:09 +01:00
Andreas Gruenbacher 9f5bdc339e drbd: Replace and remove old primitives
Centralize sock->mutex locking and unlocking in [drbd|conn]_prepare_command()
and [drbd|conn]_send_comman().

Therefore all *_send_* functions are touched to use these primitives instead
of drbd_get_data_sock()/drbd_put_data_sock() and former helper functions.

That change makes the *_send_* functions more standardized.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:08 +01:00
Andreas Gruenbacher 52b061a440 drbd: Introduce drbd_header_size()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:08 +01:00
Andreas Gruenbacher dba5858750 drbd: Introduce new primitives for sending commands
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:07 +01:00
Andreas Gruenbacher a17647aae4 drbd: drbd_send_ping(), drbd_send_ping(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:07 +01:00
Philipp Reisner 8b924f1d63 drbd: Use tconn in request_timer_fn()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:07 +01:00
Philipp Reisner 706cb24c23 drbd: Improved logging of state changes
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:06 +01:00
Philipp Reisner a6d00c8ec3 drbd: Implemented IO thawing for multiple volumes
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:06 +01:00
Philipp Reisner 4669265a7b drbd: Implemented conn_lowest_disk()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:05 +01:00
Philipp Reisner 19f83c7661 drbd: Implemented conn_lowest_conn()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:05 +01:00
Philipp Reisner 8c7e16c39f drbd: Calculate and provide ns_min to the w_after_conn_state_ch() work
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:05 +01:00
Philipp Reisner 5f082f98f5 drbd: Renamed nms to ns_max
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:04 +01:00
Philipp Reisner da9fbc276e drbd: Introduced a new type union drbd_dev_state
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:04 +01:00
Philipp Reisner 8e0af25fa8 drbd: Moved susp, susp_nod and susp_fen to the connection object
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:03 +01:00
Philipp Reisner 2aebfabb17 drbd: Renamed id_susp(union drbd_state s) to drbd_suspended(struct drbd_conf *)
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:03 +01:00
Philipp Reisner 78bae59b1b drbd: Introduced drbd_read_state()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:03 +01:00
Lars Ellenberg e15766e9c9 drbd: improvements to activate/deactivate multiple activity log extents
Recent commit drbd: get rid of bio_split, allow bios of "arbitrary" size
had a reference count leak: it only deactivated the first of several
activity log extents for intervals crossing extent boundaries.

This commit generalizes on bios spanning multiple activity log extents
in drbd_al_begin_io, and adds the necessary loop around lc_put in
drbd_al_complete_io as well.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:02 +01:00
Lars Ellenberg 23361cf32b drbd: get rid of bio_split, allow bios of "arbitrary" size
Where "arbitrary" size is currently 1 MiB, which is the BIO_MAX_SIZE
for architectures with 4k PAGE_CACHE_SIZE (most).

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:02 +01:00
Lars Ellenberg 7726547e67 drbd: prepare to activate two activity log extents at once
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:01 +01:00
Lars Ellenberg 181286ad22 drbd: preparation commit, pass drbd_interval to drbd_al_begin/complete_io
We want to avoid bio_split for bios crossing activity log boundaries.
So we may need to activate two activity log extents "atomically".
drbd_al_begin_io() needs to know more than just the start sector.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:01 +01:00
Lars Ellenberg 85f103d88c drbd: introduce the "initialized" activity log transaction type
So we can initialize a clean on disk activity log area,
without the module complaining with loud assert messages
because of checksum or magic value mismatches.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:01 +01:00
Andreas Gruenbacher 6038178ebe drbd: Change how the "handshake" packets are called
Packets of type P_HAND_SHAKE define which protocol versions and features
a node supports.  For clarity, call those packets P_CONNECTION_FEATURES
instead.

(This does not determine the features that a specific drbd device
supports, such as drbd protocol A, B, C.)

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:00 +01:00
Andreas Gruenbacher e5d6f33abe drbd: Change how the initial packets are called
The first packets exchanged when a connection is established are
referred to as P_HAND_SHAKE_S and P_HAND_SHAKE_M in the code, followed
by P_HAND_SHAKE packets.  To avoid confusion between these two unrelated
things, call the initial packets P_INITIAL_DATA and P_INITIAL_META.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:45:00 +01:00
Andreas Gruenbacher 7c96715aa8 drbd: _conn_send_cmd(), _drbd_send_cmd(): Pass a struct drbd_socket instead of a plain socket
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:59 +01:00
Andreas Gruenbacher 2bf896213d drbd: drbd_connect(): Initialize struct drbd_socket before sending anything
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:59 +01:00
Andreas Gruenbacher 1952e9166a drbd: Map from (connection, volume number) to device in the asender handlers
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:59 +01:00
Andreas Gruenbacher e05e1e59db drbd: Pass struct packet_info down to the asender receive functions
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:58 +01:00
Philipp Reisner 438c8374ae drbd: Do not segfault if a sync dependency reaches a diskless device
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:58 +01:00
Philipp Reisner 778bcf2e29 drbd: Allow to disconnect if one volume is diskless
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:58 +01:00
Philipp Reisner 435693e89b drbd: Print common state changes of all volumes as connection state changes
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:57 +01:00
Philipp Reisner 88ef594ed7 drbd: Fixed logging of old connection state
During a disconnect the oc variable in _conn_request_state()
could become outdated. Determin the common old state after
sleeping.
While at it, I implemented that for all parts of the state

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:57 +01:00
Philipp Reisner bd0c824a9d drbd: Use the idr_for_each_entry() iterator instead of idr_for_each()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:56 +01:00
Andreas Gruenbacher 4a76b1612f drbd: Map from (connection, volume number) to device in the receive handlers
The receive handlers do not all handle unknown volume numbers the same
way.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:56 +01:00
Andreas Gruenbacher e28572167c drbd: Pass struct packet_info down to the receive functions
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:56 +01:00
Andreas Gruenbacher 49ba9b1bb3 drbd: Remove useless error messages
These messages can only trigger in case there is a pretty obvious
internal programming error.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:55 +01:00
Andreas Gruenbacher deebe19579 drbd: A small cleanup in drbdd()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:55 +01:00
Andreas Gruenbacher 79ed9bd053 drbd: _drbd_send_bitmap(): Use the pre-allocated send buffer
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:54 +01:00
Andreas Gruenbacher 5a87d920f3 drbd: Preallocate one page per drbd_socket as a send buffer
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:54 +01:00
Andreas Gruenbacher fc56815c81 drbd: receive_bitmap(): Use the pre-allocated receive buffer
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:54 +01:00
Andreas Gruenbacher e6ef8a5cb3 drbd: Preallocate one page per drbd_socket as a receive buffer
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:53 +01:00
Philipp Reisner cb703454a2 drbd: Converted drbd_try_outdate_peer() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:53 +01:00
Andreas Gruenbacher a02d124091 drbd: Rename the DCBP_* functions to dcbp_* and move them to where they are used
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:52 +01:00
Andreas Gruenbacher 058820cdd7 drbd: Make _drbd_send_bitmap() static
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:52 +01:00
Andreas Gruenbacher e307f352b4 drbd: Move drbd_send_ping() and drbd_send_ping_ack() to drbd_main.c
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:52 +01:00
Andreas Gruenbacher 0916e0e308 drbd: Always use the same protocol version for the same peer
There is no need to send protocol 80 headers to peers that understand
protocol 95 headers.  Make sure that we don't send protocol 95 headers
until we have agreed upon a protocol version with our peer, though.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:51 +01:00
Andreas Gruenbacher 0829f5edf3 drbd: drbd_connected(): Return an error code upon failure.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:51 +01:00
Andreas Gruenbacher a5c3190435 drbd: Introduce and use drbd_recv_all_warn()
The pattern of receiving a fixed number of bytes and warning if a short
packet is received and the receiver has not actively been interruped is
repeated many times; clean that up.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:50 +01:00
Andreas Gruenbacher 309a834896 drbd: Get rid of typedef drbd_work_cb
This type is not used anywhere else.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:50 +01:00
Andreas Gruenbacher 8f7bed7774 drbd: Rename various functions from *_oos_* to *_out_of_sync_* for clarity
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:50 +01:00
Andreas Gruenbacher 0da34df0d0 drbd: drbd_may_do_local_read(): Use bool/true/false
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:49 +01:00
Andreas Gruenbacher 1097e9a80c drbd: Remove unnecessary assertion
This is also checked further below in the same function.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:49 +01:00
Andreas Gruenbacher 69f5ec728c drbd: Remove duplicate initialization
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:49 +01:00
Andreas Gruenbacher 3fbf4d21ae drbd: drbd_md_sync_page_io(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:48 +01:00
Andreas Gruenbacher ac29f4039a drbd: _drbd_md_sync_page_io(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:48 +01:00
Andreas Gruenbacher 22ab6a30b8 drbd: drbd_bm_read() never returns a positive value through drbd_bitmap_io()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:47 +01:00
Andreas Gruenbacher 82bc01940a drbd: Make all command handlers return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:47 +01:00
Andreas Gruenbacher c696774691 drbd: Add drbd_recv_all(): Receive an entire buffer
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:47 +01:00
Andreas Gruenbacher a982dd579c drbd: send_bitmap_rle_or_plain(): Error handling cleanup
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:46 +01:00
Andreas Gruenbacher e1c1b0fc8f drbd: recv_resync_read(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:46 +01:00
Andreas Gruenbacher 28284ceff0 drbd: recv_dless_read(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:45 +01:00
Andreas Gruenbacher fc5be8397f drbd: drbd_drain_block(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:45 +01:00
Andreas Gruenbacher 69bc7bc351 drbd: drbd_recv_header(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:45 +01:00
Andreas Gruenbacher 8172f3e9bf drbd: decode_header(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:44 +01:00
Andreas Gruenbacher e2b3032b90 drbd: drbd_process_done_ee(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:44 +01:00
Andreas Gruenbacher 99920dc5c5 drbd: Make all worker callbacks return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:43 +01:00
Andreas Gruenbacher b2f0ab62ec drbd: Temporarily change the return type of all worker callbacks
This helps to ensure that we don't miss one of them when changing their
return value semantics.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:43 +01:00
Andreas Gruenbacher a896527c06 drbd: drbd_send_short_cmd(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:43 +01:00
Andreas Gruenbacher 6bdb9b0e23 drbd: drbd_send_dblock(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:42 +01:00
Andreas Gruenbacher 7fae55da38 drbd: _drbd_send_bio(), _drbd_send_zc_bio(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:42 +01:00
Andreas Gruenbacher 7b57b89d62 drbd: drbd_send_block(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:42 +01:00
Andreas Gruenbacher 9f69230cd6 drbd: _drbd_send_zc_ee(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:41 +01:00
Andreas Gruenbacher 88b390ff63 drbd: _drbd_send_page(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:41 +01:00
Andreas Gruenbacher b987427b53 drbd: _drbd_no_send_page(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:40 +01:00
Andreas Gruenbacher 73218a3c4c drbd: drbd_send_oos(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:40 +01:00
Andreas Gruenbacher db1b0b724e drbd: drbd_send_drequest_csum(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:40 +01:00
Andreas Gruenbacher 6c1005e74d drbd: drbd_send_drequest(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:39 +01:00
Andreas Gruenbacher 5b9f499c66 drbd: drbd_send_ov_request(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:39 +01:00
Andreas Gruenbacher fa79abd893 drbd: drbd_send_ack_ex(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:38 +01:00
Andreas Gruenbacher a9a9994dc7 drbd: drbd_send_ack_{dp,rp}(): Return void: the result is never used
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:38 +01:00
Andreas Gruenbacher dd5161218b drbd: drbd_send_ack(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:38 +01:00
Andreas Gruenbacher a8c32aa846 drbd: _drbd_send_ack(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:37 +01:00
Andreas Gruenbacher d4e67d7c4f drbd: drbd_send_b_ack(): Return void: the result is never used
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:37 +01:00
Andreas Gruenbacher 2f4e7abe51 drbd: drbd_send_sr_reply(): Return void: the result is never used
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:36 +01:00
Andreas Gruenbacher d24ae219e9 drbd: drbd_send_state_req(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:36 +01:00
Andreas Gruenbacher caee1c3a92 drbd: conn_send_state_req(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:36 +01:00
Andreas Gruenbacher 758970c832 drbd: _conn_send_state_req(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:35 +01:00
Andreas Gruenbacher f02d4d0a9c drbd: drbd_send_sizes(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:35 +01:00
Andreas Gruenbacher 9c1b7f7282 drbd: drbd_gen_and_send_sync_uuid(): Return void: the result is never used
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:34 +01:00
Andreas Gruenbacher 2ae5f95b1a drbd: drbd_send_uuids() and its variants: Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:34 +01:00
Andreas Gruenbacher 387eb30817 drbd: drbd_send_protocol(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:34 +01:00
Andreas Gruenbacher e8d17b015e drbd: drbd_send_handshake(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:33 +01:00
Andreas Gruenbacher 927036f908 drbd: drbd_send_state(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:33 +01:00
Andreas Gruenbacher 103ea27528 drbd: drbd_send_sync_param(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:33 +01:00
Andreas Gruenbacher f725446353 drbd: drbd_send_cmd(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:32 +01:00
Andreas Gruenbacher 7d168ed30f drbd: Get rid of USE_DATA_SOCKET and USE_META_SOCKET
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:32 +01:00
Andreas Gruenbacher 596a37f9ef drbd: conn_send_cmd(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:31 +01:00
Andreas Gruenbacher 04dfa13788 drbd: _drbd_send_cmd(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:31 +01:00
Andreas Gruenbacher ecf2363cb5 drbd: _conn_send_cmd(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:31 +01:00
Andreas Gruenbacher ce9879cb1f drbd: conn_send_cmd2(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:30 +01:00
Andreas Gruenbacher fb708e408f drbd: Add drbd_send_all(): Send an entire buffer
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:30 +01:00
Andreas Gruenbacher 11b0be28e5 drbd: drbd_get_data_sock(): Return 0 upon success and an error code otherwise
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:29 +01:00
Andreas Gruenbacher c0d42c8e57 drbd: drbd_send(): Return a "real" error code if we have no socket
Q: Can this case even trigger?  Is failing this way any better than one
that causes a NULL pointer access?

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:29 +01:00
Philipp Reisner e90285e0ba drbd: Fixed conn_lowest_minor
It actually returned the lowest volume number. While doing that
renamed a few wrongly named variables.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:28 +01:00
Lars Ellenberg f399002e68 drbd: distribute former syncer_conf settings to disk, connection, and resource level
This commit breaks the API again.

Move per-volume former syncer options into disk_conf.
Move per-connection former syncer options into net_conf.
Renamed the remainign sync_conf to res_opts

Syncer settings have been changeable at runtime, so we need to prepare
for these settings to be runtime-changeable in their new home as well.

Introduce new configuration operations, and share the netlink attribute
between "attach" (create new disk) and "disk-opts" (change options).
Same for "connect" and "net-opts".

Some fields cannot be changed at runtime, however.
Introduce a new flag GENLA_F_INVARIANT to be able to trigger on that in
the generated validation and assignment functions.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-08 16:44:20 +01:00
Roger Pau Monne cb5bd4d19b xen/blkback: persistent-grants fixes
This patch contains fixes for persistent grants implementation v2:

 * handle == 0 is a valid handle, so initialize grants in blkback
   setting the handle to BLKBACK_INVALID_HANDLE instead of 0. Reported
   by Konrad Rzeszutek Wilk.

 * new_map is a boolean, use "true" or "false" instead of 1 and 0.
   Reported by Konrad Rzeszutek Wilk.

 * blkfront announces the persistent-grants feature as
   feature-persistent-grants, use feature-persistent instead which is
   consistent with blkback and the public Xen headers.

 * Add a consistency check in blkfront to make sure we don't try to
   access segments that have not been set.

Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Roger Pau Monne <roger.pau@citrix.com>
[v1: The new_map int->bool had already been changed]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-11-04 10:35:40 -05:00
Philipp Reisner 6b75dced00 drbd: conn_khelper() for user mode callbacks for connections
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:32 +01:00
Philipp Reisner 047e95e259 drbd: Allow volumes to become primary only on one side
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:31 +01:00
Lars Ellenberg 40cbf085f5 drbd: fix conn_reconfig_start without conn_reconfig_done in drbd_adm_attach
If drbd_adm_attach failed early, it left the CONFIG_PENDING bit on,
blocking any further conn_reconfig_start on that connection.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:31 +01:00
Philipp Reisner e4f78edee1 drbd: Separate connection state changes from minor dev state changes #2
New function got_conn_RqSReply()

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:30 +01:00
Philipp Reisner f19e4f8ba7 drbd: Converted got_Ping() and got_PingAck() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:30 +01:00
Philipp Reisner a4fbda8eca drbd: Allow packet handler functions that take a connection (meta connection)
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:29 +01:00
Philipp Reisner dfafcc8a7b drbd: Separate connection state changes from minor dev state changes #1
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:28 +01:00
Philipp Reisner 7204624c5e drbd: Converted receive_protocol() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:28 +01:00
Philipp Reisner d9ae84e790 drbd: Allow packet handler functions that take a connection
That is necessary in case a connection does not have a volume 0

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:27 +01:00
Philipp Reisner 8169e41b3e drbd: Moved CONN_DRY_RUN to the per connection (tconn) flags
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:27 +01:00
Philipp Reisner 38fa9988fa drbd: Do not modify the connection state with something else that conn_request_state()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:26 +01:00
Philipp Reisner 34f646bd57 drbd: Allow two diskless minors to be connected
In the context of drbd-8.4 it no longer makes sense to
dissalow that.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:25 +01:00
Philipp Reisner 2325eb661f drbd: New minors have to intherit the connection state form their connection
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:25 +01:00
Philipp Reisner 082a3439a2 drbd: process_done_ee() has to handle unconfigured devices now
Took the chance and converted tconn_process_done_ee() to use
idr_for_each_entry()

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:24 +01:00
Philipp Reisner 2de876efa6 drbd: Ignore packets for non existing volumes
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:23 +01:00
Lars Ellenberg 85f75dd763 drbd: introduce in-kernel "down" command
This greatly simplifies deconfiguration of whole resources.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:23 +01:00
Lars Ellenberg 3c5e5f6afd drbd: add forgotten spin_unlock
somehow a "goto abort" was introduced with commit
  drbd: Extracted is_valid_transition() out of sanitize_state()
which left drbd_req_state still holding the spin lock.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:22 +01:00
Lars Ellenberg 527f4b24e5 drbd: bail out if a config requrest is over-determined, and not matching
We have resources resp. connections, volumes, and minor numbers.
A config request may specifies all three of them.
If it turns out that the minor belongs to a different connection, or a
different volume number in the same connection, that configuration
request is invalid.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:21 +01:00
Lars Ellenberg 38f19616d2 drbd: new-connection and new-minor succeed, if the object already exists
Follow O_CREAT semantics when creating connection or minor device/volume
objects.  If we need O_CREAT|O_EXCL semantics some time down the road,
we can add NLM_F_EXCL to the netlink message flags.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:21 +01:00
Lars Ellenberg cffec5b2fe drbd: Allow a Diskless Secondary volume to be removed
Even if the connection is still established.
We should be able to reduce a volume from a replication group,
without taking the whole group offline.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:20 +01:00
Lars Ellenberg d0456c72df drbd: simplify conn_all_vols_unconf, make it bool
Get rid of a temporary variable and, funny bitand assignment.
Just short circuit, returning false, once we encounter the first
still configured volume.

FIXME verify call sites for need of rcu_read_lock or stronger.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:19 +01:00
Lars Ellenberg 543cc10b4c drbd: drbd_adm_get_status needs to show some more detail
We want to see existing connection objects, even if they do not
currently have volumes attached.

Change the .dumpit variant of drbd_adm_get_status to iterate not over
minor devices, but over connections + volumes.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:19 +01:00
Lars Ellenberg 8432b31457 drbd: allow holes in minor and volume id allocation
s/idr_get_new/idr_get_new_above/

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:17 +01:00
Lars Ellenberg 3b98c0c209 drbd: switch configuration interface from connector to genetlink
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-11-04 00:16:17 +01:00
Alex Elder 9e15b77d9a rbd: get additional info in parent spec
When a layered rbd image has a parent, that parent is identified
only by its pool id, image id, and snapshot id.  Images that have
been mapped also record *names* for those three id's.

Add code to look up these names for parent images so they match
mapped images more closely.  Skip doing this for an image if it
already has its pool name defined (this will be the case for images
mapped by the user).

It is possible that an the name of a parent image can't be
determined, even if the image id is valid.  If this occurs it
does not preclude correct operation, so don't treat this as
an error.

On the other hand, defined pools will always have both an id and a
name.   And any snapshot of an image identified as a parent for a
clone image will exist, and will have a name (if not it indicates
some other internal error).  So treat failure to get these bits
of information as errors.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-11-01 07:55:42 -05:00
Alex Elder 86b00e0da6 rbd: get parent spec for version 2 images
Add support for getting the the information identifying the parent
image for rbd images that have them.  The child image holds a
reference to its parent image specification structure.  Create a new
entry "parent" in /sys/bus/rbd/image/N/ to report the identifying
information for the parent image, if any.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-11-01 07:55:42 -05:00
Alex Elder a92ffdf8a9 rbd: allow null image name
Format 2 parent images are partially identified by their image id,
but it may not be possible to determine their image name.  The name
is not strictly needed for correct operation, so we won't be
treating it as an error if we don't know it.  Handle this case
gracefully in rbd_name_show().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-11-01 07:55:42 -05:00
Alex Elder 2c0d0a10ea rbd: allow null image name
We will know the image id for format 2 parent images, but won't
initially know its image name.  Avoid making the query for an image
id in rbd_dev_image_id() if it's already known.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-11-01 07:55:42 -05:00
Alex Elder 83a0626362 rbd: encapsulate last part of probe
Group the activities that now take place after an rbd_dev_probe()
call into a single function, and move the call to that function
into rbd_dev_probe() itself.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-11-01 07:55:42 -05:00
Alex Elder c53d589337 rbd: define rbd_dev_{create,destroy}() helpers
Encapsulate the creation/initialization and destruction of rbd
device structures.  The rbd_client and the rbd_spec structures
provided on creation hold references whose ownership is transferred
to the new rbd_device structure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-11-01 07:55:42 -05:00
Alex Elder bd4ba6554d rbd: consolidate rbd_dev init in rbd_add()
Group the allocation and initialization of fields of the rbd device
structure created in rbd_add().  Move the grouped code down later in
the function, just prior to the call to rbd_dev_probe().  This is
for the most part simple code movement.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-11-01 07:55:42 -05:00
Alex Elder 9d3997fdf4 rbd: don't pass rbd_dev to rbd_get_client()
The only reason rbd_dev is passed to rbd_get_client() is so its
rbd_client field can get assigned.  Instead, just return the
rbd_client pointer as a result and have the caller do the
assignment.

Change rbd_put_client() so it takes an rbd_client structure,
so follows the more typical symmetry with rbd_get_client().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-11-01 07:55:42 -05:00
Alex Elder 859c31df9c rbd: fill rbd_spec in rbd_add_parse_args()
Pass the address of an rbd_spec structure to rbd_add_parse_args().
Use it to hold the information defining the rbd image to be mapped
in an rbd_add() call.

Use the result in the caller to initialize the rbd_dev->id field.

This means rbd_dev is no longer needed in rbd_add_parse_args(),
so get rid of it.

Now that this transformation of rbd_add_parse_args() is complete,
correct and expand on the its header documentation to reflect the
new reality.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-11-01 07:55:42 -05:00
Alex Elder 8b8fb99c5c rbd: add reference counting to rbd_spec
With layered images we'll share rbd_spec structures, so add a
reference count to it.  It neatens up some code also.

A silly get/put pair is added to the alloc routine just to avoid
"defined but not used" warnings.  It will go away soon.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-11-01 07:55:41 -05:00
Roger Pau Monne 0a8704a51f xen/blkback: Persistent grant maps for xen blk drivers
This patch implements persistent grants for the xen-blk{front,back}
mechanism. The effect of this change is to reduce the number of unmap
operations performed, since they cause a (costly) TLB shootdown. This
allows the I/O performance to scale better when a large number of VMs
are performing I/O.

Previously, the blkfront driver was supplied a bvec[] from the request
queue. This was granted to dom0; dom0 performed the I/O and wrote
directly into the grant-mapped memory and unmapped it; blkfront then
removed foreign access for that grant. The cost of unmapping scales
badly with the number of CPUs in Dom0. An experiment showed that when
Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
ramdisk, the IPIs from performing unmap's is a bottleneck at 5 guests
(at which point 650,000 IOPS are being performed in total). If more
than 5 guests are used, the performance declines. By 10 guests, only
400,000 IOPS are being performed.

This patch improves performance by only unmapping when the connection
between blkfront and back is broken.

On startup blkfront notifies blkback that it is using persistent
grants, and blkback will do the same. If blkback is not capable of
persistent mapping, blkfront will still use the same grants, since it
is compatible with the previous protocol, and simplifies the code
complexity in blkfront.

To perform a read, in persistent mode, blkfront uses a separate pool
of pages that it maps to dom0. When a request comes in, blkfront
transmutes the request so that blkback will write into one of these
free pages. Blkback keeps note of which grefs it has already
mapped. When a new ring request comes to blkback, it looks to see if
it has already mapped that page. If so, it will not map it again. If
the page hasn't been previously mapped, it is mapped now, and a record
is kept of this mapping. Blkback proceeds as usual. When blkfront is
notified that blkback has completed a request, it memcpy's from the
shared memory, into the bvec supplied. A record that the {gref, page}
tuple is mapped, and not inflight is kept.

Writes are similar, except that the memcpy is peformed from the
supplied bvecs, into the shared pages, before the request is put onto
the ring.

Blkback stores a mapping of grefs=>{page mapped to by gref} in
a red-black tree. As the grefs are not known apriori, and provide no
guarantees on their ordering, we have to perform a search
through this tree to find the page, for every gref we receive. This
operation takes O(log n) time in the worst case. In blkfront grants
are stored using a single linked list.

The maximum number of grants that blkback will persistenly map is
currently set to RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, to
prevent a malicios guest from attempting a DoS, by supplying fresh
grefs, causing the Dom0 kernel to map excessively. If a guest
is using persistent grants and exceeds the maximum number of grants to
map persistenly the newly passed grefs will be mapped and unmaped.
Using this approach, we can have requests that mix persistent and
non-persistent grants, and we need to handle them correctly.
This allows us to set the maximum number of persistent grants to a
lower value than RING_SIZE * BLKIF_MAX_SEGMENTS_PER_REQUEST, although
setting it will lead to unpredictable performance.

In writing this patch, the question arrises as to if the additional
cost of performing memcpys in the guest (to/from the pool of granted
pages) outweigh the gains of not performing TLB shootdowns. The answer
to that question is `no'. There appears to be very little, if any
additional cost to the guest of using persistent grants. There is
perhaps a small saving, from the reduced number of hypercalls
performed in granting, and ending foreign access.

Signed-off-by: Oliver Chick <oliver.chick@citrix.com>
Signed-off-by: Roger Pau Monne <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[v1: Fixed up the misuse of bool as int]
2012-10-30 09:50:04 -04:00
Alex Elder 0d7dbfce9d rbd: define image specification structure
Group the fields that uniquely specify an rbd image into a new
reference-counted rbd_spec structure.  This structure will be used
to describe the desired image when mapping an image, and when
probing parent images in layered rbd devices.  Replace the set of
fields in the rbd device structure with a pointer to a dynamically
allocated rbd_spec.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-30 08:34:30 -05:00
Alex Elder dc79b113d6 rbd: have rbd_add_parse_args() return error
Change the interface to rbd_add_parse_args() so it returns an
error code rather than a pointer.  Return the ceph_options result
via a pointer whose address is passed as an argument.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-30 08:34:30 -05:00
Alex Elder 4e9afeba7a rbd: pass and populate rbd_options structure
Have the caller pass the address of an rbd_options structure to
rbd_add_parse_args(), to be initialized with the information
gleaned as a result of the parse.

I know, this is another near-reversal of a recent change...

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-30 08:34:29 -05:00
Alex Elder 819d52bf72 rbd: remove snap_name arg from rbd_add_parse_args()
The snapshot name returned by rbd_add_parse_args() just gets saved
in the rbd_dev eventually.  So just do that inside that function and
do away with the snap_name argument, both in rbd_add_parse_args()
and rbd_dev_set_mapping().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-30 08:34:29 -05:00
Alex Elder f28e565a1b rbd: remove options args from rbd_add_parse_args()
They "options" argument to rbd_add_parse_args() (and it's partner
options_size) is now only needed within the function, so there's no
need to have the caller allocate and pass the options buffer.  Just
allocate the options buffer within the function using dup_token().

Also distinguish between failures due to failed memory allocation
and failing because a required argument was missing.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-30 08:34:29 -05:00
Alex Elder e5c3553404 rbd: get rid of snap_name_len
The value returned in the "snap_name_len" argument to
rbd_add_parse_args() is never actually used, so get rid of it.

The snap_name_len recorded in rbd_dev_v2_snap_name() is not
useful either, so get rid of that too.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-30 08:34:29 -05:00
Alex Elder 0ddebc0c6c rbd: do all argument parsing in one place
This patch makes rbd_add_parse_args() be the single place all
argument parsing occurs for an image map request:
    - Move the ceph_parse_options() call into that function
    - Use local variables rather than parameters to hold the list
      of monitor addresses supplied
    - Rather than returning it, pass the snapshot name (and its
      length) back via parameters
    - Have the function return a ceph_options structure pointer

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-30 08:34:29 -05:00
Alex Elder 78cea76e05 rbd: move ceph_parse_options() call up
Move option parsing out of rbd_get_client() and into its caller.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-30 08:34:29 -05:00
Alex Elder daba5fdb4c rbd: rename snap_exists field
A Boolean field "snap_exists" in an rbd mapping is used to indicate
whether a mapped snapshot has been removed from an image's snapshot
context, to stop sending requests for that snapshot as soon as we
know it's gone.

Generalize the interpretation of this field so it applies to
non-snapshot (i.e. "head") mappings.  That is, define its value
to be false until the mapping has been set, and then define it to be
true for both snapshot mappings or head mappings.

Rename the field "exists" to reflect the broader interpretation.
The rbd_mapping structure is on its way out, so move the field
back into the rbd_device structure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-30 08:34:29 -05:00
Alex Elder 971f839a76 rbd: move snap info out of rbd_mapping struct
Moving the snap_id and snap_name fields into the separate
rbd_mapping structure was misguided.  (And in time, perhaps
we'll do away with that structure altogether...)

Move these fields back into struct rbd_device.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-30 08:34:29 -05:00
Alex Elder 86992098e7 rbd: make pool_id a 64 bit value
If a format 2 image has a parent, its pool id will be specified
using a 64-bit value.  Change the pool id we save for an image to
match that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-30 08:34:29 -05:00
Alex Elder 41f38c2b2f rbd: remove snapshots on error in rbd_add()
If rbd_dev_snaps_update() has ever been called for an rbd device
structure there could be snapshot structures on its snaps list.
In rbd_add(), this function is called but a subsequent error
path neglected to clean up any of these snapshots.

Add a call to rbd_remove_all_snaps() in the appropriate spot to
remedy this.  Change a couple of error labels to be a little
clearer while there.

Drop the leading underscores from the function name; there's nothing
special about that function that they might signify.  As suggested
in review, the leading underscores in __rbd_remove_snap_dev() have
been removed as well.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-30 08:34:28 -05:00
Alex Elder f7760dad28 rbd: simplify rbd_rq_fn()
When processing a request, rbd_rq_fn() makes clones of the bio's in
the request's bio chain and submits the results to osd's to be
satisfied.  If a request bio straddles the boundary between objects
backing the rbd image, it must be represented by two cloned bio's,
one for the first part (at the end of one object) and one for the
second (at the beginning of the next object).

This has been handled by a function bio_chain_clone(), which
includes an interface only a mother could love, and which has
been found to have other problems.

This patch defines two new fairly generic bio functions (one which
replaces bio_chain_clone()) to help out the situation, and then
revises rbd_rq_fn() to make use of them.

First, bio_clone_range() clones a portion of a single bio, starting
at a given offset within the bio and including only as many bytes
as requested.  As a convenience, a request to clone the entire bio
is passed directly to bio_clone().

Second, bio_chain_clone_range() performs a similar function,
producing a chain of cloned bio's covering a sub-range of the
source chain.  No bio_pair structures are used, and if successful
the result will represent exactly the specified range.

Using bio_chain_clone_range() makes bio_rq_fn() a little easier
to understand, because it avoids the need to pass very much
state information between consecutive calls.  By avoiding the need
to track a bio_pair structure, it also eliminates the problem
described here:  http://tracker.newdream.net/issues/2933

Note that a block request (and therefore the complete length of
a bio chain processed in rbd_rq_fn()) is an unsigned int, while
the result of rbd_segment_length() is u64.  This change makes
this range trunctation explicit, and trips a bug if the the
segment boundary is too far off.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-30 08:34:28 -05:00
Lars Ellenberg ccae7868b0 drbd: log request sector offset and size for IO errors
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:39:18 +01:00
Lars Ellenberg a2a3c74f24 drbd: always write bitmap on detach
If we detach due to local read-error (which sets a bit in the bitmap),
stay Primary, and then re-attach (which re-reads the bitmap from disk),
we potentially lost the "out-of-sync" (or, "bad block") information in
the bitmap.

Always (try to) write out the changed bitmap pages before going diskless.

That way, we don't lose the bit for the bad block,
the next resync will fetch it from the peer, and rewrite
it locally, which may result in block reallocation in some
lower layer (or the hardware), and thereby "heal" the bad blocks.

If the bitmap writeout errors out as well, we will (again: try to)
mark the "we need a full sync" bit in our super block,
if it was a READ error; writes are covered by the activity log already.

If that superblock does not make it to disk either, we are sorry.

Maybe we just lost an entire disk or controller (or iSCSI connection),
and there actually are no bad blocks at all, so we don't need to
re-fetch from the peer, there is no "auto-healing" necessary.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:39:18 +01:00
Lars Ellenberg 06f10adbdb drbd: prepare for more than 32 bit flags
- struct drbd_conf { ... unsigned long flags; ... }
 + struct drbd_conf { ... unsigned long drbd_flags[N]; ... }

And introduce wrapper functions for test/set/clear bit operations
on this member.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:39:18 +01:00
Lars Ellenberg 44edfb0d78 drbd: wait for meta data IO completion even with failed disk, unless force-detached
The intention of force-detach is to be able to deal with a completely
unresponsive lower level IO stack, which does not even deliver error
completions anymore, but no completion at all.

In all other cases, we must still wait for the meta data IO completion.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:39:18 +01:00
Lars Ellenberg 8b45a5c8a1 drbd: a few more GFP_KERNEL -> GFP_NOIO
This has not yet been observed, but conceivably, when using GFP_KERNEL
allocations from drbd_md_sync(), drbd_flush_after_epoch() or
receive_SyncParam(), we could trigger additional IO to our own device,
or an other device in a criss-cross setup, and end up in a local
deadlock, or potentially a distributed deadlock in a criss-cross setup
involving the peer blocked in a similar way waiting for us to make
progress.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:39:18 +01:00
Lars Ellenberg 0b143d4382 drbd: fix potential deadlock during bitmap (re-)allocation
The former comment arguing that GFP_KERNEL was good enough was wrong: it
did not take resize into account at all, and assumed the only path
leading here was the normal attach on a still secondary device, so no
deadlock would be possible.

Both resize on a Primary, or attach on a diskless Primary,
could potentially deadlock.

drbd_bm_resize() is called while IO to the respective device is
suspended, so we must use GFP_NOIO to avoid potential deadlock.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:39:18 +01:00
Lars Ellenberg 7fb907c15f drbd: panic on delayed completion of aborted requests
"aborting" requests, or force-detaching the disk, is intended for
completely blocked/hung local backing devices which do no longer
complete requests at all, not even do error completions.  In this
situation, usually a hard-reset and failover is the only way out.

By "aborting", basically faking a local error-completion,
we allow for a more graceful swichover by cleanly migrating services.
Still the affected node has to be rebooted "soon".

By completing these requests, we allow the upper layers to re-use
the associated data pages.

If later the local backing device "recovers", and now DMAs some data
from disk into the original request pages, in the best case it will
just put random data into unused pages; but typically it will corrupt
meanwhile completely unrelated data, causing all sorts of damage.

Which means delayed successful completion,
especially for READ requests,
is a reason to panic().

We assume that a delayed *error* completion is OK,
though we still will complain noisily about it.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:39:17 +01:00
Philipp Reisner dbd0820c6f drbd: Remove dead code
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:39:17 +01:00
Philipp Reisner 599377acb7 drbd: Avoid NetworkFailure state during disconnect
Disconnecting is a cluster wide state change. In case the peer node agrees
to the state transition, it sends back the fact on the meta-data connection
and closes both sockets.

In case the node node that initiated the state transfer sees the closing
action on the data-socket, before the P_STATE_CHG_REPLY packet, it was
going into one of the network failure states.

At least with the fencing option set to something else thatn "dont-care",
the unclean shutdown of the connection causes a short IO freeze or
a fence operation.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:39:17 +01:00
Philipp Reisner c12a3d8c84 drbd: Fix a potential issue with the DISCARD_CONCURRENT flag
The DISCARD_CONCURRENT flag should be set on one node and cleared on the
other node.
As the code was before it was theoretical possible that a node accepts the
meta socket, but has to close it later on, and keeps the DISCARD_CONCURRENT
flag.
Correct this by moving the clear_bit(DISCARD_CONCURRENT) where the packet
gets sent.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:39:01 +01:00
Lars Ellenberg 02b91b5526 drbd: introduce stop-sector to online verify
We now can schedule only a specific range of sectors for online verify,
or interrupt a running verify without interrupting the connection.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:39:01 +01:00
Philipp Reisner 9f2247bb9b drbd: Protect accesses to the uuid set with a spinlock
There is at least the worker context, the receiver context, the context of
receiving netlink packts and processes reading a sysfs attribute that access
the uuids.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:39:01 +01:00
Dave Chinner a1ecac3b06 loop: Make explicit loop device destruction lazy
xfstests has always had random failures of tests due to loop devices
failing to be torn down and hence leaving filesytems that cannot be
unmounted. This causes test runs to immediately stop.

Over the past 6 or 7 years we've added hacks like explicit unmount
-d commands for loop mounts, losetup -d after unmount -d fails, etc,
but still the problems persist.  Recently, the frequency of loop
related failures increased again to the point that xfstests 259 will
reliably fail with a stray loop device that was not torn down.

That is despite the fact the test is above as simple as it gets -
loop 5 or 6 times running mkfs.xfs with different paramters:

        lofile=$(losetup -f)
        losetup $lofile "$testfile"
        "$MKFS_XFS_PROG" -b size=512 $lofile >/dev/null || echo "mkfs failed!"
        sync
        losetup -d $lofile

And losteup -d $lofile is failing with EBUSY on 1-3 of these loops
every time the test is run.

Turns out that blkid is running simultaneously with losetup -d, and
so it sees an elevated reference count and returns EBUSY.  But why
is blkid running? It's obvious, isn't it? udev has decided to try
and find out what is on the block device as a result of a creation
notification. And it is racing with mkfs, so might still be scanning
the device when mkfs finishes and we try to tear it down.

So, make losetup -d force autoremove behaviour. That is, when the
last reference goes away, tear down the device. xfstests wants it
*gone*, not causing random teardown failures when we know that all
the operations the tests have specifically run on the device have
completed and are no longer referencing the loop device.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:37:31 +01:00
Selvan Mani 4453bc88f0 mtip32xx:Added appropriate timeout value for secure erase
Added appropriate timeout value for secure erase based on identify device data

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:37:27 +01:00
Oliver Chick 1f999572f2 xen/blkback: Change xen_vbd's flush_support and discard_secure to have type unsigned int, rather than bool
Changing the type of bdev parameters to be unsigned int :1, rather than bool.
This is more consistent with the types of other features in the block drivers.

Signed-off-by: Oliver Chick <oliver.chick@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:37:20 +01:00
Akinobu Mita b7010ede43 cciss: select CONFIG_CHECK_SIGNATURE
The patch cciss-use-check_signature.patch in -mm tree introduced
a build error:

drivers/built-in.o: In function `CISS_signature_present':
drivers/block/cciss.c:4270: undefined reference to `check_signature'

Add missing CONFIG_CHECK_SIGNATURE to fix this issue.

Reported-by: Fengguang Wu <wfg@linux.intel.com>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Fengguang Wu <wfg@linux.intel.com>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Jens Axboe <axboe@kernel.dk>
Acked-by: "Stephen M. Cameron" <scameron@beardog.cce.hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:37:00 +01:00
Wei Yongjun 2541aa799f cciss: remove unneeded memset()
The memory return by kzalloc() or kmem_cache_zalloc() has already be set
to zero, so remove useless memset(0).

spatch with a semantic match is used to found this problem.
(http://coccinelle.lip6.fr/)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:36:58 +01:00
Wei Yongjun 654dbef214 xen/blkback: use kmem_cache_zalloc instead of kmem_cache_alloc/memset
Using kmem_cache_zalloc() instead of kmem_cache_alloc() and memset().

spatch with a semantic match is used to found this problem.
(http://coccinelle.lip6.fr/)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-10-30 08:36:27 +01:00
Herton Ronaldo Krzesinski 1a4ae43e4f floppy: remove dr, reuse drive on do_floppy_init
This is a small cleanup, that also may turn error handling of
unitialized disks more readable. We don't need a separate variable to
track allocated disks, remove dr and reuse drive variable instead.

Signed-off-by: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:36:07 +01:00
Herton Ronaldo Krzesinski 8d3ab4ebfd floppy: use common function to check if floppies can be registered
The same checks to see if a drive can be or is registered are
repeated through the code, factor out the checks in a common function
and replace the repeated checks with it.

Signed-off-by: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:34:25 +01:00
Herton Ronaldo Krzesinski d60e7ec18c floppy: properly handle failure on add_disk loop
On floppy initialization, if something failed inside the loop we call
add_disk, there was no cleanup of previous iterations in the error
handling.

Cc: stable@vger.kernel.org
Signed-off-by: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:34:25 +01:00
Herton Ronaldo Krzesinski 238ab78469 floppy: do put_disk on current dr if blk_init_queue fails
If blk_init_queue fails, we do not call put_disk on the current dr
(dr is decremented first in the error handling loop).

Cc: stable@vger.kernel.org
Reviewed-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:34:25 +01:00
Herton Ronaldo Krzesinski b54e1f8889 floppy: don't call alloc_ordered_workqueue inside the alloc_disk loop
Since commit 070ad7e ("floppy: convert to delayed work and single-thread
wq"), we end up calling alloc_ordered_workqueue multiple times inside
the loop, which shouldn't be intended. Besides the leak, other side
effect in the current code is if blk_init_queue fails, we would end up
calling unregister_blkdev even if we didn't call yet register_blkdev.

Just moved the allocation of floppy_wq before the loop, and adjusted the
code accordingly.

Cc: stable@vger.kernel.org # 3.5+
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-30 08:34:24 +01:00
Konrad Rzeszutek Wilk 2911758f14 xen/blkback: Fix compile warning
drivers/block/xen-blkback/xenbus.c:260:5: warning: symbol 'xenvbd_sysfs_addif' was not declared. Should it be static?
drivers/block/xen-blkback/xenbus.c:284:6: warning: symbol 'xenvbd_sysfs_delif' was not declared. Should it be static?

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-10-30 08:32:43 +01:00
Alex Elder 069a4b5690 rbd: kill rbd_device->rbd_opts
The rbd_device structure has an embedded rbd_options structure.
Such a structure is needed to work with the generic ceph argument
parsing code, but there's no need to keep it around once argument
parsing is done.

Use a local variable to hold the rbd options used in parsing in
rbd_get_client(), and just transfer its content (it's just a
read_only flag) into the field in the rbd_mapping sub-structure
that requires that information.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2012-10-26 17:18:08 -05:00
Alex Elder e5cfeed281 rbd: simplify rbd_merge_bvec()
The aim of this patch is to make what's going on rbd_merge_bvec() a
bit more obvious than it was before.  This was an issue when a
recent btrfs bug led us to question whether the merge function was
working correctly.

Use "obj" rather than "chunk" to indicate the units whose boundaries
we care about we call (rados) "objects".

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2012-10-26 17:18:08 -05:00
Alex Elder d4b125e9eb rbd: increase maximum snapshot name length
Change RBD_MAX_SNAP_NAME_LEN to be based on NAME_MAX.  That is a
practical limit for the length of a snapshot name (based on the
presence of a directory using the name under /sys/bus/rbd to
represent the snapshot).

The /sys entry is created by prefixing it with "snap_"; define that
prefix symbolically, and take its length into account in defining
the snapshot name length limit.

Enforce the limit in rbd_add_parse_args().  Also delete a dout()
call in that function that was not meant to be committed.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-26 17:18:08 -05:00
Alex Elder db2388b6ee rbd: verify rbd image order value
This adds a verification that an rbd image's object order is
within the upper and lower bounds supported by this implementation.

It must be at least 9 (SECTOR_SHIFT), because the Linux bio system
assumes that minimum granularity.

It also must be less than 32 (at the moment anyway) because there
exist spots in the code that store the size of a "segment" (object
backing an rbd image) in a signed int variable, which can be 32 bits
including the sign.  We should be able to relax this limit once
we've verified the code uses 64-bit types where needed.

Note that the CLI tool already limits the order to the range 12-25.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-26 17:18:08 -05:00
Alex Elder 4634246db8 rbd: consolidate rbd_do_op() calls
The two calls to rbd_do_op() from rbd_rq_fn() differ only in the
value passed for the snapshot id and the snapshot context.

For reads the snapshot always comes from the mapping, and for writes
the snapshot id is always CEPH_NOSNAP.

The snapshot context is always null for reads.  For writes, the
snapshot context always comes from the rbd header, but it is
acquired under protection of header semaphore and could change
thereafter, so we can't simply use what's available inside
rbd_do_op().

Eliminate the snapid parameter from rbd_do_op(), and set it
based on the I/O direction inside that function instead.  Always
pass the snapshot context acquired in the caller, but reset it
to a null pointer inside rbd_do_op() if the operation is a read.

As a result, there is no difference in the read and write calls
to rbd_do_op() made in rbd_rq_fn(), so just call it unconditionally.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-26 17:18:08 -05:00
Alex Elder ff2e4bb5b3 rbd: drop rbd_do_op() opcode and flags
The only callers of rbd_do_op() are in rbd_rq_fn(), where call one
is used for writes and the other used for reads.  The request passed
to rbd_do_op() already encodes the I/O direction, and that
information can be used inside the function to set the opcode and
flags value (rather than passing them in as arguments).

So get rid of the opcode and flags arguments to rbd_do_op().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-26 17:18:08 -05:00
Alex Elder 13f4042c05 rbd: kill rbd_req_{read,write}()
Both rbd_req_read() and rbd_req_write() are simple wrapper routines
for rbd_do_op(), and each is only called once.  Replace each wrapper
call with a direct call to rbd_do_op(), and get rid of the wrapper
functions.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-26 17:18:08 -05:00
Alex Elder be466c1cc3 rbd: fix read-only option name
The name of the "read-only" mapping option was inadvertently changed
in this commit:

    f84344f3 rbd: separate mapping info in rbd_dev

Revert that hunk to return it to what it should be.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-26 17:18:08 -05:00
Alex Elder a0ea3a40fd rbd: zero return code in rbd_dev_image_id()
When rbd_dev_probe() calls rbd_dev_image_id() it expects to get
a 0 return code if successful, but it is getting a positive value.

The reason is that rbd_dev_image_id() returns the value it gets from
rbd_req_sync_exec(), which returns the number of bytes read in as a
result of the request.  (This ultimately comes from
ceph_copy_from_page_vector() in rbd_req_sync_op()).

Force the return value to 0 when successful in rbd_dev_image_id().
Do the same in rbd_dev_v2_object_prefix().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2012-10-26 17:18:08 -05:00
Alex Elder b213e0b1a6 rbd: fix bug in rbd_dev_id_put()
In rbd_dev_id_put(), there's a loop that's intended to determine
the maximum device id in use.  But it isn't doing that at all,
the effect of how it's written is to simply use the just-put id
number, which ignores whole purpose of this function.

Fix the bug.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-26 17:18:08 -05:00
Kees Cook b8977285ec drivers/block: remove CONFIG_EXPERIMENTAL
This config item has not carried much meaning for a while now and is
almost always enabled by default. As agreed during the Linux kernel
summit, remove it.

CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: Asai Thambi S P <asamymuthupa@micron.com>
CC: Pete Zaitcev <zaitcev@redhat.com>
CC: Cong Wang <xiyou.wangcong@gmail.com>
CC: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-10-23 22:30:38 +02:00
Linus Torvalds ce40be7a82 Merge branch 'for-3.7/core' of git://git.kernel.dk/linux-block
Pull block IO update from Jens Axboe:
 "Core block IO bits for 3.7.  Not a huge round this time, it contains:

   - First series from Kent cleaning up and generalizing bio allocation
     and freeing.

   - WRITE_SAME support from Martin.

   - Mikulas patches to prevent O_DIRECT crashes when someone changes
     the block size of a device.

   - Make bio_split() work on data-less bio's (like trim/discards).

   - A few other minor fixups."

Fixed up silent semantic mis-merge as per Mikulas Patocka and Andrew
Morton.  It is due to the VM no longer using a prio-tree (see commit
6b2dbba8b6ac: "mm: replace vma prio_tree with an interval tree").

So make set_blocksize() use mapping_mapped() instead of open-coding the
internal VM knowledge that has changed.

* 'for-3.7/core' of git://git.kernel.dk/linux-block: (26 commits)
  block: makes bio_split support bio without data
  scatterlist: refactor the sg_nents
  scatterlist: add sg_nents
  fs: fix include/percpu-rwsem.h export error
  percpu-rw-semaphore: fix documentation typos
  fs/block_dev.c:1644:5: sparse: symbol 'blkdev_mmap' was not declared
  blockdev: turn a rw semaphore into a percpu rw semaphore
  Fix a crash when block device is read and block size is changed at the same time
  block: fix request_queue->flags initialization
  block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()
  block: ioctl to zero block ranges
  block: Make blkdev_issue_zeroout use WRITE SAME
  block: Implement support for WRITE SAME
  block: Consolidate command flag and queue limit checks for merges
  block: Clean up special command handling logic
  block/blk-tag.c: Remove useless kfree
  block: remove the duplicated setting for congestion_threshold
  block: reject invalid queue attribute values
  block: Add bio_clone_bioset(), bio_clone_kmalloc()
  block: Consolidate bio_alloc_bioset(), bio_kmalloc()
  ...
2012-10-11 09:04:23 +09:00
Alex Elder 35152979e6 rbd: activate v2 image support
Now that v2 images support is fully implemented, have
rbd_dev_v2_probe() return 0 to indicate a successful probe.

(Note that an image that implements layering will fail
the probe early because of the feature chekc.)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-10 07:44:01 -07:00
Alex Elder d889140c4a rbd: implement feature checks
Version 2 images have two sets of feature bit fields.  The first
indicates features possibly used by the image.  The second indicates
features that the client *must* support in order to use the image.

When an image (or snapshot) is first examined, we need to make sure
that the local implementation supports the image's required
features.  If not, fail the probe for the image.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-10 07:43:51 -07:00
Alex Elder 117973fb4c rbd: define rbd_dev_v2_refresh()
Define a new function rbd_dev_v2_refresh() to update/refresh the
snapshot context for a format version 2 rbd image.  This function
will update anything that is not fixed for the life of an rbd
image--at the moment this is mainly the snapshot context and (for
a base mapping) the size.

Update rbd_refresh_header() so it selects which function to use
based on the image format.

Rename __rbd_refresh_header() to be rbd_dev_v1_refresh()
to be consistent with the naming of its version 2 counterpart.
Similarly rename rbd_refresh_header() to be rbd_dev_refresh().

Unrelated--we use rbd_image_format_valid() here.  Delete the other
use of it, which was primarily put in place to ensure that function
was referenced at the time it was defined.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-10 07:43:39 -07:00
Alex Elder 9478554ae5 rbd: define rbd_update_mapping_size()
Encapsulate the code that handles updating the size of a mapping
after an rbd image has been refreshed.  This is done in anticipation
of the next patch, which will make this common code for format 1 and
2 images.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-10 07:43:28 -07:00
Linus Torvalds 7035cdf36d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph updates from Sage Weil:
 "The bulk of this pull is a series from Alex that refactors and cleans
  up the RBD code to lay the groundwork for supporting the new image
  format and evolving feature set.  There are also some cleanups in
  libceph, and for ceph there's fixed validation of file striping
  layouts and a bugfix in the code handling a shrinking MDS cluster."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (71 commits)
  ceph: avoid 32-bit page index overflow
  ceph: return EIO on invalid layout on GET_DATALOC ioctl
  rbd: BUG on invalid layout
  ceph: propagate layout error on osd request creation
  libceph: check for invalid mapping
  ceph: convert to use le32_add_cpu()
  ceph: Fix oops when handling mdsmap that decreases max_mds
  rbd: update remaining header fields for v2
  rbd: get snapshot name for a v2 image
  rbd: get the snapshot context for a v2 image
  rbd: get image features for a v2 image
  rbd: get the object prefix for a v2 rbd image
  rbd: add code to get the size of a v2 rbd image
  rbd: lay out header probe infrastructure
  rbd: encapsulate code that gets snapshot info
  rbd: add an rbd features field
  rbd: don't use index in __rbd_add_snap_dev()
  rbd: kill create_snap sysfs entry
  rbd: define rbd_dev_image_id()
  rbd: define some new format constants
  ...
2012-10-08 06:38:18 +09:00
Linus Torvalds dc92b1f9ab Merge branch 'virtio-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull virtio changes from Rusty Russell:
 "New workflow: same git trees pulled by linux-next get sent straight to
  Linus.  Git is awkward at shuffling patches compared with quilt or mq,
  but that doesn't happen often once things get into my -next branch."

* 'virtio-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (24 commits)
  lguest: fix occasional crash in example launcher.
  virtio-blk: Disable callback in virtblk_done()
  virtio_mmio: Don't attempt to create empty virtqueues
  virtio_mmio: fix off by one error allocating queue
  drivers/virtio/virtio_pci.c: fix error return code
  virtio: don't crash when device is buggy
  virtio: remove CONFIG_VIRTIO_RING
  virtio: add help to CONFIG_VIRTIO option.
  virtio: support reserved vqs
  virtio: introduce an API to set affinity for a virtqueue
  virtio-ring: move queue_index to vring_virtqueue
  virtio_balloon: not EXPERIMENTAL any more.
  virtio-balloon: dependency fix
  virtio-blk: fix NULL checking in virtblk_alloc_req()
  virtio-blk: Add REQ_FLUSH and REQ_FUA support to bio path
  virtio-blk: Add bio-based IO path for virtio-blk
  virtio: console: fix error handling in init() function
  tools: Fix pthread flag for Makefile of trace-agent used by virtio-trace
  tools: Add guest trace agent as a user tool
  virtio/console: Allocate scatterlist according to the current pipe size
  ...
2012-10-07 21:04:56 +09:00
Linus Torvalds f1c6872e49 Features:
* Allow a Linux guest to boot as initial domain and as normal guests
    on Xen on ARM (specifically ARMv7 with virtualized extensions).
    PV console, block and network frontend/backends are working.
 Bug-fixes:
  * Fix compile linux-next fallout.
  * Fix PVHVM bootup crashing.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJQbJELAAoJEFjIrFwIi8fJSI4H/32qrQKyF5IIkFKHTN9FYDC1
 OxEGc4y47DIQpGUd/PgZ/i6h9Iyhj+I6pb4lCevykwgd0j83noepluZlCIcJnTfL
 HVXNiRIQKqFhqKdjTANxVM4APup+7Lqrvqj6OZfUuoxaZ3tSTLhabJ/7UXf2+9xy
 g2RfZtbSdQ1sukQ/A2MeGQNT79rh7v7PrYQUYSrqytjSjSLPTqRf75HWQ+eapIAH
 X3aVz8Tn6nTixZWvZOK7rAaD4awsFxGP6E46oFekB02f4x9nWHJiCZiXwb35lORb
 tz9F9td99f6N4fPJ9LgcYTaCPwzVnceZKqE9hGfip4uT+0WrEqDxq8QmBqI5YtI=
 =gxJD
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.7-arm-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull ADM Xen support from Konrad Rzeszutek Wilk:

  Features:
   * Allow a Linux guest to boot as initial domain and as normal guests
     on Xen on ARM (specifically ARMv7 with virtualized extensions).  PV
     console, block and network frontend/backends are working.
  Bug-fixes:
   * Fix compile linux-next fallout.
   * Fix PVHVM bootup crashing.

  The Xen-unstable hypervisor (so will be 4.3 in a ~6 months), supports
  ARMv7 platforms.

  The goal in implementing this architecture is to exploit the hardware
  as much as possible.  That means use as little as possible of PV
  operations (so no PV MMU) - and use existing PV drivers for I/Os
  (network, block, console, etc).  This is similar to how PVHVM guests
  operate in X86 platform nowadays - except that on ARM there is no need
  for QEMU.  The end result is that we share a lot of the generic Xen
  drivers and infrastructure.

  Details on how to compile/boot/etc are available at this Wiki:

    http://wiki.xen.org/wiki/Xen_ARMv7_with_Virtualization_Extensions

  and this blog has links to a technical discussion/presentations on the
  overall architecture:

    http://blog.xen.org/index.php/2012/09/21/xensummit-sessions-new-pvh-virtualisation-mode-for-arm-cortex-a15arm-servers-and-x86/

* tag 'stable/for-linus-3.7-arm-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: (21 commits)
  xen/xen_initial_domain: check that xen_start_info is initialized
  xen: mark xen_init_IRQ __init
  xen/Makefile: fix dom-y build
  arm: introduce a DTS for Xen unprivileged virtual machines
  MAINTAINERS: add myself as Xen ARM maintainer
  xen/arm: compile netback
  xen/arm: compile blkfront and blkback
  xen/arm: implement alloc/free_xenballooned_pages with alloc_pages/kfree
  xen/arm: receive Xen events on ARM
  xen/arm: initialize grant_table on ARM
  xen/arm: get privilege status
  xen/arm: introduce CONFIG_XEN on ARM
  xen: do not compile manage, balloon, pci, acpi, pcpu and cpu_hotplug on ARM
  xen/arm: Introduce xen_ulong_t for unsigned long
  xen/arm: Xen detection and shared_info page mapping
  docs: Xen ARM DT bindings
  xen/arm: empty implementation of grant_table arch specific functions
  xen/arm: sync_bitops
  xen/arm: page.h definitions
  xen/arm: hypercalls
  ...
2012-10-07 07:13:01 +09:00
Ed Cashin 322c9ec009 aoe: update aoe-internal version number to 50
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:30 +09:00
Ed Cashin 1ac9e60262 aoe: remove unused code
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:30 +09:00
Ed Cashin 08b6062351 aoe: make dynamic block minor numbers the default
Because udev use is so widespread, making the old static mapping the
default is too conservative, given the severe limitations it places on
usable AoE addresses.  Storage virtualization and larger shelves have made
the old limitations too confining.

These changes make the dynamic block device minor numbers the default,
removing the limitations on usable AoE addresses.

The static arrangement is still available with aoe_dyndevs=0, and the
aoe-stat tool from the userland aoetools package, the user space
counterpart to the aoe driver, recognizes the case where there is a
mismatch between the minor number in sysfs and the minor number in a
special device file.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:29 +09:00
Ed Cashin 7159e969d1 aoe: update and specify AoE address guards and error messages
In general, specific is better when it comes to messages about AoE usage
problems.  Also, explicit checks for the AoE broadcast addresses are
added.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:29 +09:00
Ed Cashin 4bcce1a355 aoe: retain static block device numbers for backwards compatibility
The old mapping between AoE target shelf and slot addresses and the block
device minor number is retained as a backwards-compatible feature, with a
new "aoe_dyndevs" module parameter available for enabling dynamic block
device minor numbers.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:29 +09:00
Ed Cashin 0c96621458 aoe: support more AoE addresses with dynamic block device minor numbers
The ATA over Ethernet protocol uses a major (shelf) and minor (slot)
address to identify a particular storage target.  These changes remove an
artificial limitation the aoe driver imposes on the use of AoE addresses.
For example, without these changes, the slot address has a maximum of 15,
but users commonly use slot numbers much greater than that.

The AoE shelf and slot address space is often used sparsely.  Instead of
using a static mapping between AoE addresses and the block device minor
number, the block device minor numbers are now allocated on demand.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:28 +09:00
Ed Cashin fea05a26c3 aoe: update copyright year in touched files
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:28 +09:00
Ed Cashin 7392fbe5ad aoe: update internal version number to 49
The internal version number of the aoe driver appears in a console message
when the driver loads and is usually obtained by the user with the
userland aoe-version tool, part of the aoetools.[1]

Although this patchset includes bugfixes backported from higher-numbered
versions published on the coraid.com website, it is a form of version 49.

1. http://aoetools.sourceforge.net/

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:27 +09:00
Ed Cashin b21faa25c6 aoe: remove unused code and add cosmetic improvements
This change removes some unused code and attempts to increase code
consistency.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:27 +09:00
Ed Cashin 1b86fda9ad aoe: increase net_device reference count while using it
This change eliminates the danger that the user could rmmod the driver for
a network interface that is being used for AoE by the aoe driver.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:27 +09:00
Ed Cashin 64a80f5ac7 aoe: associate frames with the AoE storage target
In the driver code, "target" and aoetgt refer to a particular remote
interface on the AoE storage target.  The latter is identified by its AoE
major and minor addresses.  Commands that are being sent to an AoE storage
target {major, minor} can be sent or retransmitted to any of the remote
MAC addresses associated with the AoE storage target.

That is, frames are naturally associated with not an aoetgt (AoE major,
AoE minor, remote MAC address) but an aoedev (AoE major, AoE minor).
Making the code reflect that reality simplifies the driver, especially
when the path to a remote MAC address becomes unusable.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:27 +09:00
Ed Cashin 6583303c5e aoe: disallow unsupported AoE minor addresses
A guard is inserted to prevent AoE minor addresses (slot addresses) higher
than 15 to be used, as they are not yet supported by the driver.

There is a change coming that will allow the aoe driver to overcome this
limit by using system device minor numbers dynamically, but until then,
this guard prevents unexpected targets from being used by the driver when
AoE targets with high minor numbers are on the AoE network.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:26 +09:00
Ed Cashin 25f4d75ea4 aoe: do revalidation steps in order
The discovery process begins with an optional AoE config query command and
an AoE config query response.  Normally when an aoe device is already
open, the config query response does not trigger an ATA identify device
command to be sent out, since the response contains storage capacity
information that, if changed, could surprise the user of the device.

The userland "aoe-revalidate" tool uses a character device to trigger an
AoE config query for a particular AoE storage target and an ATA device
identify command, even when the device is open.

This change causes the config query to go out first, reflecting the normal
discovery sequence.  The responses could come back in any order, so this
change is fairly cosmetic.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:26 +09:00
Ed Cashin d54d35ac66 aoe: failover remote interface based on aoe_deadsecs parameter
The aoe_deadsecs module parameter allows the user to specify a hard limit
on the number of seconds an AoE command can be retransmitted before the
AoE block device is considered to have failed.

Using aoe_deadsecs to determine the time we try using a different remote
interface helps to ensure that the hard limit is not reached before we've
tried to recover by sending to a different remote port.

As a data storage target, the AoE target is unambiguously identified by
its {major, minor} AoE address tuple, and an AoE target can have multiple
MAC addresses.  However, note that "target" in the driver code and
comments means a {major, minor, MAC address} tuple, as in "somewhere to
send packets".

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:26 +09:00
Ed Cashin 3f0f013374 aoe: use packets that work with the smallest-MTU local interface
Users with several network interfaces dedicated to AoE generally do not
configure them to support different-sized AoE data payloads on purpose.

For a given AoE target, there will be a set of local network interfaces
that can reach it.  Using only the payload that will fit in the
smallest-sized MTU of all those local interfaces greatly simplifies the
driver, especially in failure scenarios.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:25 +09:00
Ed Cashin eb086ec596 aoe: use a kernel thread for transmissions
The dev_queue_xmit function needs to have interrupts enabled, so the most
simple way to get the locking right but still fulfill that requirement is
to use a process that can call dev_queue_xmit serially over queued
transmissions.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:25 +09:00
Ed Cashin 69cf2d85de aoe: become I/O request queue handler for increased user control
To allow users to choose an elevator algorithm for their particular
workloads, change from a make_request-style driver to an
I/O-request-queue-handler-style driver.

We have to do a couple of things that might be surprising.  We manipulate
the page _count directly on the assumption that we still have no guarantee
that users of the block layer are prohibited from submitting bios
containing pages with zero reference counts.[1] If such a prohibition now
exists, I can get rid of the _count manipulation.

Just as before this patch, we still keep track of the sk_buffs that the
network layer still hasn't finished yet and cap the resources we use with
a "pool" of skbs.[2]

Now that the block layer maintains the disk stats, the aoe driver's
diskstats function can go away.

1. https://lkml.org/lkml/2007/3/1/374
2. https://lkml.org/lkml/2007/7/6/241

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:25 +09:00
Ed Cashin 896831f590 aoe: kernel thread handles I/O completions for simple locking
Make the frames the aoe driver uses to track the relationship between bios
and packets more flexible and detached, so that they can be passed to an
"aoe_ktio" thread for completion of I/O.

The frames are handled much like skbs, with a capped amount of
preallocation so that real-world use cases are likely to run smoothly and
degenerate gracefully even under memory pressure.

Decoupling I/O completion from the receive path and serializing it in a
process makes it easier to think about the correctness of the locking in
the driver, especially in the case of a remote MAC address becoming
unusable.

[dan.carpenter@oracle.com: cleanup an allocation a bit]
Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:24 +09:00
Ed Cashin 3d5b06051c aoe: for performance support larger packet payloads
tAdd adds the ability to work with large packets composed of a number of
segments, using the scatter gather feature of the block layer (biovecs)
and the network layer (skb frag array).  The motivation is the performance
gained by using a packet data payload greater than a page size and by
using the network card's scatter gather feature.

Users of the out-of-tree aoe driver already had these changes, but since
early 2011, they have complained of increased memory utilization and
higher CPU utilization during heavy writes.[1] The commit below appears
related, as it disables scatter gather on non-IP protocols inside the
harmonize_features function, even when the NIC supports sg.

  commit f01a5236bd
  Author: Jesse Gross <jesse@nicira.com>
  Date:   Sun Jan 9 06:23:31 2011 +0000

      net offloading: Generalize netif_get_vlan_features().

With that regression in place, transmits always linearize sg AoE packets,
but in-kernel users did not have this patch.  Before 2.6.38, though, these
changes were working to allow sg to increase performance.

1. http://www.spinics.net/lists/linux-mm/msg15184.html

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:24 +09:00
Paul Clements a336d29870 nbd: handle discard requests
Add discard support to nbd.  If the nbd-server supports discard, it will
send NBD_FLAG_SEND_TRIM to the client.  The client will then set the flag
in the kernel via NBD_SET_FLAGS, which tells the kernel to enable discards
for the device (QUEUE_FLAG_DISCARD).

If discard support is enabled, then when the nbd client system receives a
discard request, this will be passed along to the nbd-server.  When the
discard request is received by the nbd-server, it will perform:

	fallocate(.. FALLOC_FL_PUNCH_HOLE ..)

To punch a hole in the backend storage, which is no longer needed.

Signed-off-by: Paul Clements <paul.clements@steeleye.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:24 +09:00
Paul Clements 2f01250888 nbd: add set flags ioctl
Add a set-flags ioctl, allowing various option flags to be set on an nbd
device.  This allows the nbd-client to set the device flags (to enable
read-only mode, or enable discard support, etc.).

Flags are typically specified by the nbd-server.  During the negotiation
phase of the nbd connection, the server sends its flags to the client.
The client then uses NBD_SET_FLAGS to inform the kernel of the options.

Also included is a one-line fix to debug output for the set-timeout ioctl.

Signed-off-by: Paul Clements <paul.clements@steeleye.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:23 +09:00
Linus Torvalds 437589a74b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace changes from Eric Biederman:
 "This is a mostly modest set of changes to enable basic user namespace
  support.  This allows the code to code to compile with user namespaces
  enabled and removes the assumption there is only the initial user
  namespace.  Everything is converted except for the most complex of the
  filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
  nfs, ocfs2 and xfs as those patches need a bit more review.

  The strategy is to push kuid_t and kgid_t values are far down into
  subsystems and filesystems as reasonable.  Leaving the make_kuid and
  from_kuid operations to happen at the edge of userspace, as the values
  come off the disk, and as the values come in from the network.
  Letting compile type incompatible compile errors (present when user
  namespaces are enabled) guide me to find the issues.

  The most tricky areas have been the places where we had an implicit
  union of uid and gid values and were storing them in an unsigned int.
  Those places were converted into explicit unions.  I made certain to
  handle those places with simple trivial patches.

  Out of that work I discovered we have generic interfaces for storing
  quota by projid.  I had never heard of the project identifiers before.
  Adding full user namespace support for project identifiers accounts
  for most of the code size growth in my git tree.

  Ultimately there will be work to relax privlige checks from
  "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
  root in a user names to do those things that today we only forbid to
  non-root users because it will confuse suid root applications.

  While I was pushing kuid_t and kgid_t changes deep into the audit code
  I made a few other cleanups.  I capitalized on the fact we process
  netlink messages in the context of the message sender.  I removed
  usage of NETLINK_CRED, and started directly using current->tty.

  Some of these patches have also made it into maintainer trees, with no
  problems from identical code from different trees showing up in
  linux-next.

  After reading through all of this code I feel like I might be able to
  win a game of kernel trivial pursuit."

Fix up some fairly trivial conflicts in netfilter uid/git logging code.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
  userns: Convert the ufs filesystem to use kuid/kgid where appropriate
  userns: Convert the udf filesystem to use kuid/kgid where appropriate
  userns: Convert ubifs to use kuid/kgid
  userns: Convert squashfs to use kuid/kgid where appropriate
  userns: Convert reiserfs to use kuid and kgid where appropriate
  userns: Convert jfs to use kuid/kgid where appropriate
  userns: Convert jffs2 to use kuid and kgid where appropriate
  userns: Convert hpfs to use kuid and kgid where appropriate
  userns: Convert btrfs to use kuid/kgid where appropriate
  userns: Convert bfs to use kuid/kgid where appropriate
  userns: Convert affs to use kuid/kgid wherwe appropriate
  userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
  userns: On ia64 deal with current_uid and current_gid being kuid and kgid
  userns: On ppc convert current_uid from a kuid before printing.
  userns: Convert s390 getting uid and gid system calls to use kuid and kgid
  userns: Convert s390 hypfs to use kuid and kgid where appropriate
  userns: Convert binder ipc to use kuids
  userns: Teach security_path_chown to take kuids and kgids
  userns: Add user namespace support to IMA
  userns: Convert EVM to deal with kuids and kgids in it's hmac computation
  ...
2012-10-02 11:11:09 -07:00
Linus Torvalds 033d9959ed Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue changes from Tejun Heo:
 "This is workqueue updates for v3.7-rc1.  A lot of activities this
  round including considerable API and behavior cleanups.

   * delayed_work combines a timer and a work item.  The handling of the
     timer part has always been a bit clunky leading to confusing
     cancelation API with weird corner-case behaviors.  delayed_work is
     updated to use new IRQ safe timer and cancelation now works as
     expected.

   * Another deficiency of delayed_work was lack of the counterpart of
     mod_timer() which led to cancel+queue combinations or open-coded
     timer+work usages.  mod_delayed_work[_on]() are added.

     These two delayed_work changes make delayed_work provide interface
     and behave like timer which is executed with process context.

   * A work item could be executed concurrently on multiple CPUs, which
     is rather unintuitive and made flush_work() behavior confusing and
     half-broken under certain circumstances.  This problem doesn't
     exist for non-reentrant workqueues.  While non-reentrancy check
     isn't free, the overhead is incurred only when a work item bounces
     across different CPUs and even in simulated pathological scenario
     the overhead isn't too high.

     All workqueues are made non-reentrant.  This removes the
     distinction between flush_[delayed_]work() and
     flush_[delayed_]_work_sync().  The former is now as strong as the
     latter and the specified work item is guaranteed to have finished
     execution of any previous queueing on return.

   * In addition to the various bug fixes, Lai redid and simplified CPU
     hotplug handling significantly.

   * Joonsoo introduced system_highpri_wq and used it during CPU
     hotplug.

  There are two merge commits - one to pull in IRQ safe timer from
  tip/timers/core and the other to pull in CPU hotplug fixes from
  wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

Fixed a number of trivial conflicts, but the more interesting conflicts
were silent ones where the deprecated interfaces had been used by new
code in the merge window, and thus didn't cause any real data conflicts.

Tejun pointed out a few of them, I fixed a couple more.

* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
  workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
  workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
  workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
  workqueue: remove @delayed from cwq_dec_nr_in_flight()
  workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
  workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
  workqueue: use __cpuinit instead of __devinit for cpu callbacks
  workqueue: rename manager_mutex to assoc_mutex
  workqueue: WORKER_REBIND is no longer necessary for idle rebinding
  workqueue: WORKER_REBIND is no longer necessary for busy rebinding
  workqueue: reimplement idle worker rebinding
  workqueue: deprecate __cancel_delayed_work()
  workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
  workqueue: use mod_delayed_work() instead of __cancel + queue
  workqueue: use irqsafe timer for delayed_work
  workqueue: clean up delayed_work initializers and add missing one
  workqueue: make deferrable delayed_work initializer names consistent
  workqueue: cosmetic whitespace updates for macro definitions
  workqueue: deprecate system_nrt[_freezable]_wq
  workqueue: deprecate flush[_delayed]_work_sync()
  ...
2012-10-02 09:54:49 -07:00
Sage Weil 6cae3717cd rbd: BUG on invalid layout
This shouldn't actually be possible because the layout struct is
constructed from the RBD header and validated then.

[elder@inktank.com: converted BUG() call to equivalent rbd_assert()]

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-10-01 17:20:00 -05:00
Linus Torvalds d9a807461f USB merge for 3.7-rc1
Here is the big USB pull request for 3.7-rc1
 
 There are lots of gadget driver changes (including copying a bunch of
 files into the drivers/staging/ccg/ directory so that the other gadget
 drivers can be fixed up properly without breaking that driver), and we
 remove the old obsolete ub.c driver from the tree.  There are also the
 usual XHCI set of updates, and other various driver changes and updates.
 We also are trying hard to remove the old dbg() macro, but the final
 bits of that removal will be coming in through the networking tree
 before we can delete it for good.
 
 All of these patches have been in the linux-next tree.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iEYEABECAAYFAlBp3+AACgkQMUfUDdst+ym5vwCfe93FyJyXn/RDkGz7iBemvWFd
 vrwAoIxjaOa4/yWZWcgrWc5bP4aO3ssc
 =jYDr
 -----END PGP SIGNATURE-----

Merge tag 'usb-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB changes from Greg Kroah-Hartman:
 "Here is the big USB pull request for 3.7-rc1

  There are lots of gadget driver changes (including copying a bunch of
  files into the drivers/staging/ccg/ directory so that the other gadget
  drivers can be fixed up properly without breaking that driver), and we
  remove the old obsolete ub.c driver from the tree.

  There are also the usual XHCI set of updates, and other various driver
  changes and updates.  We also are trying hard to remove the old dbg()
  macro, but the final bits of that removal will be coming in through
  the networking tree before we can delete it for good.

  All of these patches have been in the linux-next tree.

  Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"

Fix up several annoying - but fairly mindless - conflicts due to the
termios structure having moved into the tty device, and often clashing
with dbg -> dev_dbg conversion.

* tag 'usb-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (339 commits)
  USB: ezusb: move ezusb.c from drivers/usb/serial to drivers/usb/misc
  USB: uas: fix gcc warning
  USB: uas: fix locking
  USB: Fix race condition when removing host controllers
  USB: uas: add locking
  USB: uas: fix abort
  USB: uas: remove aborted field, replace with status bit.
  USB: uas: fix task management
  USB: uas: keep track of command urbs
  xhci: Intel Panther Point BEI quirk.
  powerpc/usb: remove checking PHY_CLK_VALID for UTMI PHY
  USB: ftdi_sio: add TIAO USB Multi-Protocol Adapter (TUMPA) support
  Revert "usb : Add sysfs files to control port power."
  USB: serial: remove vizzini driver
  usb: host: xhci: Fix Null pointer dereferencing with 71c731a for non-x86 systems
  Increase XHCI suspend timeout to 16ms
  USB: ohci-at91: fix null pointer in ohci_hcd_at91_overcurrent_irq
  USB: sierra_ms: don't keep unused variable
  fsl/usb: Add support for USB controller version 2.4
  USB: qcaux: add Pantech vendor class match
  ...
2012-10-01 13:23:01 -07:00
Alex Elder 6e14b1a6c3 rbd: update remaining header fields for v2
There are three fields that are not yet updated for format 2 rbd
image headers:  the version of the header object; the encryption
type; and the compression type.  There is no interface defined for
fetching the latter two, so just initialize them explicitly to 0 for
now.

Change rbd_dev_v2_snap_context() so the caller can be supplied the
version for the header object.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:54 -05:00
Alex Elder b8b1e2db52 rbd: get snapshot name for a v2 image
Define rbd_dev_v2_snap_name() to fetch the name for a particular
snapshot in a format 2 rbd image.

Define rbd_dev_v2_snap_info() to to be a wrapper for getting the
name, size, and features for a particular snapshot, using an
interface that matches the equivalent function for version 1 images.

Define rbd_dev_snap_info() wrapper function and use it to call the
appropriate function for getting the snapshot name, size, and
features, dependent on the rbd image format.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:54 -05:00
Alex Elder 35d489f946 rbd: get the snapshot context for a v2 image
Fetch the snapshot context for an rbd format 2 image by calling
the "get_snapcontext" method on its header object.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:53 -05:00
Alex Elder b1b5402aa9 rbd: get image features for a v2 image
The features values for an rbd format 2 image are fetched from the
server using a "get_features" method.  The same method is used for
getting the features for a snapshot, so structure this addition with
a generic helper routine that can get this information for either.

The server will provide two 64-bit feature masks, one representing
the features potentially in use for this image (or its snapshot),
and one representing features that must be supported by the client
in order to work with the image.

For the time being, neither of these is really used so we keep
things simple and just record the first feature vector.  Once we
start using these feature masks, what we record and what we expose
to the user will most likely change.

Signed-off-by: Alex Elder <elder@inktank.com>
2012-10-01 14:30:53 -05:00
Alex Elder 1e1301998e rbd: get the object prefix for a v2 rbd image
The object prefix of an rbd format 2 image is fetched from the
server using a "get_object_prefix" method.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:53 -05:00
Alex Elder 9d475de5d1 rbd: add code to get the size of a v2 rbd image
The size of an rbd format 2 image is fetched from the server using a
"get_size" method.  The same method is used for getting the size of
a snapshot, so structure this addition with a generic helper routine
that we can get this information for either.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:53 -05:00
Alex Elder a30b71b999 rbd: lay out header probe infrastructure
This defines a new function rbd_dev_probe() as a top-level
function for populating detailed information about an rbd device.

It first checks for the existence of a format 2 rbd image id object.
If it exists, the image is assumed to be a format 2 rbd image, and
another function rbd_dev_v2() is called to finish populating
header data for that image.  If it does not exist, it is assumed to
be an old (format 1) rbd image, and calls a similar function
rbd_dev_v1() to populate its header information.

A new field, rbd_dev->format, is defined to record which version
of the rbd image format the device represents.  For a valid mapped
rbd device it will have one of two values, 1 or 2.

So far, the format 2 images are not really supported; this is
laying out the infrastructure for fleshing out that support.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:53 -05:00
Alex Elder cd892126c6 rbd: encapsulate code that gets snapshot info
Create a function that encapsulates looking up the name, size and
features related to a given snapshot, which is indicated by its
index in an rbd device's snapshot context array of snapshot ids.

This interface will be used to hide differences between the format 1
and format 2 images.

At the moment this (looking up the name anyway) is slightly less
efficient than what's done currently, but we may be able to optimize
this a bit later on by cacheing the last lookup if it proves to be a
problem.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:53 -05:00
Alex Elder 34b131849f rbd: add an rbd features field
Record the features values for each rbd image and each of its
snapshots.  This is really something that only becomes meaningful
for version 2 images, so this is just putting in place code
that will form common infrastructure.

It may be useful to expand the sysfs entries--and therefore the
information we maintain--for the image and for each snapshot.
But I'm going to hold off doing that until we start making
active use of the feature bits.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:53 -05:00
Alex Elder c8d184250d rbd: don't use index in __rbd_add_snap_dev()
Pass the snapshot id and snapshot size rather than an index
to __rbd_add_snap_dev() to specify values for a new snapshot.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:53 -05:00
Alex Elder 02cdb02cea rbd: kill create_snap sysfs entry
Josh proposed the following change, and I don't think I could
explain it any better than he did:

    From: Josh Durgin <josh.durgin@inktank.com>
    Date: Tue, 24 Jul 2012 14:22:11 -0700
    To: ceph-devel <ceph-devel@vger.kernel.org>
    Message-ID: <500F1203.9050605@inktank.com>

    Right now the kernel still has one piece of rbd management
    duplicated from the rbd command line tool: snapshot creation.
    There's nothing special about snapshot creation that makes it
    advantageous to do from the kernel, so I'd like to remove the
    create_snap sysfs interface.  That is,
	/sys/bus/rbd/devices/<id>/create_snap
    would be removed.

    Does anyone rely on the sysfs interface for creating rbd
    snapshots?  If so, how hard would it be to replace with:

	rbd snap create pool/image@snap

    Is there any benefit to the sysfs interface that I'm missing?

    Josh

This patch implements this proposal, removing the code that
implements the "snap_create" sysfs interface for rbd images.
As a result, quite a lot of other supporting code goes away.

Suggested-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:53 -05:00
Alex Elder 589d30e0b3 rbd: define rbd_dev_image_id()
New format 2 rbd images are permanently identified by a unique image
id.  Each rbd image also has a name, but the name can be changed.
A format 2 rbd image will have an object--whose name is based on the
image name--which maps an image's name to its image id.

Create a new function rbd_dev_image_id() that checks for the
existence of the image id object, and if it's found, records the
image id in the rbd_device structure.

Create a new rbd device attribute (/sys/bus/rbd/<num>/image_id) that
makes this information available.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:53 -05:00
Alex Elder 3bb59ad515 rbd: define some new format constants
Define constant symbols related to the rbd format 2 object names.
This begins to bring this version of the "rbd_types.h" header
more in line with the current user-space version of that file.
Complete reconciliation of differences will be done at some
point later, as a separate task.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:53 -05:00
Alex Elder f8d4de6e1c rbd: support data returned from OSD methods
An OSD object method call can be made using rbd_req_sync_exec().
Until now this has only been used for creating a new RBD snapshot,
and that has only required sending data out, not receiving anything
back from the OSD.

We will now need to get data back from an OSD on a method call, so
add parameters to rbd_req_sync_exec() that allow a buffer into which
returned data should be placed to be specified, along with its size.

Previously, rbd_req_sync_exec() passed a null pointer and zero
size to rbd_req_sync_op(); change this so the new inbound buffer
information is provided instead.

Rename the "buf" and "len" parameters in rbd_req_sync_op() to
make it more obvious they are describing inbound data.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:53 -05:00
Alex Elder 3cb4a687c7 rbd: pass flags to rbd_req_sync_exec()
In order to allow both read requests and write requests to be
initiated using rbd_req_sync_exec(), add an OSD flags value
which can be passed down to rbd_req_sync_op().  Rename the "data"
and "len" parameters to be more clear that they represent data
that is outbound.

At this point, this function is still only used (and only works) for
write requests.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:52 -05:00
Alex Elder 3ee4001e0c rbd: set up watch before announcing disk
We're ready to handle header object (refresh) events at the point we
call rbd_bus_add_dev().  Set up the watch request on the rbd image
header just after that, and after we've registered the devices for
the snapshots for the initial snapshot context.  Do this before
announce the disk as available for use.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:52 -05:00
Alex Elder 12f029448c rbd: set initial capacity in rbd_init_disk()
Move the setting of the initial capacity for an rbd image mapping
into rb_init_disk().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:52 -05:00
Alex Elder 86ff77bb68 rbd: drop dev registration check for new snap
By the time rbd_dev_snaps_register() gets called during rbd device
initialization, the main device will have already been registered.
Similarly, a header refresh will only occur for an rbd device whose
Linux device is registered.  There is therefore no need to verify
the main device is registered when registering a snapshot device.

For the time being, turn the check into a WARN_ON(), but it can
eventually just go away.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:52 -05:00
Alex Elder 0f308a3188 rbd: call rbd_init_disk() sooner
Call rbd_init_disk() from rbd_add() as soon as we have the major
device number for the mapping.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:52 -05:00
Alex Elder 85ae892675 rbd: defer setting device id
Hold off setting the device id and formatting the device name
in rbd_add() until just before it's needed.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:52 -05:00
Alex Elder 05fd6f6f8c rbd: read the header before registering device
Read the rbd header information and call rbd_dev_set_mapping()
earlier--before registering the block device or setting up the sysfs
entries for the image.  The sysfs entries provide users access to
some information that's only available after doing the rbd header
initialization, so this will make sure it's valid right away.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:52 -05:00
Alex Elder 5ed1617731 rbd: call set_snap() before snap_devs_update()
rbd_header_set_snap() is a simple initialization routine for an rbd
device's mapping.  It has to be called after the snapshot context
for the rbd_dev has been updated, but can be done before snapshot
devices have been registered.

Change the name to rbd_dev_set_mapping() to better reflect its
purpose, and call it a little sooner, before registering snapshot
devices.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:52 -05:00
Alex Elder 304f68086f rbd: defer registering snapshot devices
When a new snapshot is found in an rbd device's updated snapshot
context, __rbd_add_snap_dev() is called to create and insert an
entry in the rbd devices list of snapshots.  In addition, a Linux
device is registered to represent the snapshot.

For version 2 rbd images, it will be undesirable to initialize the
device right away.  So in anticipation of that, this patch separates
the insertion of a snapshot entry in the snaps list from the
creation of devices for those snapshots.

To do this, create a new function rbd_dev_snaps_register() which
traverses the list of snapshots and calls rbd_register_snap_dev()
on any that have not yet been registered.

Rename rbd_dev_snap_devs_update() to be rbd_dev_snaps_update()
to better reflect that only the entry in the snaps list and not
the snapshot's device is affected by the function.

For now, call rbd_dev_snaps_register() immediately after each
call to rbd_dev_snaps_update().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:52 -05:00
Alex Elder 3fcf2581c2 rbd: assign header name later
Move the assignment of the header name for an rbd image a bit later,
outside rbd_add_parse_args() and into its caller.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:52 -05:00
Alex Elder e86924a809 rbd: use snaps list in rbd_snap_by_name()
An rbd_dev structure maintains a list of current snapshots that have
already been fully initialized.  The entries on the list have type
struct rbd_snap, and each entry contains a copy of information
that's found in the rbd_dev's snapshot context and header.

The only caller of snap_by_name() is rbd_header_set_snap().  In that
call site any positive return value (the index in the snapshot
array) is ignored, so there's no need to return the index in
the snapshot context's id array when it's found.

rbd_header_set_snap() also has only one caller--rbd_add()--and that
call is made after a call to rbd_dev_snap_devs_update().  Because
the rbd_snap structures are initialized in that function, the
current snapshot list can be used instead of the snapshot context to
look up a snapshot's information by name.

Change snap_by_name() so it uses the snapshot list rather than the
rbd_dev's snapshot context in looking up snapshot information.
Return 0 if it's found rather than the snapshot id.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:52 -05:00
Alex Elder cd789ab9ca rbd: don't register snapshots in bus_add_dev()
When rbd_bus_add_dev() is called (one spot--in rbd_add()), the rbd
image header has not even been read yet.  This means that the list
of snapshots will be empty at the time of the call.  As a result,
there is no need for the code that calls rbd_register_snap_dev()
for each entry in that list--so get rid of it.

Once the header has been read (just after returning), a call will
be made to rbd_dev_snap_devs_update(), which will then find every
snapshot in the context to be new and will therefore call
rbd_register_snap_dev() via __rbd_add_snap_dev() accomplishing
the same thing.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:52 -05:00
Alex Elder 4bb1f1ed00 rbd: move locking out of rbd_header_set_snap()
Move the calls to get the header semaphore out of
rbd_header_set_snap() and into its caller.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:51 -05:00
Alex Elder 1fcdb8aa1f rbd: simplify rbd_init_disk() a bit
This just simplifies a few things in rbd_init_disk(), now that the
previous patch has moved a bunch of initialization code out if it.
Done separately to facilitate review.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:51 -05:00
Alex Elder 2ac4e75d89 rbd: do some header initialization earlier
Move some of the code that initializes an rbd header out of
rbd_init_disk() and into its caller.

Move the code at the end of rbd_init_disk() that sets the device
capacity and activates the Linux device out of that function and
into the caller, ensuring we still have the disk size available
where we need it.

Update rbd_free_disk() so it still aligns well as an inverse of
rbd_init_disk(), moving the rbd_header_free() call out to its
caller.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:51 -05:00
Alex Elder 8836b995fd rbd: simplify snap_by_name() interface
There is only one caller of snap_by_name(), and it passes two values
to be assigned, both of which are found within an rbd device
structure.

Change the interface so it just passes the address of the rbd_dev,
and make the assignments to its fields directly.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:51 -05:00
Alex Elder 4e1105a299 rbd: set mapping name with the rest
With the exception of the snapshot name, all of the mapping-specific
fields in an rbd device structure are set in rbd_header_set_snap().

Pass the snapshot name to be assigned into rbd_header_set_snap()
to keep all of the mapping assignments together.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:51 -05:00
Alex Elder 3feeb89467 rbd: return snap name from rbd_add_parse_args()
This is the first of two patches aimed at isolating the code that
sets the mapping information into a single spot.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:51 -05:00
Alex Elder 99c1f08f64 rbd: record mapped size
Add the size of the mapped image to the set of mapping-specific
fields in an rbd_device, and use it when setting the capacity of the
disk.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:51 -05:00
Alex Elder f84344f334 rbd: separate mapping info in rbd_dev
Several fields in a struct rbd_dev are related to what is mapped, as
opposed to the actual base rbd image.  If the base image is mapped
these are almost unneeded, but if a snapshot is mapped they describe
information about that snapshot.

In some contexts this can be a little bit confusing.  So group these
mapping-related field into a structure to make it clear what they
are describing.

This also includes a minor change that rearranges the fields in the
in-core image header structure so that invariant fields are at the
top, followed by those that change.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:51 -05:00
Alex Elder c9aadfe786 rbd: kill rbd_image_header->total_snaps
The "total_snaps" field in an rbd header structure is never any
different from the value of "num_snaps" stored within a snapshot
context.  Avoid any confusion by just using the value held within
the snapshot context, and get rid of the "total_snaps" field.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:51 -05:00
Alex Elder 98cec111c0 rbd: kill rbd_dev->q
A copy of rbd_dev->disk->queue is held in rbd_dev->q, but it's
never actually used.  So get just get rid of the field.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:51 -05:00
Alex Elder 9fcbb80024 rbd: rename __rbd_init_snaps_header()
The name __rbd_init_snaps_header() doesn't really convey what that
function does very well.  Its purpose is to scan a new snapshot
context and either create or destroy snapshot device entries so
that local host's view is consistent with the reality maintained
on the OSDs.  This patch just changes the name of this function,
to be rbd_dev_snap_devs_update().  Still not perfect, but I think
better.

Also add some dynamic debug statements to this function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:50 -05:00
Alex Elder e28393082d rbd: rename rbd_id_get()
This should have been done as part of this commit:

    commit de71a2970d
    Author: Alex Elder <elder@inktank.com>
    Date:   Tue Jul 3 16:01:19 2012 -0500
    rbd: rename rbd_device->id

rbd_id_get() is assigning the rbd_dev->dev_id field.  Change the
name of that function as well as rbd_id_put() and rbd_id_max
to reflect what they are affecting.

Add some dynamic debug statements related to rbd device id activity.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:50 -05:00
Alex Elder aafb230ebc rbd: define rbd_assert()
Define rbd_assert() and use it in place of various BUG_ON() calls
now present in the code.  By default assertion checking is enabled;
we want to do this differently at some point.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:50 -05:00
Alex Elder 65ccfe21dd rbd: split up rbd_get_segment()
There are two places where rbd_get_segment() is called.  One, in
rbd_rq_fn(), only needs to know the length within a segment that an
I/O request should be.  The other, in rbd_do_op(), also needs the
name of the object and the offset within it for the I/O request.

Split out rbd_segment_name() into three dedicated functions:
    - rbd_segment_name() allocates and formats the name of the
      object for a segment containing a given rbd image offset
    - rbd_segment_offset() computes the offset within a segment for
      a given rbd image offset
    - rbd_segment_length() computes the length to use for I/O within
      a segment for a request, not to exceed the end of a segment
      object.

In the new functions be a bit more careful, checking for possible
error conditions:
    - watch for errors or overflows returned by snprintf()
    - catch (using BUG_ON()) potential overflow conditions
      when computing segment length

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-10-01 14:30:50 -05:00
Alex Elder df111be631 rbd: check for overflow in rbd_get_num_segments()
It is possible in rbd_get_num_segments() for an overflow to occur
when adding the offset and length.  This is easily avoided.

Since the function returns an int and the one caller is already
prepared to handle errors, have it return -ERANGE if overflow would
occur.

The overflow check would not work if a zero-length request was
being tested, so short-circuit that case, returning 0 for the
number of segments required.  (This condition might be avoided
elsewhere already, I don't know.)

Have the caller end the request if either an error or 0 is returned.
The returned value is passed to __blk_end_request_all(), meaning
a 0 length request is not treated an error.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-10-01 14:30:50 -05:00
Alex Elder 38f5f65e9d rbd: drop needless test in rbd_rq_fn()
There's a test for null rq pointer inside the while loop in
rbd_rq_fn() that's not needed.  That same test already occurred
in the immediatly preceding loop condition test.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-10-01 14:30:50 -05:00
Alex Elder 542582fce1 rbd: bio_chain_clone() cleanups
In bio_chain_clone(), at the end of the function the bi_next field
of the tail of the new bio chain is nulled.  This isn't necessary,
because if "tail" is non-null, its value will be the last bio
structure allocated at the top of the while loop in that function.
And before that structure is added to the end of the new chain, its
bi_next pointer is always made null.

While touching that function, clean a few other things:
    - define each local variable on its own line
    - move the definition of "tmp" to an inner scope
    - move the modification of gfpmask closer to where it's used
    - rearrange the logic that sets the chain's tail pointer

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-10-01 14:30:50 -05:00
Alex Elder 84d34dcc11 rbd: kill notify_timeout option
The "notify_timeout" rbd device option is never used, so get rid of
it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-10-01 14:30:50 -05:00
Alex Elder cc0538b62c rbd: add read_only rbd map option
Add the ability to map an rbd image read-only, by specifying either
"read_only" or "ro" as an option on the rbd "command line."  Also
allow the inverse to be explicitly specified using "read_write" or
"rw".

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-10-01 14:30:50 -05:00
Alex Elder f8c3892911 rbd: move rbd_opts to struct rbd_device
The rbd options don't really apply to the ceph client.  So don't
store a pointer to it in the ceph_client structure, and put them
(a struct, not a pointer) into the rbd_dev structure proper.

Pass the rbd device structure to rbd_client_create() so it can
assign rbd_dev->rbdc if successful, and have it return an error code
instead of the rbd client pointer.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-10-01 14:30:50 -05:00
Alex Elder 621901d652 rbd: more cleanup in rbd_header_from_disk()
This just rearranges things a bit more in rbd_header_from_disk()
so that the snapshot sizes are initialized right after the buffer
to hold them is allocated and doing a little further consolidation
that follows from that.  Also adds a few simple comments.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-10-01 14:30:50 -05:00
Alex Elder f785cc1dbe rbd: kill incore snap_names_len
The only thing the on-disk snap_names_len field is needed is to
size the buffer allocated to hold a copy of the snapshot names
for an rbd image.

So don't bother saving it in the in-core rbd_image_header structure.
Just use a local variable to hold the required buffer size while
it's needed.

Move the code that actually copies the snapshot names up closer
to where the required length is saved.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-10-01 14:30:49 -05:00
Alex Elder 58c17b0e1b rbd: don't over-allocate space for object prefix
In rbd_header_from_disk() the object prefix buffer is sized based on
the maximum size it's block_name equivalent on disk could be.

Instead, only allocate enough to hold null-terminated string from
the on-disk header--or the maximum size of no NUL is found.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-10-01 14:30:49 -05:00
Alex Elder 1f7ba33115 rbd: handle locking inside __rbd_client_find()
There is only caller of __rbd_client_find(), and it somewhat
clumsily gets the appropriate lock and gets a reference to the
existing ceph_client structure if it's found.

Instead, have that function handle its own locking, and acquire the
reference if found while it holds the lock.  Drop the underscores
from the name because there's no need to signify anything special
about this function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
2012-10-01 14:30:49 -05:00
Alex Elder 523f32582f rbd: add new snapshots at the tail
This fixes a bug that went in with this commit:

    commit f6e0c99092cca7be00fca4080cfc7081739ca544
    Author: Alex Elder <elder@inktank.com>
    Date:   Thu Aug 2 11:29:46 2012 -0500
    rbd: simplify __rbd_init_snaps_header()

The problem is that a new rbd snapshot needs to go either after an
existing snapshot entry, or at the *end* of an rbd device's snapshot
list.  As originally coded, it is placed at the beginning.  This was
based on the assumption the list would be empty (so it wouldn't
matter), but in fact if multiple new snapshots are added to an empty
list in one shot the list will be non-empty after the first one is
added.

This addresses http://tracker.newdream.net/issues/3063

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:49 -05:00
Alex Elder 843a0d0879 rbd: rename block_name -> object_prefix
In the on-disk image header structure there is a field "block_name"
which represents what we now call the "object prefix" for an rbd
image.  Rename this field "object_prefix" to be consistent with
modern usage.

This appears to be the only remaining vestige of the use of "block"
in symbols that represent objects in the rbd code.

This addresses http://tracker.newdream.net/issues/1761

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
2012-10-01 14:30:49 -05:00
Alex Elder 4156d99840 rbd: separate reading header from decoding it
Right now rbd_read_header() both reads the header object for an rbd
image and decodes its contents.  It does this repeatedly if needed,
in order to ensure a complete and intact header is obtained.

Separate this process into two steps--reading of the raw header
data (in new function, rbd_dev_v1_header_read()) and separately
decoding its contents (in rbd_header_from_disk()).  As a result,
the latter function no longer requires its allocated_snaps argument.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:49 -05:00
Alex Elder 103a150f0c rbd: expand rbd_dev_ondisk_valid() checks
Add checks on the validity of the snap_count and snap_names_len
field values in rbd_dev_ondisk_valid().  This eliminates the
need to do them in rbd_header_from_disk().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:48 -05:00
Alex Elder 28cb775de1 rbd: return earlier in rbd_header_from_disk()
The only caller of rbd_header_from_disk() is rbd_read_header().
It passes as allocated_snaps the number of snapshots it will
have received from the server for the snapshot context that
rbd_header_from_disk() is to interpret.  The first time through
it provides 0--mainly to extract the number of snapshots from
the snapshot context header--so that it can allocate an
appropriately-sized buffer to receive the entire snapshot
context from the server in a second request.

rbd_header_from_disk() will not fill in the array of snapshot ids
unless the number in the snapshot matches the number the caller
had allocated.

This patch adjusts that logic a little further to be more efficient.
rbd_read_header() doesn't even examine the snapshot context unless
the snapshot count (stored in header->total_snaps) matches the
number of snapshots allocated.  So rbd_header_from_disk() doesn't
need to allocate or fill in the snapshot context field at all in
that case.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:48 -05:00
Alex Elder 6a52325f61 rbd: rearrange rbd_header_from_disk()
This just moves code around for the most part.  It was pulled out as
a separate patch to avoid cluttering up some upcoming patches which
are more substantive.  The point is basically to group everything
related to initializing the snapshot context together.

The only functional change is that rbd_header_from_disk() now
ensures the (in-core) header it is passed is zero-filled.  This
allows a simpler error handling path in rbd_header_from_disk().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:48 -05:00
Alex Elder d2bb24e506 rbd: use sizeof (object) instead of sizeof (type)
Fix a few spots in rbd_header_from_disk() to use sizeof (object)
rather than sizeof (type).  Use a local variable to record sizes
to shorten some lines and improve readability.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:48 -05:00
Alex Elder d78fd7ae03 rbd: ensure invalid pointers are made null
Fix a number of spots where a pointer value that is known to
have become invalid but was not reset to null.

Also, toss in a change so we use sizeof (object) rather than
sizeof (type).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:48 -05:00
Alex Elder 0f1d3f9385 rbd: make snap_names_len a u64
The snap_names_len field of an rbd_image_header structure is defined
with type size_t.  That field is used as both the source and target
of 64-bit byte-order swapping operations though, so it's best to
define it with type u64 instead.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:48 -05:00
Alex Elder 3593815022 rbd: simplify __rbd_init_snaps_header()
The purpose of __rbd_init_snaps_header() is to compare a new
snapshot context with an rbd device's list of existing snapshots.
It updates the list by adding any new snapshots or removing any
that are not present in the new snapshot context.

The code as written is a little confusing, because it traverses both
the existing snapshot list and the set of snapshots in the snapshot
context in reverse.  This was done based on an assumption about
snapshots that is not true--namely that a duplicate snapshot name
could cause an error in intepreting things if they were not
processed in ascending order.

These precautions are not necessary, because:
    - all snapshots are uniquely identified by their snapshot id
    - a new snapshot cannot be created if the rbd device has another
      snapshot with the same name
(It is furthermore not currently possible to rename a snapshot.)

This patch re-implements __rbd_init_snaps_header() so it passes
through both the existing snapshot list and the entries in the
snapshot context in forward order.  It still does the same thing
as before, but I find the logic considerably easier to understand.

By going forward through the names in the snapshot context, there
is no longer a need for the rbd_prev_snap_name() helper function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-10-01 14:30:48 -05:00
Linus Torvalds fdb2f9c2eb PCI changes for the 3.7 merge window:
Host bridge hotplug
     - Protect acpi_pci_drivers and acpi_pci_roots (Taku Izumi)
     - Clear host bridge resource info to avoid issue when releasing (Yinghai Lu)
     - Notify acpi_pci_drivers when hot-plugging host bridges (Jiang Liu)
     - Use standard list ops for acpi_pci_drivers (Jiang Liu)
 
   Device hotplug
     - Use pci_get_domain_bus_and_slot() to close hotplug races (Jiang Liu)
     - Remove fakephp driver (Bjorn Helgaas)
     - Fix VGA ref count in hotplug remove path (Yinghai Lu)
     - Allow acpiphp to handle PCIe ports without native hotplug (Jiang Liu)
     - Implement resume regardless of pciehp_force param (Oliver Neukum)
     - Make pci_fixup_irqs() work after init (Thierry Reding)
 
   Miscellaneous
     - Add pci_pcie_type(dev) and remove pci_dev.pcie_type (Yijing Wang)
     - Factor out PCI Express Capability accessors (Jiang Liu)
     - Add pcibios_window_alignment() so powerpc EEH can use generic resource assignment (Gavin Shan)
     - Make pci_error_handlers const (Stephen Hemminger)
     - Cleanup drivers/pci/remove.c (Bjorn Helgaas)
     - Improve Vendor-Specific Extended Capability support (Bjorn Helgaas)
     - Use standard list ops for bus->devices (Bjorn Helgaas)
     - Avoid kmalloc in pci_get_subsys() and pci_get_class() (Feng Tang)
     - Reassign invalid bus number ranges (Intel DP43BF workaround) (Yinghai Lu)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJQac4hAAoJEPGMOI97Hn6zjZYP/iaqU9kjmgTsBbSyzB4oApv/
 RRxo3I+ad9GF6XlMQfVAtyx1pgCD1gdGAtoDgGSCTqgdYD3CO10AxKU+yleAk1wo
 dbMxLifJNTrT3G1mZ/NL16yEGhCwvhfwzRtB1VoZmCT4lSApO/7cJkXl2DzHfA/i
 pmltOOiQCN8kbUcJbVPtUyTVPi2zl/8bsyCyTkS7YG0VXeGRM+ZUvPWZJ7MnWYYB
 5qoCdrw5ENCCiDQ9yw5SAfgL23b+0p6OI/x3Lkex0QQOWwSqGSiaHt4b7eitrC5b
 2eAJg32f/AzZke1YbKLMfdsL0VJP3GAswhDVHlgmo63rZkOZChm+97dgZ35Mcv5v
 kEXkWyBb1xJ3t8rZir6Qer9Iv2wOB+MkZ5qtU/Vf+l0wLQLYTrRVsKngrEDREONk
 dXbokp6iVSPeA1sTSdH9MmHlTUIj82ZLSGcxcjTsN8NWZjxx6g3rNx1uay+5MYOW
 4ET9zNu5snrAqN6N4Tb81gvtG8qYfxzdvVfrA9AaGKI6xxB7jkqgFJRp55JiEcFc
 x4cmWkhvdlhVsG2TQwFxYNfswOqD+7NCs6V4kSVZX6ezpDrH7I5VvcnnhstF7C8l
 KZul0EV7OW+kDK23pNe24lVP2xtOv6G8eK/3PmeKIXWl1V83nqre/oLufRzTfs+Z
 SxkILwY/MFpuCFteKE1t
 =haBu
 -----END PGP SIGNATURE-----

Merge tag 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI changes from Bjorn Helgaas:
 "Host bridge hotplug
    - Protect acpi_pci_drivers and acpi_pci_roots (Taku Izumi)
    - Clear host bridge resource info to avoid issue when releasing
      (Yinghai Lu)
    - Notify acpi_pci_drivers when hot-plugging host bridges (Jiang Liu)
    - Use standard list ops for acpi_pci_drivers (Jiang Liu)

  Device hotplug
    - Use pci_get_domain_bus_and_slot() to close hotplug races (Jiang
      Liu)
    - Remove fakephp driver (Bjorn Helgaas)
    - Fix VGA ref count in hotplug remove path (Yinghai Lu)
    - Allow acpiphp to handle PCIe ports without native hotplug (Jiang
      Liu)
    - Implement resume regardless of pciehp_force param (Oliver Neukum)
    - Make pci_fixup_irqs() work after init (Thierry Reding)

  Miscellaneous
    - Add pci_pcie_type(dev) and remove pci_dev.pcie_type (Yijing Wang)
    - Factor out PCI Express Capability accessors (Jiang Liu)
    - Add pcibios_window_alignment() so powerpc EEH can use generic
      resource assignment (Gavin Shan)
    - Make pci_error_handlers const (Stephen Hemminger)
    - Cleanup drivers/pci/remove.c (Bjorn Helgaas)
    - Improve Vendor-Specific Extended Capability support (Bjorn
      Helgaas)
    - Use standard list ops for bus->devices (Bjorn Helgaas)
    - Avoid kmalloc in pci_get_subsys() and pci_get_class() (Feng Tang)
    - Reassign invalid bus number ranges (Intel DP43BF workaround)
      (Yinghai Lu)"

* tag 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (102 commits)
  PCI: acpiphp: Handle PCIe ports without native hotplug capability
  PCI/ACPI: Use acpi_driver_data() rather than searching acpi_pci_roots
  PCI/ACPI: Protect acpi_pci_roots list with mutex
  PCI/ACPI: Use acpi_pci_root info rather than looking it up again
  PCI/ACPI: Pass acpi_pci_root to acpi_pci_drivers' add/remove interface
  PCI/ACPI: Protect acpi_pci_drivers list with mutex
  PCI/ACPI: Notify acpi_pci_drivers when hot-plugging PCI root bridges
  PCI/ACPI: Use normal list for struct acpi_pci_driver
  PCI/ACPI: Use DEVICE_ACPI_HANDLE rather than searching acpi_pci_roots
  PCI: Fix default vga ref_count
  ia64/PCI: Clear host bridge aperture struct resource
  x86/PCI: Clear host bridge aperture struct resource
  PCI: Stop all children first, before removing all children
  Revert "PCI: Use hotplug-safe pci_get_domain_bus_and_slot()"
  PCI: Provide a default pcibios_update_irq()
  PCI: Discard __init annotations for pci_fixup_irqs() and related functions
  PCI: Use correct type when freeing bus resource list
  PCI: Check P2P bridge for invalid secondary/subordinate range
  PCI: Convert "new_id"/"remove_id" into generic pci_bus driver attributes
  xen-pcifront: Use hotplug-safe pci_get_domain_bus_and_slot()
  ...
2012-10-01 12:05:36 -07:00
Linus Torvalds 21e98932dc Merge git://git.infradead.org/users/willy/linux-nvme
Pull NVMe driver fixes from Matthew Wilcox:
 "Now that actual hardware has been released (don't have any yet
  myself), people are starting to want some of these fixes merged."

Willy doesn't have hardware? Guys...

* git://git.infradead.org/users/willy/linux-nvme:
  NVMe: Cancel outstanding IOs on queue deletion
  NVMe: Free admin queue memory on initialisation failure
  NVMe: Use ida for nvme device instance
  NVMe: Fix whitespace damage in nvme_init
  NVMe: handle allocation failure in nvme_map_user_pages()
  NVMe: Fix uninitialized iod compiler warning
  NVMe: Do not set IO queue depth beyond device max
  NVMe: Set block queue max sectors
  NVMe: use namespace id for nvme_get_features
  NVMe: replace nvme_ns with nvme_dev for user admin
  NVMe: Fix nvme module init when nvme_major is set
  NVMe: Set request queue logical block size
2012-09-29 10:31:52 -07:00
Asias He bb8111086c virtio-blk: Disable callback in virtblk_done()
This reduces unnecessary interrupts that host could send to guest while
guest is in the progress of irq handling.

If one vcpu is handling the irq, while another interrupt comes, in
handle_edge_irq(), the guest will mask the interrupt via mask_msi_irq()
which is a very heavy operation that goes all the way down to host.

 Here are some performance numbers on qemu:

 Before:
 -------------------------------------
   seq-read  : io=0 B, bw=269730KB/s, iops=67432 , runt= 62200msec
   seq-write : io=0 B, bw=339716KB/s, iops=84929 , runt= 49386msec
   rand-read : io=0 B, bw=270435KB/s, iops=67608 , runt= 62038msec
   rand-write: io=0 B, bw=354436KB/s, iops=88608 , runt= 47335msec
     clat (usec): min=101 , max=138052 , avg=14822.09, stdev=11771.01
     clat (usec): min=96 , max=81543 , avg=11798.94, stdev=7735.60
     clat (usec): min=128 , max=140043 , avg=14835.85, stdev=11765.33
     clat (usec): min=109 , max=147207 , avg=11337.09, stdev=5990.35
   cpu          : usr=15.93%, sys=60.37%, ctx=7764972, majf=0, minf=54
   cpu          : usr=32.73%, sys=120.49%, ctx=7372945, majf=0, minf=1
   cpu          : usr=18.84%, sys=58.18%, ctx=7775420, majf=0, minf=1
   cpu          : usr=24.20%, sys=59.85%, ctx=8307886, majf=0, minf=0
   vdb: ios=8389107/8368136, merge=0/0, ticks=19457874/14616506,
 in_queue=34206098, util=99.68%
  43: interrupt in total: 887320
 fio --exec_prerun="echo 3 > /proc/sys/vm/drop_caches" --group_reporting
 --ioscheduler=noop --thread --bs=4k --size=512MB --direct=1 --numjobs=16
 --ioengine=libaio --iodepth=64 --loops=3 --ramp_time=0
 --filename=/dev/vdb --name=seq-read --stonewall --rw=read
 --name=seq-write --stonewall --rw=write --name=rnd-read --stonewall
 --rw=randread --name=rnd-write --stonewall --rw=randwrite

 After:
 -------------------------------------
   seq-read  : io=0 B, bw=309503KB/s, iops=77375 , runt= 54207msec
   seq-write : io=0 B, bw=448205KB/s, iops=112051 , runt= 37432msec
   rand-read : io=0 B, bw=311254KB/s, iops=77813 , runt= 53902msec
   rand-write: io=0 B, bw=377152KB/s, iops=94287 , runt= 44484msec
     clat (usec): min=81 , max=90588 , avg=12946.06, stdev=9085.94
     clat (usec): min=57 , max=72264 , avg=8967.97, stdev=5951.04
     clat (usec): min=29 , max=101046 , avg=12889.95, stdev=9067.91
     clat (usec): min=52 , max=106152 , avg=10660.56, stdev=4778.19
   cpu          : usr=15.05%, sys=57.92%, ctx=7710941, majf=0, minf=54
   cpu          : usr=26.78%, sys=101.40%, ctx=7387891, majf=0, minf=2
   cpu          : usr=19.03%, sys=58.17%, ctx=7681976, majf=0, minf=8
   cpu          : usr=24.65%, sys=58.34%, ctx=8442632, majf=0, minf=4
   vdb: ios=8389086/8361888, merge=0/0, ticks=17243780/12742010,
 in_queue=30078377, util=99.59%
  43: interrupt in total: 1259639
 fio --exec_prerun="echo 3 > /proc/sys/vm/drop_caches" --group_reporting
 --ioscheduler=noop --thread --bs=4k --size=512MB --direct=1 --numjobs=16
 --ioengine=libaio --iodepth=64 --loops=3 --ramp_time=0
 --filename=/dev/vdb --name=seq-read --stonewall --rw=read
 --name=seq-write --stonewall --rw=write --name=rnd-read --stonewall
 --rw=randread --name=rnd-write --stonewall --rw=randwrite

Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28 15:05:16 +09:30
Dan Carpenter f22cf8eb48 virtio-blk: fix NULL checking in virtblk_alloc_req()
Smatch complains about the inconsistent NULL checking here.  Fix it to
return NULL on failure.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (fixed accidental deletion)
2012-09-28 15:05:14 +09:30
Asias He c85a1f91b3 virtio-blk: Add REQ_FLUSH and REQ_FUA support to bio path
We need to support both REQ_FLUSH and REQ_FUA for bio based path since
it does not get the sequencing of REQ_FUA into REQ_FLUSH that request
based drivers can request.

REQ_FLUSH is emulated by:
A) If the bio has no data to write:
1. Send VIRTIO_BLK_T_FLUSH to device,
2. In the flush I/O completion handler, finish the bio

B) If the bio has data to write:
1. Send VIRTIO_BLK_T_FLUSH to device
2. In the flush I/O completion handler, send the actual write data to device
3. In the write I/O completion handler, finish the bio

REQ_FUA is emulated by:
1. Send the actual write data to device
2. In the write I/O completion handler, send VIRTIO_BLK_T_FLUSH to device
3. In the flush I/O completion handler, finish the bio

Changes in v7:
- Using vbr->flags to trace request type
- Dropped unnecessary struct virtio_blk *vblk parameter
- Reuse struct virtblk_req in bio done function

Cahnges in v6:
- Reworked REQ_FLUSH and REQ_FUA emulatation order

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28 15:05:14 +09:30
Asias He a98755c559 virtio-blk: Add bio-based IO path for virtio-blk
This patch introduces bio-based IO path for virtio-blk.

Compared to request-based IO path, bio-based IO path uses driver
provided ->make_request_fn() method to bypasses the IO scheduler. It
handles the bio to device directly without allocating a request in block
layer. This reduces the IO path in guest kernel to achieve high IOPS
and lower latency. The downside is that guest can not use the IO
scheduler to merge and sort requests. However, this is not a big problem
if the backend disk in host side uses faster disk device.

When the bio-based IO path is not enabled, virtio-blk still uses the
original request-based IO path, no performance difference is observed.

Using a slow device e.g. normal SATA disk, the bio-based IO path for
sequential read and write are slower than req-based IO path due to lack
of merge in guest kernel. So we make the bio-based path optional.

Performance evaluation:
-----------------------------
1) Fio test is performed in a 8 vcpu guest with ramdisk based guest using
kvm tool.

Short version:
 With bio-based IO path, sequential read/write, random read/write
 IOPS boost         : 28%, 24%, 21%, 16%
 Latency improvement: 32%, 17%, 21%, 16%

Long version:
 With bio-based IO path:
  seq-read  : io=2048.0MB, bw=116996KB/s, iops=233991 , runt= 17925msec
  seq-write : io=2048.0MB, bw=100829KB/s, iops=201658 , runt= 20799msec
  rand-read : io=3095.7MB, bw=112134KB/s, iops=224268 , runt= 28269msec
  rand-write: io=3095.7MB, bw=96198KB/s,  iops=192396 , runt= 32952msec
    clat (usec): min=0 , max=2631.6K, avg=58716.99, stdev=191377.30
    clat (usec): min=0 , max=1753.2K, avg=66423.25, stdev=81774.35
    clat (usec): min=0 , max=2915.5K, avg=61685.70, stdev=120598.39
    clat (usec): min=0 , max=1933.4K, avg=76935.12, stdev=96603.45
  cpu : usr=74.08%, sys=703.84%, ctx=29661403, majf=21354, minf=22460954
  cpu : usr=70.92%, sys=702.81%, ctx=77219828, majf=13980, minf=27713137
  cpu : usr=72.23%, sys=695.37%, ctx=88081059, majf=18475, minf=28177648
  cpu : usr=69.69%, sys=654.13%, ctx=145476035, majf=15867, minf=26176375
 With request-based IO path:
  seq-read  : io=2048.0MB, bw=91074KB/s, iops=182147 , runt= 23027msec
  seq-write : io=2048.0MB, bw=80725KB/s, iops=161449 , runt= 25979msec
  rand-read : io=3095.7MB, bw=92106KB/s, iops=184211 , runt= 34416msec
  rand-write: io=3095.7MB, bw=82815KB/s, iops=165630 , runt= 38277msec
    clat (usec): min=0 , max=1932.4K, avg=77824.17, stdev=170339.49
    clat (usec): min=0 , max=2510.2K, avg=78023.96, stdev=146949.15
    clat (usec): min=0 , max=3037.2K, avg=74746.53, stdev=128498.27
    clat (usec): min=0 , max=1363.4K, avg=89830.75, stdev=114279.68
  cpu : usr=53.28%, sys=724.19%, ctx=37988895, majf=17531, minf=23577622
  cpu : usr=49.03%, sys=633.20%, ctx=205935380, majf=18197, minf=27288959
  cpu : usr=55.78%, sys=722.40%, ctx=101525058, majf=19273, minf=28067082
  cpu : usr=56.55%, sys=690.83%, ctx=228205022, majf=18039, minf=26551985

2) Fio test is performed in a 8 vcpu guest with Fusion-IO based guest using
kvm tool.

Short version:
 With bio-based IO path, sequential read/write, random read/write
 IOPS boost         : 11%, 11%, 13%, 10%
 Latency improvement: 10%, 10%, 12%, 10%
Long Version:
 With bio-based IO path:
  read : io=2048.0MB, bw=58920KB/s, iops=117840 , runt= 35593msec
  write: io=2048.0MB, bw=64308KB/s, iops=128616 , runt= 32611msec
  read : io=3095.7MB, bw=59633KB/s, iops=119266 , runt= 53157msec
  write: io=3095.7MB, bw=62993KB/s, iops=125985 , runt= 50322msec
    clat (usec): min=0 , max=1284.3K, avg=128109.01, stdev=71513.29
    clat (usec): min=94 , max=962339 , avg=116832.95, stdev=65836.80
    clat (usec): min=0 , max=1846.6K, avg=128509.99, stdev=89575.07
    clat (usec): min=0 , max=2256.4K, avg=121361.84, stdev=82747.25
  cpu : usr=56.79%, sys=421.70%, ctx=147335118, majf=21080, minf=19852517
  cpu : usr=61.81%, sys=455.53%, ctx=143269950, majf=16027, minf=24800604
  cpu : usr=63.10%, sys=455.38%, ctx=178373538, majf=16958, minf=24822612
  cpu : usr=62.04%, sys=453.58%, ctx=226902362, majf=16089, minf=23278105
 With request-based IO path:
  read : io=2048.0MB, bw=52896KB/s, iops=105791 , runt= 39647msec
  write: io=2048.0MB, bw=57856KB/s, iops=115711 , runt= 36248msec
  read : io=3095.7MB, bw=52387KB/s, iops=104773 , runt= 60510msec
  write: io=3095.7MB, bw=57310KB/s, iops=114619 , runt= 55312msec
    clat (usec): min=0 , max=1532.6K, avg=142085.62, stdev=109196.84
    clat (usec): min=0 , max=1487.4K, avg=129110.71, stdev=114973.64
    clat (usec): min=0 , max=1388.6K, avg=145049.22, stdev=107232.55
    clat (usec): min=0 , max=1465.9K, avg=133585.67, stdev=110322.95
  cpu : usr=44.08%, sys=590.71%, ctx=451812322, majf=14841, minf=17648641
  cpu : usr=48.73%, sys=610.78%, ctx=418953997, majf=22164, minf=26850689
  cpu : usr=45.58%, sys=581.16%, ctx=714079216, majf=21497, minf=22558223
  cpu : usr=48.40%, sys=599.65%, ctx=656089423, majf=16393, minf=23824409

3) Fio test is performed in a 8 vcpu guest with normal SATA based guest
using kvm tool.

Short version:
 With bio-based IO path, sequential read/write, random read/write
 IOPS boost         : -10%, -10%, 4.4%, 0.5%
 Latency improvement: -12%, -15%, 2.5%, 0.8%
Long Version:
 With bio-based IO path:
  read : io=124812KB, bw=36537KB/s, iops=9060 , runt=  3416msec
  write: io=169180KB, bw=24406KB/s, iops=6065 , runt=  6932msec
  read : io=256200KB, bw=2089.3KB/s, iops=520 , runt=122630msec
  write: io=257988KB, bw=1545.7KB/s, iops=384 , runt=166910msec
    clat (msec): min=1 , max=1527 , avg=28.06, stdev=89.54
    clat (msec): min=2 , max=344 , avg=41.12, stdev=38.70
    clat (msec): min=8 , max=1984 , avg=490.63, stdev=207.28
    clat (msec): min=33 , max=4131 , avg=659.19, stdev=304.71
  cpu          : usr=4.85%, sys=17.15%, ctx=31593, majf=0, minf=7
  cpu          : usr=3.04%, sys=11.45%, ctx=39377, majf=0, minf=0
  cpu          : usr=0.47%, sys=1.59%, ctx=262986, majf=0, minf=16
  cpu          : usr=0.47%, sys=1.46%, ctx=337410, majf=0, minf=0

 With request-based IO path:
  read : io=150120KB, bw=40420KB/s, iops=10037 , runt=  3714msec
  write: io=194932KB, bw=27029KB/s, iops=6722 , runt=  7212msec
  read : io=257136KB, bw=2001.1KB/s, iops=498 , runt=128443msec
  write: io=258276KB, bw=1537.2KB/s, iops=382 , runt=168028msec
    clat (msec): min=1 , max=1542 , avg=24.84, stdev=32.45
    clat (msec): min=3 , max=628 , avg=35.62, stdev=39.71
    clat (msec): min=8 , max=2540 , avg=503.28, stdev=236.97
    clat (msec): min=41 , max=4398 , avg=653.88, stdev=302.61
  cpu          : usr=3.91%, sys=15.75%, ctx=26968, majf=0, minf=23
  cpu          : usr=2.50%, sys=10.56%, ctx=19090, majf=0, minf=0
  cpu          : usr=0.16%, sys=0.43%, ctx=20159, majf=0, minf=16
  cpu          : usr=0.18%, sys=0.53%, ctx=81364, majf=0, minf=0

How to use:
-----------------------------
Add 'virtio_blk.use_bio=1' to kernel cmdline or 'modprobe virtio_blk
use_bio=1' to enable ->make_request_fn() based I/O path.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Asias He <asias@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-09-28 15:05:13 +09:30
Linus Torvalds bee2d97b2c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull two ceph fixes from Sage Weil:
 "The first fixes a leak in the rbd setup error path, and the second
  fixes a more serious problem with mismatched kmap/kunmap that surfaced
  after the recent refactoring work."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  libceph: only kunmap kmapped pages
  rbd: drop dev reference on error in rbd_open()
2012-09-24 16:13:49 -07:00
Alex Elder 340c7a2b2c rbd: drop dev reference on error in rbd_open()
If a read-only rbd device is opened for writing in rbd_open(), it
returns without dropping the just-acquired device reference.

Fix this by moving the read-only check before getting the reference.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-09-21 20:48:54 -07:00
Linus Torvalds abef3bd710 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking updates from David Miller:
 "More bug fixes, nothing gets past these guys"

 1) More kernel info leaks found by Mathias Krause, this time in the
    IPSEC configuration layers.

 2) When IPSEC policies change, we do not properly make sure that cached
    routes (which could now be stale) throughout the system will be
    revalidated.  Fix this by generalizing the generation count
    invalidation scheme used by ipv4.  From Nicolas Dichtel.

 3) When repairing TCP sockets, we need to allow to restore not just the
    send window scale, but the receive one too.  Extend the existing
    interface to achieve this in a backwards compatible way.  From
    Andrey Vagin.

 4) A fix for FCOE scatter gather feature validation erroneously caused
    scatter gather to be disabled for things like AOE too.  From Ed L
    Cashin.

 5) Several cases of mishandling of error pointers, from Mathias Krause,
    Wei Yongjun, and Devendra Naga.

 6) Fix gianfar build, from Richard Cochran.

 7) CAP_NET_* failures should return -EPERM not -EACCES, from Zhao
    Hongjiang.

 8) Hardware reset fix in janz-ican3 CAN driver, from Ira W Snyder.

 9) Fix oops during rmmod in ti_hecc CAN driver, from Marc Kleine-Budde.

10) The removal of the conditional compilation of the clk support code
    in the stmmac driver broke things.  This is because the interfaces
    used are the ones that don't also perform the enable/disable of the
    clk.  Fix from Stefan Roese.

11) The QFQ packet scheduler can record out of range virtual start
    times, resulting later in misbehavior and even crashes.  Fix from
    Paolo Valente.

12) If MSG_WAITALL is used with IOAT DMA under TCP, we can wedge the
    receiver when the advertised receive window goes to zero.  Detect
    this case and force the processing of the IOAT DMA queue when it
    happens to avoid getting stuck.  Fix from Michal Kubecek.

13) batman-adv assumes that test_bit() returns only 0 or 1, but this is
    not true for x86 (which returns -1 or 0, via the 'sbb' instruction).
    Fix from Linus Lussing.

14) Fix small packet corruption in e1000, from Tushar Dave.

15) make_blackhole() in the IPSEC policy code can do one read unlock too
    many, fix from Li RongQing.

16) The new tcp_try_coalesce() code introduced a bug in TCP URG
    handling, fix from Eric Dumazet.

17) Fix memory leak in __netif_receive_skb() when doing zerocopy and
    when hit an OOM condition.  From Michael S Tsirkin.

18) netxen blindly deferences pdev->bus->self, which is not guarenteed
    to be non-NULL.  Fix from Nikolay Aleksandrov.

19) Fix a performance regression caused by mistakes in ipv6 checksum
    validation in the bnx2x driver, fix from Michal Schmidt.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (45 commits)
  net/stmmac: Use clk_prepare_enable and clk_disable_unprepare
  net: change return values from -EACCES to -EPERM
  net/irda: sh_sir: fix return value check in sh_sir_set_baudrate()
  stmmac: fix return value check in stmmac_open_ext_timer()
  gianfar: fix phc index build failure
  ipv6: fix return value check in fib6_add()
  bnx2x: remove false warning regarding interrupt number
  can: ti_hecc: fix oops during rmmod
  can: janz-ican3: fix support for older hardware revisions
  net: do not disable sg for packets requiring no checksum
  aoe: assert AoE packets marked as requiring no checksum
  at91ether: return PTR_ERR if call to clk_get fails
  xfrm_user: don't copy esn replay window twice for new states
  xfrm_user: ensure user supplied esn replay window is valid
  xfrm_user: fix info leak in copy_to_user_tmpl()
  xfrm_user: fix info leak in copy_to_user_policy()
  xfrm_user: fix info leak in copy_to_user_state()
  xfrm_user: fix info leak in copy_to_user_auth()
  net: qmi_wwan: adding Huawei E367, ZTE MF683 and Pantech P4200
  tcp: restore rcv_wscale in a repair mode (v2)
  ...
2012-09-21 14:32:55 -07:00
Linus Torvalds 8ca7de9164 Bug-fixes:
* Fix M2P batching re-using the incorrect structure field.
  * Disable BIOS SMP MP table search.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJQXGfdAAoJEFjIrFwIi8fJbWcH/0FI2d/VyB+ZU0ng3R0Oa7mt
 iR/x+Z+mfFdp2dXS6gs6DgJIZVA7i2K9pX4rOXjpDGGGyUeo1xoqjlQfsFWQGjZ/
 p49RrDrM93c2GdRXk3iMSWfboQI7BXBs5rnyYZQL7kMxUSR75MxbeONvhPrMSO9I
 3EBidWH08qjrn2HVF44F6xh5ONjpclo5AvGIzJ0eU4X0D0eqMnhvlAw8/UYJU2HV
 heRvuxWF9l2jNpLhKhZy1730D1X/vKA5qKAcBW8rCOpEijyPpmtKbqapeUJg/9pH
 NVquuwGutP5ozrSi7a/23+L+ezvQBmCPm5ZRG44PccBoZ/HVs8haT8UypSWSDzo=
 =TwvM
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull Xen bug-fixes from Konrad Rzeszutek Wilk:
 - Fix M2P batching re-using the incorrect structure field.

   In v3.5 we added batching for M2P override (Machine Frame Number ->
   Physical Frame Number), but the original MFN was saved in an
   incorrect structure - and we would oops/restore when restoring with
   the old MFN.

 - Disable BIOS SMP MP table search.

   A bootup issue that we had ignored until we found that on DL380 G6 it
   was needed.

* tag 'stable/for-linus-3.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/boot: Disable BIOS SMP MP table search.
  xen/m2p: do not reuse kmap_op->dev_bus_addr
2012-09-21 12:06:54 -07:00
Eric W. Biederman e4849737f7 userns: Convert loop to use kuid_t instead of uid_t
Cc: Jens Axboe <jaxboe@fusionio.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-09-21 03:13:20 -07:00
Ed Cashin 8babe8cc65 aoe: assert AoE packets marked as requiring no checksum
In order for the network layer to see that AoE requires
no checksumming in a generic way, the packets must be
marked as requiring no checksum, so we make this requirement
explicit with the assertion.

Signed-off-by: Ed Cashin <ecashin@coraid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-20 22:23:40 -04:00
Linus Torvalds c46de2263f Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "A small collection of driver fixes/updates and a core fix for 3.6.  It
  contains:

   - Bug fixes for mtip32xx, and support for new hardware (just addition
     of IDs).  They have been queued up for 3.7 for a few weeks as well.

   - rate-limit a failing command error message in block core.

   - A fix for an old cciss bug from Stephen.

   - Prevent overflow of partition count from Alan."

* 'for-linus' of git://git.kernel.dk/linux-block:
  cciss: fix handling of protocol error
  blk: add an upper sanity check on partition adding
  mtip32xx: fix user_buffer check in exec_drive_command
  mtip32xx: Remove dead code
  mtip32xx: Change printk to pr_xxxx
  mtip32xx: Proper reporting of write protect status on big-endian
  mtip32xx: Increase timeout for standby command
  mtip32xx: Handle NCQ commands during the security locked state
  mtip32xx: Add support for new devices
  block: rate-limit the error message from failing commands
2012-09-19 11:04:34 -07:00
Stephen M. Cameron 2453f5f992 cciss: fix handling of protocol error
If a command completes with a status of CMD_PROTOCOL_ERR, this
information should be conveyed to the SCSI mid layer, not dropped
on the floor.  Unlike a similar bug in the hpsa driver, this bug
only affects tape drives and CD and DVD ROM drives in the cciss
driver, and to induce it, you have to disconnect (or damage) a
cable, so it is not a very likely scenario (which would explain
why the bug has gone undetected for the last 10 years.)

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-18 11:57:08 +02:00
Paul Clements fded4e090c nbd: clear waiting_queue on shutdown
Fix a serious but uncommon bug in nbd which occurs when there is heavy
I/O going to the nbd device while, at the same time, a failure (server,
network) or manual disconnect of the nbd connection occurs.

There is a small window between the time that the nbd_thread is stopped
and the socket is shutdown where requests can continue to be queued to
nbd's internal waiting_queue.  When this happens, those requests are
never completed or freed.

The fix is to clear the waiting_queue on shutdown of the nbd device, in
the same way that the nbd request queue (queue_head) is already being
cleared.

Signed-off-by: Paul Clements <paul.clements@steeleye.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-17 15:00:37 -07:00
Greg Kroah-Hartman 2bcb132c69 Merge 3.6-rc6 into usb-next
This resolves the merge problems with:
	drivers/usb/dwc3/gadget.c
	drivers/usb/musb/tusb6010.c
that had been seen in linux-next.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-09-16 20:42:46 -07:00
Bjorn Helgaas 78890b5989 Merge commit 'v3.6-rc5' into next
* commit 'v3.6-rc5': (1098 commits)
  Linux 3.6-rc5
  HID: tpkbd: work even if the new Lenovo Keyboard driver is not configured
  Remove user-triggerable BUG from mpol_to_str
  xen/pciback: Fix proper FLR steps.
  uml: fix compile error in deliver_alarm()
  dj: memory scribble in logi_dj
  Fix order of arguments to compat_put_time[spec|val]
  xen: Use correct masking in xen_swiotlb_alloc_coherent.
  xen: fix logical error in tlb flushing
  xen/p2m: Fix one-off error in checking the P2M tree directory.
  powerpc: Don't use __put_user() in patch_instruction
  powerpc: Make sure IPI handlers see data written by IPI senders
  powerpc: Restore correct DSCR in context switch
  powerpc: Fix DSCR inheritance in copy_thread()
  powerpc: Keep thread.dscr and thread.dscr_inherit in sync
  powerpc: Update DSCR on all CPUs when writing sysfs dscr_default
  powerpc/powernv: Always go into nap mode when CPU is offline
  powerpc: Give hypervisor decrementer interrupts their own handler
  powerpc/vphn: Fix arch_update_cpu_topology() return value
  ARM: gemini: fix the gemini build
  ...

Conflicts:
	drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
	drivers/rapidio/devices/tsi721.c
2012-09-13 08:41:01 -06:00
David Milburn 97651ea687 mtip32xx: fix user_buffer check in exec_drive_command
Current user_buffer check is incorrect and causes hdparm to fail

# hdparm -I /dev/rssda
 HDIO_DRIVE_CMD(identify) failed: Input/output error

/dev/rssda:

Patching linux-3.6-rc5 hdparm works as expected

# hdparm -I /dev/rssda
/dev/rssda:

ATA device, with non-removable media
	Model Number:       DELL_P320h-MTFDGAL350SAH
	Serial Number:      00000000121302025F01
	Firmware Revision:  B1442808
<snip>

Reported-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: David Milburn <dmilburn@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-12 22:21:13 +02:00
Asai Thambi S P ac64e6572d mtip32xx: Remove dead code
Removed the dead code in mtip_hw_read_registers() and mtip_hw_read_flags().

Reported-by: Coverity
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-12 22:20:22 +02:00
Asai Thambi S P 45422e7431 mtip32xx: Change printk to pr_xxxx
Changed printk to be compliant with latest style changes

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-12 22:20:11 +02:00
Asai Thambi S P b62868e50e mtip32xx: Proper reporting of write protect status on big-endian
Proper reporting of write protect status on big-endian

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-12 22:19:59 +02:00
Asai Thambi S P d7c8b94548 mtip32xx: Increase timeout for standby command
Increased timeout for standby command to work with larger capacity drives

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-12 22:19:49 +02:00
Asai Thambi S P 12a166c919 mtip32xx: Handle NCQ commands during the security locked state
Return error for NCQ commands when the drive is in security locked state

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-12 22:19:39 +02:00
Asai Thambi S P 1a131458dd mtip32xx: Add support for new devices
Added supported device IDs in pci table

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Selvan Mani <smani@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-12 22:19:29 +02:00
Stefano Stabellini 2fc136eecd xen/m2p: do not reuse kmap_op->dev_bus_addr
If the caller passes a valid kmap_op to m2p_add_override, we use
kmap_op->dev_bus_addr to store the original mfn, but dev_bus_addr is
part of the interface with Xen and if we are batching the hypercalls it
might not have been written by the hypervisor yet. That means that later
on Xen will write to it and we'll think that the original mfn is
actually what Xen has written to it.

Rather than "stealing" struct members from kmap_op, keep using
page->index to store the original mfn and add another parameter to
m2p_remove_override to get the corresponding kmap_op instead.
It is now responsibility of the caller to keep track of which kmap_op
corresponds to a particular page in the m2p_override (gntdev, the only
user of this interface that passes a valid kmap_op, is already doing that).

CC: stable@kernel.org
Reported-and-Tested-By: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-09-12 11:21:40 -04:00
Kent Overstreet bf800ef181 block: Add bio_clone_bioset(), bio_clone_kmalloc()
Previously, there was bio_clone() but it only allocated from the fs bio
set; as a result various users were open coding it and using
__bio_clone().

This changes bio_clone() to become bio_clone_bioset(), and then we add
bio_clone() and bio_clone_kmalloc() as wrappers around it, making use of
the functionality the last patch adedd.

This will also help in a later patch changing how bio cloning works.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
CC: Alasdair Kergon <agk@redhat.com>
CC: Boaz Harrosh <bharrosh@panasas.com>
CC: Jeff Garzik <jeff@garzik.org>
Acked-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-09 10:35:39 +02:00
Kent Overstreet ccc5c9ca6a pktcdvd: Switch to bio_kmalloc()
This is prep work for killing bi_destructor - previously, pktcdvd had
its own pkt_bio_alloc which was basically duplication bio_kmalloc(),
necessitating its own bi_destructor implementation.

v5: Un-reorder some functions, to make the patch easier to review

Signed-off-by: Kent Overstreet <koverstreet@google.com>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-09 10:35:39 +02:00
Kent Overstreet 395c72a707 block: Generalized bio pool freeing
With the old code, when you allocate a bio from a bio pool you have to
implement your own destructor that knows how to find the bio pool the
bio was originally allocated from.

This adds a new field to struct bio (bi_pool) and changes
bio_alloc_bioset() to use it. This makes various bio destructors
unnecessary, so they're then deleted.

v6: Explain the temporary if statement in bio_put

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: NeilBrown <neilb@suse.de>
CC: Alasdair Kergon <agk@redhat.com>
CC: Nicholas Bellinger <nab@linux-iscsi.org>
CC: Lars Ellenberg <lars.ellenberg@linbit.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-09-09 10:35:38 +02:00
Stephen Hemminger 1d3520357d make drivers with pci error handlers const
Covers the rest of the uses of pci error handler.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2012-09-07 16:35:00 -06:00
Cong Wang 68a5059ecf block: remove the deprecated ub driver
It was scheduled to be removed in 3.6.

Acked-by: Pete Zaitcev <zaitcev@redhat.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-09-05 17:18:53 -07:00
Linus Torvalds a7e546f175 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block-related fixes from Jens Axboe:

 - Improvements to the buffered and direct write IO plugging from
   Fengguang.

 - Abstract out the mapping of a bio in a request, and use that to
   provide a blk_bio_map_sg() helper.  Useful for mapping just a bio
   instead of a full request.

 - Regression fix from Hugh, fixing up a patch that went into the
   previous release cycle (and marked stable, too) attempting to prevent
   a loop in __getblk_slow().

 - Updates to discard requests, fixing up the sizing and how we align
   them.  Also a change to disallow merging of discard requests, since
   that doesn't really work properly yet.

 - A few drbd fixes.

 - Documentation updates.

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: replace __getblk_slow misfix by grow_dev_page fix
  drbd: Write all pages of the bitmap after an online resize
  drbd: Finish requests that completed while IO was frozen
  drbd: fix drbd wire compatibility for empty flushes
  Documentation: update tunable options in block/cfq-iosched.txt
  Documentation: update tunable options in block/cfq-iosched.txt
  Documentation: update missing index files in block/00-INDEX
  block: move down direct IO plugging
  block: remove plugging at buffered write time
  block: disable discard request merge temporarily
  bio: Fix potential memory leak in bio_find_or_create_slab()
  block: Don't use static to define "void *p" in show_partition_start()
  block: Add blk_bio_map_sg() helper
  block: Introduce __blk_segment_map_sg() helper
  fs/block-dev.c:fix performance regression in O_DIRECT writes to md block devices
  block: split discard into aligned requests
  block: reorganize rounding of max_discard_sectors
2012-08-25 11:36:43 -07:00
Stephen M. Cameron b0cf0b118c cciss: fix incorrect scsi status reporting
Delete code which sets SCSI status incorrectly as it's already been set
correctly above this incorrect code.  The bug was introduced in 2009 by
commit b0e15f6db1 ("cciss: fix typo that causes scsi status to be
lost.")

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Reported-by: Roel van Meer <roel.vanmeer@bokxing.nl>
Tested-by: Roel van Meer <roel.vanmeer@bokxing.nl>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-21 16:45:02 -07:00
Tejun Heo 136b5721d7 workqueue: deprecate __cancel_delayed_work()
Now that cancel_delayed_work() can be safely called from IRQ handlers,
there's no reason to use __cancel_delayed_work().  Use
cancel_delayed_work() instead of __cancel_delayed_work() and mark the
latter deprecated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Roland Dreier <roland@kernel.org>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
2012-08-21 13:18:24 -07:00
Tejun Heo e7c2f96744 workqueue: use mod_delayed_work() instead of __cancel + queue
Now that mod_delayed_work() is safe to call from IRQ handlers,
__cancel_delayed_work() followed by queue_delayed_work() can be
replaced with mod_delayed_work().

Most conversions are straight-forward except for the following.

* net/core/link_watch.c: linkwatch_schedule_work() was doing a quite
  elaborate dancing around its delayed_work.  Collapse it such that
  linkwatch_work is queued for immediate execution if LW_URGENT and
  existing timer is kept otherwise.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
2012-08-21 13:18:24 -07:00
Tejun Heo 43829731dd workqueue: deprecate flush[_delayed]_work_sync()
flush[_delayed]_work_sync() are now spurious.  Mark them deprecated
and convert all users to flush[_delayed]_work().

If you're cc'd and wondering what's going on: Now all workqueues are
non-reentrant and the regular flushes guarantee that the work item is
not pending or running on any CPU on return, so there's no reason to
use the sync flushes at all and they're going away.

This patch doesn't make any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mattia Dongili <malattia@linux.it>
Cc: Kent Yoder <key@linux.vnet.ibm.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: Bryan Wu <bryan.wu@canonical.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-wireless@vger.kernel.org
Cc: Anton Vorontsov <cbou@mail.ru>
Cc: Sangbeom Kim <sbkim73@samsung.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Petr Vandrovec <petr@vandrovec.name>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Avi Kivity <avi@redhat.com>
2012-08-20 14:51:24 -07:00
Jens Axboe 79df9b40b3 Merge branch 'for-jens' of git://git.drbd.org/linux-drbd into for-linus 2012-08-16 19:27:18 +02:00
Philipp Reisner d1aa4d04da drbd: Write all pages of the bitmap after an online resize
We need to write the whole bitmap after we moved the meta data
due to an online resize operation.

With the support for one peta byte devices bitmap IO was optimized
to only write out touched pages. This optimization must be turned
off when writing the bitmap after an online resize.

This issue was introduced with drbd-8.3.10.

The impact of this bug is that after an online resize, the next
resync could become larger than expected.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-08-16 17:17:35 +02:00
Philipp Reisner 509fc019e5 drbd: Finish requests that completed while IO was frozen
Requests of an acked epoch are stored on the barrier_acked_requests list. In
case the private bio of such a request completes while IO on the drbd device
is suspended [req_mod(completed_ok)] then the request stays there.

When thawing IO because the fence_peer handler returned, then we use
tl_clear() to apply the connection_lost_while_pending event to all requests
on the transfer-log and the barrier_acked_requests list.

Up to now the connection_lost_while_pending event was not applied
on requests on the barrier_acked_requests list. Fixed that.

I.e. now the connection_lost_while_pending and resend events are
applied to requests on the barrier_acked_requests list. For that
it is necessary that the resend event finishes (local only)
READS correctly.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-08-16 17:14:45 +02:00
Lars Ellenberg 227f052f47 drbd: fix drbd wire compatibility for empty flushes
DRBD has a concept of request epochs or reorder-domains,
which are separated on the wire by P_BARRIER packets.

Older DRBD is not able to handle zero-sized requests at all,
so we need to map empty flushes to these drbd barriers.

These are the equivalent of empty flushes, and
by default trigger flushes on the receiving side anyways
(unless not supported or explicitly disabled),
so there is no need to handle this differently in newer drbd either.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-08-16 17:12:56 +02:00
Stefano Stabellini e79affc3f2 xen/arm: compile blkfront and blkback
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-08-08 17:21:14 +00:00
Matthew Wilcox a09115b23e NVMe: Cancel outstanding IOs on queue deletion
If the device is hot-unplugged while there are active commands, we should
time out the I/Os so that upper layers don't just see the I/Os disappear.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-08-07 15:56:23 -04:00
Artem Bityutskiy d97482ede2 drbd: nuke pdflush from comments
The pdflush thread is long gone, so this patch removes references to pdflush
from drbd comments.

Cc: drbd-dev@lists.linbit.com
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-08-04 12:15:39 +04:00
Matthew Wilcox 9e866774aa NVMe: Free admin queue memory on initialisation failure
If the adapter fails initialisation, the memory allocated for the
admin queue may not be freed.  Split the memory freeing part of
nvme_free_queue() into nvme_free_queue_mem() and call it in the case of
initialisation failure.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reported-by: Vishal Verma <vishal.l.verma@intel.com>
2012-08-03 13:55:56 -04:00
Linus Torvalds eff0d13f38 Merge branch 'for-3.6/drivers' of git://git.kernel.dk/linux-block
Pull block driver changes from Jens Axboe:

 - Making the plugging support for drivers a bit more sane from Neil.
   This supersedes the plugging change from Shaohua as well.

 - The usual round of drbd updates.

 - Using a tail add instead of a head add in the request completion for
   ndb, making us find the most completed request more quickly.

 - A few floppy changes, getting rid of a duplicated flag and also
   running the floppy init async (since it takes forever in boot terms)
   from Andi.

* 'for-3.6/drivers' of git://git.kernel.dk/linux-block:
  floppy: remove duplicated flag FD_RAW_NEED_DISK
  blk: pass from_schedule to non-request unplug functions.
  block: stack unplug
  blk: centralize non-request unplug handling.
  md: remove plug_cnt feature of plugging.
  block/nbd: micro-optimization in nbd request completion
  drbd: announce FLUSH/FUA capability to upper layers
  drbd: fix max_bio_size to be unsigned
  drbd: flush drbd work queue before invalidate/invalidate remote
  drbd: fix potential access after free
  drbd: call local-io-error handler early
  drbd: do not reset rs_pending_cnt too early
  drbd: reset congestion information before reporting it in /proc/drbd
  drbd: report congestion if we are waiting for some userland callback
  drbd: differentiate between normal and forced detach
  drbd: cleanup, remove two unused global flags
  floppy: Run floppy initialization asynchronous
2012-08-01 09:06:47 -07:00
Linus Torvalds ac694dbdbc Merge branch 'akpm' (Andrew's patch-bomb)
Merge Andrew's second set of patches:
 - MM
 - a few random fixes
 - a couple of RTC leftovers

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (120 commits)
  rtc/rtc-88pm80x: remove unneed devm_kfree
  rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails
  mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
  tmpfs: distribute interleave better across nodes
  mm: remove redundant initialization
  mm: warn if pg_data_t isn't initialized with zero
  mips: zero out pg_data_t when it's allocated
  memcg: gix memory accounting scalability in shrink_page_list
  mm/sparse: remove index_init_lock
  mm/sparse: more checks on mem_section number
  mm/sparse: optimize sparse_index_alloc
  memcg: add mem_cgroup_from_css() helper
  memcg: further prevent OOM with too many dirty pages
  memcg: prevent OOM with too many dirty pages
  mm: mmu_notifier: fix freed page still mapped in secondary MMU
  mm: memcg: only check anon swapin page charges for swap cache
  mm: memcg: only check swap cache pages for repeated charging
  mm: memcg: split swapin charge function into private and public part
  mm: memcg: remove needless !mm fixup to init_mm when charging
  mm: memcg: remove unneeded shmem charge type
  ...
2012-07-31 19:25:39 -07:00
Linus Torvalds 3e9a97082f This patch series contains a major revamp of how we collect entropy
from interrupts for /dev/random and /dev/urandom.  The goal is to
 addresses weaknesses discussed in the paper "Mining your Ps and Qs:
 Detection of Widespread Weak Keys in Network Devices", by Nadia
 Heninger, Zakir Durumeric, Eric Wustrow, J. Alex Halderman, which will
 be published in the Proceedings of the 21st Usenix Security Symposium,
 August 2012.  (See https://factorable.net for more information and an
 extended version of the paper.)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJQF/0DAAoJENNvdpvBGATwIowQAOep9QKtLrBvb2lwIRVmeiy8
 lRf7V/tYZnz4FePbR0W92JQfKYkCV8yyOO0bmeRzWL3v4m+lRwDTSyA1DDyQMoH+
 LOMzvDKSLJMSXTXdSOIr1WYACphViCR/9CrbMBCKSkYfZLJ1MdaEDxT3rcpTGD0T
 6iknUweiSkHHhkerU5yQL7FKzD5kYUe0hsF47w7QVlHRHJsW2fsZqkFoh+RpnhNw
 03u+djxNGBo9qV81vZ9D1b0vA9uRlEjoWOOEG2XE4M2iq6TUySueA72dQnCwunfi
 3kG/u1Swv2dgq6aRrP3H7zdwhYSourGxziu3jNhEKwKEohrxYY7xjNX3RVeTqP67
 AzlKsOTWpRLIDrzjSLlb8VxRQiZewu8Unex3e1G+eo20sbcIObHGrxNp7K00zZvd
 QZiMHhOwItwFTe4lBO+XbqH2JKbL9/uJmwh5EipMpQTraKO9E6N3CJiUHjzBLo2K
 iGDZxRMKf4gVJRwDxbbP6D70JPVu8ZJ09XVIpsXQ3Z1xNqaMF0QdCmP3ty56q1o0
 NvkSXxPKrijZs8Sk0rVDqnJ3ll8PuDnXMv5eDtL42VT818I5WxESn9djjwEanGv0
 TYxbFub/NRxmPEE5B2Js5FBpqsLf5f282OSMeS/5WLBbnHJR1OoPoAhGVpHvxntC
 bi5FC1OolqhvzVIdsqgt
 =u7KM
 -----END PGP SIGNATURE-----

Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random

Pull random subsystem patches from Ted Ts'o:
 "This patch series contains a major revamp of how we collect entropy
  from interrupts for /dev/random and /dev/urandom.

  The goal is to addresses weaknesses discussed in the paper "Mining
  your Ps and Qs: Detection of Widespread Weak Keys in Network Devices",
  by Nadia Heninger, Zakir Durumeric, Eric Wustrow, J.  Alex Halderman,
  which will be published in the Proceedings of the 21st Usenix Security
  Symposium, August 2012.  (See https://factorable.net for more
  information and an extended version of the paper.)"

Fix up trivial conflicts due to nearby changes in
drivers/{mfd/ab3100-core.c, usb/gadget/omap_udc.c}

* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random: (33 commits)
  random: mix in architectural randomness in extract_buf()
  dmi: Feed DMI table to /dev/random driver
  random: Add comment to random_initialize()
  random: final removal of IRQF_SAMPLE_RANDOM
  um: remove IRQF_SAMPLE_RANDOM which is now a no-op
  sparc/ldc: remove IRQF_SAMPLE_RANDOM which is now a no-op
  [ARM] pxa: remove IRQF_SAMPLE_RANDOM which is now a no-op
  board-palmz71: remove IRQF_SAMPLE_RANDOM which is now a no-op
  isp1301_omap: remove IRQF_SAMPLE_RANDOM which is now a no-op
  pxa25x_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
  omap_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
  goku_udc: remove IRQF_SAMPLE_RANDOM which was commented out
  uartlite: remove IRQF_SAMPLE_RANDOM which is now a no-op
  drivers: hv: remove IRQF_SAMPLE_RANDOM which is now a no-op
  xen-blkfront: remove IRQF_SAMPLE_RANDOM which is now a no-op
  n2_crypto: remove IRQF_SAMPLE_RANDOM which is now a no-op
  pda_power: remove IRQF_SAMPLE_RANDOM which is now a no-op
  i2c-pmcmsp: remove IRQF_SAMPLE_RANDOM which is now a no-op
  input/serio/hp_sdc.c: remove IRQF_SAMPLE_RANDOM which is now a no-op
  mfd: remove IRQF_SAMPLE_RANDOM which is now a no-op
  ...
2012-07-31 19:07:42 -07:00
Mel Gorman 7f338fe454 nbd: set SOCK_MEMALLOC for access to PFMEMALLOC reserves
Set SOCK_MEMALLOC on the NBD socket to allow access to PFMEMALLOC reserves
so pages backed by NBD, particularly if swap related, can be cleaned to
prevent the machine being deadlocked.  It is still possible that the
PFMEMALLOC reserves get depleted resulting in deadlock but this can be
resolved by the administrator by increasing min_free_kbytes.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:46 -07:00
Linus Torvalds cc8362b1f6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph changes from Sage Weil:
 "Lots of stuff this time around:

   - lots of cleanup and refactoring in the libceph messenger code, and
     many hard to hit races and bugs closed as a result.
   - lots of cleanup and refactoring in the rbd code from Alex Elder,
     mostly in preparation for the layering functionality that will be
     coming in 3.7.
   - some misc rbd cleanups from Josh Durgin that are finally going
     upstream
   - support for CRUSH tunables (used by newer clusters to improve the
     data placement)
   - some cleanup in our use of d_parent that Al brought up a while back
   - a random collection of fixes across the tree

  There is another patch coming that fixes up our ->atomic_open()
  behavior, but I'm going to hammer on it a bit more before sending it."

Fix up conflicts due to commits that were already committed earlier in
drivers/block/rbd.c, net/ceph/{messenger.c, osd_client.c}

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (132 commits)
  rbd: create rbd_refresh_helper()
  rbd: return obj version in __rbd_refresh_header()
  rbd: fixes in rbd_header_from_disk()
  rbd: always pass ops array to rbd_req_sync_op()
  rbd: pass null version pointer in add_snap()
  rbd: make rbd_create_rw_ops() return a pointer
  rbd: have __rbd_add_snap_dev() return a pointer
  libceph: recheck con state after allocating incoming message
  libceph: change ceph_con_in_msg_alloc convention to be less weird
  libceph: avoid dropping con mutex before fault
  libceph: verify state after retaking con lock after dispatch
  libceph: revoke mon_client messages on session restart
  libceph: fix handling of immediate socket connect failure
  ceph: update MAINTAINERS file
  libceph: be less chatty about stray replies
  libceph: clear all flags on con_close
  libceph: clean up con flags
  libceph: replace connection state bits with states
  libceph: drop unnecessary CLOSED check in socket state change callback
  libceph: close socket directly from ceph_con_close()
  ...
2012-07-31 14:35:28 -07:00
Quoc-Son Anh cd58ad7d18 NVMe: Use ida for nvme device instance
Signed-off-by: Quoc-Son Anh <quoc-sonx.anh@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-07-31 13:31:34 -04:00
Matthew Wilcox 0ac13140d7 NVMe: Fix whitespace damage in nvme_init
Commit 5c42ea1643 used spaces instead of tabs.
Also remove the unnecessary initialisation of the 'result' variable.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-07-31 13:31:16 -04:00
Dan Carpenter 22fff826e7 NVMe: handle allocation failure in nvme_map_user_pages()
We should return here and avoid a NULL dereference.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-07-31 13:15:43 -04:00
Jens Axboe 10af8138eb Merge branch 'upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/floppy into for-3.6/drivers 2012-07-31 11:47:36 +02:00
Fengguang Wu 2fb2ca6f5b floppy: remove duplicated flag FD_RAW_NEED_DISK
Fix coccinelle warning (without behavior change):

drivers/block/floppy.c:2518:32-48: duplicated argument to & or |

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-07-31 11:38:39 +02:00
NeilBrown 74018dc306 blk: pass from_schedule to non-request unplug functions.
This will allow md/raid to know why the unplug was called,
and will be able to act according - if !from_schedule it
is safe to perform tasks which could themselves schedule.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-07-31 09:08:15 +02:00
NeilBrown 9cbb175088 blk: centralize non-request unplug handling.
Both md and umem has similar code for getting notified on an
blk_finish_plug event.
Centralize this code in block/ and allow each driver to
provide its distinctive difference.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-07-31 09:08:14 +02:00
Chetan Loke 01ff5dbc09 block/nbd: micro-optimization in nbd request completion
Add in-flight cmds to the tail. That way while searching
(during request completion),we will always get a hit on the
first element.

Signed-off-by: Chetan Loke <loke.chetan@gmail.com>
Acked-by: Paul.Clements@steeleye.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-07-31 08:47:13 +02:00
Alex Elder 1fe5e99321 rbd: create rbd_refresh_helper()
Create a simple helper that handles the common case of calling
__rbd_refresh_header() while holding the ctl_mutex.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:21:46 -07:00
Alex Elder b813623ab9 rbd: return obj version in __rbd_refresh_header()
Add a new parameter to __rbd_refresh_header() through which the
version of the header object is passed back to the caller.  In most
cases this isn't needed.  The main motivation is to normalize
(almost) all calls to __rbd_refresh_header() so they are all
wrapped immediately by mutex_lock()/mutex_unlock().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:21:46 -07:00
Alex Elder ccece235d3 rbd: fixes in rbd_header_from_disk()
This fixes a few issues in rbd_header_from_disk():
    - There is a check intended to catch overflow, but it's wrong in
      two ways.
	- First, the type we don't want to overflow is size_t, not
	  unsigned int, and there is now a SIZE_MAX we can use for
	  use with that type.
	- Second, we're allocating the snapshot ids and snapshot
	  image sizes separately (each has type u64; on disk they
          grouped together as a rbd_image_header_ondisk structure).
	  So we can use the size of u64 in this overflow check.
    - If there are no snapshots, then there should be no snapshot
      names.  Enforce this, and issue a warning if we encounter a
      header with no snapshots but a non-zero snap_names_len.
    - When saving the snapshot names into the header, be more direct
      in defining the offset in the on-disk structure from which
      they're being copied by using "snap_count" rather than "i"
      in the array index.
    - If an error occurs, the "snapc" and "snap_names" fields are
      freed at the end of the function.  Make those fields be null
      pointers after they're freed, to be explicit that they are
      no longer valid.
    - Finally, move the definition of the local variable "i" to the
      innermost scope in which it's needed.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:21:46 -07:00
Alex Elder 913d2fdcf6 rbd: always pass ops array to rbd_req_sync_op()
All of the callers of rbd_req_sync_op() except one pass a non-null
"ops" pointer.  The only one that does not is rbd_req_sync_read(),
which passes CEPH_OSD_OP_READ as its "opcode" and, CEPH_OSD_FLAG_READ
for "flags".

By allocating the ops array in rbd_req_sync_read() and moving the
special case code for the null ops pointer into it, it becomes
clear that much of that code is not even necessary.

In addition, the "opcode" argument to rbd_req_sync_op() is never
actually used, so get rid of that.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:21:46 -07:00
Alex Elder d67d4be56a rbd: pass null version pointer in add_snap()
rbd_header_add_snap() passes the address of a version variable to
rbd_req_sync_exec(), but it ignores the result.  Just pass a null
pointer instead.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:21:46 -07:00
Alex Elder 57cfc1060f rbd: make rbd_create_rw_ops() return a pointer
Either rbd_create_rw_ops() will succeed, or it will fail because a
memory allocation failed.  Have it just return a valid pointer or
null rather than stuffing a pointer into a provided address and
returning an errno.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:21:46 -07:00
Alex Elder 4e891e0af0 rbd: have __rbd_add_snap_dev() return a pointer
It's not obvious whether the snapshot pointer whose address is
provided to __rbd_add_snap_dev() will be assigned by that function.
Change it to return the snapshot, or a pointer-coded errno in the
event of a failure.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:21:45 -07:00
Alex Elder 070c633f60 rbd: drop "object_name" from rbd_req_sync_unwatch()
rbd_req_sync_unwatch() only ever uses rbd_dev->header_name as the
value of its "object_name" parameter, and that value is available
within the function already.  So get rid of the parameter.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:54 -07:00
Alex Elder 7f0a24d855 rbd: drop "object_name" from rbd_req_sync_notify_ack()
rbd_req_sync_notify_ack() only ever uses rbd_dev->header_name as the
value of its "object_name" parameter, and that value is available
within the function already.  So get rid of the parameter.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:53 -07:00
Alex Elder 4cb162508a rbd: drop "object_name" from rbd_req_sync_notify()
rbd_req_sync_notify() only ever uses rbd_dev->header_name as the
value of its "object_name" parameter, and that value is available
within the function already.  So get rid of the parameter.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:53 -07:00
Alex Elder 0e6f322d55 rbd: drop "object_name" from rbd_req_sync_watch()
rbd_req_sync_watch() is only called in one place, and in that place
it passes rbd_dev->header_name as the value of the "object_name"
parameter.  This value is available within the function already.

Having the extra parameter leaves the impression the object name
could take on different values, but it does not.

So get rid of the parameter.  We can always add it back again if
we find we want to watch some other object in the future.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:52 -07:00
Alex Elder 14e7085d84 rbd: drop rbd_dev parameter in snap functions
Both rbd_register_snap_dev() and __rbd_remove_snap_dev() have
rbd_dev parameters that are unused.  Remove them.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:51 -07:00
Alex Elder ed63f4fd9a rbd: drop rbd_header_from_disk() gfp_flags parameter
The function rbd_header_from_disk() is only called in one spot, and
it passes GFP_KERNEL as its value for the gfp_flags parameter.

Just drop that parameter and substitute GFP_KERNEL everywhere within
that function it had been used.  (If we find we need the parameter
again in the future it's easy enough to add back again.)

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:50 -07:00
Alex Elder 9a5d690b08 rbd: snapc is unused in rbd_req_sync_read()
The "snapc" parameter to in rbd_req_sync_read() is not used, so
get rid of it.

Reported-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:49 -07:00
Alex Elder de71a2970d rbd: rename rbd_device->id
The "id" field of an rbd device structure represents the unique
client-local device id mapped to the underlying rbd image.  Each rbd
image will have another id--the image id--and each snapshot has its
own id as well.  The simple name "id" no longer conveys the
information one might like to have.

Rename the device "id" field in struct rbd_dev to be "dev_id" to
make it a little more obvious what we're dealing with without having
to think more about context.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:49 -07:00
Alex Elder 8e94af8e2b rbd: encapsulate header validity test
If an rbd image header is read and it doesn't begin with the
expected magic information, a warning is displayed.  This is
a fairly simple test, but it could be extended at some point.
Fix the comparison so it actually looks at the "text" field
rather than the front of the structure.

In any case, encapsulate the validity test in its own function.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:48 -07:00
Alex Elder bd919d45aa rbd: clean up a few dout() calls
There was a dout() call in rbd_do_request() that was reporting
the reporting the offset as the length and vice versa.  While
fixing that I did a quick scan of other dout() calls and fixed
a couple of other minor things.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:46 -07:00
Alex Elder a059329056 rbd: simplify __rbd_remove_all_snaps()
This just replaces a while loop with list_for_each_entry_safe()
in __rbd_remove_all_snaps().

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:45 -07:00
Alex Elder a66f8c97a3 rbd: drop extra header_rwsem init
In commit c666601a there was inadvertently added an extra
initialization of rbd_dev->header_rwsem.  This gets rid of the
duplicate.

Reported-by: Guangliang Zhao <gzhao@suse.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:45 -07:00
Alex Elder 9e15dc735a rbd: kill rbd_image_header->snap_seq
The snap_seq field in an rbd_image_header structure held the value
from the rbd image header when it was last refreshed.  We now
maintain this value in the snapc->seq field.  So get rid of the
other one.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:44 -07:00
Alex Elder 505cbb9bed rbd: set snapc->seq only when refreshing header
In rbd_header_add_snap() there is code to set snapc->seq to the
just-added snapshot id.  This is the only remnant left of the
use of that field for recording which snapshot an rbd_dev was
associated with.  That functionality is no longer supported,
so get rid of that final bit of code.

Doing so means we never actually set snapc->seq any more.  On the
server, the snapshot context's sequence value represents the highest
snapshot id ever issued for a particular rbd image.  So we'll make
it have that meaning here as well.  To do so, set this value
whenever the rbd header is (re-)read.  That way it will always be
consistent with the rest of the snapshot context we maintain.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:43 -07:00
Alex Elder 78dc447d3c rbd: preserve snapc->seq in rbd_header_set_snap()
In rbd_header_set_snap(), there is logic to make the snap context's
seq field get set to a particular snapshot id, or 0 if there is no
snapshot for the rbd image.

This seems to be an artifact of how the current snapshot id for an
rbd_dev was recorded before the rbd_dev->snap_id field began to be
used for that purpose.

There's no need to update the value of snapc->seq here any more, so
stop doing it.  Tidy up a few local variables in that function
while we're at it.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:42 -07:00
Alex Elder 75fe9e1981 rbd: don't use snapc->seq that way
In what appears to be an artifact of a different way of encoding
whether an rbd image maps a snapshot, __rbd_refresh_header() has
code that arranges to update the seq value in an rbd image's
snapshot context to point to the first entry in its snapshot
array if that's where it was pointing initially.

We now use rbd_dev->snap_id to record the snapshot id--using the
special value CEPH_NOSNAP to indicate the rbd_dev is not mapping a
snapshot at all.

There is therefore no need to check for this case, nor to update the
seq value, in __rbd_refresh_header().  Just preserve the seq value
that rbd_read_header() provides (which, at the moment, is nothing).

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 18:15:41 -07:00
Josh Durgin a71b891bc7 rbd: send header version when notifying
Previously the original header version was sent. Now, we update it
when the header changes.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:15:40 -07:00
Josh Durgin d1d2564654 rbd: use reference counting for the snap context
This prevents a race between requests with a given snap context and
header updates that free it. The osd client was already expecting the
snap context to be reference counted, since it get()s it in
ceph_osdc_build_request and put()s it when the request completes.

Also remove the second down_read()/up_read() on header_rwsem in
rbd_do_request, which wasn't actually preventing this race or
protecting any other data.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:15:40 -07:00
Josh Durgin 93a24e084d rbd: set image size when header is updated
The image may have been resized.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:15:39 -07:00
Josh Durgin a51aa0c042 rbd: expose the correct size of the device in sysfs
If an image was mapped to a snapshot, the size of the head version
would be shown. Protect capacity with header_rwsem, since it may
change.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:15:38 -07:00
Josh Durgin 474ef7ce83 rbd: only reset capacity when pointing to head
Snapshots cannot be resized, and the new capacity of head should not
be reflected by the snapshot.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:15:37 -07:00
Josh Durgin e88a36ec96 rbd: return errors for mapped but deleted snapshot
When a snapshot is deleted, the OSD will return ENOENT when reading
from it. This is normally interpreted as a hole by rbd, which will
return zeroes. To minimize the time in which this can happen, stop
requests early when we are notified that our snapshot no longer
exists.

[elder@inktank.com: updated __rbd_init_snaps_header() logic]

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-07-30 18:15:36 -07:00
Alex Elder d1f57ea663 rbd: kill num_reply parameters
Several functions include a num_reply parameter, but it is never
used.  Just get rid of it everywhere--it seems to be something
that never got fully implemented.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:30:09 -07:00
Alex Elder 43ae470112 rbd: option symbol renames
Use the name "ceph_opts" consistently (rather than just "opt") for
pointers to a ceph_options structure.

Change the few spots that don't use "rbd_opts" for a rbd_options
pointer to match the rest.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:30:08 -07:00
Alex Elder aded07ea9f rbd: more symbol renames
Rename variables named "obj" which represent object names so they're
consistently named "object_name".

Rename the "cls" and "method" parameters in rbd_req_sync_exec()
to be "class_name" and "method_name", and make similar changes
to the names of local variables in that function representing
the lengths of those names.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:30:07 -07:00
Alex Elder 0bed54dc9a rbd: rename some fields in struct rbd_dev
An rbd image is not a single object, but a logical construct made up
of an aggregation of objects.

Rename some fields in struct rbd_dev, in hopes of reinforcing this.
    obj         --> image_name
    obj_len     --> image_name_len
    obj_md_name --> header_name

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:30:06 -07:00
Alex Elder 0ce1a79413 rbd: use rbd_dev consistently
Most variables that represent a struct rbd_device are named
"rbd_dev", but in some cases "dev" is used instead.  Change all the
"dev" references so they use "rbd_dev" consistently, to make it
clear from the name that we're working with an RBD device (as
opposed to, for example, a struct device).  Similarly, change the
name of the "dev" field in struct rbd_notify_info to be "rbd_dev".

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:30:05 -07:00
Alex Elder 820a5f3e94 rbd: dynamically allocate snapshot name
There is no need to impose a small limit the length of the snapshot
name recorded for an rbd image in a struct rbd_dev.  Remove the
limitation by allocating space for the snapshot name dynamically.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:30:04 -07:00
Alex Elder bf3e5ae112 rbd: dynamically allocate image name
There is no need to impose a small limit the length of the rbd image
name recorded in a struct rbd_dev.  Remove the limitation by
allocating space for the image name dynamically.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:30:04 -07:00
Alex Elder cb8627c76d rbd: dynamically allocate image header name
There is no need to impose a small limit the length of the header
name recorded for an rbd image in a struct rbd_dev.  Remove the
limitation by allocating space for the header name dynamically.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:30:03 -07:00
Alex Elder 849b4260d4 rbd: dynamically allocate object prefix
There is no need to impose a small limit the length of the object
prefix recorded for an rbd image in a struct rbd_image_header.
Remove the limitation by allocating space for the object prefix
dynamically.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:30:02 -07:00
Alex Elder d22f76e703 rbd: dynamically allocate pool name
There is no need to impose a small limit the length of the pool name
recorded for an rbd image in a struct rbd_device.  Remove the
limitation by allocating space for the pool name ynamically.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:30:01 -07:00
Alex Elder 9bb2f334b9 rbd: create pool_id device attribute
Add an entry under /sys/bus/rbd/devices/<N>/ named "pool_id" that
provides the id for the pool the rbd image is assocatied with.  This
is in addition to the pool name already provided.

Rename the "poolid" field in struct rbd_device  to be "pool_id".

Update the documentation to reflect the addition of this new entry.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:30:00 -07:00
Alex Elder ca1e49a6af rbd: rename rbd_dev->block_name
Each rbd image has a name that forms the basis of all data objects
backing the device.  Old (format 1) images refer to this name as the
"block name," while new (format 2) images use the term "object
prefix" for this.

Change the field name in the in-core rbd image header structure to
reflect the more modern usage.  We intentionally keep the the name
"block_name" in the on-disk definition for format 1 image headers.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:29:59 -07:00
Alex Elder ea3352f4aa rbd: define dup_token()
Define a new function dup_token(), to be used during argument
parsing for making dynamically-allocated copies of tokens being
parsed.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:29:58 -07:00
Alex Elder ad4f232f28 rbd: drop a useless local variable
In rbd_req_sync_notify_ack(), a local variable was needlessly being
used to hold a null pointer.  Just pass NULL instead.

Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2012-07-30 09:29:56 -07:00
Jens Axboe 72ea1f74fc Merge branch 'for-jens' of git://git.drbd.org/linux-drbd into for-3.6/drivers 2012-07-30 09:03:10 +02:00
Paolo Bonzini cd5d503862 virtio-blk: allow toggling host cache between writeback and writethrough
This patch adds support for the new VIRTIO_BLK_F_CONFIG_WCE feature,
which exposes the cache mode in the configuration space and lets the
driver modify it.  The cache mode is exposed via sysfs.

Even if the host does not support the new feature, the cache mode is
visible (thanks to the existing VIRTIO_BLK_F_WCE), but not modifiable.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-07-30 13:30:52 +09:30
Asias He 2c95a32909 virtio-blk: Use block layer provided spinlock
Block layer will allocate a spinlock for the queue if the driver does
not provide one in blk_init_queue().

The reason to use the internal spinlock is that blk_cleanup_queue() will
switch to use the internal spinlock in the cleanup code path.

        if (q->queue_lock != &q->__queue_lock)
                q->queue_lock = &q->__queue_lock;

However, processes which are in D state might have taken the driver
provided spinlock, when the processes wake up, they would release the
block provided spinlock.

=====================================
[ BUG: bad unlock balance detected! ]
3.4.0-rc7+ #238 Not tainted
-------------------------------------
fio/3587 is trying to release lock (&(&q->__queue_lock)->rlock) at:
[<ffffffff813274d2>] blk_queue_bio+0x2a2/0x380
but there are no more locks to release!

other info that might help us debug this:
1 lock held by fio/3587:
 #0:  (&(&vblk->lock)->rlock){......}, at:
[<ffffffff8132661a>] get_request_wait+0x19a/0x250

Other drivers use block layer provided spinlock as well, e.g. SCSI.

Switching to the block layer provided spinlock saves a bit of memory and
does not increase lock contention. Performance test shows no real
difference is observed before and after this patch.

Changes in v2: Improve commit log as Michael suggested.

Cc: virtualization@lists.linux-foundation.org
Cc: kvm@vger.kernel.org
Cc: stable@kernel.org
Signed-off-by: Asias He <asias@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-07-30 13:30:51 +09:30
Asias He 483001c765 virtio-blk: Reset device after blk_cleanup_queue()
blk_cleanup_queue() will call blk_drian_queue() to drain all the
requests before queue DEAD marking. If we reset the device before
blk_cleanup_queue() the drain would fail.

1) if the queue is stopped in do_virtblk_request() because device is
full, the q->request_fn() will not be called.

blk_drain_queue() {
   while(true) {
      ...
      if (!list_empty(&q->queue_head))
        __blk_run_queue(q) {
	    if (queue is not stoped)
		q->request_fn()
	}
      ...
   }
}

Do no reset the device before blk_cleanup_queue() gives the chance to
start the queue in interrupt handler blk_done().

2) In commit b79d866c8b, We abort requests
dispatched to driver before blk_cleanup_queue(). There is a race if
requests are dispatched to driver after the abort and before the queue
DEAD mark. To fix this, instead of aborting the requests explicitly, we
can just reset the device after after blk_cleanup_queue so that the
device can complete all the requests before queue DEAD marking in the
drain process.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: virtualization@lists.linux-foundation.org
Cc: kvm@vger.kernel.org
Cc: stable@kernel.org
Signed-off-by: Asias He <asias@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-07-30 13:30:51 +09:30
Asias He 02e2b12494 virtio-blk: Call del_gendisk() before disable guest kick
del_gendisk() might not return due to failing to remove the
/sys/block/vda/serial sysfs entry when another thread (udev) is
trying to read it.

virtblk_remove()
  vdev->config->reset() : guest will not kick us through interrupt
    del_gendisk()
      device_del()
        kobject_del(): got stuck, sysfs entry ref count non zero

sysfs_open_file(): user space process read /sys/block/vda/serial
   sysfs_get_active() : got sysfs entry ref count
      dev_attr_show()
        virtblk_serial_show()
           blk_execute_rq() : got stuck, interrupt is disabled
                              request cannot be finished

This patch fixes it by calling del_gendisk() before we disable guest's
interrupt so that the request sent in virtblk_serial_show() will be
finished and del_gendisk() will success.

This fixes another race in hot-unplug process.

It is save to call del_gendisk(vblk->disk) before
flush_work(&vblk->config_work) which might access vblk->disk, because
vblk->disk is not freed until put_disk(vblk->disk).

Cc: virtualization@lists.linux-foundation.org
Cc: kvm@vger.kernel.org
Cc: stable@kernel.org
Signed-off-by: Asias He <asias@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-07-30 13:30:51 +09:30
Keith Busch c7d36ab8fa NVMe: Fix uninitialized iod compiler warning
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-07-27 14:25:22 -04:00
Keith Busch a0cadb85b8 NVMe: Do not set IO queue depth beyond device max
Set the depth for IO queues to the device's maximum supported queue
entries if the requested depth exceeds the device's capabilities.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-07-27 13:57:23 -04:00
Keith Busch 8fc23e032d NVMe: Set block queue max sectors
Set the max hw sectors in a namespace's request queue if the nvme device
has a max data transfer size.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-07-26 13:43:20 -04:00
Keith Busch a42ceccef0 NVMe: use namespace id for nvme_get_features
The specification does not provide a use for command dword11 in the NVMe
Get Features command, but does use the NSID for some features.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-07-26 12:46:36 -04:00
Keith Busch 50af8baec4 NVMe: replace nvme_ns with nvme_dev for user admin
The function nvme_user_admin_command does not require a namespace to
proceed.  Replace with the nvme_dev structure so that it can be called
from contexts that do not have a namespace.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-07-26 12:24:36 -04:00
Keith Busch 5c42ea1643 NVMe: Fix nvme module init when nvme_major is set
register_blkdev returns 0 when given a valid major number.

Reported-by:Ross Zwisler <ross.zwisler@intel.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-07-26 12:23:38 -04:00
Keith Busch e9ef46369f NVMe: Set request queue logical block size
Sets the request queue logical block size with the block size of the
namespace.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-07-25 14:52:49 -04:00
Lars Ellenberg a73ff3231d drbd: announce FLUSH/FUA capability to upper layers
Unconditionally announce FLUSH/FUA to upper layers.
If the lower layers on either node do not actually support this,
generic_make_request() will deal with it.

If this causes performance regressions on your setup,
make sure there are no volatile caches involved,
and mount -o nobarrier or equivalent.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-07-24 15:14:28 +02:00
Lars Ellenberg db141b2f42 drbd: fix max_bio_size to be unsigned
We capped our max_bio_size respectively max_hw_sectors with
min_t(int, lower level limit, our limit);
unfortunately, some drivers, e.g. the kvm virtio block driver, initialize their
limits to "-1U", and that is of course a smaller "int" value than our limit.

Impact: we started to request 16 MB resync requests,
which lead to protocol error and a reconnect loop.

Fix all relevant constants and parameters to be unsigned int.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-07-24 15:14:00 +02:00
Lars Ellenberg 7ee1fb93f3 drbd: flush drbd work queue before invalidate/invalidate remote
If you do back to back wait-sync/invalidate on a Primary in a tight loop,
during application IO load, you could trigger a race:
  kernel: block drbd6: FIXME going to queue 'set_n_write from StartingSync'
	but 'write from resync_finished' still pending?

Fix this by changing the order of the drbd_queue_work() and
the wake_up() in dec_ap_pending(), and adding the additional
drbd_flush_workqueue() before requesting the full sync.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-07-24 14:15:58 +02:00
Lars Ellenberg c12e9c8964 drbd: fix potential access after free
Occasionally, if we disconnect, we triggered this assert:
  block drbd7: ASSERT FAILED tl_hash[27] == c30b0f04, expected NULL

hlist_del() happens only on master bio completion.

We used to wait for pending IO to complete before freeing tl_hash
on disconnect. We no longer do so, since we learned to "freeze"
IO on disconnect.

If the local disk is too slow, we may reach C_STANDALONE early,
and there are still some requests pending locally when we call
drbd_free_tl_hash().

If we now free the tl_hash, and later the local IO completion completes
the master bio, which then does hlist_del() and clobbers freed memory.

Do hlist_del_init() and hlist_add_fake() before kfree(tl_hash),
so the hlist_del() on master bio completion is harmless.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-07-24 14:15:16 +02:00
Lars Ellenberg 63a6d0bb3d drbd: call local-io-error handler early
In case we want to hard-reset from the local-io-error handler,
we need to call it before notifying the peer or aborting local IO.
Otherwise the peer will advance its data generation UUIDs even
if secondary.

This way, local io error looks like a "regular" node crash,
which reduces the number of different failure cases.
This may be useful in a bigger picture where crashed or otherwise
"misbehaving" nodes are automatically re-deployed.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-07-24 14:10:41 +02:00
Lars Ellenberg 0029d62434 drbd: do not reset rs_pending_cnt too early
Fix asserts like
  block drbd0: in got_BlockAck:4634: rs_pending_cnt = -35 < 0 !

We reset the resync lru cache and related information (rs_pending_cnt),
once we successfully finished a resync or online verify, or if the
replication connection is lost.

We also need to reset it if a resync or online verify is aborted
because a lower level disk failed.

In that case the replication link is still established,
and we may still have packets queued in the network buffers
which want to touch rs_pending_cnt.

We do not have any synchronization mechanism to know for sure when all
such pending resync related packets have been drained.

To avoid this counter to go negative (and violate the ASSERT that it
will always be >= 0), just do not reset it when we lose a disk.

It is good enough to make sure it is re-initialized before the next
resync can start: reset it when we re-attach a disk.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-07-24 14:09:53 +02:00
Lars Ellenberg 88437879fb drbd: reset congestion information before reporting it in /proc/drbd
We cache the congestion status in mdev->congestion_reason whenever
drbd_congested() was called.
Reset this cached info before reporting it when reading /proc/drbd.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-07-24 14:07:48 +02:00
Lars Ellenberg c2ba686f35 drbd: report congestion if we are waiting for some userland callback
If the drbd worker thread is synchronously waiting for some userland
callback, we don't want some casual pageout to block on us.
Have drbd_congested() report congestion in that case.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-07-24 14:07:18 +02:00
Lars Ellenberg 383606e0de drbd: differentiate between normal and forced detach
Aborting local requests (not waiting for completion from the lower level
disk) is dangerous: if the master bio has been completed to upper
layers, data pages may be re-used for other things already.
If local IO is still pending and later completes,
this may cause crashes or corrupt unrelated data.

Only abort local IO if explicitly requested.
Intended use case is a lower level device that turned into a tarpit,
not completing io requests, not even doing error completion.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-07-24 14:06:18 +02:00
Lars Ellenberg d264580145 drbd: cleanup, remove two unused global flags
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-07-24 14:02:41 +02:00
Jens Axboe b1af9be5ef Merge branch 'upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/floppy into for-3.6/drivers 2012-07-24 13:15:08 +02:00
Linus Torvalds 7100e505b7 Power management updates for 3.6
* ACPI conversion to PM handling based on struct dev_pm_ops.
 * Conversion of a number of platform drivers to PM handling based on struct
   dev_pm_ops and removal of empty legacy PM callbacks from a couple of PCI
   drivers.
 * Suspend-to-both for in-kernel hibernation from Bojan Smojver.
 * cpuidle fixes and cleanups from ShuoX Liu, Daniel Lezcano and Preeti U Murthy.
 * cpufreq bug fixes from Jonghwa Lee and Stephen Boyd.
 * Suspend and hibernate fixes from Srivatsa S. Bhat and Colin Cross.
 * Generic PM domains framework updates.
 * RTC CMOS wakeup signaling update from Paul Fox.
 * sparse warnings fixes from Sachin Kamat.
 * Build warnings fixes for the generic PM domains framework and PM sysfs code.
 * sysfs switch for printing device suspend times from Sameer Nanda.
 * Documentation fix from Oskar Schirmer.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQIcBAABAgAGBQJQDF5eAAoJEKhOf7ml8uNsEaAP/2wg4faoOGob5A0/7tLqG3Cw
 xnTmGsfL7wG07Q8ykCL1BSlBb1VeJz8L6LTmUpaABI4M//oIBlcYQKyCE0Tat1AO
 9bJXFzK7qcHMhkTz6d6LDqtVzR3NGM3ypjZqj8aEXBov07LMR1AXvgNwXXhv25zM
 0unwrh1XNinBN3n+oaktpWk1YHUjsa5IMU+2tQJrocuHXcgK30vGXZVrZ4g9w1c2
 eS+ED1oKUqOYtFzIUX+aCtaDDheGaPlugk/GOtIB7Sae0s0vMlxH/T5ncB4SxRC+
 v3s4OykqQc5Dc8+0bNlBH7ykSVNB0PoQiyKDY67CxtH+q1xQSc9/f3XJqnGMaVDE
 17eZUZsL4qSyzRuCbPCGAgwBHmx3qNCMu1i1BcmnSxU+ikPUeCR7mYOP0mRThwPH
 OSfs+c/vZ+Ow6CwVE4UFrbm9Jve7ADnCrlZzT2m6XjhHGyjKP7SJlzP9TPsZ0LRk
 oxgQDYHmxbo50t9tBCz5L4ZTMKkDp28e78x84/CteP85srcW3GqDxrPyp2uzJu5O
 tvIEBvVlc4ucq8sG83RkugQwrG/2cQwG2HO9ERAwq01HHA1BYsuU3A961Jqf5CZo
 nFRSnByvVj/imPf47OWpDPAbVEs7jxufJuLEbPwGj1MkttTGDBIRu3zldXt2S6kP
 Q4qYU6fDaQQHFc90pqxQ
 =vC4/
 -----END PGP SIGNATURE-----

Merge tag 'pm-for-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:

 - ACPI conversion to PM handling based on struct dev_pm_ops.
 - Conversion of a number of platform drivers to PM handling based on
   struct dev_pm_ops and removal of empty legacy PM callbacks from a
   couple of PCI drivers.
 - Suspend-to-both for in-kernel hibernation from Bojan Smojver.
 - cpuidle fixes and cleanups from ShuoX Liu, Daniel Lezcano and Preeti
   Murthy.
 - cpufreq bug fixes from Jonghwa Lee and Stephen Boyd.
 - Suspend and hibernate fixes from Srivatsa Bhat and Colin Cross.
 - Generic PM domains framework updates.
 - RTC CMOS wakeup signaling update from Paul Fox.
 - sparse warnings fixes from Sachin Kamat.
 - Build warnings fixes for the generic PM domains framework and PM
   sysfs code.
 - sysfs switch for printing device suspend times from Sameer Nanda.
 - Documentation fix from Oskar Schirmer.

* tag 'pm-for-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (70 commits)
  cpufreq: Fix sysfs deadlock with concurrent hotplug/frequency switch
  EXYNOS: bugfix on retrieving old_index from freqs.old
  PM / Sleep: call early resume handlers when suspend_noirq fails
  PM / QoS: Use NULL pointer instead of plain integer in qos.c
  PM / QoS: Use NULL pointer instead of plain integer in pm_qos.h
  PM / Sleep: Require CAP_BLOCK_SUSPEND to use wake_lock/wake_unlock
  PM / Sleep: Add missing static storage class specifiers in main.c
  cpuilde / ACPI: remove time from acpi_processor_cx structure
  cpuidle / ACPI: remove usage from acpi_processor_cx structure
  cpuidle / ACPI : remove latency_ticks from acpi_processor_cx structure
  rtc-cmos: report wakeups from interrupt handler
  PM / Sleep: Fix build warning in sysfs.c for CONFIG_PM_SLEEP unset
  PM / Domains: Fix build warning for CONFIG_PM_RUNTIME unset
  olpc-xo15-sci: Use struct dev_pm_ops for power management
  PM / Domains: Replace plain integer with NULL pointer in domain.c file
  PM / Domains: Add missing static storage class specifier in domain.c file
  PM / crypto / ux500: Use struct dev_pm_ops for power management
  PM / IPMI: Remove empty legacy PCI PM callbacks
  tpm_nsc: Use struct dev_pm_ops for power management
  tpm_tis: Use struct dev_pm_ops for power management
  ...
2012-07-22 13:36:52 -07:00
Theodore Ts'o 89c30f161c xen-blkfront: remove IRQF_SAMPLE_RANDOM which is now a no-op
With the changes in the random tree, IRQF_SAMPLE_RANDOM is now a
no-op; interrupt randomness is now collected unconditionally in a very
low-overhead fashion; see commit 775f4b297b.  The IRQF_SAMPLE_RANDOM
flag was scheduled to be removed in 2009 on the
feature-removal-schedule, so this patch is preparation for the final
removal of this flag.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
2012-07-19 10:39:25 -04:00
Rafael J. Wysocki bfaa07bc32 Merge branch 'pm-drivers'
* pm-drivers:
  rtc-cmos: report wakeups from interrupt handler
  PM / crypto / ux500: Use struct dev_pm_ops for power management
  PM / IPMI: Remove empty legacy PCI PM callbacks
  tpm_nsc: Use struct dev_pm_ops for power management
  tpm_tis: Use struct dev_pm_ops for power management
  tpm_atmel: Use struct dev_pm_ops for power management
  PM / TPM: Drop unused pm_message_t argument from tpm_pm_suspend()
  omap-rng: Use struct dev_pm_ops for power management
  mg_disk: Use struct dev_pm_ops for power management
  msi-laptop: Use struct dev_pm_ops for power management
  hdaps: Use struct dev_pm_ops for power management
  sonypi: Use struct dev_pm_ops for power management
  intel_mid_thermal: Use struct dev_pm_ops for power management
  acer-wmi: Use struct dev_pm_ops for power management
  intel_ips: Remove empty legacy PM callbacks
  thinkpad_acpi: Use struct dev_pm_ops instead of legacy PM routines
  thinkpad_acpi: Drop pm_message_t arguments from suspend routines
2012-07-19 00:03:42 +02:00
Dan Carpenter 6a3ca4f188 rbd: endian bug in rbd_req_cb()
Sparse complains about this because:
drivers/block/rbd.c:996:20: warning: cast to restricted __le32
drivers/block/rbd.c:996:20: warning: cast from restricted __le16

These are set in osd_req_encode_op() and they are le16.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Alex Elder <elder@inktank.com>
(cherry picked from commit 895cfcc810)
2012-07-17 21:30:31 -07:00
Yan, Zheng 236df3755d rbd: Fix ceph_snap_context size calculation
ceph_snap_context->snaps is an u64 array

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
(cherry picked from commit f9f9a19044)
2012-07-17 21:30:19 -07:00
Silva Paulo 68d740d79c blk: fix wrong idr_pre_get() error check in loop.c
The idr_pre_get() function never returns a value < 0.  It returns 0 (no
memory) or 1 (OK).

Reported-by: Silva Paulo <psdasilva@yahoo.com>
[ Rewrote Silva's patch, but attributing it to Silva anyway  - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-14 15:39:58 -07:00
Rafael J. Wysocki 156ffcb42a mg_disk: Use struct dev_pm_ops for power management
Make the mg_disk driver define its PM callbacks through
a struct dev_pm_ops object rather than by using legacy PM hooks
in struct platform_driver.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-07-06 19:07:00 +02:00
Andi Kleen 0cc15d03bc floppy: Run floppy initialization asynchronous
floppy_init is quite slow, 3s on my test system to determine
that there is no floppy. Run it asynchronous to the other
init calls to improve boot time.

[jkosina@suse.cz: fix modular build]

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-07-04 14:01:40 +02:00
Linus Torvalds dab058fd5f floppy: cancel any pending fd_timeouts before adding a new one
In commit 070ad7e793 ("floppy: convert to delayed work and
single-thread wq") the 'fd_timeout' timer was converted to a delayed
work.  However, the "del_timer(&fd_timeout)" was lost in the process,
and any previous pending timeouts would stay active when we then
re-queued the timeout.

This resulted in the floppy probe sequence having a (stale) 20s timeout
rather than the intended 3s timeout, and thus made booting with the
floppy driver (but no actual floppy controller) take much longer than it
should.

Of course, there's little reason for most people to compile the floppy
driver into the kernel at all, which is why most people never noticed.

Canceling the delayed work where we used to do the del_timer() fixes the
issue, and makes the floppy probing use the proper new timeout instead.
The three second timeout is still very wasteful, but better than the 20s
one.

Reported-and-tested-by: Andi Kleen <ak@linux.intel.com>
Reported-and-tested-by: Calvin Walton <calvin.walton@kepstin.ca>
Cc: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-03 15:51:22 -07:00
Sage Weil 9a64e8e0ac Linux 3.5-rc1
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iQEcBAABAgAGBQJPyr4LAAoJEHm+PkMAQRiGhvMH/1uXaDmJiiyAtMhC9kQbLclK
 5RpUOV+ukRrPXBJhwWGEZvC9G/DiWAfZ/19Ee6qTGZbA46yxkgZklqO+bw7fuOLH
 dPf4MNXdhgOgbs0KkVAk6aXIYzIU836pcYg+LcapG8E8SZp3SWbJzrVbUPFwPM+m
 Sv11ZcpJfM2HH9wFRdKErUOiZHsMY+LZHcw0nx+BObytjgzBbzHNkpF57F714TUO
 QplYpIToO3XtGhIM1yRDxww+2zFlVNsCZ8IC57EDbLb8BMZWuyZoFgWZqLAnrU0u
 vy7CHLledMSvs855juJ9JxGo/EDnfwJpCnjmcp8BY+h4b5T/k5mGK6d9aeXYRf4=
 =CcWn
 -----END PGP SIGNATURE-----

Merge tag 'v3.5-rc1'

Linux 3.5-rc1

Conflicts:
	net/ceph/messenger.c
2012-06-15 12:32:04 -07:00
Jens Axboe 6d407cfaf5 Merge branch 'for-jens' of git://git.drbd.org/linux-drbd into for-linus 2012-06-13 21:19:42 +02:00
Jens Axboe 987751719b Merge branch 'stable/for-jens-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-linus 2012-06-13 21:19:06 +02:00
Tao Guo 32587371ad umem: fix up unplugging
Fix a regression introduced by 7eaceaccab ("block: remove per-queue
plugging").  In that patch, Jens removed the whole mm_unplug_device()
function, which used to be the trigger to make umem start to work.

We need to implement unplugging to make umem start to work, or I/O will
never be triggered.

Signed-off-by: Tao Guo <Tao.Guo@emc.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Shaohua Li <shli@kernel.org>
Cc: <stable@vger.kernel.org>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-13 21:17:21 +02:00
Lars Ellenberg 0d5934e3c2 drbd: fix null pointer dereference with on-congestion policy when diskless
We must not look at mdev->actlog, unless we have a get_ldev() reference.
It also does not make much sense to try to disconnect or pull-ahead of
the peer, if we don't have good local data.

Only even consider congestion policies, if our local disk is D_UP_TO_DATE.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-06-12 14:35:19 +02:00
Lars Ellenberg 1ed25b269e drbd: fix list corruption by failing but already aborted reads
If a read is aborted due to force-detach of a supposedly unresponsive
local backing device, and retried on the peer, it can happen that the
local request later still completes (hopefully with an error).
As it may already have been completed to upper layers meanwhile,
it must not be retried again now.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-06-12 14:34:51 +02:00
Lars Ellenberg 4eccc57979 drbd: fix access of unallocated pages and kernel panic
BUG: unable to handle kernel NULL pointer dereference at (null)
...
 [<d1e17561>] ? _drbd_bm_set_bits+0x151/0x240 [drbd]
 [<d1e236f8>] ? receive_bitmap+0x4f8/0xbc0 [drbd]

This fixes an off-by-one error in the receive_bitmap() path,
if run-length encoded bitmap transfer is enabled.

If the bitmap is an exact multiple of PAGE_SIZE, which means the visible
capacity of the drbd device is an exact multiple of 128 MiB (for 4k page
size), and bitmap compression (use-rle) is enabled (which became default
with 8.4), and the very last bit is dirty and reported in an rle
comressed bitmap packet, we ended up trying to kmap_atomic a page pointer
that does not exist (bitmap->bm_pages[last index + 1]).

bug introduced by:
    Date:   Fri Jul 24 15:33:24 2009 +0200
    set bits: optimize for complete last word, fix off-by-one-word corner case

made effective by:
    Date:   Thu Dec 16 00:32:38 2010 +0100
    drbd: get rid of unused debug code

    Long time ago, we had paranoia code in the bitmap that allocated one
    extra word, assigned a magic value, and checked on every occasion that
    the magic value was still unchanged.

    That debug code is unused, the extra long word complicates code a bit.
    Get rid of it.

No-one triggered this bug in the last few years, because a large subset
of our userbase is unaffected:
 * typically the last few blocks of a device are not modified
   frequently, and remain unset
 * use-rle was disabled by default in drbd < 8.4
 * those with slightly "odd" device sizes, or
 * drbd internal meta data (which will skew the device size slightly,
   thus makes it harder to have a bug relevant device size)

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-06-12 14:32:48 +02:00
Konrad Rzeszutek Wilk 6878c32e5c xen/blkfront: Add WARN to deal with misbehaving backends.
Part of the ring structure is the 'id' field which is under
control of the frontend. The frontend stamps it with "some"
value (this some in this implementation being a value less
than BLK_RING_SIZE), and when it gets a response expects
said value to be in the response structure. We have a check
for the id field when spolling new requests but not when
de-spolling responses.

We also add an extra check in add_id_to_freelist to make
sure that the 'struct request' was not NULL - as we cannot
pass a NULL to __blk_end_request_all, otherwise that crashes
(and all the operations that the response is dealing with
end up with __blk_end_request_all).

Lastly we also print the name of the operation that failed.

[v1: s/BUG/WARN/ suggested by Stefano]
[v2: Add extra check in add_id_to_freelist]
[v3: Redid op_name per Jan's suggestion]
[v4: add const * and add WARN on failure returns]
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-06-12 08:29:04 -04:00
Dan Carpenter 895cfcc810 rbd: endian bug in rbd_req_cb()
Sparse complains about this because:
drivers/block/rbd.c:996:20: warning: cast to restricted __le32
drivers/block/rbd.c:996:20: warning: cast from restricted __le16

These are set in osd_req_encode_op() and they are le16.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-06-06 09:23:54 -05:00
Yan, Zheng f9f9a19044 rbd: Fix ceph_snap_context size calculation
ceph_snap_context->snaps is an u64 array

Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
2012-06-06 09:23:53 -05:00
Asai Thambi S P 7b421d24ea mtip32xx: Create debugfs entries for troubleshooting
On module load, creates a debugfs parent 'rssd' in debugfs root. Then for each
device, create a new node with corresponding disk name. Under the new node, two
entries 'registers' and 'flags' are created.

NOTE: These entries were removed from sysfs in the previous patch

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-05 09:13:49 +02:00
Asai Thambi S P 7412ff139d mtip32xx: Remove 'registers' and 'flags' from sysfs
This patch removes entries 'registers' and 'flags' from sysfs. Updated ABI file
to reflect this change.

Reported-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-06-05 09:13:48 +02:00
Sachin Kamat 87c9ea76a2 mtip32xx: Remove version.h header file inclusion
version.h header file inclusion is no longer required.

Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
2012-06-04 10:00:32 +02:00
Asai Thambi S P b77874c969 mtip32xx: Changes to sysfs entries
* Formatted the output of 'registers' entry
* Added "Commands in Q' to output of 'registers' entry
* Added a new entry 'flags'

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-31 08:46:50 +02:00
Asai Thambi S P 8ce800935d mtip32xx: Convert macro definitions for flag bits to enum
Convert macro definitions for flags bits to enum

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-31 08:46:50 +02:00
Asai Thambi S P 377b8fc6d7 mtip32xx: minor performance tweak
When checking for command completions if the register value is zero, proceed
to next register.

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-31 08:46:50 +02:00
Asai Thambi S P e602878fd8 mtip32xx: Fix to support more than one sector in exec_drive_command()
Fix to support more than one sector in exec_drive_command().

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-31 08:46:50 +02:00
Asai Thambi S P 0a07ab224a mtip32xx: Use plain spinlock for 'cmd_issue_lock'
'cmd_issue_lock' is for only acquiring a free slot, and it is not used
in interrupt context. So replaced irq version with non-irq version of spinlock.

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-31 08:46:50 +02:00
Asai Thambi S P 6c8ab69818 mtip32xx: Set block queue boundary variables
Set the following block queue boundary variables
	* max_hw_sectors
	* max_segment_size

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>

Removed setting of q->nr_requests.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-31 08:46:50 +02:00
Asai Thambi S P d02e1f0ad0 mtip32xx: Fix to handle TFE for PIO(IOCTL/internal) commands
If a PIO (IOCTL/internal) command resulted in TFE, signal the wait event or break out of polling.

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-31 08:36:55 +02:00
Asai Thambi S P 971890f258 mtip32xx: Change HDIO_GET_IDENTITY to return stored data
For the ioctl command HDIO_GET_IDENTITY, return the stored copy of IDENTIFY
DATA instead of sending the command to the device - similar to libata.

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-31 08:36:55 +02:00
Asai Thambi S P 2df7aa96e7 mtip32xx: Set custom timeouts for PIO commands
This change sets custom timeouts depending on PIO command.

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-31 08:36:55 +02:00
Asai Thambi S P 6bb688c048 mtip32xx: fix clearing an incorrect register in mtip_init_port
Fix clearing an incorrect register in mtip_init_port

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-31 08:36:55 +02:00
Konrad Rzeszutek Wilk 8c9ce606a6 xen/blkback: Copy id field when doing BLKIF_DISCARD.
We weren't copying the id field so when we sent the response
back to the frontend (especially with a 64-bit host and 32-bit
guest), we ended up using a random value. This lead to the
frontend crashing as it would try to pass to __blk_end_request_all
a NULL 'struct request' (b/c it would use the 'id' to find the
proper 'struct request' in its shadow array) and end up crashing:

BUG: unable to handle kernel NULL pointer dereference at 000000e4
IP: [<c0646d4c>] __blk_end_request_all+0xc/0x40
.. snip..
EIP is at __blk_end_request_all+0xc/0x40
.. snip..
 [<ed95db72>] blkif_interrupt+0x172/0x330 [xen_blkfront]

This fixes the bug by passing in the proper id for the response.

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=824641

CC: stable@kernel.org
Tested-by: William Dauchy <wdauchy@gmail.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-05-30 17:20:04 -04:00
Linus Torvalds af56e0aa35 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph updates from Sage Weil:
 "There are some updates and cleanups to the CRUSH placement code, a bug
  fix with incremental maps, several cleanups and fixes from Josh Durgin
  in the RBD block device code, a series of cleanups and bug fixes from
  Alex Elder in the messenger code, and some miscellaneous bounds
  checking and gfp cleanups/fixes."

Fix up trivial conflicts in net/ceph/{messenger.c,osdmap.c} due to the
networking people preferring "unsigned int" over just "unsigned".

* git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (45 commits)
  libceph: fix pg_temp updates
  libceph: avoid unregistering osd request when not registered
  ceph: add auth buf in prepare_write_connect()
  ceph: rename prepare_connect_authorizer()
  ceph: return pointer from prepare_connect_authorizer()
  ceph: use info returned by get_authorizer
  ceph: have get_authorizer methods return pointers
  ceph: ensure auth ops are defined before use
  ceph: messenger: reduce args to create_authorizer
  ceph: define ceph_auth_handshake type
  ceph: messenger: check return from get_authorizer
  ceph: messenger: rework prepare_connect_authorizer()
  ceph: messenger: check prepare_write_connect() result
  ceph: don't set WRITE_PENDING too early
  ceph: drop msgr argument from prepare_write_connect()
  ceph: messenger: send banner in process_connect()
  ceph: messenger: reset connection kvec caller
  libceph: don't reset kvec in prepare_write_banner()
  ceph: ignore preferred_osd field
  ceph: fully initialize new layout
  ...
2012-05-30 11:17:19 -07:00
Linus Torvalds a70f35af4e Merge branch 'for-3.5/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "Here are the driver related changes for 3.5.  It contains:

   - The floppy changes from Jiri.  Jiri is now also marked as the
     maintainer of floppy.c, I shall be publically branding his forehead
     with red hot iron at the next opportune moment.

   - A batch of drbd updates and fixes from the linbit crew, as well as
     fixes from others.

   - Two small fixes for xen-blkfront courtesy of Jan."

* 'for-3.5/drivers' of git://git.kernel.dk/linux-block: (70 commits)
  floppy: take over maintainership
  floppy: remove floppy-specific O_EXCL handling
  floppy: convert to delayed work and single-thread wq
  xen-blkfront: module exit handling adjustments
  xen-blkfront: properly name all devices
  drbd: grammar fix in log message
  drbd: check MODULE for THIS_MODULE
  drbd: Restore the request restart logic
  drbd: introduce a bio_set to allocate housekeeping bios from
  drbd: remove unused define
  drbd: bm_page_async_io: properly initialize page->private
  drbd: use the newly introduced page pool for bitmap IO
  drbd: add page pool to be used for meta data IO
  drbd: allow bitmap to change during writeout from resync_finished
  drbd: fix race between drbdadm invalidate/verify and finishing resync
  drbd: fix resend/resubmit of frozen IO
  drbd: Ensure that data_size is not 0 before using data_size-1 as index
  drbd: Delay/reject other state changes while establishing a connection
  drbd: move put_ldev from __req_mod() to the endio callback
  drbd: fix WRITE_ACKED_BY_PEER_AND_SIS to not set RQ_NET_DONE
  ...
2012-05-30 09:05:47 -07:00
Linus Torvalds 99262a3daf Autogenerated GPG tag for Rusty D1ADB8F1: 15EE 8D6C AB0E 7F0C F999 BFCB D920 0E6C D1AD B8F1
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJPuv35AAoJENkgDmzRrbjxUx4P/0uc+0oNnZv11vYQsqHuhURa
 zMlsVdlXGVkvPqQiLY0QkrK5LcO6KiSnSk8vEnOYFIPjL4wNqL/4RRRLnTAJwmE+
 lsrL9DblI8Ira/EZRv7d2L12QrP+F2ZGKOZr67uVxSaxH71fUqtiJ0jqA/I8AYH7
 /V7+DgdIB1DD28Ya/JEFEUi41F08A6MU10hpaQWy9kXv09gCc9apgvH7/S3s9DaQ
 G640YWkoKZAx/OFBb8XFvpu9LqZcVl02Nl8goMZOKnMctC4iU3km7HeVjfwCgLjO
 AdA5spLMhDkS/xrpI0mSQ/wT0k0+sSYW5vEdW9N4XLZza0NgH9GfU4RtEuK85Slj
 7bPviZOcpjtt0sGi4wXCaVjZyHROX6tyRvTMUAIj3D0oJglb5T9D3MCvQnadILb0
 I0+7gk3d9rHqkO6CmjNaZG9IwR9NpFkbuolcFQuEaZoUMoKd2pYNQyxpbFGl+jCl
 7ViFHAy+fydNqDoETKincld4A43KWxOV7jyEJd7hloKcCixsqI7ZdPS7X8amec72
 a0hfNgMJzarZkTgo61Hair/d+vKGRJPgEdF1Yq76SDhYKD1TeWeDjmboctsiMjqe
 f5M4C6IdNJj9cDIlCxMk+3bX250oy7KG77v7Ux0/7nvtSWVa3yEMowD57hnn1But
 0gNC8bjXDHRsho90rDRN
 =Kj9v
 -----END PGP SIGNATURE-----

Merge tag 'virtio-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus

Pull virtio updates from Rusty Russell.

* tag 'virtio-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
  virtio: fix typo in comment
  virtio-mmio: Devices parameter parsing
  virtio_blk: Drop unused request tracking list
  virtio-blk: Fix hot-unplug race in remove method
  virtio: Use ida to allocate virtio index
  virtio: balloon: separate out common code between remove and freeze functions
  virtio: balloon: drop restore_common()
  9p: disconnect channel when PCI device is removed
  virtio: update documentation to v0.9.5 of spec
2012-05-21 20:20:23 -07:00
Asias He f65ca1dc6a virtio_blk: Drop unused request tracking list
Benchmark shows small performance improvement on fusion io device.

Before:
  seq-read : io=1,024MB, bw=19,982KB/s, iops=39,964, runt= 52475msec
  seq-write: io=1,024MB, bw=20,321KB/s, iops=40,641, runt= 51601msec
  rnd-read : io=1,024MB, bw=15,404KB/s, iops=30,808, runt= 68070msec
  rnd-write: io=1,024MB, bw=14,776KB/s, iops=29,552, runt= 70963msec

After:
  seq-read : io=1,024MB, bw=20,343KB/s, iops=40,685, runt= 51546msec
  seq-write: io=1,024MB, bw=20,803KB/s, iops=41,606, runt= 50404msec
  rnd-read : io=1,024MB, bw=16,221KB/s, iops=32,442, runt= 64642msec
  rnd-write: io=1,024MB, bw=15,199KB/s, iops=30,397, runt= 68991msec

Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-05-22 12:16:14 +09:30
Asias He b79d866c8b virtio-blk: Fix hot-unplug race in remove method
If we reset the virtio-blk device before the requests already dispatched
to the virtio-blk driver from the block layer are finised, we will stuck
in blk_cleanup_queue() and the remove will fail.

blk_cleanup_queue() calls blk_drain_queue() to drain all requests queued
before DEAD marking. However it will never success if the device is
already stopped. We'll have q->in_flight[] > 0, so the drain will not
finish.

How to reproduce the race:
1. hot-plug a virtio-blk device
2. keep reading/writing the device in guest
3. hot-unplug while the device is busy serving I/O

Test:
~1000 rounds of hot-plug/hot-unplug test passed with this patch.

Changes in v3:
- Drop blk_abort_queue and blk_abort_request
- Use __blk_end_request_all to complete request dispatched to driver

Changes in v2:
- Drop req_in_flight
- Use virtqueue_detach_unused_buf to get request dispatched to driver

Signed-off-by: Asias He <asias@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-05-22 12:16:13 +09:30
David S. Miller 17eea0df5f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-05-20 21:53:04 -04:00
Linus Torvalds 14e931a264 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
 "A few small, but important fixes.  Most of them are marked for stable
  as well

   - Fix failure to release a semaphore on error path in mtip32xx.
   - Fix crashable condition in bio_get_nr_vecs().
   - Don't mark end-of-disk buffers as mapped, limit it to i_size.
   - Fix for build problem with CONFIG_BLOCK=n on arm at least.
   - Fix for a buffer overlow on UUID partition printing.
   - Trivial removal of unused variables in dac960."

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: fix buffer overflow when printing partition UUIDs
  Fix blkdev.h build errors when BLOCK=n
  bio allocation failure due to bio_get_nr_vecs()
  block: don't mark buffers beyond end of disk as mapped
  mtip32xx: release the semaphore on an error path
  dac960: Remove unused variables from DAC960_CreateProcEntries()
2012-05-19 10:12:17 -07:00
Jens Axboe 4fd1ffaa12 Merge branch 'for-jens' of git://git.drbd.org/linux-drbd into for-3.5/drivers
Philipp writes:

This are the updates we have in the drbd-8.3 tree. They are intended
for your "for-3.5/drivers" drivers branch.

These changes include one new feature:
 * Allow detach from frozen backing devices with the new --force option;
   configurable timeout for backing devices by the new disk-timeout option

And huge number of bug fixes:
 * Fixed a write ordering problem on SyncTarget nodes for a write
   to a block that gets resynced at the same time. The bug can
   only be triggered with a device that has a firmware that
   actually reorders writes to the same block
 * Fixed a race between disconnect and receive_state, that could cause
   a IO lockup
 * Fixed resend/resubmit for requests with disk or network timeout
 * Make sure that hard state changed do not disturb the connection
   establishing process (I.e. detach due to an IO error). When the
   bug was triggered it caused a retry in the connect process
 * Postpone soft state changes to no disturb the connection
   establishing process (I.e. becoming primary). When the bug
   was triggered it could cause both nodes going into SyncSource state
 * Fixed a refcount leak that could cause failures when trying to
   unload a protocol family modules, that was used by DRBD
 * Dedicated page pool for meta data IOs
 * Deny normal detach (as opposed to --forced) if the user tries
   to detach from the last UpToDate disk in the resource
 * Fixed a possible protocol error that could be caused by
   "unusual" BIOs.
 * Enforce the disk-timeout option also on meta-data IO operations
 * Implemented stable bitmap pages when we do a full write out of
   the bitmap
 * Fixed a rare compatibility issue with DRBD's older than 8.3.7
   when negotiating the bio_size
 * Fixed a rare race condition where an empty resync could stall with
   if pause/unpause events happen in parallel
 * Made the re-establishing of connections quicker, if it got a broken pipe
   once. Previously there was a bug in the code caused it to waste the first
   successful established connection after a broken pipe event.

PS: I am postponing the drbd-8.4 for mainline for one or two kernel
    development cycles more (the ~400 patchets set).
2012-05-18 16:20:06 +02:00
Jens Axboe 13828dec45 Merge branch 'stable/for-jens-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-3.5/drivers
Konrad writes:

Please git pull the following branch:

 git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git stable/for-jens-3.5

in your for-3.5/drivers branch. The changes in it are rather simple - cleaning
up some code and adding proper mechanism to unload without leaking memory.
2012-05-18 16:17:41 +02:00
Jiri Kosina bfa10b8c98 floppy: remove floppy-specific O_EXCL handling
Block layer now handles O_EXCL in a generic way for block devices.

The semantics is however different for floppy and all other block devices,
as floppy driver contains its own O_EXCL handling.

The semantics for all-but-floppy bdevs is "there can be at most one O_EXCL
open of this file", while for floppy bdev the semantics is "if someone has
the bdev open with O_EXCL, noone else can open it".

There is actual userspace-observable change in behavior because of this
since commit e525fd89d3 ("block: make blkdev_get/put() handle exclusive
access") -- on kernels containing this commit, mount of /dev/fd0 causes
the fd0 block device be claimed with _EXCL, preventing subsequent
open(/dev/fd0).

Bring things back into shape, i.e.  make it possible, analogically to
other block devices, to mount the floppy and open() it afterwards --
remove the floppy-specific handling and let the generic bdev code O_EXCL
handling take over.

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-05-18 15:19:11 +02:00
Jiri Kosina 070ad7e793 floppy: convert to delayed work and single-thread wq
There are several races in floppy driver between bottom half
(scheduled_work) and timers (fd_timeout, fd_timer). Due to slowness
of the actual floppy devices, those races are never (at least to my
knowledge) triggered on a bare floppy metal. However on virtualized
(emulated) floppy drives, which are of course magnitudes faster
than the real ones, these races trigger reliably. They usually exhibit
themselves as NULL pointer dereferences during DMA setup, such as

	BUG: unable to handle kernel NULL pointer dereference at 0000000a
	[ ... snip ... ]
	EIP: 0060:[<c02053d5>] EFLAGS: 00010293 CPU: 0
	EAX: ffffe000 EBX: 0000000a ECX: 00000000 EDX: 0000000a
	ESI: c05d2718 EDI: 00000000 EBP: 00000000 ESP: f540fe44
	 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
	Process swapper (pid: 0, ti=f540e000 task=c082d5a0 task.ti=c0826000)
	Stack:
	 ffffe000 00001ffc 00000000 00000000 00000000 c05d2718 c0708b40 f540fe80
	 c020470f c05d2718 c0708b40 00000000 f540fe80 0000000a f540fee4 00000000
	 c0708b40 f540fee4 00000000 00000000 c020526b 00000000 c05d2718 c0708b40
	Call Trace:
	 [<c020470f>] dump_trace+0xaf/0x110
	 [<c020526b>] show_trace_log_lvl+0x4b/0x60
	 [<c0205298>] show_trace+0x18/0x20
	 [<c05c5811>] dump_stack+0x6d/0x72
	 [<c0248527>] warn_slowpath_common+0x77/0xb0
	 [<c02485f3>] warn_slowpath_fmt+0x33/0x40
	 [<f7ec593c>] setup_DMA+0x14c/0x210 [floppy]
	 [<f7ecaa95>] setup_rw_floppy+0x105/0x190 [floppy]
	 [<c0256d08>] run_timer_softirq+0x168/0x2a0
	 [<c024e762>] __do_softirq+0xc2/0x1c0
	 [<c02042ed>] do_softirq+0x7d/0xb0
	 [<f54d8a00>] 0xf54d89ff

but other instances can be easily seen as well. This can be observed at least under
VMWare, VirtualBox and KVM.

This patch converts all the timers and bottom halfs to be processed in a single
workqueue. This aproach has been already discussed back in 2010 if I remember
correctly, and Acked by Linus [1], but it then never made it to the tree.

This all is based on original idea and code of Stephen Hemminger.  I have
ported original Stepen's code to the current state of the floppy driver, and
performed quite some testing (on real hardware), which didn't reveal any issues
(this includes not only writing and reading data, but also formatting
(unfortunately I didn't find any Double-Density disks any more)). Ability to
handle errors properly (supplying known bad floppies) has also been verified.

[1] http://kerneltrap.org/mailarchive/linux-kernel/2010/6/11/4582092

Based-on-patch-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-05-18 15:19:10 +02:00
David S. Miller 028940342a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-05-16 22:17:37 -04:00
Josh Durgin 263c6ca007 rbd: rename __rbd_update_snaps to __rbd_refresh_header
This function rereads the entire header and handles any changes in
it, not just changes in snapshots.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2012-05-14 12:13:09 -05:00
Josh Durgin 3591538fb2 rbd: fix snapshot size type
Snapshot sizes should be the same type as regular image sizes. This
only affects their displayed size in sysfs, not the reported size of
an actual block device sizes.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2012-05-14 12:13:03 -05:00
Josh Durgin b06e6a6be7 rbd: remove conditional snapid parameters
The snapid parameters passed to rbd_do_op() and rbd_req_sync_op()
are now always either a valid snapid or an explicit CEPH_NOSNAP.

[elder@dreamhost.com: Rephrased the description]

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2012-05-14 12:12:58 -05:00
Josh Durgin 77dfe99fe3 rbd: store snapshot id instead of index
When a device was open at a snapshot, and snapshots were deleted or
added, data from the wrong snapshot could be read. Instead of
assuming the snap context is constant, store the actual snap id when
the device is initialized, and rely on the OSDs to signal an error
if we try reading from a snapshot that was deleted.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2012-05-14 12:12:52 -05:00
Josh Durgin 403f24d3d5 rbd: protect read of snapshot sequence number
This is updated whenever a snapshot is added or deleted, and the
snapc pointer is changed with every refresh of the header.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2012-05-14 12:12:46 -05:00
Xi Wang 50f7c4c967 rbd: fix integer overflow in rbd_header_from_disk()
ondisk->snap_count is read from disk via rbd_req_sync_read() and thus
needs validation.  Otherwise, a bogus `snap_count' could overflow the
kmalloc() size, leading to memory corruption.

Also use `u32' consistently for `snap_count'.

[elder@dreamhost.com: changed to use UINT_MAX rather than ULONG_MAX]

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
2012-05-14 12:12:41 -05:00
Dan Carpenter f8ad495a8a rbd: use gfp_flags parameter in rbd_header_from_disk()
We should use the gfp_flags that the caller specified instead of
GFP_KERNEL here.

There is only one caller and it uses GFP_KERNEL, so this change is
just a cleanup and doesn't change how the code works.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
2012-05-14 12:12:35 -05:00
Jan Beulich 8605067fb9 xen-blkfront: module exit handling adjustments
The blkdev major must be released upon exit, or else the module can't
attach to devices using the same majors upon being loaded again. Also
avoid leaking the minor tracking bitmap.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-05-11 16:11:54 -04:00
Jan Beulich e77c78c022 xen-blkfront: properly name all devices
- devices beyond xvdzz didn't get proper names assigned at all
- extended devices with minors not representable within the kernel's
  major/minor bit split spilled into foreign majors

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-05-11 16:11:52 -04:00
Asai Thambi S P a09ba13eef mtip32xx: release the semaphore on an error path
Release the semaphore in an error path in mtip_hw_get_scatterlist(). This
fixes the smatch warning inconsistent returns.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-11 16:42:14 +02:00
Jesper Juhl d88a440edd dac960: Remove unused variables from DAC960_CreateProcEntries()
The variables 'StatusProcEntry' and 'UserCommandProcEntry' are
assigned to once and then never used. This patch gets rid of the
variables.

While I was there I also fixed the indentation of the function to use
tabs rather than spaces for the lines that did not already do so.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-11 16:42:14 +02:00
Eric W. Biederman 38bf195398 connector/userns: replace netlink uses of cap_raised() with capable()
In 2009 Philip Reiser notied that a few users of netlink connector
interface needed a capability check and added the idiom
cap_raised(nsp->eff_cap, CAP_SYS_ADMIN) to a few of them, on the premise
that netlink was asynchronous.

In 2011 Patrick McHardy noticed we were being silly because netlink is
synchronous and removed eff_cap from the netlink_skb_params and changed
the idiom to cap_raised(current_cap(), CAP_SYS_ADMIN).

Looking at those spots with a fresh eye we should be calling
capable(CAP_SYS_ADMIN).  The only reason I can see for not calling capable
is that it once appeared we were not in the same task as the caller which
would have made calling capable() impossible.

In the initial user_namespace the only difference between between
cap_raised(current_cap(), CAP_SYS_ADMIN) and capable(CAP_SYS_ADMIN) are a
few sanity checks and the fact that capable(CAP_SYS_ADMIN) sets
PF_SUPERPRIV if we use the capability.

Since we are going to be using root privilege setting PF_SUPERPRIV seems
the right thing to do.

The motivation for this that patch is that in a child user namespace
cap_raised(current_cap(),...) tests your capabilities with respect to that
child user namespace not capabilities in the initial user namespace and
thus will allow processes that should be unprivielged to use the kernel
services that are only protected with cap_raised(current_cap(),..).

To fix possible user_namespace issues and to just clean up the code
replace cap_raised(current_cap(), CAP_SYS_ADMIN) with
capable(CAP_SYS_ADMIN).

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Acked-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: Andrew G. Morgan <morgan@kernel.org>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-10 23:21:39 -04:00
Lars Ellenberg 92b4ca291f drbd: grammar fix in log message
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-10 12:00:56 +02:00
Cong Wang bc4854bc91 drbd: check MODULE for THIS_MODULE
THIS_MODULE is NULL only when drbd is compiled as built-in,
so the #ifdef CONFIG_MODULES should be #ifdef MODULE instead.

This fixes the warning:

drivers/block/drbd/drbd_main.c: In function ‘drbd_buildtag’:
drivers/block/drbd/drbd_main.c:4187:24: warning: the comparison will always evaluate as ‘true’ for the address of ‘__this_module’ will never be NULL [-Waddress]

Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-10 12:00:54 +02:00
Philipp Reisner f6d0a8dbfd drbd: Restore the request restart logic
It got lost with the commit 5a7bbad27a
"block: remove support for bio remapping from ->make_request"

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 17:20:59 +02:00
Lars Ellenberg 9476f39d66 drbd: introduce a bio_set to allocate housekeeping bios from
Don't rely on availability of bios from the global fs_bio_set,
we should use our own bio_set for meta data IO.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:17:07 +02:00
Lars Ellenberg 3c2f7a856f drbd: remove unused define
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:17:06 +02:00
Arne Redlich 0c7db27920 drbd: bm_page_async_io: properly initialize page->private
If bm_page_async_io is advised to use a new page for I/O
(BM_AIO_COPY_PAGES is set), it will get it from a mempool.
Once the mempool has to dip into its reserves the page is
not reinitialized, i.e. page->private contains garbage, which
will lead to various problems once the I/O completes (dereferences
of NULL pointers, the submitting thread getting stuck in D-state,
 ...).

Signed-off-by: Arne Redlich <arne.redlich@googlemail.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2012-05-09 15:17:04 +02:00
Lars Ellenberg 4d95a10f97 drbd: use the newly introduced page pool for bitmap IO
Conflicts:

	drbd/drbd_bitmap.c

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:17:03 +02:00
Lars Ellenberg 4281808fb3 drbd: add page pool to be used for meta data IO
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:17:02 +02:00
Lars Ellenberg 0e8488ade2 drbd: allow bitmap to change during writeout from resync_finished
Symptom: messages similar to
 "FIXME asender in bm_change_bits_to,
  bitmap locked for 'write from resync_finished' by worker"

If a resync or verify is finished (or aborted), a full bitmap writeout
is triggered.  If we have ongoing local IO, the bitmap may still change
during that writeout, pending and not yet processed acks may cause bits
to be cleared, while new writes may cause bits to be to be set.

To fix this, introduce the drbd_bm_write_copy_pages() variant.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:17:00 +02:00
Lars Ellenberg a574daf5d7 drbd: fix race between drbdadm invalidate/verify and finishing resync
When a resync or online verify is finished or aborted,
drbd does a bulk write-out of changed bitmap pages.

If *in that very moment* a new verify or resync is triggered,
this can race:
 ASSERT( !test_bit(BITMAP_IO, &mdev->flags) ) in drbd_main.c
 FIXME going to queue 'set_n_write from StartingSync' but 'write from resync_finished' still pending?
and similar.

This can be observed with e.g. tight invalidate loops in test scripts,
and probably has no real-life implication.

Still, that race can be solved by first quiescen the device,
before starting a new resync or verify.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:59 +02:00
Lars Ellenberg ba280c092e drbd: fix resend/resubmit of frozen IO
DRBD can freeze IO, due to fencing policy (fencing resource-and-stonith),
or because we lost access to data (on-no-data-accessible suspend-io).

Resuming from there (re-connect, or re-attach, or explicit admin
intervention) should "just work".

Unfortunately, if the re-attach/re-connect did not happen within
the timeout, since the commit
  drbd: Implemented real timeout checking for request processing time
if so configured, the request_timer_fn() would timeout and
detach/disconnect virtually immediately.

This change tracks the most recent attach and connect, and does not
timeout within <configured timeout interval> after attach/connect.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:58 +02:00
Philipp Reisner 5de738272e drbd: Ensure that data_size is not 0 before using data_size-1 as index
This could be exploited by a peer which runs modified code.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:56 +02:00
Philipp Reisner 197296ffed drbd: Delay/reject other state changes while establishing a connection
Changes to the role and disk state should be delayed or rejected
while we establish a connection.

This is necessary, since the peer will base its resync decision
on the UUIDs and the state we sent in the drbd_connect() function.

The most prominent example for this race is becoming primary after
sending state and UUIDs and before the state changes to C_WF_CONNECTION.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:55 +02:00
Lars Ellenberg 46385c84ac drbd: move put_ldev from __req_mod() to the endio callback
One invocation in the endio handler is good enough,
we don't need mention it for each of the different ways
it calls __req_mod().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:51 +02:00
Lars Ellenberg d64957c9a9 drbd: fix WRITE_ACKED_BY_PEER_AND_SIS to not set RQ_NET_DONE
Just because this request happened during a resync does
not mean it may pretend to have been barrier-acked.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:50 +02:00
Lars Ellenberg 41c4a0035b drbd: fix READ_RETRY_REMOTE_CANCELED to not complete if device is suspended
READ_RETRY_REMOTE_CANCELED needs to be grouped with the other _CANCELED
cases, not with CONNECTION_LOST_WHILE_PENDING, as that would complete
(fail) the bio even if the device became suspended.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:48 +02:00
Lars Ellenberg 6d49e101fd drbd: make OOS_HANDED_TO_NETWORK its own case
OOS_HANDED_TO_NETWORK should not be grouped with the various
*_CANCELED/*_FAILED cases.
Also, not only clear the RQ_NET_QUEUED flag, but also mark it RQ_NET_DONE,
so it can be distinguished from a local-only request even after that.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:47 +02:00
Lars Ellenberg c088b2d904 drbd: don't pretend that barrier_nr == 0 was special
We used to have a barrier implementation where barrier_nr 0 was
reserved. That is long gone. Just use the full sequence space.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:46 +02:00
Lars Ellenberg 7ffcaa7194 drbd: remove unused static helper function
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:44 +02:00
Lars Ellenberg a5d214f621 drbd: remove some very outdated comments
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:43 +02:00
Lars Ellenberg 1abc2af205 drbd: missing wakeup after drbd_rs_del_all
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:42 +02:00
Lars Ellenberg 671a74e749 drbd: remove now unused seq_num member from struct drbd_request
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:40 +02:00
Lars Ellenberg 001a88687a drbd: fix potential data corruption and protocol error
We assumed only bios with bi_idx == 0 would end up
in drbd_make_request().

That is wrong.

At least device mapper, in __clone_and_map(), may submit
clones only covering a partial bio, but sharing
the original bvec, by adjusting bi_idx and relevant
other bio members of the clone.

We used __bio_for_each_segment() in various places,
even though that is documented as
 * drivers should not use the __ version unless they _really_ want to
 * run through the entire bio and not just pending pieces

Impact: we would send the full bio bvec, even for the clone
with bi_idx > 0, which will cause data corruption on the
peer (because we submit wrong data at the clone offset),
and will cause a DRBD protocol error, disconnect/reconnect
and resync (thus fixing the corruption),
because the next package header would be expected right
in the middle of the sent data, causing DRBD magic mismatch.

Fix: drop the assert, and use bio_for_each_segment()
instead of the __ version.

Conflicts:

	drbd/drbd_tracing.c

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:39 +02:00
Philipp Reisner b6a370ba07 drbd: Fix a potential write ordering issue on SyncTarget nodes
If a SyncTarget node gets a P_RS_DATA_REPLY before a P_DATA packet
for the same sector, it simply submits these two IO requests.

  This is be possible because on the SyncSource node, the data of the
  P_RS_DATA_REPLY packet was read from disk.  Immediately after that a
  write request from upper layers came in.

The disk scheduler or even the "hardware" queues on the disk drive might
reorder these writes.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:38 +02:00
Philipp Reisner fc28845bc0 drbd: Fix a potential race that could case data inconsistency
When we have a write request and a state change C_WF_BITMAP_S -> C_SYNC_SOURCE
at the same time, and it happens that the line

	remote = remote && drbd_should_do_remote(s);

stills sees C_WF_BITMAP_S, and

	send_oos = rw == WRITE && drbd_should_send_oos(s);

already sees C_SYNC_SOURCE both are 0.

This causes the write to not be mirrored, but marked as out-of-sync on the
Sync_Source node.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:34 +02:00
Lars Ellenberg 031a7c173f drbd: add missing part_round_stats to _drbd_start_io_acct
Without this, iostat frequently sees bogus svctime and >= 100% "utilization".

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:33 +02:00
Lars Ellenberg 47a4f1c1bb drbd: Fix module refcount leak in drbd_accept()
drbd_accept was modelled after kernel_accept
with drbd commit 53eb779 in July 2008.

Only, kernel_accept was then broken, and only fixed later
with kernel commit 1b08534e in Dec 2008:
net: Fix module refcount leak in kernel_accept()

Impact: protocol families provided as modules, e.g. ipv6 or ib_sdp,
would soon have their reference count become negative, preventing
them from being unloaded (likely), or worse, hit zero without actually
being unused, allowing them to be unloaded while still in use (unlikely,
but if triggered, causing a kernel crash).

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:32 +02:00
Philipp Reisner 7caacb69ac drbd: Consider the disk-timeout also for meta-data IO operations
If the backing device is already frozen during attach, we failed
to recognize that. The current disk-timeout code works on top
of the drbd_request objects. During attach we do not allow IO
and therefore never generate a drbd_request object but block
before that in drbd_make_request().

This patch adds the timeout to all drbd_md_sync_page_io().

Before this patch we used to go from D_ATTACHING directly
to D_DISKLESS if IO failed during attach. We can no longer
do this since we have to stay in D_FAILED until all IO
ops issued to the backing device returned.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:30 +02:00
Philipp Reisner 4afc433cf8 drbd: Do not send state packets while lower than C_CONNECTED cstate
I.e. in C_WF_REPORT_PARAMS or in C_WF_CONNECTION.
Sending may already work in these cstates, but the peer still expects
the HandShake / ConnectionFeatures packet.

Actually triggered by the Testuite on kugel.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:29 +02:00
Lars Ellenberg 545752d5d8 drbd: fix race between disconnect and receive_state
If the asender thread, or request_timer_fn(), or some other part of
the code, decided to drop the connection (because of timeout or other),
but the receiver just now was processing a P_STATE packet, there was a
chance that receive_state() would do a hard state change
"re-establishing" an already failed connection without additional handshake.

Log excerpt:
  Remote failed to finish a request within ko-count * timeout
  peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )
  asender terminated
  ...
  peer( Unknown -> Secondary ) conn( Timeout -> Connected ) pdsk( DUnknown -> UpToDate ) peer_isp( 0 -> 1 )
  ...
  Connection closed
  peer( Secondary -> Unknown ) conn( Connected -> Unconnected ) pdsk( UpToDate -> DUnknown ) peer_isp( 1 -> 0 )
  receiver terminated

Impact:
while the connection state is erroneously "Connected",
requests may be queued and even sent,
which would never be acknowledged,
and may have been missed by the cleanup.
These requests would never be completed.

The next drbd_suspend_io() will then lock up,
waiting forever for these requests to complete.

Fixed in several code paths:
  Make sure the connection state is NetworkFailure or worse
  before starting the cleanup in drbd_disconnect().
  This should make sure the cleanup won't miss any requests.

  Disallow receive_state() to "upgrade" the connection state
  from an error state. This will make sure the "illegal" state
  transition won't happen.

  For all connection failure states,
  relax the safe-guard in sanitize_state() again
  to silently mask out those state changes
  (e.g. Timeout -> Connected becomes Timeout -> Timeout).

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:16:01 +02:00
Lars Ellenberg 763eb63625 drbd: fix potential spinlock deadlock
drbd_try_clear_on_disk_bm() has a sanity check for the number of blocks
left to be resynced (rs_left) in the current resync extent.
If it detects a mismatch, it complains, and forces a disconnect using
drbd_force_state(mdev, NS(conn, C_DISCONNECTING));

Unfortunately, this may be called while holding the req_lock,
and drbd_force_state() want's to aquire that lock itself. Deadlock.

Don't force a disconnect, but fix up rs_left by recounting and
reassigning the number of dirty blocks in that extent.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:15:58 +02:00
Philipp Reisner e89868a092 drbd: Fixed an obvious copy-n-paste mistake
This bug might have caused troubles if disk-barriers and the ahead-behind
more are enabled at the same time.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:15:57 +02:00
Lars Ellenberg f479ea0661 drbd: send intermediate state change results to the peer
DRBD state changes schedule after_state_ch() actions to a worker thread,
which decides on the old and new states of that change, whether to send
an informational state update packet (P_STATE) to the peer.
If it decides to drbd_send_state(), it would however always send the
_curent_ state, which, if a second state change happens before the
after_state_ch() of the first ran, may "fast-forward" the peer's view
about this node.  In most cases that is harmless, but sometimes this can
confuse DRBD, for example into not actually starting a necessary resync
if you do a very tight detach/attach loop on a Connected Secondary.

Fix this by always sending the "new" state of the respective state
transition which scheduled this after_state_ch() work.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:15:56 +02:00
Lars Ellenberg a2e9138197 drbd: fix spurious meta data IO "error"
When detaching, even cleanly detaching due to administrator request,
we always go through D_FAILED before we become D_DISKLESS.

Don't let that state change race with an in-flight meta data IO,
or that one might think it actually experienced an IO error.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:15:54 +02:00
Philipp Reisner aaae506d54 drbd: Fixed a race condition between detach and start of resync
drbd_state_lock() is only there to serialize cluster wide state
changes. Testing the local disk state needs to happen while
holding the global_state_lock.

Otherwise you might see something like this (Oct 6 on kugel)
14:20:24 drbd0: conn( WFSyncUUID -> Connected ) disk( Inconsistent -> Failed )
14:20:24 drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
14:20:24 drbd0: conn( Connected -> SyncTarget ) disk( Failed -> Inconsistent )

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:15:53 +02:00
Lars Ellenberg 6a9a92f4ef drbd: fix harmless race to not trigger an ASSERT
We have one pre-allocated page to do certain synchronous meta data IO with,
using it is serialized like so:
	drbd_md_get_buffer();
	drbd_md_sync_page_io();
	drbd_md_sync_page_io();
	...
	drbd_md_put_buffer();

In drbd_md_sync_page_io() there is an
	ASSERT(atomic_read(&mdev->md_io_in_use) == 1);

We want to be able to timeout on unresponsive lower level devices, so we
can "detach" in that case. Inside drbd_md_sync_page_io() we grab an extra
reference, to not have a dangling pointer in case a delayed IO eventually
does still complete, even after we "detached" already.

We need to put the extra reference before we signal completion from the
completion handler, or the second drbd_md_sync_page_io() above may
trigger the assert (reference count still 2).

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:15:52 +02:00
Philipp Reisner 5ba3dac521 drbd: Derive sync-UUIDs only from the bitmap-uuid if it is non-zero
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:15:50 +02:00
Andreas Gruenbacher 7b4e4d3126 drbd: drbd_nl_resize(): Fix missing put_ldev() on error path
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:15:49 +02:00
Lars Ellenberg 40424e4a24 drbd: fix "stalled" empty resync
With sync-after dependencies, given "lucky" timing of pause/unpause
events, and the end of an empty (0 bits set) resync was sometimes not
detected on the SyncTarget, leading to a "stalled" SyncSource state.

Fixed this by expecting not only "Inconsistent -> UpToDate" but also
"Consistent -> UpToDate" transitions for the peer disk state
to end a resync.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:15:47 +02:00
Philipp Reisner 1e86ac48af drbd: Bugfix for the connection behavior
If we get into the C_BROKEN_PIPE cstate once, the state engine set the
thi->t_state of the receiver thread to restarting.  But with the while loop
in drbdd_init() a new connection gets established. After the call into
drbdd() returns immediately since the thi->t_state is not RUNNING.  The
restart of drbd_init() then resets thi->t_state to RUNNING.

I.e. after entering C_BROKEN_PIPE once, the next successful established
connection gets wasted.

The two parts of the fix:
  * Do not cause the thread to restart if we detect the issue
    with the sockets while we are in C_WF_CONNECTION.

  * Make sure that all actions that would have set us to C_BROKEN_PIPE
    happen before the state change to C_WF_REPORT_PARAMS.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:15:46 +02:00
Philipp Reisner 80f9fd55a6 drbd: Cleanup all epoch objects upon connection loss
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:15:44 +02:00
Philipp Reisner fd2491f4a4 drbd: detach must not try to abort non-local requests from drbd-8.4
Cherry picked form 8.4

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:15:43 +02:00
Philipp Reisner 79f16f5dbc drbd: Consider that the no-data-condition could be in connected state
...when the peer has inconsistent data. In that case we failed to
clear the susp_nod flag. When the local disk was attached again

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:15:42 +02:00
Philipp Reisner bca482e90b drbd: Fixed current UUID generation
Now, the new edition of the clause only fires if a diskless
peer gets promoted.

This is a fixup for "drbd: Delayed creation of current-UUID".

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:10:50 +02:00
Lars Ellenberg 22f46ce2ef drbd: change some GFP_KERNEL to GFP_NOIO
Bitmap IO may happend in the context of an application write,
in the generic block IO path.  We need to use GFP_NOIO.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:10:47 +02:00
Philipp Reisner dfa8bedbfe drbd: Implemented the disk-timeout option
When the disk-timeout is active, and it expires for a single request,
we consider the local disk as D_FAILED. Note: With this change,
I made both timeout based state transitions HARD state transitions.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:10:45 +02:00
Philipp Reisner 02ee8f95fa drbd: Force flag for the detach operation
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:10:38 +02:00
Philipp Reisner 5ca1de0384 drbd: Allow new IOs while the local disk in in FAILED state
The last bunch of commits prepared the 'detach from tar pit' feature.
With that we can be for long time in disk state FAILED. We need
to accept new IO requests during that time.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:10:34 +02:00
Philipp Reisner 9e58c4dad7 drbd: Bitmap IO functions can now return prematurely if the disk breaks
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 15:10:33 +02:00
Philipp Reisner d1f3779bbe drbd: Added a kref to bm_aio_ctx
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 10:37:19 +02:00
Philipp Reisner b2057629ea drbd: Hold a reference to ldev while doing meta-data IO
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 10:31:11 +02:00
Philipp Reisner 4a2fe568b5 drbd: Keep a reference to the bio until the completion handler finished
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 10:28:51 +02:00
Philipp Reisner 0c46442515 drbd: Implemented wait_until_done_or_disk_failure()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 10:26:51 +02:00
Philipp Reisner e17117310b drbd: Replaced md_io_mutex by an atomic: md_io_in_use
The new function drbd_md_get_buffer() aborts waiting for the buffer
in case the disk failes in the meantime.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 10:22:31 +02:00
Philipp Reisner cc94c65015 drbd: moved md_io into mdev
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 10:17:24 +02:00
Philipp Reisner 2b4dd36fba drbd: Immediately allow completion of IOs, that wait for IO completions on a failed disk
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 10:16:04 +02:00
Philipp Reisner 6d7e32f568 drbd: Keep a reference to barrier acked requests
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 10:15:28 +02:00
Philipp Reisner 6809384c71 drbd: Improve compatibility with drbd's older than 8.3.7
Regression introduced with 8.3.11 commit:
drbd: Take a more conservative approach when deciding max_bio_size

Never ever tell an older drbd, that we support more than 32KiB
in a single data request (packet).
Never believe an older drbd, that is supports more than 32KiB
in a single data request (packet)

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 10:08:57 +02:00
Philipp Reisner 77e8fdfc18 drbd: Only print sanitize state's warnings, if the state change happens
The reason for this change is that, with when doing
'drbdadm invalidate' on a disconnected resource caused
an "implicitly set pdsk from UpToDate to DUnknown" message,
which was missleading.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 10:08:22 +02:00
Lars Ellenberg 07667347c8 drbd: downgraded error printk to info
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 10:05:25 +02:00
David Howells 5f138ce01a DRBD: Fix comparison always false warning due to long/long long compare
Fix warnings of the following nature in the drbd header:

In file included from drivers/block/drbd/drbd_bitmap.c:32:
drivers/block/drbd/drbd_int.h: In function 'drbd_get_syncer_progress':
drivers/block/drbd/drbd_int.h:2234: warning: comparison is always false due to limited range of data

where mdev->rs_total (an unsigned long) is being compared to 1ULL << 32, which
is always false on a 32-bit machine.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
2012-05-09 10:03:19 +02:00
Lars Ellenberg 7948bcdc38 drbd: spelling fix: too small
It is not "to small", but "too small".

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 10:02:22 +02:00
Lars Ellenberg 1381e9a496 drbd: cosmetic: fix accidental division instead of modulo when pretty printing
For large resync rates, seq_printf_with_thousands_grouping()
accidentally only produced Y,000,00Y, instead of the real numbers.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 10:01:39 +02:00
Philipp Reisner ebd2b0cde5 drbd: Lower log priority for an event that is definitely not an error
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2012-05-09 09:59:29 +02:00
Sage Weil 3469ac1aa3 ceph: drop support for preferred_osd pgs
This was an ill-conceived feature that has been removed from Ceph.  Do
this gracefully:

 - reject attempts to specify a preferred_osd via the ioctl
 - stop exposing this information via virtual xattrs
 - always fill in -1 for requests, in case we talk to an older server
 - don't calculate preferred_osd placements/pgids

Reviewed-by: Alex Elder <elder@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2012-05-07 15:33:36 -07:00
David S. Miller f24001941c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Fix merge between commit 3adadc08cc ("net ax25: Reorder ax25_exit to
remove races") and commit 0ca7a4c87d ("net ax25: Simplify and
cleanup the ax25 sysctl handling")

The former moved around the sysctl register/unregister calls, the
later simply removed them.

With help from Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-23 23:15:17 -04:00
Pavel Emelyanov 4a17fd5229 sock: Introduce named constants for sk_reuse
Name them in a "backward compatible" manner, i.e. reuse or not
are still 1 and 0 respectively. The reuse value of 2 means that
the socket with it will forcibly reuse everyone else's port.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-21 15:52:25 -04:00
Linus Torvalds c1acb0ba33 Fixes in various components:
* mechanism to work with misconfigured backends (where they are
    advertised but in reality don't exist).
  * two tiny compile warning fixes.
  * proper error handling in gnttab_resume
  * Not using VM_PFNMAP anymore to allow backends in the same domain.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQEcBAABAgAGBQJPkYeqAAoJEFjIrFwIi8fJEXIIAI+PYLNMcHTc4bxa6pErpKaS
 rq5eCXL9+EaZOwTUqHRJjfrjnlAc+BWO8lN0H41oRQWFYh14hgfUVJ+ziEujb1kw
 N1eTMVHnH/XRJV6rIFX+TiBasnyoMmNfWEAb45UL1nEUTMPL1Jv7AiRY/GxUlHyg
 M+uFG52KP3ytXxcIiGW6pYEqJd6UgWrqnclaeg5TR5zvDlWfJbUIBEMQ/PyV0WSS
 4e7biiwi4XPWT2f1qewOmI+3r68CltU3GAs1XxjcSX+bYYuh00UtY39AsBWo2N8I
 1VORuq0QPs+GB22r3e47IqBcjXkBGRIf6w1e/5a6WLiq7TqVq4bYgCGUmWT8V7o=
 =olCU
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.4-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull xen fixes from Konrad Rzeszutek Wilk:
 - mechanism to work with misconfigured backends (where they are
   advertised but in reality don't exist).
 - two tiny compile warning fixes.
 - proper error handling in gnttab_resume
 - Not using VM_PFNMAP anymore to allow backends in the same domain.

* tag 'stable/for-linus-3.4-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  Revert "xen/p2m: m2p_find_override: use list_for_each_entry_safe"
  xen/resume: Fix compile warnings.
  xen/xenbus: Add quirk to deal with misconfigured backends.
  xen/blkback: Fix warning error.
  xen/p2m: m2p_find_override: use list_for_each_entry_safe
  xen/gntdev: do not set VM_PFNMAP
  xen/grant-table: add error-handling code on failure of gnttab_resume
2012-04-20 11:31:00 -07:00
Konrad Rzeszutek Wilk a71e23d992 xen/blkback: Fix warning error.
drivers/block/xen-blkback/xenbus.c: In function 'xen_blkbk_discard':
drivers/block/xen-blkback/xenbus.c:419:4: warning: passing argument 1 of 'dev_warn' makes pointer from integer without a cast
+[enabled by default]
include/linux/device.h:894:5: note: expected 'const struct device *' but argument is of type 'long int'

It is unclear how that mistake made it in. It surely is wrong.

Acked-by: Jens Axboe <axboe@kernel.dk>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-04-18 15:54:08 -04:00
Linus Torvalds cdd5983063 virtio: fixes on top of 3.4-rc2
Here are some virtio fixes for 3.4:
 a test build fix, a patch by Ren fixing naming for systems with a massive
 number of virtio blk devices, and balloon fixes for powerpc
 by David Gibson.
 
 There was some discussion about Ren's patch for virtio disc naming: some people
 wanted to move the legacy name mangling function to the block core.  But
 there's no concensus on that yet, and we can always deduplicate later.
 Added comments in the hope that this will stop people from
 copying this legacy naming scheme into future drivers.
 
 Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQEcBAABAgAGBQJPio1GAAoJECgfDbjSjVRpGDAH/3C/bXm9mriuNauRHwktHgJe
 gmh2BfUgnxly6vheuz0Fv61lTe6V8kekHVolbUYwAUgXeWEKK1C59xehrMGRIPDG
 1XUiti50U3P+skhIfrbkS5nZ7L+5Hk0ToQ6dd9v0BM2GxDOvgwidlY1bZe+wJEZf
 Lvl6w/djBCr1e3k4qfRnpTcdJJ4FnOjGbikLQhSTGfUXeNo6uWS1hljYWnAhzFkd
 1xU8h5PP0TDR0nYb80CeB+9Lxw0w4qyNPJIBhNN6ucB/1U6R+55HpEpmrLUkn910
 sEFEFsc0cRVWr8FiOTlmzxLHnwTc8AY/Bsp9TMSmnTRu3ZQcoQMTQQCczRj04xI=
 =VmpJ
 -----END PGP SIGNATURE-----

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio fixes from Michael S. Tsirkin:
 "Here are some virtio fixes for 3.4: a test build fix, a patch by Ren
  fixing naming for systems with a massive number of virtio blk devices,
  and balloon fixes for powerpc by David Gibson.

  There was some discussion about Ren's patch for virtio disc naming:
  some people wanted to move the legacy name mangling function to the
  block core.  But there's no concensus on that yet, and we can always
  deduplicate later.  Added comments in the hope that this will stop
  people from copying this legacy naming scheme into future drivers."

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  virtio_balloon: fix handling of PAGE_SIZE != 4k
  virtio_balloon: Fix endian bug
  virtio_blk: helper function to format disk names
  tools/virtio: fix up vhost/test module build
2012-04-16 18:34:12 -07:00
Linus Torvalds c104f1fa1e Merge branch 'for-3.4/drivers' of git://git.kernel.dk/linux-block
Pull block driver bits from Jens Axboe:

 - A series of fixes for mtip32xx.  Most from Asai at Micron, but also
   one from Greg, getting rid of the dependency on PCIE_HOTPLUG.

 - A few bug fixes for xen-blkfront, and blkback.

 - A virtio-blk fix for Vivek, making resize actually work.

 - Two fixes from Stephen, making larger transfers possible on cciss.
   This is needed for tape drive support.

* 'for-3.4/drivers' of git://git.kernel.dk/linux-block:
  block: mtip32xx: remove HOTPLUG_PCI_PCIE dependancy
  mtip32xx: dump tagmap on failure
  mtip32xx: fix handling of commands in various scenarios
  mtip32xx: Shorten macro names
  mtip32xx: misc changes
  mtip32xx: Add new sysfs entry 'status'
  mtip32xx: make setting comp_time as common
  mtip32xx: Add new bitwise flag 'dd_flag'
  mtip32xx: fix error handling in mtip_init()
  virtio-blk: Call revalidate_disk() upon online disk resize
  xen/blkback: Make optional features be really optional.
  xen/blkback: Squash the discard support for 'file' and 'phy' type.
  mtip32xx: fix incorrect value set for drv_cleanup_done, and re-initialize and start port in mtip_restart_port()
  cciss: Fix scsi tape io with more than 255 scatter gather elements
  cciss: Initialize scsi host max_sectors for tape drive support
  xen-blkfront: make blkif_io_lock spinlock per-device
  xen/blkfront: don't put bdev right after getting it
  xen-blkfront: use bitmap_set() and bitmap_clear()
  xen/blkback: Enable blkback on HVM guests
  xen/blkback: use grant-table.c hypercall wrappers
2012-04-13 18:45:13 -07:00
Ren Mingxin c0aa3e0916 virtio_blk: helper function to format disk names
The current virtio block's naming algorithm just supports 18278
(26^3 + 26^2 + 26) disks. If there are more virtio blocks,
there will be disks with the same name.

Based on commit 3e1a7ff8a0, add
a function "virtblk_name_format()" for virtio block to support mass
of disks naming.

Notes:
- Our naming scheme is ugly. We are stuck with it
  for virtio but don't use it for any new driver:
  new drivers should name their devices PREFIX%d
  where the sequence number can be allocated by ida
- sd_format_disk_name has exactly the same logic.
  Moving it to a central place was deferred over worries
  that this will make people keep using the legacy naming
  in new drivers.
  We kept code idential in case someone wants to deduplicate later.

Signed-off-by: Ren Mingxin <renmx@cn.fujitsu.com>
Acked-by: Asias He <asias@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2012-04-12 10:37:05 +03:00
Greg Kroah-Hartman 6363480651 block: mtip32xx: remove HOTPLUG_PCI_PCIE dependancy
This removes the HOTPLUG_PCI_PCIE dependency on the driver and makes it
depend on PCI.

Cc: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-12 08:47:05 +02:00
Asai Thambi S P 95fea2f1d9 mtip32xx: dump tagmap on failure
Dump tagmap on failure, instead of individual tags.

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-09 08:35:39 +02:00
Asai Thambi S P c74b0f586f mtip32xx: fix handling of commands in various scenarios
* If a ncq  command time out and a non-ncq command is active, skip restart port
* Queue(pause) ncq commands during operations spanning more than one non-ncq commands - secure erase, download microcode
* When a non-ncq command is active, allow incoming non-ncq commands to wait instead of failing back
* Changed timeout for download microcode and smart commands
* If the device in write protect mode, fail all writes (do not send to device)
* Set maximum retries to 2

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-09 08:35:39 +02:00
Asai Thambi S P 8a857a880b mtip32xx: Shorten macro names
Shortened macros used to represent mtip_port->flags and dd->dd_flag

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-09 08:35:38 +02:00
Asai Thambi S P 8182b49528 mtip32xx: misc changes
* Handle the interrupt completion of polled internal commands
* Do not check remove pending flag for standby command
* On rebuild failure,
    - set corresponding bit dd_flag
    - do not send standby command
* Free ida index in remove path

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-09 08:35:38 +02:00
Asai Thambi S P f65872177d mtip32xx: Add new sysfs entry 'status'
* Add support for detecting the following device status
        - write protect
        - over temp (thermal shutdown)
* Add new sysfs entry 'status', possible values - online, write_protect, thermal_shutdown
* Add new file 'sysfs-block-rssd' to document ABI (Reported-by: Greg Kroah-Hartman)

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-09 08:35:38 +02:00
Asai Thambi S P dad40f16ff mtip32xx: make setting comp_time as common
Moved setting completion time into mtip_issue_ncq_command()

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-09 08:35:38 +02:00
Asai Thambi S P 45038367c2 mtip32xx: Add new bitwise flag 'dd_flag'
* Merged the following flags into one variable 'dd_flag':
        * drv_cleanup_done
        * resumeflag
* Added the following flags into 'dd_flag'
        * remove pending
        * init done
* Removed 'ftlrebuildflag' (similar flag is already part of mti_port->flags)

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-09 08:35:38 +02:00
Linus Torvalds 9479f0f801 Two fixes for regressions:
* one is a workaround that will be removed in v3.5 with proper fix in the tip/x86 tree,
  * the other is to fix drivers to load on PV (a previous patch made them only
    load in PVonHVM mode).
 
 The rest are just minor fixes in the various drivers and some cleanup in the
 core code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQEcBAABAgAGBQJPfyVUAAoJEFjIrFwIi8fJUjUH/jbY5JavRqSlNELZW2A4Ta76
 8p00LqLHw/C56iHZcWKke8mqtWNb+ZfcQt7ZYcxDIYa4QWBL28x0OLAO2tOBIt37
 ZjYESWSdFJaJvmpADluWtFyGyZ9TYJllDTBm/jWj1ZtKSZvR1YkhuMXCS0f4AmGQ
 xFzSWJZUDdiOAqpN+VQD8wP00gfR8knQLg16XE2fvFdQo4XwpCtqLfHV/5pMMGdy
 Cs/ep6rq/7cdv/nshKOcBnw7RW8l3Xoi/28ht8k3DvAQ2VtFq1Tugv2G9pcCHwQG
 DIBkB3SOU6/v6P5at5+egKS5xR1fJetCWlkMd8kkbcdz2NPI4UDMkvOW6Q8yQls=
 =6Ve+
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull xen fixes from Konrad Rzeszutek Wilk:
 "Two fixes for regressions:
   * one is a workaround that will be removed in v3.5 with proper fix in
     the tip/x86 tree,
   * the other is to fix drivers to load on PV (a previous patch made
     them only load in PVonHVM mode).

  The rest are just minor fixes in the various drivers and some cleanup
  in the core code."

* tag 'stable/for-linus-3.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/pcifront: avoid pci_frontend_enable_msix() falsely returning success
  xen/pciback: fix XEN_PCI_OP_enable_msix result
  xen/smp: Remove unnecessary call to smp_processor_id()
  xen/x86: Workaround 'x86/ioapic: Add register level checks to detect bogus io-apic entries'
  xen: only check xen_platform_pci_unplug if hvm
2012-04-06 17:54:53 -07:00
Igor Mammedov e95ae5a493 xen: only check xen_platform_pci_unplug if hvm
commit b9136d207f08
  xen: initialize platform-pci even if xen_emul_unplug=never

breaks blkfront/netfront by not loading them because of
xen_platform_pci_unplug=0 and it is never set for PV guest.

Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-04-06 12:12:52 -04:00
Alex Elder cd9d9f5df6 rbd: don't hold spinlock during messenger flush
A recent change made changes to the rbd_client_list be protected by
a spinlock.  Unfortunately in rbd_put_client(), the lock is taken
before possibly dropping the last reference to an rbd_client, and on
the last reference that eventually calls flush_workqueue() which can
sleep.

The problem was flagged by a debug spinlock warning:
    BUG: spinlock wrong CPU on CPU#3, rbd/27814

The solution is to move the spinlock acquisition and release inside
rbd_client_release(), which is the spot where it's really needed for
protecting the removal of the rbd_client from the client list.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Reviewed-by: Sage Weil <sage@newdream.net>
2012-04-05 15:43:58 -05:00
Ryosuke Saito 6d27f09a63 mtip32xx: fix error handling in mtip_init()
Ensure that block device is properly unregistered, if
pci_register_driver() fails.

Signed-off-by: Ryosuke Saito <raitosyo@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-05 08:09:34 -06:00
Len Brown f6365201d8 x86: Remove the ancient and deprecated disable_hlt() and enable_hlt() facility
The X86_32-only disable_hlt/enable_hlt mechanism was used by the
32-bit floppy driver. Its effect was to replace the use of the
HLT instruction inside default_idle() with cpu_relax() - essentially
it turned off the use of HLT.

This workaround was commented in the code as:

 "disable hlt during certain critical i/o operations"

 "This halt magic was a workaround for ancient floppy DMA
  wreckage. It should be safe to remove."

H. Peter Anvin additionally adds:

 "To the best of my knowledge, no-hlt only existed because of
  flaky power distributions on 386/486 systems which were sold to
  run DOS.  Since DOS did no power management of any kind,
  including HLT, the power draw was fairly uniform; when exposed
  to the much hhigher noise levels you got when Linux used HLT
  caused some of these systems to fail.

  They were by far in the minority even back then."

Alan Cox further says:

 "Also for the Cyrix 5510 which tended to go castors up if a HLT
  occurred during a DMA cycle and on a few other boxes HLT during
  DMA tended to go astray.

  Do we care ? I doubt it. The 5510 was pretty obscure, the 5520
  fixed it, the 5530 is probably the oldest still in any kind of
  use."

So, let's finally drop this.

Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Josh Boyer <jwboyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Acked-by: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Stephen Hemminger <shemminger@vyatta.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/n/tip-3rhk9bzf0x9rljkv488tloib@git.kernel.org
[ If anyone cares then alternative instruction patching could be
  used to replace HLT with a one-byte NOP instruction. Much simpler. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-30 08:50:27 +02:00
Vivek Goyal e9986f303d virtio-blk: Call revalidate_disk() upon online disk resize
If a virtio disk is open in guest and a disk resize operation is done,
(virsh blockresize), new size is not visible to tools like "fdisk -l".
This seems to be happening as we update only part->nr_sects and not
bdev->bd_inode size.

Call revalidate_disk() which should take care of it. I tested growing disk
size of already open disk and it works for me.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-29 10:09:44 +02:00
Linus Torvalds 532bfc851a Merge branch 'akpm' (Andrew's patch-bomb)
Merge third batch of patches from Andrew Morton:
 - Some MM stragglers
 - core SMP library cleanups (on_each_cpu_mask)
 - Some IPI optimisations
 - kexec
 - kdump
 - IPMI
 - the radix-tree iterator work
 - various other misc bits.

 "That'll do for -rc1.  I still have ~10 patches for 3.4, will send
  those along when they've baked a little more."

* emailed from Andrew Morton <akpm@linux-foundation.org>: (35 commits)
  backlight: fix typo in tosa_lcd.c
  crc32: add help text for the algorithm select option
  mm: move hugepage test examples to tools/testing/selftests/vm
  mm: move slabinfo.c to tools/vm
  mm: move page-types.c from Documentation to tools/vm
  selftests/Makefile: make `run_tests' depend on `all'
  selftests: launch individual selftests from the main Makefile
  radix-tree: use iterators in find_get_pages* functions
  radix-tree: rewrite gang lookup using iterator
  radix-tree: introduce bit-optimized iterator
  fs/proc/namespaces.c: prevent crash when ns_entries[] is empty
  nbd: rename the nbd_device variable from lo to nbd
  pidns: add reboot_pid_ns() to handle the reboot syscall
  sysctl: use bitmap library functions
  ipmi: use locks on watchdog timeout set on reboot
  ipmi: simplify locking
  ipmi: fix message handling during panics
  ipmi: use a tasklet for handling received messages
  ipmi: increase KCS timeouts
  ipmi: decrease the IPMI message transaction time in interrupt mode
  ...
2012-03-28 17:19:28 -07:00
Wanlong Gao f4507164e7 nbd: rename the nbd_device variable from lo to nbd
rename the nbd_device variable from "lo" to "nbd", since "lo" is just a name
copied from loop.c.

Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Cc: Paul Clements <paul.clements@steeleye.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-28 17:14:37 -07:00
Linus Torvalds 0195c00244 Disintegrate and delete asm/system.h
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIVAwUAT3NKzROxKuMESys7AQKElw/+JyDxJSlj+g+nymkx8IVVuU8CsEwNLgRk
 8KEnRfLhGtkXFLSJYWO6jzGo16F8Uqli1PdMFte/wagSv0285/HZaKlkkBVHdJ/m
 u40oSjgT013bBh6MQ0Oaf8pFezFUiQB5zPOA9QGaLVGDLXCmgqUgd7exaD5wRIwB
 ZmyItjZeAVnDfk1R+ZiNYytHAi8A5wSB+eFDCIQYgyulA1Igd1UnRtx+dRKbvc/m
 rWQ6KWbZHIdvP1ksd8wHHkrlUD2pEeJ8glJLsZUhMm/5oMf/8RmOCvmo8rvE/qwl
 eDQ1h4cGYlfjobxXZMHqAN9m7Jg2bI946HZjdb7/7oCeO6VW3FwPZ/Ic75p+wp45
 HXJTItufERYk6QxShiOKvA+QexnYwY0IT5oRP4DrhdVB/X9cl2MoaZHC+RbYLQy+
 /5VNZKi38iK4F9AbFamS7kd0i5QszA/ZzEzKZ6VMuOp3W/fagpn4ZJT1LIA3m4A9
 Q0cj24mqeyCfjysu0TMbPtaN+Yjeu1o1OFRvM8XffbZsp5bNzuTDEvviJ2NXw4vK
 4qUHulhYSEWcu9YgAZXvEWDEM78FXCkg2v/CrZXH5tyc95kUkMPcgG+QZBB5wElR
 FaOKpiC/BuNIGEf02IZQ4nfDxE90QwnDeoYeV+FvNj9UEOopJ5z5bMPoTHxm4cCD
 NypQthI85pc=
 =G9mT
 -----END PGP SIGNATURE-----

Merge tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system

Pull "Disintegrate and delete asm/system.h" from David Howells:
 "Here are a bunch of patches to disintegrate asm/system.h into a set of
  separate bits to relieve the problem of circular inclusion
  dependencies.

  I've built all the working defconfigs from all the arches that I can
  and made sure that they don't break.

  The reason for these patches is that I recently encountered a circular
  dependency problem that came about when I produced some patches to
  optimise get_order() by rewriting it to use ilog2().

  This uses bitops - and on the SH arch asm/bitops.h drags in
  asm-generic/get_order.h by a circuituous route involving asm/system.h.

  The main difficulty seems to be asm/system.h.  It holds a number of
  low level bits with no/few dependencies that are commonly used (eg.
  memory barriers) and a number of bits with more dependencies that
  aren't used in many places (eg.  switch_to()).

  These patches break asm/system.h up into the following core pieces:

    (1) asm/barrier.h

        Move memory barriers here.  This already done for MIPS and Alpha.

    (2) asm/switch_to.h

        Move switch_to() and related stuff here.

    (3) asm/exec.h

        Move arch_align_stack() here.  Other process execution related bits
        could perhaps go here from asm/processor.h.

    (4) asm/cmpxchg.h

        Move xchg() and cmpxchg() here as they're full word atomic ops and
        frequently used by atomic_xchg() and atomic_cmpxchg().

    (5) asm/bug.h

        Move die() and related bits.

    (6) asm/auxvec.h

        Move AT_VECTOR_SIZE_ARCH here.

  Other arch headers are created as needed on a per-arch basis."

Fixed up some conflicts from other header file cleanups and moving code
around that has happened in the meantime, so David's testing is somewhat
weakened by that.  We'll find out anything that got broken and fix it..

* tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system: (38 commits)
  Delete all instances of asm/system.h
  Remove all #inclusions of asm/system.h
  Add #includes needed to permit the removal of asm/system.h
  Move all declarations of free_initmem() to linux/mm.h
  Disintegrate asm/system.h for OpenRISC
  Split arch_align_stack() out from asm-generic/system.h
  Split the switch_to() wrapper out of asm-generic/system.h
  Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h
  Create asm-generic/barrier.h
  Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h
  Disintegrate asm/system.h for Xtensa
  Disintegrate asm/system.h for Unicore32 [based on ver #3, changed by gxt]
  Disintegrate asm/system.h for Tile
  Disintegrate asm/system.h for Sparc
  Disintegrate asm/system.h for SH
  Disintegrate asm/system.h for Score
  Disintegrate asm/system.h for S390
  Disintegrate asm/system.h for PowerPC
  Disintegrate asm/system.h for PA-RISC
  Disintegrate asm/system.h for MN10300
  ...
2012-03-28 15:58:21 -07:00
Linus Torvalds 47b816ff7d Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull a few more things for powerpc by Benjamin Herrenschmidt:
 - Anton's did some recent improvements to EPOW event reporting on
   pSeries (power supply failures and such).  The patches are self
   contained enough and replace really nasty code so I felt it should
   still go in
 - I did the vio driver registration change Greg requested, I don't see
   the point of leaving that til the next merge window
 - The remaining EEH changes I said were still pending to get rid of the
   EEH references from the generic struct device_node
 - A few more iSeries removal bits
 - A perf bug fix on 970

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
  powerpc/perf: Fix instruction address sampling on 970 and Power4
  powerpc+sparc/vio: Modernize driver registration
  powerpc: Random little legacy iSeries removal tidy ups
  powerpc: Remove NO_IRQ_IGNORE
  powerpc/pseries: Cut down on enthusiastic use of defines in RAS code
  powerpc/pseries: Clean up ras_error_interrupt code
  powerpc/pseries: Remove RTAS_POWERMGM_EVENTS
  powerpc/pseries: Use rtas_get_sensor in RAS code
  powerpc/pseries: Parse and handle EPOW interrupts
  powerpc: Make function that parses RTAS error logs global
  powerpc/eeh: Retrieve PHB from global list
  powerpc/eeh: Remove eeh information from pci_dn
  powerpc/eeh: Remove eeh device from OF node
2012-03-28 14:41:36 -07:00
David Howells 9ffc93f203 Remove all #inclusions of asm/system.h
Remove all #inclusions of asm/system.h preparatory to splitting and killing
it.  Performed with the following command:

perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *`

Signed-off-by: David Howells <dhowells@redhat.com>
2012-03-28 18:30:03 +01:00
Linus Torvalds 56b59b429b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates for 3.4-rc1 from Sage Weil:
 "Alex has been busy.  There are a range of rbd and libceph cleanups,
  especially surrounding device setup and teardown, and a few critical
  fixes in that code.  There are more cleanups in the messenger code,
  virtual xattrs, a fix for CRC calculation/checks, and lots of other
  miscellaneous stuff.

  There's a patch from Amon Ott to make inos behave a bit better on
  32-bit boxes, some decode check fixes from Xi Wang, and network
  throttling fix from Jim Schutt, and a couple RBD fixes from Josh
  Durgin.

  No new functionality, just a lot of cleanup and bug fixing."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (65 commits)
  rbd: move snap_rwsem to the device, rename to header_rwsem
  ceph: fix three bugs, two in ceph_vxattrcb_file_layout()
  libceph: isolate kmap() call in write_partial_msg_pages()
  libceph: rename "page_shift" variable to something sensible
  libceph: get rid of zero_page_address
  libceph: only call kernel_sendpage() via helper
  libceph: use kernel_sendpage() for sending zeroes
  libceph: fix inverted crc option logic
  libceph: some simple changes
  libceph: small refactor in write_partial_kvec()
  libceph: do crc calculations outside loop
  libceph: separate CRC calculation from byte swapping
  libceph: use "do" in CRC-related Boolean variables
  ceph: ensure Boolean options support both senses
  libceph: a few small changes
  libceph: make ceph_tcp_connect() return int
  libceph: encapsulate some messenger cleanup code
  libceph: make ceph_msgr_wq private
  libceph: encapsulate connection kvec operations
  libceph: move prepare_write_banner()
  ...
2012-03-28 10:01:29 -07:00
Benjamin Herrenschmidt cb52d8970e powerpc+sparc/vio: Modernize driver registration
This makes vio_register_driver() get the module owner & name at compile
time like PCI drivers do, and adds a name pointer directly in struct
vio_driver to avoid having to explicitly initialize the embedded
struct device.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: David S. Miller <davem@davemloft.net>
2012-03-28 11:33:24 +11:00
Jens Axboe 6674fb79ca Merge branch 'stable/for-jens-3.4-bugfixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-3.4/drivers
Konrad writes:

I've two small fixes for the xen-blkback - and I think one more will show up
eventually (a partial revert), but not sure when. So in the spirit of keeping
the patches flowing, please git pull the following branch.
2012-03-26 09:13:14 +02:00
Linus Torvalds e22057c859 One tiny feature that accidentally got lost in the initial git pull:
* Add fast-EOI acking of interrupts (clear a bit instead of hypercall)
 And bug-fixes:
  * Fix CPU bring-up code missing a call to notify other subsystems.
  * Fix reading /sys/hypervisor even if PVonHVM drivers are not loaded.
  * In Xen ACPI processor driver: remove too verbose WARN messages, fix up
    the Kconfig dependency to be a module by default, and add dependency on
    CPU_FREQ.
  * Disable CPU frequency drivers from loading when booting under Xen
    (as we want the Xen ACPI processor to be used instead).
  * Cleanups in tmem code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQEcBAABAgAGBQJPbc3DAAoJEFjIrFwIi8fJTQkIAMnH2fPhcHAb4mNaz+3gdmsZ
 Flo6V1gMBcO8xKZlUkFgKKPYoOm7lLmvoceXLVSH5oOKSnSJo1zSinzKmcdJQo/D
 kPo4/EguNwtzcAcQh2dmT6/IM9O3ihMKUli7Oajif9PLCFFFqTaG3Y3YNBo/rxTY
 D3HAnJrIfmIyG0NpLnaFCWhCzUvcB4M7ysutECqcF8l5gnbHxRVeCKD0blM+n9GH
 Wyum00dQCwo6h6wTduhPOAxHAM4rncyR3heOB2vDxq9YJHSUhhcva5QCgQ+tdUVt
 6U2TQT1L2Px8iXXzr2w9YBpepOVajZReoKhajLjJ5VbkpBZFz5dVNfJ8LpF8RV8=
 =z8IB
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.4-tag-two' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen

Pull more xen updates from Konrad Rzeszutek Wilk:
 "One tiny feature that accidentally got lost in the initial git pull:
   * Add fast-EOI acking of interrupts (clear a bit instead of
     hypercall)
  And bug-fixes:
   * Fix CPU bring-up code missing a call to notify other subsystems.
   * Fix reading /sys/hypervisor even if PVonHVM drivers are not loaded.
   * In Xen ACPI processor driver: remove too verbose WARN messages, fix
     up the Kconfig dependency to be a module by default, and add
     dependency on CPU_FREQ.
   * Disable CPU frequency drivers from loading when booting under Xen
     (as we want the Xen ACPI processor to be used instead).
   * Cleanups in tmem code."

* tag 'stable/for-linus-3.4-tag-two' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/acpi: Fix Kconfig dependency on CPU_FREQ
  xen: initialize platform-pci even if xen_emul_unplug=never
  xen/smp: Fix bringup bug in AP code.
  xen/acpi: Remove the WARN's as they just create noise.
  xen/tmem: cleanup
  xen: support pirq_eoi_map
  xen/acpi-processor: Do not depend on CPU frequency scaling drivers.
  xen/cpufreq: Disable the cpu frequency scaling drivers from loading.
  provide disable_cpufreq() function to disable the API.
2012-03-24 12:20:25 -07:00
Konrad Rzeszutek Wilk 3389bb8bf7 xen/blkback: Make optional features be really optional.
They were using the xenbus_dev_fatal() function which would
change the state of the connection immediately. Which is not
what we want when we advertise optional features.

So make 'feature-discard','feature-barrier','feature-flush-cache'
optional.

Suggested-by: Jan Beulich <JBeulich@suse.com>
[v1: Made the discard function void and static]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-03-24 10:04:36 -04:00
Konrad Rzeszutek Wilk 4dae76705f xen/blkback: Squash the discard support for 'file' and 'phy' type.
The only reason for the distinction was for the special case of
'file' (which is assumed to be loopback device), was to reach inside
the loopback device, find the underlaying file, and call fallocate on it.
Fortunately "xen-blkback: convert hole punching to discard request on
loop devices" removes that use-case and we now based the discard
support based on blk_queue_discard(q) and extract all appropriate
parameters from the 'struct request_queue'.

CC: Li Dongyang <lidongyang@novell.com>
Acked-by: Jan Beulich <JBeulich@suse.com>
[v1: Dropping pointless initializer and keeping blank line]
[v2: Remove the kfree as it is not used anymore]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-03-24 10:04:35 -04:00
Oleg Nesterov 70834d3070 usermodehelper: use UMH_WAIT_PROC consistently
A few call_usermodehelper() callers use the hardcoded constant instead of
the proper UMH_WAIT_PROC, fix them.

Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Januszewski <spock@gentoo.org>
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-23 16:58:41 -07:00
Asai Thambi S P 22be2e6e13 mtip32xx: fix incorrect value set for drv_cleanup_done, and re-initialize and start port in mtip_restart_port()
This patch includes two changes:
	* fix incorrect value set for drv_cleanup_done
	* re-initialize and start port in mtip_restart_port()

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-23 12:33:03 +01:00
Stephen M. Cameron bc67f63650 cciss: Fix scsi tape io with more than 255 scatter gather elements
The total number of scatter gather elements in the CISS command
used by the scsi tape code was being cast to a u8, which can hold
at most 255 scatter gather elements.  It should have been cast to
a u16.  Without this patch the command gets rejected by the controller
since the total scatter gather count did not add up to the right
value resulting in an i/o error.

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-22 21:40:09 +01:00
Stephen M. Cameron 395d287526 cciss: Initialize scsi host max_sectors for tape drive support
The default is too small (1024 blocks), use h->cciss_max_sectors (8192 blocks)
Without this change, if you try to set the block size of a tape drive above
512*1024, via "mt -f /dev/st0 setblk nnn" where nnn is greater than 524288,
it won't work right.

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-22 21:40:08 +01:00
Josh Durgin c666601a93 rbd: move snap_rwsem to the device, rename to header_rwsem
A new temporary header is allocated each time the header changes, but
only the changed properties are copied over. We don't need a new
semaphore for each header update.

This addresses http://tracker.newdream.net/issues/2174

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Alex Elder <elder@dreamhost.com>
2012-03-22 10:47:52 -05:00
Alex Elder 32eec68d2f rbd: don't drop the rbd_id too early
Currently an rbd device's id is released when it is removed, but it
is done before the code is run to clean up sysfs-related files (such
as /sys/bus/rbd/devices/1).

It's possible that an rbd is still in use after the rbd_remove()
call has been made.  It's essentially the same as an active inode
that stays around after it has been removed--until its final close
operation.  This means that the id shows up as free for reuse at a
time it should not be.

The effect of this was seen by Jens Rehpoehler, who:
    - had a filesystem mounted on an rbd device
    - unmapped that filesystem (without unmounting)
    - found that the mount still worked properly
    - but hit a panic when he attempted to re-map a new rbd device

This re-map attempt found the previously-unmapped id available.
The subsequent attempt to reuse it was met with a panic while
attempting to (re-)install the sysfs entry for the new mapped
device.

Fix this by holding off "putting" the rbd id, until the rbd_device
release function is called--when the last reference is finally
dropped.

Note: This fixes: http://tracker.newdream.net/issues/1907

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:50 -05:00
Alex Elder 593a9e7b34 rbd: small changes
Here is another set of small code tidy-ups:
    - Define SECTOR_SHIFT and SECTOR_SIZE, and use these symbolic
      names throughout.  Tell the blk_queue system our physical
      block size, in the (unlikely) event we want to use something
      other than the default.
    - Delete the definition of struct rbd_info, which is never used.
    - Move the definition of dev_to_rbd() down in its source file,
      just above where it gets first used, and change its name to
      dev_to_rbd_dev().
    - Replace an open-coded operation in rbd_dev_release() to use
      dev_to_rbd_dev() instead.
    - Calculate the segment size for a given rbd_device just once in
      rbd_init_disk().
    - Use the '%zd' conversion specifier in rbd_snap_size_show(),
      since the value formatted is a size_t.
    - Switch to the '%llu' conversion specifier in rbd_snap_id_show().
      since the value formatted is unsigned.

Signed-off-by: Alex Elder <elder@dreamhost.com>
2012-03-22 10:47:50 -05:00
Alex Elder 00f1f36ffa rbd: do some refactoring
A few blocks of code are rearranged a bit here:
    - In rbd_header_from_disk():
	- Don't bother computing snap_count until we're sure the
	  on-disk header starts with a good signature.
	- Move a few independent lines of code so they are *after* a
	  check for a failed memory allocation.
	- Get rid of unnecessary local variable "ret".
    - Make a few other changes in rbd_read_header(), similar to the
      above--just moving things around a bit while preserving the
      functionality.
    - In rbd_rq_fn(), just assign rq in the while loop's controlling
      expression rather than duplicating it before and at the end of
      the loop body.  This allows the use of "continue" rather than
      "goto next" in a number of spots.
    - Rearrange the logic in snap_by_name().  End result is the same.

Signed-off-by: Alex Elder <elder@dreamhost.com>
2012-03-22 10:47:50 -05:00
Alex Elder fed4c143ba rbd: fix module sysfs setup/teardown code
Once rbd_bus_type is registered, it allows an "add" operation via
the /sys/bus/rbd/add bus attribute, and adding a new rbd device that
way establishes a connection between the device and rbd_root_dev.
But rbd_root_dev is not registered until after the rbd_bus_type
registration is complete.  This could (in principle anyway) result
in an invalid state.

Since rbd_root_dev has no tie to rbd_bus_type we can reorder these
two initializations and never be faced with this scenario.

In addition, unregister the device in the event the bus registration
fails at module init time.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:50 -05:00
Alex Elder 7ef3214af2 rbd: don't allocate mon_addrs buffer in rbd_add()
The mon_addrs buffer in rbd_add is used to hold a copy of the
monitor IP addresses supplied via /sys/bus/rbd/add.  That is
passed to rbd_get_client(), which never modifies it (nor do
any of the functions it gets passed to thereafter)--the mon_addr
parameter to rbd_get_client() is a pointer to constant data, so it
can't be modifed.  Furthermore, rbd_get_client() has the length of
the mon_addrs buffer and that is used to ensure nothing goes beyond
its end.

Based on all this, there is no reason that a buffer needs to
be used to hold a copy of the mon_addrs provided via
/sys/bus/rbd/add.   Instead, the location within that passed-in
buffer can be provided, along with the length of the "token"
therein which represents the monitor IP's.

A small change to rbd_add_parse_args() allows the address within the
buffer to be passed back, and the length is already returned.  This
now means that, at least from the perspective of this interface,
there is no such thing as a list of monitor addresses that is too
long.

Signed-off-by: Alex Elder <elder@dreamhost.com>
2012-03-22 10:47:50 -05:00
Alex Elder 5214ecc45c rbd: have rbd_parse_args() report found mon_addrs size
The argument parsing routine already computes the size of the
mon_addrs buffer it extracts from the "command."  Pass it to the
caller so it can use it to provide the length to rbd_get_client().

Signed-off-by: Alex Elder <elder@dreamhost.com>
2012-03-22 10:47:49 -05:00
Alex Elder 81a8979378 rbd: do a few checks at build time
This is a bit gratuitous, but there are a few things that can be
verified at build time rather than run time, so do that.

Signed-off-by: Alex Elder <elder@dreamhost.com>
2012-03-22 10:47:49 -05:00
Alex Elder e28fff268e rbd: don't use sscanf() in rbd_add_parse_args()
Make use of a few simple helper routines to parse the arguments
rather than sscanf().  This will treat both missing and too-long
arguments as invalid input (rather than silently truncating the
input in the too-long case).  In time this can also be used by
rbd_add() to use the passed-in buffer in place, rather than copying
its contents into new buffers.

It appears to me that the sscanf() previously used would not
correctly handle a supplied snapshot--the two final "%s" conversion
specifications were not separated by a space, and I'm not sure
how sscanf() handles that situation.  It may not be well-defined.
So that may be a bug this change fixes (but I didn't verify that).

The sizes of the mon_addrs and options buffers are now passed to
rbd_add_parse_args(), so they can be supplied to copy_token().

Signed-off-by: Alex Elder <elder@dreamhost.com>
2012-03-22 10:47:49 -05:00
Alex Elder a725f65e52 rbd: encapsulate argument parsing for rbd_add()
Move the code that parses the arguments provided to rbd_add() (which
are supplied via /sys/bus/rbd/add) into a separate function.

Also rename the "mon_dev_name" variable in rbd_add() to be
"mon_addrs".   The variable represents a list of one or more
comma-separated monitor IP addresses, each with an optional port
number.  I think "mon_addrs" captures that notion a little better.

Signed-off-by: Alex Elder <elder@dreamhost.com>
2012-03-22 10:47:48 -05:00
Alex Elder 27cc25943f rbd: simplify error handling in rbd_add()
If a couple pointers are initialized to NULL then a single
"out_nomem" label can be used for all of the memory allocation
failure cases in rbd_add().

Also, get rid of the "irc" local variable there.  There is no
real need for "rc" to be type ssize_t, and it can be used in
the spot "irc" was.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:48 -05:00
Alex Elder 60571c7d55 rbd: reduce memory used for rbd_dev fields
The length of the string containing the monitor address
specification(s) will never exceed the length of the string passed
in to rbd_add().  The same holds true for the ceph + rbd options
string.  So reduce the amount of memory allocated for these to
that length rather than the maximum (1024 bytes).

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:48 -05:00
Alex Elder d720bcb0a8 rbd: have rbd_get_client() return a rbd_client
Since rbd_get_client() currently returns an error code.  It assigns
the rbd_client field of the rbd_device structure it is passed if
successful.  Instead, have it return the created rbd_client
structure and return a pointer-coded error if there is an error.
This makes the assignment of the client pointer more obvious at the
call site.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:48 -05:00
Alex Elder f0f8cef5a3 rbd: a few simple changes
Here are a few very simple cleanups:
    - Add a "RBD_" prefix to the two driver name string definitions.
    - Move the definition of struct rbd_request below struct rbd_req_coll
      to avoid the need for an empty declaration of the latter.
    - Move and group the definitions of rbd_root_dev_release() and
      rbd_root_dev, as well as rbd_bus_type and rbd_bus_attrs[],
      close to the top of the file.  Arrange the latter so
      rbd_bus_type.bus_attrs can be initialized statically.
    - Get rid of an unnecessary local variable in rbd_open().
    - Rework some hokey logic in rbd_bus_add_dev(), so the value of
      "ret" at the end is either 0 or -ENOENT to avoid the need for
      the code duplication that was there.
    - Rename a goto target in rbd_add().

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:48 -05:00
Alex Elder 432b858749 rbd: rename "node_lock"
The spinlock used to protect rbd_client_list is named "node_lock".
Rename it to "rbd_client_list_lock" to make it more obvious what
it's for.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:48 -05:00
Alex Elder bc534d86be rbd: move ctl_mutex lock inside rbd_client_create()
Since rbd_client_create() is only called in one place, move the
acquisition of the mutex around that call inside that function.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:47 -05:00
Alex Elder d97081b0c7 rbd: move ctl_mutex lock inside rbd_get_client()
Since rbd_get_client() is only called in one place, move the
acquisition of the mutex around that call inside that function.

Furthermore, within rbd_get_client(), it appears the mutex only
needs to be held while calling rbd_client_create().  (Moving
the lock inside that function will wait for the next patch.)

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:47 -05:00
Alex Elder e6994d3dde rbd: release client list lock sooner
In rbd_get_client(), if a client is reused, a number of things
get done while still holding the list lock unnecessarily.

This just moves a few things that need no lock protection outside
the lock.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:47 -05:00
Alex Elder d184f6bfde rbd: restore previous rbd id sequence behavior
It used to be that selecting a new unique identifier for an added
rbd device required searching all existing ones to find the highest
id is used.  A recent change made that unnecessary, but made it
so that id's used were monotonically non-decreasing.  It's a bit
more pleasant to have smaller rbd id's though, and this change
makes ids get allocated as they were before--each new id is one more
than the maximum currently in use.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:47 -05:00
Alex Elder 499afd5b8e rbd: tie rbd_dev_list changes to rbd_id operations
The only time entries are added to or removed from the global
rbd_dev_list is exactly when a "put" or "get" operation is being
performed on a rbd_dev's id.  So just move the list management code
into get/put routines.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:47 -05:00
Alex Elder e124a82f3c rbd: protect the rbd_dev_list with a spinlock
The rbd_dev_list is just a simple list of all the current
rbd_devices.  Using the ctl_mutex as a concurrency guard is
overkill.  Instead, use a spinlock for that specific purpose.

This also reduces the window that the ctl_mutex needs to be held in
rbd_add().

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:47 -05:00
Alex Elder 1ddbe94eda rbd: rework calculation of new rbd id's
In order to select a new unique identifier for an added rbd device,
the list of all existing ones is searched and a value one greater
than the highest id is used.

The list search can be avoided by using an atomic variable that
keeps track of the current highest id.  Using a get/put model for
id's we can limit the boundless growth of id numbers a bit by
arranging to reuse the current highest id once it gets released.
Add these calls to "put" the id when an rbd is getting removed.

Note that this changes the pattern of device id's used--new values
will never be below the highest one seen so far (even if there
exists an unused lower one).  I assert this is OK because the key
property of an rbd id is its uniqueness, not its magnitude.

Regardless, a follow-on patch will restore the old way of doing
things, I just think this commit just makes the incremental change
to atomics a little easier to understand.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:47 -05:00
Alex Elder b7f23c361b rbd: encapsulate new rbd id selection
Move the loop that finds a new unique rbd id to use into
its own helper function.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:47 -05:00
Josh Durgin cc9d734c3d rbd: use a single value of snap_name to mean no snap
There's already a constant for this anyway.

Since rbd_header_set_snap() is only used to set the rbd device
snap_name field, just do that within that function rather than
having it take the snap_name as an argument.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>

v2: Changed interface rbd_header_set_snap() so it explicitly updates
    the snap_name in the rbd_device.  Also added a BUILD_BUG_ON()
    to verify the size of the snap_name field is sufficient for
    SNAP_HEAD_NAME.
2012-03-22 10:47:47 -05:00
Alex Elder 1dbb439913 rbd: do not duplicate ceph_client pointer in rbd_device
The rbd_device structure maintains a duplicate copy of the
ceph_client pointer maintained in its rbd_client structure.  There
appears to be no good reason for this, and its presence presents a
risk of them getting out of synch or otherwise misused.  So kill it
off, and use the rbd_client copy only.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:47 -05:00
Alex Elder ee57741c52 rbd: make ceph_parse_options() return a pointer
ceph_parse_options() takes the address of a pointer as an argument
and uses it to return the address of an allocated structure if
successful.  With this interface is not evident at call sites that
the pointer is always initialized.  Change the interface to return
the address instead (or a pointer-coded error code) to make the
validity of the returned pointer obvious.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:47 -05:00
Alex Elder 2107978668 rbd: a few small cleanups
Some minor cleanups in "drivers/block/rbd.c:
    - Use the more meaningful "RBD_MAX_OBJ_NAME_LEN" in place if "96"
      in the definition of RBD_MAX_MD_NAME_LEN.
    - Use DEFINE_SPINLOCK() to define and initialize node_lock.
    - Drop a needless (char *) cast in parse_rbd_opts_token().
    - Make a few minor formatting changes.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-03-22 10:47:46 -05:00
Igor Mammedov b9136d207f xen: initialize platform-pci even if xen_emul_unplug=never
When xen_emul_unplug=never is specified on kernel command line
reading files from /sys/hypervisor is broken (returns -EBUSY).
It is caused by xen_bus dependency on platform-pci and
platform-pci isn't initialized when xen_emul_unplug=never is
specified.

Fix it by allowing platform-pci to ignore xen_emul_unplug=never,
and do not intialize xen_[blk|net]front instead.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-03-22 11:37:11 -04:00
Linus Torvalds 5375871d43 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc merge from Benjamin Herrenschmidt:
 "Here's the powerpc batch for this merge window.  It is going to be a
  bit more nasty than usual as in touching things outside of
  arch/powerpc mostly due to the big iSeriesectomy :-) We finally got
  rid of the bugger (legacy iSeries support) which was a PITA to
  maintain and that nobody really used anymore.

  Here are some of the highlights:

   - Legacy iSeries is gone.  Thanks Stephen ! There's still some bits
     and pieces remaining if you do a grep -ir series arch/powerpc but
     they are harmless and will be removed in the next few weeks
     hopefully.

   - The 'fadump' functionality (Firmware Assisted Dump) replaces the
     previous (equivalent) "pHyp assisted dump"...  it's a rewrite of a
     mechanism to get the hypervisor to do crash dumps on pSeries, the
     new implementation hopefully being much more reliable.  Thanks
     Mahesh Salgaonkar.

   - The "EEH" code (pSeries PCI error handling & recovery) got a big
     spring cleaning, motivated by the need to be able to implement a
     new backend for it on top of some new different type of firwmare.

     The work isn't complete yet, but a good chunk of the cleanups is
     there.  Note that this adds a field to struct device_node which is
     not very nice and which Grant objects to.  I will have a patch soon
     that moves that to a powerpc private data structure (hopefully
     before rc1) and we'll improve things further later on (hopefully
     getting rid of the need for that pointer completely).  Thanks Gavin
     Shan.

   - I dug into our exception & interrupt handling code to improve the
     way we do lazy interrupt handling (and make it work properly with
     "edge" triggered interrupt sources), and while at it found & fixed
     a wagon of issues in those areas, including adding support for page
     fault retry & fatal signals on page faults.

   - Your usual random batch of small fixes & updates, including a bunch
     of new embedded boards, both Freescale and APM based ones, etc..."

I fixed up some conflicts with the generalized irq-domain changes from
Grant Likely, hopefully correctly.

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (141 commits)
  powerpc/ps3: Do not adjust the wrapper load address
  powerpc: Remove the rest of the legacy iSeries include files
  powerpc: Remove the remaining CONFIG_PPC_ISERIES pieces
  init: Remove CONFIG_PPC_ISERIES
  powerpc: Remove FW_FEATURE ISERIES from arch code
  tty/hvc_vio: FW_FEATURE_ISERIES is no longer selectable
  powerpc/spufs: Fix double unlocks
  powerpc/5200: convert mpc5200 to use of_platform_populate()
  powerpc/mpc5200: add options to mpc5200_defconfig
  powerpc/mpc52xx: add a4m072 board support
  powerpc/mpc5200: update mpc5200_defconfig to fit for charon board
  Documentation/powerpc/mpc52xx.txt: Checkpatch cleanup
  powerpc/44x: Add additional device support for APM821xx SoC and Bluestone board
  powerpc/44x: Add support PCI-E for APM821xx SoC and Bluestone board
  MAINTAINERS: Update PowerPC 4xx tree
  powerpc/44x: The bug fixed support for APM821xx SoC and Bluestone board
  powerpc: document the FSL MPIC message register binding
  powerpc: add support for MPIC message register API
  powerpc/fsl: Added aliased MSIIR register address to MSI node in dts
  powerpc/85xx: mpc8548cds - add 36-bit dts
  ...
2012-03-21 18:55:10 -07:00
Linus Torvalds 9f3938346a Merge branch 'kmap_atomic' of git://github.com/congwang/linux
Pull kmap_atomic cleanup from Cong Wang.

It's been in -next for a long time, and it gets rid of the (no longer
used) second argument to k[un]map_atomic().

Fix up a few trivial conflicts in various drivers, and do an "evil
merge" to catch some new uses that have come in since Cong's tree.

* 'kmap_atomic' of git://github.com/congwang/linux: (59 commits)
  feature-removal-schedule.txt: schedule the deprecated form of kmap_atomic() for removal
  highmem: kill all __kmap_atomic() [swarren@nvidia.com: highmem: Fix ARM build break due to __kmap_atomic rename]
  drbd: remove the second argument of k[un]map_atomic()
  zcache: remove the second argument of k[un]map_atomic()
  gma500: remove the second argument of k[un]map_atomic()
  dm: remove the second argument of k[un]map_atomic()
  tomoyo: remove the second argument of k[un]map_atomic()
  sunrpc: remove the second argument of k[un]map_atomic()
  rds: remove the second argument of k[un]map_atomic()
  net: remove the second argument of k[un]map_atomic()
  mm: remove the second argument of k[un]map_atomic()
  lib: remove the second argument of k[un]map_atomic()
  power: remove the second argument of k[un]map_atomic()
  kdb: remove the second argument of k[un]map_atomic()
  udf: remove the second argument of k[un]map_atomic()
  ubifs: remove the second argument of k[un]map_atomic()
  squashfs: remove the second argument of k[un]map_atomic()
  reiserfs: remove the second argument of k[un]map_atomic()
  ocfs2: remove the second argument of k[un]map_atomic()
  ntfs: remove the second argument of k[un]map_atomic()
  ...
2012-03-21 09:40:26 -07:00
Linus Torvalds 69a7aebcf0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree from Jiri Kosina:
 "It's indeed trivial -- mostly documentation updates and a bunch of
  typo fixes from Masanari.

  There are also several linux/version.h include removals from Jesper."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (101 commits)
  kcore: fix spelling in read_kcore() comment
  constify struct pci_dev * in obvious cases
  Revert "char: Fix typo in viotape.c"
  init: fix wording error in mm_init comment
  usb: gadget: Kconfig: fix typo for 'different'
  Revert "power, max8998: Include linux/module.h just once in drivers/power/max8998_charger.c"
  writeback: fix fn name in writeback_inodes_sb_nr_if_idle() comment header
  writeback: fix typo in the writeback_control comment
  Documentation: Fix multiple typo in Documentation
  tpm_tis: fix tis_lock with respect to RCU
  Revert "media: Fix typo in mixer_drv.c and hdmi_drv.c"
  Doc: Update numastat.txt
  qla4xxx: Add missing spaces to error messages
  compiler.h: Fix typo
  security: struct security_operations kerneldoc fix
  Documentation: broken URL in libata.tmpl
  Documentation: broken URL in filesystems.tmpl
  mtd: simplify return logic in do_map_probe()
  mm: fix comment typo of truncate_inode_pages_range
  power: bq27x00: Fix typos in comment
  ...
2012-03-20 21:12:50 -07:00
Linus Torvalds ed378a52da USB merge for 3.4-rc1
Here's the big USB merge for the 3.4-rc1 merge window.
 
 Lots of gadget driver reworks here, driver updates, xhci changes, some
 new drivers added, usb-serial core reworking to fix some bugs, and other
 various minor things.
 
 There are some patches touching arch code, but they have all been acked
 by the various arch maintainers.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.18 (GNU/Linux)
 
 iEYEABECAAYFAk9njL8ACgkQMUfUDdst+ylQ9wCfbBOnIT01lGOorkaE9pom0hhk
 HfMAoKq1xzCR2B+OS3UMyUQffk+Ri9Ri
 =KIQ2
 -----END PGP SIGNATURE-----

Merge tag 'usb-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB merge for 3.4-rc1 from Greg KH:
 "Here's the big USB merge for the 3.4-rc1 merge window.

  Lots of gadget driver reworks here, driver updates, xhci changes, some
  new drivers added, usb-serial core reworking to fix some bugs, and
  other various minor things.

  There are some patches touching arch code, but they have all been
  acked by the various arch maintainers."

* tag 'usb-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (302 commits)
  net: qmi_wwan: add support for ZTE MF820D
  USB: option: add ZTE MF820D
  usb: gadget: f_fs: Remove lock is held before freeing checks
  USB: option: make interface blacklist work again
  usb/ub: deprecate & schedule for removal the "Low Performance USB Block" driver
  USB: ohci-pxa27x: add clk_prepare/clk_unprepare calls
  USB: use generic platform driver on ath79
  USB: EHCI: Add a generic platform device driver
  USB: OHCI: Add a generic platform device driver
  USB: ftdi_sio: new PID: LUMEL PD12
  USB: ftdi_sio: add support for FT-X series devices
  USB: serial: mos7840: Fixed MCS7820 device attach problem
  usb: Don't make USB_ARCH_HAS_{XHCI,OHCI,EHCI} depend on USB_SUPPORT.
  usb gadget: fix a section mismatch when compiling g_ffs with CONFIG_USB_FUNCTIONFS_ETH
  USB: ohci-nxp: Remove i2c_write(), use smbus
  USB: ohci-nxp: Support for LPC32xx
  USB: ohci-nxp: Rename symbols from pnx4008 to nxp
  USB: OHCI-HCD: Rename ohci-pnx4008 to ohci-nxp
  usb: gadget: Kconfig: fix typo for 'different'
  usb: dwc3: pci: fix another failure path in dwc3_pci_probe()
  ...
2012-03-20 11:26:30 -07:00
Cong Wang 589973a704 drbd: remove the second argument of k[un]map_atomic()
Signed-off-by: Cong Wang <amwang@redhat.com>
2012-03-20 21:48:29 +08:00
Cong Wang cfd8005c99 block: remove the second argument of k[un]map_atomic()
Signed-off-by: Cong Wang <amwang@redhat.com>
2012-03-20 21:48:16 +08:00
Steven Noonan 3467811e26 xen-blkfront: make blkif_io_lock spinlock per-device
This patch moves the global blkif_io_lock to the per-device structure. The
spinlock seems to exists for two reasons: to disable IRQs when in the interrupt
handlers for blkfront, and to protect the blkfront VBDs when a detachment is
requested.

Having a global blkif_io_lock doesn't make sense given the use case, and it
drastically hinders performance due to contention. All VBDs with pending IOs
have to take the lock in order to get work done, which serializes everything
pretty badly.

Signed-off-by: Steven Noonan <snoonan@amazon.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-03-20 12:52:41 +01:00
Andrew Jones dad5cf659b xen/blkfront: don't put bdev right after getting it
We should hang onto bdev until we're done with it.

Signed-off-by: Andrew Jones <drjones@redhat.com>
[v1: Fixed up git commit description]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-03-20 12:52:41 +01:00
Akinobu Mita 34ae2e47d9 xen-blkfront: use bitmap_set() and bitmap_clear()
Use bitmap_set and bitmap_clear rather than modifying individual bits
in a memory region.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xensource.com
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-03-20 12:52:41 +01:00
Daniel De Graaf b2167ba6dd xen/blkback: Enable blkback on HVM guests
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-03-20 12:52:41 +01:00
Daniel De Graaf 4f14faaab4 xen/blkback: use grant-table.c hypercall wrappers
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-03-20 12:52:41 +01:00
Sebastian Andrzej Siewior 7396bd9fa1 usb/ub: deprecate & schedule for removal the "Low Performance USB Block" driver
Deprecate this driver. All devices which can be handled by this driver
can also be handled by the usb-storage driver.

Acked-By: Pete Zaitcev <zaitcev@redhat.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-03-16 13:30:10 -07:00
Stephen Rothwell ba7a4822b4 powerpc: Remove some of the legacy iSeries specific device drivers
These drivers are specific to the PowerPC legacy iSeries platform and
their Kconfig is specified in arch/powerpc.  Legacy iSeries is being
removed, so these drivers can no longer be selected.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-03-16 09:28:05 +11:00
Linus Torvalds f1cbd03f5e Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "Been sitting on this for a while, but lets get this out the door.
  This fixes various important bugs for 3.3 final, along with a few more
  trivial ones.  Please pull!"

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: fix ioc leak in put_io_context
  block, sx8: fix pointer math issue getting fw version
  Block: use a freezable workqueue for disk-event polling
  drivers/block/DAC960: fix -Wuninitialized warning
  drivers/block/DAC960: fix DAC960_V2_IOCTL_Opcode_T -Wenum-compare warning
  block: fix __blkdev_get and add_disk race condition
  block: Fix setting bio flags in drivers (sd_dif/floppy)
  block: Fix NULL pointer dereference in sd_revalidate_disk
  block: exit_io_context() should call elevator_exit_icq_fn()
  block: simplify ioc_release_fn()
  block: replace icq->changed with icq->flags
2012-03-14 17:16:45 -07:00
Greg Kroah-Hartman f7a0d426f3 Merge 3.3-rc7 into usb-next
This resolves the conflict with drivers/usb/host/ehci-fsl.h that
happened with changes in Linus's and this branch at the same time.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-03-12 09:13:31 -07:00
Muthu Kumar 9354f1b8e6 floppy/scsi: fix setting of BIO flags
Fix setting bio flags in drivers (sd_dif/floppy).

Signed-off-by: Muthukumar R <muthur@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-05 15:49:43 -08:00
Dan Carpenter ea5f4db8ec block, sx8: fix pointer math issue getting fw version
"mem" is type u8.  We need parenthesis here or it screws up the pointer
math probably leading to an oops.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: stable@kernel.org
Acked-by: Jeff Garzik <jgarzik@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-03 19:44:39 +01:00
Danny Kukawka cecd353a02 drivers/block/DAC960: fix -Wuninitialized warning
Set CommandMailbox with memset before use it. Fix for:

drivers/block/DAC960.c: In function ‘DAC960_V1_EnableMemoryMailboxInterface’:
arch/x86/include/asm/io.h:61:1: warning: ‘CommandMailbox.Bytes[12]’
 may be used uninitialized in this function [-Wuninitialized]
drivers/block/DAC960.c:1175:30: note: ‘CommandMailbox.Bytes[12]’
 was declared here

Signed-off-by: Danny Kukawka <danny.kukawka@bisect.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-02 10:48:35 +01:00
Danny Kukawka bca505f109 drivers/block/DAC960: fix DAC960_V2_IOCTL_Opcode_T -Wenum-compare warning
Fixed compiler warning:

comparison between ‘DAC960_V2_IOCTL_Opcode_T’ and ‘enum <anonymous>’

Renamed enum, added a new enum for SCSI_10.CommandOpcode in
DAC960_V2_ProcessCompletedCommand().

Signed-off-by: Danny Kukawka <danny.kukawka@bisect.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-02 10:48:32 +01:00
Muthukumar R 12ebffd146 block: Fix setting bio flags in drivers (sd_dif/floppy)
Fix setting bio flags in drivers (sd_dif/floppy).

Signed-off-by: Muthukumar R <muthur@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-02 10:40:58 +01:00
Sebastian Andrzej Siewior 7ac4704c09 usb/storage: a couple defines from drivers/usb/storage/transport.h to include/linux/usb/storage.h
This moves the BOT data structures for CBW and CSW from drivers internal
header file to global include able file in include/.
The storage gadget is using the same name for CSW but a different for
CBW so I fix it up properly. The same goes for the ub driver and keucr
driver in staging.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-28 11:05:18 -08:00
Hitoshi Mitake 797a796a13 asm-generic: architecture independent readq/writeq for 32bit environment
This provides unified readq()/writeq() helper functions for 32-bit
drivers.

For some cases, readq/writeq without atomicity is harmful, and order of
io access has to be specified explicitly.  So in this patch, new two
header files which contain non-atomic readq/writeq are added.

 - <asm-generic/io-64-nonatomic-lo-hi.h> provides non-atomic readq/
   writeq with the order of lower address -> higher address

 - <asm-generic/io-64-nonatomic-hi-lo.h> provides non-atomic readq/
   writeq with reversed order

This allows us to remove some readq()s that were added drivers when the
default non-atomic ones were removed in commit dbee8a0aff ("x86:
remove 32-bit versions of readq()/writeq()")

The drivers which need readq/writeq but can do with the non-atomic ones
must add the line:

  #include <asm-generic/io-64-nonatomic-lo-hi.h> /* or hi-lo.h */

But this will be nop in 64-bit environments, and no other #ifdefs are
required.  So I believe that this patch can solve the problem of
 1. driver-specific readq/writeq
 2. atomicity and order of io access

This patch is tested with building allyesconfig and allmodconfig as
ARCH=x86 and ARCH=i386 on top of tip/master.

Cc: Kashyap Desai <Kashyap.Desai@lsi.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Ravi Anand <ravi.anand@qlogic.com>
Cc: Vikas Chaudhary <vikas.chaudhary@qlogic.com>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Jason Uhlenkott <juhlenko@akamai.com>
Cc: James Bottomley <James.Bottomley@parallels.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Roland Dreier <roland@purestorage.com>
Cc: James Bottomley <jbottomley@parallels.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Hitoshi Mitake <h.mitake@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-21 16:47:28 -08:00
Jesper Juhl d0156f4d62 NVM Express: Remove unneeded include of linux/version.h from nvme.c
There's no need for drivers/block/nvme.c to include linux/version.h,
so remove the include.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-02-21 11:48:54 +01:00
Linus Torvalds 3ec1e88b33 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Says Jens:

 "Time to push off some of the pending items.  I really wanted to wait
  until we had the regression nailed, but alas it's not quite there yet.
  But I'm very confident that it's "just" a missing expire on exit, so
  fix from Tejun should be fairly trivial.  I'm headed out for a week on
  the slopes.

  - Killing the barrier part of mtip32xx.  It doesn't really support
    barriers, and it doesn't need them (writes are fully ordered).

  - A few fixes from Dan Carpenter, preventing overflows of integer
    multiplication.

  - A fixup for loop, fixing a previous commit that didn't quite solve
    the partial read problem from Dave Young.

  - A bio integer overflow fix from Kent Overstreet.

  - Improvement/fix of the door "keep locked" part of the cdrom shared
    code from Paolo Benzini.

  - A few cfq fixes from Shaohua Li.

  - A fix for bsg sysfs warning when removing a file it did not create
    from Stanislaw Gruszka.

  - Two fixes for floppy from Vivek, preventing a crash.

  - A few block core fixes from Tejun.  One killing the over-optimized
    ioc exit path, cleaning that up nicely.  Two others fixing an oops
    on elevator switch, due to calling into the scheduler merge check
    code without holding the queue lock."

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: fix lockdep warning on io_context release put_io_context()
  relay: prevent integer overflow in relay_open()
  loop: zero fill bio instead of return -EIO for partial read
  bio: don't overflow in bio_get_nr_vecs()
  floppy: Fix a crash during rmmod
  floppy: Cleanup disk->queue before caling put_disk() if add_disk() was never called
  cdrom: move shared static to cdrom_device_info
  bsg: fix sysfs link remove warning
  block: don't call elevator callbacks for plug merges
  block: separate out blk_rq_merge_ok() and blk_try_merge() from elevator functions
  mtip32xx: removed the irrelevant argument of mtip_hw_submit_io() and the unused member of struct driver_data
  block: strip out locking optimization in put_io_context()
  cdrom: use copy_to_user() without the underscores
  block: fix ioc locking warning
  block: fix NULL icq_cache reference
  block,cfq: change code order
2012-02-11 10:07:11 -08:00
Dave Young 306df0716a loop: zero fill bio instead of return -EIO for partial read
commit 8268f5a741 ("deny partial write for loop dev fd") tried to fix the
loop device partial read information leak problem.  But it changed the
semantics of read behavior.  When we read beyond the end of the device we
should get 0 bytes, which is normal behavior, we should not just return
-EIO

Instead of returning -EIO, zero out the bio to avoid information leak in
case of partail read.

Signed-off-by: Dave Young <dyoung@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Tested-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Dmitry Monakhov <dmonakhov@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-08 22:07:19 +01:00
Vivek Goyal 4609dff6b5 floppy: Fix a crash during rmmod
floppy driver does not call add_disk() on all the drives hence we don't take
gendisk reference on request queue for these drives. Don't call put_disk()
with disk->queue set, otherwise we try to put the reference we never took.

Reported-and-tested-by: Dirk Gouders <gouders@et.bocholt.fh-gelsenkirchen.de>
Signed-off-by: Vivek Goyal<vgoyal@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-08 20:03:39 +01:00
Vivek Goyal 3f9a5aabd0 floppy: Cleanup disk->queue before caling put_disk() if add_disk() was never called
add_disk() takes gendisk reference on request queue. If driver failed during
initialization and never called add_disk() then that extra reference is not
taken. That reference is put in put_disk(). floppy driver allocates the
disk, allocates queue, sets disk->queue and then relizes that floppy
controller is not present. It tries to tear down everything and tries to
put a reference down in put_disk() which was never taken.

In such error cases cleanup disk->queue before calling put_disk() so that
we never try to put down a reference which was never taken in first place.

Reported-and-tested-by: Suresh Jayaraman <sjayaraman@suse.com>
Tested-by: Dirk Gouders <gouders@et.bocholt.fh-gelsenkirchen.de>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-08 20:03:38 +01:00
Asai Thambi S P 4e8670e261 mtip32xx: removed the irrelevant argument of mtip_hw_submit_io() and the unused member of struct driver_data
Removed the following:
	* irrelevant argument 'barrier' of mtip_hw_submit_io()
	* unused member 'eh_active' of struct driver_data

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-02-07 07:54:31 +01:00
Linus Torvalds 6c073a7ee2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  rbd: fix safety of rbd_put_client()
  rbd: fix a memory leak in rbd_get_client()
  ceph: create a new session lock to avoid lock inversion
  ceph: fix length validation in parse_reply_info()
  ceph: initialize client debugfs outside of monc->mutex
  ceph: change "ceph.layout" xattr to be "ceph.file.layout"
2012-02-02 15:47:33 -08:00
Alex Elder d23a4b3fd6 rbd: fix safety of rbd_put_client()
The rbd_client structure uses a kref to arrange for cleaning up and
freeing an instance when its last reference is dropped.  The cleanup
routine is rbd_client_release(), and one of the things it does is
delete the rbd_client from rbd_client_list.  It acquires node_lock
to do so, but the way it is done is still not safe.

The problem is that when attempting to reuse an existing rbd_client,
the structure found might already be in the process of getting
destroyed and cleaned up.

Here's the scenario, with "CLIENT" representing an existing
rbd_client that's involved in the race:

 Thread on CPU A                | Thread on CPU B
 ---------------                | ---------------
 rbd_put_client(CLIENT)         | rbd_get_client()
   kref_put()                   |   (acquires node_lock)
     kref->refcount becomes 0   |   __rbd_client_find() returns CLIENT
     calls rbd_client_release() |   kref_get(&CLIENT->kref);
                                |   (releases node_lock)
       (acquires node_lock)     |
       deletes CLIENT from list | ...and starts using CLIENT...
       (releases node_lock)     |
       and frees CLIENT         | <-- but CLIENT gets freed here

Fix this by having rbd_put_client() acquire node_lock.  The result
could still be improved, but at least it avoids this problem.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-02-02 12:56:59 -08:00
Alex Elder 97bb59a03d rbd: fix a memory leak in rbd_get_client()
If an existing rbd client is found to be suitable for use in
rbd_get_client(), the rbd_options structure is not being
freed as it should.  Fix that.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-02-02 12:49:27 -08:00
Linus Torvalds 93c3d65b28 nvme: fix merge error due to change of 'make_request_fn' fn type
The type of 'make_request_fn' changed in 5a7bbad27a ("block: remove
support for bio remapping from ->make_request"), but the merge of the
nvme driver didn't take that into account, and as a result the driver
would compile with a warning:

  drivers/block/nvme.c: In function 'nvme_alloc_ns':
  drivers/block/nvme.c:1336:2: warning: passing argument 2 of 'blk_queue_make_request' from incompatible pointer type [enabled by default]
  include/linux/blkdev.h:830:13: note: expected 'void (*)(struct request_queue *, struct bio *)' but argument is of type 'int (*)(struct request_queue *, struct bio *)'

It's benign, but the warning is annoying.

Reported-by: Stephen Rothwell <sfr@canb.auug.org>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-18 15:41:27 -08:00
Linus Torvalds 92b5abbb44 Merge git://git.infradead.org/users/willy/linux-nvme
* git://git.infradead.org/users/willy/linux-nvme: (105 commits)
  NVMe: Set number of queues correctly
  NVMe: Version 0.8
  NVMe: Set queue flags correctly
  NVMe: Simplify nvme_unmap_user_pages
  NVMe: Mark the end of the sg list
  NVMe: Fix DMA mapping for admin commands
  NVMe: Rename IO_TIMEOUT to NVME_IO_TIMEOUT
  NVMe: Merge the nvme_bio and nvme_prp data structures
  NVMe: Change nvme_completion_fn to take a dev
  NVMe: Change get_nvmeq to take a dev instead of a namespace
  NVMe: Simplify completion handling
  NVMe: Update Identify Controller data structure
  NVMe: Implement doorbell stride capability
  NVMe: Version 0.7
  NVMe: Don't probe namespace 0
  Fix calculation of number of pages in a PRP List
  NVMe: Create nvme_identify and nvme_get_features functions
  NVMe: Fix memory leak in nvme_dev_add()
  NVMe: Fix calls to dma_unmap_sg
  NVMe: Correct sg list setup in nvme_map_user_pages
  ...
2012-01-18 12:34:09 -08:00
Linus Torvalds 16008d6416 Merge branch 'for-3.3/drivers' of git://git.kernel.dk/linux-block
* 'for-3.3/drivers' of git://git.kernel.dk/linux-block:
  mtip32xx: do rebuild monitoring asynchronously
  xen-blkfront: Use kcalloc instead of kzalloc to allocate array
  mtip32xx: uninitialized variable in mtip_quiesce_io()
  mtip32xx: updates based on feedback
  xen-blkback: convert hole punching to discard request on loop devices
  xen/blkback: Move processing of BLKIF_OP_DISCARD from dispatch_rw_block_io
  xen/blk[front|back]: Enhance discard support with secure erasing support.
  xen/blk[front|back]: Squash blkif_request_rw and blkif_request_discard together
  mtip32xx: update to new ->make_request() API
  mtip32xx: add module.h include to avoid conflict with moduleh tree
  mtip32xx: mark a few more items static
  mtip32xx: ensure that all local functions are static
  mtip32xx: cleanup compat ioctl handling
  mtip32xx: fix warnings/errors on 32-bit compiles
  block: Add driver for Micron RealSSD pcie flash cards
2012-01-15 12:48:41 -08:00
Linus Torvalds b3c9dd182e Merge branch 'for-3.3/core' of git://git.kernel.dk/linux-block
* 'for-3.3/core' of git://git.kernel.dk/linux-block: (37 commits)
  Revert "block: recursive merge requests"
  block: Stop using macro stubs for the bio data integrity calls
  blockdev: convert some macros to static inlines
  fs: remove unneeded plug in mpage_readpages()
  block: Add BLKROTATIONAL ioctl
  block: Introduce blk_set_stacking_limits function
  block: remove WARN_ON_ONCE() in exit_io_context()
  block: an exiting task should be allowed to create io_context
  block: ioc_cgroup_changed() needs to be exported
  block: recursive merge requests
  block, cfq: fix empty queue crash caused by request merge
  block, cfq: move icq creation and rq->elv.icq association to block core
  block, cfq: restructure io_cq creation path for io_context interface cleanup
  block, cfq: move io_cq exit/release to blk-ioc.c
  block, cfq: move icq cache management to block core
  block, cfq: move io_cq lookup to blk-ioc.c
  block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq
  block, cfq: reorganize cfq_io_context into generic and cfq specific parts
  block: remove elevator_queue->ops
  block: reorder elevator switch sequence
  ...

Fix up conflicts in:
 - block/blk-cgroup.c
	Switch from can_attach_task to can_attach
 - block/cfq-iosched.c
	conflict with now removed cic index changes (we now use q->id instead)
2012-01-15 12:24:45 -08:00
Jens Axboe 85a0f7b220 Merge branch 'for-3.3/mtip32xx' into for-3.3/drivers 2012-01-15 10:39:35 +01:00
Paolo Bonzini 577ebb374c block: add and use scsi_blk_cmd_ioctl
Introduce a wrapper around scsi_cmd_ioctl that takes a block device.

The function will then be enhanced to detect partition block devices
and, in that case, subject the ioctls to whitelisting.

Cc: linux-scsi@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: James Bottomley <JBottomley@parallels.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-14 15:07:24 -08:00
Linus Torvalds 0a80939b3e Autogenerated GPG tag for Rusty D1ADB8F1: 15EE 8D6C AB0E 7F0C F999 BFCB D920 0E6C D1AD B8F1
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJPD2aFAAoJENkgDmzRrbjxNzsQAIeYbbrXYLjr6kQzUSngj/eC
 FzjaTEfYTQIeuQCFJHcHthyc5lXV4sQbo3jOezW+Bp5yuDJL2aWIHesSfWZe7imu
 zQdM4VshOYdAmUR9Q0AW5zhB8Smbs7/AyABiF2jm4p0ZPOuyMDSlei9sjvE9Vjvt
 B7g5ht7L6kz0JbDnwwy0u5gs+tEitwpXYId9Y4ysZIBzIbL0qkPX8veOddGTMy0N
 8xhWXaKtufpjvxFD2ORLDsw3AkoF1xXSNuFd/5nzCNpbeE7TW931jfkPoqJumuAO
 7GLxcU9kKYl+IICobC6wBtsj/RrB7w+cBXMvPGwdBliam1qaRhUcJZi5FLM/Ha5d
 2A9QDYNUpoXiO8JbPXrV9Z+Y0+Co8RilsQj7R/rjZh6AbbYCWt9nxzx2Svl/RfTr
 xfiimHuB2P3rHjOvpCXULwOOuE5c8MzPuWncpdjiD3uGXOY/aY+X1m+if/quJw9D
 grPlKL0+YiRakEYUeGG4M77KCqyKFZaF7L7UQPbqfZcj8V/9AW3/7U5I/B9RlAjs
 idsr4fcf5s0N+oKUyTCW1ncpUDQNiwbU2NyJQqeu1ZxaRGj72AgyvsaNeyIPDyK+
 f6x95Bi7i8KLjXc9Z1KvJwh2Nxt25gNUiTYVha/9H2NpJGd1cfI15kTOGXrgddVv
 1pvuGcJDZwYiwfiXr3FL
 =HHrh
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://github.com/rustyrussell/linux

Autogenerated GPG tag for Rusty D1ADB8F1: 15EE 8D6C AB0E 7F0C F999  BFCB D920 0E6C D1AD B8F1

* tag 'for-linus' of git://github.com/rustyrussell/linux:
  module_param: check that bool parameters really are bool.
  intelfbdrv.c: bailearly is an int module_param
  paride/pcd: fix bool verbose module parameter.
  module_param: make bool parameters really bool (drivers & misc)
  module_param: make bool parameters really bool (arch)
  module_param: make bool parameters really bool (core code)
  kernel/async: remove redundant declaration.
  printk: fix unnecessary module_param_name.
  lirc_parallel: fix module parameter description.
  module_param: avoid bool abuse, add bint for special cases.
  module_param: check type correctness for module_param_array
  modpost: use linker section to generate table.
  modpost: use a table rather than a giant if/else statement.
  modules: sysfs - export: taint, coresize, initsize
  kernel/params: replace DEBUGP with pr_debug
  module: replace DEBUGP with pr_debug
  module: struct module_ref should contains long fields
  module: Fix performance regression on modules with large symbol tables
  module: Add comments describing how the "strmap" logic works

Fix up conflicts in scripts/mod/file2alias.c due to the new linker-
generated table approach to adding __mod_*_device_table entries.  The
ARM sa11x0 mcp bus needed to be converted to that too.
2012-01-14 12:32:16 -08:00
Linus Torvalds 1a52bb0b68 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: ensure prealloc_blob is in place when removing xattr
  rbd: initialize snap_rwsem in rbd_add()
  ceph: enable/disable dentry complete flags via mount option
  vfs: export symbol d_find_any_alias()
  ceph: always initialize the dentry in open_root_dentry()
  libceph: remove useless return value for osd_client __send_request()
  ceph: avoid iput() while holding spinlock in ceph_dir_fsync
  ceph: avoid useless dget/dput in encode_fh
  ceph: dereference pointer after checking for NULL
  crush: fix force for non-root TAKE
  ceph: remove unnecessary d_fsdata conditional checks
  ceph: Use kmemdup rather than duplicating its implementation

Fix up conflicts in fs/ceph/super.c (d_alloc_root() failure handling vs
always initialize the dentry in open_root_dentry)
2012-01-13 10:29:21 -08:00
Rusty Russell 1b9fbafb3a paride/pcd: fix bool verbose module parameter.
Dan Carpenter points out that it's an int, not a bool:

pcd.c:427:				if (verbose > 1)
pcd.c:433:				if (verbose > 1)
pcd.c:437:				if (verbose < 2)
pcd.c:506:#define DBMSG(msg)	((verbose>1)?(msg):NULL)

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
2012-01-13 09:32:26 +10:30
Rusty Russell 90ab5ee941 module_param: make bool parameters really bool (drivers & misc)
module_param(bool) used to counter-intuitively take an int.  In
fddd5201 (mid-2009) we allowed bool or int/unsigned int using a messy
trick.

It's time to remove the int/unsigned int option.  For this version
it'll simply give a warning, but it'll break next kernel version.

Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-01-13 09:32:20 +10:30
Alex Elder 0e805a1d85 rbd: initialize snap_rwsem in rbd_add()
New rbd device structures get initialized in rbd_add().  Many of
the fields rely on being initially zero-filled.  However we lockdep
was noticing that the rw_semaphore embedded in the header field
was not getting properly initialized.  Fix that.

Signed-off-by: Alex Elder <elder@dreamhost.com>
Signed-off-by: Sage Weil <sage@newdream.net>
2012-01-12 11:00:50 -08:00
Amit Shah f8fb5bc23a virtio: blk: Add freeze, restore handlers to support S4
Delete the vq and flush any pending requests from the block queue on the
freeze callback to prepare for hibernation.

Re-create the vq in the restore callback to resume normal function.

Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-01-12 15:44:45 +10:30
Amit Shah 6abd6e5a44 virtio: blk: Move vq initialization to separate function
The probe and PM restore functions will share this code.

Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-01-12 15:44:45 +10:30
Michael S. Tsirkin 4678d6f970 virtio_blk: fix config handler race
Fix a theoretical race related to config work
handler: a config interrupt might happen
after we flush config work but before we
reset the device. It will then cause the
config work to run during or after reset.

Two problems with this:
- if this runs after device is gone we will get use after free
- access of config while reset is in progress is racy
(as layout is changing).

As a solution
1. flush after reset when we know there will be no more interrupts
2. add a flag to disable config access before reset

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-01-12 15:44:44 +10:30
Rusty Russell f96fde41f7 virtio: rename virtqueue_add_buf_gfp to virtqueue_add_buf
Remove wrapper functions. This makes the allocation type explicit in
all callers; I used GPF_KERNEL where it seemed obvious, left it at
GFP_ATOMIC otherwise.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2012-01-12 15:44:42 +10:30
Matthew Wilcox df34813990 NVMe: Set number of queues correctly
The number of submission & completion queues should be set by calling
Set Features, not Get Features.

Reported-by: Kwok Kong <Kwok.Kong@idt.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-01-11 09:22:24 -05:00
Linus Torvalds 4690dfa8cd Merge branch 'next' of git://git.monstr.eu/linux-2.6-microblaze
* 'next' of git://git.monstr.eu/linux-2.6-microblaze:
  microblaze: Wire-up new system calls
  microblaze: Remove NO_IRQ from architecture
  input: xilinx_ps2: Don't use NO_IRQ
  block: xsysace: Don't use NO_IRQ
  microblaze: Trivial asm fix
  microblaze: Fix debug message in module
  microblaze: Remove eprintk macro
  microblaze: Send CR before LF for early console
  microblaze: Change NO_IRQ to 0
  microblaze: Use irq_of_parse_and_map for timer
  microblaze: intc: Change variable name
  microblaze: Use of_find_compatible_node for timer and intc
  microblaze: Add __cmpdi2
  microblaze: Synchronize __pa __va macros
2012-01-10 17:37:49 -08:00
Matthew Wilcox 366e8217e5 NVMe: Version 0.8
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-01-10 16:30:15 -05:00
Matthew Wilcox 4eeb9215a0 NVMe: Set queue flags correctly
QUEUE_FLAG_* are flags (other than QUEUE_FLAG_DEFAULT), so they cannot
be ORed together.  Set the queue flags using queue_flag_set_unlocked().

Reported-by: Donald Wood <donald.e.wood@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-01-10 16:29:23 -05:00
Matthew Wilcox 1c2ad9faaf NVMe: Simplify nvme_unmap_user_pages
By using the iod->nents field (the same way other I/O paths do), we can
avoid recalculating the number of sg entries at unmap time, and make
nvme_unmap_user_pages() easier to call.

Also, use the 'write' parameter instead of assuming DMA_FROM_DEVICE.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-01-10 14:54:22 -05:00
Matthew Wilcox fe304c43c6 NVMe: Mark the end of the sg list
For user I/O and admin commands, we were forgetting to mark the end of
the SG list.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-01-10 14:54:14 -05:00
Matthew Wilcox 497421880a NVMe: Fix DMA mapping for admin commands
We were always mapping as DMA_FROM_DEVICE then unmapping with
DMA_TO_DEVICE which was clearly not correct.  Follow the same pattern as
nvme_submit_io() and key off the bottom bit of the opcode to determine
whether this is a read or a write.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-01-10 14:54:05 -05:00
Matthew Wilcox ff976d724a NVMe: Rename IO_TIMEOUT to NVME_IO_TIMEOUT
IO_TIMEOUT is a little too generic and might be used by other parts of
the kernel in the future.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-01-10 14:53:54 -05:00
Matthew Wilcox eca18b2394 NVMe: Merge the nvme_bio and nvme_prp data structures
The new merged data structure is called nvme_iod.  This improves performance
for mid-sized I/Os (in the 16k range) since we save a memory allocation.
It is also a slightly simpler interface to use.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-01-10 14:51:20 -05:00
Matthew Wilcox 5c1281a3bf NVMe: Change nvme_completion_fn to take a dev
The queue is only needed for some rare occasions, and it's more consistent
to pass the device around.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-01-10 14:51:00 -05:00
Matthew Wilcox 040a93b52a NVMe: Change get_nvmeq to take a dev instead of a namespace
Upcoming patches require calling get_nvmeq when we don't have a namespace.
Some callers already have the device in a local variable anyway.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-01-10 14:49:18 -05:00
Matthew Wilcox c2f5b65020 NVMe: Simplify completion handling
Instead of encoding the handler type in the bottom two bits of the
per-completion context pointer, store the handler function as well
as the context pointer.  This gives us more flexibility and the code
is clearer.  It comes at the cost of an extra 8k of memory per queue,
but this feels like a reasonable price to pay.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2012-01-10 14:47:46 -05:00
Linus Torvalds 90160371b3 Merge branch 'stable/for-linus-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
* 'stable/for-linus-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: (37 commits)
  xen/pciback: Expand the warning message to include domain id.
  xen/pciback: Fix "device has been assigned to X domain!" warning
  xen/pciback: Move the PCI_DEV_FLAGS_ASSIGNED ops to the "[un|]bind"
  xen/xenbus: don't reimplement kvasprintf via a fixed size buffer
  xenbus: maximum buffer size is XENSTORE_PAYLOAD_MAX
  xen/xenbus: Reject replies with payload > XENSTORE_PAYLOAD_MAX.
  Xen: consolidate and simplify struct xenbus_driver instantiation
  xen-gntalloc: introduce missing kfree
  xen/xenbus: Fix compile error - missing header for xen_initial_domain()
  xen/netback: Enable netback on HVM guests
  xen/grant-table: Support mappings required by blkback
  xenbus: Use grant-table wrapper functions
  xenbus: Support HVM backends
  xen/xenbus-frontend: Fix compile error with randconfig
  xen/xenbus-frontend: Make error message more clear
  xen/privcmd: Remove unused support for arch specific privcmp mmap
  xen: Add xenbus_backend device
  xen: Add xenbus device driver
  xen: Add privcmd device driver
  xen/gntalloc: fix reference counts on multi-page mappings
  ...
2012-01-10 10:09:59 -08:00
Linus Torvalds 98793265b4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (53 commits)
  Kconfig: acpi: Fix typo in comment.
  misc latin1 to utf8 conversions
  devres: Fix a typo in devm_kfree comment
  btrfs: free-space-cache.c: remove extra semicolon.
  fat: Spelling s/obsolate/obsolete/g
  SCSI, pmcraid: Fix spelling error in a pmcraid_err() call
  tools/power turbostat: update fields in manpage
  mac80211: drop spelling fix
  types.h: fix comment spelling for 'architectures'
  typo fixes: aera -> area, exntension -> extension
  devices.txt: Fix typo of 'VMware'.
  sis900: Fix enum typo 'sis900_rx_bufer_status'
  decompress_bunzip2: remove invalid vi modeline
  treewide: Fix comment and string typo 'bufer'
  hyper-v: Update MAINTAINERS
  treewide: Fix typos in various parts of the kernel, and fix some comments.
  clockevents: drop unknown Kconfig symbol GENERIC_CLOCKEVENTS_MIGR
  gpio: Kconfig: drop unknown symbol 'CS5535_GPIO'
  leds: Kconfig: Fix typo 'D2NET_V2'
  sound: Kconfig: drop unknown symbol ARCH_CLPS7500
  ...

Fix up trivial conflicts in arch/powerpc/platforms/40x/Kconfig (some new
kconfig additions, close to removed commented-out old ones)
2012-01-08 13:21:22 -08:00
Linus Torvalds 972b2c7199 Merge branch 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (165 commits)
  reiserfs: Properly display mount options in /proc/mounts
  vfs: prevent remount read-only if pending removes
  vfs: count unlinked inodes
  vfs: protect remounting superblock read-only
  vfs: keep list of mounts for each superblock
  vfs: switch ->show_options() to struct dentry *
  vfs: switch ->show_path() to struct dentry *
  vfs: switch ->show_devname() to struct dentry *
  vfs: switch ->show_stats to struct dentry *
  switch security_path_chmod() to struct path *
  vfs: prefer ->dentry->d_sb to ->mnt->mnt_sb
  vfs: trim includes a bit
  switch mnt_namespace ->root to struct mount
  vfs: take /proc/*/mounts and friends to fs/proc_namespace.c
  vfs: opencode mntget() mnt_set_mountpoint()
  vfs: spread struct mount - remaining argument of next_mnt()
  vfs: move fsnotify junk to struct mount
  vfs: move mnt_devname
  vfs: move mnt_list to struct mount
  vfs: switch pnode.h macros to struct mount *
  ...
2012-01-08 12:19:57 -08:00
Al Viro ece2ccb668 Merge branches 'vfsmount-guts', 'umode_t' and 'partitions' into Z 2012-01-06 23:15:54 -05:00
Linus Torvalds 356b95424c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k: (21 commits)
  m68k/mac: Make CONFIG_HEARTBEAT unavailable on Mac
  m68k/serial: Remove references to obsolete serial config options
  m68k/net: Remove obsolete IRQ_FLG_* users
  m68k: Don't comment out syscalls used by glibc
  m68k/atari: Move declaration of atari_SCC_reset_done to header file
  m68k/serial: Remove references to obsolete CONFIG_SERIAL167
  m68k/hp300: Export hp300_ledstate
  m68k: Initconst section fixes
  m68k/mac: cleanup macro case
  mac_scsi: fix mac_scsi on some powerbooks
  m68k/mac: fix powerbook 150 adb_type
  m68k/mac: fix baboon irq disable and shutdown
  m68k/mac: oss irq fixes
  m68k/mac: fix nubus slot irq disable and shutdown
  m68k/mac: enable via_alt_mapping on performa 580
  m68k/mac: cleanup forward declarations
  m68k/mac: cleanup mac_irq_pending
  m68k/mac: cleanup mac_clear_irq
  m68k/mac: early console
  m68k/mvme16x: Add support for EARLY_PRINTK
  ...

Fix up trivial conflict in arch/m68k/Kconfig.debug due to new
EARLY_PRINTK config option addition clashing with movement of the
BOOTPARAM options.
2012-01-06 18:28:12 -08:00
Michal Simek ba2d5affde block: xsysace: Don't use NO_IRQ
Drivers shouldn't use NO_IRQ. Microblaze and PPC
define NO_IRQ as 0 and this reference will be removed
in near future.

Signed-off-by: Michal Simek <monstr@monstr.eu>
Reviewed-by: Ryan Mallon <rmallon@gmail.com>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
CC: Rob Herring <rob.herring@calxeda.com>
2012-01-05 08:34:29 +01:00
Jan Beulich 73db144b58 Xen: consolidate and simplify struct xenbus_driver instantiation
The 'name', 'owner', and 'mod_name' members are redundant with the
identically named fields in the 'driver' sub-structure. Rather than
switching each instance to specify these fields explicitly, introduce
a macro to simplify this.

Eliminate further redundancy by allowing the drvname argument to
DEFINE_XENBUS_DRIVER() to be blank (in which case the first entry from
the ID table will be used for .driver.name).

Also eliminate the questionable xenbus_register_{back,front}end()
wrappers - their sole remaining purpose was the checking of the
'owner' field, proper setting of which shouldn't be an issue anymore
when the macro gets used.

v2: Restore DRV_NAME for the driver name in xen-pciback.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-01-04 17:01:17 -05:00
Asai Thambi S P 62ee8c13e2 mtip32xx: do rebuild monitoring asynchronously
Earlier, rebuild monitoring was done in the context of probe. Now the service
thread takes the responsibility of rebuild monitoring, and probe returns good
status.

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-01-04 22:01:32 +01:00
Al Viro 2c9ede55ec switch device_get_devnode() and ->devnode() to umode_t *
both callers of device_get_devnode() are only interested in lower 16bits
and nobody tries to return anything wider than 16bit anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:54:55 -05:00
Al Viro ff01bb4832 fs: move code out of buffer.c
Move invalidate_bdev, block_sync_page into fs/block_dev.c.  Export
kill_bdev as well, so brd doesn't have to open code it.  Reduce
buffer_head.h requirement accordingly.

Removed a rather large comment from invalidate_bdev, as it looked a bit
obsolete to bother moving.  The small comment replacing it says enough.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:54:07 -05:00
Jens Axboe f748040bb8 Merge branch 'stable/for-jens-3.3' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-3.3/drivers 2011-12-25 16:46:46 +01:00
Linus Torvalds b0d78ee89c Merge branch 'for-linus' of git://git.kernel.dk/linux-block
* 'for-linus' of git://git.kernel.dk/linux-block:
  block: don't kick empty queue in blk_drain_queue()
  block/swim3: Locking fixes
  loop: Fix discard_alignment default setting
  cfq-iosched: fix cfq_cic_link() race confition
  cfq-iosched: free cic_index if blkio_alloc_blkg_stats fails
  cciss: fix flush cache transfer length
  cciss: Add IRQF_SHARED back in for the non-MSI(X) interrupt handler
  loop: fix loop block driver discard and encryption comment
  block: initialize request_queue's numa node during
2011-12-16 10:05:14 -08:00
Thomas Meyer f094148a17 xen-blkfront: Use kcalloc instead of kzalloc to allocate array
The advantage of kcalloc is, that will prevent integer overflows which could
result from the multiplication of number of elements and size and it is also
a bit nicer to read.

The semantic patch that makes this change is available
in https://lkml.org/lkml/2011/11/25/107

Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
[v1: Seperated the drivers/block/cciss_scsi.c out of this patch]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-12-16 12:36:52 -05:00
Tejun Heo 1ba64edef6 block, sx8: kill blk_insert_request()
The only user left for blk_insert_request() is sx8 and it can be
trivially switched to use blk_execute_rq_nowait() - special requests
aren't included in io stat and sx8 doesn't use block layer tagging.
Switch sx8 and kill blk_insert_requeset().

This patch doesn't introduce any functional difference.

Only compile tested.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jeff Garzik <jgarzik@pobox.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:37 +01:00
Linus Torvalds 653f42f6b6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: add missing spin_unlock at ceph_mdsc_build_path()
  ceph: fix SEEK_CUR, SEEK_SET regression
  crush: fix mapping calculation when force argument doesn't exist
  ceph: use i_ceph_lock instead of i_lock
  rbd: remove buggy rollback functionality
  rbd: return an error when an invalid header is read
  ceph: fix rasize reporting by ceph_show_options
2011-12-13 14:59:42 -08:00
Benjamin Herrenschmidt b302545744 block/swim3: Locking fixes
The old PowerMac swim3 driver has some "interesting" locking issues,
using a private lock and failing to lock the queue before completing
requests, which triggered WARN_ONs among others.

This rips out the private lock, makes everything operate under the
block queue lock, and generally makes things simpler.

We used to also share a queue between the two possible instances which
was problematic since we might pick the wrong controller in some cases,
so make the queue and the current request per-instance and use
queuedata to point to our private data which is a lot cleaner.

We still share the queue lock but then, it's nearly impossible to actually
use 2 swim3's simultaneously: one would need to have a Wallstreet
PowerBook, the only machine afaik with two of these on the motherboard,
and populate both hotswap bays with a floppy drive (the machine ships
only with one), so nobody cares...

While at it, add a little fix to clear up stale interrupts when loading
the driver or plugging a floppy drive in a bay.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-12 12:42:12 +01:00
Finn Thain ed04c97d51 m68k/mac: cleanup forward declarations
Move some forward declarations into header files and adjust includes.

Signed-off-by: Finn Thain <fthain@telegraphics.com.au>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
2011-12-10 19:52:46 +01:00
Josh Durgin 51703306b3 rbd: remove buggy rollback functionality
This doesn't interact with resizing well, since it doesn't set the
size of the device to the size at the snapshot. It's also an expensive
operation to be synchronous. Rollback can still be done with the
userspace rbd tool.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
2011-12-07 10:46:19 -08:00
Josh Durgin 81e759fbf7 rbd: return an error when an invalid header is read
This protects against opening future rbd images that have incompatible format changes.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
2011-12-07 10:46:10 -08:00
Justin P. Mattock 42b2aa86c6 treewide: Fix typos in various parts of the kernel, and fix some comments.
The below patch fixes some typos in various parts of the kernel, as well as fixes some comments.
Please let me know if I missed anything, and I will try to get it changed and resent.

Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-12-02 14:57:31 +01:00
Lukas Czerner dfaf3c036c loop: Fix discard_alignment default setting
discard_alignment is not relevant to the loop driver since it is
supposed to be set as a workaround for the old sector 63 alignments.
So set it to zero rather than block size.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reported-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-02 14:47:03 +01:00
Stephen M. Cameron 59bd71a81b cciss: fix flush cache transfer length
We weren't filling in the transfer length of the
flush cache command (it transfers 4 bytes of zeroes).
Firmware didn't seem to be bothered by this, but it
should be fixed.

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-28 20:12:05 +01:00
Stephen M. Cameron 6225da4815 cciss: Add IRQF_SHARED back in for the non-MSI(X) interrupt handler
IRQF_SHARED is required for older controllers that don't support MSI(X)
and which may end up sharing an interrupt.

Also remove deprecated IRQF_DISABLED.

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-28 20:12:05 +01:00
Dave Young ae95757a90 loop: fix loop block driver discard and encryption comment
The loop driver does not support discard if encryption is enabled,
fix the comment.

Signed-off-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-25 09:41:25 +01:00
Dan Carpenter 3e54a3d1b8 mtip32xx: uninitialized variable in mtip_quiesce_io()
We recently introduce new continue in the loop which make gcc complain.
In theory if MTIP_FLAG_SVC_THD_ACTIVE_BIT is set, we could hit continue
over and over until eventually we time out of the loop.  In that case
"active" should be set as true, but right now it's uninitialized.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-24 12:59:00 +01:00
Asai Thambi S P 60ec0eecfa mtip32xx: updates based on feedback
* queue ncq commands when a non-ncq is in progress or error handling is active
* merge variables 'internal_cmd_in_progress' and 'eh_active' into new variable 'flags'
* get rid of read/write semaphore 'internal_sem'
* new service thread to issue queued commands
* use macros from ata.h for command codes
* return ENOTTY for BLKFLSBUF ioctl
* style changes

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-23 08:29:24 +01:00
Li Dongyang ae18be11b5 xen-blkback: convert hole punching to discard request on loop devices
As of dfaa2ef68e, loop devices support
discard request now. We could just issue a discard request, and
the loop driver will punch the hole for us, so we don't need to touch
the internals of loop device and punch the hole ourselves, Thanks.

V0->V1: rebased on devel/for-jens-3.3

Signed-off-by: Li Dongyang <lidongyang@novell.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-11-18 13:28:05 -05:00
Konrad Rzeszutek Wilk 421463526f xen/blkback: Move processing of BLKIF_OP_DISCARD from dispatch_rw_block_io
.. and move it to its own function that will deal with the
discard operation.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-11-18 13:28:03 -05:00
Konrad Rzeszutek Wilk 5ea4298669 xen/blk[front|back]: Enhance discard support with secure erasing support.
Part of the blkdev_issue_discard(xx) operation is that it can also
issue a secure discard operation that will permanantly remove the
sectors in question. We advertise that we can support that via the
'discard-secure' attribute and on the request, if the 'secure' bit
is set, we will attempt to pass in REQ_DISCARD | REQ_SECURE.

CC: Li Dongyang <lidongyang@novell.com>
[v1: Used 'flag' instead of 'secure:1' bit]
[v2: Use 'reserved' uint8_t instead of adding a new value]
[v3: Check for nseg when mapping instead of operation]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-11-18 13:28:01 -05:00
Konrad Rzeszutek Wilk 97e36834f5 xen/blk[front|back]: Squash blkif_request_rw and blkif_request_discard together
In a union type structure to deal with the overlapping
attributes in a easier manner.

Suggested-by: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-11-18 13:27:59 -05:00
Dan Carpenter a2c2a0e668 paride: fix potential information leak in pg_read()
Smatch has a new check for Rosenberg type information leaks where structs
are copied to the user with uninitialized stack data in them.  i In this
case, the pg_write_hdr struct has a hole in it.

struct pg_write_hdr {
        char                       magic;                /*     0     1 */
        char                       func;                 /*     1     1 */
        /* XXX 2 bytes hole, try to pack */
        int                        dlen;                 /*     4     4 */

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Tim Waugh <tim@cyberelk.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-16 09:21:50 +01:00
Stephen M. Cameron 0007a4c90a cciss: auto engage SCSI mid layer at driver load time
A long time ago, probably in 2002, one of the distros, or maybe more than
one, loaded block drivers prior to loading the SCSI mid layer.  This meant
that the cciss driver, being a block driver, could not engage the SCSI mid
layer at init time without panicking, and relied on being poked by a
userland program after the system was up (and the SCSI mid layer was
therefore present) to engage the SCSI mid layer.

This is no longer the case, and cciss can safely rely on the SCSI mid
layer being present at init time and engage the SCSI mid layer straight
away.  This means that users will see their tape drives and medium
changers at driver load time without need for a script in /etc/rc.d that
does this:

for x in /proc/driver/cciss/cciss*
do
	echo "engage scsi" > $x
done

However, if no tape drives or medium changers are detected, the SCSI mid
layer will not be engaged.  If a tape drive or medium change is later
hot-added to the system it will then be necessary to use the above script
or similar for the device(s) to be acceesible.

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-16 09:21:49 +01:00
Dmitry Monakhov 7035b5df3c loop: cleanup set_status interface
1) Anyone who has read access to loopdev has permission to call set_status
   and may change important parameters such as lo_offset, lo_sizelimit and
   so on, which contradicts to read access pattern and definitely equals
   to write access pattern.
2) Add lo_offset over i_size check to prevent blkdev_size overflow.
   ##Testcase_bagin
   #dd if=/dev/zero of=./file bs=1k count=1
   #losetup /dev/loop0 ./file
   /* userspace_application */
   struct loop_info64 loinf;
   fd = open("/dev/loop0", O_RDONLY);
   ioctl(fd, LOOP_GET_STATUS64, &loinf);
   /* Set offset to any value which is bigger than i_size, and sizelimit
    * to nonzero value*/
   loinf.lo_offset = 4096*1024;
   loinf.lo_sizelimit = 1024;
   ioctl(fd, LOOP_SET_STATUS64, &loinf);
   /* After this loop device will have size similar to 0x7fffffffffxxxx */
   #blockdev --getsz /dev/loop0
   ##OUTPUT: 36028797018955968
   ##Testcase_end

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-16 09:21:49 +01:00
Dmitry Monakhov 3bb9068278 loop: prevent information leak after failed read
If read was not fully successful we have to fail whole bio to prevent
information leak of old pages

##Testcase_begin
dd if=/dev/zero of=./file bs=1M count=1
losetup /dev/loop0 ./file -o 4096
truncate -s 0 ./file
# OOps loop offset is now beyond i_size, so read will silently fail.
# So bio's pages would not be cleared, may which result in information leak.
hexdump -C /dev/loop0
##testcase_end

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-16 09:21:48 +01:00
Matthew Garrett 1937335856 The Windows driver .inf disables ASPM on all cciss devices. Do the same.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Cc: iss_storagedev@hp.com
Acked-by: Mike Miller <mike.miller@hp.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-11 22:05:54 +01:00
Linus Torvalds 32aaeffbd4 Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
  Revert "tracing: Include module.h in define_trace.h"
  irq: don't put module.h into irq.h for tracking irqgen modules.
  bluetooth: macroize two small inlines to avoid module.h
  ip_vs.h: fix implicit use of module_get/module_put from module.h
  nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
  include: replace linux/module.h with "struct module" wherever possible
  include: convert various register fcns to macros to avoid include chaining
  crypto.h: remove unused crypto_tfm_alg_modname() inline
  uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
  pm_runtime.h: explicitly requires notifier.h
  linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
  miscdevice.h: fix up implicit use of lists and types
  stop_machine.h: fix implicit use of smp.h for smp_processor_id
  of: fix implicit use of errno.h in include/linux/of.h
  of_platform.h: delete needless include <linux/module.h>
  acpi: remove module.h include from platform/aclinux.h
  miscdevice.h: delete unnecessary inclusion of module.h
  device_cgroup.h: delete needless include <linux/module.h>
  net: sch_generic remove redundant use of <linux/module.h>
  net: inet_timewait_sock doesnt need <linux/module.h>
  ...

Fix up trivial conflicts (other header files, and  removal of the ab3550 mfd driver) in
 - drivers/media/dvb/frontends/dibx000_common.c
 - drivers/media/video/{mt9m111.c,ov6650.c}
 - drivers/mfd/ab3550-core.c
 - include/linux/dmaengine.h
2011-11-06 19:44:47 -08:00
Linus Torvalds 06d381484f Merge branch 'stable/vmalloc-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
* 'stable/vmalloc-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  net: xen-netback: use API provided by xenbus module to map rings
  block: xen-blkback: use API provided by xenbus module to map rings
  xen: use generic functions instead of xen_{alloc, free}_vm_area()
2011-11-06 18:31:36 -08:00
Jens Axboe a71f483d79 mtip32xx: update to new ->make_request() API
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-05 08:36:21 +01:00
Jens Axboe 0e838c624e mtip32xx: add module.h include to avoid conflict with moduleh tree
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-05 08:35:10 +01:00
Jens Axboe 3ff147d3a8 mtip32xx: mark a few more items static
Missed two items: mtip_major, and mtip_pci_driver.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-05 08:35:10 +01:00
Jens Axboe 6316668fbc mtip32xx: ensure that all local functions are static
Kill the declarations in the header file and mark them as static.
Reshuffle a few functions to ensure that everything is properly
declared before being used.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-05 08:35:10 +01:00
Jens Axboe ef0f158734 mtip32xx: cleanup compat ioctl handling
Do the conversion/copy up front instead of passing in a compat flag
to the ioctl handler and subsequently to the exec_drive_taskfile()
function.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-05 08:35:10 +01:00
Jens Axboe 16d02c040b mtip32xx: fix warnings/errors on 32-bit compiles
We need to clean up the compat ioctl handling, but this makes it
work for now at least.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-11-05 08:35:10 +01:00
Sam Bradshaw 88523a6155 block: Add driver for Micron RealSSD pcie flash cards
This adds mtip32xx, a driver supporting Microns line of
pci-express flash storage cards.

Signed-off-by: Asai Thambi S P <asamymuthupa@micron.com>
Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-11-05 08:35:10 +01:00
Linus Torvalds 3d0a8d10cf Merge branch 'for-3.2/drivers' of git://git.kernel.dk/linux-block
* 'for-3.2/drivers' of git://git.kernel.dk/linux-block: (30 commits)
  virtio-blk: use ida to allocate disk index
  hpsa: add small delay when using PCI Power Management to reset for kump
  cciss: add small delay when using PCI Power Management to reset for kump
  xen/blkback: Fix two races in the handling of barrier requests.
  xen/blkback: Check for proper operation.
  xen/blkback: Fix the inhibition to map pages when discarding sector ranges.
  xen/blkback: Report VBD_WSECT (wr_sect) properly.
  xen/blkback: Support 'feature-barrier' aka old-style BARRIER requests.
  xen-blkfront: plug device number leak in xlblk_init() error path
  xen-blkfront: If no barrier or flush is supported, use invalid operation.
  xen-blkback: use kzalloc() in favor of kmalloc()+memset()
  xen-blkback: fixed indentation and comments
  xen-blkfront: fix a deadlock while handling discard response
  xen-blkfront: Handle discard requests.
  xen-blkback: Implement discard requests ('feature-discard')
  xen-blkfront: add BLKIF_OP_DISCARD and discard request struct
  drivers/block/loop.c: remove unnecessary bdev argument from loop_clr_fd()
  drivers/block/loop.c: emit uevent on auto release
  drivers/block/cpqarray.c: use pci_dev->revision
  loop: always allow userspace partitions and optionally support automatic scanning
  ...

Fic up trivial header file includsion conflict in drivers/block/loop.c
2011-11-04 17:22:14 -07:00
Linus Torvalds b4fdcb02f1 Merge branch 'for-3.2/core' of git://git.kernel.dk/linux-block
* 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits)
  block: don't call blk_drain_queue() if elevator is not up
  blk-throttle: use queue_is_locked() instead of lockdep_is_held()
  blk-throttle: Take blkcg->lock while traversing blkcg->policy_list
  blk-throttle: Free up policy node associated with deleted rule
  block: warn if tag is greater than real_max_depth.
  block: make gendisk hold a reference to its queue
  blk-flush: move the queue kick into
  blk-flush: fix invalid BUG_ON in blk_insert_flush
  block: Remove the control of complete cpu from bio.
  block: fix a typo in the blk-cgroup.h file
  block: initialize the bounce pool if high memory may be added later
  block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
  block: drop @tsk from attempt_plug_merge() and explain sync rules
  block: make get_request[_wait]() fail if queue is dead
  block: reorganize throtl_get_tg() and blk_throtl_bio()
  block: reorganize queue draining
  block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
  block: pass around REQ_* flags instead of broken down booleans during request alloc/free
  block: move blk_throtl prototypes to block/blk.h
  block: fix genhd refcounting in blkio_policy_parse_and_set()
  ...

Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion
and making the request functions be of type "void" instead of "int" in
 - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c}
 - drivers/staging/zram/zram_drv.c
2011-11-04 17:06:58 -07:00
Matthew Wilcox f1938f6e1e NVMe: Implement doorbell stride capability
The doorbell stride allows devices to spread out their doorbells instead
of packing them tightly.  This feature was added as part of ECN 003.

This patch also enables support for more than 512 queues :-)

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:05 -04:00
Matthew Wilcox ce38c14957 NVMe: Version 0.7
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:05 -04:00
Matthew Wilcox 2b2c189687 NVMe: Don't probe namespace 0
ECN 001 documented that namespace 0 is not valid.  Sending an Identify
with CNS of 0 and Namespace of 0 is an undefined command.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:04 -04:00
Nisheeth Bhat 0d1bc91258 Fix calculation of number of pages in a PRP List
The existing calculation underestimated the number of pages required
as it did not take into account the pointer at the end of each page.
The replacement calculation may overestimate the number of pages required
if the last page in the PRP List is entirely full.  By using ->npages
as a counter as we fill in the pages, we ensure that we don't try to
free a page that was never allocated.

Signed-off-by: Nisheeth Bhat <nisheeth.bhat@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:04 -04:00
Matthew Wilcox bc5fc7e4b2 NVMe: Create nvme_identify and nvme_get_features functions
Instead of open-coding calls to nvme_submit_admin_cmd, these
small wrappers are simpler to use (the patch removes 14 lines from
nvme_dev_add() for example).

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:04 -04:00
Matthew Wilcox 684f5c2025 NVMe: Fix memory leak in nvme_dev_add()
The driver was allocating 8k of memory, then freeing 4k of it.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:04 -04:00
Nisheeth Bhat d1a490e026 NVMe: Fix calls to dma_unmap_sg
dma_unmap_sg() must be called with the same 'nents' passed to
dma_map_sg(), not the number returned from dma_map_sg().

Signed-off-by: Nisheeth Bhat <nisheeth.bhat@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:04 -04:00
Matthew Wilcox d0ba1e497b NVMe: Correct sg list setup in nvme_map_user_pages
Our SG list was constructed to always fill the entire first page, even
if that was more than the length of the I/O.  This is probably harmless,
but some IOMMUs might do something bad.

Correcting the first call to sg_set_page() made it look a lot closer to
the sg_set_page() in the loop, so fold the first call to sg_set_page()
into the loop.

Reported-by: Nisheeth Bhat <nisheeth.bhat@intel.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
2011-11-04 15:53:04 -04:00
Matthew Wilcox 6413214c5d Fix bug in NVME_IOCTL_SUBMIT_IO
Missing 'break' in the switch statement meant that we'd fall through
to the 'return -EINVAL' case.
2011-11-04 15:53:04 -04:00
Matthew Wilcox 6bbf1acdde NVMe: Rework ioctls
Remove the special-purpose IDENTIFY, GET_RANGE_TYPE, DOWNLOAD_FIRMWARE
and ACTIVATE_FIRMWARE commands.  Replace them with a generic ADMIN_CMD
ioctl that can submit any admin command.

Add a new ID ioctl that returns the namespace ID of the queried device.
It corresponds to the SCSI Idlun ioctl.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:03 -04:00
Matthew Wilcox eac623ba7a NVMe: Add the nvme thread to the wait queue before waking it up
If the I/O was not completed by a single NVMe command, we add the
bio to the congestion list and wake up the kthread to resubmit it.
But the kthread calls remove_wait_queue() unconditionally, which
will oops if it's not on the wait queue.  So add the kthread to
the wait queue before waking it up.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:03 -04:00
Matthew Wilcox 6f0f54499f NVMe: Return real error from nvme_create_queue
nvme_setup_io_queues() was assuming that a NULL return from
nvme_create_queue() was an out-of-memory error.  That's not necessarily
true; the adapter might return -EIO, for example.  Change the calling
convention to return an ERR_PTR on failure instead of NULL.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:03 -04:00
Matthew Wilcox be5e094840 NVMe: Version 0.6
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:03 -04:00
Matthew Wilcox 184d2944cb NVMe: Add a few calling convention notes
For the benefit of reviewers, add comments to a few functions describing
their calling context

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:03 -04:00
Matthew Wilcox b77954cbdd NVMe: Handle failures from memory allocations in nvme_setup_prps
If any of the memory allocations in nvme_setup_prps fail, handle it by
modifying the passed-in data length to reflect the number of bytes we are
actually able to send.  Also allow the caller to specify the GFP flags
they need; for user-initiated commands, we can use GFP_KERNEL allocations.

The various callers are updated to handle this possibility; the main
I/O path is already prepared for this possibility (as it may happen
due to nvme_map_bio being unable to map all the segments of the I/O).
The other callers return -ENOMEM instead of doing partial I/Os.

Reported-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:03 -04:00
Matthew Wilcox 5aff9382dd NVMe: Use an IDA to allocate minor numbers
The current approach of using the namespace ID as the minor number
doesn't work when there are multiple adapters in the machine.  Rather
than statically partitioning the number of namespaces between adapters,
dynamically allocate minor numbers to namespaces as they are detected.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:03 -04:00
Matthew Wilcox fd63e9ceee NVMe: Add include of delay.h for msleep
Previously it was being implicitly included through some other header file

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:02 -04:00
Matthew Wilcox 8de055350f NVMe: Add support for timing out I/Os
In the kthread, walk the list of outstanding I/Os and check they've not
hit the timeout.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:02 -04:00
Matthew Wilcox 21075bdee0 NVMe: Rename cancel_cmdid_data to cancel_cmdid
The trailing '_data' on the end was annoying and inconsistent.  Also, make
it actually return the data since this is needed for timing out commands.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:02 -04:00
Matthew Wilcox 09a58f5364 NVMe: Fix bug in error handling
When an I/O completed with an error, we would call bio_endio twice
(once with -EIO and once with 0).  Found by inspection.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:02 -04:00
Matthew Wilcox 22605f9681 NVMe: Time out initialisation after a few seconds
THe device reports (in its capability register) how long it will take
to initialise.  If that time elapses before the ready bit becomes set,
conclude the device is broken and refuse to initialise it.  Log a nice
error message so the user knows why we did nothing.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:02 -04:00
Matthew Wilcox aba2080f3f NVMe: Fix warning in free_irq
We need to clear the affinity mask before calling free_irq()

Reported-by: Shane Michael Matthews <shane.matthews@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:02 -04:00
Matthew Wilcox 7f53f9d242 NVMe: Correct the Controller Configuration settings
The arbitration field was extended by one bit, shifting the shutdown
notification bits by one.  Also, the SQ/CQ entry size was made
configurable for future extensions.

Reported-by: Paul Luse <paul.e.luse@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:01 -04:00
Matthew Wilcox 8ef700678f NVMe: Version 0.5
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:01 -04:00
Matthew Wilcox 6c7d49455c NVMe: Change the definition of nvme_user_io
The read and write commands don't define a 'result', so there's no need
to copy it back to userspace.

Remove the ability of the ioctl to submit commands to a different
namespace; it's just asking for trouble, and the use case I have in mind
will be addressed througha  different ioctl in the future.  That removes
the need for both the block_shift and nsid arguments.

Check that the opcode is one of 'read' or 'write'.  Future opcodes may
be added in the future, but we will need a different structure definition
for them.

The nblocks field is redefined to be 0-based.  This allows the user to
request the full 65536 blocks.

Don't byteswap the reftag, apptag and appmask.  Martin Petersen tells
me these are calculated in big-endian and are transmitted to the device
in big-endian.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:01 -04:00
Matthew Wilcox 4948168280 NVMe: Add compat_ioctl
Make ioctls work for 32-bit applications on 64-bit kernels.  The structures
are defined to be the same for both 32- and 64-bit applications, so
we can use the same handler for both.

Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:01 -04:00
Matthew Wilcox 9ecdc94621 NVMe: Simplify queue lookup
Fill in all the num_possible_cpus() entries with duplicate pointers.
This reduces the complexity of the frequently-called get_nvmeq(), as
well as avoiding a bug in it when there are fewer queues than CPUs.

Reported-by: Shane Michael Matthews <shane.matthews@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:01 -04:00
Matthew Wilcox 3cb967c039 NVMe: Remove the kthread from the wait queue
Once there are no more bios on the congestion list, we can stop waking
up the nvme kthread every time a completion happens.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:00 -04:00
Matthew Wilcox 7523d834dd NVMe: Fix off-by-one when filling in PRP lists
If the last element in the PRP list fits on the end of the page, there's
no need to allocate an extra page to put that single element in.  It can
fit on the end of the page.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:00 -04:00
Matthew Wilcox ac88c36a38 NVMe: Fix interpretation of 'Number of Namespaces' field
The spec says this is a 0s based value.  We don't need to handle the
maximal value because it's reserved to mean "every namespace".

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:00 -04:00
Matthew Wilcox 19e899b2f9 NVMe: Remove outdated comments
The head can never overrun the tail since we won't allocate enough command
IDs to let that happen.  The status codes are in sync with the spec.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:00 -04:00
Matthew Wilcox fa92282149 NVMe: Fix comment formatting
Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:00 -04:00
Matthew Wilcox 714a7a2288 NVMe: Convert comments to kernel-doc notation
Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:53:00 -04:00
Matthew Wilcox b57ab0fada NVMe: Version 0.4
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:59 -04:00
Matthew Wilcox e6d15f79f9 NVMe: Reduce maximum queue depth by 1
The spec says we're not allowed to completely fill the submission queue.
Solve this by reducing the number of allocatable cmdids by 1.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:59 -04:00
Matthew Wilcox d8ee9d69f2 NVMe: Fix discontiguous accesses
When we submit subsequent portions of the I/O, we need to access the
updated block, not start reading again from the original position.
This was showing up as miscompares in the XFS randholes testcase.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:59 -04:00
Matthew Wilcox 1ad2f8932a NVMe: Handle bios that contain non-virtually contiguous addresses
NVMe scatterlists must be virtually contiguous, like almost all I/Os.
However, when the filesystem lays out files with a hole, it can be that
adjacent LBAs map to non-adjacent virtual addresses.  Handle this by
submitting one NVMe command at a time for each virtually discontiguous
range.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:59 -04:00
Matthew Wilcox 00df5cb4eb NVMe: Implement Flush
Linux implements Flush as a bit in the bio.  That means there may also be
data associated with the flush; if so the flush should be sent before the
data.  To avoid completing the bio twice, I add CMD_CTX_FLUSH to indicate
the completion routine should do nothing.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:59 -04:00
Matthew Wilcox c42705592b NVMe: Mark CMD_CTX_CANCELLED as being unlikely
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:59 -04:00
Matthew Wilcox 7547881d09 NVMe: Correct SQ doorbell semantics
The value written to the doorbell needs to be the first free index in
the queue, not the most recently used index in the queue.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:58 -04:00
Matthew Wilcox 740216fc59 NVMe: Let the kthread take care of devices earlier
If interrupts are misconfigured, the kthread will be needed to process
admin queue completions.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:58 -04:00
Matthew Wilcox b348b7d543 NVMe: Rename nr_queues to nr_io_queues
I got confused about whether this included the admin queue or not, and
had to resort to reading the spec.  It doesn't include the admin queue,
so make that clear in the name.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:58 -04:00
Matthew Wilcox ca1615424c NVMe: Remove setting of 'flags' in rw command
This was the data transfer bit until spec rev 0.92

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:58 -04:00
Matthew Wilcox ad8a5df97c NVMe: Release 0.3
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:58 -04:00
Matthew Wilcox 1fa6aeadf1 NVMe: Add a kthread to handle the congestion list
Instead of trying to resubmit I/Os in the I/O completion path (in
interrupt context), wake up a kthread which will resubmit I/O from
user context.  This allows mke2fs to run to completion.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:58 -04:00
Matthew Wilcox eeee322647 NVMe: Handle failures differently in nvme_submit_bio_queue()
Return -EBUSY if the queue is full or -ENOMEM if we failed to allocate
memory (or map a scatterlist).  Also use GFP_ATOMIC to allocate the
nvme_bio and move the locking to the callers of nvme_submit_bio_queue().

In nvme_make_request(), don't permit an I/O to jump the queue -- if the
congestion list already has an entry, just add to the tail, rather than
trying to submit.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:58 -04:00
Matthew Wilcox 768308400f NVMe: Handle physical merging of bvec entries
In order to not overrun the sg array, we have to merge physically
contiguous pages into a single sg entry.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:57 -04:00
Matthew Wilcox 1974b1ae88 NVMe: Check for DMA mapping failure
If dma_map_sg returns 0 (failure), we need to fail the I/O.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:57 -04:00
Matthew Wilcox d567760c40 NVMe: Pass the nvme_dev to nvme_free_prps and nvme_setup_prps
We were passing the nvme_queue to access the q_dmadev for the
dma_alloc_coherent calls, but since we moved to the dma pool API,
we really only need the nvme_dev.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:57 -04:00
Matthew Wilcox 99802a7aee NVMe: Optimise memory usage for I/Os between 4k and 128k
Add a second memory pool for smaller I/Os.  We can pack 16 of these on a
single page instead of using an entire page for each one.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:57 -04:00
Matthew Wilcox 091b609258 NVMe: Switch to use DMA Pool API
Calling dma_free_coherent from interrupt context causes warnings.
Using the DMA pools delays freeing until pool destruction, so avoids
the problem.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:57 -04:00
Matthew Wilcox d534df3c73 NVMe: Rename nvme_req_info to nvme_bio
There are too many things called 'info' in this driver.  This data
structure is auxiliary information for a struct bio, so call it nvme_bio,
or nbio when used as a variable.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:56 -04:00
Shane Michael Matthews e025344c56 NVMe: Initial PRP List support
Add a pointer to the nvme_req_info to hold a new data structure
(nvme_prps) which contains a list of the pages allocated to this
particular request for holding PRP list entries.  nvme_setup_prps()
now returns this pointer.

To allocate and free the memory used for PRP lists, we need a struct
device, so we need to pass the nvme_queue pointer to many functions
which didn't use to need it.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:56 -04:00
Matthew Wilcox 51882d00f0 NVMe: Advance the sg pointer when filling in an sg list
For multipage BIOs, we were always using sg[0] instead of advancing
through the list.  Oops :-)

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:56 -04:00
Matthew Wilcox d2d8703481 NVMe: Renumber the special context values
If POISON_POINTER_DELTA isn't defined, ensure they're in page 0 which
should never be mapped.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:56 -04:00
Matthew Wilcox 9294bbed78 NVMe: Handle the congestion list a little better
In the bio completion handler, check for bios on the congestion list
for this NVM queue.  Also, lock the congestion list in the make_request
function as the queue may end up being shared between multiple CPUs.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:56 -04:00
Matthew Wilcox e85248e516 NVMe: Record the timeout for each command
In addition to recording the completion data for each command, record
the anticipated completion time.  Choose a timeout of 5 seconds for
normal I/Os and 60 seconds for admin I/Os.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:56 -04:00
Matthew Wilcox ec6ce618d6 NVMe: Need to lock queue during interrupt handling
If we're sharing a queue between multiple CPUs and we cancel a sync I/O,
we must have the queue locked to avoid corrupting the stack of the thread
that submitted the I/O.  It turns out this is the same locking that's needed
for the threaded irq handler, so share that code.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:56 -04:00
Matthew Wilcox 48e3d39816 NVMe: Detect command IDs completing that are out of range
If the adapter completes a command ID that is outside the bounds of
the array, return CMD_CTX_INVALID instead of random data, and print a
message in the sync_completion handler (which is rapidly becoming the
misc completion handler :-)

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:55 -04:00
Matthew Wilcox b36235df01 NVMe: Detect commands that are completed twice
Set the context value to CMD_CTX_COMPLETED, and print a message in the
sync_completion handler if we see it.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:55 -04:00
Matthew Wilcox be7b62754e NVMe: Use a symbolic name to represent cancelled commands instead of 0
I have plans for other special values in sync_completion.  Plus, this
is more self-documenting, and lets us detect bogus usages.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:55 -04:00
Matthew Wilcox 58ffacb545 NVMe: Add a module parameter to use a threaded interrupt
We're currently calling bio_endio from hard interrupt context.  This is
not a good idea for preemptible kernels as it will cause longer latencies.
Using a threaded interrupt will run the entire queue processing mechanism
(including bio_endio) in a thread, which can be preempted.  Unfortuantely,
it also adds about 7us of latency to the single-I/O case, so make it a
module parameter for the moment.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:55 -04:00
Matthew Wilcox b1ad37efca NVMe: Call put_nvmeq() before calling nvme_submit_sync_cmd()
We can't have preemption disabled when we call schedule().  Accept the
possibility that we'll get preempted, and it'll cost us some cacheline
bounces.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:55 -04:00
Matthew Wilcox 3c0cf138d7 NVMe: Allow fatal signals to interrupt I/O
If the user sends a fatal signal, sleeping in the TASK_KILLABLE state
permits the task to be aborted.  The only wrinkle is making sure that
if/when the command completes later that it doesn't upset anything.
Handle this by setting the data pointer to 0, and checking the value
isn't NULL in the sync completion path.  Eventually, bios can be cancelled
through this path too.  Note that the cmdid isn't freed to prevent reuse.

We should also abort the command in the future, but this is a good start.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:55 -04:00
Matthew Wilcox db5d0c198d NVMe: Release 0.2
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:54 -04:00
Matthew Wilcox 6ee44cdced NVMe: Add download / activate firmware ioctls
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:54 -04:00
Matthew Wilcox 388f037f4e NVMe: Move sysfs entries to the right place
Because I wasn't setting driverfs_dev, the devices were showing up under
/sys/devices/virtual/block.  Now they appear underneath the PCI device
which they belong to.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:54 -04:00
Shane Michael Matthews 5911f20039 NVMe: Disable the device before we write the admin queues
In case the card has been left in a partially-configured state,
write 0 to the Enable bit.

Signed-off-by: Shane Michael Matthews <shane.matthews@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:54 -04:00
Matthew Wilcox 574e8b95bc NVMe: Request I/O regions
Calling pci_request_selected_regions() reserves these regions for our use.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:54 -04:00
Matthew Wilcox 2930353f9f NVMe: Allow queues to be allocated above 4GB
Need to call dma_set_coherent_mask() to allow queues to be allocated
above 4GB.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:53 -04:00
Matthew Wilcox f64d3365a3 NVMe: Enable device DMA
Need to call pci_set_master() to enable device DMA

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:53 -04:00
Shane Michael Matthews 0ee5a7d7cb NVMe: Enable and disable the PCI device
Call pci_enable_device_mem() at initialisation and pci_disable_device
at exit.

Signed-off-by: Shane Michael Matthews <shane.matthews@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:53 -04:00
Matthew Wilcox 3f85d50b60 NVMe: Check returns from nvme_alloc_queue()
It can return NULL, so handle that.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:53 -04:00
Matthew Wilcox 8e9f0e7115 NVMe: Remove 'node' from nvme_dev
We don't keep a list of nvme_dev any more

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:53 -04:00
Matthew Wilcox 51814232ec NVMe: Read the model, serial & firmware rev from the controller
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:53 -04:00
Matthew Wilcox a53295b699 NVMe: Add NVME_IOCTL_SUBMIT_IO
Allow userspace to submit synchronous I/O like the SCSI sg interface does.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:53 -04:00
Matthew Wilcox 7fc3cdabba NVMe: Create nvme_map_user_pages() and nvme_unmap_user_pages()
These are generalisations of the code that was in
nvme_submit_user_admin_command().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:52 -04:00
Matthew Wilcox bd38c5557c NVMe: Change NVME_IOCTL_GET_RANGE_TYPE to return all the ranges
Factor out most of nvme_identify() into a new nvme_submit_user_admin_command()
function.  Change nvme_get_range_type() to call it and change nvme_ioctl to
realise that it's getting back all 64 ranges.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:52 -04:00
Matthew Wilcox b8deb62cf2 NVMe: Zero the command before we send it
Make sure there's no left-over bits set from previous commands that used
this slot.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:52 -04:00
Matthew Wilcox ff22b54fda NVMe: Add nvme_setup_prps()
Generalise the code from nvme_identify() that sets PRP1 & PRP2 so that
it's usable for commands sent by nvme_submit_bio_queue().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:52 -04:00
Matthew Wilcox 36c14ed9ca NVMe: Use PRP2 for the nvme_identify ioctl
DMA the result straight to userspace instead of bounce-buffering in the
kernel.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:52 -04:00
Matthew Wilcox 53c9577e9c NVMe: Fix admin IRQ claim on real hardware
The admin IRQ is supposed to use the pin-based (or single message MSI)
interrupt.  Accomplish this by filling in entry[0]'s vector with the
INTx irq number.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:51 -04:00
Matthew Wilcox 821234603b NVMe: Rename 'cycle' to 'phase'
It's called the phase bit in the current draft

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:51 -04:00
Matthew Wilcox 1b23484bd0 NVMe: Implement per-CPU queues
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:51 -04:00
Matthew Wilcox b3b06812e1 NVMe: Reduce set_queue_count arguments by one
sq_count and cq_count are always the same, so just call it 'count'.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:51 -04:00
Matthew Wilcox 3001082cac NVMe: Factor out queue_request_irq()
Two callers with an almost identical long string of arguments, and
introducing a third soon.  Time to factor out the commonalities.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:51 -04:00
Matthew Wilcox b60503ba43 NVMe: New driver
This driver is for devices that follow the NVM Express standard

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
2011-11-04 15:52:51 -04:00
Michael S. Tsirkin 5087a50e66 virtio-blk: use ida to allocate disk index
Based on a patch by Mark Wu <dwu@redhat.com>

Current index allocation in virtio-blk is based on a monotonically
increasing variable "index". This means we'll run out of numbers
after a while.  It also could cause confusion about the disk
name in the case of hot-plugging disks.
Change virtio-blk to use ida to allocate index, instead.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2011-11-02 11:41:02 +10:30
Paul Gortmaker 0c8d44f239 block: Fix files that are modules and hence need module.h
We want to remove the implicit everywhere presence of module.h
so fix up the people relying on that implicit presence in advance.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31 19:31:13 -04:00
Paul Gortmaker d5decd3b95 block: add export.h to files using EXPORT_SYMBOL/THIS_MODULE macros
These files were getting <linux/module.h> via an implicit include
path, but we want to crush those out of existence since they cost
time during compiles of processing thousands of lines of headers
for no reason.  Give them the lightweight header that just contains
the EXPORT_SYMBOL infrastructure.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31 19:31:12 -04:00
Michael S. Tsirkin a0eda62552 virtio-blk: use ida to allocate disk index
Based on a patch by Mark Wu <dwu@redhat.com>

Current index allocation in virtio-blk is based on a monotonically
increasing variable "index". This means we'll run out of numbers
after a while.  It also could cause confusion about the disk
name in the case of hot-plugging disks.
Change virtio-blk to use ida to allocate index, instead.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-31 08:05:36 +01:00
Linus Torvalds 97d2eb13a0 Merge branch 'for-linus' of git://ceph.newdream.net/git/ceph-client
* 'for-linus' of git://ceph.newdream.net/git/ceph-client:
  libceph: fix double-free of page vector
  ceph: fix 32-bit ino numbers
  libceph: force resend of osd requests if we skip an osdmap
  ceph: use kernel DNS resolver
  ceph: fix ceph_monc_init memory leak
  ceph: let the set_layout ioctl set single traits
  Revert "ceph: don't truncate dirty pages in invalidate work thread"
  ceph: replace leading spaces with tabs
  libceph: warn on msg allocation failures
  libceph: don't complain on msgpool alloc failures
  libceph: always preallocate mon connection
  libceph: create messenger with client
  ceph: document ioctls
  ceph: implement (optional) max read size
  ceph: rename rsize -> rasize
  ceph: make readpages fully async
2011-10-28 16:42:18 -07:00
David Vrabel 2d073846b8 block: xen-blkback: use API provided by xenbus module to map rings
The xenbus module provides xenbus_map_ring_valloc() and
xenbus_map_ring_vfree().  Use these to map the ring pages granted by
the frontend.

Acked-by: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-10-26 10:02:35 -04:00
Sage Weil 6ab00d465a libceph: create messenger with client
This simplifies the init/shutdown paths, and makes client->msgr available
during the rest of the setup process.

Signed-off-by: Sage Weil <sage@newdream.net>
2011-10-25 16:10:15 -07:00
Linus Torvalds 59e5253417 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (59 commits)
  MAINTAINERS: linux-m32r is moderated for non-subscribers
  linux@lists.openrisc.net is moderated for non-subscribers
  Drop default from "DM365 codec select" choice
  parisc: Kconfig: cleanup Kernel page size default
  Kconfig: remove redundant CONFIG_ prefix on two symbols
  cris: remove arch/cris/arch-v32/lib/nand_init.S
  microblaze: add missing CONFIG_ prefixes
  h8300: drop puzzling Kconfig dependencies
  MAINTAINERS: microblaze-uclinux@itee.uq.edu.au is moderated for non-subscribers
  tty: drop superfluous dependency in Kconfig
  ARM: mxc: fix Kconfig typo 'i.MX51'
  Fix file references in Kconfig files
  aic7xxx: fix Kconfig references to READMEs
  Fix file references in drivers/ide/
  thinkpad_acpi: Fix printk typo 'bluestooth'
  bcmring: drop commented out line in Kconfig
  btmrvl_sdio: fix typo 'btmrvl_sdio_sd6888'
  doc: raw1394: Trivial typo fix
  CIFS: Don't free volume_info->UNC until we are entirely done with it.
  treewide: Correct spelling of successfully in comments
  ...
2011-10-25 12:11:02 +02:00
Linus Torvalds 31018acd4c Merge branches 'stable/bug.fixes-3.2' and 'stable/mmu.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
* 'stable/bug.fixes-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/p2m/debugfs: Make type_name more obvious.
  xen/p2m/debugfs: Fix potential pointer exception.
  xen/enlighten: Fix compile warnings and set cx to known value.
  xen/xenbus: Remove the unnecessary check.
  xen/irq: If we fail during msi_capability_init return proper error code.
  xen/events: Don't check the info for NULL as it is already done.
  xen/events: BUG() when we can't allocate our event->irq array.

* 'stable/mmu.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen: Fix selfballooning and ensure it doesn't go too far
  xen/gntdev: Fix sleep-inside-spinlock
  xen: modify kernel mappings corresponding to granted pages
  xen: add an "highmem" parameter to alloc_xenballooned_pages
  xen/p2m: Use SetPagePrivate and its friends for M2P overrides.
  xen/p2m: Make debug/xen/mmu/p2m visible again.
  Revert "xen/debug: WARN_ON when identity PFN has no _PAGE_IOMAP flag set."
2011-10-25 09:17:47 +02:00
Jens Axboe 83157223de Merge branch 'for-linus' into for-3.2/core 2011-10-24 16:24:38 +02:00
Mike Miller ab5dbebe33 cciss: add small delay when using PCI Power Management to reset for kump
The P600 requires a small delay when changing states. Otherwise we may think
the board did not reset and we bail. This for kdump only and is particular
to the P600.

Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-20 22:21:52 +02:00
Jens Axboe b8d8bdfe31 Merge branch 'stable/for-jens-3.2' of git://oss.oracle.com/git/kwilk/xen into for-3.2/drivers 2011-10-20 15:10:59 +02:00
Jens Axboe 5c04b426f2 Merge branch 'v3.1-rc10' into for-3.2/core
Conflicts:
	block/blk-core.c
	include/linux/blkdev.h

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19 14:30:42 +02:00
Konrad Rzeszutek Wilk 6927d92091 xen/blkback: Fix two races in the handling of barrier requests.
There are two windows of opportunity to cause a race when
processing a barrier request. This patch fixes this.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-10-17 14:28:57 -04:00
Christoph Hellwig 456be1484f loop: remove the incorrect write_begin/write_end shortcut
Currently the loop device tries to call directly into write_begin/write_end
instead of going through ->write if it can.  This is a fairly nasty shortcut
as write_begin and write_end are only callbacks for the generic write code
and expect to be called with filesystem specific locks held.

This code currently causes various issues for clustered filesystems as it
doesn't take the required cluster locks, and it also causes issues for XFS
as it doesn't properly lock against the swapext ioctl as called by the
defragmentation tools.  This in case causes data corruption if
defragmentation hits a busy loop device in the wrong time window, as
reported by RH QA.

The reason why we have this shortcut is that it saves a data copy when
doing a transformation on the loop device, which is the technical term
for using cryptoloop (or an XOR transformation).  Given that cryptoloop
has been deprecated in favour of dm-crypt my opinion is that we should
simply drop this shortcut instead of finding complicated ways to to
introduce a formal interface for this shortcut.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-17 12:57:20 +02:00
Konrad Rzeszutek Wilk dda1852802 xen/blkback: Check for proper operation.
The patch titled: "xen/blkback: Fix the inhibition to map pages
when discarding sector ranges." had the right idea except that
it used the wrong comparison operator. It had == instead of !=.

This fixes the bug where all (except discard) operations would
have been ignored.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-10-14 12:29:55 -04:00
Lars Ellenberg 3cb7a2a90f drbd: get rid of drbd_bcast_ee, it is of no use anymore
This function was used to broadcast the (leading part of the)
bio payload in case we see a data integrity error.  It could be received
from userland with the drbdsetup events subcommand,
to have a peek into the payload that caused the checksum mismatch,
and guess from there what may have caused the mismatch,
mainly to guess wether it was modification of in-flight data,
or data corruption by broken hardware or software bugs.

Meanwhile we support bios that are larger than the maximum payload a
netlink datagram can carry.
And we have means to reliably detect modification of in-flight data by
calculating, and comparing, the checksum before and after sendmsg.
There is no need to carry this around anymore.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:48:08 +02:00
Lars Ellenberg 569083c08d drbd: fix drbd_delete_device: remove vnr from volumes; idr_remove(); synchronize_rcu(); before cleanup
Still missing: rcu_readlock() on the various call sites that
access/iterate over those idrs.

We don't need a specific write lock, as we only modify from
configuration context, which is already strictly serialized.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:48:07 +02:00
Lars Ellenberg da4a75d2ef drbd: introduce a bio_set to allocate housekeeping bios from
Don't rely on availability of bios from the global fs_bio_set,
we should use our own bio_set for meta data IO.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:48:06 +02:00
Lars Ellenberg 9db4e77f8c drbd: use the newly introduced page pool for bitmap IO
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:48:05 +02:00
Lars Ellenberg 35abf59424 drbd: add page pool to be used for meta data IO
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:48:04 +02:00
Lars Ellenberg 3c13b680ce drbd: only wakeup if something changed in update_peer_seq
This commit got it wrong:
    drbd: Make the peer_seq updating code more obvious

    Make it more clear that update_peer_seq() is supposed to wake up the
    seq_wait queue whenever the sequence number changes.

We don't need to wake up everytime we receive a sequence number
that is _different_ from our currently stored "newest" sequence number,
but only if we receive a sequence number _newer_ than what we already
have, when we actually change mdev->peer_seq.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:48:04 +02:00
Lars Ellenberg 2c4a48d097 drbd: remove unused define
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:48:02 +02:00
Philipp Reisner 81a5d60ecf drbd: Replaced the minor_table array by an idr
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:48:01 +02:00
Philipp Reisner 774b305518 drbd: Implemented new commands to create/delete connections/minors
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:48:00 +02:00
Philipp Reisner 80883197da drbd: Converted drbd_nl_(net_conf|disconnect)() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:48:00 +02:00
Philipp Reisner 1aba4d7fcf drbd: Preparing the connector interface to operator on connections
Up to now it only operated on minor numbers. Now it can work also
on named connections.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:59 +02:00
Philipp Reisner 2f5cdd0b2c drbd: Converted the transfer log from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:58 +02:00
Philipp Reisner 49559d87fd drbd: Improved the dec_*() macros
Now those can be used with a struct drbd_conf * that has an other
name than 'mdev'.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:57 +02:00
Philipp Reisner 3f9cbe937e drbd: Removed the mdev parameter from the ..to_tags() and ...from_tags() functions
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:56 +02:00
Philipp Reisner 0e29d163f7 drbd: Reworked the unconfiguring and thread stopping code
* Moved CONFIG_PENDING and DEVICE_DYING from mdev to tconn.
* Renamed drbd_reconfig_start() and drbd_reconfig_done() to
  conn_reconfig_start() and conn_reconfig_done().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:55 +02:00
Andreas Gruenbacher c66342d949 drbd: Remove left-over function prototypes
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:55 +02:00
Andreas Gruenbacher 7201b972de drbd: Replace get_asender_cmd() with its implementation
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:54 +02:00
Andreas Gruenbacher 6e849ce88c drbd: Get rid of P_MAX_CMD
Instead of artificially enlarging the command decoding arrays to
P_MAX_CMD entries, check if an index is within the valid range using the
ARRAY_SIZE() macro.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:53 +02:00
Andreas Gruenbacher 1b3bb47d52 drbd: Remove redundant check
Opening a device only succeeds on a primary node, or when explicitly
setting the allow_oos module parameter to allow opening the device
read-only on a secondary node.  There is no other way that a request can
get into drbd_make_request(), so this code cannot trigger.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:52 +02:00
Andreas Gruenbacher 7be8da0798 drbd: Improve how conflicting writes are handled
The previous algorithm for dealing with overlapping concurrent writes
was generating unnecessary warnings for scenarios which could be
legitimate, and did not always handle partially overlapping requests
correctly.  Improve it algorithm as follows:

* While local or remote write requests are in progress, conflicting new
  local write requests will be delayed (commit 82172f7).

* When a conflict between a local and remote write request is detected,
  the node with the discard flag decides how to resolve the conflict: It
  will ask its peer to discard conflicting requests which are fully
  contained in the local request and retry requests which overlap only
  partially.  This involves a protocol change.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:51 +02:00
Andreas Gruenbacher 71b1c1eb9c drbd: Use ping-timeout when waiting for missing ack packets
When the node with the discard flag resolves write conflicts in
dual-primary mode, it may determine that its peer has sent ack packets
on the metadata socket which did not arrive, yet.  Wait for the next ack
with ping-timeout instead of a hard-coded 30 seconds.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:51 +02:00
Andreas Gruenbacher 8ccf218e9f drbd: Replace atomic_add_return with atomic_inc_return
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:50 +02:00
Andreas Gruenbacher 206d358941 drbd: Concurrent write detection fix
Commit 9b1e63e changed the concurrent write detection algorithm to only insert
peer requests into write_requests tree after determining that there is no
conflict.  With this change, new conflicting local requests could be added
while the algorithm runs, but this case was not handled correctly.  Instead of
making the algorithm deal with this case, switch back to adding peer requests
to the write_requests tree immediately: this improves fairness.

When a peer request is discarded, remove that request from the write_requests

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:49 +02:00
Andreas Gruenbacher 8050e6d005 drbd: Use container_of() instead of casting
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:48 +02:00
Lars Ellenberg 9676c76097 drbd: fix a wrong likely(), updated comments
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:47 +02:00
Lars Ellenberg c9d963a46d drbd: silence some log messages on bitmap IO
Summary log messages meant for global bitmap IO
should not be printed for bitmap IO caused by
activity log transactions.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:47 +02:00
Lars Ellenberg 7ad651b522 drbd: new on-disk activity log transaction format
Use a new on-disk transaction format for the activity log, which allows
for multiple changes to the active set per transaction.

Using 4k transaction blocks, we can now get rid of the work-around code
to deal with devices not supporting 512 byte logical block size.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:46 +02:00
Lars Ellenberg 46a15bc3ec lru_cache: allow multiple changes per transaction
Allow multiple changes to the active set of elements in lru_cache.
The only current user of lru_cache, drbd, is driving this generalisation.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:45 +02:00
Lars Ellenberg 45dfffebd0 drbd: allow to select specific bitmap pages for writeout
We are about to allow several changes to the active set in one activity
log transaction. We have to write out the corresponding bitmap pages as
well, if changed.

Introduce drbd_bm_mark_for_writeout(), then re-use the existing bitmap
writeout path to submit all marked pages in one go.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:44 +02:00
Lars Ellenberg 4738fa1690 drbd: use clear_bit_unlock() where appropriate
Some open-coded clear_bit(); smp_mb__after_clear_bit();
should in fact have been smp_mb__before_clear_bit(); clear_bit();

Instead, use clear_bit_unlock() to annotate the intention,
and have it do the right thing.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:42 +02:00
Lars Ellenberg 61610420f7 drbd: in drbd_suspend_al, set AL_SUSPENDED before unlocking the activity log
As using an empty activity log is the whole point of the excercise,
make sure it is still empty when setting AL_SUSPENDED.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:41 +02:00
Lars Ellenberg 867f57483b drbd: fix typo in comment
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:40 +02:00
Lars Ellenberg 8c387def58 drbd: simplify condition in drbd_may_do_local_read()
fold
	if (x >= (N+1))
		return 0;
	if (x < N)
		return 0;
into
	if (x != N)
		return 0;

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:39 +02:00
Andreas Gruenbacher c670a39867 drbd: Use the IS_ALIGNED() macro in some more places
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:39 +02:00
Andreas Gruenbacher 8ca9844f10 drbd: Remove obsolete comment
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:38 +02:00
Andreas Gruenbacher d0e22a260c drbd: Iterate over all overlapping intervals in a tree
Add a macro and helper function for doing that.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:37 +02:00
Andreas Gruenbacher fcefa62e4c drbd: Rename drbd_endio_{pri,sec} -> drbd_{,peer_}request_endio
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:36 +02:00
Andreas Gruenbacher fbe29dec98 drbd: Rename drbd_submit_ee -> drbd_submit_peer_request
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:35 +02:00
Philipp Reisner df24aa45f4 drbd: Implemented connection wide state changes
That is used for graceful disconnect only

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:47:32 +02:00
Philipp Reisner 047cd4a682 drbd: implemented receiving of P_CONN_ST_CHG_REQ
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:45:05 +02:00
Philipp Reisner fc3b10a45f drbd: Implemented receiving of P_CONN_ST_CHG_REPLY
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:45:04 +02:00
Philipp Reisner 5aabf467e3 drbd: Global_state_lock not necessary here...
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:45:03 +02:00
Philipp Reisner cf29c9d8c8 drbd: Implemented conn_send_state_req()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:45:02 +02:00
Philipp Reisner 8410da8f0e drbd: Introduced tconn->cstate_mutex
In compatibility mode with old DRBDs, use that as the state_mutex
as well.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:45:01 +02:00
Philipp Reisner dad2055481 drbd: Removed drbd_state_lock() and drbd_state_unlock()
The lock they constructed is only taken when the state_mutex
was already taken. It is superficial.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:45:01 +02:00
Philipp Reisner bbeb641c3e drbd: Killed volume0; last step of multi-volume-enablement
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-10-14 16:44:58 +02:00
Konrad Rzeszutek Wilk 64391b2536 xen/blkback: Fix the inhibition to map pages when discarding sector ranges.
The 'operation' parameters are the ones provided to the bio layer while
the req->operation are the ones passed in between the backend and
frontend. We used the wrong 'operation' value to squash the
call to map pages when processing the discard operation resulting
in an hypercall that did nothing. Lets guard against going in the
mapping function by checking for the proper operation type.

CC: Li Dongyang <lidongyang@novell.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-10-13 09:48:38 -04:00
Konrad Rzeszutek Wilk 5c62cb4860 xen/blkback: Report VBD_WSECT (wr_sect) properly.
We did not increment the amount of sectors written to disk
b/c we tested for the == WRITE which is incorrect - as the
operations are more of WRITE_FLUSH, WRITE_ODIRECT. This patch
fixes it by doing a & WRITE check.

CC: stable@kernel.org
Reported-by: Andy Burns <xen.lists@burns.me.uk>
Suggested-by: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-10-13 09:48:37 -04:00
Konrad Rzeszutek Wilk 29bde09378 xen/blkback: Support 'feature-barrier' aka old-style BARRIER requests.
We emulate the barrier requests by draining the outstanding bio's
and then sending the WRITE_FLUSH command. To drain the I/Os
we use the refcnt that is used during disconnect to wait for all
the I/Os before disconnecting from the frontend. We latch on its
value and if it reaches either the threshold for disconnect or when
there are no more outstanding I/Os, then we have drained all I/Os.

Suggested-by: Christopher Hellwig <hch@infradead.org>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-10-13 09:48:36 -04:00
Laszlo Ersek 469738e675 xen-blkfront: plug device number leak in xlblk_init() error path
... though after a failed xenbus_register_frontend() all may be lost.

Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-10-13 09:48:35 -04:00
Konrad Rzeszutek Wilk d11e615830 xen-blkfront: If no barrier or flush is supported, use invalid operation.
Guard against issuing BLKIF_OP_WRITE_BARRIER or BLKIF_OP_FLUSH_CACHE
by checking whether we successfully negotiated with the backend.
The negotiation with the backend also sets the q->flush_flags which
fortunately for us is also used when submitting an bio to us. If
we don't support barriers or flushes it would be set to zero so
we should never end up having to deal with REQ_FLUSH | REQ_FUA.

However, other third party implementations of __make_request that
might be stacked on top of us might not be so smart, so lets fix this up.

Acked-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-10-13 09:48:35 -04:00
Jan Beulich 8e6dc6fe51 xen-blkback: use kzalloc() in favor of kmalloc()+memset()
This fixes the problem of three of those four memset()-s having
improper size arguments passed: Sizeof a pointer-typed expression
returns the size of the pointer, not that of the pointed to data.

It also reverts using kmalloc() instead of kzalloc() for the allocation
of the pending grant handles array, as that array gets fully
initialized in a subsequent loop.

Reported-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-10-13 09:48:34 -04:00
Joe Jin c555aab97d xen-blkback: fixed indentation and comments
This patch fixes belows:

1. Fix code style issue.
2. Fix incorrect functions name in comments.

Signed-off-by: Joe Jin <joe.jin@oracle.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Ian Campbell <Ian.Campbell@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-10-13 09:48:33 -04:00
Li Dongyang 69ef68cef9 xen-blkfront: fix a deadlock while handling discard response
When we get -EOPNOTSUPP response for a discard request, we will clear
the discard flag on the request queue so we won't attempt to send discard
requests to backend again, and this should be protected under rq->queue_lock.
However, when we setup the request queue, we pass blkif_io_lock to
blk_init_queue so rq->queue_lock is blkif_io_lock indeed, and this lock
is already taken when we are in blkif_interrpt, so remove the
spin_lock/spin_unlock when we clear the discard flag or we will end up
with deadlock here

Signed-off-by: Li Dongyang <lidongyang@novell.com>
[v1: Updated description a bit and removed comment from source]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-10-13 09:48:32 -04:00
Li Dongyang ed30bf317c xen-blkfront: Handle discard requests.
If the backend advertises 'feature-discard', then interrogate
the backend for alignment and granularity. Setup the request
queue with the appropiate values and send the discard operation
as required.

Signed-off-by: Li Dongyang <lidongyang@novell.com>
[v1: Amended commit description]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-10-13 09:48:31 -04:00
Li Dongyang b3cb0d6adc xen-blkback: Implement discard requests ('feature-discard')
..aka ATA TRIM/SCSI UNMAP command to be passed through the frontend
and used as appropiately by the backend. We also advertise
certain granulity parameters to the frontend so it can plug them in.
If the backend is a realy device - we just end up using
'blkdev_issue_discard' while for loopback devices - we just punch
a hole in the image file.

Signed-off-by: Li Dongyang <lidongyang@novell.com>
[v1: Fixed up pr_debug and commit description]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-10-13 09:48:30 -04:00
Stefano Stabellini 0930bba674 xen: modify kernel mappings corresponding to granted pages
If we want to use granted pages for AIO, changing the mappings of a user
vma and the corresponding p2m is not enough, we also need to update the
kernel mappings accordingly.
Currently this is only needed for pages that are created for user usages
through /dev/xen/gntdev. As in, pages that have been in use by the
kernel and use the P2M will not need this special mapping.
However there are no guarantees that in the future the kernel won't
start accessing pages through the 1:1 even for internal usage.

In order to avoid the complexity of dealing with highmem, we allocated
the pages lowmem.
We issue a HYPERVISOR_grant_table_op right away in
m2p_add_override and we remove the mappings using another
HYPERVISOR_grant_table_op in m2p_remove_override.
Considering that m2p_add_override and m2p_remove_override are called
once per page we use multicalls and hypercall batching.

Use the kmap_op pointer directly as argument to do the mapping as it is
guaranteed to be present up until the unmapping is done.
Before issuing any unmapping multicalls, we need to make sure that the
mapping has already being done, because we need the kmap->handle to be
set correctly.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
[v1: Removed GRANT_FRAME_BIT usage]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-09-29 10:32:58 -04:00
Philipp Reisner 56707f9e87 drbd: Code de-duplication; new function apply_mask_val()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:33:22 +02:00
Philipp Reisner 4308a0a390 drbd: Removed the os parameter form sanitize_state()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:33:21 +02:00
Philipp Reisner fda74117dc drbd: Extracted is_valid_conn_transition() out of is_valid_transition()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:33:20 +02:00
Philipp Reisner 3509502dc8 drbd: Extracted is_valid_transition() out of sanitize_state()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:33:19 +02:00
Philipp Reisner a75f34ad0c drbd: Renamed is_valid_state_transition() to is_valid_soft_transition()
And removed the unused mdev parameter, and made the order of
the state parameters: os, ns

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:33:18 +02:00
Philipp Reisner d50eee21c4 drbd: Extracted after_conn_state_ch() out of after_state_ch()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:33:17 +02:00
Philipp Reisner 2a67d8b93b drbd: Converted drbd_send_ping() and related functions from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:33:16 +02:00
Philipp Reisner 00d56944ff drbd: Generalized the work callbacks
No longer work callbacks must operate on a mdev. From now on they
can also operate on a tconn.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:33:15 +02:00
Philipp Reisner 6699b65533 drbd: Moved some initializing code into drbd_new_tconn()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:33:14 +02:00
Philipp Reisner 392c880192 drbd: drbd_thread has now a pointer to a tconn instead of to a mdev
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:33:13 +02:00
Philipp Reisner 19393e105f drbd: Converted drbd_worker() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:33:12 +02:00
Philipp Reisner 32862ec705 drbd: Converted drbd_asender() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:33:11 +02:00
Philipp Reisner 4d641dd7b0 drbd: Converted drbdd_init() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:33:10 +02:00
Philipp Reisner f1b3a6ec7d drbd: Consolidated the setup of the thread name into the framework
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:33:09 +02:00
Philipp Reisner a21e929827 drbd: Moved the mdev member into drbd_work (from drbd_request and drbd_peer_request)
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:33:08 +02:00
Philipp Reisner 360cc74052 drbd: Converted drbd_free_sock() and drbd_disconnect() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:33:06 +02:00
Philipp Reisner eefc2f7de2 drbd: Converted drbdd() from mdev to tconn
The drbd_md_sync(mdev) happens in the after state change anyways...

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:33:05 +02:00
Philipp Reisner 808222845d drbd: Converted drbd_calc_cpu_mask() and drbd_thread_current_set_cpu() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:33:04 +02:00
Philipp Reisner 907599e044 drbd: Converted drbd_connect() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:33:03 +02:00
Philipp Reisner 062e879c8b drbd: Use and idr data structure to map volume numbers to mdev pointers
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:33:02 +02:00
Philipp Reisner dc8228d107 drbd: Converted drbd_send_protocol() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:33:01 +02:00
Philipp Reisner 13e6037dc9 drbd: Converted drbd_do_auth() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:33:00 +02:00
Philipp Reisner 611208706f drbd: Converted drbd_(get|put)_data_sock() and drbd_send_cmd2() to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:32:59 +02:00
Philipp Reisner 65d11ed6f2 drbd: Converted drbd_do_handshake() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:32:58 +02:00
Philipp Reisner 9ba7aa00ae drbd: Converted drbd_recv_header() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:32:57 +02:00
Philipp Reisner ce24385342 drbd: Converted decode_header() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:32:56 +02:00
Philipp Reisner 77351055b5 drbd: struct packet_info to hold information of decoded packets
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:32:50 +02:00
Philipp Reisner de0ff338d6 drbd: Converted drbd_recv() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:29:52 +02:00
Philipp Reisner 8a22cccc20 drbd: Converted drbd_send_handshake() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:29:50 +02:00
Philipp Reisner a25b63f1e7 drbd: Converted drbd_recv_fp() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:29:49 +02:00
Philipp Reisner dbd9eea094 drbd: Removed unused mdev argument from drbd_recv_short() and drbd_socket_okay()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:29:48 +02:00
Philipp Reisner d38e787ecc drbd: Converted drbd_send_fp() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:29:45 +02:00
Philipp Reisner bedbd2a53a drbd: Converted drbd_send() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:27:00 +02:00
Philipp Reisner 1a7ba646e9 drbd: Converted helper functions for drbd_send() to tconn
* drbd_update_congested()
* we_should_drop_the_connection()

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:59 +02:00
Philipp Reisner 0625ac190d drbd: Converted wake_asender() and request_ping() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:58 +02:00
Philipp Reisner 808e37b803 drbd: Moved SIGNAL_ASENDER to the per connection (tconn) flags
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:57 +02:00
Philipp Reisner e43ef195f8 drbd: Moved SEND_PING to the per connection (tconn) flags
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:56 +02:00
Philipp Reisner 25703f8320 drbd: Moved DISCARD_CONCURRENT to the per connection (tconn) flags
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:55 +02:00
Philipp Reisner 01a311a589 drbd: Started to separated connection flags (tconn) from block device flags (mdev)
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:54 +02:00
Philipp Reisner 7653620de3 drbd: Converted drbd_wait_for_connect() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:53 +02:00
Philipp Reisner eac3e990e4 drbd: Converted drbd_try_connect() from mdev to tconn
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:52 +02:00
Philipp Reisner 60ae496626 drbd: conn_printk() a dev_printk() alike for drbd's connections
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:50 +02:00
Philipp Reisner b53339fce2 drbd: Moving state related macros to drbd_state.h
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:49 +02:00
Philipp Reisner 8ea62f5464 drbd: Revert "Make sure we dont send state if a cluster wide state change is in progress"
This reverts commit 6e9fdc92b77915d5c7ab8fea751f48378f8b0080.

1) This did not fixed the issue
2) Long sleeping work items can cause IO requests to take as long as
   the longest work item

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:48 +02:00
Philipp Reisner e64a329459 drbd: Do no sleep long in drbd_start_resync
Work items that sleep too long can cause requests to take as
long as the longest sleeping work item.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:47 +02:00
Philipp Reisner 1f04af33fe drbd: Moved code
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:46 +02:00
Philipp Reisner bc31fe3352 drbd: Eliminated the user of drbd_task_to_thread()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:45 +02:00
Philipp Reisner bed879ae90 drbd: Moved the thread name into the data structure
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:44 +02:00
Philipp Reisner b890733953 drbd: Moved the state functions into its own source file
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:43 +02:00
Andreas Gruenbacher db830c464b drbd: Local variable renames: e -> peer_req
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:42 +02:00
Andreas Gruenbacher 6c852beca1 drbd: Update some comments
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:41 +02:00
Andreas Gruenbacher 18b75d756b drbd: Clean up some left-overs
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:40 +02:00
Andreas Gruenbacher f6ffca9f42 drbd: Rename struct drbd_epoch_entry to struct drbd_peer_request
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:39 +02:00
Andreas Gruenbacher c6f7df42c9 drbd: Remove unused variable in struct drbd_conf
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:37 +02:00
Andreas Gruenbacher 70b1987663 drbd: Improve the drbd_find_overlap() documentation
Describe how to reach any further overlapping intervals from the first
overlap found.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:36 +02:00
Andreas Gruenbacher 43ae077d0a drbd: Make the peer_seq updating code more obvious
Make it more clear that update_peer_seq() is supposed to wake up the
seq_wait queue whenever the sequence number changes.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:35 +02:00
Andreas Gruenbacher 6024fece73 drbd: Defer new writes when detecting conflicting writes
Before submitting a new local write request, wait for any conflicting
local or remote requests to complete.

We could assume that the new request occurred first and that the
conflicting requests overwrote it (and therefore discard the new
reques), but we know for sure that the new request occurred after the
conflicting requests and so this behavior would we weird.  We would also
end up with the wrong result if the new request is not fully contained
within the conflicting requests.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:34 +02:00
Andreas Gruenbacher ddd8877d31 drbd: Remove unnecessary reference counting left-over
Nothing in this function accesses mdev->tconn->net_conf, so there is no
need for get_net_conf() / put_net_conf() anymore.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:33 +02:00
Andreas Gruenbacher 5e4722645a drbd: _req_conflicts(): Get rid of the epoch_entries tree
Instead of keeping a separate tree for local and remote write requests
for finding requests and for conflict detection, use the same tree for
both purposes.  Introduce a flag to allow distinguishing the two
possible types of entries in this tree.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:32 +02:00
Andreas Gruenbacher 53840641bb drbd: Allow to wait for the completion of an epoch entry as well
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:31 +02:00
Andreas Gruenbacher 3e05146f0a drbd: Remove redundant check from drbd_contains_interval()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:30 +02:00
Andreas Gruenbacher a500c2efbb drbd: struct drbd_request: Introduce a new collision flag
This flag is set when a processes puts itself to sleep to wait for a
conflicting request to complete.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:29 +02:00
Andreas Gruenbacher 9e204cddaf drbd: Move some functions to where they are used
Move drbd_update_congested() to drbd_main.c, and drbd_req_new() and
drbd_req_free() to drbd_req.c: those functions are not used anywhere
else.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:28 +02:00
Andreas Gruenbacher 3e394da184 drbd: Move sequence number logic into drbd_receiver.c and simplify it
These things are only used there.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:27 +02:00
Andreas Gruenbacher cc378270e4 drbd: Initialize the sequence number sent over the network even when not used
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:26 +02:00
Andreas Gruenbacher bdc7adb006 drbd: Remove redundant initialization
packet_seq is initialized by both sides of a connection in
drbd_connect().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:25 +02:00
Andreas Gruenbacher d876302306 drbd: Rename "enum drbd_packets" to "enum drbd_packet"
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:24 +02:00
Andreas Gruenbacher f2ad906379 drbd: Move cmdname() out of drbd_int.h
There is no good reason for cmdname() to be an inline function.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:23 +02:00
Philipp Reisner b42a70ad32 drbd: Do not access tconn after it was freed
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:22 +02:00
Philipp Reisner 257d0af689 drbd: Implemented receiving of new style packets on meta socket
Now drbd communication with protocol 100 actually works.
Replaced the remaining p_header80 with p_header where we
no longer know which header it is.

In the places where p_header80 is still in use, it is on
purpose, because we know that it is an old style header
there.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:26:18 +02:00
Philipp Reisner fd340c12c9 drbd: Use new header layout
The new header layout will only be used if the peer supports
it of course.

For the first packet and the handshake packet the old (h80)
layout is used for compatibility reasons.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-09-28 10:23:03 +02:00
Carsten Emde 6c4867f646 floppy: use del_timer_sync() in init cleanup
When no floppy is found the module code can be released while a timer
function is pending or about to be executed.

CPU0                                  CPU1
				      floppy_init()
timer_softirq()
   spin_lock_irq(&base->lock);
   detach_timer();
   spin_unlock_irq(&base->lock);
   -> Interrupt
					del_timer();
				        return -ENODEV;
                                      module_cleanup();
   <- EOI
   call_timer_fn();
   OOPS

Use del_timer_sync() to prevent this.

Signed-off-by: Carsten Emde <C.Emde@osadl.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-09-21 10:22:11 +02:00
Ayan George 4c823cc3d5 drivers/block/loop.c: remove unnecessary bdev argument from loop_clr_fd()
If the loop device is associated (lo->lo_state == Lo_bound), it will have
a valid bdev pointed to by lo->lo_device.  There is no reason to ever pass
an additional block_device pointer.

Signed-off-by: Ayan George <ayan.george@canonical.com>
Cc: Phillip Susi <psusi@cfl.rr.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-09-21 10:02:13 +02:00
Phillip Susi 8a9c594422 drivers/block/loop.c: emit uevent on auto release
The loopback driver failed to emit the change uevent when auto releasing
the device.  Fixed lo_release() to pass the bdev to loop_clr_fd() so it
can emit the event.

Signed-off-by: Phillip Susi <psusi@cfl.rr.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ayan George <ayan@ayan.net>
Signed-off-by: Andrew Morton <akpm@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-09-21 10:02:13 +02:00
Sergei Shtylyov 5a3a76e6c3 drivers/block/cpqarray.c: use pci_dev->revision
This driver uses PCI_CLASS_REVISION instead of PCI_REVISION_ID, so it
wasn't converted by commit 44c10138fd ("PCI: Change all drivers to
use pci_device->revision").

Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Acked-by: Mike Miller <mike.miller@hp.com>
Cc: Chirag Kantharia <chirag.kantharia@hp.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-09-21 10:02:13 +02:00
Jiri Kosina e060c38434 Merge branch 'master' into for-next
Fast-forward merge with Linus to be able to merge patches
based on more recent version of the tree.
2011-09-15 15:08:18 +02:00
Jesper Juhl e5de063016 Remove unneeded version.h includes from drivers/block/
It was pointed out by 'make versioncheck' that some includes of
linux/version.h are not needed in drivers/block/.
This patch removes them.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-09-15 14:57:06 +02:00
Justin P. Mattock 699324871f treewide: remove extra semicolons from various parts of the kernel
This is a resend from the original, changing the title from PATCH to
RFC(since this is a review for commit, and I should have put that the first go around).
and also removing some of the commit's with ia64 and bash since it is significant.
let me know if I might have missed anything etc..

Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-09-15 14:50:49 +02:00
Joe Perches 1d273b929c drbd: Use angle brackets for system includes
Use the normal include style.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-09-15 14:02:57 +02:00
Joe Perches 57f3224c3f drbd: Convert vmalloc/memset to vzalloc
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-09-15 13:55:02 +02:00
Christoph Hellwig 5a7bbad27a block: remove support for bio remapping from ->make_request
There is very little benefit in allowing to let a ->make_request
instance update the bios device and sector and loop around it in
__generic_make_request when we can archive the same through calling
generic_make_request from the driver and letting the loop in
generic_make_request handle it.

Note that various drivers got the return value from ->make_request and
returned non-zero values for errors.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-09-12 12:12:01 +02:00
Philipp Reisner c012949a40 drbd: Replaced all p_header80 with a generic p_header
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-29 11:30:26 +02:00
Philipp Reisner c6d25cfe52 drbd: Preparing to use p_header96 for all packets
recv_bm_rle_bits() should not make any assumptions abou the layout
of the packet header

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-29 11:30:25 +02:00
Philipp Reisner 191d3cc8d9 drbd: Made drbd_flush_workqueue() to take a tconn instead of an mdev
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-29 11:30:24 +02:00
Philipp Reisner a0638456c6 drbd: moved crypto transformations and friends from mdev to tconn
sed -i \
       -e 's/mdev->cram_hmac_tfm/mdev->tconn->cram_hmac_tfm/g' \
       -e 's/mdev->integrity_w_tfm/mdev->tconn->integrity_w_tfm/g' \
       -e 's/mdev->integrity_r_tfm/mdev->tconn->integrity_r_tfm/g' \
       -e 's/mdev->int_dig_out/mdev->tconn->int_dig_out/g' \
       -e 's/mdev->int_dig_in/mdev->tconn->int_dig_in/g' \
       -e 's/mdev->int_dig_vv/mdev->tconn->int_dig_vv/g' \
       *.[ch]

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-29 11:30:23 +02:00
Philipp Reisner 87eeee41f8 drbd: moved req_lock and transfer log from mdev to tconn
sed -i \
       -e 's/mdev->req_lock/mdev->tconn->req_lock/g' \
       -e 's/mdev->unused_spare_tle/mdev->tconn->unused_spare_tle/g' \
       -e 's/mdev->newest_tle/mdev->tconn->newest_tle/g' \
       -e 's/mdev->oldest_tle/mdev->tconn->oldest_tle/g' \
       -e 's/mdev->out_of_sequence_requests/mdev->tconn->out_of_sequence_requests/g' \
       *.[ch]

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-29 11:30:15 +02:00
Philipp Reisner 31890f4ab2 drbd: moved agreed_pro_version, last_received and ko_count to tconn
sed -i \
       -e 's/mdev->agreed_pro_version/mdev->tconn->agreed_pro_version/g' \
       -e 's/mdev->last_received/mdev->tconn->last_received/g' \
       -e 's/mdev->ko_count/mdev->tconn->ko_count/g' \
       *.[ch]

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-29 11:27:07 +02:00
Philipp Reisner e6b3ea83bc drbd: moved receiver, worker and asender from mdev to tconn
Patch mostly:
sed -i -e 's/mdev->receiver/mdev->tconn->receiver/g' \
       -e 's/mdev->worker/mdev->tconn->worker/g' \
       -e 's/mdev->asender/mdev->tconn->asender/g' \
       *.[ch]

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-29 11:27:06 +02:00
Philipp Reisner e42325a576 drbd: moved data and meta from mdev to tconn
Patch mostly:

sed -i -e 's/mdev->data/mdev->tconn->data/g' \
       -e 's/mdev->meta/mdev->tconn->meta/g' \
       *.[ch]

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-29 11:27:05 +02:00
Philipp Reisner b2fb6dbe52 drbd: moved net_cont and net_cnt_wait from mdev to tconn
Patch partly generated by:

sed -i -e 's/get_net_conf(mdev)/get_net_conf(mdev->tconn)/g' \
       -e 's/put_net_conf(mdev)/put_net_conf(mdev->tconn)/g' \
       -e 's/get_net_conf(odev)/get_net_conf(odev->tconn)/g' \
       -e 's/put_net_conf(odev)/put_net_conf(odev->tconn)/g' \
       *.[ch]

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-29 11:27:04 +02:00
Philipp Reisner 89e58e755e drbd: moved net_conf from mdev to tconn
Besides moving the struct member, everything else is generated by:

sed -i -e 's/mdev->net_conf/mdev->tconn->net_conf/g' \
       -e 's/odev->net_conf/odev->tconn->net_conf/g' \
       *.[ch]

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-29 11:27:03 +02:00
Philipp Reisner 2111438b30 drbd: Minimal struct drbd_tconn
Starting to dissolve the network connection from the actual
block devices.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-29 11:27:02 +02:00
Andreas Gruenbacher 6618bf1638 drbd: Interval tree bugfix
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-29 11:27:00 +02:00
Andreas Gruenbacher e3cfa7b26a drbd: Inline function overlaps() is now unused
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-29 11:26:59 +02:00
Andreas Gruenbacher 70dc65e1b3 drbd: Remove some useless paranoia code
The open_cnt check is an open-coded D_ASSERT() check.

In case the data.work queue is not empty, it does not really help to
know which drbd_work elements remained on that list: they will be freed
immediately afterwards, anyway.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-29 11:26:58 +02:00
Andreas Gruenbacher 841ce241fa drbd: Replace the ERR_IF macro with an assert-like macro
Remove the file name and line number from the syslog messages generated:
we have no duplicate function names, and no function contains the same
assertion more than once.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-29 11:26:57 +02:00
Andreas Gruenbacher e77a0a5cc1 drbd: Convert all constants in enum drbd_thread_state to upper case
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-29 11:26:56 +02:00
Andreas Gruenbacher 8554df1c6d drbd: Convert all constants in enum drbd_req_event to upper case
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-29 11:26:55 +02:00
Andreas Gruenbacher bb3bfe9614 drbd: Remove the unused hash tables
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-29 11:26:54 +02:00
Andreas Gruenbacher 8b946255f8 drbd: Use interval tree for overlapping epoch entry detection
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-29 11:26:53 +02:00
Andreas Gruenbacher 010f6e678f drbd: Put sector and size in struct drbd_epoch_entry into struct drbd_interval
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-29 11:26:52 +02:00
Andreas Gruenbacher bc9c5c4118 drbd: Use the read and write request trees for request lookups
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-29 11:26:50 +02:00
Andreas Gruenbacher dac1389ccc drbd: Add read_requests tree
We do not do collision detection for read requests, but we still need to
look up the request objects when we receive a package over the network.
Using the same data structure for read and write requests results in
simpler code once the tl_hash and app_reads_hash tables are removed.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-29 11:26:31 +02:00
Andreas Gruenbacher de696716e8 drbd: Use interval tree for overlapping write request detection
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-25 14:58:06 +02:00
Andreas Gruenbacher ace652acf2 drbd: Put sector and size in struct drbd_request into struct drbd_interval
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-25 14:58:05 +02:00
Andreas Gruenbacher 0939b0e5cd drbd: Add interval tree data structure
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-25 14:58:04 +02:00
Andreas Gruenbacher c3afd8f568 drbd: Request lookup code cleanup (4)
Factor out duplicate code in got_NegAck().

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-25 14:58:03 +02:00
Andreas Gruenbacher ae3388daae drbd: Request lookup code cleanup (3)
Get rid of the ar_id_to_req() and ack_id_to_req() wrappers.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-25 14:58:02 +02:00
Andreas Gruenbacher 668eebc6a1 drbd: Request lookup code cleanup (2)
Unify the ar_id_to_req() and ack_id_to_req() functions: make both fail
if the consistency check fails.  Move the request lookup code now
duplicated in both functions into its own function.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-25 14:58:01 +02:00
Andreas Gruenbacher 5162458564 drbd: Request lookup code cleanup (1)
Move _ar_id_to_req() to drbd_receiver.c and mark it non-inline.  Remove
the leading underscores from _ar_id_to_req() and _ack_id_to_req().  Mark
ar_hash_slot() inline.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-25 14:58:00 +02:00
Andreas Gruenbacher 9c50842a35 drbd: Update outdated comment
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-25 14:57:59 +02:00
Andreas Gruenbacher d628769b3c drbd: Move drbd_free_tl_hash() to drbd_main()
This is the only place where this function is used.  Make it static.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-25 14:57:58 +02:00
Andreas Gruenbacher 579b57ed73 drbd: Magic reserved block_id value cleanup
The ID_VACANT definition has become entirely irrelevant by now.

The is_syncer_block_id() macro does not improve the code, so eliminated
it.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-25 14:57:57 +02:00
Andreas Gruenbacher e7fad8af75 drbd: Endianness convert the constants instead of the variables
Converting the constants happens at compile time.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-25 14:57:57 +02:00
Andreas Gruenbacher ca9bc12b90 drbd: Get rid of BE_DRBD_MAGIC and BE_DRBD_MAGIC_BIG
Converting the constants happens at compile time.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-25 14:57:56 +02:00
Andreas Gruenbacher 9a8e77530f drbd: Consistently use block_id == ID_SYNCER for checksum based resync and online verify
DRBD_MAGIC has nothing to do with block ids and the funny values
computed were not actually used, anyway.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-25 14:57:55 +02:00
Andreas Gruenbacher 3980485361 drbd: Remove superfluous declaration
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-25 14:57:43 +02:00
Andreas Gruenbacher 28c455ceb2 drbd: Get rid of req_validator_fn typedef
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
2011-08-25 14:57:31 +02:00
Kay Sievers e03c8dd149 loop: always allow userspace partitions and optionally support automatic scanning
Automatic partition scanning can be requested individually per loop
device during its setup by setting LO_FLAGS_PARTSCAN. By default, no
partition tables are scanned.

Userspace can now always add and remove partitions from all loop
devices, regardless if the in-kernel partition scanner is enabled or
not.

The needed partition minor numbers are allocated from the extended
minors space, the main loop device numbers will continue to match the
loop minors, regardless of the number of partitions used.

  # grep . /sys/class/block/loop1/loop/*
  /sys/block/loop1/loop/autoclear:0
  /sys/block/loop1/loop/backing_file:/home/kay/data/stuff/part.img
  /sys/block/loop1/loop/offset:0
  /sys/block/loop1/loop/partscan:1
  /sys/block/loop1/loop/sizelimit:0

  # ls -l /dev/loop*
  brw-rw---- 1 root disk   7,   0 Aug 14 20:22 /dev/loop0
  brw-rw---- 1 root disk   7,   1 Aug 14 20:23 /dev/loop1
  brw-rw---- 1 root disk 259,   0 Aug 14 20:23 /dev/loop1p1
  brw-rw---- 1 root disk 259,   1 Aug 14 20:23 /dev/loop1p2
  brw-rw---- 1 root disk   7,  99 Aug 14 20:23 /dev/loop99
  brw-rw---- 1 root disk 259,   2 Aug 14 20:23 /dev/loop99p1
  brw-rw---- 1 root disk 259,   3 Aug 14 20:23 /dev/loop99p2
  crw------T 1 root root  10, 237 Aug 14 20:22 /dev/loop-control

Cc: Karel Zak  <kzak@redhat.com>
Cc: Davidlohr Bueso <dave@gnu.org>
Acked-By: Tejun Heo <tj@kernel.org>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-23 20:12:04 +02:00
Jens Axboe 89c63a8ef3 Merge branch 'stable/for-jens' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-linus 2011-08-23 15:09:13 +02:00
Joe Jin 1bc05b0ae6 xen-blkback: fixed indentation and comments
This patch fixes belows:

1. Fix code style issue.
2. Fix incorrect functions name in comments.

Signed-off-by: Joe Jin <joe.jin@oracle.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Ian Campbell <Ian.Campbell@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-08-22 11:35:36 -04:00
Joe Jin 6f5986bce5 xen-blkback: Don't disconnect backend until state switched to XenbusStateClosed.
When do block-attach/block-detach test with below steps, umount hangs
in the guest. Furthermore shutdown ends up being stuck when umounting file-systems.

1. start guest.
2. attach new block device by xm block-attach in Dom0.
3. mount new disk in guest.
4. execute xm block-detach to detach the block device in dom0 until timeout
5. Any request to the disk will hung.

Root cause:
This issue is caused when setting backend device's state to
'XenbusStateClosing', which sends to the frontend the XenbusStateClosing
notification. When frontend receives the notification it tries to release
the disk in blkfront_closing(), but at that moment the disk is still in use
by guest, so frontend refuses to close. Specifically it sets the disk state to
XenbusStateClosing and sends the notification to backend - when backend receives the
event, it disconnects the vbd from real device, and sets the vbd device state to
XenbusStateClosing. The backend disconnects the real device/file, and any IO
requests to the disk in guest will end up in ether, leaving disk DEAD and set to
XenbusStateClosing. When the guest wants to disconnect the disk, umount will
hang on blkif_release()->xlvbd_release_gendisk() as it is unable to send any IO
to the disk, which prevents clean system shutdown.

Solution:
Don't disconnect backend until frontend state switched to XenbusStateClosed.

Signed-off-by: Joe Jin <joe.jin@oracle.com>
Cc: Daniel Stodden <daniel.stodden@citrix.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Annie Li <annie.li@oracle.com>
Cc: Ian Campbell <Ian.Campbell@eu.citrix.com>
[v1: Modified description a bit]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-08-22 11:35:35 -04:00
Lukas Czerner dfaa2ef68e loop: add discard support for loop devices
This commit adds discard support for loop devices. Discard is usually
supported by SSD and thinly provisioned devices as a method for
reclaiming unused space. This is no different than trying to reclaim
back space which is not used by the file system on the image, but it
still occupies space on the host file system.

We can do the reclamation on file system which does support hole
punching. So when discard request gets to the loop driver we can
translate that to punch a hole to the underlying file, hence reclaim
the free space.

This is very useful for trimming down the size of the image to only what
is really used by the file system on that image. Fstrim may be used for
that purpose.

It has been tested on ext4, xfs and btrfs with the image file systems
ext4, ext3, xfs and btrfs. ext4, or ext6 image on ext4 file system has
some problems but it seems that ext4 punch hole implementation is
somewhat flawed and it is unrelated to this commit.

Also this is a very good method of validating file systems punch hole
implementation.

Note that when encryption is used, discard support is disabled, because
using it might leak some information useful for possible attacker.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-19 14:50:46 +02:00
Andrew Morton 548ef6cc26 nbd-replace-some-printk-with-dev_warn-and-dev_info-checkpatch-fixes
ERROR: code indent should use tabs where possible
#30: FILE: drivers/block/nbd.c:578:
+^I        dev_info(disk_to_dev(lo->disk), "NBD_DISCONNECT\n");$

total: 1 errors, 0 warnings, 35 lines checked

NOTE: whitespace errors detected, you may wish to use scripts/cleanpatch or
      scripts/cleanfile

./patches/nbd-replace-some-printk-with-dev_warn-and-dev_info.patch has style problems, please review.

If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Paul Clements <Paul.Clements@steeleye.com>
Cc: WANG Cong <amwang@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-19 14:48:28 +02:00
WANG Cong 5eedf5415c nbd: replace some printk with dev_warn() and dev_info()
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Paul Clements <Paul.Clements@steeleye.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-19 14:48:28 +02:00
WANG Cong 7742ce4ab4 nbd: lower the loglevel of an error message
This is only an error, no need to use KERN_CRIT log level.

Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Paul Clements <Paul.Clements@steeleye.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-19 14:48:28 +02:00
WANG Cong 7f1b90f99a nbd: replace printk KERN_ERR with dev_err()
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Paul Clements <Paul.Clements@steeleye.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-19 14:48:22 +02:00
WANG Cong 1695b87f7d nbd: replace sysfs_create_file() with device_create_file()
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Paul Clements <Paul.Clements@steeleye.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-19 14:48:21 +02:00
WANG Cong 25ac0c2b97 nbd: use task_pid_nr() to get current pid
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Paul Clements <Paul.Clements@steeleye.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-19 14:48:17 +02:00
Jens Axboe 40bb96ade4 Merge branch 'stable/for-jens' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-linus 2011-08-09 20:43:26 +02:00
Konrad Rzeszutek Wilk ea5e116162 xen/blkback: Make description more obvious.
With the frontend having Xen but the backend not, it just looks odd:

  <*>   Xen virtual block device support
  <*>   Block-device backend driver

Fix it to have the 'Xen' in front of it.

Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-08-09 11:12:14 -04:00
Joe Handzik f963d270cb cciss: add transport mode attribute to sys
Signed-off-by: Joseph Handzik <joseph.t.handzik@beardog.cce.hp.com>
Acked-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-08 11:40:17 +02:00
Joseph Handzik 1304953700 cciss: Adds simple mode functionality
Signed-off-by: Joseph Handzik <joseph.t.handzik@beardog.cce.hp.com>
Acked-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-08 11:40:15 +02:00
Axel Lin f41c53a569 block: swim3: fix unterminated of_device_id table
of_device_id structures need a NULL terminating entry, add it.

Signed-off-by: Axel Lin <axel.lin@gmail.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-03 15:02:55 +02:00
H Hartley Sweeten ddad9ef582 drivers/block/drbd/drbd_nl.c: use bitmap_parse instead of __bitmap_parse
The buffer 'sc.cpu_mask' is a kernel buffer.  If bitmap_parse is used
instead of __bitmap_parse the extra parameter that indicates a kernel
buffer is not needed.

Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-08-02 12:43:49 +02:00
Kay Sievers 05eb0f252b loop: fix deadlock when sysfs and LOOP_CLR_FD race against each other
LOOP_CLR_FD takes lo->lo_ctl_mutex and tries to remove the loop sysfs
files. Sysfs calls show() and waits for lo->lo_ctl_mutex. LOOP_CLR_FD
waits for show() to finish to remove the sysfs file.

  cat /sys/class/block/loop0/loop/backing_file
    mutex_lock_nested+0x176/0x350
    ? loop_attr_do_show_backing_file+0x2f/0xd0 [loop]
    ? loop_attr_do_show_backing_file+0x2f/0xd0 [loop]
    loop_attr_do_show_backing_file+0x2f/0xd0 [loop]
    dev_attr_show+0x1b/0x60
    ? sysfs_read_file+0x86/0x1a0
    ? __get_free_pages+0x12/0x50
    sysfs_read_file+0xaf/0x1a0

  ioctl(LOOP_CLR_FD):
    wait_for_common+0x12c/0x180
    ? try_to_wake_up+0x2a0/0x2a0
    wait_for_completion+0x18/0x20
    sysfs_deactivate+0x178/0x180
    ? sysfs_addrm_finish+0x43/0x70
    ? sysfs_addrm_start+0x1d/0x20
    sysfs_addrm_finish+0x43/0x70
    sysfs_hash_and_remove+0x85/0xa0
    sysfs_remove_group+0x59/0x100
    loop_clr_fd+0x1dc/0x3f0 [loop]
    lo_ioctl+0x223/0x7a0 [loop]

Instead of taking the lo_ctl_mutex from sysfs code, take the inner
lo->lo_lock, to protect the access to the backing_file data.

Thanks to Tejun for help debugging and finding a solution.

Cc: Milan Broz <mbroz@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-31 22:21:35 +02:00
Kay Sievers d134b00b9a loop: add BLK_DEV_LOOP_MIN_COUNT=%i to allow distros 0 pre-allocated loop devices
Instead of unconditionally creating a fixed number of dead loop
devices which need to be investigated by storage handling services,
even when they are never used, we allow distros start with 0
loop devices and have losetup(8) and similar switch to the dynamic
/dev/loop-control interface instead of searching /dev/loop%i for free
devices.

Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-31 22:08:04 +02:00
Kay Sievers 770fe30a46 loop: add management interface for on-demand device allocation
Loop devices today have a fixed pre-allocated number of usually 8.
The number can only be changed at module init time. To find a free
device to use, /dev/loop%i needs to be scanned, and all devices need
to be opened until a free one is possibly found.

This adds a new /dev/loop-control device node, that allows to
dynamically find or allocate a free device, and to add and remove loop
devices from the running system:
 LOOP_CTL_ADD adds a specific device. Arg is the number
 of the device. It returns the device i or a negative
 error code.

 LOOP_CTL_REMOVE removes a specific device, Arg is the
 number the device. It returns the device i or a negative
 error code.

 LOOP_CTL_GET_FREE finds the next unbound device or allocates
 a new one. No arg is given. It returns the device i or a
 negative error code.

The loop kernel module gets automatically loaded when
/dev/loop-control is accessed the first time. The alias
specified in the module, instructs udev to create this
'dead' device node, even when the module is not loaded.

Example:
 cfd = open("/dev/loop-control", O_RDWR);

 # add a new specific loop device
 err = ioctl(cfd, LOOP_CTL_ADD, devnr);

 # remove a specific loop device
 err = ioctl(cfd, LOOP_CTL_REMOVE, devnr);

 # find or allocate a free loop device to use
 devnr = ioctl(cfd, LOOP_CTL_GET_FREE);

 sprintf(loopname, "/dev/loop%i", devnr);
 ffd = open("backing-file", O_RDWR);
 lfd = open(loopname, O_RDWR);
 err = ioctl(lfd, LOOP_SET_FD, ffd);

Cc: Tejun Heo <tj@kernel.org>
Cc: Karel Zak  <kzak@redhat.com>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-31 22:08:04 +02:00
Kay Sievers 34dd82afd2 loop: replace linked list of allocated devices with an idr index
Replace the linked list, that keeps track of allocated devices, with an
idr index to allow a more efficient lookup of devices.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-31 22:08:04 +02:00
Arun Sharma 60063497a9 atomic: use <linux/atomic.h>
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>

Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26 16:49:47 -07:00
Linus Torvalds ba5b56cb3e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits)
  ceph: document unlocked d_parent accesses
  ceph: explicitly reference rename old_dentry parent dir in request
  ceph: document locking for ceph_set_dentry_offset
  ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug
  ceph: protect d_parent access in ceph_d_revalidate
  ceph: protect access to d_parent
  ceph: handle racing calls to ceph_init_dentry
  ceph: set dir complete frag after adding capability
  rbd: set blk_queue request sizes to object size
  ceph: set up readahead size when rsize is not passed
  rbd: cancel watch request when releasing the device
  ceph: ignore lease mask
  ceph: fix ceph_lookup_open intent usage
  ceph: only link open operations to directory unsafe list if O_CREAT|O_TRUNC
  ceph: fix bad parent_inode calc in ceph_lookup_open
  ceph: avoid carrying Fw cap during write into page cache
  libceph: don't time out osd requests that haven't been received
  ceph: report f_bfree based on kb_avail rather than diffing.
  ceph: only queue capsnap if caps are dirty
  ceph: fix snap writeback when racing with writes
  ...
2011-07-26 13:38:50 -07:00
Josh Durgin 029bcbd8b0 rbd: set blk_queue request sizes to object size
This improves performance since more requests can be merged.

Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
2011-07-26 11:29:35 -07:00
Yehuda Sadeh 79e3057c4c rbd: cancel watch request when releasing the device
We were missing this cleanup, so when a device was released
the osd didn't clean up its watchers list, so following notifications
could be slow as osd needed to timeout on the client.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2011-07-26 11:29:04 -07:00
Linus Torvalds 8ded371f81 Merge branch 'for-3.1/drivers' of git://git.kernel.dk/linux-block
* 'for-3.1/drivers' of git://git.kernel.dk/linux-block:
  cciss: do not attempt to read from a write-only register
  xen/blkback: Add module alias for autoloading
  xen/blkback: Don't let in-flight requests defer pending ones.
  bsg: fix address space warning from sparse
  bsg: remove unnecessary conditional expressions
  bsg: fix bsg_poll() to return POLLOUT properly
2011-07-25 10:38:18 -07:00
Linus Torvalds bbd9d6f7fb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (107 commits)
  vfs: use ERR_CAST for err-ptr tossing in lookup_instantiate_filp
  isofs: Remove global fs lock
  jffs2: fix IN_DELETE_SELF on overwriting rename() killing a directory
  fix IN_DELETE_SELF on overwriting rename() on ramfs et.al.
  mm/truncate.c: fix build for CONFIG_BLOCK not enabled
  fs:update the NOTE of the file_operations structure
  Remove dead code in dget_parent()
  AFS: Fix silly characters in a comment
  switch d_add_ci() to d_splice_alias() in "found negative" case as well
  simplify gfs2_lookup()
  jfs_lookup(): don't bother with . or ..
  get rid of useless dget_parent() in btrfs rename() and link()
  get rid of useless dget_parent() in fs/btrfs/ioctl.c
  fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers
  drivers: fix up various ->llseek() implementations
  fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek
  Ext4: handle SEEK_HOLE/SEEK_DATA generically
  Btrfs: implement our own ->llseek
  fs: add SEEK_HOLE and SEEK_DATA flags
  reiserfs: make reiserfs default to barrier=flush
  ...

Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_super.c due to the new
shrinker callout for the inode cache, that clashed with the xfs code to
start the periodic workers later.
2011-07-22 19:02:39 -07:00
Linus Torvalds a99a7d1436 Merge branch 'timers-cleanup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-cleanup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  mips: Fix i8253 clockevent fallout
  i8253: Cleanup outb/inb magic
  arm: Footbridge: Use common i8253 clockevent
  mips: Use common i8253 clockevent
  x86: Use common i8253 clockevent
  i8253: Create common clockevent implementation
  i8253: Export i8253_lock unconditionally
  pcpskr: MIPS: Make config dependencies finer grained
  pcspkr: Cleanup Kconfig dependencies
  i8253: Move remaining content and delete asm/i8253.h
  i8253: Consolidate definitions of PIT_LATCH
  x86: i8253: Consolidate definitions of global_clock_event
  i8253: Alpha, PowerPC: Remove unused asm/8253pit.h
  alpha: i8253: Cleanup remaining users of i8253pit.h
  i8253: Remove I8253_LOCK config
  i8253: Make pcsp sound driver use the shared i8253_lock
  i8253: Make pcspkr input driver use the shared i8253_lock
  i8253: Consolidate all kernel definitions of i8253_lock
  i8253: Unify all kernel declarations of i8253_lock
  i8253: Create linux/i8253.h and use it in all 8253 related files
2011-07-22 16:51:56 -07:00
Linus Torvalds 8181780c16 Merge branch 'devicetree/next' of git://git.secretlab.ca/git/linux-2.6
* 'devicetree/next' of git://git.secretlab.ca/git/linux-2.6:
  dt: include linux/errno.h in linux/of_address.h
  of/address: Add of_find_matching_node_by_address helper
  dt: remove extra xsysace platform_driver registration
  tty/serial: Add devicetree support for nVidia Tegra serial ports
  dt: add empty of_property_read_u32[_array] for non-dt
  dt: bindings: move SEC node under new crypto/
  dt: add helper function to read u32 arrays
  tty/serial: change of_serial to use new of_property_read_u32() api
  dt: add 'const' for of_property_read_string parameter **out_string
  dt: add helper functions to read u32 and string property values
  tty: of_serial: support for 32 bit accesses
  dt: document the of_serial bindings
  dt/platform: allow device name to be overridden
  drivers/amba: create devices from device tree
  dt: add of_platform_populate() for creating device from the device tree
  dt: Add default match table for bus ids
2011-07-22 14:53:38 -07:00
Linus Torvalds 111ad119d1 Merge branch 'stable/drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
* 'stable/drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  xen/pciback: Have 'passthrough' option instead of XEN_PCIDEV_BACKEND_PASS and XEN_PCIDEV_BACKEND_VPCI
  xen/pciback: Remove the DEBUG option.
  xen/pciback: Drop two backends, squash and cleanup some code.
  xen/pciback: Print out the MSI/MSI-X (PIRQ) values
  xen/pciback: Don't setup an fake IRQ handler for SR-IOV devices.
  xen: rename pciback module to xen-pciback.
  xen/pciback: Fine-grain the spinlocks and fix BUG: scheduling while atomic cases.
  xen/pciback: Allocate IRQ handler for device that is shared with guest.
  xen/pciback: Disable MSI/MSI-X when reseting a device
  xen/pciback: guest SR-IOV support for PV guest
  xen/pciback: Register the owner (domain) of the PCI device.
  xen/pciback: Cleanup the driver based on checkpatch warnings and errors.
  xen/pciback: xen pci backend driver.
  xen: tmem: self-ballooning and frontswap-selfshrinking
  xen: Add module alias to autoload backend drivers
  xen: Populate xenbus device attributes
  xen: Add __attribute__((format(printf... where appropriate
  xen: prepare tmem shim to handle frontswap
  xen: allow enable use of VGA console on dom0
2011-07-22 13:45:15 -07:00
Al Viro e7f5909707 kill useless checks for sb->s_op == NULL
never is...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:44:21 -04:00
Grant Likely 8c11642a50 Merge commit 'v3.0-rc7' into devicetree/next 2011-07-15 20:11:34 -06:00
Stefan Bader 89153b5cae xen-blkfront: Fix one off warning about name clash
Avoid telling users to use xvde and onwards when using xvde.

Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-07-14 14:19:51 -04:00
Stefan Bader 196cfe2ae8 xen-blkfront: Drop name and minor adjustments for emulated scsi devices
These were intended to avoid the namespace clash when representing
emulated IDE and SCSI devices. However that seems to confuse users
more than expected (a disk defined as sda becomes xvde).
So for now go back to the scheme which does no adjustments. This
will break when mixing IDE and SCSI names in the configuration of
guests but should be by now expected.

Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2011-07-14 14:19:33 -04:00