Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Instead of referring from inside drivers/cdrom/Makefile to all the
drivers that use this driver, let these drivers select the cdrom
driver. This change makes the cdrom build code follow the approach
that is used for most other drivers, namely refer from the higher
layers to the lower layer instead of from the lower layer to the
higher layers.
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Avoid that submitting an SG_IO ioctl triggers a kernel oops that
is preceded by:
usercopy: kernel memory overwrite attempt detected to (null) (<null>) (6 bytes)
kernel BUG at mm/usercopy.c:72!
Reported-by: Dann Frazier <dann.frazier@canonical.com>
Fixes: commit ca18d6f769 ("block: Make most scsi_req_init() calls implicit")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Dann Frazier <dann.frazier@canonical.com>
Cc: <stable@vger.kernel.org> # v4.13
Reviewed-by: Christoph Hellwig <hch@lst.de>
Moved virtblk_initialize_rq() inside CONFIG_VIRTIO_BLK_SCSI.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If you do not set sk_sndtimeo you will get -ERESTARTSYS if there is a
pending signal when you enter sendmsg, which we handle properly.
However if you set a timeout for your commands we'll set sk_sndtimeo to
that timeout, which means that sendmsg will start returning -EINTR
instead of -ERESTARTSYS. Fix this by checking either cases and doing
the correct thing.
Cc: stable@vger.kernel.org
Fixes: dc88e34d69 ("nbd: set sk->sk_sndtimeo for our sockets")
Reported-and-tested-by: Daniel Xu <dlxu@fb.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Only mips defines this helper, so remove all the other arch definitions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Make these structures const as they are either passed to the functions
having the argument as const or stored as a reference in the "ci_type"
const field of a config_item structure.
Done using Coccienlle.
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Several timer users needlessly reset their .function/.data fields during
their timer callback, but nothing else changes them. Some users do not
use their .data field at all. Each instance is removed here.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> # for staging
Acked-by: Krzysztof Halasa <khc@pm.waw.pl> # for wan/hdlc*
Acked-by: Jens Axboe <axboe@kernel.dk> # for amiflop
Cc: devel@driverdev.osuosl.org
Cc: netdev@vger.kernel.org
Cc: linux-wireless@vger.kernel.org
Cc: Jens Axboe <axboe@fb.com>
Cc: Ganesh Krishna <ganesh.krishna@microchip.com>
Cc: Aditya Shankar <aditya.shankar@microchip.com>
Link: https://lkml.kernel.org/r/20171010001032.GA119829@beast
Fix to return error code -ENOMEM from the null_alloc_dev() error
handling case instead of 0, as done elsewhere in this function.
Fixes: 2984c8684f ("nullb: factor disk parameters")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A user reported a regression with using the normal ioctl interface on
newer kernels. This happens because I was setting the device size
before the device was actually connected, which caused us to error out
and close everything down. This didn't happen on netlink because we
hold the device lock the whole time we're setting things up, but we
don't do that for the ioctl path. This fixes the problem.
Cc: stable@vger.kernel.org
Fixes: 29eaadc ("nbd: stop using the bdev everywhere")
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use kmem_cache_free instead of kfree for freeing the memory previously
allocated with kmem_cache_zalloc/kmem_cache_alloc/kmem_cache_node.
Signed-off-by: Himanshu Jha <himanshujha199640@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block fixes from Jens Axboe:
"A collection of fixes for this series. This contains:
- NVMe pull request from Christoph, one uuid attribute fix, and one
fix for the controller memory buffer address for remapped BARs.
- use-after-free fix for bsg, from Benjamin Block.
- bcache race/use-after-free fix for a list traversal, fixing a
regression in this merge window. From Coly Li.
- null_blk change configfs dependency change from a 'depends' to a
'select'. This is a change from this merge window as well. From me.
- nbd signal fix from Josef, fixing a regression introduced with the
status code changes.
- nbd MAINTAINERS mailing list entry update.
- blk-throttle stall fix from Joseph Qi.
- blk-mq-debugfs fix from Omar, fixing an issue where we don't
register the IO scheduler debugfs directory, if the driver is
loaded with it. Only shows up if you switch through the sysfs
interface"
* 'for-linus' of git://git.kernel.dk/linux-block:
bsg-lib: fix use-after-free under memory-pressure
nvme-pci: Use PCI bus address for data/queues in CMB
blk-mq-debugfs: fix device sched directory for default scheduler
null_blk: change configfs dependency to select
blk-throttle: fix possible io stall when upgrade to max
MAINTAINERS: update list for NBD
nbd: fix -ERESTARTSYS handling
nvme: fix visibility of "uuid" ns attribute
bcache: use llist_for_each_entry_safe() in __closure_wake_up()
In testing I found handle passed to zs_map_object in __zram_bvec_read is
NULL so eh kernel goes oops in pin_object().
The reason is there is no routine to check the slot's freeing after
getting the slot's lock. This patch fixes it.
[minchan@kernel.org: v2]
Link: http://lkml.kernel.org/r/1505887347-10881-1-git-send-email-minchan@kernel.org
Link: http://lkml.kernel.org/r/1505788488-26723-1-git-send-email-minchan@kernel.org
Fixes: 1f7319c742 ("zram: partial IO refactoring")
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A recent commit made null_blk depend on configfs, which is kind of
annoying since you now have to find this dependency and enable that
as well. Discovered this since I no longer had null_blk available
on a box I needed to debug, since it got killed when the config
updated after the configfs change was merged.
Fixes: 3bf2bd2073 ("nullb: add configfs interface")
Reviewed-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph made it so that if we return'ed BLK_STS_RESOURCE whenever we
got ERESTARTSYS from sending our packets we'd return BLK_STS_OK, which
means we'd never requeue and just hang. We really need to return the
right value from the upper layer.
Fixes: fc17b6534e ("blk-mq: switch ->queue_rq return value to blk_status_t")
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The code is only for blkcg not for all cgroups
Fixes: d4478e92d6 ("block/loop: make loop cgroup aware")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch fix the following build warning:
drivers/block/cryptoloop.c:46:8: warning: variable 'cipher' set but not used [-Wunused-but-set-variable]
Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
loop block device handles IO in a separate thread. The actual IO
dispatched isn't cloned from the IO loop device received, so the
dispatched IO loses the cgroup context.
I'm ignoring buffer IO case now, which is quite complicated. Making the
loop thread aware cgroup context doesn't really help. The loop device
only writes to a single file. In current writeback cgroup
implementation, the file can only belong to one cgroup.
For direct IO case, we could workaround the issue in theory. For
example, say we assign cgroup1 5M/s BW for loop device and cgroup2
10M/s. We can create a special cgroup for loop thread and assign at
least 15M/s for the underlayer disk. In this way, we correctly throttle
the two cgroups. But this is tricky to setup.
This patch tries to address the issue. We record bio's css in loop
command. When loop thread is handling the command, we then use the API
provided in patch 1 to set the css for current task. The bio layer will
use the css for new IO (from patch 3).
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When the request is completed, lo_complete_rq() checks cmd->use_aio.
However, if this is in fact an aio request, cmd->use_aio will have
already been reused as cmd->ref by lo_rw_aio*. Fix it by not using a
union. On x86_64, there's a hole after the union anyways, so this
doesn't make struct loop_cmd any bigger.
Fixes: 92d773324b ("block/loop: fix use after free")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In testing we noticed that nbd would spew if you ran a fio job against
the raw device itself. This is because fio calls a block device
specific ioctl, however the block layer will first pass this back to the
driver ioctl handler in case the driver wants to do something special.
Since the device was setup using netlink this caused us to spew every
time fio called this ioctl. Since we don't have special handling, just
error out for any non-nbd specific ioctl's that come in. This fixes the
spew.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The code in __brd_direct_access multiplies the pgoff variable by page size
and divides it by 512. It can cause overflow on 32-bit architectures. The
overflow happens if we create ramdisk larger than 4G and use it as a
sparse device.
This patch replaces multiplication and division with multiplication by the
number of sectors per page.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: 1647b9b959 ("brd: add dax_operations support")
Cc: stable@vger.kernel.org # 4.12+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
* a large series of fixes and improvements to the snapshot-handling
code (Zheng Yan)
* individual read/write OSD requests passed down to libceph are now
limited to 16M in size to avoid hitting OSD-side limits (Zheng Yan)
* encode MStatfs v2 message to allow for more accurate space usage
reporting (Douglas Fuller)
* switch to the new writeback error tracking infrastructure (Jeff
Layton)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJZuAC0AAoJEEp/3jgCEfOLb14H/REYq4fDDkUa70L4leKWWdCa
n71ipkKeoorfivts71iOtGMJfK+Z6ax+dq1PvBWMy6PtzXS/+2B+t2XwILvLiwWH
h87i44bY68aLWRTSusgTfB+I7gyVrWN0WMLznZ5rfM9XuyPv+RPyJYh3EhxWI5+U
2kOHFEc+cPL6mAshGmB8lIzKOWTfmBiw28ulICwlcazm79hh39aNBQE546lS8gA3
kXuJ55odojPgXOYh+vs60raIBnm6flek1jLxBGYG3MU4gv0VVWOyW0eWeuqW+EcR
6dVYlzg1xGlPp+vRmDZQuv/E2MafBxdcil/RrdLeqcx/Hf1KJBzcLgUzIMbnOAI=
=YDZP
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.14-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"The highlights include:
- a large series of fixes and improvements to the snapshot-handling
code (Zheng Yan)
- individual read/write OSD requests passed down to libceph are now
limited to 16M in size to avoid hitting OSD-side limits (Zheng Yan)
- encode MStatfs v2 message to allow for more accurate space usage
reporting (Douglas Fuller)
- switch to the new writeback error tracking infrastructure (Jeff
Layton)"
* tag 'ceph-for-4.14-rc1' of git://github.com/ceph/ceph-client: (35 commits)
ceph: stop on-going cached readdir if mds revokes FILE_SHARED cap
ceph: wait on writeback after writing snapshot data
ceph: fix capsnap dirty pages accounting
ceph: ignore wbc->range_{start,end} when write back snapshot data
ceph: fix "range cyclic" mode writepages
ceph: cleanup local variables in ceph_writepages_start()
ceph: optimize pagevec iterating in ceph_writepages_start()
ceph: make writepage_nounlock() invalidate page that beyonds EOF
ceph: properly get capsnap's size in get_oldest_context()
ceph: remove stale check in ceph_invalidatepage()
ceph: queue cap snap only when snap realm's context changes
ceph: handle race between vmtruncate and queuing cap snap
ceph: fix message order check in handle_cap_export()
ceph: fix NULL pointer dereference in ceph_flush_snaps()
ceph: adjust 36 checks for NULL pointers
ceph: delete an unnecessary return statement in update_dentry_lease()
ceph: ENOMEM pr_err in __get_or_create_frag() is redundant
ceph: check negative offsets in ceph_llseek()
ceph: more accurate statfs
ceph: properly set snap follows for cap reconnect
...
Pull followup block layer updates from Jens Axboe:
"I ended up splitting the main pull request for this series into two,
mainly because of clashes between NVMe fixes that went into 4.13 after
the for-4.14 branches were split off. This pull request is mostly
NVMe, but not exclusively. In detail, it contains:
- Two pull request for NVMe changes from Christoph. Nothing new on
the feature front, basically just fixes all over the map for the
core bits, transport, rdma, etc.
- Series from Bart, cleaning up various bits in the BFQ scheduler.
- Series of bcache fixes, which has been lingering for a release or
two. Coly sent this in, but patches from various people in this
area.
- Set of patches for BFQ from Paolo himself, updating both
documentation and fixing some corner cases in performance.
- Series from Omar, attempting to now get the 4k loop support
correct. Our confidence level is higher this time.
- Series from Shaohua for loop as well, improving O_DIRECT
performance and fixing a use-after-free"
* 'for-4.14/block-postmerge' of git://git.kernel.dk/linux-block: (74 commits)
bcache: initialize dirty stripes in flash_dev_run()
loop: set physical block size to logical block size
bcache: fix bch_hprint crash and improve output
bcache: Update continue_at() documentation
bcache: silence static checker warning
bcache: fix for gc and write-back race
bcache: increase the number of open buckets
bcache: Correct return value for sysfs attach errors
bcache: correct cache_dirty_target in __update_writeback_rate()
bcache: gc does not work when triggering by manual command
bcache: Don't reinvent the wheel but use existing llist API
bcache: do not subtract sectors_to_gc for bypassed IO
bcache: fix sequential large write IO bypass
bcache: Fix leak of bdev reference
block/loop: remove unused field
block/loop: fix use after free
bfq: Use icq_to_bic() consistently
bfq: Suppress compiler warnings about comparisons
bfq: Check kstrtoul() return value
bfq: Declare local functions static
...
zram was the motivation for creating memset_l(). Minchan Kim sees a 7%
performance improvement on x86 with 100MB of non-zero deduplicatable
data:
perf stat -r 10 dd if=/dev/zram0 of=/dev/null
vanilla: 0.232050465 seconds time elapsed ( +- 0.51% )
memset_l: 0.217219387 seconds time elapsed ( +- 0.07% )
Link: http://lkml.kernel.org/r/20170720184539.31609-7-willy@infradead.org
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Tested-by: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: David Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Russell King <rmk+kernel@armlinux.org.uk>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is mostly updates of the usual suspects: lpfc, qla2xxx, hisi_sas, megaraid_sas, zfcp and a host of minor updates.
The major driver change here is the elimination of the block based
cciss driver in favour of the SCSI based hpsa driver (which now drives
all the legacy cases cciss used to be required for). Plus a reset
handler clean up and the redo of the SAS SMP handler to use bsg lib.
Signed-off-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJZscDNAAoJEAVr7HOZEZN4DWIQAK/UkkrvKpV/jLATM/yi7CoL
QidY86Hmwwl7A9HQ+2fjLfAsye0xcCzRwkucKK90IP5b4pefHhiJJfiMKAAe3TUW
xstnY5z5jaOhDG4nyJFoSm5fH5qXkMnJ8NZRK8f6Qg5yBN5dStEKqoBboNsz4KBI
md7idw0mbp5i2GXlJwSpc5eDS97GiPL6WkwgGaGKfXF1NDau0GbEdjijfz55haCD
pMhY7WJh/71RfOq/1ThXT1Z3khOlVcKXrkdO+602n7zh/klRBRtBC8m2a6xCfZPj
n7Pb/s0jhCQPd+e/Xtv7WEbY8uNOCrGoVgZ6U5EGrT5IeTfep24ackYqerjMhE63
esi4BJY8lUP9SGleLMgjYWyCHdmxBJRa7UI614DWN/H0QoGP6j/2EzGoi5Fw04vC
H8/+aqPPWZc9KUBioRYo8xWO8YgMqL2eyXY+Tc9cwxqAe2T6k/NC1zJVgDFKXfzb
QoWW4v9NNmYwf5vL/7tNgkeTMFQV66yUR7dR3SGTSk8UIrJ40ok0JyUAsDg86ZAH
BfMkWwhWQ6Byoel0Y7Ti88T49Cox/64r/I0ux06Qgg99+KpRLT7z20+GLIEHgXxg
116C39rgvYKqzc7W8RCyj8qSROuMVzg6QFbB6n+1PEsYIX2O8A2Re3jdS34q2LbX
aBDm/Lfdl4kkJrV9xY6P
=nQUG
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This is mostly updates of the usual suspects: lpfc, qla2xxx, hisi_sas,
megaraid_sas, zfcp and a host of minor updates.
The major driver change here is the elimination of the block based
cciss driver in favour of the SCSI based hpsa driver (which now drives
all the legacy cases cciss used to be required for). Plus a reset
handler clean up and the redo of the SAS SMP handler to use bsg lib"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (279 commits)
scsi: scsi-mq: Always unprepare before requeuing a request
scsi: Show .retries and .jiffies_at_alloc in debugfs
scsi: Improve requeuing behavior
scsi: Call scsi_initialize_rq() for filesystem requests
scsi: qla2xxx: Reset the logo flag, after target re-login.
scsi: qla2xxx: Fix slow mem alloc behind lock
scsi: qla2xxx: Clear fc4f_nvme flag
scsi: qla2xxx: add missing includes for qla_isr
scsi: qla2xxx: Fix an integer overflow in sysfs code
scsi: aacraid: report -ENOMEM to upper layer from aac_convert_sgraw2()
scsi: aacraid: get rid of one level of indentation
scsi: aacraid: fix indentation errors
scsi: storvsc: fix memory leak on ring buffer busy
scsi: scsi_transport_sas: switch to bsg-lib for SMP passthrough
scsi: smartpqi: remove the smp_handler stub
scsi: hpsa: remove the smp_handler stub
scsi: bsg-lib: pass the release callback through bsg_setup_queue
scsi: Rework handling of scsi_device.vpd_pg8[03]
scsi: Rework the code for caching Vital Product Data (VPD)
scsi: rcu: Introduce rcu_swap_protected()
...
Pull block layer updates from Jens Axboe:
"This is the first pull request for 4.14, containing most of the code
changes. It's a quiet series this round, which I think we needed after
the churn of the last few series. This contains:
- Fix for a registration race in loop, from Anton Volkov.
- Overflow complaint fix from Arnd for DAC960.
- Series of drbd changes from the usual suspects.
- Conversion of the stec/skd driver to blk-mq. From Bart.
- A few BFQ improvements/fixes from Paolo.
- CFQ improvement from Ritesh, allowing idling for group idle.
- A few fixes found by Dan's smatch, courtesy of Dan.
- A warning fixup for a race between changing the IO scheduler and
device remova. From David Jeffery.
- A few nbd fixes from Josef.
- Support for cgroup info in blktrace, from Shaohua.
- Also from Shaohua, new features in the null_blk driver to allow it
to actually hold data, among other things.
- Various corner cases and error handling fixes from Weiping Zhang.
- Improvements to the IO stats tracking for blk-mq from me. Can
drastically improve performance for fast devices and/or big
machines.
- Series from Christoph removing bi_bdev as being needed for IO
submission, in preparation for nvme multipathing code.
- Series from Bart, including various cleanups and fixes for switch
fall through case complaints"
* 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
kernfs: checking for IS_ERR() instead of NULL
drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
drbd: Fix allyesconfig build, fix recent commit
drbd: switch from kmalloc() to kmalloc_array()
drbd: abort drbd_start_resync if there is no connection
drbd: move global variables to drbd namespace and make some static
drbd: rename "usermode_helper" to "drbd_usermode_helper"
drbd: fix race between handshake and admin disconnect/down
drbd: fix potential deadlock when trying to detach during handshake
drbd: A single dot should be put into a sequence.
drbd: fix rmmod cleanup, remove _all_ debugfs entries
drbd: Use setup_timer() instead of init_timer() to simplify the code.
drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
drbd: new disk-option disable-write-same
drbd: Fix resource role for newly created resources in events2
drbd: mark symbols static where possible
drbd: Send P_NEG_ACK upon write error in protocol != C
drbd: add explicit plugging when submitting batches
drbd: change list_for_each_safe to while(list_first_entry_or_null)
drbd: introduce drbd_recv_header_maybe_unplug
...
Nothing really major this release, despite quite a lot of activity. Just lots of
things all over the place.
Some things of note include:
- Access via perf to a new type of PMU (IMC) on Power9, which can count both
core events as well as nest unit events (Memory controller etc).
- Optimisations to the radix MMU TLB flushing, mostly to avoid unnecessary Page
Walk Cache (PWC) flushes when the structure of the tree is not changing.
- Reworks/cleanups of do_page_fault() to modernise it and bring it closer to
other architectures where possible.
- Rework of our page table walking so that THP updates only need to send IPIs
to CPUs where the affected mm has run, rather than all CPUs.
- The size of our vmalloc area is increased to 56T on 64-bit hash MMU systems.
This avoids problems with the percpu allocator on systems with very sparse
NUMA layouts.
- STRICT_KERNEL_RWX support on PPC32.
- A new sched domain topology for Power9, to capture the fact that pairs of
cores may share an L2 cache.
- Power9 support for VAS, which is a new mechanism for accessing coprocessors,
and initial support for using it with the NX compression accelerator.
- Major work on the instruction emulation support, adding support for many new
instructions, and reworking it so it can be used to implement the emulation
needed to fixup alignment faults.
- Support for guests under PowerVM to use the Power9 XIVE interrupt controller.
And probably that many things again that are almost as interesting, but I had to
keep the list short. Plus the usual fixes and cleanups as always.
Thanks to:
Alexey Kardashevskiy, Alistair Popple, Andreas Schwab, Aneesh Kumar K.V, Anju
T Sudhakar, Arvind Yadav, Balbir Singh, Benjamin Herrenschmidt, Bhumika Goyal,
Breno Leitao, Bryant G. Ly, Christophe Leroy, Cédric Le Goater, Dan Carpenter,
Dou Liyang, Frederic Barrat, Gautham R. Shenoy, Geliang Tang, Geoff Levand,
Hannes Reinecke, Haren Myneni, Ivan Mikhaylov, John Allen, Julia Lawall, LABBE
Corentin, Laurentiu Tudor, Madhavan Srinivasan, Markus Elfring, Masahiro
Yamada, Matt Brown, Michael Neuling, Murilo Opsfelder Araujo, Nathan Fontenot,
Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Rashmica
Gupta, Rob Herring, Rui Teng, Sam Bobroff, Santosh Sivaraj, Scott Wood,
Shilpasri G Bhat, Sukadev Bhattiprolu, Suraj Jitindar Singh, Tobin C. Harding,
Victor Aoqui.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJZr83SAAoJEFHr6jzI4aWA6pUP/3CEaj2bSxNzWIwidqyYjuoS
O1moEsP0oYH7eBEWVHalYxvo0QPIIAhbFPaFyrOrgtfDH01Szwu9LcCALGb8orC5
Hg3IY8mpNG3Q1T8wEtTa56Ik4b5ZFty35S5+X9qLNSFoDUqSvGlSsLzhPNN7f2tl
XFm2hWqd8wXCwDsuVSFBCF61M3SAm+g6NMVNJ+VL2KIDCwBrOZLhKDPRoxLTAuMa
jjSdjVIozWyXjUrBFi8HVcoOWLxcT1HsNF0tRs51LwY/+Mlj2jAtFtsx+a06HZa6
f2p/Kcp/MEispSTk064Ap9cC1seXWI18zwZKpCUFqu0Ec2yTAiGdjOWDyYQldIp+
ttVPSHQ01YrVKwDFTtM9CiA0EET6fVPhWgAPkPfvH5TvtKwGkNdy0b+nQLuWrYip
BUmOXmjdIG3nujCzA9sv6/uNNhjhj2y+HWwuV7Qo002VFkhgZFL67u2SSUQLpYPj
PxdkY8pPVq+O+in94oDV3c36dYFF6+g6A6505Vn6eKUm/TLpszRFGkS3bKKA5vtn
74FR+guV/5RwYJcdZbfm04DgAocl7AfUDxpwRxibt6KtAK2VZKQuw4ugUTgYEd7W
mL2+AMmPKuajWXAMTHjCZPbUp9gFNyYyBQTFfGVX/XLiM8erKBnGfoa1/KzUJkhr
fVZLYIO/gzl34PiTIfgD
=UJtt
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Nothing really major this release, despite quite a lot of activity.
Just lots of things all over the place.
Some things of note include:
- Access via perf to a new type of PMU (IMC) on Power9, which can
count both core events as well as nest unit events (Memory
controller etc).
- Optimisations to the radix MMU TLB flushing, mostly to avoid
unnecessary Page Walk Cache (PWC) flushes when the structure of the
tree is not changing.
- Reworks/cleanups of do_page_fault() to modernise it and bring it
closer to other architectures where possible.
- Rework of our page table walking so that THP updates only need to
send IPIs to CPUs where the affected mm has run, rather than all
CPUs.
- The size of our vmalloc area is increased to 56T on 64-bit hash MMU
systems. This avoids problems with the percpu allocator on systems
with very sparse NUMA layouts.
- STRICT_KERNEL_RWX support on PPC32.
- A new sched domain topology for Power9, to capture the fact that
pairs of cores may share an L2 cache.
- Power9 support for VAS, which is a new mechanism for accessing
coprocessors, and initial support for using it with the NX
compression accelerator.
- Major work on the instruction emulation support, adding support for
many new instructions, and reworking it so it can be used to
implement the emulation needed to fixup alignment faults.
- Support for guests under PowerVM to use the Power9 XIVE interrupt
controller.
And probably that many things again that are almost as interesting,
but I had to keep the list short. Plus the usual fixes and cleanups as
always.
Thanks to: Alexey Kardashevskiy, Alistair Popple, Andreas Schwab,
Aneesh Kumar K.V, Anju T Sudhakar, Arvind Yadav, Balbir Singh,
Benjamin Herrenschmidt, Bhumika Goyal, Breno Leitao, Bryant G. Ly,
Christophe Leroy, Cédric Le Goater, Dan Carpenter, Dou Liyang,
Frederic Barrat, Gautham R. Shenoy, Geliang Tang, Geoff Levand, Hannes
Reinecke, Haren Myneni, Ivan Mikhaylov, John Allen, Julia Lawall,
LABBE Corentin, Laurentiu Tudor, Madhavan Srinivasan, Markus Elfring,
Masahiro Yamada, Matt Brown, Michael Neuling, Murilo Opsfelder Araujo,
Nathan Fontenot, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran,
Paul Mackerras, Rashmica Gupta, Rob Herring, Rui Teng, Sam Bobroff,
Santosh Sivaraj, Scott Wood, Shilpasri G Bhat, Sukadev Bhattiprolu,
Suraj Jitindar Singh, Tobin C. Harding, Victor Aoqui"
* tag 'powerpc-4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (321 commits)
powerpc/xive: Fix section __init warning
powerpc: Fix kernel crash in emulation of vector loads and stores
powerpc/xive: improve debugging macros
powerpc/xive: add XIVE Exploitation Mode to CAS
powerpc/xive: introduce H_INT_ESB hcall
powerpc/xive: add the HW IRQ number under xive_irq_data
powerpc/xive: introduce xive_esb_write()
powerpc/xive: rename xive_poke_esb() in xive_esb_read()
powerpc/xive: guest exploitation of the XIVE interrupt controller
powerpc/xive: introduce a common routine xive_queue_page_alloc()
powerpc/sstep: Avoid used uninitialized error
axonram: Return directly after a failed kzalloc() in axon_ram_probe()
axonram: Improve a size determination in axon_ram_probe()
axonram: Delete an error message for a failed memory allocation in axon_ram_probe()
powerpc/powernv/npu: Move tlb flush before launching ATSD
powerpc/macintosh: constify wf_sensor_ops structures
powerpc/iommu: Use permission-specific DEVICE_ATTR variants
powerpc/eeh: Delete an error out of memory message at init time
powerpc/mm: Use seq_putc() in two functions
macintosh: Convert to using %pOF instead of full_name
...
The .rw_page in struct block_device_operations is used by the swap
subsystem to read/write the page contents from/into the corresponding
swap slot in the swap device. To support the THP (Transparent Huge
Page) swap optimization, the .rw_page is enhanced to support to
read/write THP if possible.
Link: http://lkml.kernel.org/r/20170724051840.2309-6-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal L Verma <vishal.l.verma@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds document and kconfig for using of writeback feature.
Link: http://lkml.kernel.org/r/1498459987-24562-10-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch enables read IO from backing device. For the feature, it
implements two IO read functions to transfer data from backing storage.
One is asynchronous IO function and other is synchronous one.
A reason I need synchrnous IO is due to partial write which need to
complete read IO before the overwriting partial data.
We can make the partial IO's case asynchronous, too but at the moment, I
don't feel adding more complexity to support such rare use cases so want
to go with simple.
[xieyisheng1@huawei.com: read_from_bdev_async(): return 1 to avoid call page_endio() in zram_rw_page()]
Link: http://lkml.kernel.org/r/1502707447-6944-1-git-send-email-xieyisheng1@huawei.com
Link: http://lkml.kernel.org/r/1498459987-24562-9-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch enables write IO to transfer data to backing device. For
that, it implements write_to_bdev function which creates new bio and
chaining with parent bio to make the parent bio asynchrnous.
For rw_page which don't have parent bio, it submit owned bio and handle
IO completion by zram_page_end_io.
Also, this patch defines new flag ZRAM_WB to mark written page for later
read IO.
[xieyisheng1@huawei.com: fix typo in comment]
Link: http://lkml.kernel.org/r/1502707447-6944-2-git-send-email-xieyisheng1@huawei.com
Link: http://lkml.kernel.org/r/1498459987-24562-8-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For upcoming asynchronous IO like writeback, zram_rw_page should be
aware of that whether requested IO was completed or submitted
successfully, otherwise error.
For the goal, zram_bvec_rw has three return values.
-errno: returns error number
0: IO request is done synchronously
1: IO request is issued successfully.
Link: http://lkml.kernel.org/r/1498459987-24562-7-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With backing device, zram needs management of free space of backing
device.
This patch adds bitmap logic to manage free space which is very naive.
However, it would be simple enough as considering uncompressible pages's
frequenty in zram.
Link: http://lkml.kernel.org/r/1498459987-24562-6-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For writeback feature, user should set up backing device before the zram
working.
This patch enables the interface via /sys/block/zramX/backing_dev.
Currently, it supports block device only but it could be enhanced for
file as well.
Link: http://lkml.kernel.org/r/1498459987-24562-5-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zram_decompress_page naming is not proper because it doesn't decompress
if page was dedup hit or stored with compression.
Use more abstract term and consistent with write path function
__zram_bvec_write.
Link: http://lkml.kernel.org/r/1498459987-24562-4-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zram_compress does several things, compress, entry alloc and check
limitation. I did for just readbility but it hurts modulization.:(
So this patch removes zram_compress functions and inline it in
__zram_bvec_write for upcoming patches.
Link: http://lkml.kernel.org/r/1498459987-24562-3-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "writeback incompressible pages to storage", v1.
zRam is useful for memory saving with compressible pages but sometime,
workload can be changed and system has lots of incompressible pages
which is very harmful for zram.
This patch supports writeback feature of zram so admin can set up a
block device and with it, zram can save the memory via writing out the
incompressile pages once it found it's incompressible pages (1/4 comp
ratio) instead of keeping the page in memory.
[1-3] is just clean up and [4-8] is step by step feature enablement.
[4-8] is logically not bisectable(ie, logical unit separation)
although I tried to compiled out without breaking but I think it would
be better to review.
This patch (of 9):
__zram_bvec_write has some of duplicated logic for zram meta data
handling of same_page|compressed_page. This patch aims to clean it up
without behavior change.
[xieyisheng1@huawei.com: fix compr_data_size stat]
Link: http://lkml.kernel.org/r/1502707447-6944-1-git-send-email-xieyisheng1@huawei.com
Link: http://lkml.kernel.org/r/1496019048-27016-1-git-send-email-minchan@kernel.org
Link: http://lkml.kernel.org/r/1498459987-24562-2-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Juneho Choi <juno.choi@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
drivers/block/rbd.c: In function 'rbd_acquire_lock':
drivers/block/rbd.c:3602:44: error: 'ret' may be used uninitialized in this function [-Werror=maybe-uninitialized]
Silence the warning, found it when built old kernel(3.10) with
OBS(opensuse build service).
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Commit 6c6b6f28b3 ("loop: set physical block size to PAGE_SIZE")
caused mkfs.xfs to barf on ppc64 [1]. Always using PAGE_SIZE as the
physical block size still makes the most sense semantically, but let's
just lie and always set it to the same value as the logical block size
(same goes for io_min). In the future we might want to at least bump up
io_min to PAGE_SIZE but I'm sick of these stupid changes so let's play
it safe.
1: https://marc.info/?l=linux-xfs&m=150459024723753&w=2
Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull locking updates from Ingo Molnar:
- Add 'cross-release' support to lockdep, which allows APIs like
completions, where it's not the 'owner' who releases the lock, to be
tracked. It's all activated automatically under
CONFIG_PROVE_LOCKING=y.
- Clean up (restructure) the x86 atomics op implementation to be more
readable, in preparation of KASAN annotations. (Dmitry Vyukov)
- Fix static keys (Paolo Bonzini)
- Add killable versions of down_read() et al (Kirill Tkhai)
- Rework and fix jump_label locking (Marc Zyngier, Paolo Bonzini)
- Rework (and fix) tlb_flush_pending() barriers (Peter Zijlstra)
- Remove smp_mb__before_spinlock() and convert its usages, introduce
smp_mb__after_spinlock() (Peter Zijlstra)
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (56 commits)
locking/lockdep/selftests: Fix mixed read-write ABBA tests
sched/completion: Avoid unnecessary stack allocation for COMPLETION_INITIALIZER_ONSTACK()
acpi/nfit: Fix COMPLETION_INITIALIZER_ONSTACK() abuse
locking/pvqspinlock: Relax cmpxchg's to improve performance on some architectures
smp: Avoid using two cache lines for struct call_single_data
locking/lockdep: Untangle xhlock history save/restore from task independence
locking/refcounts, x86/asm: Disable CONFIG_ARCH_HAS_REFCOUNT for the time being
futex: Remove duplicated code and fix undefined behaviour
Documentation/locking/atomic: Finish the document...
locking/lockdep: Fix workqueue crossrelease annotation
workqueue/lockdep: 'Fix' flush_work() annotation
locking/lockdep/selftests: Add mixed read-write ABBA tests
mm, locking/barriers: Clarify tlb_flush_pending() barriers
locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE and CONFIG_LOCKDEP_COMPLETIONS truly non-interactive
locking/lockdep: Explicitly initialize wq_barrier::done::map
locking/lockdep: Rename CONFIG_LOCKDEP_COMPLETE to CONFIG_LOCKDEP_COMPLETIONS
locking/lockdep: Reword title of LOCKDEP_CROSSRELEASE config
locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE part of CONFIG_PROVE_LOCKING
locking/refcounts, x86/asm: Implement fast refcount overflow protection
locking/lockdep: Fix the rollback and overwrite detection logic in crossrelease
...
Pull x86 asm updates from Ingo Molnar:
- Introduce the ORC unwinder, which can be enabled via
CONFIG_ORC_UNWINDER=y.
The ORC unwinder is a lightweight, Linux kernel specific debuginfo
implementation, which aims to be DWARF done right for unwinding.
Objtool is used to generate the ORC unwinder tables during build, so
the data format is flexible and kernel internal: there's no
dependency on debuginfo created by an external toolchain.
The ORC unwinder is almost two orders of magnitude faster than the
(out of tree) DWARF unwinder - which is important for perf call graph
profiling. It is also significantly simpler and is coded defensively:
there has not been a single ORC related kernel crash so far, even
with early versions. (knock on wood!)
But the main advantage is that enabling the ORC unwinder allows
CONFIG_FRAME_POINTERS to be turned off - which speeds up the kernel
measurably:
With frame pointers disabled, GCC does not have to add frame pointer
instrumentation code to every function in the kernel. The kernel's
.text size decreases by about 3.2%, resulting in better cache
utilization and fewer instructions executed, resulting in a broad
kernel-wide speedup. Average speedup of system calls should be
roughly in the 1-3% range - measurements by Mel Gorman [1] have shown
a speedup of 5-10% for some function execution intense workloads.
The main cost of the unwinder is that the unwinder data has to be
stored in RAM: the memory cost is 2-4MB of RAM, depending on kernel
config - which is a modest cost on modern x86 systems.
Given how young the ORC unwinder code is it's not enabled by default
- but given the performance advantages the plan is to eventually make
it the default unwinder on x86.
See Documentation/x86/orc-unwinder.txt for more details.
- Remove lguest support: its intended role was that of a temporary
proof of concept for virtualization, plus its removal will enable the
reduction (removal) of the paravirt API as well, so Rusty agreed to
its removal. (Juergen Gross)
- Clean up and fix FSGS related functionality (Andy Lutomirski)
- Clean up IO access APIs (Andy Shevchenko)
- Enhance the symbol namespace (Jiri Slaby)
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits)
objtool: Handle GCC stack pointer adjustment bug
x86/entry/64: Use ENTRY() instead of ALIGN+GLOBAL for stub32_clone()
x86/fpu/math-emu: Add ENDPROC to functions
x86/boot/64: Extract efi_pe_entry() from startup_64()
x86/boot/32: Extract efi_pe_entry() from startup_32()
x86/lguest: Remove lguest support
x86/paravirt/xen: Remove xen_patch()
objtool: Fix objtool fallthrough detection with function padding
x86/xen/64: Fix the reported SS and CS in SYSCALL
objtool: Track DRAP separately from callee-saved registers
objtool: Fix validate_branch() return codes
x86: Clarify/fix no-op barriers for text_poke_bp()
x86/switch_to/64: Rewrite FS/GS switching yet again to fix AMD CPUs
selftests/x86/fsgsbase: Test selectors 1, 2, and 3
x86/fsgsbase/64: Report FSBASE and GSBASE correctly in core dumps
x86/fsgsbase/64: Fully initialize FS and GS state in start_thread_common
x86/asm: Fix UNWIND_HINT_REGS macro for older binutils
x86/asm/32: Fix regs_get_register() on segment registers
x86/xen/64: Rearrange the SYSCALL entries
x86/asm/32: Remove a bunch of '& 0xffff' from pt_regs segment reads
...
lo_rw_aio->call_read_iter->
1 aops->direct_IO
2 iov_iter_revert
lo_rw_aio_complete could happen between 1 and 2, the bio and bvec could
be freed before 2, which accesses bvec.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently loop disables merge. While it makes sense for buffer IO mode,
directio mode can benefit from request merge. Without merge, loop could
send small size IO to underlayer disk and harm performance.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Loop can handle any size of request. Limiting it to 255 sectors just
burns the CPU for bio split and request merge for underlayer disk and
also cause bad fs block allocation in directio mode.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block fixes from Jens Axboe:
"Unfortunately a few issues that warrant sending another pull request,
even if I had hoped to avoid it. This contains:
- A fix for multiqueue xen-blkback, on tear down / disconnect.
- A few fixups for NVMe, including a wrong bit definition, fix for
host memory buffers, and an nvme rdma page size fix"
* 'for-linus' of git://git.kernel.dk/linux-block:
nvme: fix the definition of the doorbell buffer config support bit
nvme-pci: use dma memory for the host memory buffer descriptors
nvme-rdma: default MR page size to 4k
xen-blkback: stop blkback thread of every queue in xen_blkif_disconnect
The comments here are really outdated, and blk-mq made flushing much
simpler, so just fold the two cases into the callers.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is a different approach from the first attempt in f2c6df7dbf
("loop: support 4k physical blocksize"). Rather than extending
LOOP_{GET,SET}_STATUS, add a separate ioctl just for setting the block
size.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The physical block size is "the lowest possible sector size that the
hardware can operate on without reverting to read-modify-write
operations" (from the comment on blk_queue_physical_block_size()). Since
loop does buffered I/O on the backing file by default, the RMW unit is a
page. This isn't the case for direct I/O mode, but let's keep it simple.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is only used for setting the soft block size on the struct
block_device once and then never used again.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Careful analysis shows that this flag is not needed.
The RESCUER flag is only needed when a make_request_fn might:
- allocate a bio from the bioset
- submit it with generic_make_request() or similar
- allocate another bio from the bioset
The second allocation can block until the first bio is processed, so
a rescuer is needed to ensure the first bio does get processed. With
a rescuer it will only get processed when the make_request_fn completes.
In drbd, allocations from drbd_io_bio_set happen from drbd_new_req()
or w_restart_disk_io() which is only called to handle
RESTART_FROZEN_DISK_IO.
In former is called precisely once from the make_request_fn.
The later is never called by within the make_request_fn.
So there cannot be two allocations in the same call to the
make_request_fn, so a rescuer is not needed.
Allocations from drbd_md_io_bio_set are used for IO to the bitmap and
the activity log. There are only accessed from worker threads and
workqueues, never directly from make_request_fn.
Again, the rescuer isn't needed.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Globals where prefixed with drbd_, that was missed in the
in #ifdef'nd code when it is built-in.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Fixes: 183ece3005 ("drbd: move global variables to drbd namespace and make some static")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We had one call to kmalloc that actually allocates an array. Switch that
one to the kmalloc_array() function.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This was found by a static analysis tool. While highly unlikely, be sure
to return without dereferencing the NULL pointer.
Reported-by: Shaobo <shaobo@cs.utah.edu>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is a follow-up to Gregs complaints that drbd clutteres the global
namespace.
Some of DRBD's module parameters are only used within one compilation
unit. Make these static.
Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Nothing like having a very generic global variable in a tiny driver
subsystem to make a mess of the global namespace...
Note, there are many other "generic" named global variables in the drbd
subsystem, someone should fix those up one day before they hit a linking
error.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
conn_try_disconnect() could potentialy hit the BUG_ON()
in _conn_set_state() where it iterates over _drbd_set_state()
and "asserts" via BUG_ON() that the latter was successful.
If the STATE_SENT bit was not yet visible to conn_is_valid_transition()
early in _conn_request_state(), but became visible before conn_set_state()
later in that call path, we could hit the BUG_ON() after _drbd_set_state(),
because it returned SS_IN_TRANSIENT_STATE.
To avoid that race, we better protect set_bit(SENT_STATE) with the spinlock.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When requesting a detach, we first suspend IO, and also inhibit meta-data IO
by means of drbd_md_get_buffer(), because we don't want to "fail" the disk
while there is IO in-flight: the transition into D_FAILED for detach purposes
may get misinterpreted as actual IO error in a confused endio function.
We wrap it all into wait_event(), to retry in case the drbd_req_state()
returns SS_IN_TRANSIENT_STATE, as it does for example during an ongoing
connection handshake.
In that example, the receiver thread may need to grab drbd_md_get_buffer()
during the handshake to make progress. To avoid potential deadlock with
detach, detach needs to grab and release the meta data buffer inside of
that wait_event retry loop. To avoid lock inversion between
mutex_lock(&device->state_mutex) and drbd_md_get_buffer(device),
introduce a new enum chg_state_flag CS_INHIBIT_MD_IO, and move the
call to drbd_md_get_buffer() inside the state_mutex grabbed in
drbd_req_state().
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Thus use the corresponding function "seq_putc".
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If there are still resources defined, but "empty", no more volumes
or connections configured, they don't hold module reference counts,
so rmmod is possible.
To avoid DRBD leftovers in debugfs, we need to call our global
drbd_debugfs_cleanup() only after all resources have been cleaned up.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Race:
drbd_adm_attach() | async drbd_md_endio()
|
device->ldev is still NULL. |
|
drbd_md_read( |
.endio = drbd_md_endio; |
submit; |
.... |
wait for done == 1; | done = 1;
); | wake_up();
.. lot of other stuff, |
.. includeing taking and |
...giving up locks, |
.. doing further IO, |
.. stuff that takes "some time" |
| while in this context,
| this is the next statement.
| which means this context was scheduled
.. only then, finally, | away for "some time".
device->ldev = nbc; |
| if (device->ldev)
| put_ldev()
Unlikely, but possible. I was able to provoke it "reliably"
by adding an mdelay(500); after the wake_up().
Fixed by moving the if (!NULL) put_ldev() before done = 1;
Impact of the bug was that the resulting refcount imbalance
could lead to premature destruction of the object, potentially
causing a NULL pointer dereference during a subsequent detach.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some backend devices claim to support write-same,
but would fail actual write-same requests.
Allow to set (or toggle) whether or not DRBD tries to support write-same.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The conn_higest_role() (a terribly misnamed function) returns
the role of the resource. It returned R_UNKNOWN as long as the
resource has not a single device.
Resources without devices are short living objects.
But it matters for the NOTIFY_CREATE netwlink message. It makes
a lot more sense to report R_SECONDARY for the newly created
resource than R_UNKNOWN.
I reviewd all call sites of conn_highest_role(), that change
does not matter for the other call sites.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We get a few warnings when building kernel with W=1:
drbd/drbd_receiver.c:1224:6: warning: no previous prototype for 'one_flush_endio' [-Wmissing-prototypes]
drbd/drbd_req.c:1450:6: warning: no previous prototype for 'send_and_submit_pending' [-Wmissing-prototypes]
drbd/drbd_main.c:924:6: warning: no previous prototype for 'assign_p_sizes_qlim' [-Wmissing-prototypes]
....
In fact, these functions are only used in the file in which they are
declared and don't need a declaration, but can be made static.
So this patch marks these functions with 'static'.
Signed-off-by: Baoyou Xie <baoyou.xie@linaro.org>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In protocol != C, we forgot to send the P_NEG_ACK for failing writes.
Once we no longer submit to local disk, because we already "detached",
due to the typical "on-io-error detach;" config setting,
we already send the neg acks right away.
Only those requests that have been submitted,
and have been error-completed by the local disk,
would forget to send the neg-ack,
and only in asynchronous replication (protocol != C).
Unless this happened during resync,
where we already always send acks, regardless of protocol.
The primary side needs the P_NEG_ACK in order to mark
the affected block(s) for resync in its out-of-sync bitmap.
If the blocks in question are not re-written again,
we may miss to resync them later, causing data inconsistencies.
This patch will always send the neg-acks, and also at least try to
persist the out-of-sync status on the local node already.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When submitting batches of requests which had been queued on the
submitter thread, typically because they needed to wait for an
activity log transactions, use explicit plugging to help potential
merging of requests in the backend io-scheduler.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Two instances of list_for_each_safe can drop their tmp element, they
really just peel off each element in turn from the start of the list.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Recently, drbd_recv_header() was changed to potentially
implicitly "unplug" the backend device(s), in case there
is currently nothing to receive.
Be more explicit about it: re-introduce the original drbd_recv_header(),
and introduce a new drbd_recv_header_maybe_unplug() for use by the
receiver "main loop".
Using explicit plugging via blk_start_plug(); blk_finish_plug();
really helps the io-scheduler of the backend with merging requests.
Wrap the receiver "main loop" with such a plug.
Also catch unplug events on the Primary,
and try to propagate.
This is performance relevant. Without this, if the receiving side does
not merge requests, number of IOPS on the peer can me significantly
higher than IOPS on the Primary, and can easily become the bottleneck.
Together, both changes should help to reduce the number of IOPS
as seen on the backend of the receiving side, by increasing
the chance of merging mergable requests, without trading latency
for more throughput.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since blk_mq_init_queue() initializes .nr_requests to the tag set
size and since that value is a good default for the skd driver, do
not overwrite the value set by blk_mq_init_queue(). This change
doubles the default value of .nr_requests.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since sTec s1120 devices support 64-bit DMA it is not necessary
to request data buffer bouncing. Hence remove the
blk_queue_bounce_limit() call.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull xen-blkback fix from Konrad:
"[...] A bug-fix when shutting down xen block backend driver with
multiple queues and the driver not clearing all of them."
struct call_single_data is used in IPIs to transfer information between
CPUs. Its size is bigger than sizeof(unsigned long) and less than
cache line size. Currently it is not allocated with any explicit alignment
requirements. This makes it possible for allocated call_single_data to
cross two cache lines, which results in double the number of the cache lines
that need to be transferred among CPUs.
This can be fixed by requiring call_single_data to be aligned with the
size of call_single_data. Currently the size of call_single_data is the
power of 2. If we add new fields to call_single_data, we may need to
add padding to make sure the size of new definition is the power of 2
as well.
Fortunately, this is enforced by GCC, which will report bad sizes.
To set alignment requirements of call_single_data to the size of
call_single_data, a struct definition and a typedef is used.
To test the effect of the patch, I used the vm-scalability multiple
thread swap test case (swap-w-seq-mt). The test will create multiple
threads and each thread will eat memory until all RAM and part of swap
is used, so that huge number of IPIs are triggered when unmapping
memory. In the test, the throughput of memory writing improves ~5%
compared with misaligned call_single_data, because of faster IPIs.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Huang, Ying <ying.huang@intel.com>
[ Add call_single_data_t and align with size of call_single_data. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/87bmnqd6lz.fsf@yhuang-mobile.sh.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make this const as is is only passed as an argument to the
function device_create_file and device_remove_file and the corresponding
arguments are of type const.
Done using Coccinelle
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit 2984c86(nullb: factor disk parameters) has a typo. The
nullb_device allocation/free is done outside of null_add_dev. The commit
accidentally frees the nullb_device in error code path.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Fixes two obvious bugs in virtio pci.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJZoG+uAAoJECgfDbjSjVRpvAMIAIoONNPd53SPKDVuyU1ycz7H
hRVJ9dgVqsCyJV7UQNXznTkk1Te+todM3eBOnnWGxBUPyyjjn+nRJY8ObzvPZNtr
GZjBHhuCeWAi1HPcGk3VKFCXB9yzVc7x91YoSZRWRveB1hOoqWCNccuXMlOf1mLC
AAYMdBR7JH9CTA5v73z0n4XmfDPFja9g5qhv3JxYypzS3IrWglsVV8RFFG94zJys
qsg3Ys6SdYnC4whdtT0sdj6zcVV3STqLtutUcWzpBJiPwL+TYprOtGxhjhjG/YdP
vurTYmMk1FZyTlxflfzH0yIRQVZyxARcPGrchhvFv9eE4qN0y4E72FkN8UyyKpU=
=qTWW
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio fixes from Michael Tsirkin:
"Fixes two obvious bugs in virtio pci"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio_pci: fix cpu affinity support
virtio_blk: fix incorrect message when disk is resized
The SKD_ID_INCR flag in skd_request_context.id duplicates information
that is already available otherwise, e.g. through the block layer
request state and through skd_request_context.state. Hence remove
the code that manipulates this flag and also the flag itself.
Since skd_isr_completion_posted() only uses the lower bits of
skd_request_context.id as hardware tag, this patch does not change
the behavior of the skd driver. I'm referring to the following code:
tag = req_id & SKD_ID_SLOT_AND_TABLE_MASK;
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Although it is easy to see that skdev->disk != NULL if skdev->queue
!= NULL, add a test for skdev->disk to avoid that smatch reports the
following warning:
drivers/block/skd_main.c:3080 skd_free_disk()
error: we previously assumed 'disk' could be null (see line 3074)
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It is not worth to keep the debug statements in skd_end_request().
Without debug statements that function only consists of two
statements. Hence inline skd_end_request().
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The latter name follows more closely the function names used in
other blk-mq drivers.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dan reported this:
The patch 2984c8684f96: "nullb: factor disk parameters" from Aug 14,
2017, leads to the following Smatch complaint:
drivers/block/null_blk.c:1759 null_init_tag_set()
error: we previously assumed 'nullb' could be null (see line
1750)
1755 set->cmd_size = sizeof(struct nullb_cmd);
1756 set->flags = BLK_MQ_F_SHOULD_MERGE;
1757 set->driver_data = NULL;
1758
1759 if (nullb->dev->blocking)
^^^^^^^^^^^^^^^^^^^^
And an unchecked dereference.
nullb could be NULL here.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In xen_blkif_disconnect, before checking inflight I/O, following code
stops the blkback thread,
if (ring->xenblkd) {
kthread_stop(ring->xenblkd);
wake_up(&ring->shutdown_wq);
}
If there is inflight I/O in any non-last queue, blkback returns -EBUSY
directly, and above code would not be called to stop thread of remaining
queue and processs them. When removing vbd device with lots of disk I/O
load, some queues with inflight I/O still have blkback thread running even
though the corresponding vbd device or guest is gone.
And this could cause some problems, for example, if the backend device type
is file, some loop devices and blkback thread always lingers there forever
after guest is destroyed, and this causes failure of umounting repositories
unless rebooting the dom0.
This patch allows thread of every queue has the chance to get stopped.
Otherwise, only thread of queue previous to(including) first busy one get
stopped, blkthread of remaining queue will still run. So stop all threads
properly and return -EBUSY if any queue has inflight I/O.
Signed-off-by: Annie Li <annie.li@oracle.com>
Reviewed-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Reviewed-by: Adnan Misherfi <adnan.misherfi@oracle.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The message printed on disk resize is incorrect. The following is
printed when resizing to 2 GiB:
$ truncate -s 1G test.img
$ qemu -device virtio-blk-pci,logical_block_size=4096,...
(qemu) block_resize drive1 2G
virtio_blk virtio0: new size: 4194304 4096-byte logical blocks (17.2 GB/16.0 GiB)
The virtio_blk capacity config field is in 512-byte sector units
regardless of logical_block_size as per the VIRTIO specification.
Therefore the message should read:
virtio_blk virtio0: new size: 524288 4096-byte logical blocks (2.15 GB/2.0 GiB)
Note that this only affects the printed message. Thankfully the actual
block device has the correct size because the block layer expects
capacity in sectors.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The hpsa driver now has support for all boards the cciss driver
used to support, so this patch removes the cciss driver and
make hpsa an alias to cciss.
Signed-off-by: Hannes Reinecke <hare@suse.com>
Acked-by: Don Brace <don.brace@microsemi.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Lguest seems to be rather unused these days. It has seen only patches
ensuring it still builds the last two years and its official state is
"Odd Fixes".
Remove it in order to be able to clean up the paravirt code.
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: boris.ostrovsky@oracle.com
Cc: lguest@lists.ozlabs.org
Cc: rusty@rustcorp.com.au
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/20170816173157.8633-3-jgross@suse.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There's some stuff still up in the air, let's not get stuck with a
subpar ABI. I'll follow up with something better for 4.14.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).
For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.
Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since MSI support on some motherboards is unreliable, change the
default interrupt mode from MSI to MSI-X. This patch avoids that
the following message appears sporadially in the kernel logs of
my test setup:
do_IRQ: 3.193 No irq handler for vector
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Avoid that normal request completion and the timeout handler can
run concurrently by calling blk_mq_complete_request() instead of
blk_mq_end_request() from skd_end_request(). Avoid that the block
layer can reuse a request while the firmware is still processing
it. Convert skd_softirq_done() to blk-mq. Pass the pointer to
skd_softirq_done() to the block layer core through
blk_mq_ops.complete instead of by calling blk_queue_softirq_done().
Pass the pointer to skd_timed_out() to the block layer core
through blk_mq_ops.timeout instead of by calling
blk_queue_timed_out(). The timeout handler has been tested as
follows:
echo 1 > /sys/block/skd0/io-timeout-fail &&
(cd /sys/kernel/debug/fail_io_timeout &&
echo 100 > probability &&
echo N > task-filter &&
echo 1 > times)
Fixes: commit a74d5b76fa ("skd: Switch to block layer timeout mechanism")
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch does not change any functionality but makes the skd
driver code more similar to that of other blk-mq kernel drivers.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch removes one debug statement but otherwise does not change
any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Sometime disk could have tracks broken and data there is inaccessable,
but data in other parts can be accessed in normal way. MD RAID supports
such disks. But we don't have a good way to test it, because we can't
control which part of a physical disk is bad. For a virtual disk, this
can be easily controlled.
This patch adds a new 'badblock' attribute. Configure it in this way:
echo "+1-100" > xxx/badblock, this will make sector [1-100] as bad
blocks.
echo "-20-30" > xxx/badblock, this will make sector [20-30] good
If badblocks are accessed, the nullb disk will return IO error. Other
parts of the disk can accessed in normal way.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Software must flush disk cache to guarantee data safety. To check if
software correctly does disk cache flush, we must know the behavior of
disk. But physical disk behavior is uncontrollable. Even software
doesn't do the flush, the disk probably does the flush. This patch tries
to emulate a cache in the test disk.
All write will go to a cache first, when the cache is full, we then
flush some data to disk storage. A flush request will flush all data of
the cache to disk storage. A FUA write will write to memory store
directly and revalidate data in cache. If there is a power failure (by
writing to power attribute, 'echo 0 > disk_name/power'), we discard all
data in the cache, but preserve the data in disk storage. Later we can
power on the disk again as usual (write 1 to 'power' attribute), then we
can check data integrity and very if software does everything correctly.
A new attribute 'cache_size' (in MB) is added to configure cache size.
Based on original patch from Kyungchan Koh
Signed-off-by: Kyungchan Koh <kkc6196@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In test, we usually expect controllable disk speed. For example, in a
raid array, we'd like some disks are fast and some are slow. MD RAID
actually has a feature for this. To test the feature, we'd like to make
the disk run in specific speed.
block throttling probably can be used for this purpose, but it requires
cgroup setup. Here we just implement a simple throttling mechanism in
the driver. There is slight fluctuation in the mechanism, but it's good
enough for test.
To configure the bandwidth cap, user sets the 'mbps' attribute. mbps is
MB/s.
Based on original patch from Kyungchan Koh
Signed-off-by: Kyungchan Koh <kkc6196@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
discard makes sense for memory backed disk. And also it's useful to test
if upper layer supports dicard correctly.
User configures 'discard' attribute to enable/disable dicard support.
Based on original patch from Kyungchan Koh
Signed-off-by: Kyungchan Koh <kkc6196@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This adds memory backed store in nullb.
User configure 'memory_backed' attribute for this. By default, nullb
disk doesn't use memory backed store.
Based on original patch from Kyungchan Koh
Signed-off-by: Kyungchan Koh <kkc6196@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We now dynamically create disks. Managing the disk index with ida to
avoid bump up the index too much.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The device created in nullb configfs interface isn't power on by
default. After user configures the device, user can do 'echo 1 >
xxx/nullb/device_name/power' to power on the device, which will create a
disk. the xxx/nullb/device_name/index is the disk index, so if the index
is 2, the new created disk should be named as /dev/nullb2. Note, the
'index' is only valid after disk is power on.
'echo 0 > xxx/nullb/device_name/power' will remove the disk. Note, this
doesn't remove the device. To remove the device, user should do 'rmdir
xxx/nullb/device_name'. Removing the device will remove the disk too.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add configfs interface for nullb. configfs interface is more flexible
and easy to configure in a per-disk basis.
Configuration is something like this:
mount -t configfs none /mnt
Checking which features the driver supports:
cat /mnt/nullb/features
The 'features' attribute is for future extension. We probably will add
new features into the driver, userspace can check this attribute to find
the supported features.
Create/remove a device:
mkdir/rmdir /mnt/nullb/a
Then configure the device by setting attributes under /mnt/nullb/a, most
of nullb supported module parameters are converted to attributes:
size; /* device size in MB */
completion_nsec; /* time in ns to complete a request */
submit_queues; /* number of submission queues */
home_node; /* home node for the device */
queue_mode; /* block interface */
blocksize; /* block size */
irqmode; /* IRQ completion handler */
hw_queue_depth; /* queue depth */
use_lightnvm; /* register as a LightNVM device */
blocking; /* blocking blk-mq device */
use_per_node_hctx; /* use per-node allocation for hardware context */
Note, creating a device doesn't create a disk immediately. Creating a
disk is done in two phases: create a device and then power on the
device. Next patch will introduce device power on.
Based on original patch from Kyungchan Koh
Signed-off-by: Kyungchan Koh <kkc6196@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When we switch to configfs interface, each disk could have different
configuration. To prepare for the change, we move most disk setting to a
separate data structure. The existing module parameter interface is
kept. The 'nr_devices' and 'shared_tags' don't make sense for per-disk
setting, so they are remained as global settings.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
My initial impulse was to check for IS_ERR_OR_NULL() but when I looked
at this code a bit more closely, we should only need to check for
IS_ERR().
The blk_mq_alloc_tag_set() returns negative error codes and zero on
success so we can just do an "if (rc) goto err_out;". It's better to
preserve the error code anyhow. The blk_mq_init_queue() returns error
pointers on failure, it never returns NULL. We can also remove the
"q = NULL;" at the start because that's no longer needed.
Fixes: ca33dd9296 ("skd: Convert to blk-mq")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Someone got too agressive about removing initializations and
accidentally removed the "rc = 0;" which is required.
Fixes: c830da8cbc ("skd: Remove superfluous initializations from skd_isr_completion_posted()")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove the driver version information because this information
is not useful in an upstream kernel driver.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bump the driver version. Remove the build ID because build IDs do
not make sense for an upstream kernel driver. Keep the driver
version in the module information but do not report it during every
load, unload or probe.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Only take skdev->lock if necessary.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch does not change any functionality but makes the code
more brief.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Every single coherent DMA memory buffer occupies at least one page.
Reduce memory usage by switching from coherent buffers to streaming
DMA for I/O requests (struct skd_fitmsg_context) and S/G-lists
(struct fit_sg_descriptor[]).
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since skd_device.in_flight is only used to display the number of
in-flight requests in debug messages, remove that member and
introduce skd_in_flight(). That last function relies on the block
layer to determine the number of in flight requests.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove the timeout slot variables and rely on the block layer to
detect request timeouts.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Introduce a tag set and a blk_mq_ops structure. Set .cmd_size such
that struct request and struct skd_request_context are allocated
through a single allocation. Remove the skd_request_context.req
pointer. Make queue starting asynchronous such that this can occur
safely from interrupt context. Use locking to protect skdev->skmsg
and *skdev->skmsg against concurrent access from concurrent
.queue_rq() calls. Introduce the functions skd_init_request() and
skd_exit_request() to set up / clean up the per-request S/G-list.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Set request_queue.cmd_size, introduce skd_init_rq() and skd_exit_rq()
and remove skd_device.skreq_table.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Issue a warning if a NULL argument is passed to skd_free_sg_list().
Move this function up to make the blk-mq conversion patch easier
to read.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch does not change any functionality but makes the blk-mq
conversion patch easier to read.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The only functional change in this patch is that the skd_fitmsg_context
in which requests are accumulated is changed from a local variable into
a member of struct skd_device. This patch will make the blk-mq conversion
easier.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Convert the per-device scalar variables that are protected by the
queue lock into atomics such that it becomes safe to access these
variables without holding the queue lock.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use the request tag when allocating a skd_fitmsg_context or
skd_request_context such that the lists used to track free elements
can be eliminated. Swap the skd_end_request() and skd_release_req()
calls to avoid triggering a use-after-free. Remove
skd_fitmsg_context.state and .outstanding because FIT messages are
shared among requests and because updating a FIT message after a
request has finished whould trigger a use-after-free.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The debug code in skd_send_special_fitmsg() assumes that req.n_sg
represents the number of S/G descriptors. However, skd_construct()
initializes that member variable to zero. Set req.n_sg to one such
that the debugging code in skd_send_special_fitmsg() works as
expected.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Removing the SG IO code also removed the code that sets
SKD_REQ_STATE_ABORTED. Hence also remove the code that checks for
this state.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The skd SG IO support duplicates the functionality of the bsg driver.
Hence remove it.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This will make it easier to convert this driver to the blk-mq
approach. This patch also reduces interrupt latency by moving
skd_request_fn() calls out of the skd_isr() interrupt.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move the skd_fail_all_pending() call out of skd_request_fn_not_online()
such that this function can be reused in the blk-mq code path.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch does not change any functionality but makes the next
patch in this series easier to read.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
mem_map[i] is accessed through readl() / writel() hence declaring
mem_map as volatile is not necessary.
Remove the volatile declarations from struct fit_completion_entry_v1
pointers and struct fit_comp_error_info since reading these structures
multiple times is safe.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since setup_timer() invokes init_timer(), invoking init_timer()
just before setup_timer() is redundant. Hence remove the init_timer()
call.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This change makes skd_preop_sg_list() support chained sg-lists.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since all callers pass zero as second argument to skd_recover_requests(),
drop that second argument.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The value of skcmp, cmp_cntxt etc. is overwritten during every
loop iteration and is not used after the loop has finished. Hence
initializing these variables outside the loop is not necessary.
This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use DMA_FROM_DEVICE and DMA_TO_DEVICE directly instead of
introducing driver-private constants with the same numerical
value.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use ARRAY_SIZE() instead of open-coding it. This patch does not
change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since needless use of __packed slows down access to data structures,
only use __packed when needed.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch will help to verify the changes made by the next patch.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This change makes the source code easier to read.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
dma_alloc_coherent() guarantees alignment on a page boundary so
no explicit alignment is needed to align on a 64 byte boundary.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Due to the previous patch it is guaranteed that the FIT msg contains
at least one request after the for-loop has finished. Use this to
simplify the code.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Prepare the S/G-list before allocating a FIT msg such that the FIT
msg always contains at least one request after the for-loop is
finished.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pass the correct size to pci_free_consistent() in skd_free_skcomp().
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ensure that sparse does not report any warnings when building the
skd driver with sparse verification enabled (C=1 or C=2).
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use dev_err() and dev_info() instead of pr_err() and pr_info().
Since dev_dbg() is able to report file name and line number
information, remove __FILE__ and __LINE__ from the dev_dbg() calls.
Remove the struct skd_device members and the function (skd_name())
that became superfluous due to these changes.
This patch removes the device name and serial number from log
statements. An example of the old log line format:
(skd0:STM000196603:[0000:00:09.0]): Driver state STARTING(3)=>ONLINE(4)
An example of the new log line format:
skd:0000:00:09.0: Driver state STARTING(3)=>ONLINE(4)
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The purpose of barrier() is to prevent reordering by the compiler.
Since the compiler does not reorder calls to non-pure functions,
remove the barrier() calls from skd_reg_{read,write}{32,64}().
Since pr_debug() is able to report file name and line number
information, remove __FILE__ and __LINE__ from the pr_debug() calls.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
These variables have been detected by building with W=1. Declare
'acc' as __maybe_unused because most access_ok() implementations
ignore their first argument. This patch does not change any
functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no function skd_completion_posted_isr() in the skd driver
but there is a function called skd_isr_completion_posted(). Fix
the function name in the comment.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Change "ptimal" into "optimal" and remove the misleading reference
to sysfs.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch does not change any functionality but makes the skd
driver source code more uniform with that of other kernel drivers.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since the code guarded by #ifdef SKD_VMK_POLL_HANDLER / #endif
is never built on Linux systems, remove it.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
E-mails sent to support@stec-inc.com bounce. Hence remove that
e-mail address from the driver. Add an entry to the MAINTAINERS
file instead.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This change does not affect any skd driver version derived from a
dual licensed code base but makes all code derived from future
upstream skd driver versions GPLv2 only.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ensure that the members of struct skd_msg_buf have been transferred
to the PCIe adapter before the doorbell is triggered. This patch
avoids that I/O fails sporadically and that the following error
message is reported:
(skd0:STM000196603:[0000:00:09.0]): Completion mismatch comp_id=0x0000 skreq=0x0400 new=0x0000
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Avoid that smatch reports the following warning when building with
C=2 CHECK="smatch -p=kernel":
drivers/block/xen-blkback/blkback.c:710 xen_blkbk_unmap_prepare() warn: inconsistent indenting
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Roger Pau Monn303251 <roger.pau@citrix.com>
Cc: xen-devel@lists.xenproject.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's no reason to have partitions disabled for nbd by default, it costs us
nothing to have it enabled and is just confusing/obnoxious to users who try to
use partitions with nbd.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If users really want to use a particular index for their nbd device and it
doesn't already exist there's no reason we can't just create it for them. Do
this instead of erroring out.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cc: Markus Elfring <elfring@users.sourceforge.net>
Cc: Jim Paris <jim@jtan.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Geoff Levand <geoff@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Omit an extra message for a memory allocation failure in this function.
This issue was detected by using the Coccinelle software.
Link: http://events.linuxfoundation.org/sites/events/files/slides/LCJ16-Refactor_Strings-WSang_0.pdf
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Cc: Jim Paris <jim@jtan.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Geoff Levand <geoff@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Pull xen block changes from Konrad:
Two fixes, both of them spotted by Amazon:
1) Fix in Xen-blkfront caused by the re-write in 4.8 time-frame.
2) Fix in the xen_biovec_phys_mergeable which allowed guest
requests when using NVMe - to slurp up more data than allowed
leading to an XSA (which has been made public today).
The early device registration made possible a race leading to allocations
of disks with wrong minors.
This patch moves the device registration further down the loop_init
function to make the race infeasible.
Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: Anton Volkov <avolkov@ispras.ru>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since commit d05d7f4079 ("Merge branch 'for-4.8/core' of
git://git.kernel.dk/linux-block") and 3fc9d69093 ("Merge branch
'for-4.8/drivers' of git://git.kernel.dk/linux-block"), blkfront_resume()
has been using an index for iterating ring_info to check request when
iterating blk_shadow in an inner loop. This seems to have been
accidentally introduced during the massive rewrite of the block layer
macros in the commits.
This may cause crash like this:
[11798.057074] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
[11798.058832] IP: [<ffffffff814411fa>] blkfront_resume+0x10a/0x610
....
[11798.061063] Call Trace:
[11798.061063] [<ffffffff8139ce93>] xenbus_dev_resume+0x53/0x140
[11798.061063] [<ffffffff8139ce40>] ? xenbus_dev_probe+0x150/0x150
[11798.061063] [<ffffffff813f359e>] dpm_run_callback+0x3e/0x110
[11798.061063] [<ffffffff813f3a08>] device_resume+0x88/0x190
[11798.061063] [<ffffffff813f4cc0>] dpm_resume+0x100/0x2d0
[11798.061063] [<ffffffff813f5221>] dpm_resume_end+0x11/0x20
[11798.061063] [<ffffffff813950a8>] do_suspend+0xe8/0x1a0
[11798.061063] [<ffffffff813954bd>] shutdown_handler+0xfd/0x130
[11798.061063] [<ffffffff8139aba0>] ? split+0x110/0x110
[11798.061063] [<ffffffff8139ac26>] xenwatch_thread+0x86/0x120
[11798.061063] [<ffffffff810b4570>] ? prepare_to_wait_event+0x110/0x110
[11798.061063] [<ffffffff8108fe57>] kthread+0xd7/0xf0
[11798.061063] [<ffffffff811da811>] ? kfree+0x121/0x170
[11798.061063] [<ffffffff8108fd80>] ? kthread_park+0x60/0x60
[11798.061063] [<ffffffff810863b0>] ? call_usermodehelper_exec_work+0xb0/0xb0
[11798.061063] [<ffffffff810864ea>] ? call_usermodehelper_exec_async+0x13a/0x140
[11798.061063] [<ffffffff81534a45>] ret_from_fork+0x25/0x30
Use the right index in the inner loop.
Fixes: d05d7f4079 ("Merge branch 'for-4.8/core' of git://git.kernel.dk/linux-block")
Fixes: 3fc9d69093 ("Merge branch 'for-4.8/drivers' of git://git.kernel.dk/linux-block")
Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
Reviewed-by: Thomas Friebel <friebelt@amazon.de>
Reviewed-by: Eduardo Valentin <eduval@amazon.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Roger Pau Monne <roger.pau@citrix.com>
Cc: xen-devel@lists.xenproject.org
Cc: stable@vger.kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
comp_algorithm_store() passes the size of the source buffer to strlcpy()
instead of the destination buffer size. Make it explicit that the two
buffers have the same size and use strcpy() instead of strlcpy(). The
latter can be done safely since the function ensures that the string in
the source buffer is terminated.
Link: http://lkml.kernel.org/r/20170803163350.45245-1-mka@chromium.org
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Using mpgroup to define multiple paths for a virtual disk causes multiple
virtual-device-port ports to be created for that virtual device.
Each virtual-device-port port then gets a vdisk created for it by the Linux
sunvdc driver. As mpgroup is not supported by the Linux sunvdc driver it
cannot handle multiple ports for a single vdisk, leading to a kernel panic
at startup.
This fix prevents more than one vdisk per virtual-device-port being created
until full virtual disk multipathing (mpgroup) support is implemented.
Signed-off-by: Jim Quigley <Jim.Quigley@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Aaron Young <aaron.young@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
No functional change in this patch, just in preparation for
basing the inflight mechanism on the queue in question.
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
set submit_queues to 1 by default, and make sure it's value > 0.
Signed-off-by: weiping zhang <zhangweiping@didichuxing.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
gcc-7 points out that a large controller number would overflow the
string length for the procfs name and the firmware version string:
drivers/block/DAC960.c: In function 'DAC960_Probe':
drivers/block/DAC960.c:6591:38: warning: 'sprintf' may write a terminating nul past the end of the destination [-Wformat-overflow=]
drivers/block/DAC960.c: In function 'DAC960_V1_ReadControllerConfiguration':
drivers/block/DAC960.c:1681:40: error: '%02d' directive writing between 2 and 3 bytes into a region of size between 2 and 5 [-Werror=format-overflow=]
drivers/block/DAC960.c:1681:40: note: directive argument in the range [0, 255]
drivers/block/DAC960.c:1681:3: note: 'sprintf' output between 10 and 14 bytes into a destination of size 12
Both of these seem appropriately sized, and using snprintf()
instead of sprintf() improves this by ensuring that even
incorrect data won't cause undefined behavior here.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block fixes from Jens Axboe:
"A small collection of fixes that should go into this series. This
contains:
- NVMe pull request from Christoph, with various fixes for nvme
proper and nvme-fc.
- disable runtime PM for blk-mq for now.
With scsi now defaulting to using blk-mq, this reared its head as
an issue. Longer term we'll fix up runtime PM for blk-mq, for now
just disable it to prevent a hang on laptop resume for some folks.
- blk-mq CPU <-> hw queue map fix from Christoph.
- xen/blkfront pull request from Konrad, with two small fixes for the
blkfront driver.
- a few fixups for nbd from Joseph.
- a stable fix for pblk from Javier"
* 'for-linus' of git://git.kernel.dk/linux-block:
lightnvm: pblk: advance bio according to lba index
nvme: validate admin queue before unquiesce
nbd: clear disconnected on reconnect
nvme-pci: fix HMB size calculation
nvme-fc: revise TRADDR parsing
nvme-fc: address target disconnect race conditions in fcp io submit
nvme: fabrics commands should use the fctype field for data direction
nvme: also provide a UUID in the WWID sysfs attribute
xen/blkfront: always allocate grants first from per-queue persistent grants
xen-blkfront: fix mq start/stop race
blk-mq: map queues to all present CPUs
block: disable runtime-pm for blk-mq
xen-blkfront: Fix handling of non-supported operations
nbd: only set sndtimeo if we have a timeout set
nbd: take tx_lock before disconnecting
nbd: allow multiple disconnects to be sent
If our device loses its connection for longer than the dead timeout we
will set NBD_DISCONNECTED in order to quickly fail any pending IO's that
flood in after the IO's that were waiting during the dead timer.
However if we re-connect at some point in the future we'll still see
this DISCONNECTED flag set if we then lose our connection again after
that, which means we won't get notifications for our newly lost
connections. Fix this by just clearing the DISCONNECTED flag on
reconnect in order to make sure everything works as it's supposed to.
Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch partially reverts 3df0e50 ("xen/blkfront: pseudo support for
multi hardware queues/rings"). The xen-blkfront queue/ring might hang due
to grants allocation failure in the situation when gnttab_free_head is
almost empty while many persistent grants are reserved for this queue/ring.
As persistent grants management was per-queue since 73716df ("xen/blkfront:
make persistent grants pool per-queue"), we should always allocate from
persistent grants first.
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
When ring buf full, hw queue will be stopped. While blkif interrupt consume
request and make free space in ring buf, hw queue will be started again.
But since start queue is protected by spin lock while stop not, that will
cause a race.
interrupt: process:
blkif_interrupt() blkif_queue_rq()
kick_pending_request_queues_locked()
blk_mq_start_stopped_hw_queues()
clear_bit(BLK_MQ_S_STOPPED, &hctx->state)
blk_mq_stop_hw_queue(hctx)
blk_mq_run_hw_queue(hctx, async)
If ring buf is made empty in this case, interrupt will never come, then the
hw queue will be stopped forever, all processes waiting for the pending io
in the queue will hung.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Ankur Arora <ankur.a.arora@oracle.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Use sysfs_match_string() helper instead of open coded variant.
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>
A user reported that he was getting immediate disconnects with my
sndtimeo patch applied. This is because by default the OSS nbd client
doesn't set a timeout, so we end up setting the sndtimeo to 0, which of
course means we have send errors a lot. Instead only set our sndtimeo
if the user specified a timeout, otherwise we'll just wait forever like
we did previously.
Fixes: dc88e34d69 ("nbd: set sk->sk_sndtimeo for our sockets")
Reported-by: Adam Borowski <kilobyte@angband.pl>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We need to take the tx_lock so we don't interleave our disconnect
request between real data going down the wire.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's no reason to limit ourselves to one disconnect message per
socket. Sometimes networks do strange things, might as well let
sysadmins hit the panic button as much as they want.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No need to return value in queue work, kill ret variable.
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull more block updates from Jens Axboe:
"This is a followup for block changes, that didn't make the initial
pull request. It's a bit of a mixed bag, this contains:
- A followup pull request from Sagi for NVMe. Outside of fixups for
NVMe, it also includes a series for ensuring that we properly
quiesce hardware queues when browsing live tags.
- Set of integrity fixes from Dmitry (mostly), fixing various issues
for folks using DIF/DIX.
- Fix for a bug introduced in cciss, with the req init changes. From
Christoph.
- Fix for a bug in BFQ, from Paolo.
- Two followup fixes for lightnvm/pblk from Javier.
- Depth fix from Ming for blk-mq-sched.
- Also from Ming, performance fix for mtip32xx that was introduced
with the dynamic initialization of commands"
* 'for-linus' of git://git.kernel.dk/linux-block: (44 commits)
block: call bio_uninit in bio_endio
nvmet: avoid unneeded assignment of submit_bio return value
nvme-pci: add module parameter for io queue depth
nvme-pci: compile warnings in nvme_alloc_host_mem()
nvmet_fc: Accept variable pad lengths on Create Association LS
nvme_fc/nvmet_fc: revise Create Association descriptor length
lightnvm: pblk: remove unnecessary checks
lightnvm: pblk: control I/O flow also on tear down
cciss: initialize struct scsi_req
null_blk: fix error flow for shared tags during module_init
block: Fix __blkdev_issue_zeroout loop
nvme-rdma: unconditionally recycle the request mr
nvme: split nvme_uninit_ctrl into stop and uninit
virtio_blk: quiesce/unquiesce live IO when entering PM states
mtip32xx: quiesce request queues to make sure no submissions are inflight
nbd: quiesce request queues to make sure no submissions are inflight
nvme: kick requeue list when requeueing a request instead of when starting the queues
nvme-pci: quiesce/unquiesce admin_q instead of start/stop its hw queues
nvme-loop: quiesce/unquiesce admin_q instead of start/stop its hw queues
nvme-fc: quiesce/unquiesce admin_q instead of start/stop its hw queues
...
The global variable 'rd_size' is declared as 'int' in source file
arch/arm/kernel/atags_parse.c and as 'unsigned long' in
drivers/block/brd.c. Fix this inconsistency.
Additionally, remove the declarations of rd_image_start, rd_prompt and
rd_doload from parse_tag_ramdisk() since these duplicate existing
declarations in <linux/initrd.h>.
Link: http://lkml.kernel.org/r/20170627065024.12347-1-bart.vanassche@wdc.com
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: Zhaohongjiang <zhaohongjiang@huawei.com>
Cc: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
attribute_groups are not supposed to change at runtime. All functions
working with attribute_groups provided by <linux/sysfs.h> work with
const attribute_group. So mark the non-const structs as const.
File size before:
text data bss dec hex filename
8293 841 4 9138 23b2 drivers/block/zram/zram_drv.o
File size After adding 'const':
text data bss dec hex filename
8357 777 4 9138 23b2 drivers/block/zram/zram_drv.o
Link: http://lkml.kernel.org/r/65680c1c4d85818f7094cbfa31c91bf28185ba1b.1499061182.git.arvind.yadav.cs@gmail.com
Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull followup NVMe (mostly) changes from Sagi:
I added the quiesce/unquiesce patches in here as it's
easy for me easily apply changes on top. It has accumulated
reviews and includes mostly nvme anyway, please tell me if
you don't want to take them with this.
This includes:
- quiesce/unquiesce fixes in nvme and others from me
- nvme-fc add create association padding spec updates from James
- some more quirking from MKP
- nvmet nit cleanup from Max
- Fix nvme-rdma racy RDMA completion signalling from Marta
- some centralization patches from me
- add tagset nr_hw_queues updates on controller resets in
nvme drivers from me
- nvme-rdma fix resources recycling when doing error recovery from me
- minor cleanups in nvme-fc from me
* Introduce the _flushcache() family of memory copy helpers and use them
for persistent memory write operations on x86. The _flushcache()
semantic indicates that the cache is either bypassed for the copy
operation (movnt) or any lines dirtied by the copy operation are
written back (clwb, clflushopt, or clflush).
* Extend dax_operations with ->copy_from_iter() and ->flush()
operations. These operations and other infrastructure updates allow
all persistent memory specific dax functionality to be pushed into
libnvdimm and the pmem driver directly. It also allows dax-specific
sysfs attributes to be linked to a host device, for example:
/sys/block/pmem0/dax/write_cache
* Add support for the new NVDIMM platform/firmware mechanisms introduced
in ACPI 6.2 and UEFI 2.7. This support includes the v1.2 namespace
label format, extensions to the address-range-scrub command set, new
error injection commands, and a new BTT (block-translation-table)
layout. These updates support inter-OS and pre-OS compatibility.
* Fix a longstanding memory corruption bug in nfit_test.
* Make the pmem and nvdimm-region 'badblocks' sysfs files poll(2)
capable.
* Miscellaneous fixes and small updates across libnvdimm and the nfit
driver.
Acknowledgements that came after the branch was pushed:
commit 6aa734a2f3 "libnvdimm, region, pmem: fix 'badblocks'
sysfs_get_dirent() reference lifetime"
Reviewed-by: Toshi Kani <toshi.kani@hpe.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZXsUtAAoJEB7SkWpmfYgCOXcP/06bncqTEvtgrOF2b7O8w+8e
mTySD51RUn6UpkFd37SMRch+rmbojuqj465TAE7XIXgyLgIOJixKaTlHYUoEnP3X
rC4Q/g5mN0nittMDwL+vQaa1lQWd2kbjOlrqCgnLHVEEJpHmiQussunjvir4G1U7
5ROooP8W+qMK5y5XPLJAg/gyGhYkjpRSlDg3Eo5meZZ0IdURbI7+WCLKrPcQUERT
WmDc9gLhJdSQVxBV/0m2gdAER4ADmFjcrlm8kjXRBhdlUmEFjM0zpvlHJutHTkks
rNZWCmCJs0Sas+DmRKszFmvVFHRHqUVA3dWK4P6PJEX+tl7BwlPcxpbfacHTG2EZ
btArFc584DZ+EIrim1cXXRvLFlxnKOFBtBeteFs7l2kZjEcN6S4I5OZgTyeDpe/i
2WDpHWLQWibkcIzH9y1EuMBkYnQjTJl1pecHzJoTaC+jAQ+opLiY7EecjLmCmQS6
MBYUeQZNufLGfT5b8KXfpKeiXhpFkYrAGp+ErfoH/6RKy2zqTdagN1yVhos2y+a7
JJu/Weetpn8qv+KTGUShO8TGyWv3wU46YkG2rKWl0FL1+C+6LMMw1/L0A97lwVlg
BpypVVyaNu1D22ifZ8O5wbqPIYghoZ5akA0CiduhX19cpl5rTeTd8EvLjvcYhZEZ
pMHuMAqIcIyLhIe2/sRF
=xKQB
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"libnvdimm updates for the latest ACPI and UEFI specifications. This
pull request also includes new 'struct dax_operations' enabling to
undo the abuse of copy_user_nocache() for copy operations to pmem.
The dax work originally missed 4.12 to address concerns raised by Al.
Summary:
- Introduce the _flushcache() family of memory copy helpers and use
them for persistent memory write operations on x86. The
_flushcache() semantic indicates that the cache is either bypassed
for the copy operation (movnt) or any lines dirtied by the copy
operation are written back (clwb, clflushopt, or clflush).
- Extend dax_operations with ->copy_from_iter() and ->flush()
operations. These operations and other infrastructure updates allow
all persistent memory specific dax functionality to be pushed into
libnvdimm and the pmem driver directly. It also allows dax-specific
sysfs attributes to be linked to a host device, for example:
/sys/block/pmem0/dax/write_cache
- Add support for the new NVDIMM platform/firmware mechanisms
introduced in ACPI 6.2 and UEFI 2.7. This support includes the v1.2
namespace label format, extensions to the address-range-scrub
command set, new error injection commands, and a new BTT
(block-translation-table) layout. These updates support inter-OS
and pre-OS compatibility.
- Fix a longstanding memory corruption bug in nfit_test.
- Make the pmem and nvdimm-region 'badblocks' sysfs files poll(2)
capable.
- Miscellaneous fixes and small updates across libnvdimm and the nfit
driver.
Acknowledgements that came after the branch was pushed: commit
6aa734a2f3 ("libnvdimm, region, pmem: fix 'badblocks'
sysfs_get_dirent() reference lifetime") was reviewed by Toshi Kani
<toshi.kani@hpe.com>"
* tag 'libnvdimm-for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (42 commits)
libnvdimm, namespace: record 'lbasize' for pmem namespaces
acpi/nfit: Issue Start ARS to retrieve existing records
libnvdimm: New ACPI 6.2 DSM functions
acpi, nfit: Show bus_dsm_mask in sysfs
libnvdimm, acpi, nfit: Add bus level dsm mask for pass thru.
acpi, nfit: Enable DSM pass thru for root functions.
libnvdimm: passthru functions clear to send
libnvdimm, btt: convert some info messages to warn/err
libnvdimm, region, pmem: fix 'badblocks' sysfs_get_dirent() reference lifetime
libnvdimm: fix the clear-error check in nsio_rw_bytes
libnvdimm, btt: fix btt_rw_page not returning errors
acpi, nfit: quiet invalid block-aperture-region warnings
libnvdimm, btt: BTT updates for UEFI 2.7 format
acpi, nfit: constify *_attribute_group
libnvdimm, pmem: disable dax flushing when pmem is fronting a volatile region
libnvdimm, pmem, dax: export a cache control attribute
dax: convert to bitmask for flags
dax: remove default copy_from_iter fallback
libnvdimm, nfit: enable support for volatile ranges
libnvdimm, pmem: fix persistence warning
...
Merge misc updates from Andrew Morton:
- a few hotfixes
- various misc updates
- ocfs2 updates
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (108 commits)
mm, memory_hotplug: move movable_node to the hotplug proper
mm, memory_hotplug: drop CONFIG_MOVABLE_NODE
mm, memory_hotplug: drop artificial restriction on online/offline
mm: memcontrol: account slab stats per lruvec
mm: memcontrol: per-lruvec stats infrastructure
mm: memcontrol: use generic mod_memcg_page_state for kmem pages
mm: memcontrol: use the node-native slab memory counters
mm: vmstat: move slab statistics from zone to node counters
mm/zswap.c: delete an error message for a failed memory allocation in zswap_dstmem_prepare()
mm/zswap.c: improve a size determination in zswap_frontswap_init()
mm/zswap.c: delete an error message for a failed memory allocation in zswap_pool_create()
mm/swapfile.c: sort swap entries before free
mm/oom_kill: count global and memory cgroup oom kills
mm: per-cgroup memory reclaim stats
mm: kmemleak: treat vm_struct as alternative reference to vmalloc'ed objects
mm: kmemleak: factor object reference updating out of scan_block()
mm: kmemleak: slightly reduce the size of some structures on 64-bit architectures
mm, mempolicy: don't check cpuset seqlock where it doesn't matter
mm, cpuset: always use seqlock when changing task's nodemask
mm, mempolicy: simplify rebinding mempolicies when updating cpusets
...
Pull misc compat stuff updates from Al Viro:
"This part is basically untangling various compat stuff. Compat
syscalls moved to their native counterparts, getting rid of quite a
bit of double-copying and/or set_fs() uses. A lot of field-by-field
copyin/copyout killed off.
- kernel/compat.c is much closer to containing just the
copyin/copyout of compat structs. Not all compat syscalls are gone
from it yet, but it's getting there.
- ipc/compat_mq.c killed off completely.
- block/compat_ioctl.c cleaned up; floppy compat ioctls moved to
drivers/block/floppy.c where they belong. Yes, there are several
drivers that implement some of the same ioctls. Some are m68k and
one is 32bit-only pmac. drivers/block/floppy.c is the only one in
that bunch that can be built on biarch"
* 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
mqueue: move compat syscalls to native ones
usbdevfs: get rid of field-by-field copyin
compat_hdio_ioctl: get rid of set_fs()
take floppy compat ioctls to sodding floppy.c
ipmi: get rid of field-by-field __get_user()
ipmi: get COMPAT_IPMICTL_RECEIVE_MSG in sync with the native one
rt_sigtimedwait(): move compat to native
select: switch compat_{get,put}_fd_set() to compat_{get,put}_bitmap()
put_compat_rusage(): switch to copy_to_user()
sigpending(): move compat to native
getrlimit()/setrlimit(): move compat to native
times(2): move compat to native
compat_{get,put}_bitmap(): use unsafe_{get,put}_user()
fb_get_fscreeninfo(): don't bother with do_fb_ioctl()
do_sigaltstack(): lift copying to/from userland into callers
take compat_sys_old_getrlimit() to native syscall
trim __ARCH_WANT_SYS_OLD_GETRLIMIT
Regardless of whether it is same page or not, it's surely write and
stored to zram so we should increase pages_stored stat. Otherwise, user
can see zero value via mm_stats although he writes a lot of pages to
zram.
Link: http://lkml.kernel.org/r/1494834068-27004-1-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The changes in "block: Make most scsi_req_init() calls implicit" mean
that every driver that supports the generic scsi ioctls needs to
call scsi_req_init on newly allocated requests, but that commit didn't
add the call to the ccіss driver. Fix that to avoid crashes when
udev issues SG_IO commands.
Fixes: ca18d6f7 ("block: Make most scsi_req_init() calls implicit")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In case we use shared tags feature, blk_mq_alloc_tag_set might fail
during module initialization. In that case, fail the load with a
suitable error code. Also move the tagset initialization process after
defining the amount of submission queues.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Without it its not guaranteed that no .queue_rq is inflight.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: virtio-dev@lists.oasis-open.org
Cc: Jason Wang <jasowang@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Unlike blk_mq_stop_hw_queues, blk_mq_quiesce_queue respects the
submission path rcu grace. quiesce the queue before iterating
on live tags, or performing device io quiescing.
While were at it, verify that the request started in mtip_abort_cmd
amd mtip_queue_cmd tag iteration calls.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Unlike blk_mq_stop_hw_queues, blk_mq_quiesce_queue respects the
submission path rcu grace. quiesce the queue before iterating
on live tags.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Pull read/write updates from Al Viro:
"Christoph's fs/read_write.c series - consolidation and cleanups"
* 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
nfsd: remove nfsd_vfs_read
nfsd: use vfs_iter_read/write
fs: implement vfs_iter_write using do_iter_write
fs: implement vfs_iter_read using do_iter_read
fs: move more code into do_iter_read/do_iter_write
fs: remove __do_readv_writev
fs: remove do_compat_readv_writev
fs: remove do_readv_writev
Pull misc user access cleanups from Al Viro:
"The first pile is assorted getting rid of cargo-culted access_ok(),
cargo-culted set_fs() and field-by-field copyouts.
The same description applies to a lot of stuff in other branches -
this is just the stuff that didn't fit into a more specific topical
branch"
* 'work.misc-set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
Switch flock copyin/copyout primitives to copy_{from,to}_user()
fs/fcntl: return -ESRCH in f_setown when pid/pgid can't be found
fs/fcntl: f_setown, avoid undefined behaviour
fs/fcntl: f_setown, allow returning error
lpfc debugfs: get rid of pointless access_ok()
adb: get rid of pointless access_ok()
isdn: get rid of pointless access_ok()
compat statfs: switch to copy_to_user()
fs/locks: don't mess with the address limit in compat_fcntl64
nfsd_readlink(): switch to vfs_get_link()
drbd: ->sendpage() never needed set_fs()
fs/locks: pass kernel struct flock to fcntl_getlk/setlk
fs: locks: Fix some troubles at kernel-doc comments
It is observed reading the register from HW takes a bit long,
for example in my box, the following difference of 'perf report
--no-children fio ...' can be seen when running I/O:
1) V4.12 without patch
+ 9.28% fio [mtip32xx] [k] mtip_irq_handler
+ 8.48% fio [mtip32xx] [k] mtip_init_cmd_header
2) V4.12 with the following patch
+ 9.14% fio [mtip32xx] [k] mtip_irq_handler
......
+ 1.14% fio [mtip32xx] [k] mtip_init_cmd_header
IOPS can be increased by ~5% with this patch too.
Fixes: a4e84aae8139(mtip32xx: use runtime tag to initialize command header)
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Here is the big driver core update for 4.13-rc1.
The large majority of this is a lot of cleanup of old fields in the
driver core structures and their remaining usages in random drivers.
All of those fixes have been reviewed by the various subsystem
maintainers. There's also some small firmware updates in here, a new
kobject uevent api interface that makes userspace interaction easier,
and a few other minor things.
All of these have been in linux-next for a long while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWVpX4A8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ymobgCfd0d13IfpZoq1N41wc6z2Z0xD7cwAnRMeH1/p
kEeISGpHPYP9f8PBh9FO
=Hfqt
-----END PGP SIGNATURE-----
Merge tag 'driver-core-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the big driver core update for 4.13-rc1.
The large majority of this is a lot of cleanup of old fields in the
driver core structures and their remaining usages in random drivers.
All of those fixes have been reviewed by the various subsystem
maintainers. There's also some small firmware updates in here, a new
kobject uevent api interface that makes userspace interaction easier,
and a few other minor things.
All of these have been in linux-next for a long while with no reported
issues"
* tag 'driver-core-4.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (56 commits)
arm: mach-rpc: ecard: fix build error
zram: convert remaining CLASS_ATTR() to CLASS_ATTR_RO()
driver-core: remove struct bus_type.dev_attrs
powerpc: vio_cmo: use dev_groups and not dev_attrs for bus_type
powerpc: vio: use dev_groups and not dev_attrs for bus_type
USB: usbip: convert to use DRIVER_ATTR_RW
s390: drivers: convert to use DRIVER_ATTR_RO/WO
platform: thinkpad_acpi: convert to use DRIVER_ATTR_RO/RW
pcmcia: ds: convert to use DRIVER_ATTR_RO
wireless: ipw2x00: convert to use DRIVER_ATTR_RW
net: ehea: convert to use DRIVER_ATTR_RO
net: caif: convert to use DRIVER_ATTR_RO
TTY: hvc: convert to use DRIVER_ATTR_RW
PCI: pci-driver: convert to use DRIVER_ATTR_WO
IB: nes: convert to use DRIVER_ATTR_RW
HID: hid-core: convert to use DRIVER_ATTR_RO and drv_groups
arm: ecard: fix dev_groups patch typo
tty: serdev: use dev_groups and not dev_attrs for bus_type
sparc: vio: use dev_groups and not dev_attrs for bus_type
hid: intel-ish-hid: use dev_groups and not dev_attrs for bus_type
...
De-dupliate some code and allow for passing the flags argument to
vfs_iter_write. Additionally it now properly updates timestamps.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
De-dupliate some code and allow for passing the flags argument to
vfs_iter_read. Additional it properly updates atime now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Drop static on a local variable, when the variable is initialized before
any use, on every possible execution path through the function. The
static has no benefit, and dropping it reduces the code size.
The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@bad exists@
position p;
identifier x;
type T;
@@
static T x@p;
...
x = <+...x...+>
@@
identifier x;
expression e;
type T;
position p != bad.p;
@@
-static
T x@p;
... when != x
when strict
?x = e;
// </smpl>
The change in code size is indicates by the following output from the size
command.
before:
text data bss dec hex filename
67299 2291 1056 70646 113f6 drivers/block/drbd/drbd_nl.o
after:
text data bss dec hex filename
67283 2291 1056 70630 113e6 drivers/block/drbd/drbd_nl.o
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Require all dax-drivers to register a ->copy_from_iter() operation so
that it is clear which dax_operations are optional and which must be
implemented for filesystem-dax to operate.
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Instead move it to the callers. Those that either don't use bio_data() or
page_address() or are specific to architectures that do not support highmem
are skipped.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For historical reasons we default to bouncing highmem pages for all block
queues. But the blk-mq drivers are easy to audit to ensure that we don't
need this - scsi and mtip32xx set explicit limits and everyone else doesn't
have any particular ones.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We only call blk_queue_bounce for request-based drivers, so stop messing
with it for make_request based drivers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
pktcdvd is a make_request based stacking driver and thus doesn't have any
addressing limits on it's own. It also doesn't use bio_data() or
page_address(), so it doesn't need a lowmem bounce either.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This fixes up two commits that have touched this driver. The
command status field is now a blk_status_t, so we can't check
for < 0 and we definitely can't assume it's holding -Exxxx error
values. All we care about here is whether ->status is zero or not.
Check for that, and remove the various attempts at smart error
reporting. Just log to dmesg what command failed, and the
blk_status_t value.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 2a842acab1 ("block: introduce new block status code type")
Fixes: 3f5e6a3577 ("mtip32xx: convert internal command issue to block IO path")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull in the fix for shared tags, as it conflicts with the pending
changes in for-4.13/block. We already pulled in v4.12-rc5 to solve
other conflicts or get fixes that went into 4.12, so not a lot
of changes in this merge.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of explicitly calling scsi_req_init() after blk_get_request(),
call that function from inside blk_get_request(). Add an
.initialize_rq_fn() callback function to the block drivers that need
it. Merge the IDE .init_rq_fn() function into .initialize_rq_fn()
because it is too small to keep it as a separate function. Keep the
scsi_req_init() call in ide_prep_sense() because it follows a
blk_rq_init() call.
References: commit 82ed4db499 ("block: split scsi_request out of struct request")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some storage drivers need to share tag sets between devices. It's
useful to be able to model that with null_blk, to find hangs or
performance issues.
Add a 'shared_tags' bool module parameter that. If that is set to
true and nr_devices is bigger than 1, all devices allocated will
share the same tag set.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bios that are re-submitted will pass through blk_queue_split() when
blk_queue_bio() is called, and this will split the bio if necessary.
There is no longer any need to do this splitting in xen-blkfront.
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
pktcdvd doesn't change the bi_io_vec of the clone bio,
so it is more efficient to use bio_clone_fast(), and not clone
the bi_io_vec.
This requires providing a bio_set, and it is safest to
provide a dedicated bio_set rather than sharing
fs_bio_set, which filesytems use.
This new bio_set, pkt_bio_set, can also be use for the bio_split()
call as the two allocations (bio_clone_fast, and bio_split) are
independent, neither can block a bio allocated by the other.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
drbd does not modify the bi_io_vec of the cloned bio,
so there is no need to clone that part. So bio_clone_fast()
is the better choice.
For bio_clone_fast() we need to specify a bio_set.
We could use fs_bio_set, which bio_clone() uses, or
drbd_md_io_bio_set, which drbd uses for metadata, but it is
generally best to avoid sharing bio_sets unless you can
be certain that there are no interdependencies.
So create a new bio_set, drbd_io_bio_set, and use bio_clone_fast().
Also remove a "XXX cannot fail ???" comment because it definitely
cannot fail - bio_clone_fast() doesn't fail if the GFP flags allow for
sleeping.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio_clone() makes a copy of the bi_io_vec, but rbd never changes that,
so there is no need for a copy.
bio_clone_fast() can be used instead, which avoids making the copy.
This requires that we provide a bio_set. bio_clone() uses fs_bio_set,
but it isn't, in general, safe to use the same bio_set at different
levels of the stack, as that can lead to deadlocks. As filesystems
use fs_bio_set, block devices shouldn't.
As rbd never stacks, it is safe to have a single global bio_set for
all rbd devices to use. So allocate that when the module is
initialised, and use it with bio_clone_fast().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch converts bioset_create() to not create a workqueue by
default, so alloctions will never trigger punt_bios_to_rescuer(). It
also introduces a new flag BIOSET_NEED_RESCUER which tells
bioset_create() to preserve the old behavior.
All callers of bioset_create() that are inside block device drivers,
are given the BIOSET_NEED_RESCUER flag.
biosets used by filesystems or other top-level users do not
need rescuing as the bio can never be queued behind other
bios. This includes fs_bio_set, blkdev_dio_pool,
btrfs_bioset, xfs_ioend_bioset, and one allocated by
target_core_iblock.c.
biosets used by md/raid do not need rescuing as
their usage was recently audited and revised to never
risk deadlock.
It is hoped that most, if not all, of the remaining biosets
can end up being the non-rescued version.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Credit-to: Ming Lei <ming.lei@redhat.com> (minor fixes)
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
"flags" arguments are often seen as good API design as they allow
easy extensibility.
bioset_create_nobvec() is implemented internally as a variation in
flags passed to __bioset_create().
To support future extension, make the internal structure part of the
API.
i.e. add a 'flags' argument to bioset_create() and discard
bioset_create_nobvec().
Note that the bio_split allocations in drivers/md/raid* do not need
the bvec mempool - they should have used bioset_create_nobvec().
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_queue_split() is always called with the last arg being q->bio_split,
where 'q' is the first arg.
Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses
q->bio_split.
This is inconsistent and unnecessary. Remove the last arg and always use
q->bio_split inside blk_queue_split()
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Credit-to: Javier González <jg@lightnvm.io> (Noticed that lightnvm was missed)
Reviewed-by: Javier González <javier@cnexlabs.com>
Tested-by: Javier González <javier@cnexlabs.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a filesystem is mounted from a loop device, writes are
throttled by balance_dirty_pages() twice: once when writing
to the filesystem and once when the loop_handle_cmd() writes
to the backing file. This double-throttling can trigger
positive feedback loops that create significant delays. The
throttling at the lower level is seen by the upper level as
a slow device, so it throttles extra hard.
The PF_LESS_THROTTLE flag was created to handle exactly this
circumstance, though with an NFS filesystem mounted from a
local NFS server. It reduces the throttling on the lower
layer so that it can proceed largely unthrottled.
To demonstrate this, create a filesystem on a loop device
and write (e.g. with dd) several large files which combine
to consume significantly more than the limit set by
/proc/sys/vm/dirty_ratio or dirty_bytes. Measure the total
time taken.
When I do this directly on a device (no loop device) the
total time for several runs (mkfs, mount, write 200 files,
umount) is fairly stable: 28-35 seconds.
When I do this over a loop device the times are much worse
and less stable. 52-460 seconds. Half below 100seconds,
half above.
When I apply this patch, the times become stable again,
though not as fast as the no-loop-back case: 53-72 seconds.
There may be room for further improvement as the total overhead still
seems too high, but this is a big improvement.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
of_device_ids are not supposed to change at runtime. All functions
working with of_device_ids provided by <linux/of.h> work with const
of_device_ids. So mark the non-const structs as const.
File size before:
text data bss dec hex filename
8908 1096 624 10628 2984 drivers/block/swim3.o
File size after constify swim3_match:
text data bss dec hex filename
9708 296 624 10628 2984 drivers/block/swim3.o
Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Rather than constructing a local structure instance on the stack, fill
the fields directly on the shared ring, just like other backends do.
Build on the fact that all response structure flavors are actually
identical (the old code did make this assumption too).
This is XSA-216.
Cc: stable@vger.kernel.org
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
There is no need to use xen_blkif_get()/xen_blkif_put() in the kthread
of xen-blkback. Thread stopping is synchronous and using the blkif
reference counting in the kthread will avoid to ever let the reference
count drop to zero at the end of an I/O running concurrent to
disconnecting and multiple rings.
Setting ring->xenblkd to NULL after stopping the kthread isn't needed
as the kthread does this already.
Signed-off-by: Juergen Gross <jgross@suse.com>
Tested-by: Steven Haigh <netwiz@crc.id.au>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The be structure must not be freed when freeing the blkif structure
isn't done. Otherwise a use-after-free of be when unmapping the ring
used for communicating with the frontend will occur in case of a
late call of xenblk_disconnect() (e.g. due to an I/O still active
when trying to disconnect).
Signed-off-by: Juergen Gross <jgross@suse.com>
Tested-by: Steven Haigh <netwiz@crc.id.au>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Today disconnecting xen-blkback is broken in case there are still
I/Os in flight: xen_blkif_disconnect() will bail out early without
releasing all resources in the hope it will be called again when
the last request has terminated. This, however, won't happen as
xen_blkif_free() won't be called on termination of the last running
request: xen_blkif_put() won't decrement the blkif refcnt to 0 as
xen_blkif_disconnect() didn't finish before thus some xen_blkif_put()
calls in xen_blkif_disconnect() didn't happen.
To solve this deadlock xen_blkif_disconnect() and
xen_blkif_alloc_rings() shouldn't use xen_blkif_put() and
xen_blkif_get() but use some other way to do their accounting of
resources.
This at once fixes another error in xen_blkif_disconnect(): when it
returned early with -EBUSY for another ring than 0 it would call
xen_blkif_put() again for already handled rings on a subsequent call.
This will lead to inconsistencies in the refcnt handling.
Cc: stable@vger.kernel.org
Signed-off-by: Juergen Gross <jgross@suse.com>
Tested-by: Steven Haigh <netwiz@crc.id.au>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
I missed converting the last zram attribute to CLASS_ATTR_RO() after
removing CLASS_ATTR() from the kernel, causing a build breakage. This
patch fixes that problem.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJZPdbLAAoJEHm+PkMAQRiGx4wH/1nCjfnl6fE8oJ24/1gEAOUh
biFdqJkYZmlLYHVtYfLm4Ueg4adJdg0wx6qM/4RaAzmQVvLfDV34bc1qBf1+P95G
kVF+osWyXrZo5cTwkwapHW/KNu4VJwAx2D1wrlxKDVG5AOrULH1pYOYGOpApEkZU
4N+q5+M0ce0GJpqtUZX+UnI33ygjdDbBxXoFKsr24B7eA0ouGbAJ7dC88WcaETL+
2/7tT01SvDMo0jBSV0WIqlgXwZ5gp3yPGnklC3F4159Yze6VFrzHMKS/UpPF8o8E
W9EbuzwxsKyXUifX2GY348L1f+47glen/1sedbuKnFhP6E9aqUQQJXvEO7ueQl4=
=m2Gx
-----END PGP SIGNATURE-----
Merge tag 'v4.12-rc5' into for-4.13/block
We've already got a few conflicts and upcoming work depends on some of the
changes that have gone into mainline as regression fixes for this series.
Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream
trees to continue working on 4.13 changes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Use the same values for use for request completion errors as the return
value from ->queue_rq. BLK_STS_RESOURCE is special cased to cause
a requeue, and all the others are completed as-is.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently we use nornal Linux errno values in the block layer, and while
we accept any error a few have overloaded magic meanings. This patch
instead introduces a new blk_status_t value that holds block layer specific
status codes and explicitly explains their meaning. Helpers to convert from
and to the previous special meanings are provided for now, but I suspect
we want to get rid of them in the long run - those drivers that have a
errno input (e.g. networking) usually get errnos that don't know about
the special block layer overloads, and similarly returning them to userspace
will usually return somethings that strictly speaking isn't correct
for file system operations, but that's left as an exercise for later.
For now the set of errors is a very limited set that closely corresponds
to the previous overloaded errno values, but there is some low hanging
fruite to improve it.
blk_status_t (ab)uses the sparse __bitwise annotations to allow for sparse
typechecking, so that we can easily catch places passing the wrong values.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
If the nbd server stops receiving packets altogether we will get stuck
waiting for them to receive indefinitely as the tcp buffer will never
empty, which looks like a deadlock. Fix this by setting the sk send
timeout to our configured timeout, that way if the server really
misbehaves we'll disconnect cleanly instead of waiting forever.
Reported-by: Dan Melnic <dmm@fb.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
gcc points out an unusual indentation:
drivers/block/loop.c: In function 'loop_set_status':
drivers/block/loop.c:1149:3: error: this 'if' clause does not guard... [-Werror=misleading-indentation]
if (figure_loop_size(lo, info->lo_offset, info->lo_sizelimit,
^~
drivers/block/loop.c:1152:4: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the 'if'
goto exit;
This was introduced by a new feature that accidentally moved the opening
braces from one condition to another. Adding a second pair of braces
makes it work correctly again and also more readable.
Fixes: f2c6df7dbf ("loop: support 4k physical blocksize")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
The class_attrs pointer is long depreciated, and is about to be finally
removed, so move to use the class_groups pointer instead.
Cc: <linux-block@vger.kernel.org>
Acked-by: Jens Axboe <axboe@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The class_attrs pointer is long depreciated, and is about to be finally
removed, so move to use the class_groups pointer instead.
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When generating bootable VM images certain systems (most notably
s390x) require devices with 4k blocksize. This patch implements
a new flag 'LO_FLAGS_BLOCKSIZE' which will set the physical
blocksize to that of the underlying device, and allow to change
the logical blocksize for up to the physical blocksize.
Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
While installing SLES-12 (based on v4.4), I found that the installer
will stall for 60+ seconds during LVM disk scan. The root cause was
determined to be the removal of a bound device check in loop_flush()
by commit b5dd2f6047 ("block: loop: improve performance via blk-mq").
Restoring this check, examining ->lo_state as set by loop_set_fd()
eliminates the bad behavior.
Test method:
modprobe loop max_loop=64
dd if=/dev/zero of=disk bs=512 count=200K
for((i=0;i<4;i++))do losetup -f disk; done
mkfs.ext4 -F /dev/loop0
for((i=0;i<4;i++))do mkdir t$i; mount /dev/loop$i t$i;done
for f in `ls /dev/loop[0-9]*|sort`; do \
echo $f; dd if=$f of=/dev/null bs=512 count=1; \
done
Test output: stock patched
/dev/loop0 18.1217e-05 8.3842e-05
/dev/loop1 6.1114e-05 0.000147979
/dev/loop10 0.414701 0.000116564
/dev/loop11 0.7474 6.7942e-05
/dev/loop12 0.747986 8.9082e-05
/dev/loop13 0.746532 7.4799e-05
/dev/loop14 0.480041 9.3926e-05
/dev/loop15 1.26453 7.2522e-05
Note that from loop10 onward, the device is not mounted, yet the
stock kernel consumes several orders of magnitude more wall time
than it does for a mounted device.
(Thanks for Mike Galbraith <efault@gmx.de>, give a changelog review.)
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: James Wang <jnwang@suse.com>
Fixes: b5dd2f6047 ("block: loop: improve performance via blk-mq")
Signed-off-by: Jens Axboe <axboe@fb.com>
introduced in -rc1.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJZMYN+AAoJEEp/3jgCEfOL6u0H/Rp2NdbW4QPoJWZmvL0vD77B
4N3Y+QCG7CSG3TBHWFlIUL7MsHY2wS/3D6rcIPlLhXVb3PV+IhBw5Aqu/OwzF1j6
YuEAoJZWAS4nP3eTDVz1eFSJnk/M4jrIqgPFFdKdh+pumxTnEwl6vXeKD8SSX4xx
2mlxf+EavqeNYppTyBdIWYlVjF03klWPp42G0CRpvjficO+5tvkP4cOrI8cRkgoK
fYFZDTp7Zz0Y1+FQfxntfRtFgksvNplvdnenDnjsPxty+YPUGETpRfo72R8DMU8n
A16g9hkHZNL3CQWcKwkwQmHX/IJxpKzC0xNMNaZO+8dJqAWOdB9h9FS+roAO1p0=
=WpjE
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.12-rc4' of git://github.com/ceph/ceph-client
Pull ceph fix from Ilya Dryomov:
"A small fix for rbd FALLOC_FL_ZERO_RANGE/PUNCH_HOLE handling breakage
introduced in -rc1"
* tag 'ceph-for-4.12-rc4' of git://github.com/ceph/ceph-client:
rbd: implement REQ_OP_WRITE_ZEROES
Since the pktcdvd driver only supports request queues for which
struct scsi_request is the first member of their private request
data, refuse to register block layer queues for which struct
scsi_request is not the first member of the private data.
References: commit 82ed4db499 ("block: split scsi_request out of struct request")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
From the context where a SCSI command is submitted it is not always
possible to figure out whether or not the queue the command is
submitted to has struct scsi_request as the first member of its
private data. Hence introduce the flag QUEUE_FLAG_SCSI_PASSTHROUGH.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Don Brace <don.brace@microsemi.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
NBD userland client and server have FUA (forced unit access) support
and flags defined. Make NBD kernel module recognize NBD_FLAG_SEND_FUA,
enable FUA on the queue, and forward FUA requests to the server.
Signed-off-by: Shaun McDowell <shaunjmcdowell@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
nbd_config is allocated in nbd_alloc_config(), but never freed.
Fixes: 5ea8d10802 ("nbd: separate out the config information")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
There is nothing to clear -- nbd_device has just been allocated.
Fold nbd_reset() into its other caller, nbd_config_put().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit 93c1defedc ("rbd: remove the discard_zeroes_data flag")
explicitly didn't implement REQ_OP_WRITE_ZEROES for rbd, while the
following commit 48920ff2a5 ("block: remove the discard_zeroes_data
flag") dropped ->discard_zeroes_data in favor of REQ_OP_WRITE_ZEROES.
rbd does support efficient zeroing via CEPH_OSD_OP_ZERO opcode and will
release either some or all blocks depending on whether the zeroing
request is rbd_obj_bytes() aligned. This is how we currently implement
discards, so REQ_OP_WRITE_ZEROES can be identical to REQ_OP_DISCARD for
now. Caveats:
- REQ_NOUNMAP is ignored, but AFAICT that's true of at least two other
current implementations - nvme and loop
- there is no ->write_zeroes_alignment and blk_bio_write_zeroes_split()
is hence less helpful than blk_bio_discard_split(), but this can (and
should) be fixed on the rbd side
In the future we will split these into two code paths to respect
REQ_NOUNMAP on zeroout and save on zeroing blocks that couldn't be
released on discard.
Fixes: 93c1defedc ("rbd: remove the discard_zeroes_data flag")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Pull block fixes from Jens Axboe:
"A small collection of fixes that should go into this cycle.
- a pull request from Christoph for NVMe, which ended up being
manually applied to avoid pulling in newer bits in master. Mostly
fibre channel fixes from James, but also a few fixes from Jon and
Vijay
- a pull request from Konrad, with just a single fix for xen-blkback
from Gustavo.
- a fuseblk bdi fix from Jan, fixing a regression in this series with
the dynamic backing devices.
- a blktrace fix from Shaohua, replacing sscanf() with kstrtoull().
- a request leak fix for drbd from Lars, fixing a regression in the
last series with the kref changes. This will go to stable as well"
* 'for-linus' of git://git.kernel.dk/linux-block:
nvmet: release the sq ref on rdma read errors
nvmet-fc: remove target cpu scheduling flag
nvme-fc: stop queues on error detection
nvme-fc: require target or discovery role for fc-nvme targets
nvme-fc: correct port role bits
nvme: unmap CMB and remove sysfs file in reset path
blktrace: fix integer parse
fuseblk: Fix warning in super_setup_bdi_name()
block: xen-blkback: add null check to avoid null pointer dereference
drbd: fix request leak introduced by locking/atomic, kref: Kill kref_sub()
When killing kref_sub(), the unconditional additional kref_get()
was not properly paired with the necessary kref_put(), causing
a leak of struct drbd_requests (~ 224 Bytes) per submitted bio,
and breaking DRBD in general, as the destructor of those "drbd_requests"
does more than just the mempoll_free().
Fixes: bdfafc4ffd ("locking/atomic, kref: Kill kref_sub()")
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: stable@vger.kernel.org # v4.11
Signed-off-by: Jens Axboe <axboe@fb.com>
A bunch of changes to virtio, most affecting virtio net.
ptr_ring batched zeroing - first of batching enhancements
that seems ready.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJZEceWAAoJECgfDbjSjVRpg8YIAIbB1UJZkrHh/fdCQjM2O53T
ESS62W91LBT+weYH/N79RrfnGWzDOHrCQ8Or1nAKQZx2vU1aroqYXeJ9o0liRBYr
hGZB1/bg8obA5XkKUfME2+XClakvuXABUJbky08iDL9nILlrvIVLoUgZ9ouL0vTj
oUv4n6jtguNFV7tk/injGNRparEVdcefX0dbPxXomx5tSeD2fOE96/Q4q2eD3f7r
NHD4DailEJC7ndJGa6b4M9g8shkXzgEnSw+OJHpcJcxCnAzkVT94vsU7OluiDvmG
bfdGZNb0ohDYZLbJDR8aiMkoad8bIVLyGlhqnMBiZQEOF4oiWM9UJLvp5Lq9g7A=
=Sb7L
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:
"Fixes, cleanups, performance
A bunch of changes to virtio, most affecting virtio net. Also ptr_ring
batched zeroing - first of batching enhancements that seems ready."
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
s390/virtio: change maintainership
tools/virtio: fix spelling mistake: "wakeus" -> "wakeups"
virtio_net: tidy a couple debug statements
ptr_ring: support testing different batching sizes
ringtest: support test specific parameters
ptr_ring: batch ring zeroing
virtio: virtio_driver doc
virtio_net: don't reset twice on XDP on/off
virtio_net: fix support for small rings
virtio_net: reduce alignment for buffers
virtio_net: rework mergeable buffer handling
virtio_net: allow specifying context for rx
virtio: allow extra context per descriptor
tools/virtio: fix build breakage
virtio: add context flag to find vqs
virtio: wrap find_vqs
ringtest: fix an assert statement
lock transfers from myself and the long awaited -ENOSPC handling series
from Jeff. The former will allow rbd users to take advantage of
exclusive lock's built-in blacklist/break-lock functionality while
staying in control of who owns the lock. With the latter in place, we
will abort filesystem writes on -ENOSPC instead of having them block
indefinitely.
Beyond that we've got the usual pile of filesystem fixes from Zheng,
some refcount_t conversion patches from Elena and a patch for an
ancient open() flags handling bug from Alexander.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJZEt/kAAoJEEp/3jgCEfOLpzAIAIld0N06DuHKG2F9mHEnLeGl
Y60BZ3Ajo32i9qPT/u9ntI99ZMlkuHcNWg6WpCCh8umbwk2eiAKRP/KcfGcWmmp9
EHj9COCmBR9TRM1pNS1lSMzljDnxf9sQmbIO9cwMQBUya5g19O0OpApzxF1YQhCR
V9B/FYV5IXELC3b/NH45oeDAD9oy/WgwbhQ2feTBQJmzIVJx+Je9hdhR1PH1rI06
ysyg3VujnUi/hoDhvPTBznNOxnHx/HQEecHH8b01MkbaCgxPH88jsUK/h7PYF3Gh
DE/sCN69HXeu1D/al3zKoZdahsJ5GWkj9Q+vvBoQJm+ZPsndC+qpgSj761n9v38=
=vamy
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.12-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"The two main items are support for disabling automatic rbd exclusive
lock transfers from myself and the long awaited -ENOSPC handling
series from Jeff.
The former will allow rbd users to take advantage of exclusive lock's
built-in blacklist/break-lock functionality while staying in control
of who owns the lock. With the latter in place, we will abort
filesystem writes on -ENOSPC instead of having them block
indefinitely.
Beyond that we've got the usual pile of filesystem fixes from Zheng,
some refcount_t conversion patches from Elena and a patch for an
ancient open() flags handling bug from Alexander"
* tag 'ceph-for-4.12-rc1' of git://github.com/ceph/ceph-client: (31 commits)
ceph: fix memory leak in __ceph_setxattr()
ceph: fix file open flags on ppc64
ceph: choose readdir frag based on previous readdir reply
rbd: exclusive map option
rbd: return ResponseMessage result from rbd_handle_request_lock()
rbd: kill rbd_is_lock_supported()
rbd: support updating the lock cookie without releasing the lock
rbd: store lock cookie
rbd: ignore unlock errors
rbd: fix error handling around rbd_init_disk()
rbd: move rbd_unregister_watch() call into rbd_dev_image_release()
rbd: move rbd_dev_destroy() call out of rbd_dev_image_release()
ceph: when seeing write errors on an inode, switch to sync writes
Revert "ceph: SetPageError() for writeback pages if writepages fails"
ceph: handle epoch barriers in cap messages
libceph: add an epoch_barrier field to struct ceph_osd_client
libceph: abort already submitted but abortable requests when map or pool goes full
libceph: allow requests to return immediately on full conditions if caller wishes
libceph: remove req->r_replay_version
ceph: make seeky readdir more efficient
...
We now have memalloc_noreclaim_{save,restore} helpers for robust setting
and clearing of PF_MEMALLOC. Let's convert the code which was using the
generic tsk_restore_flags(). No functional change.
[vbabka@suse.cz: in net/core/sock.c the hunk is missing]
Link: http://lkml.kernel.org/r/20170405074700.29871-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Lee Duncan <lduncan@suse.com>
Cc: Chris Leech <cleech@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Boris Brezillon <boris.brezillon@free-electrons.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Wouter Verhelst <w@uter.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CURRENT_TIME is not y2038 safe. The macro will be deleted and all the
references to it will be replaced by ktime_get_* apis.
struct timespec is also not y2038 safe. Retain timespec for timestamp
representation here as ceph uses it internally everywhere. These
references will be changed to use struct timespec64 in a separate patch.
The current_fs_time() api is being changed to use vfs struct inode* as
an argument instead of struct super_block*.
Set the new mds client request r_stamp field using ktime_get_real_ts()
instead of using current_fs_time().
Also, since r_stamp is used as mtime on the server, use timespec_trunc()
to truncate the timestamp, using the right granularity from the
superblock.
This api will be transitioned to be y2038 safe along with vfs.
Link: http://lkml.kernel.org/r/1491613030-11599-5-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
M: Ilya Dryomov <idryomov@gmail.com>
M: "Yan, Zheng" <zyan@redhat.com>
M: Sage Weil <sage@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__vmalloc* allows users to provide gfp flags for the underlying
allocation. This API is quite popular
$ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
77
The only problem is that many people are not aware that they really want
to give __GFP_HIGHMEM along with other flags because there is really no
reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
which are mapped to the kernel vmalloc space. About half of users don't
use this flag, though. This signals that we make the API unnecessarily
too complex.
This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
be mapped to the vmalloc space. Current users which add __GFP_HIGHMEM
are simplified and drop the flag.
Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Cristopher Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull block fixes and updates from Jens Axboe:
"Some fixes and followup features/changes that should go in, in this
merge window. This contains:
- Two fixes for lightnvm from Javier, fixing problems in the new code
merge previously in this merge window.
- A fix from Jan for the backing device changes, fixing an issue in
NFS that causes a failure to mount on certain setups.
- A change from Christoph, cleaning up the blk-mq init and exit
request paths.
- Remove elevator_change(), which is now unused. From Bart.
- A fix for queue operation invocation on a dead queue, from Bart.
- A series fixing up mtip32xx for blk-mq scheduling, removing a
bandaid we previously had in place for this. From me.
- A regression fix for this series, fixing a case where we wait on
workqueue flushing from an invalid (non-blocking) context. From me.
- A fix/optimization from Ming, ensuring that we don't both quiesce
and freeze a queue at the same time.
- A fix from Peter on lock ordering for CPU hotplug. Not a real
problem right now, but will be once the CPU hotplug rework goes in.
- A series from Omar, cleaning up out blk-mq debugfs support, and
adding support for exporting info from schedulers in debugfs as
well. This is really useful in debugging stalls or livelocks. From
Omar"
* 'for-linus' of git://git.kernel.dk/linux-block: (28 commits)
mq-deadline: add debugfs attributes
kyber: add debugfs attributes
blk-mq-debugfs: allow schedulers to register debugfs attributes
blk-mq: untangle debugfs and sysfs
blk-mq: move debugfs declarations to a separate header file
blk-mq: Do not invoke queue operations on a dead queue
blk-mq-debugfs: get rid of a bunch of boilerplate
blk-mq-debugfs: rename hw queue directories from <n> to hctx<n>
blk-mq-debugfs: don't open code strstrip()
blk-mq-debugfs: error on long write to queue "state" file
blk-mq-debugfs: clean up flag definitions
blk-mq-debugfs: separate flags with |
nfs: Fix bdi handling for cloned superblocks
block/mq: Cure cpu hotplug lock inversion
lightnvm: fix bad back free on error path
lightnvm: create cmd before allocating request
blk-mq: don't use sync workqueue flushing from drivers
mtip32xx: convert internal commands to regular block infrastructure
mtip32xx: cleanup internal tag assumptions
block: don't call blk_mq_quiesce_queue() after queue is frozen
...
* Region media error reporting: A libnvdimm region device is the parent
to one or more namespaces. To date, media errors have been reported via
the "badblocks" attribute attached to pmem block devices for namespaces
in "raw" or "memory" mode. Given that namespaces can be in "device-dax"
or "btt-sector" mode this new interface reports media errors
generically, i.e. independent of namespace modes or state. This
subsequently allows userspace tooling to craft "ACPI 6.1 Section
9.20.7.6 Function Index 4 - Clear Uncorrectable Error" requests and
submit them via the ioctl path for NVDIMM root bus devices.
* Introduce 'struct dax_device' and 'struct dax_operations': Prompted by
a request from Linus and feedback from Christoph this allows for dax
capable drivers to publish their own custom dax operations. This fixes
the broken assumption that all dax operations are related to a
persistent memory device, and makes it easier for other architectures
and platforms to add customized persistent memory support.
* 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
available for storage appliance applications to manually trigger memory
controllers to drain write-pending buffers that would otherwise be
flushed automatically by the platform ADR (asynchronous-DRAM-refresh)
mechanism at a power loss event. Support for "locked" DIMMs is included
to prevent namespaces from surfacing when the namespace label data area
is locked. Finally, fixes for various reported deadlocks and crashes,
also tagged for -stable.
* ACPI / nfit driver updates: General updates of the nfit driver to add
DSM command overrides, ACPI 6.1 health state flags support, DSM payload
debug available by default, and various fixes.
Acknowledgements that came after the branch was pushed:
commmit 565851c972 "device-dax: fix sysfs attribute deadlock"
Tested-by: Yi Zhang <yizhan@redhat.com>
commit 23f4984483 "libnvdimm: rework region badblocks clearing"
Tested-by: Toshi Kani <toshi.kani@hpe.com>
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZDONJAAoJEB7SkWpmfYgC3SsP/2KrLvTUcz646ViuPOgZ2cC4
W6wAx6cvDSt+H52kLnFEsYoFt7WAj20ggPirb/Bc5jkGlvwE0lT9Xtmso9GpVkYT
J9ZJ9pP/4YaAD3II1gmTwaUjYi0FxoOdx3Eb92yuWkO/8ylz4b2Nu3cBpYwyziGQ
nIfEVwDXRLE86u6x0bWuf6TlVuvsbdiAI55CDqDMVQC6xIOLbSez7b8QIHlpiKEb
Mw+xqdQva0esoreZEOXEhWNO+qtfILx8/ceBEGTNMp4e/JjZ2FbrSNplM+9bH5k7
ywqP8lW+mBEw0fmBBkYoVG/xyesiiBb55JLnbi8Ew+7IUxw8a3iV7wftRi62lHcK
zAjsHe4L+MansgtZsCL8wluvIPaktAdtB4xr7l9VNLKRYRUG73jEWU0gcUNryHIL
BkQJ52pUS1PkClyAsWbBBHl1I/CvzVPd21VW0YELmLR4OywKy1c+eKw2bcYgjrb4
59HZSv6S6EoKaQC+2qvVNpePil7cdfg5V2ubH/ki9HoYVyoxDptEWHnvf0NNatIH
Y7mNcOPvhOksJmnKSyHbDjtRur7WoHIlC9D7UjEFkSBWsKPjxJHoidN4SnCMRtjQ
WKQU0seoaKj04b68Bs/Qm9NozVgnsPFIUDZeLMikLFX2Jt7YSPu+Jmi2s4re6WLh
TmJQ3Ly9t3o3/weHSzmn
=Ox0s
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"The bulk of this has been in multiple -next releases. There were a few
late breaking fixes and small features that got added in the last
couple days, but the whole set has received a build success
notification from the kbuild robot.
Change summary:
- Region media error reporting: A libnvdimm region device is the
parent to one or more namespaces. To date, media errors have been
reported via the "badblocks" attribute attached to pmem block
devices for namespaces in "raw" or "memory" mode. Given that
namespaces can be in "device-dax" or "btt-sector" mode this new
interface reports media errors generically, i.e. independent of
namespace modes or state.
This subsequently allows userspace tooling to craft "ACPI 6.1
Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error"
requests and submit them via the ioctl path for NVDIMM root bus
devices.
- Introduce 'struct dax_device' and 'struct dax_operations': Prompted
by a request from Linus and feedback from Christoph this allows for
dax capable drivers to publish their own custom dax operations.
This fixes the broken assumption that all dax operations are
related to a persistent memory device, and makes it easier for
other architectures and platforms to add customized persistent
memory support.
- 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
available for storage appliance applications to manually trigger
memory controllers to drain write-pending buffers that would
otherwise be flushed automatically by the platform ADR
(asynchronous-DRAM-refresh) mechanism at a power loss event.
Support for "locked" DIMMs is included to prevent namespaces from
surfacing when the namespace label data area is locked. Finally,
fixes for various reported deadlocks and crashes, also tagged for
-stable.
- ACPI / nfit driver updates: General updates of the nfit driver to
add DSM command overrides, ACPI 6.1 health state flags support, DSM
payload debug available by default, and various fixes.
Acknowledgements that came after the branch was pushed:
- commmit 565851c972 "device-dax: fix sysfs attribute deadlock":
Tested-by: Yi Zhang <yizhan@redhat.com>
- commit 23f4984483 "libnvdimm: rework region badblocks clearing"
Tested-by: Toshi Kani <toshi.kani@hpe.com>"
* tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits)
libnvdimm, pfn: fix 'npfns' vs section alignment
libnvdimm: handle locked label storage areas
libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
brd: fix uninitialized use of brd->dax_dev
block, dax: use correct format string in bdev_dax_supported
device-dax: fix sysfs attribute deadlock
libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
libnvdimm: rework region badblocks clearing
acpi, nfit: kill ACPI_NFIT_DEBUG
libnvdimm: fix clear length of nvdimm_forget_poison()
libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
libnvdimm, region: sysfs trigger for nvdimm_flush()
libnvdimm: fix phys_addr for nvdimm_clear_poison
x86, dax, pmem: remove indirection around memcpy_from_pmem()
block: remove block_device_operations ->direct_access()
block, dax: convert bdev_dax_supported() to dax_direct_access()
filesystem-dax: convert to dax_direct_access()
Revert "block: use DAX for partition table reads"
ext2, ext4, xfs: retrieve dax_device for iomap operations
...
Support disabling automatic exclusive lock transfers to allow users
to be in charge of which node should own the lock while being able to
reuse exclusive lock's built-in blacklist/break-lock functionality.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Right now it's just 0, but "no automatic exclusive lock transfers" mode
code will need -EROFS.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Currently the exclusive lock is acquired only if the mapping is
writable, i.e. an image HEAD mapped in rw mode. This means that we
don't acquire the lock for executing a read from a snapshot or an image
HEAD mapped in ro mode, even if lock_on_read is set. This is somewhat
weird and inconsistent with "no automatic exclusive lock transfers"
mode, where the lock is acquired unconditionally.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
As we no longer release the lock before potentially raising BLACKLISTED
in rbd_reregister_watch(), the "either locked or blacklisted" assert in
rbd_queue_workfn() needs to go: we can be both locked and blacklisted
at that point now.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
In preparation for supporting set_cookie method (or rather set_cookie
fallback for older OSDs), store the lock cookie on lock and use it on
unlock instead of recalculating from rbd_dev->watch_cookie.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Currently the lock_state is set to UNLOCKED (preventing further I/O),
but RELEASED_LOCK notification isn't sent. Be consistent with userspace
and treat ceph_cls_unlock() errors as the image is unlocked.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
add_disk() takes an extra reference on disk->queue, which is put in
put_disk() -> disk_release(). Avoiding blk_cleanup_queue() (which also
puts the queue) until add_disk() sets GENHD_FL_UP works for the queue
itself, but leaks various queue internals. Conditioning tag_set freeing
on GENHD_FL_UP is wrong too: all error paths after rbd_init_disk() leak
the tag_set.
Move the final "announce" steps out of rbd_dev_device_setup() so that
it can be unwound like any other function. Leave "announce" steps to
do_rbd_add/remove().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
rbd_dev->disk tear down vs rbd_watch_cb() race shouldn't be a problem
anymore thanks to EXISTS and REMOVING checks in rbd_dev_update_size().
A similar race could occur on "rbd map", see commit 811c668877
("rbd: fix rbd map vs notify races").
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
No reason to hide CephFS-specific features in the rbd case. Recent
feature bits mix RADOS and CephFS-specific stuff together anyway.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In page_same_filled function, all elements in the page is compared with
next index value. The current comparison routine compares the (i)th and
(i+1)th values of the page.
In this case, two load operaions occur for each comparison. But if we
store first value of the page stores at 'val' variable and using it to
compare with others, the load opearation is reduced. It reduce load
operation per page by up to 64times.
Link: http://lkml.kernel.org/r/1488428104-7257-1-git-send-email-sangwoo2.park@lge.com
Signed-off-by: Sangwoo Park <sangwoo2.park@lge.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The zram_free_page already handles NULL handle case and same page so use
it to reduce error probability. (Acutaully, I made a mistake when I
handled same page feature)
Link: http://lkml.kernel.org/r/1492052365-16169-7-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With element, sometime I got confused handle and element access. It
might be my bad but I think it's time to introduce accessor to prevent
future idiot like me. This patch is just clean-up patch so it shouldn't
change any behavior.
Link: http://lkml.kernel.org/r/1492052365-16169-6-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With this clean-up phase, I want to use zram's wrapper function to lock
table access which is more consistent with other zram's functions.
Link: http://lkml.kernel.org/r/1492052365-16169-4-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For architecture(PAGE_SIZE > 4K), zram have supported partial IO.
However, the mixed code for handling normal/partial IO is too mess,
error-prone to modify IO handler functions with upcoming feature so this
patch aims for cleaning up zram's IO handling functions.
Link: http://lkml.kernel.org/r/1492052365-16169-3-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "zram clean up", v2.
This patchset aims to clean up zram .
[1] clean up multiple pages's bvec handling.
[2] clean up partial IO handling
[3-6] clean up zram via using accessor and removing pointless structure.
With [2-6] applied, we can get a few hundred bytes as well as huge
readibility enhance.
x86: 708 byte save
add/remove: 1/1 grow/shrink: 0/11 up/down: 478/-1186 (-708)
function old new delta
zram_special_page_read - 478 +478
zram_reset_device 317 314 -3
mem_used_max_store 131 128 -3
compact_store 96 93 -3
mm_stat_show 203 197 -6
zram_add 719 712 -7
zram_slot_free_notify 229 214 -15
zram_make_request 819 803 -16
zram_meta_free 128 111 -17
zram_free_page 180 151 -29
disksize_store 432 361 -71
zram_decompress_page.isra 504 - -504
zram_bvec_rw 2592 2080 -512
Total: Before=25350773, After=25350065, chg -0.00%
ppc64: 231 byte save
add/remove: 2/0 grow/shrink: 1/9 up/down: 681/-912 (-231)
function old new delta
zram_special_page_read - 480 +480
zram_slot_lock - 200 +200
vermagic 39 40 +1
mm_stat_show 256 248 -8
zram_meta_free 200 184 -16
zram_add 944 912 -32
zram_free_page 348 308 -40
disksize_store 572 492 -80
zram_decompress_page 664 564 -100
zram_slot_free_notify 292 160 -132
zram_make_request 1132 1000 -132
zram_bvec_rw 2768 2396 -372
Total: Before=17565825, After=17565594, chg -0.00%
This patch (of 6):
Johannes Thumshirn reported system goes the panic when using NVMe over
Fabrics loopback target with zram.
The reason is zram expects each bvec in bio contains a single page
but nvme can attach a huge bulk of pages attached to the bio's bvec
so that zram's index arithmetic could be wrong so that out-of-bound
access makes system panic.
[1] in mainline solved solved the problem by limiting max_sectors with
SECTORS_PER_PAGE but it makes zram slow because bio should split with
each pages so this patch makes zram aware of multiple pages in a bvec
so it could solve without any regression(ie, bio split).
[1] 0bc315381f, zram: set physical queue limits to avoid array out of
bounds accesses
Link: http://lkml.kernel.org/r/20170413134057.GA27499@bbox
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Johannes Thumshirn <jthumshirn@suse.de>
Tested-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit 1647b9b9 "brd: add dax_operations support" introduced the allocation
and freeing of a dax_device, but the allocated dax_device is not stored
into the brd_device, so brd_del_one() will eventually operate on an
uninitialized brd->dax_dev.
Fix this by storing the allocated dax_device to brd->dax_dev.
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Get rid of the private end_io handlers and data, and just use the
regular block IO path for these requests. This removes a lot of
redundant code.
Signed-off-by: Jens Axboe <axboe@fb.com>
We don't decode the internal tag to the proper group or tag
indx. This works fine because we have hard wired it as 0 for now,
but could break if we get rid of that.
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull networking updates from David Millar:
"Here are some highlights from the 2065 networking commits that
happened this development cycle:
1) XDP support for IXGBE (John Fastabend) and thunderx (Sunil Kowuri)
2) Add a generic XDP driver, so that anyone can test XDP even if they
lack a networking device whose driver has explicit XDP support
(me).
3) Sparc64 now has an eBPF JIT too (me)
4) Add a BPF program testing framework via BPF_PROG_TEST_RUN (Alexei
Starovoitov)
5) Make netfitler network namespace teardown less expensive (Florian
Westphal)
6) Add symmetric hashing support to nft_hash (Laura Garcia Liebana)
7) Implement NAPI and GRO in netvsc driver (Stephen Hemminger)
8) Support TC flower offload statistics in mlxsw (Arkadi Sharshevsky)
9) Multiqueue support in stmmac driver (Joao Pinto)
10) Remove TCP timewait recycling, it never really could possibly work
well in the real world and timestamp randomization really zaps any
hint of usability this feature had (Soheil Hassas Yeganeh)
11) Support level3 vs level4 ECMP route hashing in ipv4 (Nikolay
Aleksandrov)
12) Add socket busy poll support to epoll (Sridhar Samudrala)
13) Netlink extended ACK support (Johannes Berg, Pablo Neira Ayuso,
and several others)
14) IPSEC hw offload infrastructure (Steffen Klassert)"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2065 commits)
tipc: refactor function tipc_sk_recv_stream()
tipc: refactor function tipc_sk_recvmsg()
net: thunderx: Optimize page recycling for XDP
net: thunderx: Support for XDP header adjustment
net: thunderx: Add support for XDP_TX
net: thunderx: Add support for XDP_DROP
net: thunderx: Add basic XDP support
net: thunderx: Cleanup receive buffer allocation
net: thunderx: Optimize CQE_TX handling
net: thunderx: Optimize RBDR descriptor handling
net: thunderx: Support for page recycling
ipx: call ipxitf_put() in ioctl error path
net: sched: add helpers to handle extended actions
qed*: Fix issues in the ptp filter config implementation.
qede: Fix concurrency issue in PTP Tx path processing.
stmmac: Add support for SIMATIC IOT2000 platform
net: hns: fix ethtool_get_strings overflow in hns driver
tcp: fix wraparound issue in tcp_lp
bpf, arm64: fix jit branch offset related to ldimm64
bpf, arm64: implement jiting of BPF_XADD
...
We are going to add more parameters to find_vqs, let's wrap the call so
we don't need to tweak all drivers every time.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Remove the request_idx parameter, which can't be used safely now that we
support I/O schedulers with blk-mq. Except for a superflous check in
mtip32xx it was unused anyway.
Also pass the tag_set instead of just the driver data - this allows drivers
to avoid some code duplication in a follow on cleanup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This reverts commit 4981d04dd8.
The driver has been converted to using the proper infrastructure
for issuing internal commands. This means it's now safe to use with
the scheduling infrastruture, so we can now revert the change
that turned off scheduling for mtip32xx.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The driver special cases certain things for command issue, depending
on whether it's an internal command or not. Make the internal commands
use the regular infrastructure for issuing IO.
Since this is an 8-group souped up AHCI variant, we have to deal
with NCQ vs non-queueable commands. Do this from the queue_rq
handler, by backing off unless the drive is idle.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This is a prep patch for backoff in ->queue_rq() for non-ncq commands.
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
All callers now pass in GFP_KERNEL, get rid of the argument.
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
All callers can safely block. Kill the atomic/block argument, and
remove the argument from all callers.
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block layer updates from Jens Axboe:
- Add BFQ IO scheduler under the new blk-mq scheduling framework. BFQ
was initially a fork of CFQ, but subsequently changed to implement
fairness based on B-WF2Q+, a modified variant of WF2Q. BFQ is meant
to be used on desktop type single drives, providing good fairness.
From Paolo.
- Add Kyber IO scheduler. This is a full multiqueue aware scheduler,
using a scalable token based algorithm that throttles IO based on
live completion IO stats, similary to blk-wbt. From Omar.
- A series from Jan, moving users to separately allocated backing
devices. This continues the work of separating backing device life
times, solving various problems with hot removal.
- A series of updates for lightnvm, mostly from Javier. Includes a
'pblk' target that exposes an open channel SSD as a physical block
device.
- A series of fixes and improvements for nbd from Josef.
- A series from Omar, removing queue sharing between devices on mostly
legacy drivers. This helps us clean up other bits, if we know that a
queue only has a single device backing. This has been overdue for
more than a decade.
- Fixes for the blk-stats, and improvements to unify the stats and user
windows. This both improves blk-wbt, and enables other users to
register a need to receive IO stats for a device. From Omar.
- blk-throttle improvements from Shaohua. This provides a scalable
framework for implementing scalable priotization - particularly for
blk-mq, but applicable to any type of block device. The interface is
marked experimental for now.
- Bucketized IO stats for IO polling from Stephen Bates. This improves
efficiency of polled workloads in the presence of mixed block size
IO.
- A few fixes for opal, from Scott.
- A few pulls for NVMe, including a lot of fixes for NVMe-over-fabrics.
From a variety of folks, mostly Sagi and James Smart.
- A series from Bart, improving our exposed info and capabilities from
the blk-mq debugfs support.
- A series from Christoph, cleaning up how handle WRITE_ZEROES.
- A series from Christoph, cleaning up the block layer handling of how
we track errors in a request. On top of being a nice cleanup, it also
shrinks the size of struct request a bit.
- Removal of mg_disk and hd (sorry Linus) by Christoph. The former was
never used by platforms, and the latter has outlived it's usefulness.
- Various little bug fixes and cleanups from a wide variety of folks.
* 'for-4.12/block' of git://git.kernel.dk/linux-block: (329 commits)
block: hide badblocks attribute by default
blk-mq: unify hctx delay_work and run_work
block: add kblock_mod_delayed_work_on()
blk-mq: unify hctx delayed_run_work and run_work
nbd: fix use after free on module unload
MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler
blk-mq-sched: alloate reserved tags out of normal pool
mtip32xx: use runtime tag to initialize command header
scsi: Implement blk_mq_ops.show_rq()
blk-mq: Add blk_mq_ops.show_rq()
blk-mq: Show operation, cmd_flags and rq_flags names
blk-mq: Make blk_flags_show() callers append a newline character
blk-mq: Move the "state" debugfs attribute one level down
blk-mq: Unregister debugfs attributes earlier
blk-mq: Only unregister hctxs for which registration succeeded
blk-mq-debugfs: Rename functions for registering and unregistering the mq directory
blk-mq: Let blk_mq_debugfs_register() look up the queue name
blk-mq: Register <dev>/queue/mq after having registered <dev>/queue
ide-pm: always pass 0 error to ide_complete_rq in ide_do_devset
ide-pm: always pass 0 error to __blk_end_request_all
..
list_for_each_entry() isn't super safe if we're freeing the objects
while we traverse the list. Also don't bother taking the extra
reference, the module refcounting stuff will save us from having anybody
messing with the device while we're trying to unload.
Reported-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
mtip32xx supposes that 'request_idx' passed to .init_request()
is tag of the request, and use that as request's tag to initialize
command header.
After MQ IO scheduler is in, request tag assigned isn't same with
the request index anymore, so cause strange hardware failure on
mtip32xx, even whole system panic is triggered.
This patch fixes the issue by initializing command header via
request's real tag.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now that all the producers and consumers of dax interfaces have been
converted to using dax_operations on a dax_device, remove the block
device direct_access enabling.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Fixes: 97b50a654d ("virtio_blk: make SCSI passthrough support configurable")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Both conflict were simple overlapping changes.
In the kaweth case, Eric Dumazet's skb_cow() bug fix overlapped the
conversion of the driver in net-next to use in-netdev stats.
Signed-off-by: David S. Miller <davem@davemloft.net>
We need to get the command payload from the request before
we attempt to dereference it.
Fixes: 4dda4735c5 ("mtip32xx: add a status field to struct mtip_cmd")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
I lack the basic understanding of what segments mean, so we were being
limited to 512kib requests even with higher max_sectors sizes set.
Setting the maximum number of segments to unlimited allows us to
actually have arbitrarily large IO's go through NBD.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now that all drivers that call blk_mq_complete_requests have a
->complete callback we can remove the direct call to blk_mq_end_request,
as well as the error argument to blk_mq_complete_request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
xen-blkfron is the last users using rq->errros for passing back error to
blk-mq, and I'd like to get rid of that. In the longer run the driver
should be moving more of the completion processing into .complete, but
this is the minimal change to move forward for now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Instead of using req->errors, which will go away.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Add a nbd-specific field instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
In thruth I've just audited which blk-mq drivers don't currently have a
complete callback, but I think this change is at least borderline useful.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This passes on the scsi_cmnd result field to users of passthrough
requests. Currently we abuse req->errors for this purpose, but that
field will go away in its current form.
Note that the old IDE code abuses the errors field in very creative
ways and stores all kinds of different values in it. I didn't dare
to touch this magic, so the abuses are brought forward 1:1.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Remove passing req->errors (which at that point is always 0) to
blk_mq_complete_request, and rely on the virtio status code for the
serial number passthrough request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
The function only returns -EIO if rq->errors is non-zero, which is not
very useful and lets a large number of callers ignore the return value.
Just let the callers figure out their error themselves.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The driver never sets req->errors, so blk_execute_rq will always return 0.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch changes the behavior of the null_blk driver for the
LightNVM mode as follows:
* REQ_FAILFAST_MASK is set for read-ahead requests.
* If no I/O priority has been set in the bio, the I/O priority is
copied from the I/O context.
* The rq_disk member is initialized if bio->bi_bdev != NULL.
* req->errors is initialized to zero.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Matias Bjørling <m@bjorling.me>
Cc: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Setup a dax_inode to have the same lifetime as the brd block device and
add a ->direct_access() method that is equivalent to
brd_direct_access(). Once fs/dax.c has been converted to use
dax_operations the old brd_direct_access() will be removed.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The recent introduced MQ IO scheduler breaks mtip32xx in the
following way.
mtip32xx use the 'request_index' passed to .init_request() as
hardware tag index for initializing hardware queue, and it
actually require that rq->tag is always same with 'request_index'
passed to .init_request(). Current blk-mq IO scheduler can't
guarantee this point, so this patch passes BLK_MQ_F_NO_SCHED
and at least make mtip32xx working.
This patch fixes the following strange hardware failure. The
issue can be triggered easily when doing I/O with mq-deadline
enabled.
[ 186.972578] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 32993
[ 186.972578] {1}[Hardware Error]: event severity: fatal
[ 186.972579] {1}[Hardware Error]: Error 0, type: fatal
[ 186.972580] {1}[Hardware Error]: section_type: PCIe error
[ 186.972580] {1}[Hardware Error]: port_type: 0, PCIe end point
[ 186.972581] {1}[Hardware Error]: version: 1.0
[ 186.972581] {1}[Hardware Error]: command: 0x0407, status: 0x0010
[ 186.972582] {1}[Hardware Error]: device_id: 0000:07:00.0
[ 186.972582] {1}[Hardware Error]: slot: 4
[ 186.972583] {1}[Hardware Error]: secondary_bus: 0x00
[ 186.972583] {1}[Hardware Error]: vendor_id: 0x1344, device_id: 0x5150
[ 186.972584] {1}[Hardware Error]: class_code: 008001
[ 186.972585] Kernel panic - not syncing: Fatal hardware error!
Reported-by: Jozef Mikovic <jmikovic@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This was just a proof of concept user for the SCSI OSD library, and
never had any real users.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Boaz Harrosh <ooo@electrozaur.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
NBD doesn't care about limiting the segment size, let the user push the
largest bio's they want. This allows us to control the request size
solely through max_sectors_kb.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When a blkfront device is resized from dom0, emit a KOBJ_CHANGE uevent to
notify the guest about the change. This allows for custom udev rules, such
as automatically resizing a filesystem, when an event occurs.
With this patch you get these udev
KERNEL[577.206230] change /devices/vbd-51728/block/xvdb (block)
UDEV [577.226218] change /devices/vbd-51728/block/xvdb (block)
Signed-off-by: Marc Olson <marcolso@amazon.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
For ease of management it would be nice for users to specify that the
device node for a nbd device is destroyed once it is disconnected and
there are no more users. Add a client flag and enable this operation to
happen.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
In order to support deleting the device on disconnect we need to
refcount the actual nbd_device struct. So add the refcounting framework
and change how we free the normal devices at rmmod time so we can catch
reference leaks.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Allow users to query the status of existing nbd devices. Right now this
only returns whether or not the device is connected, but could be
extended in the future to include more information.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Sometimes we like to upgrade our server without making all of our
clients freak out and reconnect. This patch provides a way to specify a
dead connection timeout to allow us to pause all requests and wait for
new connections to be opened. With this in place I can take down the
nbd server for less than the dead connection timeout time and bring it
back up and everything resumes gracefully.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When running a disconnect torture test I noticed that sometimes we would
crash with a negative ref count on our queue. This was because we were
ending the same request twice. Turns out we were racing with
NBD_CLEAR_SOCK clearing the requests as well as the teardown of the
device clearing the requests. So instead make the ioctl only shutdown
the sockets and make it so that we only ever run nbd_clear_que from the
device teardown.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Provide a mechanism to notify userspace that there's been a link problem
on a NBD device. This will allow userspace to re-establish a connection
and provide the new socket to the device without disrupting the device.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We want to be able to reconnect dead connections to existing block
devices, so add a reconfigure netlink command. We will also allow users
to change their timeout on the fly, but everything else will require a
disconnect and reconnect. You won't be able to add more connections
either, simply replace dead connections with new more lively
connections.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The existing ioctl interface for configuring NBD devices is a bit
cumbersome and hard to extend. The other problem is we leave a
userspace app sitting in it's syscall until the device disconnects,
which is less than ideal.
This patch introduces a netlink interface for adding and disconnecting
nbd devices. This has the benefits of being easily extendable without
breaking older userspace applications, and allows us to configure a nbd
device without leaving a userspace app sitting waiting for the device to
disconnect.
With this interface we also gain the ability to configure more devices
than are preallocated at insmod time. We also have gained the ability
to not specify a particular device and be provided one for us so that
userspace doesn't need to find a free device to configure.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
In preparation for the upcoming netlink interface we need to not rely on
already having the bdev for the NBD device we are doing operations on.
Instead of passing the bdev around, just use it in places where we know
we already have the bdev.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
In order to properly refcount the various aspects of a NBD device we
need to separate out the configuration elements of the nbd device. The
configuration of a NBD device has a different lifetime from the actual
device, so it doesn't make sense to bundle these two concepts. Add a
config_refs to keep track of the configuration structure, that way we
can be sure that we never access it when we've torn down the device.
Add a new nbd_config structure to hold all of the transient
configuration information. Finally create this when we open the device
so that it is in place when we start to configure the device. This has
a nice side-effect of fixing a long standing problem where you could end
up with a half-configured nbd device that needed to be "disconnected" in
order to be usable again. Now once we close our device the
configuration will be discarded.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently if we have multiple connections and one of them goes down we will tear
down the whole device. However there's no reason we need to do this as we
could have other connections that are working fine. Deal with this by keeping
track of the state of the different connections, and if we lose one we mark it
as dead and send all IO destined for that socket to one of the other healthy
sockets. Any outstanding requests that were on the dead socket will timeout and
be re-submitted properly.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When adding a new socket we look it up and then try to add it to our
configuration. If any of those steps fail we need to make sure we put
the socket so we don't leak them.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Conflicts were simply overlapping changes. In the net/ipv4/route.c
case the code had simply moved around a little bit and the same fix
was made in both 'net' and 'net-next'.
In the net/sched/sch_generic.c case a fix in 'net' happened at
the same time that a new argument was added to qdisc_hash_add().
Signed-off-by: David S. Miller <davem@davemloft.net>
This drivers was added in 2008, but as far as a I can tell we never had a
single platform that actually registered resources for the platform driver.
It's also been unmaintained for a long time and apparently has a ATA mode
that can be driven using the IDE/libata subsystem.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The copy_page is optimized memcpy for page-alinged address. If it is
used with non-page aligned address, it can corrupt memory which means
system corruption. With zram, it can happen with
1. 64K architecture
2. partial IO
3. slub debug
Partial IO need to allocate a page and zram allocates it via kmalloc.
With slub debug, kmalloc(PAGE_SIZE) doesn't return page-size aligned
address. And finally, copy_page(mem, cmem) corrupts memory.
So, this patch changes it to memcpy.
Actuaully, we don't need to change zram_bvec_write part because zsmalloc
returns page-aligned address in case of PAGE_SIZE class but it's not
good to rely on the internal of zsmalloc.
Note:
When this patch is merged to stable, clear_page should be fixed, too.
Unfortunately, recent zram removes it by "same page merge" feature so
it's hard to backport this patch to -stable tree.
I will handle it when I receive the mail from stable tree maintainer to
merge this patch to backport.
Fixes: 42e99bd ("zram: optimize memory operations with clear_page()/copy_page()")
Link: http://lkml.kernel.org/r/1492042622-12074-2-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In zram_rw_page, the logic to get offset is wrong by operator precedence
(i.e., "<<" is higher than "&"). With wrong offset, zram can corrupt
the user's data. This patch fixes it.
Fixes: 8c7f01025 ("zram: implement rw_page operation of zram")
Link: http://lkml.kernel.org/r/1492042622-12074-1-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pass the new extended ACK reporting struct to all of the generic
netlink parsing functions. For now, pass NULL in almost all callers
(except for some in the core.)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is not safe for one thread to modify the ->flags
of another thread as there is no locking that can protect
the update.
So tsk_restore_flags(), which takes a task pointer and modifies
the flags, is an invitation to do the wrong thing.
All current users pass "current" as the task, so no developers have
accepted that invitation. It would be best to ensure it remains
that way.
So rename tsk_restore_flags() to current_restore_flags() and don't
pass in a task_struct pointer. Always operate on current->flags.
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
kill this hack.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
It seems like DRBD assumes its on the wire TRIM request always zeroes data.
Use that fact to implement REQ_OP_WRITE_ZEROES.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
drbd always wants its discard wire operations to zero the blocks, so
use blkdev_issue_zeroout with the BLKDEV_ZERO_UNMAP flag instead of
reinventing it poorly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
rsxx only supports discarding on large alignments, so the zeroing code
would always fall back to explicit writings of zeroes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
rbd only supports discarding on large alignments, so the zeroing code
would always fall back to explicit writings of zeroes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
It's just a in-driver reimplementation of writing zeroes to the pages,
which fails if the discards aren't page aligned.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
It's identical to discard as hole punches will always leave us with
zeroes on reads.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Just the same as discard if the block size equals the system page size.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Turn the existing discard flag into a new BLKDEV_ZERO_UNMAP flag with
similar semantics, but without referring to diѕcard.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We've added a considerable amount of fixes for stalls and issues
with the blk-mq scheduling in the 4.11 series since forking
off the for-4.12/block branch. We need to do improvements on
top of that for 4.12, so pull in the previous fixes to make
our lives easier going forward.
Signed-off-by: Jens Axboe <axboe@fb.com>
This driver is for pre-IDE hardisk that are only found in PC from the
stoneage of personal computing, and which we don't support elsewhere
in the kernel these days.
It's also been marked broken forever.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Constify all instances of blk_mq_ops, as they are never modified.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This adds a new module parameter to null_blk, blocking. If set, null_blk
will set the BLK_MQ_F_BLOCKING flag, indicating that it sometimes/always
needs to block in its ->queue_rq() function. The intent is to help find
regressions in blocking drivers, since not many of them exist.
If null_blk is loaded with submit_queues > 1 and blocking=1, this
shows the regression recently fixed by bf4907c05e.
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
As the .q_usage_counter is used by both legacy and
mq path, we need to block new I/O if queue becomes
dead in blk_queue_enter().
So rename it and we can use this function in both
paths.
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
URLs to ftp.kernel.org are still exist though the service is closed [0].
This commit fixes the URLs to use www.kernel.org instead.
[0] https://www.kernel.org/shutting-down-ftp-services.html
Signed-off-by: SeongJae Park <sj38.park@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
When a filesystem is mounted on a nbd device and on a disconnect, because
of kill_bdev(), and resetting bdev size to zero, buffer_head mappings are
getting destroyed under mounted filesystem.
After a bdev size reset(i.e bdev->bd_inode->i_size = 0) on a disconnect,
followed by a sys_umount(),
generic_shutdown_super()->...
->__sync_blockdev()->...
-blkdev_writepages()->...
->do_invalidatepage()->...
-discard_buffer() is discarding superblock buffer_head assumed
to be in mapped state by ext4_commit_super().
[mlin: ported to 4.11-rc2]
Signed-off-by: Ratna Manoj Bolla <manoj.br@gmail.com
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We can't just set the timeout on the tagset, we have to set it on the
queue as it would have been setup already at this point.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We've been relying on the block layer to assume rq->errors being set
translates into -EIO. I noticed in testing that sometimes this isn't
true, and really there's not much of a reason to have a counter instead
of just using -EIO. So set it properly so we don't leak random numbers
to unsuspecting victims.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We can submit IO in a processes context, which means there can be
pending signals. This isn't a fatal error for NBD, but it does require
some finesse. If the signal happens before we transmit anything then we
are ok, just requeue the request and carry on. However if we've done a
partial transmit we can't allow anything else to be transmitted on this
socket until we transmit the remaining part of the request. Deal with
this by keeping track of how much we've sent for the current request,
and if we get an ERESTARTSYS during any part of our transmission save
the state of that request and requeue the IO. If anybody tries to
submit a request that isn't our pending request then requeue that
request until we are able to service the one that is pending.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Use setup_timer() instead of init_timer() to simplify the code.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Use setup_timer() instead of init_timer() to simplify the code.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
jewel and later on the server side and all stable kernels, a fixup for
-rc1 CRUSH changes and two usability enhancements: osd_request_timeout
option and supported_features bus attribute.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJYwsEIAAoJEEp/3jgCEfOL34sH+wbYyT6uXQ3hlIoRt2FQNh5b
F6qmvH4jYRI+YyjJHgE7lLEv7cq/PESPej2hrw9U7GAso0KEsazOv+qpj4AcW+u1
arXYTIQQa2w9sCuj7/BrbEzDtnNOVnGyD3Ng0wAfvbxg/37xzqumkbccuWJm6GdH
Vjk31G4ZmaOOr38jeo0AkYWgs7kgfthLMFo73TgHTBBO9fkQQQL1xZH5D/Irzf8P
1ytfVyGeTl8D3szdkkOnc4eUFMwJ35wqesL+gAsQntx1/wDnGqa2IabXRs4oqr8F
oT88LXSP8w2PaFKI1FrwOuMov6ngg38tir2SMxGDIQ6TdxtK8lW37Cx3eHavqtE=
=f4Bs
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.11-rc2' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
- a fix for the recently discovered misdirected requests bug present in
jewel and later on the server side and all stable kernels
- a fixup for -rc1 CRUSH changes
- two usability enhancements: osd_request_timeout option and
supported_features bus attribute.
* tag 'ceph-for-4.11-rc2' of git://github.com/ceph/ceph-client:
libceph: osd_request_timeout option
rbd: supported_features bus attribute
libceph: don't set weight to IN when OSD is destroyed
libceph: fix crush_decode() for older maps
Merge fixes from Andrew Morton:
"26 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (26 commits)
userfaultfd: remove wrong comment from userfaultfd_ctx_get()
fat: fix using uninitialized fields of fat_inode/fsinfo_inode
sh: cayman: IDE support fix
kasan: fix races in quarantine_remove_cache()
kasan: resched in quarantine_remove_cache()
mm: do not call mem_cgroup_free() from within mem_cgroup_alloc()
thp: fix another corner case of munlock() vs. THPs
rmap: fix NULL-pointer dereference on THP munlocking
mm/memblock.c: fix memblock_next_valid_pfn()
userfaultfd: selftest: vm: allow to build in vm/ directory
userfaultfd: non-cooperative: userfaultfd_remove revalidate vma in MADV_DONTNEED
userfaultfd: non-cooperative: fix fork fctx->new memleak
mm/cgroup: avoid panic when init with low memory
drivers/md/bcache/util.h: remove duplicate inclusion of blkdev.h
mm/vmstats: add thp_split_pud event for clarity
include/linux/fs.h: fix unsigned enum warning with gcc-4.2
userfaultfd: non-cooperative: release all ctx in dup_userfaultfd_complete
userfaultfd: non-cooperative: robustness check
userfaultfd: non-cooperative: rollback userfaultfd_exit
x86, mm: unify exit paths in gup_pte_range()
...
Fix typos and add the following to the scripts/spelling.txt:
overide||override
While we are here, fix the doubled "address" in the touched line
Documentation/devicetree/bindings/regulator/ti-abb-regulator.txt.
Also, fix the comment block style in the touched hunks in
drivers/media/dvb-frontends/drx39xyj/drx_driver.h.
Link: http://lkml.kernel.org/r/1481573103-11329-21-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zram can handle at most SECTORS_PER_PAGE sectors in a bio's bvec. When using
the NVMe over Fabrics loopback target which potentially sends a huge bulk of
pages attached to the bio's bvec this results in a kernel panic because of
array out of bounds accesses in zram_decompress_page().
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
... so that userspace can generate meaningful error messages and spell
out unsupported features that need to be disabled.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
This is the set of stuff that didn't quite make the initial pull and a
set of fixes for stuff which did. The new stuff is basically lpfc
(nvme), qedi and aacraid. The fixes cover a lot of previously
submitted stuff, the most important of which probably covers some of
the failing irq vectors allocation and other fallout from having the
SCSI command allocated as part of the block allocation functions.
Signed-off-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJYug/3AAoJEAVr7HOZEZN4bRwP/RvAW/NWlvs812gUeHNahUHx
HF3EtY5YHneZev1L4fa1Myd8cNcB3gic53e1WvQhN5f+LczeI7iXHH8xIQK/B4GT
OI5W7ZCGlhIamgsU8tUFGrtIOM6N/LIMZPMsaE62dcev0jC0Db4p/5yv5bUg1yuL
JOLWqIaTYTx7b5FwR2cXKv7KGYpXonvcoVUDV6s8aMYToJhJiQ6JtWmjaAA+/OJ7
HTcqNX/KSWQBG3ERjE3A/rGBFxPR9KPq8kpqVJlw39KEN9tMjXb0QAGvAeauNye2
xaIjV41tXH1Ys3aoh6ZU+TpzN3HLlH/gegRe8671fUwqu/RnSzFC8Eidoddzgqne
z5BF0Dy36FA5ol2HTFEwS5WM1gRM+z6YGnbhpE/XfyTL2sLHRynSXKxb02IdjRNi
gCdV89AL8xdfPPFJjsx4hGWUJhalW/RwuPayAupePKGQHePabHa19McI2ssg0WBg
OC6uuEk7vT95xuTZVJlrwXTJPnc9dxWJwzh21oEnvMfWM0VMW06EKT4AcX8FOsjL
RswiYp9wl8JDmfNxyJnQJj0+QXeIE/gh35vBQCGsLVKY4+xzbi3yZfb1sZB0XOum
yLK2LFQ+Ltk1TaPPSZsgXQZfodot0dFJPTnWr+whOfY1kepNlFRwQHltXnCWzgRs
QG//WHsk0u1Mj2gehwmQ
=e+6v
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull more SCSI updates from James Bottomley:
"This is the set of stuff that didn't quite make the initial pull and a
set of fixes for stuff which did.
The new stuff is basically lpfc (nvme), qedi and aacraid. The fixes
cover a lot of previously submitted stuff, the most important of which
probably covers some of the failing irq vectors allocation and other
fallout from having the SCSI command allocated as part of the block
allocation functions"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (59 commits)
scsi: qedi: Fix memory leak in tmf response processing.
scsi: aacraid: remove redundant zero check on ret
scsi: lpfc: use proper format string for dma_addr_t
scsi: lpfc: use div_u64 for 64-bit division
scsi: mac_scsi: Fix MAC_SCSI=m option when SCSI=m
scsi: cciss: correct check map error.
scsi: qla2xxx: fix spelling mistake: "seperator" -> "separator"
scsi: aacraid: Fixed expander hotplug for SMART family
scsi: mpt3sas: switch to pci_alloc_irq_vectors
scsi: qedf: fixup compilation warning about atomic_t usage
scsi: remove scsi_execute_req_flags
scsi: merge __scsi_execute into scsi_execute
scsi: simplify scsi_execute_req_flags
scsi: make the sense header argument to scsi_test_unit_ready mandatory
scsi: sd: improve TUR handling in sd_check_events
scsi: always zero sshdr in scsi_normalize_sense
scsi: scsi_dh_emc: return success in clariion_std_inquiry()
scsi: fix memory leak of sdpk on when gd fails to allocate
scsi: sd: make sd_devt_release() static
scsi: qedf: Add QLogic FastLinQ offload FCoE driver framework.
...
Pull vfs 'statx()' update from Al Viro.
This adds the new extended stat() interface that internally subsumes our
previous stat interfaces, and allows user mode to specify in more detail
what kind of information it wants.
It also allows for some explicit synchronization information to be
passed to the filesystem, which can be relevant for network filesystems:
is the cached value ok, or do you need open/close consistency, or what?
From David Howells.
Andreas Dilger points out that the first version of the extended statx
interface was posted June 29, 2010:
https://www.spinics.net/lists/linux-fsdevel/msg33831.html
* 'rebased-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
statx: Add a system call to make enhanced file info available
Pull block layer fixes from Jens Axboe:
"A collection of fixes for this merge window, either fixes for existing
issues, or parts that were waiting for acks to come in. This pull
request contains:
- Allocation of nvme queues on the right node from Shaohua.
This was ready long before the merge window, but waiting on an ack
from Bjorn on the PCI bit. Now that we have that, the three patches
can go in.
- Two fixes for blk-mq-sched with nvmeof, which uses hctx specific
request allocations. This caused an oops. One part from Sagi, one
part from Omar.
- A loop partition scan deadlock fix from Omar, fixing a regression
in this merge window.
- A three-patch series from Keith, closing up a hole on clearing out
requests on shutdown/resume.
- A stable fix for nbd from Josef, fixing a leak of sockets.
- Two fixes for a regression in this window from Jan, fixing a
problem with one of his earlier patches dealing with queue vs bdi
life times.
- A fix for a regression with virtio-blk, causing an IO stall if
scheduling is used. From me.
- A fix for an io context lock ordering problem. From me"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: Move bdi_unregister() to del_gendisk()
blk-mq: ensure that bd->last is always set correctly
block: don't call ioc_exit_icq() with the queue lock held for blk-mq
block: Initialize bd_bdi on inode initialization
loop: fix LO_FLAGS_PARTSCAN hang
nvme: Complete all stuck requests
blk-mq: Provide freeze queue timeout
blk-mq: Export blk_mq_freeze_queue_wait
nbd: stop leaking sockets
blk-mq: move update of tags->rqs to __blk_mq_alloc_request()
blk-mq: kill blk_mq_set_alloc_data()
blk-mq: make blk_mq_alloc_request_hctx() allocate a scheduler request
blk-mq-sched: Allocate sched reserved tags as specified in the original queue tagset
nvme: allocate nvme_queue in correct node
PCI: add an API to get node from vector
blk-mq: allocate blk_mq_tags and requests in correct node
Pull sched.h split-up from Ingo Molnar:
"The point of these changes is to significantly reduce the
<linux/sched.h> header footprint, to speed up the kernel build and to
have a cleaner header structure.
After these changes the new <linux/sched.h>'s typical preprocessed
size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K
lines), which is around 40% faster to build on typical configs.
Not much changed from the last version (-v2) posted three weeks ago: I
eliminated quirks, backmerged fixes plus I rebased it to an upstream
SHA1 from yesterday that includes most changes queued up in -next plus
all sched.h changes that were pending from Andrew.
I've re-tested the series both on x86 and on cross-arch defconfigs,
and did a bisectability test at a number of random points.
I tried to test as many build configurations as possible, but some
build breakage is probably still left - but it should be mostly
limited to architectures that have no cross-compiler binaries
available on kernel.org, and non-default configurations"
* 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits)
sched/headers: Clean up <linux/sched.h>
sched/headers: Remove #ifdefs from <linux/sched.h>
sched/headers: Remove the <linux/topology.h> include from <linux/sched.h>
sched/headers, hrtimer: Remove the <linux/wait.h> include from <linux/hrtimer.h>
sched/headers, x86/apic: Remove the <linux/pm.h> header inclusion from <asm/apic.h>
sched/headers, timers: Remove the <linux/sysctl.h> include from <linux/timer.h>
sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/init.h>
sched/core: Remove unused prefetch_stack()
sched/headers: Remove <linux/rculist.h> from <linux/sched.h>
sched/headers: Remove the 'init_pid_ns' prototype from <linux/sched.h>
sched/headers: Remove <linux/signal.h> from <linux/sched.h>
sched/headers: Remove <linux/rwsem.h> from <linux/sched.h>
sched/headers: Remove the runqueue_is_locked() prototype
sched/headers: Remove <linux/sched.h> from <linux/sched/hotplug.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/debug.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/nohz.h>
sched/headers: Remove <linux/sched.h> from <linux/sched/stat.h>
sched/headers: Remove the <linux/gfp.h> include from <linux/sched.h>
sched/headers: Remove <linux/rtmutex.h> from <linux/sched.h>
...
Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.
The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode. This change is propagated to the vfs_getattr*()
function.
Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.
========
OVERVIEW
========
The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.
A number of requests were gathered for features to be included. The
following have been included:
(1) Make the fields a consistent size on all arches and make them large.
(2) Spare space, request flags and information flags are provided for
future expansion.
(3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
__s64).
(4) Creation time: The SMB protocol carries the creation time, which could
be exported by Samba, which will in turn help CIFS make use of
FS-Cache as that can be used for coherency data (stx_btime).
This is also specified in NFSv4 as a recommended attribute and could
be exported by NFSD [Steve French].
(5) Lightweight stat: Ask for just those details of interest, and allow a
netfs (such as NFS) to approximate anything not of interest, possibly
without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
Dilger] (AT_STATX_DONT_SYNC).
(6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
its cached attributes are up to date [Trond Myklebust]
(AT_STATX_FORCE_SYNC).
And the following have been left out for future extension:
(7) Data version number: Could be used by userspace NFS servers [Aneesh
Kumar].
Can also be used to modify fill_post_wcc() in NFSD which retrieves
i_version directly, but has just called vfs_getattr(). It could get
it from the kstat struct if it used vfs_xgetattr() instead.
(There's disagreement on the exact semantics of a single field, since
not all filesystems do this the same way).
(8) BSD stat compatibility: Including more fields from the BSD stat such
as creation time (st_btime) and inode generation number (st_gen)
[Jeremy Allison, Bernd Schubert].
(9) Inode generation number: Useful for FUSE and userspace NFS servers
[Bernd Schubert].
(This was asked for but later deemed unnecessary with the
open-by-handle capability available and caused disagreement as to
whether it's a security hole or not).
(10) Extra coherency data may be useful in making backups [Andreas Dilger].
(No particular data were offered, but things like last backup
timestamp, the data version number and the DOS archive bit would come
into this category).
(11) Allow the filesystem to indicate what it can/cannot provide: A
filesystem can now say it doesn't support a standard stat feature if
that isn't available, so if, for instance, inode numbers or UIDs don't
exist or are fabricated locally...
(This requires a separate system call - I have an fsinfo() call idea
for this).
(12) Store a 16-byte volume ID in the superblock that can be returned in
struct xstat [Steve French].
(Deferred to fsinfo).
(13) Include granularity fields in the time data to indicate the
granularity of each of the times (NFSv4 time_delta) [Steve French].
(Deferred to fsinfo).
(14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
Note that the Linux IOC flags are a mess and filesystems such as Ext4
define flags that aren't in linux/fs.h, so translation in the kernel
may be a necessity (or, possibly, we provide the filesystem type too).
(Some attributes are made available in stx_attributes, but the general
feeling was that the IOC flags were to ext[234]-specific and shouldn't
be exposed through statx this way).
(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
Michael Kerrisk].
(Deferred, probably to fsinfo. Finding out if there's an ACL or
seclabal might require extra filesystem operations).
(16) Femtosecond-resolution timestamps [Dave Chinner].
(A __reserved field has been left in the statx_timestamp struct for
this - if there proves to be a need).
(17) A set multiple attributes syscall to go with this.
===============
NEW SYSTEM CALL
===============
The new system call is:
int ret = statx(int dfd,
const char *filename,
unsigned int flags,
unsigned int mask,
struct statx *buffer);
The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat(). There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.
Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):
(1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
respect.
(2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
its attributes with the server - which might require data writeback to
occur to get the timestamps correct.
(3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
network filesystem. The resulting values should be considered
approximate.
mask is a bitmask indicating the fields in struct statx that are of
interest to the caller. The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat(). It should be noted that asking for
more information may entail extra I/O operations.
buffer points to the destination for the data. This must be 256 bytes in
size.
======================
MAIN ATTRIBUTES RECORD
======================
The following structures are defined in which to return the main attribute
set:
struct statx_timestamp {
__s64 tv_sec;
__s32 tv_nsec;
__s32 __reserved;
};
struct statx {
__u32 stx_mask;
__u32 stx_blksize;
__u64 stx_attributes;
__u32 stx_nlink;
__u32 stx_uid;
__u32 stx_gid;
__u16 stx_mode;
__u16 __spare0[1];
__u64 stx_ino;
__u64 stx_size;
__u64 stx_blocks;
__u64 __spare1[1];
struct statx_timestamp stx_atime;
struct statx_timestamp stx_btime;
struct statx_timestamp stx_ctime;
struct statx_timestamp stx_mtime;
__u32 stx_rdev_major;
__u32 stx_rdev_minor;
__u32 stx_dev_major;
__u32 stx_dev_minor;
__u64 __spare2[14];
};
The defined bits in request_mask and stx_mask are:
STATX_TYPE Want/got stx_mode & S_IFMT
STATX_MODE Want/got stx_mode & ~S_IFMT
STATX_NLINK Want/got stx_nlink
STATX_UID Want/got stx_uid
STATX_GID Want/got stx_gid
STATX_ATIME Want/got stx_atime{,_ns}
STATX_MTIME Want/got stx_mtime{,_ns}
STATX_CTIME Want/got stx_ctime{,_ns}
STATX_INO Want/got stx_ino
STATX_SIZE Want/got stx_size
STATX_BLOCKS Want/got stx_blocks
STATX_BASIC_STATS [The stuff in the normal stat struct]
STATX_BTIME Want/got stx_btime{,_ns}
STATX_ALL [All currently available stuff]
stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.
Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution. Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.
The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does. The following
attributes map to FS_*_FL flags and are the same numerical value:
STATX_ATTR_COMPRESSED File is compressed by the fs
STATX_ATTR_IMMUTABLE File is marked immutable
STATX_ATTR_APPEND File is append-only
STATX_ATTR_NODUMP File is not to be dumped
STATX_ATTR_ENCRYPTED File requires key to decrypt in fs
Within the kernel, the supported flags are listed by:
KSTAT_ATTR_FS_IOC_FLAGS
[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]
New flags include:
STATX_ATTR_AUTOMOUNT Object is an automount trigger
These are for the use of GUI tools that might want to mark files specially,
depending on what they are.
Fields in struct statx come in a number of classes:
(0) stx_dev_*, stx_blksize.
These are local system information and are always available.
(1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
stx_size, stx_blocks.
These will be returned whether the caller asks for them or not. The
corresponding bits in stx_mask will be set to indicate whether they
actually have valid values.
If the caller didn't ask for them, then they may be approximated. For
example, NFS won't waste any time updating them from the server,
unless as a byproduct of updating something requested.
If the values don't actually exist for the underlying object (such as
UID or GID on a DOS file), then the bit won't be set in the stx_mask,
even if the caller asked for the value. In such a case, the returned
value will be a fabrication.
Note that there are instances where the type might not be valid, for
instance Windows reparse points.
(2) stx_rdev_*.
This will be set only if stx_mode indicates we're looking at a
blockdev or a chardev, otherwise will be 0.
(3) stx_btime.
Similar to (1), except this will be set to 0 if it doesn't exist.
=======
TESTING
=======
The following test program can be used to test the statx system call:
samples/statx/test-statx.c
Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.
Here's some example output. Firstly, an NFS directory that crosses to
another FSID. Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.
[root@andromeda ~]# /tmp/test-statx -A /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:26 Inode: 1703937 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
Secondly, the result of automounting on that directory.
[root@andromeda ~]# /tmp/test-statx /warthog/data
statx(/warthog/data) = 0
results=7ff
Size: 4096 Blocks: 8 IO Block: 1048576 directory
Device: 00:27 Inode: 2 Links: 125
Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
Access: 2016-11-24 09:02:12.219699527+0000
Modify: 2016-11-17 10:44:36.225653653+0000
Change: 2016-11-17 10:44:36.225653653+0000
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull vfs pile two from Al Viro:
- orangefs fix
- series of fs/namei.c cleanups from me
- VFS stuff coming from overlayfs tree
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
orangefs: Use RCU for destroy_inode
vfs: use helper for calling f_op->fsync()
mm: use helper for calling f_op->mmap()
vfs: use helpers for calling f_op->{read,write}_iter()
vfs: pass type instead of fn to do_{loop,iter}_readv_writev()
vfs: extract common parts of {compat_,}do_readv_writev()
vfs: wrap write f_ops with file_{start,end}_write()
vfs: deny copy_file_range() for non regular files
vfs: deny fallocate() on directory
vfs: create vfs helper vfs_tmpfile()
namei.c: split unlazy_walk()
namei.c: fold the check for DCACHE_OP_REVALIDATE into d_revalidate()
lookup_fast(): clean up the logics around the fallback to non-rcu mode
namei: fold unlazy_link() into its sole caller
Pull vfs sendmsg updates from Al Viro:
"More sendmsg work.
This is a fairly separate isolated stuff (there's a continuation
around lustre, but that one was too late to soak in -next), thus the
separate pull request"
* 'work.sendmsg' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ncpfs: switch to sock_sendmsg()
ncpfs: don't mess with manually advancing iovec on send
ncpfs: sendmsg does *not* bugger iovec these days
ceph_tcp_sendpage(): use ITER_BVEC sendmsg
afs_send_pages(): use ITER_BVEC
rds: remove dead code
ceph: switch to sock_recvmsg()
usbip_recv(): switch to sock_recvmsg()
iscsi_target: deal with short writes on the tx side
[nbd] pass iov_iter to nbd_xmit()
[nbd] switch sock_xmit() to sock_{send,recv}msg()
[drbd] use sock_sendmsg()
Looks like a quiet cycle for vhost/virtio, just a couple of minor
tweaks. Most notable is automatic interrupt affinity for blk and scsi.
Hopefully other devices are not far behind.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJYt1rRAAoJECgfDbjSjVRpEZsIALSHevdXWtRHBZUb0ZkqPLQb
/x2Vn49CcALS1p7iSuP9L027MPeaLKyr0NBT9hptBChp/4b9lnZWyyAo6vYQrzfx
Ia/hLBYsK4ml6lEwbyfLwqkF2cmYCrZhBSVAILifn84lTPoN7CT0PlYDfA+OCaNR
geo75qF8KR+AUO0aqchwMRL3RV3OxZKxQr2AR6LttCuhiBgnV3Xqxffg/M3x6ONM
0ffFFdodm6slem3hIEiGUMwKj4NKQhcOleV+y0fVBzWfLQG9210pZbQyRBRikIL0
7IsaarpaUr7OrLAZFMGF6nJnyRAaRrt6WknTHZkyvyggrePrGcmGgPm4jrODwY4=
=2zwv
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull vhost updates from Michael Tsirkin:
"virtio, vhost: optimizations, fixes
Looks like a quiet cycle for vhost/virtio, just a couple of minor
tweaks. Most notable is automatic interrupt affinity for blk and scsi.
Hopefully other devices are not far behind"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio-console: avoid DMA from stack
vhost: introduce O(1) vq metadata cache
virtio_scsi: use virtio IRQ affinity
virtio_blk: use virtio IRQ affinity
blk-mq: provide a default queue mapping for virtio device
virtio: provide a method to get the IRQ affinity mask for a virtqueue
virtio: allow drivers to request IRQ affinity when creating VQs
virtio_pci: simplify MSI-X setup
virtio_pci: don't duplicate the msix_enable flag in struct pci_dev
virtio_pci: use shared interrupts for virtqueues
virtio_pci: remove struct virtio_pci_vq_info
vhost: try avoiding avail index access when getting descriptor
virtio_mmio: expose header to userspace
loop_reread_partitions() needs to do I/O, but we just froze the queue,
so we end up waiting forever. This can easily be reproduced with losetup
-P. Fix it by moving the reread to after we unfreeze the queue.
Fixes: ecdd09597a ("block/loop: fix race between I/O and set_status")
Reported-by: Tejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This was introduced in the multi-connection patch, we've been leaking
socket's ever since.
Fixes: 9561a7a ("nbd: add multi-connection support")
cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Fix up affected files that include this signal functionality via sched.h.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.
Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are going to move scheduler ABI details to <uapi/linux/sched/types.h>,
which will be used from a number of .c files.
Create empty placeholder header that maps to <linux/types.h>.
Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull IDR rewrite from Matthew Wilcox:
"The most significant part of the following is the patch to rewrite the
IDR & IDA to be clients of the radix tree. But there's much more,
including an enhancement of the IDA to be significantly more space
efficient, an IDR & IDA test suite, some improvements to the IDR API
(and driver changes to take advantage of those improvements), several
improvements to the radix tree test suite and RCU annotations.
The IDR & IDA rewrite had a good spin in linux-next and Andrew's tree
for most of the last cycle. Coupled with the IDR test suite, I feel
pretty confident that any remaining bugs are quite hard to hit. 0-day
did a great job of watching my git tree and pointing out problems; as
it hit them, I added new test-cases to be sure not to be caught the
same way twice"
Willy goes on to expand a bit on the IDR rewrite rationale:
"The radix tree and the IDR use very similar data structures.
Merging the two codebases lets us share the memory allocation pools,
and results in a net deletion of 500 lines of code. It also opens up
the possibility of exposing more of the features of the radix tree to
users of the IDR (and I have some interesting patches along those
lines waiting for 4.12)
It also shrinks the size of the 'struct idr' from 40 bytes to 24 which
will shrink a fair few data structures that embed an IDR"
* 'idr-4.11' of git://git.infradead.org/users/willy/linux-dax: (32 commits)
radix tree test suite: Add config option for map shift
idr: Add missing __rcu annotations
radix-tree: Fix __rcu annotations
radix-tree: Add rcu_dereference and rcu_assign_pointer calls
radix tree test suite: Run iteration tests for longer
radix tree test suite: Fix split/join memory leaks
radix tree test suite: Fix leaks in regression2.c
radix tree test suite: Fix leaky tests
radix tree test suite: Enable address sanitizer
radix_tree_iter_resume: Fix out of bounds error
radix-tree: Store a pointer to the root in each node
radix-tree: Chain preallocated nodes through ->parent
radix tree test suite: Dial down verbosity with -v
radix tree test suite: Introduce kmalloc_verbose
idr: Return the deleted entry from idr_remove
radix tree test suite: Build separate binaries for some tests
ida: Use exceptional entries for small IDAs
ida: Move ida_bitmap to a percpu variable
Reimplement IDR and IDA using the radix tree
radix-tree: Add radix_tree_iter_delete
...
- support for rbd data-pool feature, which enables rbd images on
erasure-coded pools (myself). CEPH_PG_MAX_SIZE has been bumped to
allow erasure-coded profiles with k+m up to 32.
- a patch for ceph_d_revalidate() performance regression introduced in
4.9, along with some cleanups in the area (Jeff Layton)
- a set of fixes for unsafe ->d_parent accesses in CephFS (Jeff Layton)
- buffered reads are now processed in rsize windows instead of rasize
windows (Andreas Gerstmayr). The new default for rsize mount option
is 64M.
- ack vs commit distinction is gone, greatly simplifying ->fsync() and
MOSDOpReply handling code (myself)
Also a few filesystem bug fixes from Zheng, a CRUSH sync up (CRUSH
computations are still serialized though) and several minor fixes and
cleanups all over.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJYtY0rAAoJEEp/3jgCEfOLQioH/36QKsalquY1FCdJnJve9qj0
q19OohamIedhv76AYvXhJzBBHlVwerjicE51/bSzuUhxV+ApdATrPPcLC22oLd3i
h0R9NAUMYjiris1yN/Z9JRiPCSdsxvHuRycsUMRSRbxZhnyP9XdTxFD1A+fLfisU
Z4osyTzadabVL5Um9maRBbAtXCWh3d9JZzPa5xIvWTEO4CWWk87GtEIIQDcgx+Y6
8ZSMmrVFDNtskUp9js+LnFYW7/xBsEXyqgsqKaecf5uQqwu1WKRXSKtv9PUmGAIb
HBrlUdV1PQaCzTYtaoztJshNdYcphM5L7gePzxRG0nXrTNsq8J5eCzI8en5qS8w=
=CPL/
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.11-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"This time around we have:
- support for rbd data-pool feature, which enables rbd images on
erasure-coded pools (myself). CEPH_PG_MAX_SIZE has been bumped to
allow erasure-coded profiles with k+m up to 32.
- a patch for ceph_d_revalidate() performance regression introduced
in 4.9, along with some cleanups in the area (Jeff Layton)
- a set of fixes for unsafe ->d_parent accesses in CephFS (Jeff
Layton)
- buffered reads are now processed in rsize windows instead of rasize
windows (Andreas Gerstmayr). The new default for rsize mount option
is 64M.
- ack vs commit distinction is gone, greatly simplifying ->fsync()
and MOSDOpReply handling code (myself)
... also a few filesystem bug fixes from Zheng, a CRUSH sync up (CRUSH
computations are still serialized though) and several minor fixes and
cleanups all over"
* tag 'ceph-for-4.11-rc1' of git://github.com/ceph/ceph-client: (52 commits)
libceph, rbd, ceph: WRITE | ONDISK -> WRITE
libceph: get rid of ack vs commit
ceph: remove special ack vs commit behavior
ceph: tidy some white space in get_nonsnap_parent()
crush: fix dprintk compilation
crush: do is_out test only if we do not collide
ceph: remove req from unsafe list when unregistering it
rbd: constify device_type structure
rbd: kill obj_request->object_name and rbd_segment_name_cache
rbd: store and use obj_request->object_no
rbd: RBD_V{1,2}_DATA_FORMAT macros
rbd: factor out __rbd_osd_req_create()
rbd: set offset and length outside of rbd_obj_request_create()
rbd: support for data-pool feature
rbd: introduce rbd_init_layout()
rbd: use rbd_obj_bytes() more
rbd: remove now unused rbd_obj_request_wait() and helpers
rbd: switch rbd_obj_method_sync() to ceph_osdc_call()
libceph: pass reply buffer length through ceph_osdc_call()
rbd: do away with obj_request in rbd_obj_read_sync()
...
Fix typos and add the following to the scripts/spelling.txt:
algined||aligned
While we are here, fix the "appplication" in the touched line in
drivers/block/loop.c. Also, fix the "may not naturally ..." to
"may not be naturally ..." in the touched line in mm/page_alloc.
Link: http://lkml.kernel.org/r/1481573103-11329-9-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use automatic IRQ affinity assignment in the virtio layer if available,
and build the blk-mq queues based on it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Add a struct irq_affinity pointer to the find_vqs methods, which if set
is used to tell the PCI layer to create the MSI-X vectors for our I/O
virtqueues with the proper affinity from the start. Compared to after
the fact affinity hints this gives us an instantly working setup and
allows to allocate the irq descritors node-local and avoid interconnect
traffic. Last but not least this will allow blk-mq queues are created
based on the interrupt affinity for storage drivers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Merge more updates from Andrew Morton:
- almost all of the rest of MM
- misc bits
- KASAN updates
- procfs
- lib/ updates
- checkpatch updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (124 commits)
checkpatch: remove false unbalanced braces warning
checkpatch: notice unbalanced else braces in a patch
checkpatch: add another old address for the FSF
checkpatch: update $logFunctions
checkpatch: warn on logging continuations
checkpatch: warn on embedded function names
lib/lz4: remove back-compat wrappers
fs/pstore: fs/squashfs: change usage of LZ4 to work with new LZ4 version
crypto: change LZ4 modules to work with new LZ4 module version
lib/decompress_unlz4: change module to work with new LZ4 module version
lib: update LZ4 compressor module
lib/test_sort.c: make it explicitly non-modular
lib: add CONFIG_TEST_SORT to enable self-test of sort()
rbtree: use designated initializers
linux/kernel.h: fix DIV_ROUND_CLOSEST to support negative divisors
lib/find_bit.c: micro-optimise find_next_*_bit
lib: add module support to atomic64 tests
lib: add module support to glob tests
lib: add module support to crc32 tests
kernel/ksysfs.c: add __ro_after_init to bin_attribute structure
...
The idea is that without doing more calculations we extend zero pages to
same element pages for zram. zero page is special case of same element
page with zero element.
1. the test is done under android 7.0
2. startup too many applications circularly
3. sample the zero pages, same pages (none-zero element)
and total pages in function page_zero_filled
the result is listed as below:
ZERO SAME TOTAL
36214 17842 598196
ZERO/TOTAL SAME/TOTAL (ZERO+SAME)/TOTAL ZERO/SAME
AVERAGE 0.060631909 0.024990816 0.085622726 2.663825038
STDEV 0.00674612 0.005887625 0.009707034 2.115881328
MAX 0.069698422 0.030046087 0.094975336 7.56043956
MIN 0.03959586 0.007332205 0.056055193 1.928985507
from the above data, the benefit is about 2.5% and up to 3% of total
swapout pages.
The defect of the patch is that when we recovery a page from non-zero
element the operations are low efficient for partial read.
This patch extends zero_page to same_page so if there is any user to
have monitored zero_pages, he will be surprised if the number is
increased but it's not harmful, I believe.
[minchan@kernel.org: do not free same element pages in zram_meta_free]
Link: http://lkml.kernel.org/r/20170207065741.GA2567@bbox
Link: http://lkml.kernel.org/r/1483692145-75357-1-git-send-email-zhouxianrong@huawei.com
Link: http://lkml.kernel.org/r/1486307804-27903-1-git-send-email-minchan@kernel.org
Signed-off-by: zhouxianrong <zhouxianrong@huawei.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zram_reset_device() waits for ongoing writepage pages to be completed by
zram->refcount logic. However, it's pointless because before the reset,
we prevent further opening of zram by zram->claim and flush all of
pending IO by fsync_bdev so there should be no pending IO at the
zram_reset_device().
So let's remove that code which is even broken due to the lack of
wake_up elsewhere.
Link: http://lkml.kernel.org/r/1485145031-11661-1-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull sparc updates from David Miller:
1) Support multiple huge page sizes, from Nitin Gupta.
2) Improve boot time on large memory configurations, from Pavel
Tatashin.
3) Make BRK handling more consistent and documented, from Vijay Kumar.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
sparc64: Fix build error in flush_tsb_user_page
sparc64: memblock resizes are not handled properly
sparc64: use latency groups to improve add_node_ranges speed
sparc64: Add 64K page size support
sparc64: Multi-page size support
Documentation/sparc: Steps for sending break on sunhv console
sparc64: Send break twice from console to return to boot prom
sparc64: Migrate hvcons irq to panicked cpu
sparc64: Set cpu state to offline when stopped
sunvdc: Add support for setting physical sector size
sparc64: fix for user probes in high memory
sparc: topology_64.h: Fix condition for including cpudata.h
sparc32: mm: srmmu: add __ro_after_init to sparc32_cachetlb_ops structures
Pull block updates and fixes from Jens Axboe:
- NVMe updates and fixes that missed the first pull request. This
includes bug fixes, and support for autonomous power management.
- Fix from Christoph for missing clear of the request payload, causing
a problem with (at least) the storvsc driver.
- Further fixes for the queue/bdi life time issues from Jan.
- The Kconfig mq scheduler update from me.
- Fixing a use-after-free in dm-rq, spotted by Bart, introduced in this
merge window.
- Three fixes for nbd from Josef.
- Bug fix from Omar, fixing a bug in sas transport code that oopses
when bsg ioctls were used. From Omar.
- Improvements to the queue restart and tag wait from from Omar.
- Set of fixes for the sed/opal code from Scott.
- Three trivial patches to cciss from Tobin
* 'for-linus' of git://git.kernel.dk/linux-block: (41 commits)
dm-rq: don't dereference request payload after ending request
blk-mq-sched: separate mark hctx and queue restart operations
blk-mq: use sbq wait queues instead of restart for driver tags
block/sed-opal: Propagate original error message to userland.
nvme/pci: re-check security protocol support after reset
block/sed-opal: Introduce free_opal_dev to free the structure and clean up state
nvme: detect NVMe controller in recent MacBooks
nvme-rdma: add support for host_traddr
nvmet-rdma: Fix error handling
nvmet-rdma: use nvme cm status helper
nvme-rdma: move nvme cm status helper to .h file
nvme-fc: don't bother to validate ioccsz and iorcsz
nvme/pci: No special case for queue busy on IO
nvme/core: Fix race kicking freed request_queue
nvme/pci: Disable on removal when disconnected
nvme: Enable autonomous power state transitions
nvme: Add a quirk mechanism that uses identify_ctrl
nvme: make nvmf_register_transport require a create_ctrl callback
nvme: Use CNS as 8-bit field and avoid endianness conversion
nvme: add semicolon in nvme_command setting
...
CEPH_OSD_FLAG_ONDISK is set in account_request().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
Remove device driver failed to check map error messages
Reported-by: Johnny Bieren <jbieren@redhat.com>
Tested-by: Johnny Bieren <jbieren@redhat.com>
Reviewed-by: Scott Teel <scott.teel@microsemi.com>
Signed-off-by: Don Brace <don.brace@microsemi.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Physical sector size is supported in v1.2 of the vDisk protocol and
should be set if available. If protocol version 1.2 is used and the
physical disk size is unavailable, then the disk is considered busy.
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We had a deprecated_attr_warn() warning for 2 years and now the time has
come and we finally can do the cleanup.
The plan was as follows:
: per-stat sysfs attributes are considered to be deprecated.
: The basic strategy is:
: -- the existing RW nodes will be downgraded to WO nodes (in linux 4.11)
: -- deprecated RO sysfs nodes will eventually be removed (in linux 4.11)
:
: The list of deprecated attributes can be found here:
: Documentation/ABI/obsolete/sysfs-block-zram
:
: Basically, every attribute that has its own read accessible sysfs
: node (e.g. num_reads) *AND* is accessible via one of the stat files
: (zram<id>/stat or zram<id>/io_stat or zram<id>/mm_stat) is considered
: to be deprecated.
The patch also removes `obsolete/sysfs-block-zram', clean ups
`testing/sysfs-block-zram' and tweaks zram.txt files.
Link: http://lkml.kernel.org/r/20170118035838.11090-1-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Coccinelle emits a warning about casting the return value of
kmalloc(). Coccinelle suggests removing the cast as do
kerneljanitors.
Remove cast from kmalloc() call.
Signed-off-by: Tobin C. Harding <me@tobin.cc>
Acked-by: Don Brace <don.brace@microsemi.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Checkpatch emits ERROR:OPEN_BRACE: that open brace { should be on the
previous line.
Move open brace to new line. Also add space after if/switch statement
since we introduce more checkpatch errors if not fixed at the same
time.
Signed-off-by: Tobin C. Harding <me@tobin.cc>
Acked-by: Don Brace <don.brace@microsemi.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABAgAGBQJYrElpAAoJELDendYovxMvNFQH/RJU7lwDSf7rF7ZzFGvdcsfi
T4DDuowkYoJm2+GypoRVzZZ0lxJlxr0mNKPvgGDvuTogMY7pvjAf6B7/xCvTFsNU
UoO2I7ljgXxCXFRiXH50nAjS7PC2PFW3Qx+8XPIWeZmnUPeJi4Q43fiSloUt+a6l
JgS/autOCflGasR5MihCZXkvdVF81K6GuEd3hCh9GKZ/8RiwNPaY50vHnMv/hfqq
SJNRKOTSRXioYlTohLnjuPWHDMayRJEO48IXl3c7aNxDTkHjn78yaoPhJ7m+M0Bq
s2GyaQA4tCwADhP+tilmI5H1vFpt6w9x7O0dgWiSm7TB91lwfZOen2WhTPPp6L8=
=gktV
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.11-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:
"Xen features and fixes:
- a series from Boris Ostrovsky adding support for booting Linux as
Xen PVH guest
- a series from Juergen Gross streamlining the xenbus driver
- a series from Paul Durrant adding support for the new device model
hypercall
- several small corrections"
* tag 'for-linus-4.11-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/privcmd: add IOCTL_PRIVCMD_RESTRICT
xen/privcmd: Add IOCTL_PRIVCMD_DM_OP
xen/privcmd: return -ENOTTY for unimplemented IOCTLs
xen: optimize xenbus driver for multiple concurrent xenstore accesses
xen: modify xenstore watch event interface
xen: clean up xenbus internal headers
xenbus: Neaten xenbus_va_dev_error
xen/pvh: Use Xen's emergency_restart op for PVH guests
xen/pvh: Enable CPU hotplug
xen/pvh: PVH guests always have PV devices
xen/pvh: Initialize grant table for PVH guests
xen/pvh: Make sure we don't use ACPI_IRQ_MODEL_PIC for SCI
xen/pvh: Bootstrap PVH guest
xen/pvh: Import PVH-related Xen public interfaces
xen/x86: Remove PVH support
x86/boot/32: Convert the 32-bit pgtable setup code from assembly to C
xen/manage: correct return value check on xenbus_scanf()
x86/xen: Fix APIC id mismatch warning on Intel
xen/netback: set default upper limit of tx/rx queues to 8
xen/netfront: set default upper limit of tx/rx queues to 8
If we fail to register the blockdev we need to make sure to destroy the
recv workqueue.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We noticed when trying to do O_DIRECT to an export on the server side
that we were getting requests smaller than the 4k sectorsize of the
device. This is because the client isn't setting the logical and
physical blocksizes properly for the underlying device. Fix this up by
setting the queue blocksizes and then calling bd_set_size.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Break the ioctl handling out into helper functions, some of these things
are getting pretty big and unwieldy.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This update includes the usual round of major driver updates (ncr5380,
ufs, lpfc, be2iscsi, hisi_sas, storvsc, cxlflash, aacraid,
megaraid_sas, ). There's also an assortment of minor fixes and the
major update of switching a bunch of drivers to pci_alloc_irq_vectors
from Christoph.
Signed-off-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJYq5adAAoJEAVr7HOZEZN4bjUP/Atk7CSZVnC75pcYmncbEGCx
ysOlEHK4uW2HhiAYk3PlYMk+pKrMHet2zsbbM9PHJfopdOHZ7Sq1+UZZVeqE1Zun
8pe0NhON+fZx7XAnevdEvnSSULQZ+AGfjZO72iUwkJiN3ozYaFtCITOyn49l4GpR
ra9emskBh7CQOFW2voGn1AKeDijPYGx3+TO4AUrWjVMiByR06gb1bmImx+ljiUrs
jzRJPfrt90ORcTdpMateyN2EXxudcASMhX03SJ6fRI84hPAhMCROMbTv8RnzOTE4
DPbnvbYUowlHt43iUhJHSwGdkRRaRBnkzQENBp1fNrNzZgF6vB7+kShxbonrYB2p
gC4ewaJr0BNj+HsUnvTpe3WseiPOcfsnBsKilPLKBlm2dCKEXqFox/dj/T1uexxg
HoyFrl3u8fyEqVHrzRS4M9t/njWh0NFmXxb0wBdj+lkVFTRErGSKQ8SfOqshuSGs
P8NN88jy8vC7uqgzKBJ+UH3ehzn3qfBxasFHIC/e2awY9FqKjHGTxKMmSVpjXVxy
wCvE2FQ3k/qEj2XSM6f7/NGytlSOlju5q1rFtHPW2M+TFSh0LJWCnmVjR/Zle9em
pBWmtIgCv8W5b41zL2H94nLWAZbfdrrNU/XnX88l47LKnmorte/PGhpxu36NEsMS
VCgreQmFMdMRY+WzDWl1
=cBQx
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This update includes the usual round of major driver updates (ncr5380,
ufs, lpfc, be2iscsi, hisi_sas, storvsc, cxlflash, aacraid,
megaraid_sas, ...).
There's also an assortment of minor fixes and the major update of
switching a bunch of drivers to pci_alloc_irq_vectors from Christoph"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (188 commits)
scsi: megaraid_sas: handle dma_addr_t right on 32-bit
scsi: megaraid_sas: array overflow in megasas_dump_frame()
scsi: snic: switch to pci_irq_alloc_vectors
scsi: megaraid_sas: driver version upgrade
scsi: megaraid_sas: Change RAID_1_10_RMW_CMDS to RAID_1_PEER_CMDS and set value to 2
scsi: megaraid_sas: Indentation and smatch warning fixes
scsi: megaraid_sas: Cleanup VD_EXT_DEBUG and SPAN_DEBUG related debug prints
scsi: megaraid_sas: Increase internal command pool
scsi: megaraid_sas: Use synchronize_irq to wait for IRQs to complete
scsi: megaraid_sas: Bail out the driver load if ld_list_query fails
scsi: megaraid_sas: Change build_mpt_mfi_pass_thru to return void
scsi: megaraid_sas: During OCR, if get_ctrl_info fails do not continue with OCR
scsi: megaraid_sas: Do not set fp_possible if TM capable for non-RW syspdIO, change fp_possible to bool
scsi: megaraid_sas: Remove unused pd_index from megasas_build_ld_nonrw_fusion
scsi: megaraid_sas: megasas_return_cmd does not memset IO frame to zero
scsi: megaraid_sas: max_fw_cmds are decremented twice, remove duplicate
scsi: megaraid_sas: update can_queue only if the new value is less
scsi: megaraid_sas: Change max_cmd from u32 to u16 in all functions
scsi: megaraid_sas: set pd_after_lb from MR_BuildRaidContext and initialize pDevHandle to MR_DEVHANDLE_INVALID
scsi: megaraid_sas: latest controller OCR capability from FW before sending shutdown DCMD
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJYqeb8AAoJEPfTWPspceCmB3UP/3UtcPrzEm8w2cxB9MaWhZN3
J+jiwlO4vaqhm2HVzQtoJqfaqRlud/iDx5cIXE2S7FnIM54ZKs3CANbKu8X+b1zm
eJije3zMI8A8qyftigbz6a/Y2kWE4ZqFEc9WU5CWawfTl3ImCVUi8+F5X0wOLU/h
r50zAQOEyURH4G5usNl9q0olF6FonJ82AcYm1iJ0QP2wYWZRJauC0rRn8IT93tyK
bZPHnGKdkd7km8yi3zr2GNWOfuZZuA0HWAaF4qfrHPZQ883gITFAUIlFb1f+2TNl
DkQzRrBB2wPWPnlbfb9KejMkvL94hflzsLb5rHt835DyVXFRyjxsgyAI8A+LPGSz
vqZ3rsbWj6H4F9z2CkZ+T+AP/ZSWDNjwc0RXPm9HYdR5CDeTxIUVvnFQ44YNsmTv
Xd5BKrUJ2oKegAxQG6zcuFx23p8JzhT70l+mNrMdtyeKnDD9FRdDvhKG9AHeTipn
o/DnGivhS3UMQoQ7D68KOO+kuhLDeo7my5XGsnjzMO/iHqg++7IP2HyYYs/Ba4qZ
cYaCtSDQW71Zt0vsqa6dvPuXBveu4h8Qh8R7uAGjSGS9IAFFb4Cab2tiUdISE6PE
YnMWzY+G6pT8imlLVOL5/QFuo2Q4pUsaL0AHpXMCN9TZnQtbqXa8eqwnKnQ0m2KN
7ut0IYYEPaYUX5xFn1K6
=z7AL
-----END PGP SIGNATURE-----
Merge tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
- blk-mq scheduling framework from me and Omar, with a port of the
deadline scheduler for this framework. A port of BFQ from Paolo is in
the works, and should be ready for 4.12.
- Various fixups and improvements to the above scheduling framework
from Omar, Paolo, Bart, me, others.
- Cleanup of the exported sysfs blk-mq data into debugfs, from Omar.
This allows us to export more information that helps debug hangs or
performance issues, without cluttering or abusing the sysfs API.
- Fixes for the sbitmap code, the scalable bitmap code that was
migrated from blk-mq, from Omar.
- Removal of the BLOCK_PC support in struct request, and refactoring of
carrying SCSI payloads in the block layer. This cleans up the code
nicely, and enables us to kill the SCSI specific parts of struct
request, shrinking it down nicely. From Christoph mainly, with help
from Hannes.
- Support for ranged discard requests and discard merging, also from
Christoph.
- Support for OPAL in the block layer, and for NVMe as well. Mainly
from Scott Bauer, with fixes/updates from various others folks.
- Error code fixup for gdrom from Christophe.
- cciss pci irq allocation cleanup from Christoph.
- Making the cdrom device operations read only, from Kees Cook.
- Fixes for duplicate bdi registrations and bdi/queue life time
problems from Jan and Dan.
- Set of fixes and updates for lightnvm, from Matias and Javier.
- A few fixes for nbd from Josef, using idr to name devices and a
workqueue deadlock fix on receive. Also marks Josef as the current
maintainer of nbd.
- Fix from Josef, overwriting queue settings when the number of
hardware queues is updated for a blk-mq device.
- NVMe fix from Keith, ensuring that we don't repeatedly mark and IO
aborted, if we didn't end up aborting it.
- SG gap merging fix from Ming Lei for block.
- Loop fix also from Ming, fixing a race and crash between setting loop
status and IO.
- Two block race fixes from Tahsin, fixing request list iteration and
fixing a race between device registration and udev device add
notifiations.
- Double free fix from cgroup writeback, from Tejun.
- Another double free fix in blkcg, from Hou Tao.
- Partition overflow fix for EFI from Alden Tondettar.
* tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block: (156 commits)
nvme: Check for Security send/recv support before issuing commands.
block/sed-opal: allocate struct opal_dev dynamically
block/sed-opal: tone down not supported warnings
block: don't defer flushes on blk-mq + scheduling
blk-mq-sched: ask scheduler for work, if we failed dispatching leftovers
blk-mq: don't special case flush inserts for blk-mq-sched
blk-mq-sched: don't add flushes to the head of requeue queue
blk-mq: have blk_mq_dispatch_rq_list() return if we queued IO or not
block: do not allow updates through sysfs until registration completes
lightnvm: set default lun range when no luns are specified
lightnvm: fix off-by-one error on target initialization
Maintainers: Modify SED list from nvme to block
Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN
uapi: sed-opal fix IOW for activate lsp to use correct struct
cdrom: Make device operations read-only
elevator: fix loading wrong elevator type for blk-mq devices
cciss: switch to pci_irq_alloc_vectors
block/loop: fix race between I/O and set_status
blk-mq-sched: don't hold queue_lock when calling exit_icq
block: set make_request_fn manually in blk_mq_update_nr_hw_queues
...
Pull locking updates from Ingo Molnar:
"The main changes in this cycle were:
- Implement wraparound-safe refcount_t and kref_t types based on
generic atomic primitives (Peter Zijlstra)
- Improve and fix the ww_mutex code (Nicolai Hähnle)
- Add self-tests to the ww_mutex code (Chris Wilson)
- Optimize percpu-rwsems with the 'rcuwait' mechanism (Davidlohr
Bueso)
- Micro-optimize the current-task logic all around the core kernel
(Davidlohr Bueso)
- Tidy up after recent optimizations: remove stale code and APIs,
clean up the code (Waiman Long)
- ... plus misc fixes, updates and cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
fork: Fix task_struct alignment
locking/spinlock/debug: Remove spinlock lockup detection code
lockdep: Fix incorrect condition to print bug msgs for MAX_LOCKDEP_CHAIN_HLOCKS
lkdtm: Convert to refcount_t testing
kref: Implement 'struct kref' using refcount_t
refcount_t: Introduce a special purpose refcount type
sched/wake_q: Clarify queue reinit comment
sched/wait, rcuwait: Fix typo in comment
locking/mutex: Fix lockdep_assert_held() fail
locking/rtmutex: Flip unlikely() branch to likely() in __rt_mutex_slowlock()
locking/rwsem: Reinit wake_q after use
locking/rwsem: Remove unnecessary atomic_long_t casts
jump_labels: Move header guard #endif down where it belongs
locking/atomic, kref: Implement kref_put_lock()
locking/ww_mutex: Turn off __must_check for now
locking/atomic, kref: Avoid more abuse
locking/atomic, kref: Use kref_get_unless_zero() more
locking/atomic, kref: Kill kref_sub()
locking/atomic, kref: Add kref_read()
locking/atomic, kref: Add KREF_INIT()
...
Declare device_type structure as const as it is only stored in the
type field of a device structure. This field is of type const, so add
const to the declaration of device_type structure.
File size before:
text data bss dec hex filename
61546 11610 208 73364 11e94 drivers/block/rbd.o
File size after:
text data bss dec hex filename
61610 11578 208 73396 11eb4 drivers/block/rbd.o
Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
object_no can be trivially formatted into an object name. We already
store object names in OSD requests with special care to avoid dynamic
allocations for short names. Storing a name in obj_request, obtained
as below (!), is a waste and will be removed in the next commit.
name = kmem_cache_alloc(rbd_segment_name_cache, ...);
snprintf(name, ...);
obj_request->object_name = kstrdup(name);
kmem_cache_free(rbd_segment_name_cache, name);
...
ceph_oid_aprintf(..., "%s", obj_request->object_name);
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
... and also fix up the comment -- format 1 data objects have always
been 12 hex digits long.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Factor OSD request allocation and initialization code out into
__rbd_osd_req_create().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
The allocation doesn't depend on offset and length. Both offset and
length can be changed after obj_request is allocated, too.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Add support for RBD_FEATURE_DATA_POOL feature. rbd_dev->layout.pool_id
now stores the data pool id.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Rather than initializing layout fields with some made up values in
__rbd_dev_create(), move the initialization into rbd_init_layout() and
call it after the header is actually populated.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Returning u64 doesn't make sense: max header->obj_order is 25 and
ceph_file_layout::object_size is u32.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
As explained in the previous commit, rbd_obj_request machinery (and
rbd_osd_req_create() in particular) shouldn't be used for working with
metadata objects.
Switch to the recently added ceph_osdc_call(). It assumes single pages
for outbound and inbound buffers, but that's OK - none of the callers
need more than that. These pages need to be allocated (messenger is in
dire need of proper iterator interface!), but we are swapping for
pages[] and pagelist allocations in the existing code.
Kill class_name argument - all rbd methods are under "rbd".
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
rbd_obj_request machinery is completely unnecessary here; all that's
being done is fetching a metadata object - no striping, cloning, etc.
More importantly, rbd_osd_req_create() grabs pool id from layout and
that is becoming a data pool id.
Kill offset argument - all metadata objects are small and read in full.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
No reason to delay it until image_id is known. This will be required
by some rbd_obj_method_sync() callers, after rbd_obj_method_sync() is
changed to take oloc.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Image format 1 is deprecated and format 2 doesn't have these. Also,
__rbd_dev_create() takes care of zeroing (or otherwise initializing)
format 2 specific fields.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Since function tables are a common target for attackers, it's best to keep
them in read-only memory. As such, this makes the CDROM device ops tables
const. This drops additionally n_minors, since it isn't used meaningfully,
and sets the only user of cdrom_dummy_generic_packet explicitly so the
variables can all be const.
Inspired by similar changes in grsecurity/PaX.
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <axboe@fb.com>
It is a relatively common idiom (8 instances) to first look up an IDR
entry, and then remove it from the tree if it is found, possibly doing
further operations upon the entry afterwards. If we change idr_remove()
to return the removed object, all of these users can save themselves a
walk of the IDR tree.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Simple cleanup to use the new APIs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Don Brace <don.brace@microsemi.com>
Tested-by: Don Brace <don.brace@microsemi.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Today a Xenstore watch event is delivered via a callback function
declared as:
void (*callback)(struct xenbus_watch *,
const char **vec, unsigned int len);
As all watch events only ever come with two parameters (path and token)
changing the prototype to:
void (*callback)(struct xenbus_watch *,
const char *path, const char *token);
is the natural thing to do.
Apply this change and adapt all users.
Cc: konrad.wilk@oracle.com
Cc: roger.pau@citrix.com
Cc: wei.liu2@citrix.com
Cc: paul.durrant@citrix.com
Cc: netdev@vger.kernel.org
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
A previous commit made the bdi embedded in the request queue
a pointer, but neglected to fixup zram. Fix it up.
Fixes: dc3b17cc8b ("block: Use pointer to backing_dev_info from request_queue")
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We will want to have struct backing_dev_info allocated separately from
struct request_queue. As the first step add pointer to backing_dev_info
to request_queue and convert all users touching it. No functional
changes in this patch.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
To prepare for dynamically adding new nbd devices to the system switch
from using an array for the nbd devices and instead use an idr. This
copies what loop does for keeping track of its devices.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Since we are in the memory reclaim path we need our recv work to be on a
workqueue that has WQ_MEM_RECLAIM set so we can avoid deadlocks. Also
set WQ_HIGHPRI since we are in the completion path for IO.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Instead of keeping two levels of indirection for requests types, fold it
all into the operations. The little caveat here is that previously
cmd_type only applied to struct request, while the request and bio op
fields were set to plain REQ_OP_READ/WRITE even for passthrough
operations.
Instead this patch adds new REQ_OP_* for SCSI passthrough and driver
private requests, althought it has to add two for each so that we
can communicate the data in/out nature of the request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This can be used to check for fs vs non-fs requests and basically
removes all knowledge of BLOCK_PC specific from the block layer,
as well as preparing for removing the cmd_type field in struct request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This is where we do the rest of the request handling, which will
become much simpler soon, too.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Disconnects don't use block layer requests these days, so all handling
of private requests is dead code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
The SCSI passthrough idea was a a bad idea to start with (guess who came
up with it?), and has been removed from the virtio 1.O spec, and is not
enabled by defauly by any host I know of. Add a separate config option
for it so that we don't need to enable it for most setups. That way
any bugs related to it (like the one recently fixed for vmapped stacks)
do not affect other users, and the size of the virtblk_req structure
also shrinks significantly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
We can simply use blk_mq_rq_from_pdu to get back at the request at
I/O completion time.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
We only need this code to support scsi, ide, cciss and virtio. And at
least for virtio it's a deprecated feature to start with.
This should shrink the kernel size for embedded device that only use,
say eMMC a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
This way there is no need to drag in a dependency on the
BLOCK_PC code, which is going to become optional.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
When the lightnvm core had the "gennvm" layer between the device and the
target, there was a need for the core to be able to figure out which
target it should send an end_io callback to. Leading to a "double"
end_io, first for the media manager instance, and then for the target
instance. Now that core and gennvm is merged, there is no longer a need
for this, and a single end_io callback will do.
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The number of configuration groups has been limited to one in current
code, even if there is support for up to four. With the introduction
of the open-channel SSD 1.3 specification, only a single
group is exposed onwards. Reflect this in the nvm_id structure.
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
And require all drivers that want to support BLOCK_PC to allocate it
as the first thing of their private data. To support this the legacy
IDE and BSG code is switched to set cmd_size on their queues to let
the block layer allocate the additional space.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Making use of "max_indirect_segments=" has issues:
- blkfront_setup_indirect() may end up with zero psegs when PAGE_SIZE
is sufficiently much larger than XEN_PAGE_SIZE
- the variable driven by the command line option
(xen_blkif_max_segments) has a somewhat different purpose, and hence
should namely never end up being zero
- as long as the specified value is lower than the legacy default,
we better don't use indirect segments at all (or we'd in fact lower
throughput)
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Don't truncate the "feature-persistent" value read from xenstore: Any
non-zero value is supposed to enable the feature, just like is already
being done for feature_secdiscard.
Just like the other feature_* fields, feature_flush and feature_fua are
boolean flags, and hence fit well into a single bit.
Keep all bit fields together to limit gaps.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
A user noticed that write performance was horrible over loopback and we
traced it to an inversion of when we need to set MSG_MORE. It should be
set when we have more bvec's to send, not when we are on the last bvec.
This patch made the test go from 20 iops to 78k iops.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Fixes: 429a787be6 ("nbd: fix use-after-free of rq/bio in the xmit path")
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block fixes from Jens Axboe:
- the virtio_blk stack DMA corruption fix from Christoph, fixing and
issue with VMAP stacks.
- O_DIRECT blkbits calculation fix from Chandan.
- discard regression fix from Christoph.
- queue init error handling fixes for nbd and virtio_blk, from Omar and
Jeff.
- two small nvme fixes, from Christoph and Guilherme.
- rename of blk_queue_zone_size and bdev_zone_size to _sectors instead,
to more closely follow what we do in other places in the block layer.
This interface is new for this series, so let's get the naming right
before releasing a kernel with this feature. From Damien.
* 'for-linus' of git://git.kernel.dk/linux-block:
block: don't try to discard from __blkdev_issue_zeroout
sd: remove __data_len hack for WRITE SAME
nvme: use blk_rq_payload_bytes
scsi: use blk_rq_payload_bytes
block: add blk_rq_payload_bytes
block: Rename blk_queue_zone_size and bdev_zone_size
nvme: apply DELAY_BEFORE_CHK_RDY quirk at probe time too
nvme-rdma: fix nvme_rdma_queue_is_ready
virtio_blk: fix panic in initialization error path
nbd: blk_mq_init_queue returns an error code on failure, not NULL
virtio_blk: avoid DMA to stack for the sense buffer
do_direct_IO: Use inode->i_blkbits to compute block count to be cleaned
By general sentiment kref_sub() is a bad interface, make it go away.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since we need to change the implementation, stop exposing internals.
Provide kref_read() to read the current reference count; typically
used for debug messages.
Kills two anti-patterns:
atomic_read(&kref->refcount)
kref->refcount.counter
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since we need to change the implementation, stop exposing internals.
Provide KREF_INIT() to allow static initialization of struct kref.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The raw_cmd_copyin() function does a kmalloc() with GFP_USER, although the
allocated structure is obviously not mapped to userspace, just copied from/to.
In this case GFP_KERNEL is more appropriate, so let's use it, although in the
current implementation this does not manifest as any error.
Reported-by: Matthew Wilcox <mawilcox@linuxonhyperv.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
zram has used per-cpu stream feature from v4.7. It aims for increasing
cache hit ratio of scratch buffer for compressing. Downside of that
approach is that zram should ask memory space for compressed page in
per-cpu context which requires stricted gfp flag which could be failed.
If so, it retries to allocate memory space out of per-cpu context so it
could get memory this time and compress the data again, copies it to the
memory space.
In this scenario, zram assumes the data should never be changed but it is
not true without stable page support. So, If the data is changed under
us, zram can make buffer overrun so that zsmalloc free object chain is
broken so system goes crash like below
https://bugzilla.suse.com/show_bug.cgi?id=997574
This patch adds BDI_CAP_STABLE_WRITES to zram for declaring "I am block
device needing *stable write*".
Fixes: da9556a236 ("zram: user per-cpu compression streams")
Link: http://lkml.kernel.org/r/1482366980-3782-4-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Hyeoncheol Lee <cheol.lee@lge.com>
Cc: <yjay.kim@lge.com>
Cc: Sangseok Lee <sangseok.lee@lge.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: <stable@vger.kernel.org> [4.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit b4c5c60920 ("zram: avoid lockdep splat by revalidate_disk")
moved revalidate_disk call out of init_lock to avoid lockdep
false-positive splat. However, commit 08eee69fcf ("zram: remove
init_lock in zram_make_request") removed init_lock in IO path so there
is no worry about lockdep splat. So, let's restore it.
This patch is needed to set BDI_CAP_STABLE_WRITES atomically in next
patch.
Fixes: da9556a236 ("zram: user per-cpu compression streams")
Link: http://lkml.kernel.org/r/1482366980-3782-3-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Hyeoncheol Lee <cheol.lee@lge.com>
Cc: <yjay.kim@lge.com>
Cc: Sangseok Lee <sangseok.lee@lge.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: <stable@vger.kernel.org> [4.7+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If blk_mq_init_queue() returns an error, it gets assigned to
vblk->disk->queue. Then, when we call put_disk(), we end up calling
blk_put_queue() with the ERR_PTR, causing a bad dereference. Fix it by
only assigning to vblk->disk->queue on success.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Additionally, don't assign directly to disk->queue, otherwise
blk_put_queue (called via put_disk) will choke (panic) on the errno
stored there.
Bug found by code inspection after Omar found a similar issue in
virtio_blk. Compile-tested only.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Most users of BLOCK_PC requests allocate the sense buffer on the stack,
so to avoid DMA to the stack copy them to a field in the heap allocated
virtblk_req structure. Without that any attempt at SCSI passthrough I/O,
including the SG_IO ioctl from userspace will crash the kernel. Note that
this includes running tools like hdparm even when the host does not have
SCSI passthrough enabled.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Jens Axboe <axboe@fb.com>
Prepare to mark sensitive kernel structures for randomization by making
sure they're using designated initializers. These were identified during
allyesconfig builds of x86, arm, and arm64, with most initializer fixes
extracted from grsecurity.
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
ktime_set(S,N) was required for the timespec storage type and is still
useful for situations where a Seconds and Nanoseconds part of a time value
needs to be converted. For anything where the Seconds argument is 0, this
is pointless and can be replaced with a simple assignment.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
This was entirely automated, using the script by Al:
PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
$(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)
to do the replacement at the end of the merge window.
Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- a large rework of cephx auth code to cope with CONFIG_VMAP_STACK
(myself). Also fixed a deadlock caused by a bogus allocation on the
writeback path and authorize reply verification.
- a fix for long stalls during fsync (Jeff Layton). The client now
has a way to force the MDS log flush, leading to ~100x speedups in
some synthetic tests.
- a new [no]require_active_mds mount option (Zheng Yan). On mount, we
will now check whether any of the MDSes are available and bail rather
than block if none are. This check can be avoided by specifying the
"no" option.
- a couple of MDS cap handling fixes and a few assorted patches
throughout.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJYVByGAAoJEEp/3jgCEfOLBqkH/A7nVf7ObSDYmLuYgg1gJ8zq
4zDDE42S4yZwayAVpn3UjbfPuez5J44lsdXitExdfiHOdIQZDa/WqAbSqQ48HCSg
7sG6ecRWg3G5zG0psPZnB+S5wGMvsLXmj2hvzV1lt2t0lI5bDLSlNRSnElbhilD/
8Z7+Ni2go8DMC9o49SJU32lBW7IByKl4p4flveItgwUvGkIFNd8OT3CyPBUqonQs
lRCeImRYU8Jghb+ifnRxWSbuDf7pZAPc9kL0vibpUUT/1bH6iHsedKp37WQKqc/w
KDSNnKiZcz0gY/hJeLqE3ymCIKO6SU+JkMQSaYNTouLO5fQsRr8/uWQXSe6S5oc=
=ypWx
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.10-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"A varied set of changes:
- a large rework of cephx auth code to cope with CONFIG_VMAP_STACK
(myself). Also fixed a deadlock caused by a bogus allocation on the
writeback path and authorize reply verification.
- a fix for long stalls during fsync (Jeff Layton). The client now
has a way to force the MDS log flush, leading to ~100x speedups in
some synthetic tests.
- a new [no]require_active_mds mount option (Zheng Yan).
On mount, we will now check whether any of the MDSes are available
and bail rather than block if none are. This check can be avoided
by specifying the "no" option.
- a couple of MDS cap handling fixes and a few assorted patches
throughout"
* tag 'ceph-for-4.10-rc1' of git://github.com/ceph/ceph-client: (32 commits)
libceph: remove now unused finish_request() wrapper
libceph: always signal completion when done
ceph: avoid creating orphan object when checking pool permission
ceph: properly set issue_seq for cap release
ceph: add flags parameter to send_cap_msg
ceph: update cap message struct version to 10
ceph: define new argument structure for send_cap_msg
ceph: move xattr initialzation before the encoding past the ceph_mds_caps
ceph: fix minor typo in unsafe_request_wait
ceph: record truncate size/seq for snap data writeback
ceph: check availability of mds cluster on mount
ceph: fix splice read for no Fc capability case
ceph: try getting buffer capability for readahead/fadvise
ceph: fix scheduler warning due to nested blocking
ceph: fix printing wrong return variable in ceph_direct_read_write()
crush: include mapper.h in mapper.c
rbd: silence bogus -Wmaybe-uninitialized warning
libceph: no need to drop con->mutex for ->get_authorizer()
libceph: drop len argument of *verify_authorizer_reply()
libceph: verify authorize reply on connect
...
This update includes the usual round of major driver updates (ncr5380,
lpfc, hisi_sas, megaraid_sas, ufs, ibmvscsis, mpt3sas). There's also
an assortment of minor fixes, mostly in error legs or other not very
user visible stuff. The major change is the pci_alloc_irq_vectors
replacement for the old pci_msix_.. calls; this effectively makes IRQ
mapping generic for the drivers and allows blk_mq to use the
information.
Signed-off-by: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJYUJOvAAoJEAVr7HOZEZN42B0P/1lj1W2N7y0LOAbR2MasyQvT
fMD/SSip/v+R/zJiTv+5M/IDQT6ez62JnQGWyX3HZTob9VEfoqagbUuHH6y+fmib
tHyqiYc9FC/NZSRX/0ib+rpnSVhC/YRSVV7RrAqilbpOKAzeU25FlN/vbz+Nv/XL
ReVBl+2nGjJtHyWqUN45Zuf74c0zXOWPPUD0hRaNclK5CfZv5wDYupzHzTNSQTkj
nWvwPYT0OgSMNe7mR+IDFyOe3UQ/OYyeJB0yBNqO63IiaUabT8/hgrWR9qJAvWh8
LiH+iSQ69+sDUnvWvFjuls/GzcFuuTljwJbm+FyTsmNHONPVY8JRCLNq7CNDJ6Vx
HwpNuJdTSJpne4lAVBGPwgjs+GhlMvUP/xYVLWAXdaBxU9XGePrwqQDcFu1Rbx3P
yfMiVaY1+e45OEjLRCbDAwFnMPevC3kyymIvSsTySJxhTbYrOsyrrWt5kwWsvE3r
SKANsub+xUnpCkyg57nXRQStJSCiSfGIDsydKmMX+pf1SR4k6gCUQZlcchUX0uOa
dcY6re0c7EJIQQiT7qeGP5TRBblxARocCA/Igx6b5U5HmuQ48tDFlMCps7/TE84V
JBsBnmkXcEi/ALShL/Tui+3YKA1DfOtEnwHtXx/9Ecx/nxP2Sjr9LJwCKiONv8NY
RgLpGfccrix34lQumOk5
=sPXh
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This update includes the usual round of major driver updates (ncr5380,
lpfc, hisi_sas, megaraid_sas, ufs, ibmvscsis, mpt3sas).
There's also an assortment of minor fixes, mostly in error legs or
other not very user visible stuff. The major change is the
pci_alloc_irq_vectors replacement for the old pci_msix_.. calls; this
effectively makes IRQ mapping generic for the drivers and allows
blk_mq to use the information"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (256 commits)
scsi: qla4xxx: switch to pci_alloc_irq_vectors
scsi: hisi_sas: support deferred probe for v2 hw
scsi: megaraid_sas: switch to pci_alloc_irq_vectors
scsi: scsi_devinfo: remove synchronous ALUA for NETAPP devices
scsi: be2iscsi: set errno on error path
scsi: be2iscsi: set errno on error path
scsi: hpsa: fallback to use legacy REPORT PHYS command
scsi: scsi_dh_alua: Fix RCU annotations
scsi: hpsa: use %phN for short hex dumps
scsi: hisi_sas: fix free'ing in probe and remove
scsi: isci: switch to pci_alloc_irq_vectors
scsi: ipr: Fix runaway IRQs when falling back from MSI to LSI
scsi: dpt_i2o: double free on error path
scsi: cxlflash: Migrate scsi command pointer to AFU command
scsi: cxlflash: Migrate IOARRIN specific routines to function pointers
scsi: cxlflash: Cleanup queuecommand()
scsi: cxlflash: Cleanup send_tmf()
scsi: cxlflash: Remove AFU command lock
scsi: cxlflash: Wait for active AFU commands to timeout upon tear down
scsi: cxlflash: Remove private command pool
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABAgAGBQJYT5HMAAoJELDendYovxMvhNQH/1g/3ahM4JKN8Z0SbjKBEdQm
yj2xOj6cE3l6wMSUblKjZD2DLLhpmcHT/E97Xro/lZQEfQJoMXXWWDFowMU/P1LA
mJxb7Fzq5Wr+6eGSAlIQB270MrpNi/luf+CWHMwVA3V7R3KRXwonOdGQSkISIzCd
tgIydEA3a9r2+HgeIBpZFZ4GcSrJQU75krMyl2tjD1C+jeYVd+zdoj2OnDsZQDZQ
hDWApMpNbpSBAn7JtSSdXWSTBsGH0lUECebeYPhPQ2sX2P6Y8+UCGwA7i6FFdbTa
agXfVSdRz8dCe3k19VcKDAw6nK9BTTMnEeEHmkmygIh6wuHPP44CzigTXIbJoXI=
=zjfm
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.10-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:
"Xen features and fixes for 4.10
These are some fixes, a move of some arm related headers to share them
between arm and arm64 and a series introducing a helper to make code
more readable.
The most notable change is David stepping down as maintainer of the
Xen hypervisor interface. This results in me sending you the pull
requests for Xen related code from now on"
* tag 'for-linus-4.10-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (29 commits)
xen/balloon: Only mark a page as managed when it is released
xenbus: fix deadlock on writes to /proc/xen/xenbus
xen/scsifront: don't request a slot on the ring until request is ready
xen/x86: Increase xen_e820_map to E820_X_MAX possible entries
x86: Make E820_X_MAX unconditionally larger than E820MAX
xen/pci: Bubble up error and fix description.
xen: xenbus: set error code on failure
xen: set error code on failures
arm/xen: Use alloc_percpu rather than __alloc_percpu
arm/arm64: xen: Move shared architecture headers to include/xen/arm
xen/events: use xen_vcpu_id mapping for EVTCHNOP_status
xen/gntdev: Use VM_MIXEDMAP instead of VM_IO to avoid NUMA balancing
xen-scsifront: Add a missing call to kfree
MAINTAINERS: update XEN HYPERVISOR INTERFACE
xenfs: Use proc_create_mount_point() to create /proc/xen
xen-platform: use builtin_pci_driver
xen-netback: fix error handling output
xen: make use of xenbus_read_unsigned() in xenbus
xen: make use of xenbus_read_unsigned() in xen-pciback
xen: make use of xenbus_read_unsigned() in xen-fbfront
...
Pull block layer updates from Jens Axboe:
"This is the main block pull request this series. Contrary to previous
release, I've kept the core and driver changes in the same branch. We
always ended up having dependencies between the two for obvious
reasons, so makes more sense to keep them together. That said, I'll
probably try and keep more topical branches going forward, especially
for cycles that end up being as busy as this one.
The major parts of this pull request is:
- Improved support for O_DIRECT on block devices, with a small
private implementation instead of using the pig that is
fs/direct-io.c. From Christoph.
- Request completion tracking in a scalable fashion. This is utilized
by two components in this pull, the new hybrid polling and the
writeback queue throttling code.
- Improved support for polling with O_DIRECT, adding a hybrid mode
that combines pure polling with an initial sleep. From me.
- Support for automatic throttling of writeback queues on the block
side. This uses feedback from the device completion latencies to
scale the queue on the block side up or down. From me.
- Support from SMR drives in the block layer and for SD. From Hannes
and Shaun.
- Multi-connection support for nbd. From Josef.
- Cleanup of request and bio flags, so we have a clear split between
which are bio (or rq) private, and which ones are shared. From
Christoph.
- A set of patches from Bart, that improve how we handle queue
stopping and starting in blk-mq.
- Support for WRITE_ZEROES from Chaitanya.
- Lightnvm updates from Javier/Matias.
- Supoort for FC for the nvme-over-fabrics code. From James Smart.
- A bunch of fixes from a whole slew of people, too many to name
here"
* 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
blk-stat: fix a few cases of missing batch flushing
blk-flush: run the queue when inserting blk-mq flush
elevator: make the rqhash helpers exported
blk-mq: abstract out blk_mq_dispatch_rq_list() helper
blk-mq: add blk_mq_start_stopped_hw_queue()
block: improve handling of the magic discard payload
blk-wbt: don't throttle discard or write zeroes
nbd: use dev_err_ratelimited in io path
nbd: reset the setup task for NBD_CLEAR_SOCK
nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
nvme-fabrics: Add target support for FC transport
nvme-fabrics: Add host support for FC transport
nvme-fabrics: Add FC transport LLDD api definitions
nvme-fabrics: Add FC transport FC-NVME definitions
nvme-fabrics: Add FC transport error codes to nvme.h
Add type 0x28 NVME type code to scsi fc headers
nvme-fabrics: patch target code in prep for FC transport support
nvme-fabrics: set sqe.command_id in core not transports
parser: add u64 number parser
nvme-rdma: align to generic ib_event logging helper
...
Pull smp hotplug updates from Thomas Gleixner:
"This is the final round of converting the notifier mess to the state
machine. The removal of the notifiers and the related infrastructure
will happen around rc1, as there are conversions outstanding in other
trees.
The whole exercise removed about 2000 lines of code in total and in
course of the conversion several dozen bugs got fixed. The new
mechanism allows to test almost every hotplug step standalone, so
usage sites can exercise all transitions extensively.
There is more room for improvement, like integrating all the
pointlessly different architecture mechanisms of synchronizing,
setting cpus online etc into the core code"
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
tracing/rb: Init the CPU mask on allocation
soc/fsl/qbman: Convert to hotplug state machine
soc/fsl/qbman: Convert to hotplug state machine
zram: Convert to hotplug state machine
KVM/PPC/Book3S HV: Convert to hotplug state machine
arm64/cpuinfo: Convert to hotplug state machine
arm64/cpuinfo: Make hotplug notifier symmetric
mm/compaction: Convert to hotplug state machine
iommu/vt-d: Convert to hotplug state machine
mm/zswap: Convert pool to hotplug state machine
mm/zswap: Convert dst-mem to hotplug state machine
mm/zsmalloc: Convert to hotplug state machine
mm/vmstat: Convert to hotplug state machine
mm/vmstat: Avoid on each online CPU loops
mm/vmstat: Drop get_online_cpus() from init_cpu_node_state/vmstat_cpu_dead()
tracing/rb: Convert to hotplug state machine
oprofile/nmi timer: Convert to hotplug state machine
net/iucv: Use explicit clean up labels in iucv_init()
x86/pci/amd-bus: Convert to hotplug state machine
x86/oprofile/nmi: Convert to hotplug state machine
...
drivers/block/rbd.c: In function ‘rbd_watch_cb’:
drivers/block/rbd.c:3690:5: error: ‘struct_v’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
drivers/block/rbd.c:3759:5: note: ‘struct_v’ was declared here
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
While doing stress tests we noticed that we'd get a lot of dmesg spam if
we suddenly disconnected the nbd device out of band. Rate limit the
messages in the io path in order to deal with this.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If an app exits before running NBD_DO_IT but after adding sockets we can
end up not being allowed to do a new nbd device. Fix this by making
NBD_CLEAR_SOCK reset the setup_task.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
zram hot_add sysfs attribute is a very 'special' attribute - reading
from it creates a new uninitialized zram device. This file, by a
mistake, can be read by a 'normal' user at the moment, while only root
must be able to create a new zram device, therefore hot_add attribute
must have S_IRUSR mode, not S_IRUGO.
[akpm@linux-foundation.org: s/sence/sense/, reflow comment to use 80 cols]
Fixes: 6566d1a32b ("zram: add dynamic device add/remove functionality")
Link: http://lkml.kernel.org/r/20161205155845.20129-1-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reported-by: Steven Allen <steven@stebalien.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org> [4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have this:
ERROR: "__aeabi_ldivmod" [drivers/block/nbd.ko] undefined!
ERROR: "__divdi3" [drivers/block/nbd.ko] undefined!
nbd.c:(.text+0x247c72): undefined reference to `__divdi3'
due to a recent commit, that did 64-bit division. Use the proper
divider function so that 32-bit compiles don't break.
Fixes: ef77b51524 ("nbd: use loff_t for blocksize and nbd_set_size args")
Signed-off-by: Jens Axboe <axboe@fb.com>
If we have large devices (say like the 40t drive I was trying to test with) we
will end up overflowing the int arguments to nbd_set_size and not get the right
size for our device. Fix this by using loff_t everywhere so I don't have to
think about this again. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Install the callbacks via the state machine with multi instance support and let
the core invoke the callbacks on the already online CPUs.
[bigeasy: wire up the multi instance stuff]
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: rt@linutronix.de
Cc: Nitin Gupta <ngupta@vflare.org>
Link: http://lkml.kernel.org/r/20161126231350.10321-19-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Fix bug https://bugzilla.kernel.org/show_bug.cgi?id=188531. In function
mtip_block_initialize(), variable rv takes the return value, and its
value should be negative on errors. rv is initialized as 0 and is not
reset when the call to ida_pre_get() fails. So 0 may be returned.
The return value 0 indicates that there is no error, which may be
inconsistent with the execution status. This patch fixes the bug by
explicitly assigning -ENOMEM to rv on the branch that ida_pre_get()
fails.
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The zram hot removal code calls idr_remove() even when zram_remove()
returns an error (typically -EBUSY). This results in a leftover at the
device release, eventually leading to a crash when the module is
reloaded.
As described in the bug report below, the following procedure would
cause an Oops with zram:
- provision three zram devices via modprobe zram num_devices=3
- configure a size for each device
+ echo "1G" > /sys/block/$zram_name/disksize
- mkfs and mount zram0 only
- attempt to hot remove all three devices
+ echo 2 > /sys/class/zram-control/hot_remove
+ echo 1 > /sys/class/zram-control/hot_remove
+ echo 0 > /sys/class/zram-control/hot_remove
- zram0 removal fails with EBUSY, as expected
- unmount zram0
- try zram0 hot remove again
+ echo 0 > /sys/class/zram-control/hot_remove
- fails with ENODEV (unexpected)
- unload zram kernel module
+ completes successfully
- zram0 device node still exists
- attempt to mount /dev/zram0
+ mount command is killed
+ following BUG is encountered
BUG: unable to handle kernel paging request at ffffffffa0002ba0
IP: get_disk+0x16/0x50
Oops: 0000 [#1] SMP
CPU: 0 PID: 252 Comm: mount Not tainted 4.9.0-rc6 #176
Call Trace:
exact_lock+0xc/0x20
kobj_lookup+0xdc/0x160
get_gendisk+0x2f/0x110
__blkdev_get+0x10c/0x3c0
blkdev_get+0x19d/0x2e0
blkdev_open+0x56/0x70
do_dentry_open.isra.19+0x1ff/0x310
vfs_open+0x43/0x60
path_openat+0x2c9/0xf30
do_filp_open+0x79/0xd0
do_sys_open+0x114/0x1e0
SyS_open+0x19/0x20
entry_SYSCALL_64_fastpath+0x13/0x94
This patch adds the proper error check in hot_remove_store() not to call
idr_remove() unconditionally.
Fixes: 17ec4cd985 ("zram: don't call idr_remove() from zram_remove()")
Bugzilla: https://bugzilla.opensuse.org/show_bug.cgi?id=1010970
Link: http://lkml.kernel.org/r/20161121132140.12683-1-tiwai@suse.de
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Reviewed-by: David Disseldorp <ddiss@suse.de>
Reported-by: David Disseldorp <ddiss@suse.de>
Tested-by: David Disseldorp <ddiss@suse.de>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: <stable@vger.kernel.org> [4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Multiple paths don't set it properly, ensure that we do.
Fixes: 9561a7ade0 ("nbd: add multi-connection support")
Signed-off-by: Jens Axboe <axboe@fb.com>
NBD can become contended on its single connection. We have to serialize all
writes and we can only process one read response at a time. Fix this by
allowing userspace to provide multiple connections to a single nbd device. This
coupled with block-mq drastically increases performance in multi-process cases.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
For a non-cloned bio, bio_add_page() only returns failure when
the io vec table is full, but in that case, bio->bi_vcnt can't
be zero at all.
So remove the impossible failure handling.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Some drivers often use external bvec table, so introduce
this helper for this case. It is always safe to access the
bio->bi_io_vec in this way for this case.
After converting to this usage, it will becomes a bit easier
to evaluate the remaining direct access to bio->bi_io_vec,
so it can help to prepare for the following multipage bvec
support.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fixed up the new O_DIRECT cases.
Signed-off-by: Jens Axboe <axboe@fb.com>
This driver is both orphaned, and not really useful anymore. Mark
it as such, and remove it in a future kernel after a release or
two.
Signed-off-by: Jens Axboe <axboe@fb.com>
For writes, we can get a completion in while we're still iterating
the request and bio chain. If that happens, we're reading freed
memory and we can crash.
Break out after the last segment and avoid having the iterator
read freed memory.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If CONFIG_NVM is disabled, loading null_block module with use_lightnvm=1
fails. But there are no messages and documents related to the failure.
Add the appropriate error message.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Massaged the text a bit.
Signed-off-by: Jens Axboe <axboe@fb.com>
->queue_rq() should return one of the BLK_MQ_RQ_QUEUE_* constants, not
an errno.
f4aa4c7bba ("block: loop: convert to per-device workqueue")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
aoeblk contains some mysterious code, that wants to elevate the bio
vec page counts while it's under IO. That is not needed, it's
fragile, and it's causing kernel oopses for some.
Reported-by: Tested-by: Don Koch <kochd@us.ibm.com>
Tested-by: Tested-by: Don Koch <kochd@us.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
There is no need to call kfree() if memdup_user() fails, as no memory
was allocated and the error in the error-valued pointer should be returned.
Signed-off-by: Sachin Shukla <sachin.s5@samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Building with W=1 shows a harmless warning for the skd driver:
drivers/block/skd_main.c:2959:1: error: ‘static’ is not at beginning of declaration [-Werror=old-style-declaration]
This changes the prototype to the expected formatting.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
As reported by gcc -Wmaybe-uninitialized, the cleanup path for
skd_acquire_msix tries to free the already allocated msi-x vectors
in reverse order, but the index variable may not have been
used yet:
drivers/block/skd_main.c: In function ‘skd_acquire_irq’:
drivers/block/skd_main.c:3890:8: error: ‘i’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
This changes the failure path to skip releasing the interrupts
if we have not started requesting them yet.
Fixes: 180b0ae77d ("skd: use pci_alloc_irq_vectors")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Don't pass a size larger than iov_len to kernel_sendmsg().
Otherwise it will cause a NULL pointer deref when kernel_sendmsg()
returns with rv < size.
DRBD as external module has been around in the kernel 2.4 days already.
We used to be compatible to 2.4 and very early 2.6 kernels,
we used to use
rv = sock_sendmsg(sock, &msg, iov.iov_len);
then later changed to
rv = kernel_sendmsg(sock, &msg, &iov, 1, size);
when we should have used
rv = kernel_sendmsg(sock, &msg, &iov, 1, iov.iov_len);
tcp_sendmsg() used to totally ignore the size parameter.
57be5bd ip: convert tcp_sendmsg() to iov_iter primitives
changes that, and exposes our long standing error.
Even with this error exposed, to trigger the bug, we would need to have
an environment (config or otherwise) causing us to not use sendpage()
for larger transfers, a failing connection, and have it fail "just at the
right time". Apparently that was unlikely enough for most, so this went
unnoticed for years.
Still, it is known to trigger at least some of these,
and suspected for the others:
[0] http://lists.linbit.com/pipermail/drbd-user/2016-July/023112.html
[1] http://lists.linbit.com/pipermail/drbd-dev/2016-March/003362.html
[2] https://forums.grsecurity.net/viewtopic.php?f=3&t=4546
[3] https://ubuntuforums.org/showthread.php?t=2336150
[4] http://e2.howsolveproblem.com/i/1175162/
This should go into 4.9,
and into all stable branches since and including v4.0,
which is the first to contain the exposing change.
It is correct for all stable branches older than that as well
(which contain the DRBD driver; which is 2.6.33 and up).
It requires a small "conflict" resolution for v4.4 and earlier, with v4.5
we dropped the comment block immediately preceding the kernel_sendmsg().
Fixes: b411b3637f ("The DRBD driver")
Cc: <stable@vger.kernel.org> # 2.6.33.x-
Cc: viro@zeniv.linux.org.uk
Cc: christoph.lechleitner@iteg.at
Cc: wolfgang.glas@iteg.at
Reported-by: Christoph Lechleitner <christoph.lechleitner@iteg.at>
Tested-by: Christoph Lechleitner <christoph.lechleitner@iteg.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
[changed oneliner to be "obvious" without context; more verbose message]
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
For small buffers we may use %*ph[N] specifier, for the bigger blocks
print_hex_dump() call.
Cc: Don Brace <don.brace@microsemi.com>
Cc: esc.storagedev@microsemi.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Don Brace <don.brace@microsemi.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Replace custom approach by %*ph specifier to dump small buffers in hex format.
Unfortunately we can't use print_hex_dump_bytes() here since tha gap is
present, though one familiar with the code may change this.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Switch the skd driver to use pci_alloc_irq_vectors. We need to two calls to
pci_alloc_irq_vectors as skd only supports multiple MSI-X vectors, but not
multiple MSI vectors.
Otherwise this cleans up a lot of cruft and allows to a lot more common code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Hi Peter, hi Jens,
I've been looking over the multi page bio vec work again recently, and
one of the stumbling blocks is raw biovec access in the pktcdvd.
The first issue is that it directly sets up the page and offset pointers
in the biovec just before calling bio_add_page. As bio_add_page already
does the setup it's trivial to just switch it to stack variables for the
arguments.
The second issue is the copy code in pkt_make_local_copy, which
effectively is an opencoded version of bio_copy_data except that it
skips pages that already are the same in the ѕource and destination.
But we look at the only calleer we just set up the bio using bio_add_page
to point exactly to the page array that pkt_make_local_copy compares,
so the pages will always be the same and we can just remove this function.
Note that all this is done based on code inspection, I don't have any
packet writing hardware myself.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Use xenbus_read_unsigned() instead of xenbus_scanf() when possible.
This requires to change the type of some reads from int to unsigned,
but these cases have been wrong before: negative values are not allowed
for the modified cases.
Cc: konrad.wilk@oracle.com
Cc: roger.pau@citrix.com
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: David Vrabel <david.vrabel@citrix.com>
Use xenbus_read_unsigned() instead of xenbus_scanf() when possible.
This requires to change the type of one read from int to unsigned,
but this case has been wrong before: negative values are not allowed
for the modified case.
Cc: konrad.wilk@oracle.com
Cc: roger.pau@citrix.com
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: David Vrabel <david.vrabel@citrix.com>
'blk_mq_alloc_request()' returns an error pointer in case of error, not
NULL. So test it with IS_ERR.
Fixes: fd8383fd88 ("nbd: convert to blkmq")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Jens Axboe <axboe@fb.com>
Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls
are followed by kicking the requeue list. Hence add an argument to
these two functions that allows to kick the requeue list. This was
proposed by Christoph Hellwig.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
Since blk_mq_requeue_work() starts stopped queues and since
execution of this function can be scheduled after a queue has
been stopped it is not possible to stop queues without using
an additional state variable to track whether or not the queue
has been stopped. Hence modify blk_mq_requeue_work() such that it
does not start stopped queues. My conclusion after a review of
the blk_mq_stop_hw_queues() and blk_mq_{delay_,}kick_requeue_list()
callers is as follows:
* In the dm driver starting and stopping queues should only happen
if __dm_suspend() or __dm_resume() is called and not if the
requeue list is processed.
* In the SCSI core queue stopping and starting should only be
performed by the scsi_internal_device_block() and
scsi_internal_device_unblock() functions but not by any other
function. Although the blk_mq_stop_hw_queue() call in
scsi_queue_rq() may help to reduce CPU load if a LLD queue is
full, figuring out whether or not a queue should be restarted
when requeueing a command would require to introduce additional
locking in scsi_mq_requeue_cmd() to avoid a race with
scsi_internal_device_block(). Avoid this complexity by removing
the blk_mq_stop_hw_queue() call from scsi_queue_rq().
* In the NVMe core only the functions that call
blk_mq_start_stopped_hw_queues() explicitly should start stopped
queues.
* A blk_mq_start_stopped_hwqueues() call must be added in the
xen-blkfront driver in its blkif_recover() function.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Roger Pau Monné <roger.pau@citrix.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: James Bottomley <jejb@linux.vnet.ibm.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly. Where applicable this also drops usage of the
bio_set_op_attrs wrapper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Noidle should be the default for writes as seen by all the compounds
definitions in fs.h using it. In fact only direct I/O really should
be using NODILE, so turn the whole flag around to get the defaults
right, which will make our life much easier especially onces the
WRITE_* defines go away.
This assumes all the existing "raw" users of REQ_SYNC for writes
want noidle behavior, which seems to be spot on from a quick audit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
The local variable "err" will be set to an appropriate value
by a following statement.
Thus omit the explicit initialisation at the beginning.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Multiplications for the size determination of memory allocations
indicated that array data structures should be processed.
Thus use the corresponding function "kmalloc_array".
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
A lot of the REQ_* flags are only used on struct requests, and only of
use to the block layer and a few drivers that dig into struct request
internals.
This patch adds a new req_flags_t rq_flags field to struct request for
them, and thus dramatically shrinks the number of common requests. It
also removes the unfortunate situation where we have to fit the fields
from the same enum into 32 bits for struct bio and 64 bits for
struct request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
It makes the message hard to interpret correctly if a base 10 number is
prefixed by 0x. So change to a hex number.
Link: http://lkml.kernel.org/r/20161026125658.25728-3-u.kleine-koenig@pengutronix.de
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Discontinue having the brd driver destructively free all pages in the
ramdisk in response to the BLKFLSBUF ioctl. Doing so allows a BLKFLSBUF
ioctl issued to a logical partition to destroy pages of the parent brd
device (and all other partitions of that brd device).
This change breaks compatibility - but in this case the compatibility
breaks more than it helps.
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Currently rd_size was int which lead to overflow and bogus device size
once the requested ramdisk size was 1 TB or more. Although these days
ramdisks with 1 TB size are mostly a mistake, the days when they are
useful are not far.
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit 0eadf37afc ("nbd: allow block mq to deal with timeouts")
changed normal usage of nbd->sock_lock to use spin_lock/spin_unlock
rather than the *_irq variants, but it missed this unlock in an
error path.
Found by Coverity, CID 1373871.
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Markus Pargmann <mpa@pengutronix.de>
Fixes: 0eadf37afc ("nbd: allow block mq to deal with timeouts")
Signed-off-by: Jens Axboe <axboe@fb.com>
If the header object gets deleted (perhaps along with the entire pool),
there is no point in attempting to reregister the watch. Treat this
the same as blacklisting: fail all pending and new I/Os requiring the
lock.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
-EBLACKLISTED from __rbd_register_watch() means that our ceph_client
got blacklisted - we won't be able to restore the watch and reacquire
the lock. Wake up and fail all outstanding requests waiting for the
lock and arrange for all new requests that require the lock to fail
immediately.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Mike Christie <mchristi@redhat.com>
A good practice is to prefix the names of functions by the name
of the subsystem.
The kthread worker API is a mix of classic kthreads and workqueues. Each
worker has a dedicated kthread. It runs a generic function that process
queued works. It is implemented as part of the kthread subsystem.
This patch renames the existing kthread worker API to use
the corresponding name from the workqueues API prefixed by
kthread_:
__init_kthread_worker() -> __kthread_init_worker()
init_kthread_worker() -> kthread_init_worker()
init_kthread_work() -> kthread_init_work()
insert_kthread_work() -> kthread_insert_work()
queue_kthread_work() -> kthread_queue_work()
flush_kthread_work() -> kthread_flush_work()
flush_kthread_worker() -> kthread_flush_worker()
Note that the names of DEFINE_KTHREAD_WORK*() macros stay
as they are. It is common that the "DEFINE_" prefix has
precedence over the subsystem names.
Note that INIT() macros and init() functions use different
naming scheme. There is no good solution. There are several
reasons for this solution:
+ "init" in the function names stands for the verb "initialize"
aka "initialize worker". While "INIT" in the macro names
stands for the noun "INITIALIZER" aka "worker initializer".
+ INIT() macros are used only in DEFINE() macros
+ init() functions are used close to the other kthread()
functions. It looks much better if all the functions
use the same scheme.
+ There will be also kthread_destroy_worker() that will
be used close to kthread_cancel_work(). It is related
to the init() function. Again it looks better if all
functions use the same naming scheme.
+ there are several precedents for such init() function
names, e.g. amd_iommu_init_device(), free_area_init_node(),
jump_label_init_type(), regmap_init_mmio_clk(),
+ It is not an argument but it was inconsistent even before.
[arnd@arndb.de: fix linux-next merge conflict]
Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.com
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
with maintenance operations offloaded to userspace (Douglas Fuller,
Mike Christie and myself). Another block device bullet is a series
fixing up layering error paths (myself).
On the filesystem side, we've got patches that improve our handling of
buffered vs dio write races (Neil Brown) and a few assorted fixes from
Zheng. Also included a couple of random cleanups and a minor CRUSH
update.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJX+PjZAAoJEEp/3jgCEfOLVuoH/RwtFLIb6/KZUYtBOrVVrTun
kReRlfq2xKYrGGtyQEqSuz7fBdwT1LVCVcL8kC4GFD4R67o+tNMAr6PfM/7pZABj
HRoRLgSZ9FLw4W5n0VpBIznih75QUbCdXiTCtH9eorMHU5q1YpTvVHHlF9W9Pm2I
eNGnBWpGyHVeiK66mpUCH+EQKQ4GkAVD9rneTNqLHgq2yotHkVl1j258+DL6JRGs
OBoh3RmNQaGOAS37Lss8erCSusAGEcAeGV6ubuK2lFUKyR41EkD3I0xkhNSPe+CD
RifFcpVziIeTu//cLgl0nnHGtmUytD7HgJubaPthArKIOen9ZDAfEkgI0o+JI2A=
=45O7
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.9-rc1' of git://github.com/ceph/ceph-client
Pull Ceph updates from Ilya Dryomov:
"The big ticket item here is support for rbd exclusive-lock feature,
with maintenance operations offloaded to userspace (Douglas Fuller,
Mike Christie and myself). Another block device bullet is a series
fixing up layering error paths (myself).
On the filesystem side, we've got patches that improve our handling of
buffered vs dio write races (Neil Brown) and a few assorted fixes from
Zheng. Also included a couple of random cleanups and a minor CRUSH
update"
* tag 'ceph-for-4.9-rc1' of git://github.com/ceph/ceph-client: (39 commits)
crush: remove redundant local variable
crush: don't normalize input of crush_ln iteratively
libceph: ceph_build_auth() doesn't need ceph_auth_build_hello()
libceph: use CEPH_AUTH_UNKNOWN in ceph_auth_build_hello()
ceph: fix description for rsize and rasize mount options
rbd: use kmalloc_array() in rbd_header_from_disk()
ceph: use list_move instead of list_del/list_add
ceph: handle CEPH_SESSION_REJECT message
ceph: avoid accessing / when mounting a subpath
ceph: fix mandatory flock check
ceph: remove warning when ceph_releasepage() is called on dirty page
ceph: ignore error from invalidate_inode_pages2_range() in direct write
ceph: fix error handling of start_read()
rbd: add rbd_obj_request_error() helper
rbd: img_data requests don't own their page array
rbd: don't call rbd_osd_req_format_read() for !img_data requests
rbd: rework rbd_img_obj_exists_submit() error paths
rbd: don't crash or leak on errors in rbd_img_obj_parent_read_full_callback()
rbd: move bumping img_request refcount into rbd_obj_request_submit()
rbd: mark the original request as done if stat request fails
...
Pull blk-mq irq/cpu mapping updates from Jens Axboe:
"This is the block-irq topic branch for 4.9-rc. It's mostly from
Christoph, and it allows drivers to specify their own mappings, and
more importantly, to share the blk-mq mappings with the IRQ affinity
mappings. It's a good step towards making this work better out of the
box"
* 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
blk_mq: linux/blk-mq.h does not include all the headers it depends on
blk-mq: kill unused blk_mq_create_mq_map()
blk-mq: get rid of the cpumask in struct blk_mq_tags
nvme: remove the post_scan callout
nvme: switch to use pci_alloc_irq_vectors
blk-mq: provide a default queue mapping for PCI device
blk-mq: allow the driver to pass in a queue mapping
blk-mq: remove ->map_queue
blk-mq: only allocate a single mq_map per tag_set
blk-mq: don't redistribute hardware queues on a CPU hotplug event
* A multiplication for the size determination of a memory allocation
indicated that an array data structure should be processed.
Thus use the corresponding function "kmalloc_array".
This issue was detected by using the Coccinelle software.
* Delete the local variable "size" which became unnecessary with
this refactoring.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Pull setting an error and marking a request done code into a new
helper. obj_request_img_data_test() check isn't strictly needed right
now, but makes it applicable to !img_data requests and a bit safer.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Move the check into rbd_obj_request_destroy() to avoid use-after-free
on errors in rbd_img_request_fill(..., OBJ_REQUEST_PAGES, ...), where
pages, owned by the caller, gets freed in rbd_img_request_fill().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: David Disseldorp <ddiss@suse.de>
Accessing obj_request->img_request union field is only valid for object
requests associated with an image (i.e. if obj_request_img_data_test()
returns true). rbd_osd_req_format_read() used to do more, but now it
just sets osd_req->snap_id. Standalone and stat object requests always
go to the HEAD revision and are fine with CEPH_NOSNAP set by libceph,
so get around the invalid union field use by simply not calling
rbd_osd_req_format_read() in those places.
Reported-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: David Disseldorp <ddiss@suse.de>
- don't put obj_request before rbd_obj_request_get() if
rbd_obj_request_create() fails
- don't leak pages if rbd_obj_request_create() fails
- don't leak stat_request if rbd_osd_req_create() fails
Reported-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: David Disseldorp <ddiss@suse.de>
- fix parent_length == img_request->xferred assert to not fire on
copyup read failures
- don't leak pages if copyup read fails or we can't allocate a new osd
request
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: David Disseldorp <ddiss@suse.de>
Commit 0f2d5be792 ("rbd: use reference counts for image requests")
added rbd_img_request_get(), which rbd_img_request_fill() calls for
each obj_request added to img_request. It was an urgent band-aid for
the uglyness that is rbd_img_obj_callback() and none of the error paths
were updated.
Given that this img_request reference is meant to represent an
obj_request that hasn't passed through rbd_img_obj_callback() yet,
proper cleanup in appropriate destructors is a challenge. However,
noting that if we don't get a chance to call rbd_obj_request_complete(),
there is not going to be a call to rbd_img_obj_callback(), we can move
rbd_img_request_get() into rbd_obj_request_submit() and fixup the two
places that call rbd_obj_request_complete() directly and not through
rbd_obj_request_submit() to temporarily bump img_request, so that
rbd_img_obj_callback() can put as usual.
This takes care of img_request leaks on errors on the submit side.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
If stat request fails with something other than -ENOENT (which just
means that we need to copyup), the original object request is never
marked as done and therefore never completed. Fix this by moving the
mark done + complete snippet from rbd_img_obj_parent_read_full() into
rbd_img_obj_exists_callback(). The former remains covered, as the
latter is its only caller (through rbd_img_obj_request_submit()).
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: David Disseldorp <ddiss@suse.de>
Assert once in rbd_img_obj_request_submit().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Reviewed-by: David Disseldorp <ddiss@suse.de>
Add a per-device option to acquire exclusive lock on reads (in addition
to writes and discards). The use case is iSCSI, where it will be used
to prevent execution of stale writes after the implicit failover.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Mike Christie <mchristi@redhat.com>
We take a mutex when sending commands and send stuff over the network, we need
to have queue_rq called asynchronously.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Fixes: fd8383fd88 ("nbd: convert to blkmq")
Signed-off-by: Jens Axboe <axboe@fb.com>
LightNVM compatible device drivers does not have a method to expose
LightNVM specific sysfs entries.
To enable LightNVM sysfs entries to be exposed, lightnvm device
drivers require a struct device to attach it to. To allow both the
actual device driver and lightnvm sysfs entries to coexist, the device
driver tracks the lifetime of the nvm_dev structure.
This patch refactors NVMe and null_blk to handle the lifetime of struct
nvm_dev, which eliminates the need for struct gendisk when a lightnvm
compatible device is provided.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
With LightNVM enabled devices, the gendisk structure is not exposed
to the user. This hides the device driver specific sysfs entries, and
prevents binding of LightNVM geometry information to the device.
Refactor the device registration process, so that gendisk and
non-gendisk devices are easily managed.
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
All drivers use the default, so provide an inline version of it. If we
ever need other queue mapping we can add an optional method back,
although supporting will also require major changes to the queue setup
code.
This provides better code generation, and better debugability as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Instead of rolling our own timer, just utilize the blk mq req timeout and do the
disconnect if any of our commands timeout.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
In preparation for some future changes, change a few of the state bools over to
normal bits to set/clear properly.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We hit a warning when shutting down the nbd connection because we have irq's
disabled. We don't really need to do the shutdown under the lock, just clear
the nbd->sock. So do the shutdown outside of the irq. This gets rid of the
warning.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This moves NBD over to using blkmq, which allows us to get rid of the NBD
wide queue lock and the async submit kthread. We will start with 1 hw
queue for now, but I plan to add multiple tcp connection support in the
future and we'll fix how we set the hwqueue's.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We get 1 warning when biuld kernel with W=1:
drivers/block/mtip32xx/mtip32xx.c:3689:6: warning: no previous prototype for
'mtip_block_release' [-Wmissing-prototypes]
In fact, this function is only used in the file in which it is declared
and don't need a declaration, but can be made static.
so this patch marks it 'static'.
Signed-off-by: Baoyou Xie <baoyou.xie@linaro.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
This adds a force close option, so we can force the unmapping
of a rbd device that is open. If a path/device is blacklisted, apps
like multipathd can map a new device and then unmap the old one.
The unmapping cleanup would then be handled by the generic hotunplug
code paths in multipahd like is done for iSCSI, FC/FCOE, SAS, etc.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Export the info used to setup the rbd image, so it can be used to remap
the image.
Signed-off-by: Mike Christie <mchristi@redhat.com>
[idryomov@gmail.com: do_rbd_add() EH]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Export snap id in sysfs, so tools like multipathd can use it in a uuid.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Export the cluster fsid, so tools like udev and multipath-tools can use
it for part of the uuid.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Export client addr/nonce, so userspace can check if a image is being
blacklisted.
Signed-off-by: Mike Christie <mchristi@redhat.com>
[idryomov@gmail.com: ceph_client_addr(), endianess fix]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
With exclusive-lock added and more to come, print features into dmesg.
Change capacity to decimal while at it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Add basic support for RBD_FEATURE_EXCLUSIVE_LOCK feature. Maintenance
operations (resize, snapshot create, etc) are offloaded to librbd via
returning -EOPNOTSUPP - librbd should request the lock and execute the
operation.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Tested-by: Mike Christie <mchristi@redhat.com>
Revamp watch code to support retrying watch re-registration:
- add rbd_dev->watch_state for more robust errcb handling
- store watch cookie separately to avoid dereferencing watch_handle
which is set to NULL on unwatch
- move re-register code into a delayed work and retry re-registration
every second, unless the client is blacklisted
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Tested-by: Mike Christie <mchristi@redhat.com>
This is going to be used for re-registering watch requests and
exclusive-lock tasks: acquire/request lock, notify-acquired, release
lock, notify-released. Some refactoring in the map/unmap paths was
necessary to give this workqueue a meaningful name: "rbdX-tasks".
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
It's gid / global_id in other places.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
Current code forgets to free resources in the failure path of
xlvbd_alloc_gendisk(), this patch fix it.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
blk_mq_update_nr_hw_queues() reset all queue limits to default which it's
not as xen-blkfront expected, introducing blkif_set_queue_limits() to reset
limits with initial correct values.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Two places didn't get updated when 64KB page granularity was introduced,
this patch fix them.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
- Misc fixes and cleanups all over the place.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJXq0ruAAoJECgfDbjSjVRp5P8H/2OlDJdSS1l+TwOXbY95ntQ1
vxUX4vGCX5IujC+Rbt7sQV2prE3b6IktFNagpbRoWn21JkpoDMvPtYJrn5BhLtoh
fvDkZE6Wo3QztFSjaUBZWEABBt03KPX0yrAIZplu8ne/Z8KAT3zK57BPnKfmxwv+
dpxt+1wlnqAvYsoUUQZBFT4Gmk2oDiTofiIbQq7W9W/fooznLtLB+ArYtdfNJizC
JnI/vJuWceEXfjT26HexCRhA2OZskrA4ZadDhOjAqkTPN5DHfweLDuHh7IsVfDd1
wXqjc4ks3cYG0CloJ2qY2K7RpDOFIxIizixeDIuAbn9aX4sPOYYfqRm+4iRwmqQ=
=9aUO
-----END PGP SIGNATURE-----
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio/vhost fixes and cleanups from Michael Tsirkin:
"Misc fixes and cleanups all over the place"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio/s390: deprecate old transport
virtio/s390: keep early_put_chars
virtio_blk: Fix a slient kernel panic
virtio-vsock: fix include guard typo
vhost/vsock: fix vhost virtio_vsock_pkt use-after-free
9p/trans_virtio: use kvfree() for iov_iter_get_pages_alloc()
virtio: fix error handling for debug builds
virtio: fix memory leak in virtqueue_add()
ceph_file_layout::pool_id is now s64. rbd_add_get_pool_id() and
ceph_pg_poolid_by_name() both return an int, so it's bogus anyway.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
We do a lot of memory allocation in function init_vq, and don't handle
the allocation failure properly. Then this function will return 0,
although initialization fails due to lacking memory. At that moment,
kernel will panic in guest machine, if virtio is used to drive disk.
To fix this bug, we should take care of allocation failure, and return
correct value to let caller know what happen.
Tested-by: Chao Fan <fanc.fnst@cn.fujitsu.com>
Signed-off-by: Minfei Huang <mnghuan@gmail.com>
Signed-off-by: Minfei Huang <minfei.hmf@alibaba-inc.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Purely cosmetic at this point, as rbd doesn't use RADOS namespaces and
hence rbd_dev->header_oloc->pool_ns is always NULL.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Since commit 63a4cc2486, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.
No intended functional changes in this commit.
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit abf545484d changed it from an 'rw' flags type to the
newer ops based interface, but now we're effectively leaking
some bdev internals to the rest of the kernel. Since we only
care about whether it's a read or a write at that level, just
pass in a bool 'is_write' parameter instead.
Then we can also move op_is_write() and friends back under
CONFIG_BLOCK protection.
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The rw_page users were not converted to use bio/req ops. As a result
bdev_write_page is not passing down REQ_OP_WRITE and the IOs will
be sent down as reads.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Fixes: 4e1b2d52a8 ("block, fs, drivers: remove REQ_OP compat defs and related code")
Modified by me to:
1) Drop op_flags passing into ->rw_page(), as we don't use it.
2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK
Signed-off-by: Jens Axboe <axboe@fb.com>
Fix a fat-fingered conversion to the req_op accessors, and also
use a switch statement to make it more obvious what is being checked.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Dave Chinner <david@fromorbit.com>
Fixes: c2df40 ("drivers: use req op accessor");
Reviewed-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Quentin ran into this bug:
WARNING: CPU: 64 PID: 10085 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x65/0x80
sysfs: cannot create duplicate filename '/devices/virtual/block/nbd3/pid'
Modules linked in: nbd
CPU: 64 PID: 10085 Comm: qemu-nbd Tainted: G D 4.6.0+ #7
0000000000000000 ffff8820330bba68 ffffffff814b8791 ffff8820330bbac8
0000000000000000 ffff8820330bbab8 ffffffff810d04ab ffff8820330bbaa8
0000001f00000296 0000000000017681 ffff8810380bf000 ffffffffa0001790
Call Trace:
[<ffffffff814b8791>] dump_stack+0x4d/0x6c
[<ffffffff810d04ab>] __warn+0xdb/0x100
[<ffffffff810d0574>] warn_slowpath_fmt+0x44/0x50
[<ffffffff81218c65>] sysfs_warn_dup+0x65/0x80
[<ffffffff81218a02>] sysfs_add_file_mode_ns+0x172/0x180
[<ffffffff81218a35>] sysfs_create_file_ns+0x25/0x30
[<ffffffff81594a76>] device_create_file+0x36/0x90
[<ffffffffa0000e8d>] __nbd_ioctl+0x32d/0x9b0 [nbd]
[<ffffffff814cc8e8>] ? find_next_bit+0x18/0x20
[<ffffffff810f7c29>] ? select_idle_sibling+0xe9/0x120
[<ffffffff810f6cd7>] ? __enqueue_entity+0x67/0x70
[<ffffffff810f9bf0>] ? enqueue_task_fair+0x630/0xe20
[<ffffffff810efa76>] ? resched_curr+0x36/0x70
[<ffffffff810f0078>] ? check_preempt_curr+0x78/0x90
[<ffffffff810f00a2>] ? ttwu_do_wakeup+0x12/0x80
[<ffffffff810f01b1>] ? ttwu_do_activate.constprop.86+0x61/0x70
[<ffffffff810f0c15>] ? try_to_wake_up+0x185/0x2d0
[<ffffffff810f0d6d>] ? default_wake_function+0xd/0x10
[<ffffffff81105471>] ? autoremove_wake_function+0x11/0x40
[<ffffffffa0001577>] nbd_ioctl+0x67/0x94 [nbd]
[<ffffffff814ac0fd>] blkdev_ioctl+0x14d/0x940
[<ffffffff811b0da2>] ? put_pipe_info+0x22/0x60
[<ffffffff811d96cc>] block_ioctl+0x3c/0x40
[<ffffffff811ba08d>] do_vfs_ioctl+0x8d/0x5e0
[<ffffffff811aa329>] ? ____fput+0x9/0x10
[<ffffffff810e9092>] ? task_work_run+0x72/0x90
[<ffffffff811ba627>] SyS_ioctl+0x47/0x80
[<ffffffff8185f5df>] entry_SYSCALL_64_fastpath+0x17/0x93
---[ end trace 7899b295e4f850c8 ]---
It seems fairly obvious that device_create_file() is not being protected
from being run concurrently on the same nbd.
Quentin found the following relevant commits:
1a2ad21 nbd: add locking to nbd_ioctl
90b8f28 [PATCH] end of methods switch: remove the old ones
d4430d6 [PATCH] beginning of methods conversion
08f8585 [PATCH] move block_device_operations to blkdev.h
It would seem that the race was introduced in the process of moving nbd
from BKL to unlocked ioctls.
By setting nbd->task_recv while the mutex is held, we can prevent other
processes from running concurrently (since nbd->task_recv is also checked
while the mutex is held).
Reported-and-tested-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Markus Pargmann <mpa@pengutronix.de>
Cc: Paul Clements <paul.clements@steeleye.com>
Cc: Pavel Machek <pavel@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Commit 09954bad4 ("floppy: refactor open() flags handling"), as a
side-effect, causes open(/dev/fdX, O_ACCMODE) to fail. It turns out that
this is being used setfdprm userspace for ioctl-only open().
Reintroduce back the original behavior wrt !(FMODE_READ|FMODE_WRITE)
modes, while still keeping the original O_NDELAY bug fixed.
Cc: stable@vger.kernel.org # v4.5+
Reported-by: Wim Osterholt <wim@djo.tudelft.nl>
Tested-by: Wim Osterholt <wim@djo.tudelft.nl>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Merge yet more updates from Andrew Morton:
- the rest of ocfs2
- various hotfixes, mainly MM
- quite a bit of misc stuff - drivers, fork, exec, signals, etc.
- printk updates
- firmware
- checkpatch
- nilfs2
- more kexec stuff than usual
- rapidio updates
- w1 things
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (111 commits)
ipc: delete "nr_ipc_ns"
kcov: allow more fine-grained coverage instrumentation
init/Kconfig: add clarification for out-of-tree modules
config: add android config fragments
init/Kconfig: ban CONFIG_LOCALVERSION_AUTO with allmodconfig
relay: add global mode support for buffer-only channels
init: allow blacklisting of module_init functions
w1:omap_hdq: fix regression
w1: add helper macro module_w1_family
w1: remove need for ida and use PLATFORM_DEVID_AUTO
rapidio/switches: add driver for IDT gen3 switches
powerpc/fsl_rio: apply changes for RIO spec rev 3
rapidio: modify for rev.3 specification changes
rapidio: change inbound window size type to u64
rapidio/idt_gen2: fix locking warning
rapidio: fix error handling in mbox request/release functions
rapidio/tsi721_dma: advance queue processing from transfer submit call
rapidio/tsi721: add messaging mbox selector parameter
rapidio/tsi721: add PCIe MRRS override parameter
rapidio/tsi721_dma: add channel mask and queue size parameters
...
* RADOS namespace support in libceph and CephFS (Zheng Yan and myself).
The stopgaps added in 4.5 to deny access to inodes in namespaces are
removed and CEPH_FEATURE_FS_FILE_LAYOUT_V2 feature bit is now fully
supported.
* A large rework of the MDS cap flushing code (Zheng Yan).
* Handle some of ->d_revalidate() in RCU mode (Jeff Layton). We were
overly pessimistic before, bailing at the first sight of LOOKUP_RCU.
On top of that we've got a few CephFS bug fixes, a couple of cleanups
and Arnd's workaround for a weird genksyms issue.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJXoKLJAAoJEEp/3jgCEfOLDTUIAIcctpKUiNBokc95mQaXYl34
j7lPIaD0/Ur7JPt4nMdtlywYJYSVV2c+SglHztj/+fv0G4bWbLVEFRruh9SwKIci
PzttcmycIAqSn1f5gBZwyQbGuffd/F0EnBj7fFjcukt01i3s1ZQ7t4XtLGtAV0Ts
aIfFtx9SqWig57Z1OZqNgnhnOoh6IqNbic3FL5Hvdl5N5pFbBcQho6Vzoa5O1osH
URG6RmCcO4nykfSoxiivE7UZ+CImsXHkRD7rupBuIjqjZ8wvmZqQF5qxnkb9Dw2F
IkNhrHkTSIiv4EsNPLAETTnFSozrL1nEykKr2FBW+ti8nxNcav+8FgVapqLvFIw=
=gQ0/
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-4.8-rc1' of git://github.com/ceph/ceph-client
Pull Ceph updates from Ilya Dryomov:
"The highlights are:
- RADOS namespace support in libceph and CephFS (Zheng Yan and
myself). The stopgaps added in 4.5 to deny access to inodes in
namespaces are removed and CEPH_FEATURE_FS_FILE_LAYOUT_V2 feature
bit is now fully supported
- A large rework of the MDS cap flushing code (Zheng Yan)
- Handle some of ->d_revalidate() in RCU mode (Jeff Layton). We were
overly pessimistic before, bailing at the first sight of LOOKUP_RCU
On top of that we've got a few CephFS bug fixes, a couple of cleanups
and Arnd's workaround for a weird genksyms issue"
* tag 'ceph-for-4.8-rc1' of git://github.com/ceph/ceph-client: (34 commits)
ceph: fix symbol versioning for ceph_monc_do_statfs
ceph: Correctly return NXIO errors from ceph_llseek
ceph: Mark the file cache as unreclaimable
ceph: optimize cap flush waiting
ceph: cleanup ceph_flush_snaps()
ceph: kick cap flushes before sending other cap message
ceph: introduce an inode flag to indicates if snapflush is needed
ceph: avoid sending duplicated cap flush message
ceph: unify cap flush and snapcap flush
ceph: use list instead of rbtree to track cap flushes
ceph: update types of some local varibles
ceph: include 'follows' of pending snapflush in cap reconnect message
ceph: update cap reconnect message to version 3
ceph: mount non-default filesystem by name
libceph: fsmap.user subscription support
ceph: handle LOOKUP_RCU in ceph_d_revalidate
ceph: allow dentry_lease_is_valid to work under RCU walk
ceph: clear d_fsinfo pointer under d_lock
ceph: remove ceph_mdsc_lease_release
ceph: don't use ->d_time
...
kernel.h header doesn't directly use dynamic debug, instead we can
include it in module.c (which used it via kernel.h). printk.h only uses
it if CONFIG_DYNAMIC_DEBUG is on, changing the inclusion to only happen
in that case.
Link: http://lkml.kernel.org/r/1468429793-16917-1-git-send-email-luisbg@osg.samsung.com
[luisbg@osg.samsung.com: include dynamic_debug.h in drb_int.h]
Link: http://lkml.kernel.org/r/1468447828-18558-2-git-send-email-luisbg@osg.samsung.com
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1/ Replace pcommit with ADR / directed-flushing:
The pcommit instruction, which has not shipped on any product, is
deprecated. Instead, the requirement is that platforms implement either
ADR, or provide one or more flush addresses per nvdimm. ADR
(Asynchronous DRAM Refresh) flushes data in posted write buffers to the
memory controller on a power-fail event. Flush addresses are defined in
ACPI 6.x as an NVDIMM Firmware Interface Table (NFIT) sub-structure:
"Flush Hint Address Structure". A flush hint is an mmio address that
when written and fenced assures that all previous posted writes
targeting a given dimm have been flushed to media.
2/ On-demand ARS (address range scrub):
Linux uses the results of the ACPI ARS commands to track bad blocks
in pmem devices. When latent errors are detected we re-scrub the media
to refresh the bad block list, userspace can also request a re-scrub at
any time.
3/ Support for the Microsoft DSM (device specific method) command format.
4/ Support for EDK2/OVMF virtual disk device memory ranges.
5/ Various fixes and cleanups across the subsystem.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXmXBsAAoJEB7SkWpmfYgCEwwP/1IOt9ocP+iHLMDH9KE7VaTZ
NmUDR+Zy6g5cRQM7SgcuU5BXUcx+OsSrSrUTVF1cW994o9Gbz1mFotkv0ZAsPcYY
ZVRQxo2oqHrssyOcg+PsgKWiXn68rJOCgmpEyzaJywl5qTMst7pzsT1s1f7rSh6h
trCf4VaJJwxZR8fARGtlHUnnhPe2Orp99EZRKEWprAsIv2kPuWpPHSjRjuEgN1JG
KW8AYwWqFTtiLRUk86I4KBB0wcDrfctsjgN9Ogd6+aHyQBRnVSr2U+vDCFkC8KLu
qiDCpYp+yyxBjclnljz7tRRT3GtzfCUWd4v2KVWqgg2IaobUc0Lbukp/rmikUXQP
WLikT2OCQ994eFK5OX3Q3cIU/4j459TQnof8q14yVSpjAKrNUXVSR5puN7Hxa+V7
41wKrAsnsyY1oq+Yd/rMR8VfH7PHx3bFkrmRCGZCufLX1UQm4aYj+sWagDKiV3yA
DiudghbOnhfurfGsnXUVw7y7GKs+gNWNBmB6ndAD6ZEHmKoGUhAEbJDLCc3DnANl
b/2mv1MIdIcC1DlCmnbbcn6fv6bICe/r8poK3VrCK3UgOq/EOvKIWl7giP+k1JuC
6DdVYhlNYIVFXUNSLFAwz8OkLu8byx7WDm36iEqrKHtPw+8qa/2bWVgOU6OBgpjV
cN3edFVIdxvZeMgM5Ubq
=xCBG
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
- Replace pcommit with ADR / directed-flushing.
The pcommit instruction, which has not shipped on any product, is
deprecated. Instead, the requirement is that platforms implement
either ADR, or provide one or more flush addresses per nvdimm.
ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers
to the memory controller on a power-fail event.
Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware
Interface Table (NFIT) sub-structure: "Flush Hint Address Structure".
A flush hint is an mmio address that when written and fenced assures
that all previous posted writes targeting a given dimm have been
flushed to media.
- On-demand ARS (address range scrub).
Linux uses the results of the ACPI ARS commands to track bad blocks
in pmem devices. When latent errors are detected we re-scrub the
media to refresh the bad block list, userspace can also request a
re-scrub at any time.
- Support for the Microsoft DSM (device specific method) command
format.
- Support for EDK2/OVMF virtual disk device memory ranges.
- Various fixes and cleanups across the subsystem.
* tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits)
libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register"
nfit: do an ARS scrub on hitting a latent media error
nfit: move to nfit/ sub-directory
nfit, libnvdimm: allow an ARS scrub to be triggered on demand
libnvdimm: register nvdimm_bus devices with an nd_bus driver
pmem: clarify a debug print in pmem_clear_poison
x86/insn: remove pcommit
Revert "KVM: x86: add pcommit support"
nfit, tools/testing/nvdimm/: unify shutdown paths
libnvdimm: move ->module to struct nvdimm_bus_descriptor
nfit: cleanup acpi_nfit_init calling convention
nfit: fix _FIT evaluation memory leak + use after free
tools/testing/nvdimm: add manufacturing_{date|location} dimm properties
tools/testing/nvdimm: add virtual ramdisk range
acpi, nfit: treat virtual ramdisk SPA as pmem region
pmem: kill __pmem address space
pmem: kill wmb_pmem()
libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes
fs/dax: remove wmb_pmem()
libnvdimm, pmem: flush posted-write queues on shutdown
...
Pull vfs updates from Al Viro:
"Assorted cleanups and fixes.
Probably the most interesting part long-term is ->d_init() - that will
have a bunch of followups in (at least) ceph and lustre, but we'll
need to sort the barrier-related rules before it can get used for
really non-trivial stuff.
Another fun thing is the merge of ->d_iput() callers (dentry_iput()
and dentry_unlink_inode()) and a bunch of ->d_compare() ones (all
except the one in __d_lookup_lru())"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
fs/dcache.c: avoid soft-lockup in dput()
vfs: new d_init method
vfs: Update lookup_dcache() comment
bdev: get rid of ->bd_inodes
Remove last traces of ->sync_page
new helper: d_same_name()
dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends()
vfs: clean up documentation
vfs: document ->d_real()
vfs: merge .d_select_inode() into .d_real()
unify dentry_iput() and dentry_unlink_inode()
binfmt_misc: ->s_root is not going anywhere
drop redundant ->owner initializations
ufs: get rid of redundant checks
orangefs: constify inode_operations
missed comment updates from ->direct_IO() prototype change
file_inode(f)->i_mapping is f->f_mapping
trim fsnotify hooks a bit
9p: new helper - v9fs_parent_fid()
debugfs: ->d_parent is never NULL or negative
...
Add pool namesapce pointer to struct ceph_file_layout and struct
ceph_object_locator. Pool namespace is used by when mapping object
to PG, it's also used when composing OSD request.
The namespace pointer in struct ceph_file_layout is RCU protected.
So libceph can read namespace without taking lock.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
[idryomov@gmail.com: ceph_oloc_destroy(), misc minor changes]
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Define new ceph_file_layout structure and rename old ceph_file_layout
to ceph_file_layout_legacy. This is preparation for adding namespace
to ceph_file_layout structure.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
- ACPI support for guests on ARM platforms.
- Generic steal time support for arm and x86.
- Support cases where kernel cpu is not Xen VCPU number (e.g., if
in-guest kexec is used).
- Use the system workqueue instead of a custom workqueue in various
places.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJXmLlrAAoJEFxbo/MsZsTRvRQH/1wOMF8BmlbZfR7H3qwDfjst
ApNifCiZE08xDtWBlwUaBFAQxyflQS9BBiNZDVK0sysIdXeOdpWV7V0ZjRoLL+xr
czsaaGXDcmXxJxApoMDVuT7FeP6rEk6LVAYRoHpVjJjMZGW3BbX1vZaMW4DXl2WM
9YNaF2Lj+rpc1f8iG31nUxwkpmcXFog6ct4tu7HiyCFT3hDkHt/a4ghuBdQItCkd
vqBa1pTpcGtQBhSmWzlylN/PV2+NKcRd+kGiwd09/O/rNzogTMCTTWeHKAtMpPYb
Cu6oSqJtlK5o0vtr0qyLSWEGIoyjE2gE92s0wN3iCzFY1PldqdsxUO622nIj+6o=
=G6q3
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.8-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from David Vrabel:
"Features and fixes for 4.8-rc0:
- ACPI support for guests on ARM platforms.
- Generic steal time support for arm and x86.
- Support cases where kernel cpu is not Xen VCPU number (e.g., if
in-guest kexec is used).
- Use the system workqueue instead of a custom workqueue in various
places"
* tag 'for-linus-4.8-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (47 commits)
xen: add static initialization of steal_clock op to xen_time_ops
xen/pvhvm: run xen_vcpu_setup() for the boot CPU
xen/evtchn: use xen_vcpu_id mapping
xen/events: fifo: use xen_vcpu_id mapping
xen/events: use xen_vcpu_id mapping in events_base
x86/xen: use xen_vcpu_id mapping when pointing vcpu_info to shared_info
x86/xen: use xen_vcpu_id mapping for HYPERVISOR_vcpu_op
xen: introduce xen_vcpu_id mapping
x86/acpi: store ACPI ids from MADT for future usage
x86/xen: update cpuid.h from Xen-4.7
xen/evtchn: add IOCTL_EVTCHN_RESTRICT
xen-blkback: really don't leak mode property
xen-blkback: constify instance of "struct attribute_group"
xen-blkfront: prefer xenbus_scanf() over xenbus_gather()
xen-blkback: prefer xenbus_scanf() over xenbus_gather()
xen: support runqueue steal time on xen
arm/xen: add support for vm_assist hypercall
xen: update xen headers
xen-pciback: drop superfluous variables
xen-pciback: short-circuit read path used for merging write values
...
Merge updates from Andrew Morton:
- a few misc bits
- ocfs2
- most(?) of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (125 commits)
thp: fix comments of __pmd_trans_huge_lock()
cgroup: remove unnecessary 0 check from css_from_id()
cgroup: fix idr leak for the first cgroup root
mm: memcontrol: fix documentation for compound parameter
mm: memcontrol: remove BUG_ON in uncharge_list
mm: fix build warnings in <linux/compaction.h>
mm, thp: convert from optimistic swapin collapsing to conservative
mm, thp: fix comment inconsistency for swapin readahead functions
thp: update Documentation/{vm/transhuge,filesystems/proc}.txt
shmem: split huge pages beyond i_size under memory pressure
thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
khugepaged: add support of collapse for tmpfs/shmem pages
shmem: make shmem_inode_info::lock irq-safe
khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
thp: extract khugepaged from mm/huge_memory.c
shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
shmem: add huge pages support
shmem: get_unmapped_area align huge page
shmem: prepare huge= mount option and sysfs knob
mm, rmap: account shmem thp pages
...
We now allocate streams from CPU_UP hot-plug path, there are no
context-dependent stream allocations anymore and we can schedule from
zcomp_strm_alloc(). Use GFP_KERNEL directly and drop a gfp_t parameter.
Link: http://lkml.kernel.org/r/20160531122017.2878-9-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add "deflate", "lz4hc", "842" algorithms to the list of known
compression backends. The real availability of those algorithms,
however, depends on the corresponding CONFIG_CRYPTO_FOO config options.
[sergey.senozhatsky@gmail.com: zram-add-more-compression-algorithms-v3]
Link: http://lkml.kernel.org/r/20160604024902.11778-7-sergey.senozhatsky@gmail.com
Link: http://lkml.kernel.org/r/20160531122017.2878-8-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is no way to get a string with all the crypto comp algorithms
supported by the crypto comp engine, so we need to maintain our own
backends list. At the same time we additionally need to use
crypto_has_comp() to make sure that the user has requested a compression
algorithm that is recognized by the crypto comp engine. Relying on
/proc/crypto is not an options here, because it does not show
not-yet-inserted compression modules.
Example:
modprobe zram
cat /proc/crypto | grep -i lz4
modprobe lz4
cat /proc/crypto | grep -i lz4
name : lz4
driver : lz4-generic
module : lz4
So the user can't tell exactly if the lz4 is really supported from
/proc/crypto output, unless someone or something has loaded it.
This patch also adds crypto_has_comp() to zcomp_available_show(). We
store all the compression algorithms names in zcomp's `backends' array,
regardless the CONFIG_CRYPTO_FOO configuration, but show only those that
are also supported by crypto engine. This helps user to know the exact
list of compression algorithms that can be used.
Example:
module lz4 is not loaded yet, but is supported by the crypto
engine. /proc/crypto has no information on this module, while
zram's `comp_algorithm' lists it:
cat /proc/crypto | grep -i lz4
cat /sys/block/zram0/comp_algorithm
[lzo] lz4 deflate lz4hc 842
We still use the `backends' array to determine if the requested
compression backend is known to crypto api. This array, however, may not
contain some entries, therefore as the last step we call crypto_has_comp()
function which attempts to insmod the requested compression algorithm to
determine if crypto api supports it. The advantage of this method is that
now we permit the usage of out-of-tree crypto compression modules
(implementing S/W or H/W compression).
[sergey.senozhatsky@gmail.com: zram-use-crypto-api-to-check-alg-availability-v3]
Link: http://lkml.kernel.org/r/20160604024902.11778-4-sergey.senozhatsky@gmail.com
Link: http://lkml.kernel.org/r/20160531122017.2878-5-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This has started as a 'add zlib support' work, but after some thinking I
saw no blockers for a bigger change -- a switch to crypto API.
We don't have an idle zstreams list anymore and our write path now works
absolutely differently, preventing preemption during compression. This
removes possibilities of read paths preempting writes at wrong places
and opens the door for a move from custom LZO/LZ4 compression backends
implementation to a more generic one, using crypto compress API.
This patch set also eliminates the need of a new context-less crypto API
interface, which was quite hard to sell, so we can move along faster.
benchmarks:
(x86_64, 4GB, zram-perf script)
perf reported run-time fio (max jobs=3). I performed fio test with the
increasing number of parallel jobs (max to 3) on a 3G zram device, using
`static' data and the following crypto comp algorithms:
842, deflate, lz4, lz4hc, lzo
the output was:
- test running time (which can tell us what algorithms performs faster)
and
- zram mm_stat (which tells the compressed memory size, max used memory, etc).
It's just for information. for example, LZ4HC has twice the running
time of LZO, but the compressed memory size is: 23592960 vs 34603008
bytes.
test-fio-zram-842
197.907655282 seconds time elapsed
201.623142884 seconds time elapsed
226.854291345 seconds time elapsed
test-fio-zram-DEFLATE
253.259516155 seconds time elapsed
258.148563401 seconds time elapsed
290.251909365 seconds time elapsed
test-fio-zram-LZ4
27.022598717 seconds time elapsed
29.580522717 seconds time elapsed
33.293463430 seconds time elapsed
test-fio-zram-LZ4HC
56.393954615 seconds time elapsed
74.904659747 seconds time elapsed
101.940998564 seconds time elapsed
test-fio-zram-LZO
28.155948075 seconds time elapsed
30.390036330 seconds time elapsed
34.455773159 seconds time elapsed
zram mm_stat-s (max fio jobs=3)
test-fio-zram-842
mm_stat (jobs1): 3221225472 673185792 690266112 0 690266112 0 0
mm_stat (jobs2): 3221225472 673185792 690266112 0 690266112 0 0
mm_stat (jobs3): 3221225472 673185792 690266112 0 690266112 0 0
test-fio-zram-DEFLATE
mm_stat (jobs1): 3221225472 24379392 37761024 0 37761024 0 0
mm_stat (jobs2): 3221225472 24379392 37761024 0 37761024 0 0
mm_stat (jobs3): 3221225472 24379392 37761024 0 37761024 0 0
test-fio-zram-LZ4
mm_stat (jobs1): 3221225472 23592960 37761024 0 37761024 0 0
mm_stat (jobs2): 3221225472 23592960 37761024 0 37761024 0 0
mm_stat (jobs3): 3221225472 23592960 37761024 0 37761024 0 0
test-fio-zram-LZ4HC
mm_stat (jobs1): 3221225472 23592960 37761024 0 37761024 0 0
mm_stat (jobs2): 3221225472 23592960 37761024 0 37761024 0 0
mm_stat (jobs3): 3221225472 23592960 37761024 0 37761024 0 0
test-fio-zram-LZO
mm_stat (jobs1): 3221225472 34603008 50335744 0 50335744 0 0
mm_stat (jobs2): 3221225472 34603008 50335744 0 50335744 0 0
mm_stat (jobs3): 3221225472 34603008 50335744 0 50339840 0 0
This patch (of 8):
We don't perform any zstream idle list lookup anymore, so
zcomp_strm_find()/zcomp_strm_release() names are not representative.
Rename to zcomp_stream_get()/zcomp_stream_put().
Link: http://lkml.kernel.org/r/20160531122017.2878-2-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull block driver updates from Jens Axboe:
"This branch also contains core changes. I've come to the conclusion
that from 4.9 and forward, I'll be doing just a single branch. We
often have dependencies between core and drivers, and it's hard to
always split them up appropriately without pulling core into drivers
when that happens.
That said, this contains:
- separate secure erase type for the core block layer, from
Christoph.
- set of discard fixes, from Christoph.
- bio shrinking fixes from Christoph, as a followup up to the
op/flags change in the core branch.
- map and append request fixes from Christoph.
- NVMeF (NVMe over Fabrics) code from Christoph. This is pretty
exciting!
- nvme-loop fixes from Arnd.
- removal of ->driverfs_dev from Dan, after providing a
device_add_disk() helper.
- bcache fixes from Bhaktipriya and Yijing.
- cdrom subchannel read fix from Vchannaiah.
- set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.
- set of drbd updates and fixes from Fabian, Lars, and Philipp.
- mg_disk error path fix from Bart.
- user notification for failed device add for loop, from Minfei.
- NVMe in general:
+ NVMe delay quirk from Guilherme.
+ SR-IOV support and command retry limits from Keith.
+ fix for memory-less NUMA node from Masayoshi.
+ use UINT_MAX for discard sectors, from Minfei.
+ cancel IO fixes from Ming.
+ don't allocate unused major, from Neil.
+ error code fixup from Dan.
+ use constants for PSDT/FUSE from James.
+ variable init fix from Jay.
+ fabrics fixes from Ming, Sagi, and Wei.
+ various fixes"
* 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
nvme/pci: Provide SR-IOV support
nvme: initialize variable before logical OR'ing it
block: unexport various bio mapping helpers
scsi/osd: open code blk_make_request
target: stop using blk_make_request
block: simplify and export blk_rq_append_bio
block: ensure bios return from blk_get_request are properly initialized
virtio_blk: use blk_rq_map_kern
memstick: don't allow REQ_TYPE_BLOCK_PC requests
block: shrink bio size again
block: simplify and cleanup bvec pool handling
block: get rid of bio_rw and READA
block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
NVMe: don't allocate unused nvme_major
nvme: avoid crashes when node 0 is memoryless node.
nvme: Limit command retries
loop: Make user notify for adding loop device failed
nvme-loop: fix nvme-loop Kconfig dependencies
nvmet: fix return value check in nvmet_subsys_alloc()
...
Pull core block updates from Jens Axboe:
- the big change is the cleanup from Mike Christie, cleaning up our
uses of command types and modified flags. This is what will throw
some merge conflicts
- regression fix for the above for btrfs, from Vincent
- following up to the above, better packing of struct request from
Christoph
- a 2038 fix for blktrace from Arnd
- a few trivial/spelling fixes from Bart Van Assche
- a front merge check fix from Damien, which could cause issues on
SMR drives
- Atari partition fix from Gabriel
- convert cfq to highres timers, since jiffies isn't granular enough
for some devices these days. From Jan and Jeff
- CFQ priority boost fix idle classes, from me
- cleanup series from Ming, improving our bio/bvec iteration
- a direct issue fix for blk-mq from Omar
- fix for plug merging not involving the IO scheduler, like we do for
other types of merges. From Tahsin
- expose DAX type internally and through sysfs. From Toshi and Yigal
* 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
block: Fix front merge check
block: do not merge requests without consulting with io scheduler
block: Fix spelling in a source code comment
block: expose QUEUE_FLAG_DAX in sysfs
block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
Btrfs: fix comparison in __btrfs_map_block()
block: atari: Return early for unsupported sector size
Doc: block: Fix a typo in queue-sysfs.txt
cfq-iosched: Charge at least 1 jiffie instead of 1 ns
cfq-iosched: Fix regression in bonnie++ rewrite performance
cfq-iosched: Convert slice_resid from u64 to s64
block: Convert fifo_time from ulong to u64
blktrace: avoid using timespec
block/blk-cgroup.c: Declare local symbols static
block/bio-integrity.c: Add #include "blk.h"
block/partition-generic.c: Remove a set-but-not-used variable
block: bio: kill BIO_MAX_SIZE
cfq-iosched: temporarily boost queue priority for idle classes
block: drbd: avoid to use BIO_MAX_SIZE
block: bio: remove BIO_MAX_SECTORS
...
Commit 9d092603cc ("xen-blkback: do not leak mode property") left one
path unfixed; correct this.
Acked-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The functions these get passed to have been taking pointers to const
since at least 2.6.16.
Acked-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
... for single items being collected: It is more typesafe (as the
compiler can check format string and to-be-written-to variable match)
and requires one less parameter to be passed.
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
... for single items being collected: It is more typesafe (as the
compiler can check format string and to-be-written-to variable match)
and requires one less parameter to be passed.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Currently, presence of direct_access() in block_device_operations
indicates support of DAX on its block device. Because
block_device_operations is instantiated with 'const', this DAX
capablity may not be enabled conditinally.
In preparation for supporting DAX to device-mapper devices, add
QUEUE_FLAG_DAX to request_queue flags to advertise their DAX
support. This will allow to set the DAX capability based on how
mapped device is composed.
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <linux-s390@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
blk_get_request is used for BLOCK_PC and similar passthrough requests.
Currently we always need to call blk_rq_set_block_pc or an open coded
version of it to allow appending bios using the request mapping helpers
later on, which is a somewhat awkward API. Instead move the
initialization part of blk_rq_set_block_pc into blk_get_request, so that
we always have a safe to use request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Similar to how SCSI and NVMe prepare passthrough requests. This avoids
poking into request internals too much.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
These two are confusing leftover of the old world order, combining
values of the REQ_OP_ and REQ_ namespaces. For callers that don't
special case we mostly just replace bi_rw with bio_data_dir or
op_is_write, except for the few cases where a switch over the REQ_OP_
values makes more sense. Any check for READA is replaced with an
explicit check for REQ_RAHEAD. Also remove the READA alias for
REQ_RAHEAD.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The __pmem address space was meant to annotate codepaths that touch
persistent memory and need to coordinate a call to wmb_pmem(). Now that
wmb_pmem() is gone, there is little need to keep this annotation.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
There is no error number returned if loop driver fails in function
alloc_disk to add new loop device. Add a correct error number to make
user notify in this case.
Signed-off-by: Minfei Huang <mnghuan@gmail.com>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Dan writes:
"The removal of ->driverfs_dev in favor of just passing the parent
device in as a parameter to add_disk(). See below, it has received a
"Reviewed-by" from Christoph, Bart, and Johannes.
It is also a pre-requisite for Fam Zheng's work to cleanup gendisk
uevents vs attribute visibility [1]. We would extend device_add_disk()
to take an attribute_group list.
This is based off a branch of block.git/for-4.8/drivers and has
received a positive build success notification from the kbuild robot
across several configs.
[1]: "gendisk: Generate uevent after attribute available"
http://marc.info/?l=linux-virtualization&m=146725201522201&w=2"
Pull block IO fixes from Jens Axboe:
"Three small fixes that have been queued up and tested for this series:
- A bug fix for xen-blkfront from Bob Liu, fixing an issue with
incomplete requests during migration.
- A fix for an ancient issue in retrieving the IO priority of a
different PID than self, preventing that task from going away while
we access it. From Omar.
- A writeback fix from Tahsin, fixing a case where we'd call ihold()
with a zero ref count inode"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: fix use-after-free in sys_ioprio_get()
writeback: inode cgroup wb switch should not call ihold()
xen-blkfront: save uncompleted reqs in blkfront_resume()
Uncompleted reqs used to be 'saved and resubmitted' in blkfront_recover() during
migration, but that's too late after multi-queue was introduced.
After a migrate to another host (which may not have multiqueue support), the
number of rings (block hardware queues) may be changed and the ring and shadow
structure will also be reallocated.
The blkfront_recover() then can't 'save and resubmit' the real
uncompleted reqs because shadow structure have been reallocated.
This patch fixes this issue by moving the 'save' logic out of
blkfront_recover() to earlier place in blkfront_resume().
The 'resubmit' is not changed and still in blkfront_recover().
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: stable@vger.kernel.org
MG_DISK_MAJ is defined as 0 so dynamic block major number
allocation is used by the driver and the assigned major
number is stored in host->major. This patch fixes error
path in mg_probe() to use host->major instead of using
MG_DISK_MAJ.
Cc: unsik Kim <donari75@gmail.com>
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
For block drivers that specify a parent device, convert them to use
device_add_disk().
This conversion was done with the following semantic patch:
@@
struct gendisk *disk;
expression E;
@@
- disk->driverfs_dev = E;
...
- add_disk(disk);
+ device_add_disk(E, disk);
@@
struct gendisk *disk;
expression E1, E2;
@@
- disk->driverfs_dev = E1;
...
E2 = disk;
...
- add_disk(E2);
+ device_add_disk(E1, E2);
...plus some manual fixups for a few missed conversions.
Cc: Jens Axboe <axboe@fb.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
This is the third version of the patchset previously sent [1]. I have
basically only rebased it on top of 4.7-rc1 tree and dropped "dm: get
rid of superfluous gfp flags" which went through dm tree. I am sending
it now because it is tree wide and chances for conflicts are reduced
considerably when we want to target rc2. I plan to send the next step
and rename the flag and move to a better semantic later during this
release cycle so we will have a new semantic ready for 4.8 merge window
hopefully.
Motivation:
While working on something unrelated I've checked the current usage of
__GFP_REPEAT in the tree. It seems that a majority of the usage is and
always has been bogus because __GFP_REPEAT has always been about costly
high order allocations while we are using it for order-0 or very small
orders very often. It seems that a big pile of them is just a
copy&paste when a code has been adopted from one arch to another.
I think it makes some sense to get rid of them because they are just
making the semantic more unclear. Please note that GFP_REPEAT is
documented as
* __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
* _might_ fail. This depends upon the particular VM implementation.
while !costly requests have basically nofail semantic. So one could
reasonably expect that order-0 request with __GFP_REPEAT will not loop
for ever. This is not implemented right now though.
I would like to move on with __GFP_REPEAT and define a better semantic
for it.
$ git grep __GFP_REPEAT origin/master | wc -l
111
$ git grep __GFP_REPEAT | wc -l
36
So we are down to the third after this patch series. The remaining
places really seem to be relying on __GFP_REPEAT due to large allocation
requests. This still needs some double checking which I will do later
after all the simple ones are sorted out.
I am touching a lot of arch specific code here and I hope I got it right
but as a matter of fact I even didn't compile test for some archs as I
do not have cross compiler for them. Patches should be quite trivial to
review for stupid compile mistakes though. The tricky parts are usually
hidden by macro definitions and thats where I would appreciate help from
arch maintainers.
[1] http://lkml.kernel.org/r/1461849846-27209-1-git-send-email-mhocko@kernel.org
This patch (of 19):
__GFP_REPEAT has a rather weak semantic but since it has been introduced
around 2.6.12 it has been ignored for low order allocations. Yet we
have the full kernel tree with its usage for apparently order-0
allocations. This is really confusing because __GFP_REPEAT is
explicitly documented to allow allocation failures which is a weaker
semantic than the current order-0 has (basically nofail).
Let's simply drop __GFP_REPEAT from those places. This would allow to
identify place which really need allocator to retry harder and formulate
a more specific semantic for what the flag is supposed to do actually.
Link: http://lkml.kernel.org/r/1464599699-30131-2-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Liqin <liqin.linux@gmail.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com> [for tile]
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: John Crispin <blogic@openwrt.org>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
crypto_alloc_hash returns an ERR_PTR(), not NULL.
Also reset peer_integrity_tfm to NULL, to not call crypto_free_hash()
on an errno in the cleanup path.
Reported-by: Insu Yun <wuninsu@gmail.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
For larger devices, the array of bitmap page pointers can grow very
large (8000 pointers per TB of storage).
For each activity log transaction, we need to flush the associated
bitmap pages to stable storage. Currently, we just "mark" the respective
pages while setting up the transaction, then tell the bitmap code to
write out all marked pages, but skip unchanged pages.
But one such transaction can affect only a small number of bitmap pages,
there is no need to scan the full array of several (ten-)thousand
page pointers to find the few marked ones.
Instead, remember the index numbers of the few affected pages,
and later only re-check those to skip duplicates and unchanged ones.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Also skip the message unless bitmap IO took longer than 5 ms.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This should silence a warning about an empty statement. Thanks to Fabian
Frederick <fabf@skynet.be> who sent a patch I modified to be smaller and
avoids an additional indent level.
Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This contains various cosmetic fixes ranging from simple typos to
const-ifying, and using booleans properly.
Original commit messages from Fabian's patch set:
drbd: debugfs: constify drbd_version_fops
drbd: use seq_put instead of seq_print where possible
drbd: include linux/uaccess.h instead of asm/uaccess.h
drbd: use const char * const for drbd strings
drbd: kerneldoc warning fix in w_e_end_data_req()
drbd: use unsigned for one bit fields
drbd: use bool for peer is_ states
drbd: fix typo
drbd: use | for bitmask combination
drbd: use true/false for bool
drbd: fix drbd_bm_init() comments
drbd: introduce peer state union
drbd: fix maybe_pull_ahead() locking comments
drbd: use bool for growing
drbd: remove redundant declarations
drbd: replace if/BUG by BUG_ON
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Scenario, starting with normal operation
Connected Primary/Secondary UpToDate/UpToDate
NetworkFailure Primary/Unknown UpToDate/DUnknown (frozen)
... more failures happen, secondary loses it's disk,
but eventually is able to re-establish the replication link ...
Connected Primary/Secondary UpToDate/Diskless (resumed; needs to bump uuid!)
We used to just resume/resent suspended requests,
without bumping the UUID.
Which will lead to problems later, when we want to re-attach the disk on
the peer, without first disconnecting, or if we experience additional
failures, because we now have diverging data without being able to
recognize it.
Make sure we also bump the current data generation UUID,
if we notice "peer disk unknown" -> "peer disk known bad".
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We already serialize connection state changes,
and other, non-connection state changes (role changes)
while we are establishing a connection.
But if we have an established connection,
then trigger a resync handshake (by primary --force or similar),
until now we just had to be "lucky".
Consider this sequence (e.g. deployment scenario):
create-md; up;
-> Connected Secondary/Secondary Inconsistent/Inconsistent
then do a racy primary --force on both peers.
block drbd0: drbd_sync_handshake:
block drbd0: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
block drbd0: peer 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> Connected ) pdsk( DUnknown -> Inconsistent )
block drbd0: peer( Secondary -> Primary ) pdsk( Inconsistent -> UpToDate )
*** HERE things go wrong. ***
block drbd0: role( Secondary -> Primary )
block drbd0: drbd_sync_handshake:
block drbd0: self 0000000000000005:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
block drbd0: peer C90D2FC716D232AB:0000000000000004:0000000000000000:0000000000000000 bits:25590 flags:0
block drbd0: Becoming sync target due to disk states.
block drbd0: Writing the whole bitmap, full sync required after drbd_sync_handshake.
block drbd0: Remote failed to finish a request within 6007ms > ko-count (2) * timeout (30 * 0.1s)
drbd s0: peer( Primary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )
The problem here is that the local promotion happens before the sync handshake
triggered by the remote promotion was completed. Some assumptions elsewhere
become wrong, and when the expected resync handshake is then received and
processed, we get stuck in a deadlock, which can only be recovered by reboot :-(
Fix: if we know the peer has good data,
and our own disk is present, but NOT good,
and there is no resync going on yet,
we expect a sync handshake to happen "soon".
So reject a racy promotion with SS_IN_TRANSIENT_STATE.
Result:
... as above ...
block drbd0: peer( Secondary -> Primary ) pdsk( Inconsistent -> UpToDate )
*** local promotion being postponed until ... ***
block drbd0: drbd_sync_handshake:
block drbd0: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:25590 flags:0
block drbd0: peer 77868BDA836E12A5:0000000000000004:0000000000000000:0000000000000000 bits:25590 flags:0
...
block drbd0: conn( WFBitMapT -> WFSyncUUID )
block drbd0: updated sync uuid 85D06D0E8887AD44:0000000000000000:0000000000000000:0000000000000000
block drbd0: conn( WFSyncUUID -> SyncTarget )
*** ... after the resync handshake ***
block drbd0: role( Secondary -> Primary )
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If in a two-primary scenario, we lost our peer, freeze IO,
and are still frozen (no UUID rotation) when the peer comes back
as Secondary after a hard crash, we will see identical UUIDs.
The "rule_nr = 40" chose to use the "CRASHED_PRIMARY" bit as
arbitration, but that would cause the still running (but frozen) Primary
to become SyncTarget (which it typically refuses), and the handshake is
declined.
Fix: check current roles.
If we have *one* current primary, the Primary wins.
(rule_nr = 41)
Since that is a protocol change, use the newly introduced DRBD_FF_WSAME
to determine if rule_nr = 41 can be applied.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We will support WRITE_SAME, if
* all peers support WRITE_SAME (both in kernel and DRBD version),
* all peer devices support WRITE_SAME
* logical_block_size is identical on all peers.
We may at some point introduce a fallback on the receiving side
for devices/kernels that do not support WRITE_SAME,
by open-coding a submit loop. But not yet.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Even if discard_zeroes_data != 0,
if discard_zeroes_if_aligned is set, we assume we can reliably
zero-out/discard using the drbd_issue_peer_discard() helper.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When re-attaching the local backend device to a C_STANDALONE D_DISKLESS
R_PRIMARY with OND_SUSPEND_IO, we may only resume IO if we recognize the
backend that is being attached as D_UP_TO_DATE.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If DRBD lost all path to good data,
and the on-no-data-accessible policy is OND_SUSPEND_IO,
all pending and new IO requests are suspended (will block).
If that setting is OND_IO_ERROR, IO will still be completed.
READ to "clean" areas (e.g. on an D_INCONSISTENT device,
and bitmap indicates a block is already in sync) will succeed.
READ to "unclean" areas (bitmap indicates block is out-of-sync),
will return EIO.
If we are already D_DISKLESS (or D_FAILED), we also return EIO.
Unfortunately, on a former R_PRIMARY C_SYNC_TARGET D_INCONSISTENT,
after replication link loss, new WRITE requests still went through OK.
The would also set the "out-of-sync" bit on their way, so READ after
WRITE would still return EIO. Also, the data generation UUIDs had not
been bumped, we would cause data divergence, without being able to
detect it on the next sync handshake, given the right sequence of events
in a multiple error scenario and "improper" order of recovery actions.
The right thing to do is to return EIO for all new writes,
unless we have access to good, current, D_UP_TO_DATE data.
The "established best practices" way to avoid these situations in the
first place is to set OND_SUSPEND_IO, or even do a hard-reset from
the pri-on-incon-degr policy helper hook.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Possibly sequence of events:
SyncTarget is made Primary, then loses replication link
(only path to good data on SyncSource).
Behavior is then controlled by the on-no-data-accessible policy,
which defaults to OND_IO_ERROR (may be set to OND_SUSPEND_IO).
If OND_IO_ERROR is in fact the current policy, we clear the susp_fen
(IO suspended due to fencing policy) flag, do NOT set the susp_nod
(IO suspended due to no data) flag.
But we forgot to call the IO error completion for all pending,
suspended, requests.
While at it, also add a race check for a theoretically possible
race with a new handshake (network hickup), we may be able to
re-send requests, and can avoid passing IO errors up the stack.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When resync is finished, we already call the "after-resync-target"
handler (on the former sync target, obviously), once per volume.
Paired with the before-resync-target handler, you can create snapshots,
before the resync causes the volumes to become inconsistent,
and discard those snapshots again, once they are no longer needed.
It was also overloaded to be paired with the "fence-peer" handler,
to "unfence" once the volumes are up-to-date and known good.
This has some disadvantages, though: we call "fence-peer" for the whole
connection (once for the group of volumes), but would call unfence as
side-effect of after-resync-target once for each volume.
Also, we fence on a (current, or about to become) Primary,
which will later become the sync-source.
Calling unfence only as a side effect of the after-resync-target
handler opens a race window, between a new fence on the Primary
(SyncTarget) and the unfence on the SyncTarget, which is difficult to
close without some kind of "cluster wide lock" in those handlers.
We would not need those handlers if we could still communicate.
Which makes trying to aquire a cluster wide lock from those handlers
seem like a very bad idea.
This introduces the "unfence-peer" handler, which will be called
per connection (once for the group of volumes), just like the fence
handler, only once all volumes are back in sync, and on the SyncSource.
Which is expected to be the node that previously called "fence", the
node that is currently allowed to be Primary, and thus the only node
that could trigger a new "fence" that could race with this unfence.
Which makes us not need any cluster wide synchronization here,
serializing two scripts running on the same node is trivial.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If the replication link breaks exactly during "resync finished" detection,
finishing too early on the sync source could again lead to UUIDs rotated
too fast, and potentially a spurious full resync on next handshake.
Always wait for explicit resync finished state change notification from
the sync target.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Make sure we have at least 67 (> AL_UPDATES_PER_TRANSACTION)
al-extents available, and allow up to half of that to be
discarded in one bio.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
For consistency, also zero-out partial unaligned chunks of discard
requests on the local backend.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now that we have the discard_zeroes_if_aligned setting, we should also
check it when setting up our queue parameters on the primary,
not only on the receiving side.
We announce discard support,
UNLESS
* we are connected to a peer that does not support TRIM
on the DRBD protocol level. Otherwise, it would either discard, or
do a fallback to zero-out, depending on its backend and configuration.
* our local backend does not support discards,
or (discard_zeroes_data=0 AND discard_zeroes_if_aligned=no).
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We can avoid spurious data divergence caused by partially-ignored
discards on certain backends with discard_zeroes_data=0, if we
translate partial unaligned discard requests into explicit zero-out.
The relevant use case is LVM/DM thin.
If on different nodes, DRBD is backed by devices with differing
discard characteristics, discards may lead to data divergence
(old data or garbage left over on one backend, zeroes due to
unmapped areas on the other backend). Online verify would now
potentially report tons of spurious differences.
While probably harmless for most use cases (fstrim on a file system),
DRBD cannot have that, it would violate our promise to upper layers
that our data instances on the nodes are identical.
To be correct and play safe (make sure data is identical on both copies),
we would have to disable discard support, if our local backend (on a
Primary) does not support "discard_zeroes_data=true".
We'd also have to translate discards to explicit zero-out on the
receiving (typically: Secondary) side, unless the receiving side
supports "discard_zeroes_data=true".
Which both would allocate those blocks, instead of unmapping them,
in contrast with expectations.
LVM/DM thin does set discard_zeroes_data=0,
because it silently ignores discards to partial chunks.
We can work around this by checking the alignment first.
For unaligned (wrt. alignment and granularity) or too small discards,
we zero-out the initial (and/or) trailing unaligned partial chunks,
but discard all the aligned full chunks.
At least for LVM/DM thin, the result is effectively "discard_zeroes_data=1".
Arguably it should behave this way internally, by default,
and we'll try to make that happen.
But our workaround is still valid for already deployed setups,
and for other devices that may behave this way.
Setting discard-zeroes-if-aligned=yes will allow DRBD to use
discards, and to announce discard_zeroes_data=true, even on
backends that announce discard_zeroes_data=false.
Setting discard-zeroes-if-aligned=no will cause DRBD to always
fall-back to zero-out on the receiving side, and to not even
announce discard capabilities on the Primary, if the respective
backend announces discard_zeroes_data=false.
We used to ignore the discard_zeroes_data setting completely.
To not break established and expected behaviour, and suddenly
cause fstrim on thin-provisioned LVs to run out-of-space,
instead of freeing up space, the default value is "yes".
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
To maintain write-order fidelity accros all volumes in a DRBD resource,
the receiver of a P_BARRIER needs to issue flushes to all volumes.
We used to do this by calling blkdev_issue_flush(), synchronously,
one volume at a time.
We now submit all flushes to all volumes in parallel, then wait for all
completions, to reduce worst-case latencies on multi-volume resources.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The command line parameter the kernel module uses to communicate the
device minor to userland helper is flawed in a way that the device
indentifier "minor-%d" is being truncated to minors with a maximum
of 5 digits.
But DRBD 8.4 allows 2^20 == 1048576 minors,
thus a minimum of 7 digits must be supported.
Reported by Veit Wahlich on drbd-dev.
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Regression introduced with 8.4.5
drbd: application writes may set-in-sync in protocol != C
Overwriting the same block (LBA) while a former version is still
"in-flight" to the peer (to be exact: we did not receive the
P_BARRIER_ACK for its epoch yet) would wait for the full epoch of that
former version to be acknowledged by the peer.
In synchronous and quasi-synchronous protocols C and B,
this may double the latency on overwrites.
With protocol A, which is supposed to be asynchronous and only wait for
local completion, it is even worse: it would make overwrites
quasi-synchronous, they would be hit by the full RTT, which protocol A
was specifically meant to avoid, and possibly the additional time it
takes to drain the buffers first.
Particularly bad for databases, or anything else that
does frequent updates to the same blocks (various file system meta data).
No impact if >= rtt passes between updates to the same block.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If thinly provisioned volumes are used, during a resync the sync source
tries to find out if a block is deallocated. If it is deallocated, then
the resync target uses block_dev_issue_zeroout() on the range in
question.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
As long as the value is 0 the feature is disabled. With setting
it to a positive value, DRBD limits and aligns its resync requests
to the rs-discard-granularity setting. If the sync source detects
all zeros in such a block, the resync target discards the range
on disk.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
If during resync we read only zeroes for a range of sectors assume
that these secotors can be discarded on the sync target node.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When leaving resync states because of disconnect,
do the bitmap write-out synchronously in the drbd_disconnected() path.
When leaving resync states because we go back to AHEAD/BEHIND, or
because resync actually finished, or some disk was lost during resync,
trigger the write-out from after_state_ch().
The bitmap write-out for resync -> ahead/behind was missing completely before.
Note that this is all only an optimization to avoid double-resyncs of
already completed blocks in case this node crashes.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The intention was to only suspend IO if some normal bitmap operation is
supposed to be locked out, not always. If the bulk operation is flaged
as BM_LOCKED_CHANGE_ALLOWED, we do not need to suspend IO.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Use BIO_MAX_PAGES instead and we will remove BIO_MAX_SIZE.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Tested-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Instead of overloading the discard support with the REQ_SECURE flag.
Use the opportunity to rename the queue flag as well, and remove the
dead checks for this flag in the RAID 1 and RAID 10 drivers that don't
claim support for secure erase.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Because we define WRITE/READ as REQ_OPs, we cannot do
switch (rq_data_dir(request))
case READ
....
case WRITE
...
without getting warnings about handling other REQ_OPs.
This just has mq_disk do a if/else like it does in other
places.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
After a migrate to another host (which may not have multiqueue
support), the number of rings (block hardware queues)
may be changed and the ring info structure will also be reallocated.
This patch fixes two related bugs:
* call blk_mq_update_nr_hw_queues() to make blk-core know the number
of hardware queues have been changed.
* Don't store rinfo pointer to hctx->driver_data, because rinfo may be
reallocated so use hctx->queue_num to get the rinfo structure instead.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Sometimes blkfront may twice receive blkback_changed() notification
(XenbusStateConnected) after migration, which will cause
talk_to_blkback() to be called twice too and confuse xen-blkback.
The flow is as follow:
blkfront blkback
blkfront_resume()
> talk_to_blkback()
> Set blkfront to XenbusStateInitialised
front changed()
> Connect()
> Set blkback to XenbusStateConnected
blkback_changed()
> Skip talk_to_blkback()
because frontstate == XenbusStateInitialised
> blkfront_connect()
> Set blkfront to XenbusStateConnected
-----
And here we get another XenbusStateConnected notification leading
to:
-----
blkback_changed()
> because now frontstate != XenbusStateInitialised
talk_to_blkback() is also called again
> blkfront state changed from
XenbusStateConnected to XenbusStateInitialised
(Which is not correct!)
front_changed():
> Do nothing because blkback
already in XenbusStateConnected
Now blkback is in XenbusStateConnected but blkfront is still
in XenbusStateInitialised - leading to no disks.
Poking of the XenbusStateConnected state is allowed (to deal with
block disk change) and has to be dealt with. The most likely
cause of this bug are custom udev scripts hooking up the disks
and then validating the size.
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
We were passing in &nbd for the private data in debugfs_create_file() for the
flags entry. We expect it to just be nbd, fix this so we get proper output from
this debugfs entry.
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The last patch added a REQ_OP_FLUSH for request_fn drivers
and the next patch renames REQ_FLUSH to REQ_PREFLUSH which
will be used by file systems and make_request_fn drivers so
they can send a write/flush combo.
This patch drops xen's use of REQ_FLUSH to track if it supports
REQ_OP_FLUSH requests, so REQ_FLUSH can be deleted.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Juergen Gross <kernel@pfupf.net>
Signed-off-by: Jens Axboe <axboe@fb.com>
This adds a REQ_OP_FLUSH operation that is sent to request_fn
based drivers by the block layer's flush code, instead of
sending requests with the request->cmd_flags REQ_FLUSH bit set.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
The req operation REQ_OP is separated from the rq_flag_bits
definition. This converts the block layer drivers to
use req_op to get the op from the request struct.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Separate the op from the rq_flag_bits and have xen
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Separate the op from the rq_flag_bits and have drbd
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch converts the simple bi_rw use cases in the block,
drivers, mm and fs code to set/get the bio operation using
bio_set_op_attrs/bio_op
These should be simple one or two liner cases, so I just did them
in one patch. The next patches handle the more complicated
cases in a module per patch.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
We currently set REQ_WRITE/WRITE for all non READ IOs
like discard, flush, writesame, etc. In the next patches where we
no longer set up the op as a bitmap, we will not be able to
detect a operation direction like writesame by testing if REQ_WRITE is
set.
This patch converts the drivers and cgroup to use the
op_is_write helper. This should just cover the simple
cases. I did dm, md and bcache in their own patches
because they were more involved.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Fixed up fs/ext4/crypto.c
Signed-off-by: Jens Axboe <axboe@fb.com>
- Until now, dax has been disabled if media errors were found on
any device. This enables the use of DAX in the presence of these
errors by making all sector-aligned zeroing go through the driver.
- The driver (already) has the ability to clear errors on writes that
are sent through the block layer using 'DSMs' defined in ACPI 6.1.
Other misc changes:
- When mounting DAX filesystems, check to make sure the partition
is page aligned. This is a requirement for DAX, and previously, we
allowed such unaligned mounts to succeed, but subsequent reads/writes
would fail.
- Misc/cleanup fixes from Jan that remove unused code from DAX related to
zeroing, writeback, and some size checks.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXQ4GKAAoJEHr6Yb6juE3/zowP/iclIhgXXXMQJRUHJlePMXC8
15sGZ32JS1ak9g7vrsmNVEDNynfNtiMYdBxtUyRuj6xqgwdZvFk3F55KOCPtaeA1
+yADkgeRkTAcwzmHw9WQVEzBCqyzSisdrwtEfH817qdq9FJdH66x2Kos6i+HeAVr
5Q/e4gs7lKrjf384/QBl+wxNZOndJaQAPd2VRHQqx2A9F33v0ljdwRaUG1r4fjK2
dtmhcZCqdQyuAGXW3piTnZc5ZFc3DPqO4FkEfqkEK3lFOflK0fd8wMsAZRp/Jd0j
GJsgnVSWSqG0Dz476djlG0w8t2p5Jv1g9cKChV+ZZEdFLKWHCOUFqXNj8uI8I4k5
cOEKCHyJ3IwfSHhNQqktEWrQN4T8ZXhWtuc9GuV4UZYuqJqHci6EdR/YsWsJjV+L
lm/qvK4ipDS1pivxOy8KX/iN0z7Io8J9GXpStDx3g8iWjLlh4YYlbJLWeeRepo/z
aPlV/QAKcHiGY6jzLExrZIyCWkzwo6O+0p1Kxerv9/7K/32HWbOodZ+tC8eD+N25
pV69nCGf+u50T2TtIx1+iann4NC1r7zg5yqnT9AgpyZpiwR5joCDzI5sXW+D0rcS
vPtfM84Ccdeq/e6mvfIpZgR0/npQapKnrmUest0J7P2BFPHiFPji1KzZ7M+1aFOo
9R6JdrAj0Sc+FBa+cGzH
=v6Of
-----END PGP SIGNATURE-----
Merge tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull misc DAX updates from Vishal Verma:
"DAX error handling for 4.7
- Until now, dax has been disabled if media errors were found on any
device. This enables the use of DAX in the presence of these
errors by making all sector-aligned zeroing go through the driver.
- The driver (already) has the ability to clear errors on writes that
are sent through the block layer using 'DSMs' defined in ACPI 6.1.
Other misc changes:
- When mounting DAX filesystems, check to make sure the partition is
page aligned. This is a requirement for DAX, and previously, we
allowed such unaligned mounts to succeed, but subsequent
reads/writes would fail.
- Misc/cleanup fixes from Jan that remove unused code from DAX
related to zeroing, writeback, and some size checks"
* tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: fix a comment in dax_zero_page_range and dax_truncate_page
dax: for truncate/hole-punch, do zeroing through the driver if possible
dax: export a low-level __dax_zero_page_range helper
dax: use sb_issue_zerout instead of calling dax_clear_sectors
dax: enable dax in the presence of known media errors (badblocks)
dax: fallback from pmd to pte on error
block: Update blkdev_dax_capable() for consistency
xfs: Add alignment check for DAX mount
ext2: Add alignment check for DAX mount
ext4: Add alignment check for DAX mount
block: Add bdev_dax_supported() for dax mount checks
block: Add vfs_msg() interface
dax: Remove redundant inode size checks
dax: Remove pointless writeback from dax_do_io()
dax: Remove zeroing from dax_io()
dax: Remove dead zeroing code from fault handlers
ext2: Avoid DAX zeroing to corrupt data
ext2: Fix block zeroing in ext2_get_blocks() for DAX
dax: Remove complete_unwritten argument
DAX: move RADIX_DAX_ definitions to dax.c
Pull Ceph updates from Sage Weil:
"This changeset has a few main parts:
- Ilya has finished a huge refactoring effort to sync up the
client-side logic in libceph with the user-space client code, which
has evolved significantly over the last couple years, with lots of
additional behaviors (e.g., how requests are handled when cluster
is full and transitions from full to non-full).
This structure of the code is more closely aligned with userspace
now such that it will be much easier to maintain going forward when
behavior changes take place. There are some locking improvements
bundled in as well.
- Zheng adds multi-filesystem support (multiple namespaces within the
same Ceph cluster)
- Zheng has changed the readdir offsets and directory enumeration so
that dentry offsets are hash-based and therefore stable across
directory fragmentation events on the MDS.
- Zheng has a smorgasbord of bug fixes across fs/ceph"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (71 commits)
ceph: fix wake_up_session_cb()
ceph: don't use truncate_pagecache() to invalidate read cache
ceph: SetPageError() for writeback pages if writepages fails
ceph: handle interrupted ceph_writepage()
ceph: make ceph_update_writeable_page() uninterruptible
libceph: make ceph_osdc_wait_request() uninterruptible
ceph: handle -EAGAIN returned by ceph_update_writeable_page()
ceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEM
ceph: block non-fatal signals for fault/page_mkwrite
ceph: make logical calculation functions return bool
ceph: tolerate bad i_size for symlink inode
ceph: improve fragtree change detection
ceph: keep leaf frag when updating fragtree
ceph: fix dir_auth check in ceph_fill_dirfrag()
ceph: don't assume frag tree splits in mds reply are sorted
ceph: fix inode reference leak
ceph: using hash value to compose dentry offset
ceph: don't forbid marking directory complete after forward seek
ceph: record 'offset' for each entry of readdir result
ceph: define 'end/complete' in readdir reply as bit flags
...
For map check, we are going to need to send CEPH_MSG_MON_GET_VERSION
messages asynchronously and get a callback on completion. Refactor MON
client to allow firing off generic requests asynchronously and add an
async variant of ceph_monc_get_version(). ceph_monc_do_statfs() is
switched over and remains sync.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This adds support and switches rbd to a new, more reliable version of
watch/notify protocol. As with the OSD client update, this is mostly
about getting the right structures linked into the right places so that
reconnects are properly sent when needed. watch/notify v2 also
requires sending regular pings to the OSDs - send_linger_ping().
A major change from the old watch/notify implementation is the
introduction of ceph_osd_linger_request - linger requests no longer
piggy back on ceph_osd_request. ceph_osd_event has been merged into
ceph_osd_linger_request.
All the details are now hidden within libceph, the interface consists
of a simple pair of watch/unwatch functions and ceph_osdc_notify_ack().
ceph_osdc_watch() does return ceph_osd_linger_request, but only to keep
the lifetime management simple.
ceph_osdc_notify_ack() accepts an optional data payload, which is
relayed back to the notifier.
Portions of this patch are loosely based on work by Douglas Fuller
<dfuller@redhat.com> and Mike Christie <michaelc@cs.wisc.edu>.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Introduce __rbd_dev_header_unwatch_sync(), which doesn't flush notify
callbacks. This is for the new rados_watcherrcb_t, which would be
called from a notify callback.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
finish_read(), its only user, uses it to get to hdr.data_len, which is
what ->r_result is set to on success. This gains us the ability to
safely call callbacks from contexts other than reply, e.g. map check.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The crux of this is getting rid of ceph_osdc_build_request(), so that
MOSDOp can be encoded not before but after calc_target() calculates the
actual target. Encoding now happens within ceph_osdc_start_request().
Also nuked is the accompanying bunch of pointers into the encoded
buffer that was used to update fields on each send - instead, the
entire front is re-encoded. If we want to support target->name_len !=
base->name_len in the future, there is no other way, because oid is
surrounded by other fields in the encoded buffer.
Encoding OSD ops and adding data items to the request message were
mixed together in osd_req_encode_op(). While we want to re-encode OSD
ops, we don't want to add duplicate data items to the message when
resending, so all call to ceph_osdc_msg_data_add() are factored out
into a new setup_request_data().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Switch to ceph_object_id and use ceph_oid_aprintf() instead of a bare
const char *. This reduces noise in rbd_dev_header_name().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Currently ceph_object_id can hold object names of up to 100
(CEPH_MAX_OID_NAME_LEN) characters. This is enough for all use cases,
expect one - long rbd image names:
- a format 1 header is named "<imgname>.rbd"
- an object that points to a format 2 header is named "rbd_id.<imgname>"
We operate on these potentially long-named objects during rbd map, and,
for format 1 images, during header refresh. (A format 2 header name is
a small system-generated string.)
Lift this 100 character limit by making ceph_object_id be able to point
to an externally-allocated string. Apart from being able to work with
almost arbitrarily-long named objects, this allows us to reduce the
size of ceph_object_id from >100 bytes to 64 bytes.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The size of ->r_request and ->r_reply messages depends on the size of
the object name (ceph_object_id), while the size of ceph_osd_request is
fixed. Move message allocation into a separate function that would
have to be called after ceph_object_id and ceph_object_locator (which
is also going to become variable in size with RADOS namespaces) have
been filled in:
req = ceph_osdc_alloc_request(...);
<fill in req->r_base_oid>
<fill in req->r_base_oloc>
ceph_osdc_alloc_messages(req);
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
By the time we get to checking for_each_obj_request_safe(img_request)
terminating condition, all obj_requests may be complete and img_request
ref, that rbd_img_request_submit() takes away from its caller, may be
put. Moving the next_obj_request cursor is then a use-after-free on
img_request.
It's totally benign, as the value that's read is never used, but
I think it's still worth fixing.
Cc: Alex Elder <elder@linaro.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
debug_stat sysfs is read-only and represents various debugging data that
zram developers may need. This file is not meant to be used by anyone
else: its content is not documented and will change any time w/o any
notice. Therefore, the output of debug_stat file contains a version
string. To avoid any confusion, we will increase the version number
every time we modify the output.
At the moment this file exports only one value -- the number of
re-compressions, IOW, the number of times compression fast path has
failed. This stat is temporary any will be useful in case if any
per-cpu compression streams regressions will be reported.
Link: http://lkml.kernel.org/r/20160513230834.GB26763@bbox
Link: http://lkml.kernel.org/r/20160511134553.12655-1-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove the internal part of max_comp_streams interface, since we
switched to per-cpu streams. We will keep RW max_comp_streams attr
around, because:
a) we may (silently) switch back to idle compression streams list and
don't want to disturb user space
b) max_comp_streams attr must wait for the next 'lay off cycle'; we
give user space 2 years to adjust before we remove/downgrade the attr,
and there are already several attrs scheduled for removal in 4.11, so
it's too late for max_comp_streams.
This slightly change a user visible behaviour:
- First, reading from max_comp_stream file now will always return the
number of online CPUs.
- Second, writing to max_comp_stream will not take any effect.
Link: http://lkml.kernel.org/r/20160503165546.25201-1-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pass GFP flags to zs_malloc() instead of using a fixed mask supplied to
zs_create_pool(), so we can be more flexible, but, more importantly, we
need this to switch zram to per-cpu compression streams -- zram will try
to allocate handle with preemption disabled in a fast path and switch to
a slow path (using different gfp mask) if the fast one has failed.
Apart from that, this also align zs_malloc() interface with zspool/zbud.
[sergey.senozhatsky@gmail.com: pass GFP flags to zs_malloc() instead of using a fixed mask]
Link: http://lkml.kernel.org/r/20160429150942.GA637@swordfish
Link: http://lkml.kernel.org/r/20160429150942.GA637@swordfish
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Many developers already know that field for reference count of the
struct page is _count and atomic type. They would try to handle it
directly and this could break the purpose of page reference count
tracepoint. To prevent direct _count modification, this patch rename it
to _refcount and add warning message on the code. After that, developer
who need to handle reference count will find that field should not be
accessed directly.
[akpm@linux-foundation.org: fix comments, per Vlastimil]
[akpm@linux-foundation.org: Documentation/vm/transhuge.txt too]
[sfr@canb.auug.org.au: sync ethernet driver changes]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Sunil Goutham <sgoutham@cavium.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Manish Chopra <manish.chopra@qlogic.com>
Cc: Yuval Mintz <yuval.mintz@qlogic.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1/ If a mapping overlaps a bad sector fail the request.
2/ Do not opportunistically report more dax-capable capacity than is
requested when errors present.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
[vishal: fix a conflict with system RAM collision patches]
[vishal: add a 'size' parameter to ->direct_access]
[vishal: fix a conflict with DAX alignment check patches]
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Pull networking updates from David Miller:
"Highlights:
1) Support SPI based w5100 devices, from Akinobu Mita.
2) Partial Segmentation Offload, from Alexander Duyck.
3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE.
4) Allow cls_flower stats offload, from Amir Vadai.
5) Implement bpf blinding, from Daniel Borkmann.
6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is
actually using FASYNC these atomics are superfluous. From Eric
Dumazet.
7) Run TCP more preemptibly, also from Eric Dumazet.
8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e
driver, from Gal Pressman.
9) Allow creating ppp devices via rtnetlink, from Guillaume Nault.
10) Improve BPF usage documentation, from Jesper Dangaard Brouer.
11) Support tunneling offloads in qed, from Manish Chopra.
12) aRFS offloading in mlx5e, from Maor Gottlieb.
13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo
Leitner.
14) Add MSG_EOR support to TCP, this allows controlling packet
coalescing on application record boundaries for more accurate
socket timestamp sampling. From Martin KaFai Lau.
15) Fix alignment of 64-bit netlink attributes across the board, from
Nicolas Dichtel.
16) Per-vlan stats in bridging, from Nikolay Aleksandrov.
17) Several conversions of drivers to ethtool ksettings, from Philippe
Reynes.
18) Checksum neutral ILA in ipv6, from Tom Herbert.
19) Factorize all of the various marvell dsa drivers into one, from
Vivien Didelot
20) Add VF support to qed driver, from Yuval Mintz"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits)
Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m"
Revert "phy dp83867: Make rgmii parameters optional"
r8169: default to 64-bit DMA on recent PCIe chips
phy dp83867: Make rgmii parameters optional
phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
bpf: arm64: remove callee-save registers use for tmp registers
asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
switchdev: pass pointer to fib_info instead of copy
net_sched: close another race condition in tcf_mirred_release()
tipc: fix nametable publication field in nl compat
drivers: net: Don't print unpopulated net_device name
qed: add support for dcbx.
ravb: Add missing free_irq() calls to ravb_close()
qed: Remove a stray tab
net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
net: ethernet: fec-mpc52xx: use phydev from struct net_device
bpf, doc: fix typo on bpf_asm descriptions
stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
net: ethernet: fs-enet: use phydev from struct net_device
...
Pull block driver updates from Jens Axboe:
"On top of the core pull request, this is the drivers pull request for
this merge window. This contains:
- Switch drivers to the new write back cache API, and kill off the
flush flags. From me.
- Kill the discard support for the STEC pci-e flash driver. It's
trivially broken, and apparently unmaintained, so it's safer to
just remove it. From Jeff Moyer.
- A set of lightnvm updates from the usual suspects (Matias/Javier,
and Simon), and fixes from Arnd, Jeff Mahoney, Sagi, and Wenwei
Tao.
- A set of updates for NVMe:
- Turn the controller state management into a proper state
machine. From Christoph.
- Shuffling of code in preparation for NVMe-over-fabrics, also
from Christoph.
- Cleanup of the command prep part from Ming Lin.
- Rewrite of the discard support from Ming Lin.
- Deadlock fix for namespace removal from Ming Lin.
- Use the now exported blk-mq tag helper for IO termination.
From Sagi.
- Various little fixes from Christoph, Guilherme, Keith, Ming
Lin, Wang Sheng-Hui.
- Convert mtip32xx to use the now exported blk-mq tag iter function,
from Keith"
* 'for-4.7/drivers' of git://git.kernel.dk/linux-block: (74 commits)
lightnvm: reserved space calculation incorrect
lightnvm: rename nr_pages to nr_ppas on nvm_rq
lightnvm: add is_cached entry to struct ppa_addr
lightnvm: expose gennvm_mark_blk to targets
lightnvm: remove mgt targets on mgt removal
lightnvm: pass dma address to hardware rather than pointer
lightnvm: do not assume sequential lun alloc.
nvme/lightnvm: Log using the ctrl named device
lightnvm: rename dma helper functions
lightnvm: enable metadata to be sent to device
lightnvm: do not free unused metadata on rrpc
lightnvm: fix out of bound ppa lun id on bb tbl
lightnvm: refactor set_bb_tbl for accepting ppa list
lightnvm: move responsibility for bad blk mgmt to target
lightnvm: make nvm_set_rqd_ppalist() aware of vblks
lightnvm: remove struct factory_blks
lightnvm: refactor device ops->get_bb_tbl()
lightnvm: introduce nvm_for_each_lun_ppa() macro
lightnvm: refactor dev->online_target to global nvm_targets
lightnvm: rename nvm_targets to nvm_tgt_type
...
Pull core block layer updates from Jens Axboe:
"This is the core block IO changes for this merge window. Nothing
earth shattering in here, it's mostly just fixes. In detail:
- Fix for a long standing issue where wrong ordering in blk-mq caused
order_to_size() to spew a warning. From Bart.
- Async discard support from Christoph. Basically just splitting our
sync interface into a submit + wait part.
- Add a cleaner interface for flagging whether a device has a write
back cache or not. We've previously overloaded blk_queue_flush()
with this, but let's make it more explicit. Drivers cleaned up and
updated in the drivers pull request. From me.
- Fix for a double check for whether IO accounting is enabled or not.
From Michael Callahan.
- Fix for the async discard from Mike Snitzer, reinstating the early
EOPNOTSUPP return if the device doesn't support discards.
- Also from Mike, export bio_inc_remaining() so dm can drop it's
private copy of it.
- From Ming Lin, add support for passing in an offset for request
payloads.
- Tag function export from Sagi, which will be used in NVMe in the
drivers pull.
- Two blktrace related fixes from Shaohua.
- Propagate NOMERGE flag when making a request from a bio, also from
Shaohua.
- An optimization to not parse cgroup paths in blk-throttle, if we
don't need to. From Shaohua"
* 'for-4.7/core' of git://git.kernel.dk/linux-block:
blk-mq: fix undefined behaviour in order_to_size()
blk-throttle: don't parse cgroup path if trace isn't enabled
blktrace: add missed mask name
blktrace: delete garbage for message trace
block: make bio_inc_remaining() interface accessible again
block: reinstate early return of -EOPNOTSUPP from blkdev_issue_discard
block: Minor blk_account_io_start usage cleanup
block: add __blkdev_issue_discard
block: remove struct bio_batch
block: copy NOMERGE flag from bio to request
block: add ability to flag write back caching on a device
blk-mq: Export tagset iter function
block: add offset in blk_add_request_payload()
writeback: Fix performance regression in wb_over_bg_thresh()
The attribute 0 is never used in drbd, so let's use it as pad attribute
in netlink messages. This minimizes the patch.
Note that this patch is only compile-tested.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A while ago, commit 9875201e10 ("rbd: fix use-after free of
rbd_dev->disk") fixed rbd unmap vs notify race by introducing
an exported wrapper for flushing notifies and sticking it into
do_rbd_remove().
A similar problem exists on the rbd map path, though: the watch is
registered in rbd_dev_image_probe(), while the disk is set up quite
a few steps later, in rbd_dev_device_setup(). Nothing prevents
a notify from coming in and crashing on a NULL rbd_dev->disk:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
Call Trace:
[<ffffffffa0508344>] rbd_watch_cb+0x34/0x180 [rbd]
[<ffffffffa04bd290>] do_event_work+0x40/0xb0 [libceph]
[<ffffffff8109d5db>] process_one_work+0x17b/0x470
[<ffffffff8109e3ab>] worker_thread+0x11b/0x400
[<ffffffff8109e290>] ? rescuer_thread+0x400/0x400
[<ffffffff810a5acf>] kthread+0xcf/0xe0
[<ffffffff810b41b3>] ? finish_task_switch+0x53/0x170
[<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
[<ffffffff81645dd8>] ret_from_fork+0x58/0x90
[<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140
RIP [<ffffffffa050828a>] rbd_dev_refresh+0xfa/0x180 [rbd]
If an error occurs during rbd map, we have to error out, potentially
tearing down a watch. Just like on rbd unmap, notifies have to be
flushed, otherwise rbd_watch_cb() may end up trying to read in the
image header after rbd_dev_image_release() has run:
Assertion failure in rbd_dev_header_info() at line 4722:
rbd_assert(rbd_image_format_valid(rbd_dev->image_format));
Call Trace:
[<ffffffff81cccee0>] ? rbd_parent_request_create+0x150/0x150
[<ffffffff81cd4e59>] rbd_dev_refresh+0x59/0x390
[<ffffffff81cd5229>] rbd_watch_cb+0x69/0x290
[<ffffffff81fde9bf>] do_event_work+0x10f/0x1c0
[<ffffffff81107799>] process_one_work+0x689/0x1a80
[<ffffffff811076f7>] ? process_one_work+0x5e7/0x1a80
[<ffffffff81132065>] ? finish_task_switch+0x225/0x640
[<ffffffff81107110>] ? pwq_dec_nr_in_flight+0x2b0/0x2b0
[<ffffffff81108c69>] worker_thread+0xd9/0x1320
[<ffffffff81108b90>] ? process_one_work+0x1a80/0x1a80
[<ffffffff8111b02d>] kthread+0x21d/0x2e0
[<ffffffff8111ae10>] ? kthread_stop+0x550/0x550
[<ffffffff82022802>] ret_from_fork+0x22/0x40
[<ffffffff8111ae10>] ? kthread_stop+0x550/0x550
RIP [<ffffffff81ccd8f9>] rbd_dev_header_info+0xa19/0x1e30
To fix this, a) check if RBD_DEV_FLAG_EXISTS is set before calling
revalidate_disk(), b) move ceph_osdc_flush_notifies() call into
rbd_dev_header_unwatch_sync() to cover rbd map error paths and c) turn
header read-in into a critical section. The latter also happens to
take care of rbd map foo@bar vs rbd snap rm foo@bar race.
Fixes: http://tracker.ceph.com/issues/15490
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
Simply creating a file system on an skd device, followed by mount and
fstrim will result in errors in the logs and then a BUG(). Let's remove
discard support from that driver. As far as I can tell, it hasn't
worked right since it was merged. This patch also has a side-effect of
cleaning up an unintentional shadowed declaration inside of
skd_end_request.
I tested to ensure that I can still do I/O to the device using xfstests
./check -g quick. I didn't do anything more extensive than that,
though.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull block fixes from Jens Axboe:
"A few fixes for the current series. This contains:
- Two fixes for NVMe:
One fixes a reset race that can be triggered by repeated
insert/removal of the module.
The other fixes an issue on some platforms, where we get probe
timeouts since legacy interrupts isn't working. This used not to
be a problem since we had the worker thread poll for completions,
but since that was killed off, it means those poor souls can't
successfully probe their NVMe device. Use a proper IRQ check and
probe (msi-x -> msi ->legacy), like most other drivers to work
around this. Both from Keith.
- A loop corruption issue with offset in iters, from Ming Lei.
- A fix for not having the partition stat per cpu ref count
initialized before sending out the KOBJ_ADD, which could cause user
space to access the counter prior to initialization. Also from
Ming Lei.
- A fix for using the wrong congestion state, from Kaixu Xia"
* 'for-linus' of git://git.kernel.dk/linux-block:
block: loop: fix filesystem corruption in case of aio/dio
NVMe: Always use MSI/MSI-x interrupts
NVMe: Fix reset/remove race
writeback: fix the wrong congested state variable definition
block: partition: initialize percpuref before sending out KOBJ_ADD
Starting from commit e36f620428(block: split bios to max possible length),
block core starts to split bio in the middle of bvec.
Unfortunately loop dio/aio doesn't consider this situation, and
always treat 'iter.iov_offset' as zero. Then filesystem corruption
is observed.
This patch figures out the offset of the base bvevc via
'bio->bi_iter.bi_bvec_done' and fixes the issue by passing the offset
to iov iterator.
Fixes: e36f620428 (block: split bios to max possible length)
Cc: Keith Busch <keith.busch@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org (4.5)
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now that we converted everything to the newer block write cache
interface, kill off the queue flush_flags and queueable flush
entries.
Signed-off-by: Jens Axboe <axboe@fb.com>
The driver calls it with 0 for flags, since it doesn't have a writeback
cache. Just remove the call, as it's a no-op right now.
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Only a single tags array anyway.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
We could kmalloc() the payload, so need the offset in page.
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Pull Ceph fix from Sage Weil:
"This just fixes a few remaining memory allocations in RBD to use
GFP_NOIO instead of GFP_ATOMIC"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
rbd: use GFP_NOIO consistently for request allocations
As of 5a60e87603, RBD object request
allocations are made via rbd_obj_request_create() with GFP_NOIO.
However, subsequent OSD request allocations in rbd_osd_req_create*()
use GFP_ATOMIC.
With heavy page cache usage (e.g. OSDs running on same host as krbd
client), rbd_osd_req_create() order-1 GFP_ATOMIC allocations have been
observed to fail, where direct reclaim would have allowed GFP_NOIO
allocations to succeed.
Cc: stable@vger.kernel.org # 3.18+
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Suggested-by: Neil Brown <neilb@suse.com>
Signed-off-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Mostly direct substitution with occasional adjustment or removing
outdated comments.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull Ceph updates from Sage Weil:
"There is quite a bit here, including some overdue refactoring and
cleanup on the mon_client and osd_client code from Ilya, scattered
writeback support for CephFS and a pile of bug fixes from Zheng, and a
few random cleanups and fixes from others"
[ I already decided not to pull this because of it having been rebased
recently, but ended up changing my mind after all. Next time I'll
really hold people to it. Oh well. - Linus ]
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (34 commits)
libceph: use KMEM_CACHE macro
ceph: use kmem_cache_zalloc
rbd: use KMEM_CACHE macro
ceph: use lookup request to revalidate dentry
ceph: kill ceph_get_dentry_parent_inode()
ceph: fix security xattr deadlock
ceph: don't request vxattrs from MDS
ceph: fix mounting same fs multiple times
ceph: remove unnecessary NULL check
ceph: avoid updating directory inode's i_size accidentally
ceph: fix race during filling readdir cache
libceph: use sizeof_footer() more
ceph: kill ceph_empty_snapc
ceph: fix a wrong comparison
ceph: replace CURRENT_TIME by current_fs_time()
ceph: scattered page writeback
libceph: add helper that duplicates last extent operation
libceph: enable large, variable-sized OSD requests
libceph: osdc->req_mempool should be backed by a slab pool
libceph: make r_request msg_size calculation clearer
...
Use KMEM_CACHE() instead of kmem_cache_create() to simplify the code.
Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Turn r_ops into a flexible array member to enable large, consisting of
up to 16 ops, OSD requests. The use case is scattered writeback in
cephfs and, as far as the kernel client is concerned, 16 is just a made
up number.
r_ops had size 3 for copyup+hint+write, but copyup is really a special
case - it can only happen once. ceph_osd_request_cache is therefore
stuffed with num_ops=2 requests, anything bigger than that is allocated
with kmalloc(). req_mempool is backed by ceph_osd_request_cache, which
means either num_ops=1 or num_ops=2 for use_mempool=true - all existing
users (ceph_writepages_start(), ceph_osdc_writepages()) are fine with
that.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>