Linus 5.3-rc1
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl0006weHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGaDUIAJ4oTyVWpMRZkfG6 vVY8qVMU3zlzEqRiyLYjkXoe/mGpuU/UVTyyStllxZ+Gg9da0mGwlugScKriPJof 4KRUDDTGX5DrfEOo+0brKvM+PYh9uGViPgKXzyv7i6BrnX2z3JdBR4bKNuEYlAJ9 N93Qg7v05SBHIq2Gfp3klrdWbsTTW2EaDTLbcgifXLnfKyFr47kwsmXAHPlTFP0p dYsZHHmf14Y9n1+ToZeVINgjQFr6mFn6ygY/PqTnd6vCgEEfP9eENJ4BZCtN1ZL/ V0BO9MyJ5iZV0AfwSEKydk+kDEvO16TG/nyDrECVuur7AXsBx18ZplVc787f6GK+ dyCQJ3U= =XLAF -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQRcEzekXsqa64kGDp7j7w1vZxhRxQUCXTYRHQAKCRDj7w1vZxhR xY5IAQC0H/r62rlFq+JpbmksutMqvIferowP7HUk6yOaAKdVawD/c1qsTk/xxI0x StrxRCDqeGA7D2R/ZNb/4sobnn7+oAM= =k9CF -----END PGP SIGNATURE----- Merge v5.3-rc1 into drm-misc-next Noralf needs some SPI patches in 5.3 to merge some work on tinydrm. Signed-off-by: Maxime Ripard <maxime.ripard@bootlin.com>
This commit is contained in:
commit
03b0f2ce73
16695 changed files with 979854 additions and 436075 deletions
1
.gitignore
vendored
1
.gitignore
vendored
|
@ -30,6 +30,7 @@
|
||||||
*.lz4
|
*.lz4
|
||||||
*.lzma
|
*.lzma
|
||||||
*.lzo
|
*.lzo
|
||||||
|
*.mod
|
||||||
*.mod.c
|
*.mod.c
|
||||||
*.o
|
*.o
|
||||||
*.o.*
|
*.o.*
|
||||||
|
|
2
.mailmap
2
.mailmap
|
@ -81,6 +81,7 @@ Greg Kroah-Hartman <greg@echidna.(none)>
|
||||||
Greg Kroah-Hartman <gregkh@suse.de>
|
Greg Kroah-Hartman <gregkh@suse.de>
|
||||||
Greg Kroah-Hartman <greg@kroah.com>
|
Greg Kroah-Hartman <greg@kroah.com>
|
||||||
Gregory CLEMENT <gregory.clement@bootlin.com> <gregory.clement@free-electrons.com>
|
Gregory CLEMENT <gregory.clement@bootlin.com> <gregory.clement@free-electrons.com>
|
||||||
|
Hanjun Guo <guohanjun@huawei.com> <hanjun.guo@linaro.org>
|
||||||
Henk Vergonet <Henk.Vergonet@gmail.com>
|
Henk Vergonet <Henk.Vergonet@gmail.com>
|
||||||
Henrik Kretzschmar <henne@nachtwindheim.de>
|
Henrik Kretzschmar <henne@nachtwindheim.de>
|
||||||
Henrik Rydberg <rydberg@bitmath.org>
|
Henrik Rydberg <rydberg@bitmath.org>
|
||||||
|
@ -238,6 +239,7 @@ Vlad Dogaru <ddvlad@gmail.com> <vlad.dogaru@intel.com>
|
||||||
Vladimir Davydov <vdavydov.dev@gmail.com> <vdavydov@virtuozzo.com>
|
Vladimir Davydov <vdavydov.dev@gmail.com> <vdavydov@virtuozzo.com>
|
||||||
Vladimir Davydov <vdavydov.dev@gmail.com> <vdavydov@parallels.com>
|
Vladimir Davydov <vdavydov.dev@gmail.com> <vdavydov@parallels.com>
|
||||||
Takashi YOSHII <takashi.yoshii.zj@renesas.com>
|
Takashi YOSHII <takashi.yoshii.zj@renesas.com>
|
||||||
|
Will Deacon <will@kernel.org> <will.deacon@arm.com>
|
||||||
Yakir Yang <kuankuan.y@gmail.com> <ykk@rock-chips.com>
|
Yakir Yang <kuankuan.y@gmail.com> <ykk@rock-chips.com>
|
||||||
Yusuke Goda <goda.yusuke@renesas.com>
|
Yusuke Goda <goda.yusuke@renesas.com>
|
||||||
Gustavo Padovan <gustavo@las.ic.unicamp.br>
|
Gustavo Padovan <gustavo@las.ic.unicamp.br>
|
||||||
|
|
5
CREDITS
5
CREDITS
|
@ -1770,7 +1770,6 @@ S: USA
|
||||||
|
|
||||||
N: Dave Jones
|
N: Dave Jones
|
||||||
E: davej@codemonkey.org.uk
|
E: davej@codemonkey.org.uk
|
||||||
W: http://www.codemonkey.org.uk
|
|
||||||
D: Assorted VIA x86 support.
|
D: Assorted VIA x86 support.
|
||||||
D: 2.5 AGPGART overhaul.
|
D: 2.5 AGPGART overhaul.
|
||||||
D: CPUFREQ maintenance.
|
D: CPUFREQ maintenance.
|
||||||
|
@ -1800,7 +1799,7 @@ S: 2300 Copenhagen S.
|
||||||
S: Denmark
|
S: Denmark
|
||||||
|
|
||||||
N: Jozsef Kadlecsik
|
N: Jozsef Kadlecsik
|
||||||
E: kadlec@blackhole.kfki.hu
|
E: kadlec@netfilter.org
|
||||||
P: 1024D/470DB964 4CB3 1A05 713E 9BF7 FAC5 5809 DD8C B7B1 470D B964
|
P: 1024D/470DB964 4CB3 1A05 713E 9BF7 FAC5 5809 DD8C B7B1 470D B964
|
||||||
D: netfilter: TCP window tracking code
|
D: netfilter: TCP window tracking code
|
||||||
D: netfilter: raw table
|
D: netfilter: raw table
|
||||||
|
@ -3120,7 +3119,7 @@ S: France
|
||||||
N: Rik van Riel
|
N: Rik van Riel
|
||||||
E: riel@redhat.com
|
E: riel@redhat.com
|
||||||
W: http://www.surriel.com/
|
W: http://www.surriel.com/
|
||||||
D: Linux-MM site, Documentation/sysctl/*, swap/mm readaround
|
D: Linux-MM site, Documentation/admin-guide/sysctl/*, swap/mm readaround
|
||||||
D: kswapd fixes, random kernel hacker, rmap VM,
|
D: kswapd fixes, random kernel hacker, rmap VM,
|
||||||
D: nl.linux.org administrator, minor scheduler additions
|
D: nl.linux.org administrator, minor scheduler additions
|
||||||
S: Red Hat Boston
|
S: Red Hat Boston
|
||||||
|
|
|
@ -5,7 +5,7 @@ Description: It is possible to switch the cpi setting of the mouse with the
|
||||||
press of a button.
|
press of a button.
|
||||||
When read, this file returns the raw number of the actual cpi
|
When read, this file returns the raw number of the actual cpi
|
||||||
setting reported by the mouse. This number has to be further
|
setting reported by the mouse. This number has to be further
|
||||||
processed to receive the real dpi value.
|
processed to receive the real dpi value:
|
||||||
|
|
||||||
VALUE DPI
|
VALUE DPI
|
||||||
1 400
|
1 400
|
||||||
|
|
|
@ -11,7 +11,7 @@ Description:
|
||||||
Kernel code may export it for complete or partial access.
|
Kernel code may export it for complete or partial access.
|
||||||
|
|
||||||
GPIOs are identified as they are inside the kernel, using integers in
|
GPIOs are identified as they are inside the kernel, using integers in
|
||||||
the range 0..INT_MAX. See Documentation/gpio for more information.
|
the range 0..INT_MAX. See Documentation/admin-guide/gpio for more information.
|
||||||
|
|
||||||
/sys/class/gpio
|
/sys/class/gpio
|
||||||
/export ... asks the kernel to export a GPIO to userspace
|
/export ... asks the kernel to export a GPIO to userspace
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
rfkill - radio frequency (RF) connector kill switch support
|
rfkill - radio frequency (RF) connector kill switch support
|
||||||
|
|
||||||
For details to this subsystem look at Documentation/rfkill.txt.
|
For details to this subsystem look at Documentation/driver-api/rfkill.rst.
|
||||||
|
|
||||||
What: /sys/class/rfkill/rfkill[0-9]+/claim
|
What: /sys/class/rfkill/rfkill[0-9]+/claim
|
||||||
Date: 09-Jul-2007
|
Date: 09-Jul-2007
|
||||||
|
|
|
@ -423,23 +423,6 @@ Description:
|
||||||
(e.g. driver restart on the VM which owns the VF).
|
(e.g. driver restart on the VM which owns the VF).
|
||||||
|
|
||||||
|
|
||||||
sysfs interface for NetEffect RNIC Low-Level iWARP driver (nes)
|
|
||||||
---------------------------------------------------------------
|
|
||||||
|
|
||||||
What: /sys/class/infiniband/nesX/hw_rev
|
|
||||||
What: /sys/class/infiniband/nesX/hca_type
|
|
||||||
What: /sys/class/infiniband/nesX/board_id
|
|
||||||
Date: Feb, 2008
|
|
||||||
KernelVersion: v2.6.25
|
|
||||||
Contact: linux-rdma@vger.kernel.org
|
|
||||||
Description:
|
|
||||||
hw_rev: (RO) Hardware revision number
|
|
||||||
|
|
||||||
hca_type: (RO) Host Channel Adapter type (NEX020)
|
|
||||||
|
|
||||||
board_id: (RO) Manufacturing board id
|
|
||||||
|
|
||||||
|
|
||||||
sysfs interface for Chelsio T4/T5 RDMA driver (cxgb4)
|
sysfs interface for Chelsio T4/T5 RDMA driver (cxgb4)
|
||||||
-----------------------------------------------------
|
-----------------------------------------------------
|
||||||
|
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
rfkill - radio frequency (RF) connector kill switch support
|
rfkill - radio frequency (RF) connector kill switch support
|
||||||
|
|
||||||
For details to this subsystem look at Documentation/rfkill.txt.
|
For details to this subsystem look at Documentation/driver-api/rfkill.rst.
|
||||||
|
|
||||||
For the deprecated /sys/class/rfkill/*/claim knobs of this interface look in
|
For the deprecated /sys/class/rfkill/*/claim knobs of this interface look in
|
||||||
Documentation/ABI/removed/sysfs-class-rfkill.
|
Documentation/ABI/removed/sysfs-class-rfkill.
|
||||||
|
|
|
@ -61,7 +61,7 @@ Date: October 2002
|
||||||
Contact: Linux Memory Management list <linux-mm@kvack.org>
|
Contact: Linux Memory Management list <linux-mm@kvack.org>
|
||||||
Description:
|
Description:
|
||||||
The node's hit/miss statistics, in units of pages.
|
The node's hit/miss statistics, in units of pages.
|
||||||
See Documentation/numastat.txt
|
See Documentation/admin-guide/numastat.rst
|
||||||
|
|
||||||
What: /sys/devices/system/node/nodeX/distance
|
What: /sys/devices/system/node/nodeX/distance
|
||||||
Date: October 2002
|
Date: October 2002
|
||||||
|
|
|
@ -1,5 +1,4 @@
|
||||||
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/asic_health
|
||||||
asic_health
|
|
||||||
|
|
||||||
Date: June 2018
|
Date: June 2018
|
||||||
KernelVersion: 4.19
|
KernelVersion: 4.19
|
||||||
|
@ -9,9 +8,8 @@ Description: This file shows ASIC health status. The possible values are:
|
||||||
|
|
||||||
The files are read only.
|
The files are read only.
|
||||||
|
|
||||||
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/cpld1_version
|
||||||
cpld1_version
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/cpld2_version
|
||||||
cpld2_version
|
|
||||||
Date: June 2018
|
Date: June 2018
|
||||||
KernelVersion: 4.19
|
KernelVersion: 4.19
|
||||||
Contact: Vadim Pasternak <vadimpmellanox.com>
|
Contact: Vadim Pasternak <vadimpmellanox.com>
|
||||||
|
@ -20,8 +18,7 @@ Description: These files show with which CPLD versions have been burned
|
||||||
|
|
||||||
The files are read only.
|
The files are read only.
|
||||||
|
|
||||||
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/fan_dir
|
||||||
fan_dir
|
|
||||||
|
|
||||||
Date: December 2018
|
Date: December 2018
|
||||||
KernelVersion: 5.0
|
KernelVersion: 5.0
|
||||||
|
@ -32,8 +29,7 @@ Description: This file shows the system fans direction:
|
||||||
|
|
||||||
The files are read only.
|
The files are read only.
|
||||||
|
|
||||||
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/jtag_enable
|
||||||
jtag_enable
|
|
||||||
|
|
||||||
Date: November 2018
|
Date: November 2018
|
||||||
KernelVersion: 5.0
|
KernelVersion: 5.0
|
||||||
|
@ -43,8 +39,7 @@ Description: These files show with which CPLD versions have been burned
|
||||||
|
|
||||||
The files are read only.
|
The files are read only.
|
||||||
|
|
||||||
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/jtag_enable
|
||||||
jtag_enable
|
|
||||||
|
|
||||||
Date: November 2018
|
Date: November 2018
|
||||||
KernelVersion: 5.0
|
KernelVersion: 5.0
|
||||||
|
@ -87,16 +82,15 @@ Description: These files allow asserting system power cycling, switching
|
||||||
|
|
||||||
The files are write only.
|
The files are write only.
|
||||||
|
|
||||||
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_aux_pwr_or_ref
|
||||||
reset_aux_pwr_or_ref
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_asic_thermal
|
||||||
reset_asic_thermal
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_hotswap_or_halt
|
||||||
reset_hotswap_or_halt
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_hotswap_or_wd
|
||||||
reset_hotswap_or_wd
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_fw_reset
|
||||||
reset_fw_reset
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_long_pb
|
||||||
reset_long_pb
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_main_pwr_fail
|
||||||
reset_main_pwr_fail
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_short_pb
|
||||||
reset_short_pb
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_sw_reset
|
||||||
reset_sw_reset
|
|
||||||
Date: June 2018
|
Date: June 2018
|
||||||
KernelVersion: 4.19
|
KernelVersion: 4.19
|
||||||
Contact: Vadim Pasternak <vadimpmellanox.com>
|
Contact: Vadim Pasternak <vadimpmellanox.com>
|
||||||
|
@ -110,11 +104,10 @@ Description: These files show the system reset cause, as following: power
|
||||||
|
|
||||||
The files are read only.
|
The files are read only.
|
||||||
|
|
||||||
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_comex_pwr_fail
|
||||||
reset_comex_pwr_fail
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_from_comex
|
||||||
reset_from_comex
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_system
|
||||||
reset_system
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_voltmon_upgrade_fail
|
||||||
reset_voltmon_upgrade_fail
|
|
||||||
|
|
||||||
Date: November 2018
|
Date: November 2018
|
||||||
KernelVersion: 5.0
|
KernelVersion: 5.0
|
||||||
|
@ -127,3 +120,23 @@ Description: These files show the system reset cause, as following: ComEx
|
||||||
the last reset cause.
|
the last reset cause.
|
||||||
|
|
||||||
The files are read only.
|
The files are read only.
|
||||||
|
|
||||||
|
Date: June 2019
|
||||||
|
KernelVersion: 5.3
|
||||||
|
Contact: Vadim Pasternak <vadimpmellanox.com>
|
||||||
|
Description: These files show the system reset cause, as following:
|
||||||
|
COMEX thermal shutdown; wathchdog power off or reset was derived
|
||||||
|
by one of the next components: COMEX, switch board or by Small Form
|
||||||
|
Factor mezzanine, reset requested from ASIC, reset cuased by BIOS
|
||||||
|
reload. Value 1 in file means this is reset cause, 0 - otherwise.
|
||||||
|
Only one of the above causes could be 1 at the same time, representing
|
||||||
|
only last reset cause.
|
||||||
|
|
||||||
|
The files are read only.
|
||||||
|
|
||||||
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_comex_thermal
|
||||||
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_comex_wd
|
||||||
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_from_asic
|
||||||
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_reload_bios
|
||||||
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_sff_wd
|
||||||
|
What: /sys/devices/platform/mlxplat/mlxreg-io/hwmon/hwmon*/reset_swb_wd
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
What: /sys/kernel/debug/cec/*/error-inj
|
What: /sys/kernel/debug/cec/*/error-inj
|
||||||
Date: March 2018
|
Date: March 2018
|
||||||
Contact: Hans Verkuil <hans.verkuil@cisco.com>
|
Contact: Hans Verkuil <hverkuil-cisco@xs4all.nl>
|
||||||
Description:
|
Description:
|
||||||
|
|
||||||
The CEC Framework allows for CEC error injection commands through
|
The CEC Framework allows for CEC error injection commands through
|
||||||
|
|
56
Documentation/ABI/testing/debugfs-cros-ec
Normal file
56
Documentation/ABI/testing/debugfs-cros-ec
Normal file
|
@ -0,0 +1,56 @@
|
||||||
|
What: /sys/kernel/debug/<cros-ec-device>/console_log
|
||||||
|
Date: September 2017
|
||||||
|
KernelVersion: 4.13
|
||||||
|
Description:
|
||||||
|
If the EC supports the CONSOLE_READ command type, this file
|
||||||
|
can be used to grab the EC logs. The kernel polls for the log
|
||||||
|
and keeps its own buffer but userspace should grab this and
|
||||||
|
write it out to some logs.
|
||||||
|
|
||||||
|
What: /sys/kernel/debug/<cros-ec-device>/panicinfo
|
||||||
|
Date: September 2017
|
||||||
|
KernelVersion: 4.13
|
||||||
|
Description:
|
||||||
|
This file dumps the EC panic information from the previous
|
||||||
|
reboot. This file will only exist if the PANIC_INFO command
|
||||||
|
type is supported by the EC.
|
||||||
|
|
||||||
|
What: /sys/kernel/debug/<cros-ec-device>/pdinfo
|
||||||
|
Date: June 2018
|
||||||
|
KernelVersion: 4.17
|
||||||
|
Description:
|
||||||
|
This file provides the port role, muxes and power debug
|
||||||
|
information for all the USB PD/type-C ports available. If
|
||||||
|
the are no ports available, this file will be just an empty
|
||||||
|
file.
|
||||||
|
|
||||||
|
What: /sys/kernel/debug/<cros-ec-device>/uptime
|
||||||
|
Date: June 2019
|
||||||
|
KernelVersion: 5.3
|
||||||
|
Description:
|
||||||
|
A u32 providing the time since EC booted in ms. This is
|
||||||
|
is used for synchronizing the AP host time with the EC
|
||||||
|
log. An error is returned if the command is not supported
|
||||||
|
by the EC or there is a communication problem.
|
||||||
|
|
||||||
|
What: /sys/kernel/debug/<cros-ec-device>/last_resume_result
|
||||||
|
Date: June 2019
|
||||||
|
KernelVersion: 5.3
|
||||||
|
Description:
|
||||||
|
Some ECs have a feature where they will track transitions to
|
||||||
|
the (Intel) processor's SLP_S0 line, in order to detect cases
|
||||||
|
where a system failed to go into S0ix. When the system resumes,
|
||||||
|
an EC with this feature will return a summary of SLP_S0
|
||||||
|
transitions that occurred. The last_resume_result file returns
|
||||||
|
the most recent response from the AP's resume message to the EC.
|
||||||
|
|
||||||
|
The bottom 31 bits contain a count of the number of SLP_S0
|
||||||
|
transitions that occurred since the suspend message was
|
||||||
|
received. Bit 31 is set if the EC attempted to wake the
|
||||||
|
system due to a timeout when watching for SLP_S0 transitions.
|
||||||
|
Callers can use this to detect a wake from the EC due to
|
||||||
|
S0ix timeouts. The result will be zero if no suspend
|
||||||
|
transitions have been attempted, or the EC does not support
|
||||||
|
this feature.
|
||||||
|
|
||||||
|
Output will be in the format: "0x%08x\n".
|
|
@ -3,7 +3,10 @@ Date: Jan 2019
|
||||||
KernelVersion: 5.1
|
KernelVersion: 5.1
|
||||||
Contact: oded.gabbay@gmail.com
|
Contact: oded.gabbay@gmail.com
|
||||||
Description: Sets the device address to be used for read or write through
|
Description: Sets the device address to be used for read or write through
|
||||||
PCI bar. The acceptable value is a string that starts with "0x"
|
PCI bar, or the device VA of a host mapped memory to be read or
|
||||||
|
written directly from the host. The latter option is allowed
|
||||||
|
only when the IOMMU is disabled.
|
||||||
|
The acceptable value is a string that starts with "0x"
|
||||||
|
|
||||||
What: /sys/kernel/debug/habanalabs/hl<n>/command_buffers
|
What: /sys/kernel/debug/habanalabs/hl<n>/command_buffers
|
||||||
Date: Jan 2019
|
Date: Jan 2019
|
||||||
|
@ -33,10 +36,12 @@ Contact: oded.gabbay@gmail.com
|
||||||
Description: Allows the root user to read or write directly through the
|
Description: Allows the root user to read or write directly through the
|
||||||
device's PCI bar. Writing to this file generates a write
|
device's PCI bar. Writing to this file generates a write
|
||||||
transaction while reading from the file generates a read
|
transaction while reading from the file generates a read
|
||||||
transcation. This custom interface is needed (instead of using
|
transaction. This custom interface is needed (instead of using
|
||||||
the generic Linux user-space PCI mapping) because the DDR bar
|
the generic Linux user-space PCI mapping) because the DDR bar
|
||||||
is very small compared to the DDR memory and only the driver can
|
is very small compared to the DDR memory and only the driver can
|
||||||
move the bar before and after the transaction
|
move the bar before and after the transaction.
|
||||||
|
If the IOMMU is disabled, it also allows the root user to read
|
||||||
|
or write from the host a device VA of a host mapped memory
|
||||||
|
|
||||||
What: /sys/kernel/debug/habanalabs/hl<n>/device
|
What: /sys/kernel/debug/habanalabs/hl<n>/device
|
||||||
Date: Jan 2019
|
Date: Jan 2019
|
||||||
|
@ -46,6 +51,13 @@ Description: Enables the root user to set the device to specific state.
|
||||||
Valid values are "disable", "enable", "suspend", "resume".
|
Valid values are "disable", "enable", "suspend", "resume".
|
||||||
User can read this property to see the valid values
|
User can read this property to see the valid values
|
||||||
|
|
||||||
|
What: /sys/kernel/debug/habanalabs/hl<n>/engines
|
||||||
|
Date: Jul 2019
|
||||||
|
KernelVersion: 5.3
|
||||||
|
Contact: oded.gabbay@gmail.com
|
||||||
|
Description: Displays the status registers values of the device engines and
|
||||||
|
their derived idle status
|
||||||
|
|
||||||
What: /sys/kernel/debug/habanalabs/hl<n>/i2c_addr
|
What: /sys/kernel/debug/habanalabs/hl<n>/i2c_addr
|
||||||
Date: Jan 2019
|
Date: Jan 2019
|
||||||
KernelVersion: 5.1
|
KernelVersion: 5.1
|
||||||
|
|
|
@ -23,11 +23,9 @@ Description:
|
||||||
|
|
||||||
For writing, bytes 0-1 indicate the message type, one of enum
|
For writing, bytes 0-1 indicate the message type, one of enum
|
||||||
wilco_ec_msg_type. Byte 2+ consist of the data passed in the
|
wilco_ec_msg_type. Byte 2+ consist of the data passed in the
|
||||||
request, starting at MBOX[0]
|
request, starting at MBOX[0]. At least three bytes are required
|
||||||
|
for writing, two for the type and at least a single byte of
|
||||||
At least three bytes are required for writing, two for the type
|
data.
|
||||||
and at least a single byte of data. Only the first
|
|
||||||
EC_MAILBOX_DATA_SIZE bytes of MBOX will be used.
|
|
||||||
|
|
||||||
Example:
|
Example:
|
||||||
// Request EC info type 3 (EC firmware build date)
|
// Request EC info type 3 (EC firmware build date)
|
||||||
|
@ -40,7 +38,7 @@ Description:
|
||||||
$ cat /sys/kernel/debug/wilco_ec/raw
|
$ cat /sys/kernel/debug/wilco_ec/raw
|
||||||
00 00 31 32 2f 32 31 2f 31 38 00 38 00 01 00 2f 00 ..12/21/18.8...
|
00 00 31 32 2f 32 31 2f 31 38 00 38 00 01 00 2f 00 ..12/21/18.8...
|
||||||
|
|
||||||
Note that the first 32 bytes of the received MBOX[] will be
|
Note that the first 16 bytes of the received MBOX[] will be
|
||||||
printed, even if some of the data is junk. It is up to you to
|
printed, even if some of the data is junk, and skipping bytes
|
||||||
know how many of the first bytes of data are the actual
|
17 to 32. It is up to you to know how many of the first bytes of
|
||||||
response.
|
data are the actual response.
|
||||||
|
|
|
@ -24,11 +24,11 @@ Description:
|
||||||
[euid=] [fowner=] [fsname=]]
|
[euid=] [fowner=] [fsname=]]
|
||||||
lsm: [[subj_user=] [subj_role=] [subj_type=]
|
lsm: [[subj_user=] [subj_role=] [subj_type=]
|
||||||
[obj_user=] [obj_role=] [obj_type=]]
|
[obj_user=] [obj_role=] [obj_type=]]
|
||||||
option: [[appraise_type=]] [permit_directio]
|
option: [[appraise_type=]] [template=] [permit_directio]
|
||||||
|
|
||||||
base: func:= [BPRM_CHECK][MMAP_CHECK][CREDS_CHECK][FILE_CHECK][MODULE_CHECK]
|
base: func:= [BPRM_CHECK][MMAP_CHECK][CREDS_CHECK][FILE_CHECK][MODULE_CHECK]
|
||||||
[FIRMWARE_CHECK]
|
[FIRMWARE_CHECK]
|
||||||
[KEXEC_KERNEL_CHECK] [KEXEC_INITRAMFS_CHECK]
|
[KEXEC_KERNEL_CHECK] [KEXEC_INITRAMFS_CHECK]
|
||||||
|
[KEXEC_CMDLINE]
|
||||||
mask:= [[^]MAY_READ] [[^]MAY_WRITE] [[^]MAY_APPEND]
|
mask:= [[^]MAY_READ] [[^]MAY_WRITE] [[^]MAY_APPEND]
|
||||||
[[^]MAY_EXEC]
|
[[^]MAY_EXEC]
|
||||||
fsmagic:= hex value
|
fsmagic:= hex value
|
||||||
|
@ -38,6 +38,8 @@ Description:
|
||||||
fowner:= decimal value
|
fowner:= decimal value
|
||||||
lsm: are LSM specific
|
lsm: are LSM specific
|
||||||
option: appraise_type:= [imasig]
|
option: appraise_type:= [imasig]
|
||||||
|
template:= name of a defined IMA template type
|
||||||
|
(eg, ima-ng). Only valid when action is "measure".
|
||||||
pcr:= decimal value
|
pcr:= decimal value
|
||||||
|
|
||||||
default policy:
|
default policy:
|
||||||
|
|
|
@ -29,4 +29,4 @@ Description:
|
||||||
17 - sectors discarded
|
17 - sectors discarded
|
||||||
18 - time spent discarding
|
18 - time spent discarding
|
||||||
|
|
||||||
For more details refer to Documentation/iostats.txt
|
For more details refer to Documentation/admin-guide/iostats.rst
|
||||||
|
|
|
@ -3,18 +3,28 @@ Date: August 2017
|
||||||
Contact: Daniel Colascione <dancol@google.com>
|
Contact: Daniel Colascione <dancol@google.com>
|
||||||
Description:
|
Description:
|
||||||
This file provides pre-summed memory information for a
|
This file provides pre-summed memory information for a
|
||||||
process. The format is identical to /proc/pid/smaps,
|
process. The format is almost identical to /proc/pid/smaps,
|
||||||
except instead of an entry for each VMA in a process,
|
except instead of an entry for each VMA in a process,
|
||||||
smaps_rollup has a single entry (tagged "[rollup]")
|
smaps_rollup has a single entry (tagged "[rollup]")
|
||||||
for which each field is the sum of the corresponding
|
for which each field is the sum of the corresponding
|
||||||
fields from all the maps in /proc/pid/smaps.
|
fields from all the maps in /proc/pid/smaps.
|
||||||
For more details, see the procfs man page.
|
Additionally, the fields Pss_Anon, Pss_File and Pss_Shmem
|
||||||
|
are not present in /proc/pid/smaps. These fields represent
|
||||||
|
the sum of the Pss field of each type (anon, file, shmem).
|
||||||
|
For more details, see Documentation/filesystems/proc.txt
|
||||||
|
and the procfs man page.
|
||||||
|
|
||||||
Typical output looks like this:
|
Typical output looks like this:
|
||||||
|
|
||||||
00100000-ff709000 ---p 00000000 00:00 0 [rollup]
|
00100000-ff709000 ---p 00000000 00:00 0 [rollup]
|
||||||
|
Size: 1192 kB
|
||||||
|
KernelPageSize: 4 kB
|
||||||
|
MMUPageSize: 4 kB
|
||||||
Rss: 884 kB
|
Rss: 884 kB
|
||||||
Pss: 385 kB
|
Pss: 385 kB
|
||||||
|
Pss_Anon: 301 kB
|
||||||
|
Pss_File: 80 kB
|
||||||
|
Pss_Shmem: 4 kB
|
||||||
Shared_Clean: 696 kB
|
Shared_Clean: 696 kB
|
||||||
Shared_Dirty: 0 kB
|
Shared_Dirty: 0 kB
|
||||||
Private_Clean: 120 kB
|
Private_Clean: 120 kB
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
Where: /sys/fs/pstore/... (or /dev/pstore/...)
|
What: /sys/fs/pstore/... (or /dev/pstore/...)
|
||||||
Date: March 2011
|
Date: March 2011
|
||||||
Kernel Version: 2.6.39
|
KernelVersion: 2.6.39
|
||||||
Contact: tony.luck@intel.com
|
Contact: tony.luck@intel.com
|
||||||
Description: Generic interface to platform dependent persistent storage.
|
Description: Generic interface to platform dependent persistent storage.
|
||||||
|
|
||||||
|
|
|
@ -15,7 +15,7 @@ Description:
|
||||||
9 - I/Os currently in progress
|
9 - I/Os currently in progress
|
||||||
10 - time spent doing I/Os (ms)
|
10 - time spent doing I/Os (ms)
|
||||||
11 - weighted time spent doing I/Os (ms)
|
11 - weighted time spent doing I/Os (ms)
|
||||||
For more details refer Documentation/iostats.txt
|
For more details refer Documentation/admin-guide/iostats.rst
|
||||||
|
|
||||||
|
|
||||||
What: /sys/block/<disk>/<part>/stat
|
What: /sys/block/<disk>/<part>/stat
|
||||||
|
|
|
@ -45,7 +45,7 @@ Description:
|
||||||
- Values below -2 are rejected with -EINVAL
|
- Values below -2 are rejected with -EINVAL
|
||||||
|
|
||||||
For more information, see
|
For more information, see
|
||||||
Documentation/laptops/disk-shock-protection.txt
|
Documentation/admin-guide/laptops/disk-shock-protection.rst
|
||||||
|
|
||||||
|
|
||||||
What: /sys/block/*/device/ncq_prio_enable
|
What: /sys/block/*/device/ncq_prio_enable
|
||||||
|
|
|
@ -33,3 +33,26 @@ Description: Contains the PIM/PAM/POM values, as reported by the
|
||||||
in sync with the values current in the channel subsystem).
|
in sync with the values current in the channel subsystem).
|
||||||
Note: This is an I/O-subchannel specific attribute.
|
Note: This is an I/O-subchannel specific attribute.
|
||||||
Users: s390-tools, HAL
|
Users: s390-tools, HAL
|
||||||
|
|
||||||
|
What: /sys/bus/css/devices/.../driver_override
|
||||||
|
Date: June 2019
|
||||||
|
Contact: Cornelia Huck <cohuck@redhat.com>
|
||||||
|
linux-s390@vger.kernel.org
|
||||||
|
Description: This file allows the driver for a device to be specified. When
|
||||||
|
specified, only a driver with a name matching the value written
|
||||||
|
to driver_override will have an opportunity to bind to the
|
||||||
|
device. The override is specified by writing a string to the
|
||||||
|
driver_override file (echo vfio-ccw > driver_override) and
|
||||||
|
may be cleared with an empty string (echo > driver_override).
|
||||||
|
This returns the device to standard matching rules binding.
|
||||||
|
Writing to driver_override does not automatically unbind the
|
||||||
|
device from its current driver or make any attempt to
|
||||||
|
automatically load the specified driver. If no driver with a
|
||||||
|
matching name is currently loaded in the kernel, the device
|
||||||
|
will not bind to any driver. This also allows devices to
|
||||||
|
opt-out of driver binding using a driver_override name such as
|
||||||
|
"none". Only a single driver may be specified in the override,
|
||||||
|
there is no support for parsing delimiters.
|
||||||
|
Note that unlike the mechanism of the same name for pci, this
|
||||||
|
file does not allow to override basic matching rules. I.e.,
|
||||||
|
the driver must still match the subchannel type of the device.
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
Where: /sys/bus/event_source/devices/<dev>/format
|
What: /sys/bus/event_source/devices/<dev>/format
|
||||||
Date: January 2012
|
Date: January 2012
|
||||||
Kernel Version: 3.3
|
KernelVersion: 3.3
|
||||||
Contact: Jiri Olsa <jolsa@redhat.com>
|
Contact: Jiri Olsa <jolsa@redhat.com>
|
||||||
Description:
|
Description:
|
||||||
Attribute group to describe the magic bits that go into
|
Attribute group to describe the magic bits that go into
|
||||||
|
|
|
@ -1,20 +1,20 @@
|
||||||
Where: /sys/bus/i2c/devices/.../heading0_input
|
What: /sys/bus/i2c/devices/.../heading0_input
|
||||||
Date: April 2010
|
Date: April 2010
|
||||||
Kernel Version: 2.6.36?
|
KernelVersion: 2.6.36?
|
||||||
Contact: alan.cox@intel.com
|
Contact: alan.cox@intel.com
|
||||||
Description: Reports the current heading from the compass as a floating
|
Description: Reports the current heading from the compass as a floating
|
||||||
point value in degrees.
|
point value in degrees.
|
||||||
|
|
||||||
Where: /sys/bus/i2c/devices/.../power_state
|
What: /sys/bus/i2c/devices/.../power_state
|
||||||
Date: April 2010
|
Date: April 2010
|
||||||
Kernel Version: 2.6.36?
|
KernelVersion: 2.6.36?
|
||||||
Contact: alan.cox@intel.com
|
Contact: alan.cox@intel.com
|
||||||
Description: Sets the power state of the device. 0 sets the device into
|
Description: Sets the power state of the device. 0 sets the device into
|
||||||
sleep mode, 1 wakes it up.
|
sleep mode, 1 wakes it up.
|
||||||
|
|
||||||
Where: /sys/bus/i2c/devices/.../calibration
|
What: /sys/bus/i2c/devices/.../calibration
|
||||||
Date: April 2010
|
Date: April 2010
|
||||||
Kernel Version: 2.6.36?
|
KernelVersion: 2.6.36?
|
||||||
Contact: alan.cox@intel.com
|
Contact: alan.cox@intel.com
|
||||||
Description: Sets the calibration on or off (1 = on, 0 = off). See the
|
Description: Sets the calibration on or off (1 = on, 0 = off). See the
|
||||||
chip data sheet.
|
chip data sheet.
|
||||||
|
|
|
@ -61,8 +61,11 @@ What: /sys/bus/iio/devices/triggerX/sampling_frequency_available
|
||||||
KernelVersion: 2.6.35
|
KernelVersion: 2.6.35
|
||||||
Contact: linux-iio@vger.kernel.org
|
Contact: linux-iio@vger.kernel.org
|
||||||
Description:
|
Description:
|
||||||
When the internal sampling clock can only take a small
|
When the internal sampling clock can only take a specific set of
|
||||||
discrete set of values, this file lists those available.
|
frequencies, we can specify the available values with:
|
||||||
|
- a small discrete set of values like "0 2 4 6 8"
|
||||||
|
- a range with minimum, step and maximum frequencies like
|
||||||
|
"[min step max]"
|
||||||
|
|
||||||
What: /sys/bus/iio/devices/iio:deviceX/oversampling_ratio
|
What: /sys/bus/iio/devices/iio:deviceX/oversampling_ratio
|
||||||
KernelVersion: 2.6.38
|
KernelVersion: 2.6.38
|
||||||
|
|
|
@ -18,11 +18,11 @@ Description:
|
||||||
values are 'base' and 'lid'.
|
values are 'base' and 'lid'.
|
||||||
|
|
||||||
What: /sys/bus/iio/devices/iio:deviceX/id
|
What: /sys/bus/iio/devices/iio:deviceX/id
|
||||||
Date: Septembre 2017
|
Date: September 2017
|
||||||
KernelVersion: 4.14
|
KernelVersion: 4.14
|
||||||
Contact: linux-iio@vger.kernel.org
|
Contact: linux-iio@vger.kernel.org
|
||||||
Description:
|
Description:
|
||||||
This attribute is exposed by the CrOS EC legacy accelerometer
|
This attribute is exposed by the CrOS EC sensors driver and
|
||||||
driver and represents the sensor ID as exposed by the EC. This
|
represents the sensor ID as exposed by the EC. This ID is used
|
||||||
ID is used by the Android sensor service hardware abstraction
|
by the Android sensor service hardware abstraction layer (sensor
|
||||||
layer (sensor HAL) through the Android container on ChromeOS.
|
HAL) through the Android container on ChromeOS.
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
What /sys/bus/iio/devices/iio:deviceX/sensor_sensitivity
|
What: /sys/bus/iio/devices/iio:deviceX/sensor_sensitivity
|
||||||
Date: January 2017
|
Date: January 2017
|
||||||
KernelVersion: 4.11
|
KernelVersion: 4.11
|
||||||
Contact: linux-iio@vger.kernel.org
|
Contact: linux-iio@vger.kernel.org
|
||||||
|
@ -6,7 +6,7 @@ Description:
|
||||||
Show or set the gain boost of the amp, from 0-31 range.
|
Show or set the gain boost of the amp, from 0-31 range.
|
||||||
default 31
|
default 31
|
||||||
|
|
||||||
What /sys/bus/iio/devices/iio:deviceX/sensor_max_range
|
What: /sys/bus/iio/devices/iio:deviceX/sensor_max_range
|
||||||
Date: January 2017
|
Date: January 2017
|
||||||
KernelVersion: 4.11
|
KernelVersion: 4.11
|
||||||
Contact: linux-iio@vger.kernel.org
|
Contact: linux-iio@vger.kernel.org
|
||||||
|
|
44
Documentation/ABI/testing/sysfs-bus-iio-frequency-adf4371
Normal file
44
Documentation/ABI/testing/sysfs-bus-iio-frequency-adf4371
Normal file
|
@ -0,0 +1,44 @@
|
||||||
|
What: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_frequency
|
||||||
|
KernelVersion:
|
||||||
|
Contact: linux-iio@vger.kernel.org
|
||||||
|
Description:
|
||||||
|
Stores the PLL frequency in Hz for channel Y.
|
||||||
|
Reading returns the actual frequency in Hz.
|
||||||
|
The ADF4371 has an integrated VCO with fundamendal output
|
||||||
|
frequency ranging from 4000000000 Hz 8000000000 Hz.
|
||||||
|
|
||||||
|
out_altvoltage0_frequency:
|
||||||
|
A divide by 1, 2, 4, 8, 16, 32 or circuit generates
|
||||||
|
frequencies from 62500000 Hz to 8000000000 Hz.
|
||||||
|
out_altvoltage1_frequency:
|
||||||
|
This channel duplicates the channel 0 frequency
|
||||||
|
out_altvoltage2_frequency:
|
||||||
|
A frequency doubler generates frequencies from
|
||||||
|
8000000000 Hz to 16000000000 Hz.
|
||||||
|
out_altvoltage3_frequency:
|
||||||
|
A frequency quadrupler generates frequencies from
|
||||||
|
16000000000 Hz to 32000000000 Hz.
|
||||||
|
|
||||||
|
Note: writes to one of the channels will affect the frequency of
|
||||||
|
all the other channels, since it involves changing the VCO
|
||||||
|
fundamental output frequency.
|
||||||
|
|
||||||
|
What: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_name
|
||||||
|
KernelVersion:
|
||||||
|
Contact: linux-iio@vger.kernel.org
|
||||||
|
Description:
|
||||||
|
Reading returns the datasheet name for channel Y:
|
||||||
|
|
||||||
|
out_altvoltage0_name: RF8x
|
||||||
|
out_altvoltage1_name: RFAUX8x
|
||||||
|
out_altvoltage2_name: RF16x
|
||||||
|
out_altvoltage3_name: RF32x
|
||||||
|
|
||||||
|
What: /sys/bus/iio/devices/iio:deviceX/out_altvoltageY_powerdown
|
||||||
|
KernelVersion:
|
||||||
|
Contact: linux-iio@vger.kernel.org
|
||||||
|
Description:
|
||||||
|
This attribute allows the user to power down the PLL and it's
|
||||||
|
RFOut buffers.
|
||||||
|
Writing 1 causes the specified channel to power down.
|
||||||
|
Clearing returns to normal operation.
|
|
@ -1,4 +1,4 @@
|
||||||
What /sys/bus/iio/devices/iio:deviceX/in_proximity_input
|
What: /sys/bus/iio/devices/iio:deviceX/in_proximity_input
|
||||||
Date: March 2014
|
Date: March 2014
|
||||||
KernelVersion: 3.15
|
KernelVersion: 3.15
|
||||||
Contact: Matt Ranostay <matt.ranostay@konsulko.com>
|
Contact: Matt Ranostay <matt.ranostay@konsulko.com>
|
||||||
|
@ -6,7 +6,7 @@ Description:
|
||||||
Get the current distance in meters of storm (1km steps)
|
Get the current distance in meters of storm (1km steps)
|
||||||
1000-40000 = distance in meters
|
1000-40000 = distance in meters
|
||||||
|
|
||||||
What /sys/bus/iio/devices/iio:deviceX/sensor_sensitivity
|
What: /sys/bus/iio/devices/iio:deviceX/sensor_sensitivity
|
||||||
Date: March 2014
|
Date: March 2014
|
||||||
KernelVersion: 3.15
|
KernelVersion: 3.15
|
||||||
Contact: Matt Ranostay <matt.ranostay@konsulko.com>
|
Contact: Matt Ranostay <matt.ranostay@konsulko.com>
|
||||||
|
|
|
@ -9,9 +9,9 @@ errors may be "seen" / reported by the link partner and not the
|
||||||
problematic endpoint itself (which may report all counters as 0 as it never
|
problematic endpoint itself (which may report all counters as 0 as it never
|
||||||
saw any problems).
|
saw any problems).
|
||||||
|
|
||||||
Where: /sys/bus/pci/devices/<dev>/aer_dev_correctable
|
What: /sys/bus/pci/devices/<dev>/aer_dev_correctable
|
||||||
Date: July 2018
|
Date: July 2018
|
||||||
Kernel Version: 4.19.0
|
KernelVersion: 4.19.0
|
||||||
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
||||||
Description: List of correctable errors seen and reported by this
|
Description: List of correctable errors seen and reported by this
|
||||||
PCI device using ERR_COR. Note that since multiple errors may
|
PCI device using ERR_COR. Note that since multiple errors may
|
||||||
|
@ -31,9 +31,9 @@ Header Log Overflow 0
|
||||||
TOTAL_ERR_COR 2
|
TOTAL_ERR_COR 2
|
||||||
-------------------------------------------------------------------------
|
-------------------------------------------------------------------------
|
||||||
|
|
||||||
Where: /sys/bus/pci/devices/<dev>/aer_dev_fatal
|
What: /sys/bus/pci/devices/<dev>/aer_dev_fatal
|
||||||
Date: July 2018
|
Date: July 2018
|
||||||
Kernel Version: 4.19.0
|
KernelVersion: 4.19.0
|
||||||
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
||||||
Description: List of uncorrectable fatal errors seen and reported by this
|
Description: List of uncorrectable fatal errors seen and reported by this
|
||||||
PCI device using ERR_FATAL. Note that since multiple errors may
|
PCI device using ERR_FATAL. Note that since multiple errors may
|
||||||
|
@ -62,9 +62,9 @@ TLP Prefix Blocked Error 0
|
||||||
TOTAL_ERR_FATAL 0
|
TOTAL_ERR_FATAL 0
|
||||||
-------------------------------------------------------------------------
|
-------------------------------------------------------------------------
|
||||||
|
|
||||||
Where: /sys/bus/pci/devices/<dev>/aer_dev_nonfatal
|
What: /sys/bus/pci/devices/<dev>/aer_dev_nonfatal
|
||||||
Date: July 2018
|
Date: July 2018
|
||||||
Kernel Version: 4.19.0
|
KernelVersion: 4.19.0
|
||||||
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
||||||
Description: List of uncorrectable nonfatal errors seen and reported by this
|
Description: List of uncorrectable nonfatal errors seen and reported by this
|
||||||
PCI device using ERR_NONFATAL. Note that since multiple errors
|
PCI device using ERR_NONFATAL. Note that since multiple errors
|
||||||
|
@ -103,20 +103,20 @@ collectors) that are AER capable. These indicate the number of error messages as
|
||||||
device, so these counters include them and are thus cumulative of all the error
|
device, so these counters include them and are thus cumulative of all the error
|
||||||
messages on the PCI hierarchy originating at that root port.
|
messages on the PCI hierarchy originating at that root port.
|
||||||
|
|
||||||
Where: /sys/bus/pci/devices/<dev>/aer_stats/aer_rootport_total_err_cor
|
What: /sys/bus/pci/devices/<dev>/aer_stats/aer_rootport_total_err_cor
|
||||||
Date: July 2018
|
Date: July 2018
|
||||||
Kernel Version: 4.19.0
|
KernelVersion: 4.19.0
|
||||||
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
||||||
Description: Total number of ERR_COR messages reported to rootport.
|
Description: Total number of ERR_COR messages reported to rootport.
|
||||||
|
|
||||||
Where: /sys/bus/pci/devices/<dev>/aer_stats/aer_rootport_total_err_fatal
|
What: /sys/bus/pci/devices/<dev>/aer_stats/aer_rootport_total_err_fatal
|
||||||
Date: July 2018
|
Date: July 2018
|
||||||
Kernel Version: 4.19.0
|
KernelVersion: 4.19.0
|
||||||
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
||||||
Description: Total number of ERR_FATAL messages reported to rootport.
|
Description: Total number of ERR_FATAL messages reported to rootport.
|
||||||
|
|
||||||
Where: /sys/bus/pci/devices/<dev>/aer_stats/aer_rootport_total_err_nonfatal
|
What: /sys/bus/pci/devices/<dev>/aer_stats/aer_rootport_total_err_nonfatal
|
||||||
Date: July 2018
|
Date: July 2018
|
||||||
Kernel Version: 4.19.0
|
KernelVersion: 4.19.0
|
||||||
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
Contact: linux-pci@vger.kernel.org, rajatja@google.com
|
||||||
Description: Total number of ERR_NONFATAL messages reported to rootport.
|
Description: Total number of ERR_NONFATAL messages reported to rootport.
|
||||||
|
|
|
@ -1,68 +1,68 @@
|
||||||
Where: /sys/bus/pci/devices/<dev>/ccissX/cXdY/model
|
What: /sys/bus/pci/devices/<dev>/ccissX/cXdY/model
|
||||||
Date: March 2009
|
Date: March 2009
|
||||||
Kernel Version: 2.6.30
|
KernelVersion: 2.6.30
|
||||||
Contact: iss_storagedev@hp.com
|
Contact: iss_storagedev@hp.com
|
||||||
Description: Displays the SCSI INQUIRY page 0 model for logical drive
|
Description: Displays the SCSI INQUIRY page 0 model for logical drive
|
||||||
Y of controller X.
|
Y of controller X.
|
||||||
|
|
||||||
Where: /sys/bus/pci/devices/<dev>/ccissX/cXdY/rev
|
What: /sys/bus/pci/devices/<dev>/ccissX/cXdY/rev
|
||||||
Date: March 2009
|
Date: March 2009
|
||||||
Kernel Version: 2.6.30
|
KernelVersion: 2.6.30
|
||||||
Contact: iss_storagedev@hp.com
|
Contact: iss_storagedev@hp.com
|
||||||
Description: Displays the SCSI INQUIRY page 0 revision for logical
|
Description: Displays the SCSI INQUIRY page 0 revision for logical
|
||||||
drive Y of controller X.
|
drive Y of controller X.
|
||||||
|
|
||||||
Where: /sys/bus/pci/devices/<dev>/ccissX/cXdY/unique_id
|
What: /sys/bus/pci/devices/<dev>/ccissX/cXdY/unique_id
|
||||||
Date: March 2009
|
Date: March 2009
|
||||||
Kernel Version: 2.6.30
|
KernelVersion: 2.6.30
|
||||||
Contact: iss_storagedev@hp.com
|
Contact: iss_storagedev@hp.com
|
||||||
Description: Displays the SCSI INQUIRY page 83 serial number for logical
|
Description: Displays the SCSI INQUIRY page 83 serial number for logical
|
||||||
drive Y of controller X.
|
drive Y of controller X.
|
||||||
|
|
||||||
Where: /sys/bus/pci/devices/<dev>/ccissX/cXdY/vendor
|
What: /sys/bus/pci/devices/<dev>/ccissX/cXdY/vendor
|
||||||
Date: March 2009
|
Date: March 2009
|
||||||
Kernel Version: 2.6.30
|
KernelVersion: 2.6.30
|
||||||
Contact: iss_storagedev@hp.com
|
Contact: iss_storagedev@hp.com
|
||||||
Description: Displays the SCSI INQUIRY page 0 vendor for logical drive
|
Description: Displays the SCSI INQUIRY page 0 vendor for logical drive
|
||||||
Y of controller X.
|
Y of controller X.
|
||||||
|
|
||||||
Where: /sys/bus/pci/devices/<dev>/ccissX/cXdY/block:cciss!cXdY
|
What: /sys/bus/pci/devices/<dev>/ccissX/cXdY/block:cciss!cXdY
|
||||||
Date: March 2009
|
Date: March 2009
|
||||||
Kernel Version: 2.6.30
|
KernelVersion: 2.6.30
|
||||||
Contact: iss_storagedev@hp.com
|
Contact: iss_storagedev@hp.com
|
||||||
Description: A symbolic link to /sys/block/cciss!cXdY
|
Description: A symbolic link to /sys/block/cciss!cXdY
|
||||||
|
|
||||||
Where: /sys/bus/pci/devices/<dev>/ccissX/rescan
|
What: /sys/bus/pci/devices/<dev>/ccissX/rescan
|
||||||
Date: August 2009
|
Date: August 2009
|
||||||
Kernel Version: 2.6.31
|
KernelVersion: 2.6.31
|
||||||
Contact: iss_storagedev@hp.com
|
Contact: iss_storagedev@hp.com
|
||||||
Description: Kicks of a rescan of the controller to discover logical
|
Description: Kicks of a rescan of the controller to discover logical
|
||||||
drive topology changes.
|
drive topology changes.
|
||||||
|
|
||||||
Where: /sys/bus/pci/devices/<dev>/ccissX/cXdY/lunid
|
What: /sys/bus/pci/devices/<dev>/ccissX/cXdY/lunid
|
||||||
Date: August 2009
|
Date: August 2009
|
||||||
Kernel Version: 2.6.31
|
KernelVersion: 2.6.31
|
||||||
Contact: iss_storagedev@hp.com
|
Contact: iss_storagedev@hp.com
|
||||||
Description: Displays the 8-byte LUN ID used to address logical
|
Description: Displays the 8-byte LUN ID used to address logical
|
||||||
drive Y of controller X.
|
drive Y of controller X.
|
||||||
|
|
||||||
Where: /sys/bus/pci/devices/<dev>/ccissX/cXdY/raid_level
|
What: /sys/bus/pci/devices/<dev>/ccissX/cXdY/raid_level
|
||||||
Date: August 2009
|
Date: August 2009
|
||||||
Kernel Version: 2.6.31
|
KernelVersion: 2.6.31
|
||||||
Contact: iss_storagedev@hp.com
|
Contact: iss_storagedev@hp.com
|
||||||
Description: Displays the RAID level of logical drive Y of
|
Description: Displays the RAID level of logical drive Y of
|
||||||
controller X.
|
controller X.
|
||||||
|
|
||||||
Where: /sys/bus/pci/devices/<dev>/ccissX/cXdY/usage_count
|
What: /sys/bus/pci/devices/<dev>/ccissX/cXdY/usage_count
|
||||||
Date: August 2009
|
Date: August 2009
|
||||||
Kernel Version: 2.6.31
|
KernelVersion: 2.6.31
|
||||||
Contact: iss_storagedev@hp.com
|
Contact: iss_storagedev@hp.com
|
||||||
Description: Displays the usage count (number of opens) of logical drive Y
|
Description: Displays the usage count (number of opens) of logical drive Y
|
||||||
of controller X.
|
of controller X.
|
||||||
|
|
||||||
Where: /sys/bus/pci/devices/<dev>/ccissX/resettable
|
What: /sys/bus/pci/devices/<dev>/ccissX/resettable
|
||||||
Date: February 2011
|
Date: February 2011
|
||||||
Kernel Version: 2.6.38
|
KernelVersion: 2.6.38
|
||||||
Contact: iss_storagedev@hp.com
|
Contact: iss_storagedev@hp.com
|
||||||
Description: Value of 1 indicates the controller can honor the reset_devices
|
Description: Value of 1 indicates the controller can honor the reset_devices
|
||||||
kernel parameter. Value of 0 indicates reset_devices cannot be
|
kernel parameter. Value of 0 indicates reset_devices cannot be
|
||||||
|
@ -71,9 +71,9 @@ Description: Value of 1 indicates the controller can honor the reset_devices
|
||||||
a dump device, as kdump requires resetting the device in order
|
a dump device, as kdump requires resetting the device in order
|
||||||
to work reliably.
|
to work reliably.
|
||||||
|
|
||||||
Where: /sys/bus/pci/devices/<dev>/ccissX/transport_mode
|
What: /sys/bus/pci/devices/<dev>/ccissX/transport_mode
|
||||||
Date: July 2011
|
Date: July 2011
|
||||||
Kernel Version: 3.0
|
KernelVersion: 3.0
|
||||||
Contact: iss_storagedev@hp.com
|
Contact: iss_storagedev@hp.com
|
||||||
Description: Value of "simple" indicates that the controller has been placed
|
Description: Value of "simple" indicates that the controller has been placed
|
||||||
in "simple mode". Value of "performant" indicates that the
|
in "simple mode". Value of "performant" indicates that the
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
What: /sys/bus/siox/devices/siox-X/active
|
What: /sys/bus/siox/devices/siox-X/active
|
||||||
KernelVersion: 4.16
|
KernelVersion: 4.16
|
||||||
Contact: Gavin Schenk <g.schenk@eckelmann.de>, Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
|
Contact: Thorsten Scherer <t.scherer@eckelmann.de>, Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
|
||||||
Description:
|
Description:
|
||||||
On reading represents the current state of the bus. If it
|
On reading represents the current state of the bus. If it
|
||||||
contains a "0" the bus is stopped and connected devices are
|
contains a "0" the bus is stopped and connected devices are
|
||||||
|
@ -12,7 +12,7 @@ Description:
|
||||||
|
|
||||||
What: /sys/bus/siox/devices/siox-X/device_add
|
What: /sys/bus/siox/devices/siox-X/device_add
|
||||||
KernelVersion: 4.16
|
KernelVersion: 4.16
|
||||||
Contact: Gavin Schenk <g.schenk@eckelmann.de>, Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
|
Contact: Thorsten Scherer <t.scherer@eckelmann.de>, Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
|
||||||
Description:
|
Description:
|
||||||
Write-only file. Write
|
Write-only file. Write
|
||||||
|
|
||||||
|
@ -27,13 +27,13 @@ Description:
|
||||||
|
|
||||||
What: /sys/bus/siox/devices/siox-X/device_remove
|
What: /sys/bus/siox/devices/siox-X/device_remove
|
||||||
KernelVersion: 4.16
|
KernelVersion: 4.16
|
||||||
Contact: Gavin Schenk <g.schenk@eckelmann.de>, Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
|
Contact: Thorsten Scherer <t.scherer@eckelmann.de>, Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
|
||||||
Description:
|
Description:
|
||||||
Write-only file. A single write removes the last device in the siox chain.
|
Write-only file. A single write removes the last device in the siox chain.
|
||||||
|
|
||||||
What: /sys/bus/siox/devices/siox-X/poll_interval_ns
|
What: /sys/bus/siox/devices/siox-X/poll_interval_ns
|
||||||
KernelVersion: 4.16
|
KernelVersion: 4.16
|
||||||
Contact: Gavin Schenk <g.schenk@eckelmann.de>, Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
|
Contact: Thorsten Scherer <t.scherer@eckelmann.de>, Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
|
||||||
Description:
|
Description:
|
||||||
Defines the interval between two poll cycles in nano seconds.
|
Defines the interval between two poll cycles in nano seconds.
|
||||||
Note this is rounded to jiffies on writing. On reading the current value
|
Note this is rounded to jiffies on writing. On reading the current value
|
||||||
|
@ -41,33 +41,33 @@ Description:
|
||||||
|
|
||||||
What: /sys/bus/siox/devices/siox-X-Y/connected
|
What: /sys/bus/siox/devices/siox-X-Y/connected
|
||||||
KernelVersion: 4.16
|
KernelVersion: 4.16
|
||||||
Contact: Gavin Schenk <g.schenk@eckelmann.de>, Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
|
Contact: Thorsten Scherer <t.scherer@eckelmann.de>, Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
|
||||||
Description:
|
Description:
|
||||||
Read-only value. "0" means the Yth device on siox bus X isn't "connected" i.e.
|
Read-only value. "0" means the Yth device on siox bus X isn't "connected" i.e.
|
||||||
communication with it is not ensured. "1" signals a working connection.
|
communication with it is not ensured. "1" signals a working connection.
|
||||||
|
|
||||||
What: /sys/bus/siox/devices/siox-X-Y/inbytes
|
What: /sys/bus/siox/devices/siox-X-Y/inbytes
|
||||||
KernelVersion: 4.16
|
KernelVersion: 4.16
|
||||||
Contact: Gavin Schenk <g.schenk@eckelmann.de>, Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
|
Contact: Thorsten Scherer <t.scherer@eckelmann.de>, Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
|
||||||
Description:
|
Description:
|
||||||
Read-only value reporting the inbytes value provided to siox-X/device_add
|
Read-only value reporting the inbytes value provided to siox-X/device_add
|
||||||
|
|
||||||
What: /sys/bus/siox/devices/siox-X-Y/status_errors
|
What: /sys/bus/siox/devices/siox-X-Y/status_errors
|
||||||
KernelVersion: 4.16
|
KernelVersion: 4.16
|
||||||
Contact: Gavin Schenk <g.schenk@eckelmann.de>, Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
|
Contact: Thorsten Scherer <t.scherer@eckelmann.de>, Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
|
||||||
Description:
|
Description:
|
||||||
Counts the number of time intervals when the read status byte doesn't yield the
|
Counts the number of time intervals when the read status byte doesn't yield the
|
||||||
expected value.
|
expected value.
|
||||||
|
|
||||||
What: /sys/bus/siox/devices/siox-X-Y/type
|
What: /sys/bus/siox/devices/siox-X-Y/type
|
||||||
KernelVersion: 4.16
|
KernelVersion: 4.16
|
||||||
Contact: Gavin Schenk <g.schenk@eckelmann.de>, Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
|
Contact: Thorsten Scherer <t.scherer@eckelmann.de>, Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
|
||||||
Description:
|
Description:
|
||||||
Read-only value reporting the type value provided to siox-X/device_add.
|
Read-only value reporting the type value provided to siox-X/device_add.
|
||||||
|
|
||||||
What: /sys/bus/siox/devices/siox-X-Y/watchdog
|
What: /sys/bus/siox/devices/siox-X-Y/watchdog
|
||||||
KernelVersion: 4.16
|
KernelVersion: 4.16
|
||||||
Contact: Gavin Schenk <g.schenk@eckelmann.de>, Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
|
Contact: Thorsten Scherer <t.scherer@eckelmann.de>, Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
|
||||||
Description:
|
Description:
|
||||||
Read-only value reporting if the watchdog of the siox device is
|
Read-only value reporting if the watchdog of the siox device is
|
||||||
active. "0" means the watchdog is not active and the device is expected to
|
active. "0" means the watchdog is not active and the device is expected to
|
||||||
|
@ -75,13 +75,13 @@ Description:
|
||||||
|
|
||||||
What: /sys/bus/siox/devices/siox-X-Y/watchdog_errors
|
What: /sys/bus/siox/devices/siox-X-Y/watchdog_errors
|
||||||
KernelVersion: 4.16
|
KernelVersion: 4.16
|
||||||
Contact: Gavin Schenk <g.schenk@eckelmann.de>, Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
|
Contact: Thorsten Scherer <t.scherer@eckelmann.de>, Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
|
||||||
Description:
|
Description:
|
||||||
Read-only value reporting the number to time intervals when the
|
Read-only value reporting the number to time intervals when the
|
||||||
watchdog was active.
|
watchdog was active.
|
||||||
|
|
||||||
What: /sys/bus/siox/devices/siox-X-Y/outbytes
|
What: /sys/bus/siox/devices/siox-X-Y/outbytes
|
||||||
KernelVersion: 4.16
|
KernelVersion: 4.16
|
||||||
Contact: Gavin Schenk <g.schenk@eckelmann.de>, Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
|
Contact: Thorsten Scherer <t.scherer@eckelmann.de>, Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
|
||||||
Description:
|
Description:
|
||||||
Read-only value reporting the outbytes value provided to siox-X/device_add.
|
Read-only value reporting the outbytes value provided to siox-X/device_add.
|
||||||
|
|
|
@ -1,14 +1,14 @@
|
||||||
Where: /sys/bus/usb/.../powered
|
What: /sys/bus/usb/.../powered
|
||||||
Date: August 2008
|
Date: August 2008
|
||||||
Kernel Version: 2.6.26
|
KernelVersion: 2.6.26
|
||||||
Contact: Harrison Metzger <harrisonmetz@gmail.com>
|
Contact: Harrison Metzger <harrisonmetz@gmail.com>
|
||||||
Description: Controls whether the device's display will powered.
|
Description: Controls whether the device's display will powered.
|
||||||
A value of 0 is off and a non-zero value is on.
|
A value of 0 is off and a non-zero value is on.
|
||||||
|
|
||||||
Where: /sys/bus/usb/.../mode_msb
|
What: /sys/bus/usb/.../mode_msb
|
||||||
Where: /sys/bus/usb/.../mode_lsb
|
What: /sys/bus/usb/.../mode_lsb
|
||||||
Date: August 2008
|
Date: August 2008
|
||||||
Kernel Version: 2.6.26
|
KernelVersion: 2.6.26
|
||||||
Contact: Harrison Metzger <harrisonmetz@gmail.com>
|
Contact: Harrison Metzger <harrisonmetz@gmail.com>
|
||||||
Description: Controls the devices display mode.
|
Description: Controls the devices display mode.
|
||||||
For a 6 character display the values are
|
For a 6 character display the values are
|
||||||
|
@ -16,24 +16,24 @@ Description: Controls the devices display mode.
|
||||||
for an 8 character display the values are
|
for an 8 character display the values are
|
||||||
MSB 0x08; LSB 0xFF.
|
MSB 0x08; LSB 0xFF.
|
||||||
|
|
||||||
Where: /sys/bus/usb/.../textmode
|
What: /sys/bus/usb/.../textmode
|
||||||
Date: August 2008
|
Date: August 2008
|
||||||
Kernel Version: 2.6.26
|
KernelVersion: 2.6.26
|
||||||
Contact: Harrison Metzger <harrisonmetz@gmail.com>
|
Contact: Harrison Metzger <harrisonmetz@gmail.com>
|
||||||
Description: Controls the way the device interprets its text buffer.
|
Description: Controls the way the device interprets its text buffer.
|
||||||
raw: each character controls its segment manually
|
raw: each character controls its segment manually
|
||||||
hex: each character is between 0-15
|
hex: each character is between 0-15
|
||||||
ascii: each character is between '0'-'9' and 'A'-'F'.
|
ascii: each character is between '0'-'9' and 'A'-'F'.
|
||||||
|
|
||||||
Where: /sys/bus/usb/.../text
|
What: /sys/bus/usb/.../text
|
||||||
Date: August 2008
|
Date: August 2008
|
||||||
Kernel Version: 2.6.26
|
KernelVersion: 2.6.26
|
||||||
Contact: Harrison Metzger <harrisonmetz@gmail.com>
|
Contact: Harrison Metzger <harrisonmetz@gmail.com>
|
||||||
Description: The text (or data) for the device to display
|
Description: The text (or data) for the device to display
|
||||||
|
|
||||||
Where: /sys/bus/usb/.../decimals
|
What: /sys/bus/usb/.../decimals
|
||||||
Date: August 2008
|
Date: August 2008
|
||||||
Kernel Version: 2.6.26
|
KernelVersion: 2.6.26
|
||||||
Contact: Harrison Metzger <harrisonmetz@gmail.com>
|
Contact: Harrison Metzger <harrisonmetz@gmail.com>
|
||||||
Description: Controls the decimal places on the device.
|
Description: Controls the decimal places on the device.
|
||||||
To set the nth decimal place, give this field
|
To set the nth decimal place, give this field
|
||||||
|
|
|
@ -4,7 +4,7 @@ KernelVersion: 3.5
|
||||||
Contact: Johan Hovold <jhovold@gmail.com>
|
Contact: Johan Hovold <jhovold@gmail.com>
|
||||||
Description:
|
Description:
|
||||||
Get the ALS output channel used as input in
|
Get the ALS output channel used as input in
|
||||||
ALS-current-control mode (0, 1), where
|
ALS-current-control mode (0, 1), where:
|
||||||
|
|
||||||
0 - out_current0 (backlight 0)
|
0 - out_current0 (backlight 0)
|
||||||
1 - out_current1 (backlight 1)
|
1 - out_current1 (backlight 1)
|
||||||
|
@ -28,7 +28,7 @@ Date: April 2012
|
||||||
KernelVersion: 3.5
|
KernelVersion: 3.5
|
||||||
Contact: Johan Hovold <jhovold@gmail.com>
|
Contact: Johan Hovold <jhovold@gmail.com>
|
||||||
Description:
|
Description:
|
||||||
Set the brightness-mapping mode (0, 1), where
|
Set the brightness-mapping mode (0, 1), where:
|
||||||
|
|
||||||
0 - exponential mode
|
0 - exponential mode
|
||||||
1 - linear mode
|
1 - linear mode
|
||||||
|
@ -38,7 +38,7 @@ Date: April 2012
|
||||||
KernelVersion: 3.5
|
KernelVersion: 3.5
|
||||||
Contact: Johan Hovold <jhovold@gmail.com>
|
Contact: Johan Hovold <jhovold@gmail.com>
|
||||||
Description:
|
Description:
|
||||||
Set the PWM-input control mask (5 bits), where
|
Set the PWM-input control mask (5 bits), where:
|
||||||
|
|
||||||
bit 5 - PWM-input enabled in Zone 4
|
bit 5 - PWM-input enabled in Zone 4
|
||||||
bit 4 - PWM-input enabled in Zone 3
|
bit 4 - PWM-input enabled in Zone 3
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
Note: Attributes that are shared between devices are stored in the directory
|
Please note that attributes that are shared between devices are stored in
|
||||||
pointed to by the symlink device/.
|
the directory pointed to by the symlink device/.
|
||||||
Example: The real path of the attribute /sys/class/cxl/afu0.0s/irqs_max is
|
For example, the real path of the attribute /sys/class/cxl/afu0.0s/irqs_max is
|
||||||
/sys/class/cxl/afu0.0s/device/irqs_max, i.e. /sys/class/cxl/afu0.0/irqs_max.
|
/sys/class/cxl/afu0.0s/device/irqs_max, i.e. /sys/class/cxl/afu0.0/irqs_max.
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -47,7 +47,7 @@ Description:
|
||||||
What: /sys/class/devfreq/.../trans_stat
|
What: /sys/class/devfreq/.../trans_stat
|
||||||
Date: October 2012
|
Date: October 2012
|
||||||
Contact: MyungJoo Ham <myungjoo.ham@samsung.com>
|
Contact: MyungJoo Ham <myungjoo.ham@samsung.com>
|
||||||
Descrtiption:
|
Description:
|
||||||
This ABI shows the statistics of devfreq behavior on a
|
This ABI shows the statistics of devfreq behavior on a
|
||||||
specific device. It shows the time spent in each state and
|
specific device. It shows the time spent in each state and
|
||||||
the number of transitions between states.
|
the number of transitions between states.
|
||||||
|
|
|
@ -4,7 +4,7 @@ KernelVersion: 3.5
|
||||||
Contact: Johan Hovold <jhovold@gmail.com>
|
Contact: Johan Hovold <jhovold@gmail.com>
|
||||||
Description:
|
Description:
|
||||||
Set the ALS output channel to use as input in
|
Set the ALS output channel to use as input in
|
||||||
ALS-current-control mode (1, 2), where
|
ALS-current-control mode (1, 2), where:
|
||||||
|
|
||||||
1 - out_current1
|
1 - out_current1
|
||||||
2 - out_current2
|
2 - out_current2
|
||||||
|
@ -22,7 +22,7 @@ Date: April 2012
|
||||||
KernelVersion: 3.5
|
KernelVersion: 3.5
|
||||||
Contact: Johan Hovold <jhovold@gmail.com>
|
Contact: Johan Hovold <jhovold@gmail.com>
|
||||||
Description:
|
Description:
|
||||||
Set the pattern generator fall and rise times (0..7), where
|
Set the pattern generator fall and rise times (0..7), where:
|
||||||
|
|
||||||
0 - 2048 us
|
0 - 2048 us
|
||||||
1 - 262 ms
|
1 - 262 ms
|
||||||
|
@ -45,7 +45,7 @@ Date: April 2012
|
||||||
KernelVersion: 3.5
|
KernelVersion: 3.5
|
||||||
Contact: Johan Hovold <jhovold@gmail.com>
|
Contact: Johan Hovold <jhovold@gmail.com>
|
||||||
Description:
|
Description:
|
||||||
Set the brightness-mapping mode (0, 1), where
|
Set the brightness-mapping mode (0, 1), where:
|
||||||
|
|
||||||
0 - exponential mode
|
0 - exponential mode
|
||||||
1 - linear mode
|
1 - linear mode
|
||||||
|
@ -55,7 +55,7 @@ Date: April 2012
|
||||||
KernelVersion: 3.5
|
KernelVersion: 3.5
|
||||||
Contact: Johan Hovold <jhovold@gmail.com>
|
Contact: Johan Hovold <jhovold@gmail.com>
|
||||||
Description:
|
Description:
|
||||||
Set the PWM-input control mask (5 bits), where
|
Set the PWM-input control mask (5 bits), where:
|
||||||
|
|
||||||
bit 5 - PWM-input enabled in Zone 4
|
bit 5 - PWM-input enabled in Zone 4
|
||||||
bit 4 - PWM-input enabled in Zone 3
|
bit 4 - PWM-input enabled in Zone 3
|
||||||
|
|
|
@ -5,7 +5,7 @@ Contact: Janne Kanniainen <janne.kanniainen@gmail.com>
|
||||||
Description:
|
Description:
|
||||||
Set the mode of LEDs. You should notice that changing the mode
|
Set the mode of LEDs. You should notice that changing the mode
|
||||||
of one LED will update the mode of its two sibling devices as
|
of one LED will update the mode of its two sibling devices as
|
||||||
well.
|
well. Possible values are:
|
||||||
|
|
||||||
0 - normal
|
0 - normal
|
||||||
1 - audio
|
1 - audio
|
||||||
|
@ -13,4 +13,4 @@ Description:
|
||||||
|
|
||||||
Normal: LEDs are fully on when enabled
|
Normal: LEDs are fully on when enabled
|
||||||
Audio: LEDs brightness depends on sound level
|
Audio: LEDs brightness depends on sound level
|
||||||
Breathing: LEDs brightness varies at human breathing rate
|
Breathing: LEDs brightness varies at human breathing rate
|
||||||
|
|
|
@ -41,3 +41,11 @@ Description:
|
||||||
xgmii, moca, qsgmii, trgmii, 1000base-x, 2500base-x, rxaui,
|
xgmii, moca, qsgmii, trgmii, 1000base-x, 2500base-x, rxaui,
|
||||||
xaui, 10gbase-kr, unknown
|
xaui, 10gbase-kr, unknown
|
||||||
|
|
||||||
|
What: /sys/class/mdio_bus/<bus>/<device>/phy_standalone
|
||||||
|
Date: May 2019
|
||||||
|
KernelVersion: 5.3
|
||||||
|
Contact: netdev@vger.kernel.org
|
||||||
|
Description:
|
||||||
|
Boolean value indicating whether the PHY device is used in
|
||||||
|
standalone mode, without a net_device associated, by PHYLINK.
|
||||||
|
Attribute created only when this is the case.
|
||||||
|
|
|
@ -29,7 +29,7 @@ Contact: Bjørn Mork <bjorn@mork.no>
|
||||||
Description:
|
Description:
|
||||||
Unsigned integer.
|
Unsigned integer.
|
||||||
|
|
||||||
Write a number ranging from 1 to 127 to add a qmap mux
|
Write a number ranging from 1 to 254 to add a qmap mux
|
||||||
based network device, supported by recent Qualcomm based
|
based network device, supported by recent Qualcomm based
|
||||||
modems.
|
modems.
|
||||||
|
|
||||||
|
@ -46,5 +46,5 @@ Contact: Bjørn Mork <bjorn@mork.no>
|
||||||
Description:
|
Description:
|
||||||
Unsigned integer.
|
Unsigned integer.
|
||||||
|
|
||||||
Write a number ranging from 1 to 127 to delete a previously
|
Write a number ranging from 1 to 254 to delete a previously
|
||||||
created qmap mux based network device.
|
created qmap mux based network device.
|
||||||
|
|
|
@ -376,10 +376,42 @@ Description:
|
||||||
supply. Normally this is configured based on the type of
|
supply. Normally this is configured based on the type of
|
||||||
connection made (e.g. A configured SDP should output a maximum
|
connection made (e.g. A configured SDP should output a maximum
|
||||||
of 500mA so the input current limit is set to the same value).
|
of 500mA so the input current limit is set to the same value).
|
||||||
|
Use preferably input_power_limit, and for problems that can be
|
||||||
|
solved using power limit use input_current_limit.
|
||||||
|
|
||||||
Access: Read, Write
|
Access: Read, Write
|
||||||
Valid values: Represented in microamps
|
Valid values: Represented in microamps
|
||||||
|
|
||||||
|
What: /sys/class/power_supply/<supply_name>/input_voltage_limit
|
||||||
|
Date: May 2019
|
||||||
|
Contact: linux-pm@vger.kernel.org
|
||||||
|
Description:
|
||||||
|
This entry configures the incoming VBUS voltage limit currently
|
||||||
|
set in the supply. Normally this is configured based on
|
||||||
|
system-level knowledge or user input (e.g. This is part of the
|
||||||
|
Pixel C's thermal management strategy to effectively limit the
|
||||||
|
input power to 5V when the screen is on to meet Google's skin
|
||||||
|
temperature targets). Note that this feature should not be
|
||||||
|
used for safety critical things.
|
||||||
|
Use preferably input_power_limit, and for problems that can be
|
||||||
|
solved using power limit use input_voltage_limit.
|
||||||
|
|
||||||
|
Access: Read, Write
|
||||||
|
Valid values: Represented in microvolts
|
||||||
|
|
||||||
|
What: /sys/class/power_supply/<supply_name>/input_power_limit
|
||||||
|
Date: May 2019
|
||||||
|
Contact: linux-pm@vger.kernel.org
|
||||||
|
Description:
|
||||||
|
This entry configures the incoming power limit currently set
|
||||||
|
in the supply. Normally this is configured based on
|
||||||
|
system-level knowledge or user input. Use preferably this
|
||||||
|
feature to limit the incoming power and use current/voltage
|
||||||
|
limit only for problems that can be solved using power limit.
|
||||||
|
|
||||||
|
Access: Read, Write
|
||||||
|
Valid values: Represented in microwatts
|
||||||
|
|
||||||
What: /sys/class/power_supply/<supply_name>/online,
|
What: /sys/class/power_supply/<supply_name>/online,
|
||||||
Date: May 2007
|
Date: May 2007
|
||||||
Contact: linux-pm@vger.kernel.org
|
Contact: linux-pm@vger.kernel.org
|
||||||
|
|
30
Documentation/ABI/testing/sysfs-class-power-wilco
Normal file
30
Documentation/ABI/testing/sysfs-class-power-wilco
Normal file
|
@ -0,0 +1,30 @@
|
||||||
|
What: /sys/class/power_supply/wilco-charger/charge_type
|
||||||
|
Date: April 2019
|
||||||
|
KernelVersion: 5.2
|
||||||
|
Description:
|
||||||
|
What charging algorithm to use:
|
||||||
|
|
||||||
|
Standard: Fully charges battery at a standard rate.
|
||||||
|
Adaptive: Battery settings adaptively optimized based on
|
||||||
|
typical battery usage pattern.
|
||||||
|
Fast: Battery charges over a shorter period.
|
||||||
|
Trickle: Extends battery lifespan, intended for users who
|
||||||
|
primarily use their Chromebook while connected to AC.
|
||||||
|
Custom: A low and high threshold percentage is specified.
|
||||||
|
Charging begins when level drops below
|
||||||
|
charge_control_start_threshold, and ceases when
|
||||||
|
level is above charge_control_end_threshold.
|
||||||
|
|
||||||
|
What: /sys/class/power_supply/wilco-charger/charge_control_start_threshold
|
||||||
|
Date: April 2019
|
||||||
|
KernelVersion: 5.2
|
||||||
|
Description:
|
||||||
|
Used when charge_type="Custom", as described above. Measured in
|
||||||
|
percentages. The valid range is [50, 95].
|
||||||
|
|
||||||
|
What: /sys/class/power_supply/wilco-charger/charge_control_end_threshold
|
||||||
|
Date: April 2019
|
||||||
|
KernelVersion: 5.2
|
||||||
|
Description:
|
||||||
|
Used when charge_type="Custom", as described above. Measured in
|
||||||
|
percentages. The valid range is [55, 100].
|
|
@ -5,7 +5,7 @@ Contact: linux-pm@vger.kernel.org
|
||||||
Description:
|
Description:
|
||||||
The powercap/ class sub directory belongs to the power cap
|
The powercap/ class sub directory belongs to the power cap
|
||||||
subsystem. Refer to
|
subsystem. Refer to
|
||||||
Documentation/power/powercap/powercap.txt for details.
|
Documentation/power/powercap/powercap.rst for details.
|
||||||
|
|
||||||
What: /sys/class/powercap/<control type>
|
What: /sys/class/powercap/<control type>
|
||||||
Date: September 2013
|
Date: September 2013
|
||||||
|
@ -147,6 +147,6 @@ What: /sys/class/powercap/.../<power zone>/enabled
|
||||||
Date: September 2013
|
Date: September 2013
|
||||||
KernelVersion: 3.13
|
KernelVersion: 3.13
|
||||||
Contact: linux-pm@vger.kernel.org
|
Contact: linux-pm@vger.kernel.org
|
||||||
Description
|
Description:
|
||||||
This allows to enable/disable power capping at power zone level.
|
This allows to enable/disable power capping at power zone level.
|
||||||
This applies to current power zone and its children.
|
This applies to current power zone and its children.
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
switchtec - Microsemi Switchtec PCI Switch Management Endpoint
|
switchtec - Microsemi Switchtec PCI Switch Management Endpoint
|
||||||
|
|
||||||
For details on this subsystem look at Documentation/switchtec.txt.
|
For details on this subsystem look at Documentation/driver-api/switchtec.rst.
|
||||||
|
|
||||||
What: /sys/class/switchtec
|
What: /sys/class/switchtec
|
||||||
Date: 05-Jan-2017
|
Date: 05-Jan-2017
|
||||||
|
|
|
@ -125,12 +125,6 @@ Description:
|
||||||
The EUI-48 of this device in colon separated hex
|
The EUI-48 of this device in colon separated hex
|
||||||
octets.
|
octets.
|
||||||
|
|
||||||
What: /sys/class/uwb_rc/uwbN/<EUI-48>/BPST
|
|
||||||
Date: July 2008
|
|
||||||
KernelVersion: 2.6.27
|
|
||||||
Contact: linux-usb@vger.kernel.org
|
|
||||||
Description:
|
|
||||||
|
|
||||||
What: /sys/class/uwb_rc/uwbN/<EUI-48>/IEs
|
What: /sys/class/uwb_rc/uwbN/<EUI-48>/IEs
|
||||||
Date: July 2008
|
Date: July 2008
|
||||||
KernelVersion: 2.6.27
|
KernelVersion: 2.6.27
|
||||||
|
|
|
@ -34,7 +34,7 @@ Description: CPU topology files that describe kernel limits related to
|
||||||
present: cpus that have been identified as being present in
|
present: cpus that have been identified as being present in
|
||||||
the system.
|
the system.
|
||||||
|
|
||||||
See Documentation/cputopology.txt for more information.
|
See Documentation/admin-guide/cputopology.rst for more information.
|
||||||
|
|
||||||
|
|
||||||
What: /sys/devices/system/cpu/probe
|
What: /sys/devices/system/cpu/probe
|
||||||
|
@ -103,7 +103,7 @@ Description: CPU topology files that describe a logical CPU's relationship
|
||||||
thread_siblings_list: human-readable list of cpu#'s hardware
|
thread_siblings_list: human-readable list of cpu#'s hardware
|
||||||
threads within the same core as cpu#
|
threads within the same core as cpu#
|
||||||
|
|
||||||
See Documentation/cputopology.txt for more information.
|
See Documentation/admin-guide/cputopology.rst for more information.
|
||||||
|
|
||||||
|
|
||||||
What: /sys/devices/system/cpu/cpuidle/current_driver
|
What: /sys/devices/system/cpu/cpuidle/current_driver
|
||||||
|
@ -137,7 +137,8 @@ Description: Discover cpuidle policy and mechanism
|
||||||
current_governor: (RW) displays current idle policy. Users can
|
current_governor: (RW) displays current idle policy. Users can
|
||||||
switch the governor at runtime by writing to this file.
|
switch the governor at runtime by writing to this file.
|
||||||
|
|
||||||
See files in Documentation/cpuidle/ for more information.
|
See Documentation/admin-guide/pm/cpuidle.rst and
|
||||||
|
Documentation/driver-api/pm/cpuidle.rst for more information.
|
||||||
|
|
||||||
|
|
||||||
What: /sys/devices/system/cpu/cpuX/cpuidle/stateN/name
|
What: /sys/devices/system/cpu/cpuX/cpuidle/stateN/name
|
||||||
|
@ -538,3 +539,26 @@ Description: Intel Energy and Performance Bias Hint (EPB)
|
||||||
|
|
||||||
This attribute is present for all online CPUs supporting the
|
This attribute is present for all online CPUs supporting the
|
||||||
Intel EPB feature.
|
Intel EPB feature.
|
||||||
|
|
||||||
|
What: /sys/devices/system/cpu/umwait_control
|
||||||
|
/sys/devices/system/cpu/umwait_control/enable_c02
|
||||||
|
/sys/devices/system/cpu/umwait_control/max_time
|
||||||
|
Date: May 2019
|
||||||
|
Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
|
||||||
|
Description: Umwait control
|
||||||
|
|
||||||
|
enable_c02: Read/write interface to control umwait C0.2 state
|
||||||
|
Read returns C0.2 state status:
|
||||||
|
0: C0.2 is disabled
|
||||||
|
1: C0.2 is enabled
|
||||||
|
|
||||||
|
Write 'y' or '1' or 'on' to enable C0.2 state.
|
||||||
|
Write 'n' or '0' or 'off' to disable C0.2 state.
|
||||||
|
|
||||||
|
The interface is case insensitive.
|
||||||
|
|
||||||
|
max_time: Read/write interface to control umwait maximum time
|
||||||
|
in TSC-quanta that the CPU can reside in either C0.1
|
||||||
|
or C0.2 state. The time is an unsigned 32-bit number.
|
||||||
|
Note that a value of zero means there is no limit.
|
||||||
|
Low order two bits must be zero.
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
What: /sys/bus/pci/drivers/altera-cvp/chkcfg
|
What: /sys/bus/pci/drivers/altera-cvp/chkcfg
|
||||||
Date: May 2017
|
Date: May 2017
|
||||||
Kernel Version: 4.13
|
KernelVersion: 4.13
|
||||||
Contact: Anatolij Gustschin <agust@denx.de>
|
Contact: Anatolij Gustschin <agust@denx.de>
|
||||||
Description:
|
Description:
|
||||||
Contains either 1 or 0 and controls if configuration
|
Contains either 1 or 0 and controls if configuration
|
||||||
|
|
|
@ -62,18 +62,20 @@ What: /sys/class/habanalabs/hl<n>/ic_clk
|
||||||
Date: Jan 2019
|
Date: Jan 2019
|
||||||
KernelVersion: 5.1
|
KernelVersion: 5.1
|
||||||
Contact: oded.gabbay@gmail.com
|
Contact: oded.gabbay@gmail.com
|
||||||
Description: Allows the user to set the maximum clock frequency of the
|
Description: Allows the user to set the maximum clock frequency, in Hz, of
|
||||||
Interconnect fabric. Writes to this parameter affect the device
|
the Interconnect fabric. Writes to this parameter affect the
|
||||||
only when the power management profile is set to "manual" mode.
|
device only when the power management profile is set to "manual"
|
||||||
The device IC clock might be set to lower value then the
|
mode. The device IC clock might be set to lower value than the
|
||||||
maximum. The user should read the ic_clk_curr to see the actual
|
maximum. The user should read the ic_clk_curr to see the actual
|
||||||
frequency value of the IC
|
frequency value of the IC. This property is valid only for the
|
||||||
|
Goya ASIC family
|
||||||
|
|
||||||
What: /sys/class/habanalabs/hl<n>/ic_clk_curr
|
What: /sys/class/habanalabs/hl<n>/ic_clk_curr
|
||||||
Date: Jan 2019
|
Date: Jan 2019
|
||||||
KernelVersion: 5.1
|
KernelVersion: 5.1
|
||||||
Contact: oded.gabbay@gmail.com
|
Contact: oded.gabbay@gmail.com
|
||||||
Description: Displays the current clock frequency of the Interconnect fabric
|
Description: Displays the current clock frequency, in Hz, of the Interconnect
|
||||||
|
fabric. This property is valid only for the Goya ASIC family
|
||||||
|
|
||||||
What: /sys/class/habanalabs/hl<n>/infineon_ver
|
What: /sys/class/habanalabs/hl<n>/infineon_ver
|
||||||
Date: Jan 2019
|
Date: Jan 2019
|
||||||
|
@ -92,18 +94,20 @@ What: /sys/class/habanalabs/hl<n>/mme_clk
|
||||||
Date: Jan 2019
|
Date: Jan 2019
|
||||||
KernelVersion: 5.1
|
KernelVersion: 5.1
|
||||||
Contact: oded.gabbay@gmail.com
|
Contact: oded.gabbay@gmail.com
|
||||||
Description: Allows the user to set the maximum clock frequency of the
|
Description: Allows the user to set the maximum clock frequency, in Hz, of
|
||||||
MME compute engine. Writes to this parameter affect the device
|
the MME compute engine. Writes to this parameter affect the
|
||||||
only when the power management profile is set to "manual" mode.
|
device only when the power management profile is set to "manual"
|
||||||
The device MME clock might be set to lower value then the
|
mode. The device MME clock might be set to lower value than the
|
||||||
maximum. The user should read the mme_clk_curr to see the actual
|
maximum. The user should read the mme_clk_curr to see the actual
|
||||||
frequency value of the MME
|
frequency value of the MME. This property is valid only for the
|
||||||
|
Goya ASIC family
|
||||||
|
|
||||||
What: /sys/class/habanalabs/hl<n>/mme_clk_curr
|
What: /sys/class/habanalabs/hl<n>/mme_clk_curr
|
||||||
Date: Jan 2019
|
Date: Jan 2019
|
||||||
KernelVersion: 5.1
|
KernelVersion: 5.1
|
||||||
Contact: oded.gabbay@gmail.com
|
Contact: oded.gabbay@gmail.com
|
||||||
Description: Displays the current clock frequency of the MME compute engine
|
Description: Displays the current clock frequency, in Hz, of the MME compute
|
||||||
|
engine. This property is valid only for the Goya ASIC family
|
||||||
|
|
||||||
What: /sys/class/habanalabs/hl<n>/pci_addr
|
What: /sys/class/habanalabs/hl<n>/pci_addr
|
||||||
Date: Jan 2019
|
Date: Jan 2019
|
||||||
|
@ -163,18 +167,20 @@ What: /sys/class/habanalabs/hl<n>/tpc_clk
|
||||||
Date: Jan 2019
|
Date: Jan 2019
|
||||||
KernelVersion: 5.1
|
KernelVersion: 5.1
|
||||||
Contact: oded.gabbay@gmail.com
|
Contact: oded.gabbay@gmail.com
|
||||||
Description: Allows the user to set the maximum clock frequency of the
|
Description: Allows the user to set the maximum clock frequency, in Hz, of
|
||||||
TPC compute engines. Writes to this parameter affect the device
|
the TPC compute engines. Writes to this parameter affect the
|
||||||
only when the power management profile is set to "manual" mode.
|
device only when the power management profile is set to "manual"
|
||||||
The device TPC clock might be set to lower value then the
|
mode. The device TPC clock might be set to lower value than the
|
||||||
maximum. The user should read the tpc_clk_curr to see the actual
|
maximum. The user should read the tpc_clk_curr to see the actual
|
||||||
frequency value of the TPC
|
frequency value of the TPC. This property is valid only for
|
||||||
|
Goya ASIC family
|
||||||
|
|
||||||
What: /sys/class/habanalabs/hl<n>/tpc_clk_curr
|
What: /sys/class/habanalabs/hl<n>/tpc_clk_curr
|
||||||
Date: Jan 2019
|
Date: Jan 2019
|
||||||
KernelVersion: 5.1
|
KernelVersion: 5.1
|
||||||
Contact: oded.gabbay@gmail.com
|
Contact: oded.gabbay@gmail.com
|
||||||
Description: Displays the current clock frequency of the TPC compute engines
|
Description: Displays the current clock frequency, in Hz, of the TPC compute
|
||||||
|
engines. This property is valid only for the Goya ASIC family
|
||||||
|
|
||||||
What: /sys/class/habanalabs/hl<n>/uboot_ver
|
What: /sys/class/habanalabs/hl<n>/uboot_ver
|
||||||
Date: Jan 2019
|
Date: Jan 2019
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
What: For USB devices : /sys/bus/usb/devices/<busnum>-<devnum>:<config num>.<interface num>/<hid-bus>:<vendor-id>:<product-id>.<num>/report_descriptor
|
What: /sys/bus/usb/devices/<busnum>-<devnum>:<config num>.<interface num>/<hid-bus>:<vendor-id>:<product-id>.<num>/report_descriptor
|
||||||
For BT devices : /sys/class/bluetooth/hci<addr>/<hid-bus>:<vendor-id>:<product-id>.<num>/report_descriptor
|
What: /sys/class/bluetooth/hci<addr>/<hid-bus>:<vendor-id>:<product-id>.<num>/report_descriptor
|
||||||
Symlink : /sys/class/hidraw/hidraw<num>/device/report_descriptor
|
What: /sys/class/hidraw/hidraw<num>/device/report_descriptor
|
||||||
Date: Jan 2011
|
Date: Jan 2011
|
||||||
KernelVersion: 2.0.39
|
KernelVersion: 2.0.39
|
||||||
Contact: Alan Ott <alan@signal11.us>
|
Contact: Alan Ott <alan@signal11.us>
|
||||||
|
@ -9,9 +9,9 @@ Description: When read, this file returns the device's raw binary HID
|
||||||
This file cannot be written.
|
This file cannot be written.
|
||||||
Users: HIDAPI library (http://www.signal11.us/oss/hidapi)
|
Users: HIDAPI library (http://www.signal11.us/oss/hidapi)
|
||||||
|
|
||||||
What: For USB devices : /sys/bus/usb/devices/<busnum>-<devnum>:<config num>.<interface num>/<hid-bus>:<vendor-id>:<product-id>.<num>/country
|
What: /sys/bus/usb/devices/<busnum>-<devnum>:<config num>.<interface num>/<hid-bus>:<vendor-id>:<product-id>.<num>/country
|
||||||
For BT devices : /sys/class/bluetooth/hci<addr>/<hid-bus>:<vendor-id>:<product-id>.<num>/country
|
What: /sys/class/bluetooth/hci<addr>/<hid-bus>:<vendor-id>:<product-id>.<num>/country
|
||||||
Symlink : /sys/class/hidraw/hidraw<num>/device/country
|
What: /sys/class/hidraw/hidraw<num>/device/country
|
||||||
Date: February 2015
|
Date: February 2015
|
||||||
KernelVersion: 3.19
|
KernelVersion: 3.19
|
||||||
Contact: Olivier Gay <ogay@logitech.com>
|
Contact: Olivier Gay <ogay@logitech.com>
|
||||||
|
|
|
@ -5,7 +5,7 @@ Description: It is possible to switch the dpi setting of the mouse with the
|
||||||
press of a button.
|
press of a button.
|
||||||
When read, this file returns the raw number of the actual dpi
|
When read, this file returns the raw number of the actual dpi
|
||||||
setting reported by the mouse. This number has to be further
|
setting reported by the mouse. This number has to be further
|
||||||
processed to receive the real dpi value.
|
processed to receive the real dpi value:
|
||||||
|
|
||||||
VALUE DPI
|
VALUE DPI
|
||||||
1 800
|
1 800
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
What: /sys/class/tpm/tpmX/ppi/
|
What: /sys/class/tpm/tpmX/ppi/
|
||||||
Date: August 2012
|
Date: August 2012
|
||||||
Kernel Version: 3.6
|
KernelVersion: 3.6
|
||||||
Contact: xiaoyan.zhang@intel.com
|
Contact: xiaoyan.zhang@intel.com
|
||||||
Description:
|
Description:
|
||||||
This folder includes the attributes related with PPI (Physical
|
This folder includes the attributes related with PPI (Physical
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
What: /sys/bus/scsi/drivers/st/debug_flag
|
What: /sys/bus/scsi/drivers/st/debug_flag
|
||||||
Date: October 2015
|
Date: October 2015
|
||||||
Kernel Version: ?.?
|
KernelVersion: ?.?
|
||||||
Contact: shane.seymour@hpe.com
|
Contact: shane.seymour@hpe.com
|
||||||
Description:
|
Description:
|
||||||
This file allows you to turn debug output from the st driver
|
This file allows you to turn debug output from the st driver
|
||||||
|
|
|
@ -1,6 +1,6 @@
|
||||||
What: /sys/bus/hid/devices/<bus>:<vid>:<pid>.<n>/speed
|
What: /sys/bus/hid/devices/<bus>:<vid>:<pid>.<n>/speed
|
||||||
Date: April 2010
|
Date: April 2010
|
||||||
Kernel Version: 2.6.35
|
KernelVersion: 2.6.35
|
||||||
Contact: linux-bluetooth@vger.kernel.org
|
Contact: linux-bluetooth@vger.kernel.org
|
||||||
Description:
|
Description:
|
||||||
The /sys/bus/hid/devices/<bus>:<vid>:<pid>.<n>/speed file
|
The /sys/bus/hid/devices/<bus>:<vid>:<pid>.<n>/speed file
|
||||||
|
|
|
@ -243,3 +243,11 @@ Description:
|
||||||
- Del: echo '[h/c]!extension' > /sys/fs/f2fs/<disk>/extension_list
|
- Del: echo '[h/c]!extension' > /sys/fs/f2fs/<disk>/extension_list
|
||||||
- [h] means add/del hot file extension
|
- [h] means add/del hot file extension
|
||||||
- [c] means add/del cold file extension
|
- [c] means add/del cold file extension
|
||||||
|
|
||||||
|
What: /sys/fs/f2fs/<disk>/unusable
|
||||||
|
Date April 2019
|
||||||
|
Contact: "Daniel Rosenberg" <drosen@google.com>
|
||||||
|
Description:
|
||||||
|
If checkpoint=disable, it displays the number of blocks that are unusable.
|
||||||
|
If checkpoint=enable it displays the enumber of blocks that would be unusable
|
||||||
|
if checkpoint=disable were to be set.
|
||||||
|
|
|
@ -2,7 +2,7 @@ What: /sys/kernel/fscaps
|
||||||
Date: February 2011
|
Date: February 2011
|
||||||
KernelVersion: 2.6.38
|
KernelVersion: 2.6.38
|
||||||
Contact: Ludwig Nussel <ludwig.nussel@suse.de>
|
Contact: Ludwig Nussel <ludwig.nussel@suse.de>
|
||||||
Description
|
Description:
|
||||||
Shows whether file system capabilities are honored
|
Shows whether file system capabilities are honored
|
||||||
when executing a binary
|
when executing a binary
|
||||||
|
|
||||||
|
|
|
@ -24,3 +24,12 @@ Description: /sys/kernel/iommu_groups/reserved_regions list IOVA
|
||||||
region is described on a single line: the 1st field is
|
region is described on a single line: the 1st field is
|
||||||
the base IOVA, the second is the end IOVA and the third
|
the base IOVA, the second is the end IOVA and the third
|
||||||
field describes the type of the region.
|
field describes the type of the region.
|
||||||
|
|
||||||
|
What: /sys/kernel/iommu_groups/reserved_regions
|
||||||
|
Date: June 2019
|
||||||
|
KernelVersion: v5.3
|
||||||
|
Contact: Eric Auger <eric.auger@redhat.com>
|
||||||
|
Description: In case an RMRR is used only by graphics or USB devices
|
||||||
|
it is now exposed as "direct-relaxable" instead of "direct".
|
||||||
|
In device assignment use case, for instance, those RMRR
|
||||||
|
are considered to be relaxable and safe.
|
||||||
|
|
|
@ -11,4 +11,4 @@ Description:
|
||||||
example would be, if User A has shares = 1024 and user
|
example would be, if User A has shares = 1024 and user
|
||||||
B has shares = 2048, User B will get twice the CPU
|
B has shares = 2048, User B will get twice the CPU
|
||||||
bandwidth user A will. For more details refer
|
bandwidth user A will. For more details refer
|
||||||
Documentation/scheduler/sched-design-CFS.txt
|
Documentation/scheduler/sched-design-CFS.rst
|
||||||
|
|
|
@ -4,7 +4,7 @@ KernelVersion: 2.6.24
|
||||||
Contact: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
|
Contact: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
|
||||||
Kexec Mailing List <kexec@lists.infradead.org>
|
Kexec Mailing List <kexec@lists.infradead.org>
|
||||||
Vivek Goyal <vgoyal@redhat.com>
|
Vivek Goyal <vgoyal@redhat.com>
|
||||||
Description
|
Description:
|
||||||
Shows physical address and size of vmcoreinfo ELF note.
|
Shows physical address and size of vmcoreinfo ELF note.
|
||||||
First value contains physical address of note in hex and
|
First value contains physical address of note in hex and
|
||||||
second value contains the size of note in hex. This ELF
|
second value contains the size of note in hex. This ELF
|
||||||
|
|
|
@ -31,7 +31,7 @@ Description:
|
||||||
To control the LED display, use the following :
|
To control the LED display, use the following :
|
||||||
echo 0x0T000DDD > /sys/devices/platform/asus_laptop/
|
echo 0x0T000DDD > /sys/devices/platform/asus_laptop/
|
||||||
where T control the 3 letters display, and DDD the 3 digits display.
|
where T control the 3 letters display, and DDD the 3 digits display.
|
||||||
The DDD table can be found in Documentation/laptops/asus-laptop.txt
|
The DDD table can be found in Documentation/admin-guide/laptops/asus-laptop.rst
|
||||||
|
|
||||||
What: /sys/devices/platform/asus_laptop/bluetooth
|
What: /sys/devices/platform/asus_laptop/bluetooth
|
||||||
Date: January 2007
|
Date: January 2007
|
||||||
|
|
|
@ -36,3 +36,13 @@ KernelVersion: 3.5
|
||||||
Contact: "AceLan Kao" <acelan.kao@canonical.com>
|
Contact: "AceLan Kao" <acelan.kao@canonical.com>
|
||||||
Description:
|
Description:
|
||||||
Resume on lid open. 1 means on, 0 means off.
|
Resume on lid open. 1 means on, 0 means off.
|
||||||
|
|
||||||
|
What: /sys/devices/platform/<platform>/fan_boost_mode
|
||||||
|
Date: Sep 2019
|
||||||
|
KernelVersion: 5.3
|
||||||
|
Contact: "Yurii Pavlovskyi" <yurii.pavlovskyi@gmail.com>
|
||||||
|
Description:
|
||||||
|
Fan boost mode:
|
||||||
|
* 0 - normal,
|
||||||
|
* 1 - overboost,
|
||||||
|
* 2 - silent
|
||||||
|
|
|
@ -1,7 +1,7 @@
|
||||||
What: /sys/devices/platform/<i2c-demux-name>/available_masters
|
What: /sys/devices/platform/<i2c-demux-name>/available_masters
|
||||||
Date: January 2016
|
Date: January 2016
|
||||||
KernelVersion: 4.6
|
KernelVersion: 4.6
|
||||||
Contact: Wolfram Sang <wsa@the-dreams.de>
|
Contact: Wolfram Sang <wsa+renesas@sang-engineering.com>
|
||||||
Description:
|
Description:
|
||||||
Reading the file will give you a list of masters which can be
|
Reading the file will give you a list of masters which can be
|
||||||
selected for a demultiplexed bus. The format is
|
selected for a demultiplexed bus. The format is
|
||||||
|
@ -12,7 +12,7 @@ Description:
|
||||||
What: /sys/devices/platform/<i2c-demux-name>/current_master
|
What: /sys/devices/platform/<i2c-demux-name>/current_master
|
||||||
Date: January 2016
|
Date: January 2016
|
||||||
KernelVersion: 4.6
|
KernelVersion: 4.6
|
||||||
Contact: Wolfram Sang <wsa@the-dreams.de>
|
Contact: Wolfram Sang <wsa+renesas@sang-engineering.com>
|
||||||
Description:
|
Description:
|
||||||
This file selects/shows the active I2C master for a demultiplexed
|
This file selects/shows the active I2C master for a demultiplexed
|
||||||
bus. It uses the <index> value from the file 'available_masters'.
|
bus. It uses the <index> value from the file 'available_masters'.
|
||||||
|
|
40
Documentation/ABI/testing/sysfs-platform-wilco-ec
Normal file
40
Documentation/ABI/testing/sysfs-platform-wilco-ec
Normal file
|
@ -0,0 +1,40 @@
|
||||||
|
What: /sys/bus/platform/devices/GOOG000C\:00/boot_on_ac
|
||||||
|
Date: April 2019
|
||||||
|
KernelVersion: 5.3
|
||||||
|
Description:
|
||||||
|
Boot on AC is a policy which makes the device boot from S5
|
||||||
|
when AC power is connected. This is useful for users who
|
||||||
|
want to run their device headless or with a dock.
|
||||||
|
|
||||||
|
Input should be parseable by kstrtou8() to 0 or 1.
|
||||||
|
|
||||||
|
What: /sys/bus/platform/devices/GOOG000C\:00/build_date
|
||||||
|
Date: May 2019
|
||||||
|
KernelVersion: 5.3
|
||||||
|
Description:
|
||||||
|
Display Wilco Embedded Controller firmware build date.
|
||||||
|
Output will a MM/DD/YY string.
|
||||||
|
|
||||||
|
What: /sys/bus/platform/devices/GOOG000C\:00/build_revision
|
||||||
|
Date: May 2019
|
||||||
|
KernelVersion: 5.3
|
||||||
|
Description:
|
||||||
|
Display Wilco Embedded Controller build revision.
|
||||||
|
Output will a version string be similar to the example below:
|
||||||
|
d2592cae0
|
||||||
|
|
||||||
|
What: /sys/bus/platform/devices/GOOG000C\:00/model_number
|
||||||
|
Date: May 2019
|
||||||
|
KernelVersion: 5.3
|
||||||
|
Description:
|
||||||
|
Display Wilco Embedded Controller model number.
|
||||||
|
Output will a version string be similar to the example below:
|
||||||
|
08B6
|
||||||
|
|
||||||
|
What: /sys/bus/platform/devices/GOOG000C\:00/version
|
||||||
|
Date: May 2019
|
||||||
|
KernelVersion: 5.3
|
||||||
|
Description:
|
||||||
|
Display Wilco Embedded Controller firmware version.
|
||||||
|
The format of the string is x.y.z. Where x is major, y is minor
|
||||||
|
and z is the build number. For example: 95.00.06
|
|
@ -300,4 +300,4 @@ Description:
|
||||||
attempt.
|
attempt.
|
||||||
|
|
||||||
Using this sysfs file will override any values that were
|
Using this sysfs file will override any values that were
|
||||||
set using the kernel command line for disk offset.
|
set using the kernel command line for disk offset.
|
||||||
|
|
|
@ -212,7 +212,7 @@ The standard 64-bit addressing device would do something like this::
|
||||||
|
|
||||||
If the device only supports 32-bit addressing for descriptors in the
|
If the device only supports 32-bit addressing for descriptors in the
|
||||||
coherent allocations, but supports full 64-bits for streaming mappings
|
coherent allocations, but supports full 64-bits for streaming mappings
|
||||||
it would look like this:
|
it would look like this::
|
||||||
|
|
||||||
if (dma_set_mask(dev, DMA_BIT_MASK(64))) {
|
if (dma_set_mask(dev, DMA_BIT_MASK(64))) {
|
||||||
dev_warn(dev, "mydev: No suitable DMA available\n");
|
dev_warn(dev, "mydev: No suitable DMA available\n");
|
||||||
|
|
|
@ -198,7 +198,7 @@ call to set the mask to the value returned.
|
||||||
::
|
::
|
||||||
|
|
||||||
size_t
|
size_t
|
||||||
dma_direct_max_mapping_size(struct device *dev);
|
dma_max_mapping_size(struct device *dev);
|
||||||
|
|
||||||
Returns the maximum size of a mapping for the device. The size parameter
|
Returns the maximum size of a mapping for the device. The size parameter
|
||||||
of the mapping functions like dma_map_single(), dma_map_page() and
|
of the mapping functions like dma_map_single(), dma_map_page() and
|
||||||
|
|
|
@ -1,49 +0,0 @@
|
||||||
In the good old days when graphics parameters were configured explicitly
|
|
||||||
in a file called xorg.conf, even broken hardware could be managed.
|
|
||||||
|
|
||||||
Today, with the advent of Kernel Mode Setting, a graphics board is
|
|
||||||
either correctly working because all components follow the standards -
|
|
||||||
or the computer is unusable, because the screen remains dark after
|
|
||||||
booting or it displays the wrong area. Cases when this happens are:
|
|
||||||
- The graphics board does not recognize the monitor.
|
|
||||||
- The graphics board is unable to detect any EDID data.
|
|
||||||
- The graphics board incorrectly forwards EDID data to the driver.
|
|
||||||
- The monitor sends no or bogus EDID data.
|
|
||||||
- A KVM sends its own EDID data instead of querying the connected monitor.
|
|
||||||
Adding the kernel parameter "nomodeset" helps in most cases, but causes
|
|
||||||
restrictions later on.
|
|
||||||
|
|
||||||
As a remedy for such situations, the kernel configuration item
|
|
||||||
CONFIG_DRM_LOAD_EDID_FIRMWARE was introduced. It allows to provide an
|
|
||||||
individually prepared or corrected EDID data set in the /lib/firmware
|
|
||||||
directory from where it is loaded via the firmware interface. The code
|
|
||||||
(see drivers/gpu/drm/drm_edid_load.c) contains built-in data sets for
|
|
||||||
commonly used screen resolutions (800x600, 1024x768, 1280x1024, 1600x1200,
|
|
||||||
1680x1050, 1920x1080) as binary blobs, but the kernel source tree does
|
|
||||||
not contain code to create these data. In order to elucidate the origin
|
|
||||||
of the built-in binary EDID blobs and to facilitate the creation of
|
|
||||||
individual data for a specific misbehaving monitor, commented sources
|
|
||||||
and a Makefile environment are given here.
|
|
||||||
|
|
||||||
To create binary EDID and C source code files from the existing data
|
|
||||||
material, simply type "make".
|
|
||||||
|
|
||||||
If you want to create your own EDID file, copy the file 1024x768.S,
|
|
||||||
replace the settings with your own data and add a new target to the
|
|
||||||
Makefile. Please note that the EDID data structure expects the timing
|
|
||||||
values in a different way as compared to the standard X11 format.
|
|
||||||
|
|
||||||
X11:
|
|
||||||
HTimings: hdisp hsyncstart hsyncend htotal
|
|
||||||
VTimings: vdisp vsyncstart vsyncend vtotal
|
|
||||||
|
|
||||||
EDID:
|
|
||||||
#define XPIX hdisp
|
|
||||||
#define XBLANK htotal-hdisp
|
|
||||||
#define XOFFSET hsyncstart-hdisp
|
|
||||||
#define XPULSE hsyncend-hsyncstart
|
|
||||||
|
|
||||||
#define YPIX vdisp
|
|
||||||
#define YBLANK vtotal-vdisp
|
|
||||||
#define YOFFSET vsyncstart-vdisp
|
|
||||||
#define YPULSE vsyncend-vsyncstart
|
|
13
Documentation/Kconfig
Normal file
13
Documentation/Kconfig
Normal file
|
@ -0,0 +1,13 @@
|
||||||
|
config WARN_MISSING_DOCUMENTS
|
||||||
|
|
||||||
|
bool "Warn if there's a missing documentation file"
|
||||||
|
depends on COMPILE_TEST
|
||||||
|
help
|
||||||
|
It is not uncommon that a document gets renamed.
|
||||||
|
This option makes the Kernel to check for missing dependencies,
|
||||||
|
warning when something is missing. Works only if the Kernel
|
||||||
|
is built from a git tree.
|
||||||
|
|
||||||
|
If unsure, select 'N'.
|
||||||
|
|
||||||
|
|
|
@ -4,6 +4,11 @@
|
||||||
|
|
||||||
subdir-y := devicetree/bindings/
|
subdir-y := devicetree/bindings/
|
||||||
|
|
||||||
|
# Check for broken documentation file references
|
||||||
|
ifeq ($(CONFIG_WARN_MISSING_DOCUMENTS),y)
|
||||||
|
$(shell $(srctree)/scripts/documentation-file-ref-check --warn)
|
||||||
|
endif
|
||||||
|
|
||||||
# You can set these variables from the command line.
|
# You can set these variables from the command line.
|
||||||
SPHINXBUILD = sphinx-build
|
SPHINXBUILD = sphinx-build
|
||||||
SPHINXOPTS =
|
SPHINXOPTS =
|
||||||
|
@ -23,11 +28,13 @@ ifeq ($(HAVE_SPHINX),0)
|
||||||
.DEFAULT:
|
.DEFAULT:
|
||||||
$(warning The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed and in PATH, or set the SPHINXBUILD make variable to point to the full path of the '$(SPHINXBUILD)' executable.)
|
$(warning The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed and in PATH, or set the SPHINXBUILD make variable to point to the full path of the '$(SPHINXBUILD)' executable.)
|
||||||
@echo
|
@echo
|
||||||
@./scripts/sphinx-pre-install
|
@$(srctree)/scripts/sphinx-pre-install
|
||||||
@echo " SKIP Sphinx $@ target."
|
@echo " SKIP Sphinx $@ target."
|
||||||
|
|
||||||
else # HAVE_SPHINX
|
else # HAVE_SPHINX
|
||||||
|
|
||||||
|
export SPHINXOPTS = $(shell perl -e 'open IN,"sphinx-build --version 2>&1 |"; while (<IN>) { if (m/([\d\.]+)/) { print "-jauto" if ($$1 >= "1.7") } ;} close IN')
|
||||||
|
|
||||||
# User-friendly check for pdflatex and latexmk
|
# User-friendly check for pdflatex and latexmk
|
||||||
HAVE_PDFLATEX := $(shell if which $(PDFLATEX) >/dev/null 2>&1; then echo 1; else echo 0; fi)
|
HAVE_PDFLATEX := $(shell if which $(PDFLATEX) >/dev/null 2>&1; then echo 1; else echo 0; fi)
|
||||||
HAVE_LATEXMK := $(shell if which latexmk >/dev/null 2>&1; then echo 1; else echo 0; fi)
|
HAVE_LATEXMK := $(shell if which latexmk >/dev/null 2>&1; then echo 1; else echo 0; fi)
|
||||||
|
@ -70,12 +77,14 @@ quiet_cmd_sphinx = SPHINX $@ --> file://$(abspath $(BUILDDIR)/$3/$4)
|
||||||
$(abspath $(BUILDDIR)/$3/$4)
|
$(abspath $(BUILDDIR)/$3/$4)
|
||||||
|
|
||||||
htmldocs:
|
htmldocs:
|
||||||
|
@$(srctree)/scripts/sphinx-pre-install --version-check
|
||||||
@+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,html,$(var),,$(var)))
|
@+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,html,$(var),,$(var)))
|
||||||
|
|
||||||
linkcheckdocs:
|
linkcheckdocs:
|
||||||
@$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,linkcheck,$(var),,$(var)))
|
@$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,linkcheck,$(var),,$(var)))
|
||||||
|
|
||||||
latexdocs:
|
latexdocs:
|
||||||
|
@$(srctree)/scripts/sphinx-pre-install --version-check
|
||||||
@+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,latex,$(var),latex,$(var)))
|
@+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,latex,$(var),latex,$(var)))
|
||||||
|
|
||||||
ifeq ($(HAVE_PDFLATEX),0)
|
ifeq ($(HAVE_PDFLATEX),0)
|
||||||
|
@ -87,14 +96,17 @@ pdfdocs:
|
||||||
else # HAVE_PDFLATEX
|
else # HAVE_PDFLATEX
|
||||||
|
|
||||||
pdfdocs: latexdocs
|
pdfdocs: latexdocs
|
||||||
|
@$(srctree)/scripts/sphinx-pre-install --version-check
|
||||||
$(foreach var,$(SPHINXDIRS), $(MAKE) PDFLATEX="$(PDFLATEX)" LATEXOPTS="$(LATEXOPTS)" -C $(BUILDDIR)/$(var)/latex || exit;)
|
$(foreach var,$(SPHINXDIRS), $(MAKE) PDFLATEX="$(PDFLATEX)" LATEXOPTS="$(LATEXOPTS)" -C $(BUILDDIR)/$(var)/latex || exit;)
|
||||||
|
|
||||||
endif # HAVE_PDFLATEX
|
endif # HAVE_PDFLATEX
|
||||||
|
|
||||||
epubdocs:
|
epubdocs:
|
||||||
|
@$(srctree)/scripts/sphinx-pre-install --version-check
|
||||||
@+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,epub,$(var),epub,$(var)))
|
@+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,epub,$(var),epub,$(var)))
|
||||||
|
|
||||||
xmldocs:
|
xmldocs:
|
||||||
|
@$(srctree)/scripts/sphinx-pre-install --version-check
|
||||||
@+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,xml,$(var),xml,$(var)))
|
@+$(foreach var,$(SPHINXDIRS),$(call loop_cmd,sphinx,xml,$(var),xml,$(var)))
|
||||||
|
|
||||||
endif # HAVE_SPHINX
|
endif # HAVE_SPHINX
|
||||||
|
|
|
@ -1,270 +0,0 @@
|
||||||
The MSI Driver Guide HOWTO
|
|
||||||
Tom L Nguyen tom.l.nguyen@intel.com
|
|
||||||
10/03/2003
|
|
||||||
Revised Feb 12, 2004 by Martine Silbermann
|
|
||||||
email: Martine.Silbermann@hp.com
|
|
||||||
Revised Jun 25, 2004 by Tom L Nguyen
|
|
||||||
Revised Jul 9, 2008 by Matthew Wilcox <willy@linux.intel.com>
|
|
||||||
Copyright 2003, 2008 Intel Corporation
|
|
||||||
|
|
||||||
1. About this guide
|
|
||||||
|
|
||||||
This guide describes the basics of Message Signaled Interrupts (MSIs),
|
|
||||||
the advantages of using MSI over traditional interrupt mechanisms, how
|
|
||||||
to change your driver to use MSI or MSI-X and some basic diagnostics to
|
|
||||||
try if a device doesn't support MSIs.
|
|
||||||
|
|
||||||
|
|
||||||
2. What are MSIs?
|
|
||||||
|
|
||||||
A Message Signaled Interrupt is a write from the device to a special
|
|
||||||
address which causes an interrupt to be received by the CPU.
|
|
||||||
|
|
||||||
The MSI capability was first specified in PCI 2.2 and was later enhanced
|
|
||||||
in PCI 3.0 to allow each interrupt to be masked individually. The MSI-X
|
|
||||||
capability was also introduced with PCI 3.0. It supports more interrupts
|
|
||||||
per device than MSI and allows interrupts to be independently configured.
|
|
||||||
|
|
||||||
Devices may support both MSI and MSI-X, but only one can be enabled at
|
|
||||||
a time.
|
|
||||||
|
|
||||||
|
|
||||||
3. Why use MSIs?
|
|
||||||
|
|
||||||
There are three reasons why using MSIs can give an advantage over
|
|
||||||
traditional pin-based interrupts.
|
|
||||||
|
|
||||||
Pin-based PCI interrupts are often shared amongst several devices.
|
|
||||||
To support this, the kernel must call each interrupt handler associated
|
|
||||||
with an interrupt, which leads to reduced performance for the system as
|
|
||||||
a whole. MSIs are never shared, so this problem cannot arise.
|
|
||||||
|
|
||||||
When a device writes data to memory, then raises a pin-based interrupt,
|
|
||||||
it is possible that the interrupt may arrive before all the data has
|
|
||||||
arrived in memory (this becomes more likely with devices behind PCI-PCI
|
|
||||||
bridges). In order to ensure that all the data has arrived in memory,
|
|
||||||
the interrupt handler must read a register on the device which raised
|
|
||||||
the interrupt. PCI transaction ordering rules require that all the data
|
|
||||||
arrive in memory before the value may be returned from the register.
|
|
||||||
Using MSIs avoids this problem as the interrupt-generating write cannot
|
|
||||||
pass the data writes, so by the time the interrupt is raised, the driver
|
|
||||||
knows that all the data has arrived in memory.
|
|
||||||
|
|
||||||
PCI devices can only support a single pin-based interrupt per function.
|
|
||||||
Often drivers have to query the device to find out what event has
|
|
||||||
occurred, slowing down interrupt handling for the common case. With
|
|
||||||
MSIs, a device can support more interrupts, allowing each interrupt
|
|
||||||
to be specialised to a different purpose. One possible design gives
|
|
||||||
infrequent conditions (such as errors) their own interrupt which allows
|
|
||||||
the driver to handle the normal interrupt handling path more efficiently.
|
|
||||||
Other possible designs include giving one interrupt to each packet queue
|
|
||||||
in a network card or each port in a storage controller.
|
|
||||||
|
|
||||||
|
|
||||||
4. How to use MSIs
|
|
||||||
|
|
||||||
PCI devices are initialised to use pin-based interrupts. The device
|
|
||||||
driver has to set up the device to use MSI or MSI-X. Not all machines
|
|
||||||
support MSIs correctly, and for those machines, the APIs described below
|
|
||||||
will simply fail and the device will continue to use pin-based interrupts.
|
|
||||||
|
|
||||||
4.1 Include kernel support for MSIs
|
|
||||||
|
|
||||||
To support MSI or MSI-X, the kernel must be built with the CONFIG_PCI_MSI
|
|
||||||
option enabled. This option is only available on some architectures,
|
|
||||||
and it may depend on some other options also being set. For example,
|
|
||||||
on x86, you must also enable X86_UP_APIC or SMP in order to see the
|
|
||||||
CONFIG_PCI_MSI option.
|
|
||||||
|
|
||||||
4.2 Using MSI
|
|
||||||
|
|
||||||
Most of the hard work is done for the driver in the PCI layer. The driver
|
|
||||||
simply has to request that the PCI layer set up the MSI capability for this
|
|
||||||
device.
|
|
||||||
|
|
||||||
To automatically use MSI or MSI-X interrupt vectors, use the following
|
|
||||||
function:
|
|
||||||
|
|
||||||
int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
|
|
||||||
unsigned int max_vecs, unsigned int flags);
|
|
||||||
|
|
||||||
which allocates up to max_vecs interrupt vectors for a PCI device. It
|
|
||||||
returns the number of vectors allocated or a negative error. If the device
|
|
||||||
has a requirements for a minimum number of vectors the driver can pass a
|
|
||||||
min_vecs argument set to this limit, and the PCI core will return -ENOSPC
|
|
||||||
if it can't meet the minimum number of vectors.
|
|
||||||
|
|
||||||
The flags argument is used to specify which type of interrupt can be used
|
|
||||||
by the device and the driver (PCI_IRQ_LEGACY, PCI_IRQ_MSI, PCI_IRQ_MSIX).
|
|
||||||
A convenient short-hand (PCI_IRQ_ALL_TYPES) is also available to ask for
|
|
||||||
any possible kind of interrupt. If the PCI_IRQ_AFFINITY flag is set,
|
|
||||||
pci_alloc_irq_vectors() will spread the interrupts around the available CPUs.
|
|
||||||
|
|
||||||
To get the Linux IRQ numbers passed to request_irq() and free_irq() and the
|
|
||||||
vectors, use the following function:
|
|
||||||
|
|
||||||
int pci_irq_vector(struct pci_dev *dev, unsigned int nr);
|
|
||||||
|
|
||||||
Any allocated resources should be freed before removing the device using
|
|
||||||
the following function:
|
|
||||||
|
|
||||||
void pci_free_irq_vectors(struct pci_dev *dev);
|
|
||||||
|
|
||||||
If a device supports both MSI-X and MSI capabilities, this API will use the
|
|
||||||
MSI-X facilities in preference to the MSI facilities. MSI-X supports any
|
|
||||||
number of interrupts between 1 and 2048. In contrast, MSI is restricted to
|
|
||||||
a maximum of 32 interrupts (and must be a power of two). In addition, the
|
|
||||||
MSI interrupt vectors must be allocated consecutively, so the system might
|
|
||||||
not be able to allocate as many vectors for MSI as it could for MSI-X. On
|
|
||||||
some platforms, MSI interrupts must all be targeted at the same set of CPUs
|
|
||||||
whereas MSI-X interrupts can all be targeted at different CPUs.
|
|
||||||
|
|
||||||
If a device supports neither MSI-X or MSI it will fall back to a single
|
|
||||||
legacy IRQ vector.
|
|
||||||
|
|
||||||
The typical usage of MSI or MSI-X interrupts is to allocate as many vectors
|
|
||||||
as possible, likely up to the limit supported by the device. If nvec is
|
|
||||||
larger than the number supported by the device it will automatically be
|
|
||||||
capped to the supported limit, so there is no need to query the number of
|
|
||||||
vectors supported beforehand:
|
|
||||||
|
|
||||||
nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_ALL_TYPES)
|
|
||||||
if (nvec < 0)
|
|
||||||
goto out_err;
|
|
||||||
|
|
||||||
If a driver is unable or unwilling to deal with a variable number of MSI
|
|
||||||
interrupts it can request a particular number of interrupts by passing that
|
|
||||||
number to pci_alloc_irq_vectors() function as both 'min_vecs' and
|
|
||||||
'max_vecs' parameters:
|
|
||||||
|
|
||||||
ret = pci_alloc_irq_vectors(pdev, nvec, nvec, PCI_IRQ_ALL_TYPES);
|
|
||||||
if (ret < 0)
|
|
||||||
goto out_err;
|
|
||||||
|
|
||||||
The most notorious example of the request type described above is enabling
|
|
||||||
the single MSI mode for a device. It could be done by passing two 1s as
|
|
||||||
'min_vecs' and 'max_vecs':
|
|
||||||
|
|
||||||
ret = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_ALL_TYPES);
|
|
||||||
if (ret < 0)
|
|
||||||
goto out_err;
|
|
||||||
|
|
||||||
Some devices might not support using legacy line interrupts, in which case
|
|
||||||
the driver can specify that only MSI or MSI-X is acceptable:
|
|
||||||
|
|
||||||
nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_MSI | PCI_IRQ_MSIX);
|
|
||||||
if (nvec < 0)
|
|
||||||
goto out_err;
|
|
||||||
|
|
||||||
4.3 Legacy APIs
|
|
||||||
|
|
||||||
The following old APIs to enable and disable MSI or MSI-X interrupts should
|
|
||||||
not be used in new code:
|
|
||||||
|
|
||||||
pci_enable_msi() /* deprecated */
|
|
||||||
pci_disable_msi() /* deprecated */
|
|
||||||
pci_enable_msix_range() /* deprecated */
|
|
||||||
pci_enable_msix_exact() /* deprecated */
|
|
||||||
pci_disable_msix() /* deprecated */
|
|
||||||
|
|
||||||
Additionally there are APIs to provide the number of supported MSI or MSI-X
|
|
||||||
vectors: pci_msi_vec_count() and pci_msix_vec_count(). In general these
|
|
||||||
should be avoided in favor of letting pci_alloc_irq_vectors() cap the
|
|
||||||
number of vectors. If you have a legitimate special use case for the count
|
|
||||||
of vectors we might have to revisit that decision and add a
|
|
||||||
pci_nr_irq_vectors() helper that handles MSI and MSI-X transparently.
|
|
||||||
|
|
||||||
4.4 Considerations when using MSIs
|
|
||||||
|
|
||||||
4.4.1 Spinlocks
|
|
||||||
|
|
||||||
Most device drivers have a per-device spinlock which is taken in the
|
|
||||||
interrupt handler. With pin-based interrupts or a single MSI, it is not
|
|
||||||
necessary to disable interrupts (Linux guarantees the same interrupt will
|
|
||||||
not be re-entered). If a device uses multiple interrupts, the driver
|
|
||||||
must disable interrupts while the lock is held. If the device sends
|
|
||||||
a different interrupt, the driver will deadlock trying to recursively
|
|
||||||
acquire the spinlock. Such deadlocks can be avoided by using
|
|
||||||
spin_lock_irqsave() or spin_lock_irq() which disable local interrupts
|
|
||||||
and acquire the lock (see Documentation/kernel-hacking/locking.rst).
|
|
||||||
|
|
||||||
4.5 How to tell whether MSI/MSI-X is enabled on a device
|
|
||||||
|
|
||||||
Using 'lspci -v' (as root) may show some devices with "MSI", "Message
|
|
||||||
Signalled Interrupts" or "MSI-X" capabilities. Each of these capabilities
|
|
||||||
has an 'Enable' flag which is followed with either "+" (enabled)
|
|
||||||
or "-" (disabled).
|
|
||||||
|
|
||||||
|
|
||||||
5. MSI quirks
|
|
||||||
|
|
||||||
Several PCI chipsets or devices are known not to support MSIs.
|
|
||||||
The PCI stack provides three ways to disable MSIs:
|
|
||||||
|
|
||||||
1. globally
|
|
||||||
2. on all devices behind a specific bridge
|
|
||||||
3. on a single device
|
|
||||||
|
|
||||||
5.1. Disabling MSIs globally
|
|
||||||
|
|
||||||
Some host chipsets simply don't support MSIs properly. If we're
|
|
||||||
lucky, the manufacturer knows this and has indicated it in the ACPI
|
|
||||||
FADT table. In this case, Linux automatically disables MSIs.
|
|
||||||
Some boards don't include this information in the table and so we have
|
|
||||||
to detect them ourselves. The complete list of these is found near the
|
|
||||||
quirk_disable_all_msi() function in drivers/pci/quirks.c.
|
|
||||||
|
|
||||||
If you have a board which has problems with MSIs, you can pass pci=nomsi
|
|
||||||
on the kernel command line to disable MSIs on all devices. It would be
|
|
||||||
in your best interests to report the problem to linux-pci@vger.kernel.org
|
|
||||||
including a full 'lspci -v' so we can add the quirks to the kernel.
|
|
||||||
|
|
||||||
5.2. Disabling MSIs below a bridge
|
|
||||||
|
|
||||||
Some PCI bridges are not able to route MSIs between busses properly.
|
|
||||||
In this case, MSIs must be disabled on all devices behind the bridge.
|
|
||||||
|
|
||||||
Some bridges allow you to enable MSIs by changing some bits in their
|
|
||||||
PCI configuration space (especially the Hypertransport chipsets such
|
|
||||||
as the nVidia nForce and Serverworks HT2000). As with host chipsets,
|
|
||||||
Linux mostly knows about them and automatically enables MSIs if it can.
|
|
||||||
If you have a bridge unknown to Linux, you can enable
|
|
||||||
MSIs in configuration space using whatever method you know works, then
|
|
||||||
enable MSIs on that bridge by doing:
|
|
||||||
|
|
||||||
echo 1 > /sys/bus/pci/devices/$bridge/msi_bus
|
|
||||||
|
|
||||||
where $bridge is the PCI address of the bridge you've enabled (eg
|
|
||||||
0000:00:0e.0).
|
|
||||||
|
|
||||||
To disable MSIs, echo 0 instead of 1. Changing this value should be
|
|
||||||
done with caution as it could break interrupt handling for all devices
|
|
||||||
below this bridge.
|
|
||||||
|
|
||||||
Again, please notify linux-pci@vger.kernel.org of any bridges that need
|
|
||||||
special handling.
|
|
||||||
|
|
||||||
5.3. Disabling MSIs on a single device
|
|
||||||
|
|
||||||
Some devices are known to have faulty MSI implementations. Usually this
|
|
||||||
is handled in the individual device driver, but occasionally it's necessary
|
|
||||||
to handle this with a quirk. Some drivers have an option to disable use
|
|
||||||
of MSI. While this is a convenient workaround for the driver author,
|
|
||||||
it is not good practice, and should not be emulated.
|
|
||||||
|
|
||||||
5.4. Finding why MSIs are disabled on a device
|
|
||||||
|
|
||||||
From the above three sections, you can see that there are many reasons
|
|
||||||
why MSIs may not be enabled for a given device. Your first step should
|
|
||||||
be to examine your dmesg carefully to determine whether MSIs are enabled
|
|
||||||
for your machine. You should also check your .config to be sure you
|
|
||||||
have enabled CONFIG_PCI_MSI.
|
|
||||||
|
|
||||||
Then, 'lspci -t' gives the list of bridges above a device. Reading
|
|
||||||
/sys/bus/pci/devices/*/msi_bus will tell you whether MSIs are enabled (1)
|
|
||||||
or disabled (0). If 0 is found in any of the msi_bus files belonging
|
|
||||||
to bridges between the PCI root and the device, MSIs are disabled.
|
|
||||||
|
|
||||||
It is also worth checking the device driver to see whether it supports MSIs.
|
|
||||||
For example, it may contain calls to pci_irq_alloc_vectors() with the
|
|
||||||
PCI_IRQ_MSI or PCI_IRQ_MSIX flags.
|
|
|
@ -1,198 +0,0 @@
|
||||||
The PCI Express Port Bus Driver Guide HOWTO
|
|
||||||
Tom L Nguyen tom.l.nguyen@intel.com
|
|
||||||
11/03/2004
|
|
||||||
|
|
||||||
1. About this guide
|
|
||||||
|
|
||||||
This guide describes the basics of the PCI Express Port Bus driver
|
|
||||||
and provides information on how to enable the service drivers to
|
|
||||||
register/unregister with the PCI Express Port Bus Driver.
|
|
||||||
|
|
||||||
2. Copyright 2004 Intel Corporation
|
|
||||||
|
|
||||||
3. What is the PCI Express Port Bus Driver
|
|
||||||
|
|
||||||
A PCI Express Port is a logical PCI-PCI Bridge structure. There
|
|
||||||
are two types of PCI Express Port: the Root Port and the Switch
|
|
||||||
Port. The Root Port originates a PCI Express link from a PCI Express
|
|
||||||
Root Complex and the Switch Port connects PCI Express links to
|
|
||||||
internal logical PCI buses. The Switch Port, which has its secondary
|
|
||||||
bus representing the switch's internal routing logic, is called the
|
|
||||||
switch's Upstream Port. The switch's Downstream Port is bridging from
|
|
||||||
switch's internal routing bus to a bus representing the downstream
|
|
||||||
PCI Express link from the PCI Express Switch.
|
|
||||||
|
|
||||||
A PCI Express Port can provide up to four distinct functions,
|
|
||||||
referred to in this document as services, depending on its port type.
|
|
||||||
PCI Express Port's services include native hotplug support (HP),
|
|
||||||
power management event support (PME), advanced error reporting
|
|
||||||
support (AER), and virtual channel support (VC). These services may
|
|
||||||
be handled by a single complex driver or be individually distributed
|
|
||||||
and handled by corresponding service drivers.
|
|
||||||
|
|
||||||
4. Why use the PCI Express Port Bus Driver?
|
|
||||||
|
|
||||||
In existing Linux kernels, the Linux Device Driver Model allows a
|
|
||||||
physical device to be handled by only a single driver. The PCI
|
|
||||||
Express Port is a PCI-PCI Bridge device with multiple distinct
|
|
||||||
services. To maintain a clean and simple solution each service
|
|
||||||
may have its own software service driver. In this case several
|
|
||||||
service drivers will compete for a single PCI-PCI Bridge device.
|
|
||||||
For example, if the PCI Express Root Port native hotplug service
|
|
||||||
driver is loaded first, it claims a PCI-PCI Bridge Root Port. The
|
|
||||||
kernel therefore does not load other service drivers for that Root
|
|
||||||
Port. In other words, it is impossible to have multiple service
|
|
||||||
drivers load and run on a PCI-PCI Bridge device simultaneously
|
|
||||||
using the current driver model.
|
|
||||||
|
|
||||||
To enable multiple service drivers running simultaneously requires
|
|
||||||
having a PCI Express Port Bus driver, which manages all populated
|
|
||||||
PCI Express Ports and distributes all provided service requests
|
|
||||||
to the corresponding service drivers as required. Some key
|
|
||||||
advantages of using the PCI Express Port Bus driver are listed below:
|
|
||||||
|
|
||||||
- Allow multiple service drivers to run simultaneously on
|
|
||||||
a PCI-PCI Bridge Port device.
|
|
||||||
|
|
||||||
- Allow service drivers implemented in an independent
|
|
||||||
staged approach.
|
|
||||||
|
|
||||||
- Allow one service driver to run on multiple PCI-PCI Bridge
|
|
||||||
Port devices.
|
|
||||||
|
|
||||||
- Manage and distribute resources of a PCI-PCI Bridge Port
|
|
||||||
device to requested service drivers.
|
|
||||||
|
|
||||||
5. Configuring the PCI Express Port Bus Driver vs. Service Drivers
|
|
||||||
|
|
||||||
5.1 Including the PCI Express Port Bus Driver Support into the Kernel
|
|
||||||
|
|
||||||
Including the PCI Express Port Bus driver depends on whether the PCI
|
|
||||||
Express support is included in the kernel config. The kernel will
|
|
||||||
automatically include the PCI Express Port Bus driver as a kernel
|
|
||||||
driver when the PCI Express support is enabled in the kernel.
|
|
||||||
|
|
||||||
5.2 Enabling Service Driver Support
|
|
||||||
|
|
||||||
PCI device drivers are implemented based on Linux Device Driver Model.
|
|
||||||
All service drivers are PCI device drivers. As discussed above, it is
|
|
||||||
impossible to load any service driver once the kernel has loaded the
|
|
||||||
PCI Express Port Bus Driver. To meet the PCI Express Port Bus Driver
|
|
||||||
Model requires some minimal changes on existing service drivers that
|
|
||||||
imposes no impact on the functionality of existing service drivers.
|
|
||||||
|
|
||||||
A service driver is required to use the two APIs shown below to
|
|
||||||
register its service with the PCI Express Port Bus driver (see
|
|
||||||
section 5.2.1 & 5.2.2). It is important that a service driver
|
|
||||||
initializes the pcie_port_service_driver data structure, included in
|
|
||||||
header file /include/linux/pcieport_if.h, before calling these APIs.
|
|
||||||
Failure to do so will result an identity mismatch, which prevents
|
|
||||||
the PCI Express Port Bus driver from loading a service driver.
|
|
||||||
|
|
||||||
5.2.1 pcie_port_service_register
|
|
||||||
|
|
||||||
int pcie_port_service_register(struct pcie_port_service_driver *new)
|
|
||||||
|
|
||||||
This API replaces the Linux Driver Model's pci_register_driver API. A
|
|
||||||
service driver should always calls pcie_port_service_register at
|
|
||||||
module init. Note that after service driver being loaded, calls
|
|
||||||
such as pci_enable_device(dev) and pci_set_master(dev) are no longer
|
|
||||||
necessary since these calls are executed by the PCI Port Bus driver.
|
|
||||||
|
|
||||||
5.2.2 pcie_port_service_unregister
|
|
||||||
|
|
||||||
void pcie_port_service_unregister(struct pcie_port_service_driver *new)
|
|
||||||
|
|
||||||
pcie_port_service_unregister replaces the Linux Driver Model's
|
|
||||||
pci_unregister_driver. It's always called by service driver when a
|
|
||||||
module exits.
|
|
||||||
|
|
||||||
5.2.3 Sample Code
|
|
||||||
|
|
||||||
Below is sample service driver code to initialize the port service
|
|
||||||
driver data structure.
|
|
||||||
|
|
||||||
static struct pcie_port_service_id service_id[] = { {
|
|
||||||
.vendor = PCI_ANY_ID,
|
|
||||||
.device = PCI_ANY_ID,
|
|
||||||
.port_type = PCIE_RC_PORT,
|
|
||||||
.service_type = PCIE_PORT_SERVICE_AER,
|
|
||||||
}, { /* end: all zeroes */ }
|
|
||||||
};
|
|
||||||
|
|
||||||
static struct pcie_port_service_driver root_aerdrv = {
|
|
||||||
.name = (char *)device_name,
|
|
||||||
.id_table = &service_id[0],
|
|
||||||
|
|
||||||
.probe = aerdrv_load,
|
|
||||||
.remove = aerdrv_unload,
|
|
||||||
|
|
||||||
.suspend = aerdrv_suspend,
|
|
||||||
.resume = aerdrv_resume,
|
|
||||||
};
|
|
||||||
|
|
||||||
Below is a sample code for registering/unregistering a service
|
|
||||||
driver.
|
|
||||||
|
|
||||||
static int __init aerdrv_service_init(void)
|
|
||||||
{
|
|
||||||
int retval = 0;
|
|
||||||
|
|
||||||
retval = pcie_port_service_register(&root_aerdrv);
|
|
||||||
if (!retval) {
|
|
||||||
/*
|
|
||||||
* FIX ME
|
|
||||||
*/
|
|
||||||
}
|
|
||||||
return retval;
|
|
||||||
}
|
|
||||||
|
|
||||||
static void __exit aerdrv_service_exit(void)
|
|
||||||
{
|
|
||||||
pcie_port_service_unregister(&root_aerdrv);
|
|
||||||
}
|
|
||||||
|
|
||||||
module_init(aerdrv_service_init);
|
|
||||||
module_exit(aerdrv_service_exit);
|
|
||||||
|
|
||||||
6. Possible Resource Conflicts
|
|
||||||
|
|
||||||
Since all service drivers of a PCI-PCI Bridge Port device are
|
|
||||||
allowed to run simultaneously, below lists a few of possible resource
|
|
||||||
conflicts with proposed solutions.
|
|
||||||
|
|
||||||
6.1 MSI and MSI-X Vector Resource
|
|
||||||
|
|
||||||
Once MSI or MSI-X interrupts are enabled on a device, it stays in this
|
|
||||||
mode until they are disabled again. Since service drivers of the same
|
|
||||||
PCI-PCI Bridge port share the same physical device, if an individual
|
|
||||||
service driver enables or disables MSI/MSI-X mode it may result
|
|
||||||
unpredictable behavior.
|
|
||||||
|
|
||||||
To avoid this situation all service drivers are not permitted to
|
|
||||||
switch interrupt mode on its device. The PCI Express Port Bus driver
|
|
||||||
is responsible for determining the interrupt mode and this should be
|
|
||||||
transparent to service drivers. Service drivers need to know only
|
|
||||||
the vector IRQ assigned to the field irq of struct pcie_device, which
|
|
||||||
is passed in when the PCI Express Port Bus driver probes each service
|
|
||||||
driver. Service drivers should use (struct pcie_device*)dev->irq to
|
|
||||||
call request_irq/free_irq. In addition, the interrupt mode is stored
|
|
||||||
in the field interrupt_mode of struct pcie_device.
|
|
||||||
|
|
||||||
6.3 PCI Memory/IO Mapped Regions
|
|
||||||
|
|
||||||
Service drivers for PCI Express Power Management (PME), Advanced
|
|
||||||
Error Reporting (AER), Hot-Plug (HP) and Virtual Channel (VC) access
|
|
||||||
PCI configuration space on the PCI Express port. In all cases the
|
|
||||||
registers accessed are independent of each other. This patch assumes
|
|
||||||
that all service drivers will be well behaved and not overwrite
|
|
||||||
other service driver's configuration settings.
|
|
||||||
|
|
||||||
6.4 PCI Config Registers
|
|
||||||
|
|
||||||
Each service driver runs its PCI config operations on its own
|
|
||||||
capability structure except the PCI Express capability structure, in
|
|
||||||
which Root Control register and Device Control register are shared
|
|
||||||
between PME and AER. This patch assumes that all service drivers
|
|
||||||
will be well behaved and not overwrite other service driver's
|
|
||||||
configuration settings.
|
|
192
Documentation/PCI/acpi-info.rst
Normal file
192
Documentation/PCI/acpi-info.rst
Normal file
|
@ -0,0 +1,192 @@
|
||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
========================================
|
||||||
|
ACPI considerations for PCI host bridges
|
||||||
|
========================================
|
||||||
|
|
||||||
|
The general rule is that the ACPI namespace should describe everything the
|
||||||
|
OS might use unless there's another way for the OS to find it [1, 2].
|
||||||
|
|
||||||
|
For example, there's no standard hardware mechanism for enumerating PCI
|
||||||
|
host bridges, so the ACPI namespace must describe each host bridge, the
|
||||||
|
method for accessing PCI config space below it, the address space windows
|
||||||
|
the host bridge forwards to PCI (using _CRS), and the routing of legacy
|
||||||
|
INTx interrupts (using _PRT).
|
||||||
|
|
||||||
|
PCI devices, which are below the host bridge, generally do not need to be
|
||||||
|
described via ACPI. The OS can discover them via the standard PCI
|
||||||
|
enumeration mechanism, using config accesses to discover and identify
|
||||||
|
devices and read and size their BARs. However, ACPI may describe PCI
|
||||||
|
devices if it provides power management or hotplug functionality for them
|
||||||
|
or if the device has INTx interrupts connected by platform interrupt
|
||||||
|
controllers and a _PRT is needed to describe those connections.
|
||||||
|
|
||||||
|
ACPI resource description is done via _CRS objects of devices in the ACPI
|
||||||
|
namespace [2]. The _CRS is like a generalized PCI BAR: the OS can read
|
||||||
|
_CRS and figure out what resource is being consumed even if it doesn't have
|
||||||
|
a driver for the device [3]. That's important because it means an old OS
|
||||||
|
can work correctly even on a system with new devices unknown to the OS.
|
||||||
|
The new devices might not do anything, but the OS can at least make sure no
|
||||||
|
resources conflict with them.
|
||||||
|
|
||||||
|
Static tables like MCFG, HPET, ECDT, etc., are *not* mechanisms for
|
||||||
|
reserving address space. The static tables are for things the OS needs to
|
||||||
|
know early in boot, before it can parse the ACPI namespace. If a new table
|
||||||
|
is defined, an old OS needs to operate correctly even though it ignores the
|
||||||
|
table. _CRS allows that because it is generic and understood by the old
|
||||||
|
OS; a static table does not.
|
||||||
|
|
||||||
|
If the OS is expected to manage a non-discoverable device described via
|
||||||
|
ACPI, that device will have a specific _HID/_CID that tells the OS what
|
||||||
|
driver to bind to it, and the _CRS tells the OS and the driver where the
|
||||||
|
device's registers are.
|
||||||
|
|
||||||
|
PCI host bridges are PNP0A03 or PNP0A08 devices. Their _CRS should
|
||||||
|
describe all the address space they consume. This includes all the windows
|
||||||
|
they forward down to the PCI bus, as well as registers of the host bridge
|
||||||
|
itself that are not forwarded to PCI. The host bridge registers include
|
||||||
|
things like secondary/subordinate bus registers that determine the bus
|
||||||
|
range below the bridge, window registers that describe the apertures, etc.
|
||||||
|
These are all device-specific, non-architected things, so the only way a
|
||||||
|
PNP0A03/PNP0A08 driver can manage them is via _PRS/_CRS/_SRS, which contain
|
||||||
|
the device-specific details. The host bridge registers also include ECAM
|
||||||
|
space, since it is consumed by the host bridge.
|
||||||
|
|
||||||
|
ACPI defines a Consumer/Producer bit to distinguish the bridge registers
|
||||||
|
("Consumer") from the bridge apertures ("Producer") [4, 5], but early
|
||||||
|
BIOSes didn't use that bit correctly. The result is that the current ACPI
|
||||||
|
spec defines Consumer/Producer only for the Extended Address Space
|
||||||
|
descriptors; the bit should be ignored in the older QWord/DWord/Word
|
||||||
|
Address Space descriptors. Consequently, OSes have to assume all
|
||||||
|
QWord/DWord/Word descriptors are windows.
|
||||||
|
|
||||||
|
Prior to the addition of Extended Address Space descriptors, the failure of
|
||||||
|
Consumer/Producer meant there was no way to describe bridge registers in
|
||||||
|
the PNP0A03/PNP0A08 device itself. The workaround was to describe the
|
||||||
|
bridge registers (including ECAM space) in PNP0C02 catch-all devices [6].
|
||||||
|
With the exception of ECAM, the bridge register space is device-specific
|
||||||
|
anyway, so the generic PNP0A03/PNP0A08 driver (pci_root.c) has no need to
|
||||||
|
know about it.
|
||||||
|
|
||||||
|
New architectures should be able to use "Consumer" Extended Address Space
|
||||||
|
descriptors in the PNP0A03 device for bridge registers, including ECAM,
|
||||||
|
although a strict interpretation of [6] might prohibit this. Old x86 and
|
||||||
|
ia64 kernels assume all address space descriptors, including "Consumer"
|
||||||
|
Extended Address Space ones, are windows, so it would not be safe to
|
||||||
|
describe bridge registers this way on those architectures.
|
||||||
|
|
||||||
|
PNP0C02 "motherboard" devices are basically a catch-all. There's no
|
||||||
|
programming model for them other than "don't use these resources for
|
||||||
|
anything else." So a PNP0C02 _CRS should claim any address space that is
|
||||||
|
(1) not claimed by _CRS under any other device object in the ACPI namespace
|
||||||
|
and (2) should not be assigned by the OS to something else.
|
||||||
|
|
||||||
|
The PCIe spec requires the Enhanced Configuration Access Method (ECAM)
|
||||||
|
unless there's a standard firmware interface for config access, e.g., the
|
||||||
|
ia64 SAL interface [7]. A host bridge consumes ECAM memory address space
|
||||||
|
and converts memory accesses into PCI configuration accesses. The spec
|
||||||
|
defines the ECAM address space layout and functionality; only the base of
|
||||||
|
the address space is device-specific. An ACPI OS learns the base address
|
||||||
|
from either the static MCFG table or a _CBA method in the PNP0A03 device.
|
||||||
|
|
||||||
|
The MCFG table must describe the ECAM space of non-hot pluggable host
|
||||||
|
bridges [8]. Since MCFG is a static table and can't be updated by hotplug,
|
||||||
|
a _CBA method in the PNP0A03 device describes the ECAM space of a
|
||||||
|
hot-pluggable host bridge [9]. Note that for both MCFG and _CBA, the base
|
||||||
|
address always corresponds to bus 0, even if the bus range below the bridge
|
||||||
|
(which is reported via _CRS) doesn't start at 0.
|
||||||
|
|
||||||
|
|
||||||
|
[1] ACPI 6.2, sec 6.1:
|
||||||
|
For any device that is on a non-enumerable type of bus (for example, an
|
||||||
|
ISA bus), OSPM enumerates the devices' identifier(s) and the ACPI
|
||||||
|
system firmware must supply an _HID object ... for each device to
|
||||||
|
enable OSPM to do that.
|
||||||
|
|
||||||
|
[2] ACPI 6.2, sec 3.7:
|
||||||
|
The OS enumerates motherboard devices simply by reading through the
|
||||||
|
ACPI Namespace looking for devices with hardware IDs.
|
||||||
|
|
||||||
|
Each device enumerated by ACPI includes ACPI-defined objects in the
|
||||||
|
ACPI Namespace that report the hardware resources the device could
|
||||||
|
occupy [_PRS], an object that reports the resources that are currently
|
||||||
|
used by the device [_CRS], and objects for configuring those resources
|
||||||
|
[_SRS]. The information is used by the Plug and Play OS (OSPM) to
|
||||||
|
configure the devices.
|
||||||
|
|
||||||
|
[3] ACPI 6.2, sec 6.2:
|
||||||
|
OSPM uses device configuration objects to configure hardware resources
|
||||||
|
for devices enumerated via ACPI. Device configuration objects provide
|
||||||
|
information about current and possible resource requirements, the
|
||||||
|
relationship between shared resources, and methods for configuring
|
||||||
|
hardware resources.
|
||||||
|
|
||||||
|
When OSPM enumerates a device, it calls _PRS to determine the resource
|
||||||
|
requirements of the device. It may also call _CRS to find the current
|
||||||
|
resource settings for the device. Using this information, the Plug and
|
||||||
|
Play system determines what resources the device should consume and
|
||||||
|
sets those resources by calling the device’s _SRS control method.
|
||||||
|
|
||||||
|
In ACPI, devices can consume resources (for example, legacy keyboards),
|
||||||
|
provide resources (for example, a proprietary PCI bridge), or do both.
|
||||||
|
Unless otherwise specified, resources for a device are assumed to be
|
||||||
|
taken from the nearest matching resource above the device in the device
|
||||||
|
hierarchy.
|
||||||
|
|
||||||
|
[4] ACPI 6.2, sec 6.4.3.5.1, 2, 3, 4:
|
||||||
|
QWord/DWord/Word Address Space Descriptor (.1, .2, .3)
|
||||||
|
General Flags: Bit [0] Ignored
|
||||||
|
|
||||||
|
Extended Address Space Descriptor (.4)
|
||||||
|
General Flags: Bit [0] Consumer/Producer:
|
||||||
|
|
||||||
|
* 1 – This device consumes this resource
|
||||||
|
* 0 – This device produces and consumes this resource
|
||||||
|
|
||||||
|
[5] ACPI 6.2, sec 19.6.43:
|
||||||
|
ResourceUsage specifies whether the Memory range is consumed by
|
||||||
|
this device (ResourceConsumer) or passed on to child devices
|
||||||
|
(ResourceProducer). If nothing is specified, then
|
||||||
|
ResourceConsumer is assumed.
|
||||||
|
|
||||||
|
[6] PCI Firmware 3.2, sec 4.1.2:
|
||||||
|
If the operating system does not natively comprehend reserving the
|
||||||
|
MMCFG region, the MMCFG region must be reserved by firmware. The
|
||||||
|
address range reported in the MCFG table or by _CBA method (see Section
|
||||||
|
4.1.3) must be reserved by declaring a motherboard resource. For most
|
||||||
|
systems, the motherboard resource would appear at the root of the ACPI
|
||||||
|
namespace (under \_SB) in a node with a _HID of EISAID (PNP0C02), and
|
||||||
|
the resources in this case should not be claimed in the root PCI bus’s
|
||||||
|
_CRS. The resources can optionally be returned in Int15 E820 or
|
||||||
|
EFIGetMemoryMap as reserved memory but must always be reported through
|
||||||
|
ACPI as a motherboard resource.
|
||||||
|
|
||||||
|
[7] PCI Express 4.0, sec 7.2.2:
|
||||||
|
For systems that are PC-compatible, or that do not implement a
|
||||||
|
processor-architecture-specific firmware interface standard that allows
|
||||||
|
access to the Configuration Space, the ECAM is required as defined in
|
||||||
|
this section.
|
||||||
|
|
||||||
|
[8] PCI Firmware 3.2, sec 4.1.2:
|
||||||
|
The MCFG table is an ACPI table that is used to communicate the base
|
||||||
|
addresses corresponding to the non-hot removable PCI Segment Groups
|
||||||
|
range within a PCI Segment Group available to the operating system at
|
||||||
|
boot. This is required for the PC-compatible systems.
|
||||||
|
|
||||||
|
The MCFG table is only used to communicate the base addresses
|
||||||
|
corresponding to the PCI Segment Groups available to the system at
|
||||||
|
boot.
|
||||||
|
|
||||||
|
[9] PCI Firmware 3.2, sec 4.1.3:
|
||||||
|
The _CBA (Memory mapped Configuration Base Address) control method is
|
||||||
|
an optional ACPI object that returns the 64-bit memory mapped
|
||||||
|
configuration base address for the hot plug capable host bridge. The
|
||||||
|
base address returned by _CBA is processor-relative address. The _CBA
|
||||||
|
control method evaluates to an Integer.
|
||||||
|
|
||||||
|
This control method appears under a host bridge object. When the _CBA
|
||||||
|
method appears under an active host bridge object, the operating system
|
||||||
|
evaluates this structure to identify the memory mapped configuration
|
||||||
|
base address corresponding to the PCI Segment Group for the bus number
|
||||||
|
range specified in _CRS method. An ACPI name space object that contains
|
||||||
|
the _CBA method must also contain a corresponding _SEG method.
|
|
@ -1,187 +0,0 @@
|
||||||
ACPI considerations for PCI host bridges
|
|
||||||
|
|
||||||
The general rule is that the ACPI namespace should describe everything the
|
|
||||||
OS might use unless there's another way for the OS to find it [1, 2].
|
|
||||||
|
|
||||||
For example, there's no standard hardware mechanism for enumerating PCI
|
|
||||||
host bridges, so the ACPI namespace must describe each host bridge, the
|
|
||||||
method for accessing PCI config space below it, the address space windows
|
|
||||||
the host bridge forwards to PCI (using _CRS), and the routing of legacy
|
|
||||||
INTx interrupts (using _PRT).
|
|
||||||
|
|
||||||
PCI devices, which are below the host bridge, generally do not need to be
|
|
||||||
described via ACPI. The OS can discover them via the standard PCI
|
|
||||||
enumeration mechanism, using config accesses to discover and identify
|
|
||||||
devices and read and size their BARs. However, ACPI may describe PCI
|
|
||||||
devices if it provides power management or hotplug functionality for them
|
|
||||||
or if the device has INTx interrupts connected by platform interrupt
|
|
||||||
controllers and a _PRT is needed to describe those connections.
|
|
||||||
|
|
||||||
ACPI resource description is done via _CRS objects of devices in the ACPI
|
|
||||||
namespace [2]. The _CRS is like a generalized PCI BAR: the OS can read
|
|
||||||
_CRS and figure out what resource is being consumed even if it doesn't have
|
|
||||||
a driver for the device [3]. That's important because it means an old OS
|
|
||||||
can work correctly even on a system with new devices unknown to the OS.
|
|
||||||
The new devices might not do anything, but the OS can at least make sure no
|
|
||||||
resources conflict with them.
|
|
||||||
|
|
||||||
Static tables like MCFG, HPET, ECDT, etc., are *not* mechanisms for
|
|
||||||
reserving address space. The static tables are for things the OS needs to
|
|
||||||
know early in boot, before it can parse the ACPI namespace. If a new table
|
|
||||||
is defined, an old OS needs to operate correctly even though it ignores the
|
|
||||||
table. _CRS allows that because it is generic and understood by the old
|
|
||||||
OS; a static table does not.
|
|
||||||
|
|
||||||
If the OS is expected to manage a non-discoverable device described via
|
|
||||||
ACPI, that device will have a specific _HID/_CID that tells the OS what
|
|
||||||
driver to bind to it, and the _CRS tells the OS and the driver where the
|
|
||||||
device's registers are.
|
|
||||||
|
|
||||||
PCI host bridges are PNP0A03 or PNP0A08 devices. Their _CRS should
|
|
||||||
describe all the address space they consume. This includes all the windows
|
|
||||||
they forward down to the PCI bus, as well as registers of the host bridge
|
|
||||||
itself that are not forwarded to PCI. The host bridge registers include
|
|
||||||
things like secondary/subordinate bus registers that determine the bus
|
|
||||||
range below the bridge, window registers that describe the apertures, etc.
|
|
||||||
These are all device-specific, non-architected things, so the only way a
|
|
||||||
PNP0A03/PNP0A08 driver can manage them is via _PRS/_CRS/_SRS, which contain
|
|
||||||
the device-specific details. The host bridge registers also include ECAM
|
|
||||||
space, since it is consumed by the host bridge.
|
|
||||||
|
|
||||||
ACPI defines a Consumer/Producer bit to distinguish the bridge registers
|
|
||||||
("Consumer") from the bridge apertures ("Producer") [4, 5], but early
|
|
||||||
BIOSes didn't use that bit correctly. The result is that the current ACPI
|
|
||||||
spec defines Consumer/Producer only for the Extended Address Space
|
|
||||||
descriptors; the bit should be ignored in the older QWord/DWord/Word
|
|
||||||
Address Space descriptors. Consequently, OSes have to assume all
|
|
||||||
QWord/DWord/Word descriptors are windows.
|
|
||||||
|
|
||||||
Prior to the addition of Extended Address Space descriptors, the failure of
|
|
||||||
Consumer/Producer meant there was no way to describe bridge registers in
|
|
||||||
the PNP0A03/PNP0A08 device itself. The workaround was to describe the
|
|
||||||
bridge registers (including ECAM space) in PNP0C02 catch-all devices [6].
|
|
||||||
With the exception of ECAM, the bridge register space is device-specific
|
|
||||||
anyway, so the generic PNP0A03/PNP0A08 driver (pci_root.c) has no need to
|
|
||||||
know about it.
|
|
||||||
|
|
||||||
New architectures should be able to use "Consumer" Extended Address Space
|
|
||||||
descriptors in the PNP0A03 device for bridge registers, including ECAM,
|
|
||||||
although a strict interpretation of [6] might prohibit this. Old x86 and
|
|
||||||
ia64 kernels assume all address space descriptors, including "Consumer"
|
|
||||||
Extended Address Space ones, are windows, so it would not be safe to
|
|
||||||
describe bridge registers this way on those architectures.
|
|
||||||
|
|
||||||
PNP0C02 "motherboard" devices are basically a catch-all. There's no
|
|
||||||
programming model for them other than "don't use these resources for
|
|
||||||
anything else." So a PNP0C02 _CRS should claim any address space that is
|
|
||||||
(1) not claimed by _CRS under any other device object in the ACPI namespace
|
|
||||||
and (2) should not be assigned by the OS to something else.
|
|
||||||
|
|
||||||
The PCIe spec requires the Enhanced Configuration Access Method (ECAM)
|
|
||||||
unless there's a standard firmware interface for config access, e.g., the
|
|
||||||
ia64 SAL interface [7]. A host bridge consumes ECAM memory address space
|
|
||||||
and converts memory accesses into PCI configuration accesses. The spec
|
|
||||||
defines the ECAM address space layout and functionality; only the base of
|
|
||||||
the address space is device-specific. An ACPI OS learns the base address
|
|
||||||
from either the static MCFG table or a _CBA method in the PNP0A03 device.
|
|
||||||
|
|
||||||
The MCFG table must describe the ECAM space of non-hot pluggable host
|
|
||||||
bridges [8]. Since MCFG is a static table and can't be updated by hotplug,
|
|
||||||
a _CBA method in the PNP0A03 device describes the ECAM space of a
|
|
||||||
hot-pluggable host bridge [9]. Note that for both MCFG and _CBA, the base
|
|
||||||
address always corresponds to bus 0, even if the bus range below the bridge
|
|
||||||
(which is reported via _CRS) doesn't start at 0.
|
|
||||||
|
|
||||||
|
|
||||||
[1] ACPI 6.2, sec 6.1:
|
|
||||||
For any device that is on a non-enumerable type of bus (for example, an
|
|
||||||
ISA bus), OSPM enumerates the devices' identifier(s) and the ACPI
|
|
||||||
system firmware must supply an _HID object ... for each device to
|
|
||||||
enable OSPM to do that.
|
|
||||||
|
|
||||||
[2] ACPI 6.2, sec 3.7:
|
|
||||||
The OS enumerates motherboard devices simply by reading through the
|
|
||||||
ACPI Namespace looking for devices with hardware IDs.
|
|
||||||
|
|
||||||
Each device enumerated by ACPI includes ACPI-defined objects in the
|
|
||||||
ACPI Namespace that report the hardware resources the device could
|
|
||||||
occupy [_PRS], an object that reports the resources that are currently
|
|
||||||
used by the device [_CRS], and objects for configuring those resources
|
|
||||||
[_SRS]. The information is used by the Plug and Play OS (OSPM) to
|
|
||||||
configure the devices.
|
|
||||||
|
|
||||||
[3] ACPI 6.2, sec 6.2:
|
|
||||||
OSPM uses device configuration objects to configure hardware resources
|
|
||||||
for devices enumerated via ACPI. Device configuration objects provide
|
|
||||||
information about current and possible resource requirements, the
|
|
||||||
relationship between shared resources, and methods for configuring
|
|
||||||
hardware resources.
|
|
||||||
|
|
||||||
When OSPM enumerates a device, it calls _PRS to determine the resource
|
|
||||||
requirements of the device. It may also call _CRS to find the current
|
|
||||||
resource settings for the device. Using this information, the Plug and
|
|
||||||
Play system determines what resources the device should consume and
|
|
||||||
sets those resources by calling the device’s _SRS control method.
|
|
||||||
|
|
||||||
In ACPI, devices can consume resources (for example, legacy keyboards),
|
|
||||||
provide resources (for example, a proprietary PCI bridge), or do both.
|
|
||||||
Unless otherwise specified, resources for a device are assumed to be
|
|
||||||
taken from the nearest matching resource above the device in the device
|
|
||||||
hierarchy.
|
|
||||||
|
|
||||||
[4] ACPI 6.2, sec 6.4.3.5.1, 2, 3, 4:
|
|
||||||
QWord/DWord/Word Address Space Descriptor (.1, .2, .3)
|
|
||||||
General Flags: Bit [0] Ignored
|
|
||||||
|
|
||||||
Extended Address Space Descriptor (.4)
|
|
||||||
General Flags: Bit [0] Consumer/Producer:
|
|
||||||
1–This device consumes this resource
|
|
||||||
0–This device produces and consumes this resource
|
|
||||||
|
|
||||||
[5] ACPI 6.2, sec 19.6.43:
|
|
||||||
ResourceUsage specifies whether the Memory range is consumed by
|
|
||||||
this device (ResourceConsumer) or passed on to child devices
|
|
||||||
(ResourceProducer). If nothing is specified, then
|
|
||||||
ResourceConsumer is assumed.
|
|
||||||
|
|
||||||
[6] PCI Firmware 3.2, sec 4.1.2:
|
|
||||||
If the operating system does not natively comprehend reserving the
|
|
||||||
MMCFG region, the MMCFG region must be reserved by firmware. The
|
|
||||||
address range reported in the MCFG table or by _CBA method (see Section
|
|
||||||
4.1.3) must be reserved by declaring a motherboard resource. For most
|
|
||||||
systems, the motherboard resource would appear at the root of the ACPI
|
|
||||||
namespace (under \_SB) in a node with a _HID of EISAID (PNP0C02), and
|
|
||||||
the resources in this case should not be claimed in the root PCI bus’s
|
|
||||||
_CRS. The resources can optionally be returned in Int15 E820 or
|
|
||||||
EFIGetMemoryMap as reserved memory but must always be reported through
|
|
||||||
ACPI as a motherboard resource.
|
|
||||||
|
|
||||||
[7] PCI Express 4.0, sec 7.2.2:
|
|
||||||
For systems that are PC-compatible, or that do not implement a
|
|
||||||
processor-architecture-specific firmware interface standard that allows
|
|
||||||
access to the Configuration Space, the ECAM is required as defined in
|
|
||||||
this section.
|
|
||||||
|
|
||||||
[8] PCI Firmware 3.2, sec 4.1.2:
|
|
||||||
The MCFG table is an ACPI table that is used to communicate the base
|
|
||||||
addresses corresponding to the non-hot removable PCI Segment Groups
|
|
||||||
range within a PCI Segment Group available to the operating system at
|
|
||||||
boot. This is required for the PC-compatible systems.
|
|
||||||
|
|
||||||
The MCFG table is only used to communicate the base addresses
|
|
||||||
corresponding to the PCI Segment Groups available to the system at
|
|
||||||
boot.
|
|
||||||
|
|
||||||
[9] PCI Firmware 3.2, sec 4.1.3:
|
|
||||||
The _CBA (Memory mapped Configuration Base Address) control method is
|
|
||||||
an optional ACPI object that returns the 64-bit memory mapped
|
|
||||||
configuration base address for the hot plug capable host bridge. The
|
|
||||||
base address returned by _CBA is processor-relative address. The _CBA
|
|
||||||
control method evaluates to an Integer.
|
|
||||||
|
|
||||||
This control method appears under a host bridge object. When the _CBA
|
|
||||||
method appears under an active host bridge object, the operating system
|
|
||||||
evaluates this structure to identify the memory mapped configuration
|
|
||||||
base address corresponding to the PCI Segment Group for the bus number
|
|
||||||
range specified in _CRS method. An ACPI name space object that contains
|
|
||||||
the _CBA method must also contain a corresponding _SEG method.
|
|
13
Documentation/PCI/endpoint/index.rst
Normal file
13
Documentation/PCI/endpoint/index.rst
Normal file
|
@ -0,0 +1,13 @@
|
||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
======================
|
||||||
|
PCI Endpoint Framework
|
||||||
|
======================
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 2
|
||||||
|
|
||||||
|
pci-endpoint
|
||||||
|
pci-endpoint-cfs
|
||||||
|
pci-test-function
|
||||||
|
pci-test-howto
|
118
Documentation/PCI/endpoint/pci-endpoint-cfs.rst
Normal file
118
Documentation/PCI/endpoint/pci-endpoint-cfs.rst
Normal file
|
@ -0,0 +1,118 @@
|
||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
=======================================
|
||||||
|
Configuring PCI Endpoint Using CONFIGFS
|
||||||
|
=======================================
|
||||||
|
|
||||||
|
:Author: Kishon Vijay Abraham I <kishon@ti.com>
|
||||||
|
|
||||||
|
The PCI Endpoint Core exposes configfs entry (pci_ep) to configure the
|
||||||
|
PCI endpoint function and to bind the endpoint function
|
||||||
|
with the endpoint controller. (For introducing other mechanisms to
|
||||||
|
configure the PCI Endpoint Function refer to [1]).
|
||||||
|
|
||||||
|
Mounting configfs
|
||||||
|
=================
|
||||||
|
|
||||||
|
The PCI Endpoint Core layer creates pci_ep directory in the mounted configfs
|
||||||
|
directory. configfs can be mounted using the following command::
|
||||||
|
|
||||||
|
mount -t configfs none /sys/kernel/config
|
||||||
|
|
||||||
|
Directory Structure
|
||||||
|
===================
|
||||||
|
|
||||||
|
The pci_ep configfs has two directories at its root: controllers and
|
||||||
|
functions. Every EPC device present in the system will have an entry in
|
||||||
|
the *controllers* directory and and every EPF driver present in the system
|
||||||
|
will have an entry in the *functions* directory.
|
||||||
|
::
|
||||||
|
|
||||||
|
/sys/kernel/config/pci_ep/
|
||||||
|
.. controllers/
|
||||||
|
.. functions/
|
||||||
|
|
||||||
|
Creating EPF Device
|
||||||
|
===================
|
||||||
|
|
||||||
|
Every registered EPF driver will be listed in controllers directory. The
|
||||||
|
entries corresponding to EPF driver will be created by the EPF core.
|
||||||
|
::
|
||||||
|
|
||||||
|
/sys/kernel/config/pci_ep/functions/
|
||||||
|
.. <EPF Driver1>/
|
||||||
|
... <EPF Device 11>/
|
||||||
|
... <EPF Device 21>/
|
||||||
|
.. <EPF Driver2>/
|
||||||
|
... <EPF Device 12>/
|
||||||
|
... <EPF Device 22>/
|
||||||
|
|
||||||
|
In order to create a <EPF device> of the type probed by <EPF Driver>, the
|
||||||
|
user has to create a directory inside <EPF DriverN>.
|
||||||
|
|
||||||
|
Every <EPF device> directory consists of the following entries that can be
|
||||||
|
used to configure the standard configuration header of the endpoint function.
|
||||||
|
(These entries are created by the framework when any new <EPF Device> is
|
||||||
|
created)
|
||||||
|
::
|
||||||
|
|
||||||
|
.. <EPF Driver1>/
|
||||||
|
... <EPF Device 11>/
|
||||||
|
... vendorid
|
||||||
|
... deviceid
|
||||||
|
... revid
|
||||||
|
... progif_code
|
||||||
|
... subclass_code
|
||||||
|
... baseclass_code
|
||||||
|
... cache_line_size
|
||||||
|
... subsys_vendor_id
|
||||||
|
... subsys_id
|
||||||
|
... interrupt_pin
|
||||||
|
|
||||||
|
EPC Device
|
||||||
|
==========
|
||||||
|
|
||||||
|
Every registered EPC device will be listed in controllers directory. The
|
||||||
|
entries corresponding to EPC device will be created by the EPC core.
|
||||||
|
::
|
||||||
|
|
||||||
|
/sys/kernel/config/pci_ep/controllers/
|
||||||
|
.. <EPC Device1>/
|
||||||
|
... <Symlink EPF Device11>/
|
||||||
|
... <Symlink EPF Device12>/
|
||||||
|
... start
|
||||||
|
.. <EPC Device2>/
|
||||||
|
... <Symlink EPF Device21>/
|
||||||
|
... <Symlink EPF Device22>/
|
||||||
|
... start
|
||||||
|
|
||||||
|
The <EPC Device> directory will have a list of symbolic links to
|
||||||
|
<EPF Device>. These symbolic links should be created by the user to
|
||||||
|
represent the functions present in the endpoint device.
|
||||||
|
|
||||||
|
The <EPC Device> directory will also have a *start* field. Once
|
||||||
|
"1" is written to this field, the endpoint device will be ready to
|
||||||
|
establish the link with the host. This is usually done after
|
||||||
|
all the EPF devices are created and linked with the EPC device.
|
||||||
|
::
|
||||||
|
|
||||||
|
| controllers/
|
||||||
|
| <Directory: EPC name>/
|
||||||
|
| <Symbolic Link: Function>
|
||||||
|
| start
|
||||||
|
| functions/
|
||||||
|
| <Directory: EPF driver>/
|
||||||
|
| <Directory: EPF device>/
|
||||||
|
| vendorid
|
||||||
|
| deviceid
|
||||||
|
| revid
|
||||||
|
| progif_code
|
||||||
|
| subclass_code
|
||||||
|
| baseclass_code
|
||||||
|
| cache_line_size
|
||||||
|
| subsys_vendor_id
|
||||||
|
| subsys_id
|
||||||
|
| interrupt_pin
|
||||||
|
| function
|
||||||
|
|
||||||
|
[1] :doc:`pci-endpoint`
|
|
@ -1,105 +0,0 @@
|
||||||
CONFIGURING PCI ENDPOINT USING CONFIGFS
|
|
||||||
Kishon Vijay Abraham I <kishon@ti.com>
|
|
||||||
|
|
||||||
The PCI Endpoint Core exposes configfs entry (pci_ep) to configure the
|
|
||||||
PCI endpoint function and to bind the endpoint function
|
|
||||||
with the endpoint controller. (For introducing other mechanisms to
|
|
||||||
configure the PCI Endpoint Function refer to [1]).
|
|
||||||
|
|
||||||
*) Mounting configfs
|
|
||||||
|
|
||||||
The PCI Endpoint Core layer creates pci_ep directory in the mounted configfs
|
|
||||||
directory. configfs can be mounted using the following command.
|
|
||||||
|
|
||||||
mount -t configfs none /sys/kernel/config
|
|
||||||
|
|
||||||
*) Directory Structure
|
|
||||||
|
|
||||||
The pci_ep configfs has two directories at its root: controllers and
|
|
||||||
functions. Every EPC device present in the system will have an entry in
|
|
||||||
the *controllers* directory and and every EPF driver present in the system
|
|
||||||
will have an entry in the *functions* directory.
|
|
||||||
|
|
||||||
/sys/kernel/config/pci_ep/
|
|
||||||
.. controllers/
|
|
||||||
.. functions/
|
|
||||||
|
|
||||||
*) Creating EPF Device
|
|
||||||
|
|
||||||
Every registered EPF driver will be listed in controllers directory. The
|
|
||||||
entries corresponding to EPF driver will be created by the EPF core.
|
|
||||||
|
|
||||||
/sys/kernel/config/pci_ep/functions/
|
|
||||||
.. <EPF Driver1>/
|
|
||||||
... <EPF Device 11>/
|
|
||||||
... <EPF Device 21>/
|
|
||||||
.. <EPF Driver2>/
|
|
||||||
... <EPF Device 12>/
|
|
||||||
... <EPF Device 22>/
|
|
||||||
|
|
||||||
In order to create a <EPF device> of the type probed by <EPF Driver>, the
|
|
||||||
user has to create a directory inside <EPF DriverN>.
|
|
||||||
|
|
||||||
Every <EPF device> directory consists of the following entries that can be
|
|
||||||
used to configure the standard configuration header of the endpoint function.
|
|
||||||
(These entries are created by the framework when any new <EPF Device> is
|
|
||||||
created)
|
|
||||||
|
|
||||||
.. <EPF Driver1>/
|
|
||||||
... <EPF Device 11>/
|
|
||||||
... vendorid
|
|
||||||
... deviceid
|
|
||||||
... revid
|
|
||||||
... progif_code
|
|
||||||
... subclass_code
|
|
||||||
... baseclass_code
|
|
||||||
... cache_line_size
|
|
||||||
... subsys_vendor_id
|
|
||||||
... subsys_id
|
|
||||||
... interrupt_pin
|
|
||||||
|
|
||||||
*) EPC Device
|
|
||||||
|
|
||||||
Every registered EPC device will be listed in controllers directory. The
|
|
||||||
entries corresponding to EPC device will be created by the EPC core.
|
|
||||||
|
|
||||||
/sys/kernel/config/pci_ep/controllers/
|
|
||||||
.. <EPC Device1>/
|
|
||||||
... <Symlink EPF Device11>/
|
|
||||||
... <Symlink EPF Device12>/
|
|
||||||
... start
|
|
||||||
.. <EPC Device2>/
|
|
||||||
... <Symlink EPF Device21>/
|
|
||||||
... <Symlink EPF Device22>/
|
|
||||||
... start
|
|
||||||
|
|
||||||
The <EPC Device> directory will have a list of symbolic links to
|
|
||||||
<EPF Device>. These symbolic links should be created by the user to
|
|
||||||
represent the functions present in the endpoint device.
|
|
||||||
|
|
||||||
The <EPC Device> directory will also have a *start* field. Once
|
|
||||||
"1" is written to this field, the endpoint device will be ready to
|
|
||||||
establish the link with the host. This is usually done after
|
|
||||||
all the EPF devices are created and linked with the EPC device.
|
|
||||||
|
|
||||||
|
|
||||||
| controllers/
|
|
||||||
| <Directory: EPC name>/
|
|
||||||
| <Symbolic Link: Function>
|
|
||||||
| start
|
|
||||||
| functions/
|
|
||||||
| <Directory: EPF driver>/
|
|
||||||
| <Directory: EPF device>/
|
|
||||||
| vendorid
|
|
||||||
| deviceid
|
|
||||||
| revid
|
|
||||||
| progif_code
|
|
||||||
| subclass_code
|
|
||||||
| baseclass_code
|
|
||||||
| cache_line_size
|
|
||||||
| subsys_vendor_id
|
|
||||||
| subsys_id
|
|
||||||
| interrupt_pin
|
|
||||||
| function
|
|
||||||
|
|
||||||
[1] -> Documentation/PCI/endpoint/pci-endpoint.txt
|
|
231
Documentation/PCI/endpoint/pci-endpoint.rst
Normal file
231
Documentation/PCI/endpoint/pci-endpoint.rst
Normal file
|
@ -0,0 +1,231 @@
|
||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
:Author: Kishon Vijay Abraham I <kishon@ti.com>
|
||||||
|
|
||||||
|
This document is a guide to use the PCI Endpoint Framework in order to create
|
||||||
|
endpoint controller driver, endpoint function driver, and using configfs
|
||||||
|
interface to bind the function driver to the controller driver.
|
||||||
|
|
||||||
|
Introduction
|
||||||
|
============
|
||||||
|
|
||||||
|
Linux has a comprehensive PCI subsystem to support PCI controllers that
|
||||||
|
operates in Root Complex mode. The subsystem has capability to scan PCI bus,
|
||||||
|
assign memory resources and IRQ resources, load PCI driver (based on
|
||||||
|
vendor ID, device ID), support other services like hot-plug, power management,
|
||||||
|
advanced error reporting and virtual channels.
|
||||||
|
|
||||||
|
However the PCI controller IP integrated in some SoCs is capable of operating
|
||||||
|
either in Root Complex mode or Endpoint mode. PCI Endpoint Framework will
|
||||||
|
add endpoint mode support in Linux. This will help to run Linux in an
|
||||||
|
EP system which can have a wide variety of use cases from testing or
|
||||||
|
validation, co-processor accelerator, etc.
|
||||||
|
|
||||||
|
PCI Endpoint Core
|
||||||
|
=================
|
||||||
|
|
||||||
|
The PCI Endpoint Core layer comprises 3 components: the Endpoint Controller
|
||||||
|
library, the Endpoint Function library, and the configfs layer to bind the
|
||||||
|
endpoint function with the endpoint controller.
|
||||||
|
|
||||||
|
PCI Endpoint Controller(EPC) Library
|
||||||
|
------------------------------------
|
||||||
|
|
||||||
|
The EPC library provides APIs to be used by the controller that can operate
|
||||||
|
in endpoint mode. It also provides APIs to be used by function driver/library
|
||||||
|
in order to implement a particular endpoint function.
|
||||||
|
|
||||||
|
APIs for the PCI controller Driver
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
This section lists the APIs that the PCI Endpoint core provides to be used
|
||||||
|
by the PCI controller driver.
|
||||||
|
|
||||||
|
* devm_pci_epc_create()/pci_epc_create()
|
||||||
|
|
||||||
|
The PCI controller driver should implement the following ops:
|
||||||
|
|
||||||
|
* write_header: ops to populate configuration space header
|
||||||
|
* set_bar: ops to configure the BAR
|
||||||
|
* clear_bar: ops to reset the BAR
|
||||||
|
* alloc_addr_space: ops to allocate in PCI controller address space
|
||||||
|
* free_addr_space: ops to free the allocated address space
|
||||||
|
* raise_irq: ops to raise a legacy, MSI or MSI-X interrupt
|
||||||
|
* start: ops to start the PCI link
|
||||||
|
* stop: ops to stop the PCI link
|
||||||
|
|
||||||
|
The PCI controller driver can then create a new EPC device by invoking
|
||||||
|
devm_pci_epc_create()/pci_epc_create().
|
||||||
|
|
||||||
|
* devm_pci_epc_destroy()/pci_epc_destroy()
|
||||||
|
|
||||||
|
The PCI controller driver can destroy the EPC device created by either
|
||||||
|
devm_pci_epc_create() or pci_epc_create() using devm_pci_epc_destroy() or
|
||||||
|
pci_epc_destroy().
|
||||||
|
|
||||||
|
* pci_epc_linkup()
|
||||||
|
|
||||||
|
In order to notify all the function devices that the EPC device to which
|
||||||
|
they are linked has established a link with the host, the PCI controller
|
||||||
|
driver should invoke pci_epc_linkup().
|
||||||
|
|
||||||
|
* pci_epc_mem_init()
|
||||||
|
|
||||||
|
Initialize the pci_epc_mem structure used for allocating EPC addr space.
|
||||||
|
|
||||||
|
* pci_epc_mem_exit()
|
||||||
|
|
||||||
|
Cleanup the pci_epc_mem structure allocated during pci_epc_mem_init().
|
||||||
|
|
||||||
|
|
||||||
|
APIs for the PCI Endpoint Function Driver
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
This section lists the APIs that the PCI Endpoint core provides to be used
|
||||||
|
by the PCI endpoint function driver.
|
||||||
|
|
||||||
|
* pci_epc_write_header()
|
||||||
|
|
||||||
|
The PCI endpoint function driver should use pci_epc_write_header() to
|
||||||
|
write the standard configuration header to the endpoint controller.
|
||||||
|
|
||||||
|
* pci_epc_set_bar()
|
||||||
|
|
||||||
|
The PCI endpoint function driver should use pci_epc_set_bar() to configure
|
||||||
|
the Base Address Register in order for the host to assign PCI addr space.
|
||||||
|
Register space of the function driver is usually configured
|
||||||
|
using this API.
|
||||||
|
|
||||||
|
* pci_epc_clear_bar()
|
||||||
|
|
||||||
|
The PCI endpoint function driver should use pci_epc_clear_bar() to reset
|
||||||
|
the BAR.
|
||||||
|
|
||||||
|
* pci_epc_raise_irq()
|
||||||
|
|
||||||
|
The PCI endpoint function driver should use pci_epc_raise_irq() to raise
|
||||||
|
Legacy Interrupt, MSI or MSI-X Interrupt.
|
||||||
|
|
||||||
|
* pci_epc_mem_alloc_addr()
|
||||||
|
|
||||||
|
The PCI endpoint function driver should use pci_epc_mem_alloc_addr(), to
|
||||||
|
allocate memory address from EPC addr space which is required to access
|
||||||
|
RC's buffer
|
||||||
|
|
||||||
|
* pci_epc_mem_free_addr()
|
||||||
|
|
||||||
|
The PCI endpoint function driver should use pci_epc_mem_free_addr() to
|
||||||
|
free the memory space allocated using pci_epc_mem_alloc_addr().
|
||||||
|
|
||||||
|
Other APIs
|
||||||
|
~~~~~~~~~~
|
||||||
|
|
||||||
|
There are other APIs provided by the EPC library. These are used for binding
|
||||||
|
the EPF device with EPC device. pci-ep-cfs.c can be used as reference for
|
||||||
|
using these APIs.
|
||||||
|
|
||||||
|
* pci_epc_get()
|
||||||
|
|
||||||
|
Get a reference to the PCI endpoint controller based on the device name of
|
||||||
|
the controller.
|
||||||
|
|
||||||
|
* pci_epc_put()
|
||||||
|
|
||||||
|
Release the reference to the PCI endpoint controller obtained using
|
||||||
|
pci_epc_get()
|
||||||
|
|
||||||
|
* pci_epc_add_epf()
|
||||||
|
|
||||||
|
Add a PCI endpoint function to a PCI endpoint controller. A PCIe device
|
||||||
|
can have up to 8 functions according to the specification.
|
||||||
|
|
||||||
|
* pci_epc_remove_epf()
|
||||||
|
|
||||||
|
Remove the PCI endpoint function from PCI endpoint controller.
|
||||||
|
|
||||||
|
* pci_epc_start()
|
||||||
|
|
||||||
|
The PCI endpoint function driver should invoke pci_epc_start() once it
|
||||||
|
has configured the endpoint function and wants to start the PCI link.
|
||||||
|
|
||||||
|
* pci_epc_stop()
|
||||||
|
|
||||||
|
The PCI endpoint function driver should invoke pci_epc_stop() to stop
|
||||||
|
the PCI LINK.
|
||||||
|
|
||||||
|
|
||||||
|
PCI Endpoint Function(EPF) Library
|
||||||
|
----------------------------------
|
||||||
|
|
||||||
|
The EPF library provides APIs to be used by the function driver and the EPC
|
||||||
|
library to provide endpoint mode functionality.
|
||||||
|
|
||||||
|
APIs for the PCI Endpoint Function Driver
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
This section lists the APIs that the PCI Endpoint core provides to be used
|
||||||
|
by the PCI endpoint function driver.
|
||||||
|
|
||||||
|
* pci_epf_register_driver()
|
||||||
|
|
||||||
|
The PCI Endpoint Function driver should implement the following ops:
|
||||||
|
* bind: ops to perform when a EPC device has been bound to EPF device
|
||||||
|
* unbind: ops to perform when a binding has been lost between a EPC
|
||||||
|
device and EPF device
|
||||||
|
* linkup: ops to perform when the EPC device has established a
|
||||||
|
connection with a host system
|
||||||
|
|
||||||
|
The PCI Function driver can then register the PCI EPF driver by using
|
||||||
|
pci_epf_register_driver().
|
||||||
|
|
||||||
|
* pci_epf_unregister_driver()
|
||||||
|
|
||||||
|
The PCI Function driver can unregister the PCI EPF driver by using
|
||||||
|
pci_epf_unregister_driver().
|
||||||
|
|
||||||
|
* pci_epf_alloc_space()
|
||||||
|
|
||||||
|
The PCI Function driver can allocate space for a particular BAR using
|
||||||
|
pci_epf_alloc_space().
|
||||||
|
|
||||||
|
* pci_epf_free_space()
|
||||||
|
|
||||||
|
The PCI Function driver can free the allocated space
|
||||||
|
(using pci_epf_alloc_space) by invoking pci_epf_free_space().
|
||||||
|
|
||||||
|
APIs for the PCI Endpoint Controller Library
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
This section lists the APIs that the PCI Endpoint core provides to be used
|
||||||
|
by the PCI endpoint controller library.
|
||||||
|
|
||||||
|
* pci_epf_linkup()
|
||||||
|
|
||||||
|
The PCI endpoint controller library invokes pci_epf_linkup() when the
|
||||||
|
EPC device has established the connection to the host.
|
||||||
|
|
||||||
|
Other APIs
|
||||||
|
~~~~~~~~~~
|
||||||
|
|
||||||
|
There are other APIs provided by the EPF library. These are used to notify
|
||||||
|
the function driver when the EPF device is bound to the EPC device.
|
||||||
|
pci-ep-cfs.c can be used as reference for using these APIs.
|
||||||
|
|
||||||
|
* pci_epf_create()
|
||||||
|
|
||||||
|
Create a new PCI EPF device by passing the name of the PCI EPF device.
|
||||||
|
This name will be used to bind the the EPF device to a EPF driver.
|
||||||
|
|
||||||
|
* pci_epf_destroy()
|
||||||
|
|
||||||
|
Destroy the created PCI EPF device.
|
||||||
|
|
||||||
|
* pci_epf_bind()
|
||||||
|
|
||||||
|
pci_epf_bind() should be invoked when the EPF device has been bound to
|
||||||
|
a EPC device.
|
||||||
|
|
||||||
|
* pci_epf_unbind()
|
||||||
|
|
||||||
|
pci_epf_unbind() should be invoked when the binding between EPC device
|
||||||
|
and EPF device is lost.
|
|
@ -1,215 +0,0 @@
|
||||||
PCI ENDPOINT FRAMEWORK
|
|
||||||
Kishon Vijay Abraham I <kishon@ti.com>
|
|
||||||
|
|
||||||
This document is a guide to use the PCI Endpoint Framework in order to create
|
|
||||||
endpoint controller driver, endpoint function driver, and using configfs
|
|
||||||
interface to bind the function driver to the controller driver.
|
|
||||||
|
|
||||||
1. Introduction
|
|
||||||
|
|
||||||
Linux has a comprehensive PCI subsystem to support PCI controllers that
|
|
||||||
operates in Root Complex mode. The subsystem has capability to scan PCI bus,
|
|
||||||
assign memory resources and IRQ resources, load PCI driver (based on
|
|
||||||
vendor ID, device ID), support other services like hot-plug, power management,
|
|
||||||
advanced error reporting and virtual channels.
|
|
||||||
|
|
||||||
However the PCI controller IP integrated in some SoCs is capable of operating
|
|
||||||
either in Root Complex mode or Endpoint mode. PCI Endpoint Framework will
|
|
||||||
add endpoint mode support in Linux. This will help to run Linux in an
|
|
||||||
EP system which can have a wide variety of use cases from testing or
|
|
||||||
validation, co-processor accelerator, etc.
|
|
||||||
|
|
||||||
2. PCI Endpoint Core
|
|
||||||
|
|
||||||
The PCI Endpoint Core layer comprises 3 components: the Endpoint Controller
|
|
||||||
library, the Endpoint Function library, and the configfs layer to bind the
|
|
||||||
endpoint function with the endpoint controller.
|
|
||||||
|
|
||||||
2.1 PCI Endpoint Controller(EPC) Library
|
|
||||||
|
|
||||||
The EPC library provides APIs to be used by the controller that can operate
|
|
||||||
in endpoint mode. It also provides APIs to be used by function driver/library
|
|
||||||
in order to implement a particular endpoint function.
|
|
||||||
|
|
||||||
2.1.1 APIs for the PCI controller Driver
|
|
||||||
|
|
||||||
This section lists the APIs that the PCI Endpoint core provides to be used
|
|
||||||
by the PCI controller driver.
|
|
||||||
|
|
||||||
*) devm_pci_epc_create()/pci_epc_create()
|
|
||||||
|
|
||||||
The PCI controller driver should implement the following ops:
|
|
||||||
* write_header: ops to populate configuration space header
|
|
||||||
* set_bar: ops to configure the BAR
|
|
||||||
* clear_bar: ops to reset the BAR
|
|
||||||
* alloc_addr_space: ops to allocate in PCI controller address space
|
|
||||||
* free_addr_space: ops to free the allocated address space
|
|
||||||
* raise_irq: ops to raise a legacy, MSI or MSI-X interrupt
|
|
||||||
* start: ops to start the PCI link
|
|
||||||
* stop: ops to stop the PCI link
|
|
||||||
|
|
||||||
The PCI controller driver can then create a new EPC device by invoking
|
|
||||||
devm_pci_epc_create()/pci_epc_create().
|
|
||||||
|
|
||||||
*) devm_pci_epc_destroy()/pci_epc_destroy()
|
|
||||||
|
|
||||||
The PCI controller driver can destroy the EPC device created by either
|
|
||||||
devm_pci_epc_create() or pci_epc_create() using devm_pci_epc_destroy() or
|
|
||||||
pci_epc_destroy().
|
|
||||||
|
|
||||||
*) pci_epc_linkup()
|
|
||||||
|
|
||||||
In order to notify all the function devices that the EPC device to which
|
|
||||||
they are linked has established a link with the host, the PCI controller
|
|
||||||
driver should invoke pci_epc_linkup().
|
|
||||||
|
|
||||||
*) pci_epc_mem_init()
|
|
||||||
|
|
||||||
Initialize the pci_epc_mem structure used for allocating EPC addr space.
|
|
||||||
|
|
||||||
*) pci_epc_mem_exit()
|
|
||||||
|
|
||||||
Cleanup the pci_epc_mem structure allocated during pci_epc_mem_init().
|
|
||||||
|
|
||||||
2.1.2 APIs for the PCI Endpoint Function Driver
|
|
||||||
|
|
||||||
This section lists the APIs that the PCI Endpoint core provides to be used
|
|
||||||
by the PCI endpoint function driver.
|
|
||||||
|
|
||||||
*) pci_epc_write_header()
|
|
||||||
|
|
||||||
The PCI endpoint function driver should use pci_epc_write_header() to
|
|
||||||
write the standard configuration header to the endpoint controller.
|
|
||||||
|
|
||||||
*) pci_epc_set_bar()
|
|
||||||
|
|
||||||
The PCI endpoint function driver should use pci_epc_set_bar() to configure
|
|
||||||
the Base Address Register in order for the host to assign PCI addr space.
|
|
||||||
Register space of the function driver is usually configured
|
|
||||||
using this API.
|
|
||||||
|
|
||||||
*) pci_epc_clear_bar()
|
|
||||||
|
|
||||||
The PCI endpoint function driver should use pci_epc_clear_bar() to reset
|
|
||||||
the BAR.
|
|
||||||
|
|
||||||
*) pci_epc_raise_irq()
|
|
||||||
|
|
||||||
The PCI endpoint function driver should use pci_epc_raise_irq() to raise
|
|
||||||
Legacy Interrupt, MSI or MSI-X Interrupt.
|
|
||||||
|
|
||||||
*) pci_epc_mem_alloc_addr()
|
|
||||||
|
|
||||||
The PCI endpoint function driver should use pci_epc_mem_alloc_addr(), to
|
|
||||||
allocate memory address from EPC addr space which is required to access
|
|
||||||
RC's buffer
|
|
||||||
|
|
||||||
*) pci_epc_mem_free_addr()
|
|
||||||
|
|
||||||
The PCI endpoint function driver should use pci_epc_mem_free_addr() to
|
|
||||||
free the memory space allocated using pci_epc_mem_alloc_addr().
|
|
||||||
|
|
||||||
2.1.3 Other APIs
|
|
||||||
|
|
||||||
There are other APIs provided by the EPC library. These are used for binding
|
|
||||||
the EPF device with EPC device. pci-ep-cfs.c can be used as reference for
|
|
||||||
using these APIs.
|
|
||||||
|
|
||||||
*) pci_epc_get()
|
|
||||||
|
|
||||||
Get a reference to the PCI endpoint controller based on the device name of
|
|
||||||
the controller.
|
|
||||||
|
|
||||||
*) pci_epc_put()
|
|
||||||
|
|
||||||
Release the reference to the PCI endpoint controller obtained using
|
|
||||||
pci_epc_get()
|
|
||||||
|
|
||||||
*) pci_epc_add_epf()
|
|
||||||
|
|
||||||
Add a PCI endpoint function to a PCI endpoint controller. A PCIe device
|
|
||||||
can have up to 8 functions according to the specification.
|
|
||||||
|
|
||||||
*) pci_epc_remove_epf()
|
|
||||||
|
|
||||||
Remove the PCI endpoint function from PCI endpoint controller.
|
|
||||||
|
|
||||||
*) pci_epc_start()
|
|
||||||
|
|
||||||
The PCI endpoint function driver should invoke pci_epc_start() once it
|
|
||||||
has configured the endpoint function and wants to start the PCI link.
|
|
||||||
|
|
||||||
*) pci_epc_stop()
|
|
||||||
|
|
||||||
The PCI endpoint function driver should invoke pci_epc_stop() to stop
|
|
||||||
the PCI LINK.
|
|
||||||
|
|
||||||
2.2 PCI Endpoint Function(EPF) Library
|
|
||||||
|
|
||||||
The EPF library provides APIs to be used by the function driver and the EPC
|
|
||||||
library to provide endpoint mode functionality.
|
|
||||||
|
|
||||||
2.2.1 APIs for the PCI Endpoint Function Driver
|
|
||||||
|
|
||||||
This section lists the APIs that the PCI Endpoint core provides to be used
|
|
||||||
by the PCI endpoint function driver.
|
|
||||||
|
|
||||||
*) pci_epf_register_driver()
|
|
||||||
|
|
||||||
The PCI Endpoint Function driver should implement the following ops:
|
|
||||||
* bind: ops to perform when a EPC device has been bound to EPF device
|
|
||||||
* unbind: ops to perform when a binding has been lost between a EPC
|
|
||||||
device and EPF device
|
|
||||||
* linkup: ops to perform when the EPC device has established a
|
|
||||||
connection with a host system
|
|
||||||
|
|
||||||
The PCI Function driver can then register the PCI EPF driver by using
|
|
||||||
pci_epf_register_driver().
|
|
||||||
|
|
||||||
*) pci_epf_unregister_driver()
|
|
||||||
|
|
||||||
The PCI Function driver can unregister the PCI EPF driver by using
|
|
||||||
pci_epf_unregister_driver().
|
|
||||||
|
|
||||||
*) pci_epf_alloc_space()
|
|
||||||
|
|
||||||
The PCI Function driver can allocate space for a particular BAR using
|
|
||||||
pci_epf_alloc_space().
|
|
||||||
|
|
||||||
*) pci_epf_free_space()
|
|
||||||
|
|
||||||
The PCI Function driver can free the allocated space
|
|
||||||
(using pci_epf_alloc_space) by invoking pci_epf_free_space().
|
|
||||||
|
|
||||||
2.2.2 APIs for the PCI Endpoint Controller Library
|
|
||||||
This section lists the APIs that the PCI Endpoint core provides to be used
|
|
||||||
by the PCI endpoint controller library.
|
|
||||||
|
|
||||||
*) pci_epf_linkup()
|
|
||||||
|
|
||||||
The PCI endpoint controller library invokes pci_epf_linkup() when the
|
|
||||||
EPC device has established the connection to the host.
|
|
||||||
|
|
||||||
2.2.2 Other APIs
|
|
||||||
There are other APIs provided by the EPF library. These are used to notify
|
|
||||||
the function driver when the EPF device is bound to the EPC device.
|
|
||||||
pci-ep-cfs.c can be used as reference for using these APIs.
|
|
||||||
|
|
||||||
*) pci_epf_create()
|
|
||||||
|
|
||||||
Create a new PCI EPF device by passing the name of the PCI EPF device.
|
|
||||||
This name will be used to bind the the EPF device to a EPF driver.
|
|
||||||
|
|
||||||
*) pci_epf_destroy()
|
|
||||||
|
|
||||||
Destroy the created PCI EPF device.
|
|
||||||
|
|
||||||
*) pci_epf_bind()
|
|
||||||
|
|
||||||
pci_epf_bind() should be invoked when the EPF device has been bound to
|
|
||||||
a EPC device.
|
|
||||||
|
|
||||||
*) pci_epf_unbind()
|
|
||||||
|
|
||||||
pci_epf_unbind() should be invoked when the binding between EPC device
|
|
||||||
and EPF device is lost.
|
|
103
Documentation/PCI/endpoint/pci-test-function.rst
Normal file
103
Documentation/PCI/endpoint/pci-test-function.rst
Normal file
|
@ -0,0 +1,103 @@
|
||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
=================
|
||||||
|
PCI Test Function
|
||||||
|
=================
|
||||||
|
|
||||||
|
:Author: Kishon Vijay Abraham I <kishon@ti.com>
|
||||||
|
|
||||||
|
Traditionally PCI RC has always been validated by using standard
|
||||||
|
PCI cards like ethernet PCI cards or USB PCI cards or SATA PCI cards.
|
||||||
|
However with the addition of EP-core in linux kernel, it is possible
|
||||||
|
to configure a PCI controller that can operate in EP mode to work as
|
||||||
|
a test device.
|
||||||
|
|
||||||
|
The PCI endpoint test device is a virtual device (defined in software)
|
||||||
|
used to test the endpoint functionality and serve as a sample driver
|
||||||
|
for other PCI endpoint devices (to use the EP framework).
|
||||||
|
|
||||||
|
The PCI endpoint test device has the following registers:
|
||||||
|
|
||||||
|
1) PCI_ENDPOINT_TEST_MAGIC
|
||||||
|
2) PCI_ENDPOINT_TEST_COMMAND
|
||||||
|
3) PCI_ENDPOINT_TEST_STATUS
|
||||||
|
4) PCI_ENDPOINT_TEST_SRC_ADDR
|
||||||
|
5) PCI_ENDPOINT_TEST_DST_ADDR
|
||||||
|
6) PCI_ENDPOINT_TEST_SIZE
|
||||||
|
7) PCI_ENDPOINT_TEST_CHECKSUM
|
||||||
|
8) PCI_ENDPOINT_TEST_IRQ_TYPE
|
||||||
|
9) PCI_ENDPOINT_TEST_IRQ_NUMBER
|
||||||
|
|
||||||
|
* PCI_ENDPOINT_TEST_MAGIC
|
||||||
|
|
||||||
|
This register will be used to test BAR0. A known pattern will be written
|
||||||
|
and read back from MAGIC register to verify BAR0.
|
||||||
|
|
||||||
|
* PCI_ENDPOINT_TEST_COMMAND
|
||||||
|
|
||||||
|
This register will be used by the host driver to indicate the function
|
||||||
|
that the endpoint device must perform.
|
||||||
|
|
||||||
|
======== ================================================================
|
||||||
|
Bitfield Description
|
||||||
|
======== ================================================================
|
||||||
|
Bit 0 raise legacy IRQ
|
||||||
|
Bit 1 raise MSI IRQ
|
||||||
|
Bit 2 raise MSI-X IRQ
|
||||||
|
Bit 3 read command (read data from RC buffer)
|
||||||
|
Bit 4 write command (write data to RC buffer)
|
||||||
|
Bit 5 copy command (copy data from one RC buffer to another RC buffer)
|
||||||
|
======== ================================================================
|
||||||
|
|
||||||
|
* PCI_ENDPOINT_TEST_STATUS
|
||||||
|
|
||||||
|
This register reflects the status of the PCI endpoint device.
|
||||||
|
|
||||||
|
======== ==============================
|
||||||
|
Bitfield Description
|
||||||
|
======== ==============================
|
||||||
|
Bit 0 read success
|
||||||
|
Bit 1 read fail
|
||||||
|
Bit 2 write success
|
||||||
|
Bit 3 write fail
|
||||||
|
Bit 4 copy success
|
||||||
|
Bit 5 copy fail
|
||||||
|
Bit 6 IRQ raised
|
||||||
|
Bit 7 source address is invalid
|
||||||
|
Bit 8 destination address is invalid
|
||||||
|
======== ==============================
|
||||||
|
|
||||||
|
* PCI_ENDPOINT_TEST_SRC_ADDR
|
||||||
|
|
||||||
|
This register contains the source address (RC buffer address) for the
|
||||||
|
COPY/READ command.
|
||||||
|
|
||||||
|
* PCI_ENDPOINT_TEST_DST_ADDR
|
||||||
|
|
||||||
|
This register contains the destination address (RC buffer address) for
|
||||||
|
the COPY/WRITE command.
|
||||||
|
|
||||||
|
* PCI_ENDPOINT_TEST_IRQ_TYPE
|
||||||
|
|
||||||
|
This register contains the interrupt type (Legacy/MSI) triggered
|
||||||
|
for the READ/WRITE/COPY and raise IRQ (Legacy/MSI) commands.
|
||||||
|
|
||||||
|
Possible types:
|
||||||
|
|
||||||
|
====== ==
|
||||||
|
Legacy 0
|
||||||
|
MSI 1
|
||||||
|
MSI-X 2
|
||||||
|
====== ==
|
||||||
|
|
||||||
|
* PCI_ENDPOINT_TEST_IRQ_NUMBER
|
||||||
|
|
||||||
|
This register contains the triggered ID interrupt.
|
||||||
|
|
||||||
|
Admissible values:
|
||||||
|
|
||||||
|
====== ===========
|
||||||
|
Legacy 0
|
||||||
|
MSI [1 .. 32]
|
||||||
|
MSI-X [1 .. 2048]
|
||||||
|
====== ===========
|
|
@ -1,87 +0,0 @@
|
||||||
PCI TEST
|
|
||||||
Kishon Vijay Abraham I <kishon@ti.com>
|
|
||||||
|
|
||||||
Traditionally PCI RC has always been validated by using standard
|
|
||||||
PCI cards like ethernet PCI cards or USB PCI cards or SATA PCI cards.
|
|
||||||
However with the addition of EP-core in linux kernel, it is possible
|
|
||||||
to configure a PCI controller that can operate in EP mode to work as
|
|
||||||
a test device.
|
|
||||||
|
|
||||||
The PCI endpoint test device is a virtual device (defined in software)
|
|
||||||
used to test the endpoint functionality and serve as a sample driver
|
|
||||||
for other PCI endpoint devices (to use the EP framework).
|
|
||||||
|
|
||||||
The PCI endpoint test device has the following registers:
|
|
||||||
|
|
||||||
1) PCI_ENDPOINT_TEST_MAGIC
|
|
||||||
2) PCI_ENDPOINT_TEST_COMMAND
|
|
||||||
3) PCI_ENDPOINT_TEST_STATUS
|
|
||||||
4) PCI_ENDPOINT_TEST_SRC_ADDR
|
|
||||||
5) PCI_ENDPOINT_TEST_DST_ADDR
|
|
||||||
6) PCI_ENDPOINT_TEST_SIZE
|
|
||||||
7) PCI_ENDPOINT_TEST_CHECKSUM
|
|
||||||
8) PCI_ENDPOINT_TEST_IRQ_TYPE
|
|
||||||
9) PCI_ENDPOINT_TEST_IRQ_NUMBER
|
|
||||||
|
|
||||||
*) PCI_ENDPOINT_TEST_MAGIC
|
|
||||||
|
|
||||||
This register will be used to test BAR0. A known pattern will be written
|
|
||||||
and read back from MAGIC register to verify BAR0.
|
|
||||||
|
|
||||||
*) PCI_ENDPOINT_TEST_COMMAND:
|
|
||||||
|
|
||||||
This register will be used by the host driver to indicate the function
|
|
||||||
that the endpoint device must perform.
|
|
||||||
|
|
||||||
Bitfield Description:
|
|
||||||
Bit 0 : raise legacy IRQ
|
|
||||||
Bit 1 : raise MSI IRQ
|
|
||||||
Bit 2 : raise MSI-X IRQ
|
|
||||||
Bit 3 : read command (read data from RC buffer)
|
|
||||||
Bit 4 : write command (write data to RC buffer)
|
|
||||||
Bit 5 : copy command (copy data from one RC buffer to another
|
|
||||||
RC buffer)
|
|
||||||
|
|
||||||
*) PCI_ENDPOINT_TEST_STATUS
|
|
||||||
|
|
||||||
This register reflects the status of the PCI endpoint device.
|
|
||||||
|
|
||||||
Bitfield Description:
|
|
||||||
Bit 0 : read success
|
|
||||||
Bit 1 : read fail
|
|
||||||
Bit 2 : write success
|
|
||||||
Bit 3 : write fail
|
|
||||||
Bit 4 : copy success
|
|
||||||
Bit 5 : copy fail
|
|
||||||
Bit 6 : IRQ raised
|
|
||||||
Bit 7 : source address is invalid
|
|
||||||
Bit 8 : destination address is invalid
|
|
||||||
|
|
||||||
*) PCI_ENDPOINT_TEST_SRC_ADDR
|
|
||||||
|
|
||||||
This register contains the source address (RC buffer address) for the
|
|
||||||
COPY/READ command.
|
|
||||||
|
|
||||||
*) PCI_ENDPOINT_TEST_DST_ADDR
|
|
||||||
|
|
||||||
This register contains the destination address (RC buffer address) for
|
|
||||||
the COPY/WRITE command.
|
|
||||||
|
|
||||||
*) PCI_ENDPOINT_TEST_IRQ_TYPE
|
|
||||||
|
|
||||||
This register contains the interrupt type (Legacy/MSI) triggered
|
|
||||||
for the READ/WRITE/COPY and raise IRQ (Legacy/MSI) commands.
|
|
||||||
|
|
||||||
Possible types:
|
|
||||||
- Legacy : 0
|
|
||||||
- MSI : 1
|
|
||||||
- MSI-X : 2
|
|
||||||
|
|
||||||
*) PCI_ENDPOINT_TEST_IRQ_NUMBER
|
|
||||||
|
|
||||||
This register contains the triggered ID interrupt.
|
|
||||||
|
|
||||||
Admissible values:
|
|
||||||
- Legacy : 0
|
|
||||||
- MSI : [1 .. 32]
|
|
||||||
- MSI-X : [1 .. 2048]
|
|
235
Documentation/PCI/endpoint/pci-test-howto.rst
Normal file
235
Documentation/PCI/endpoint/pci-test-howto.rst
Normal file
|
@ -0,0 +1,235 @@
|
||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
===================
|
||||||
|
PCI Test User Guide
|
||||||
|
===================
|
||||||
|
|
||||||
|
:Author: Kishon Vijay Abraham I <kishon@ti.com>
|
||||||
|
|
||||||
|
This document is a guide to help users use pci-epf-test function driver
|
||||||
|
and pci_endpoint_test host driver for testing PCI. The list of steps to
|
||||||
|
be followed in the host side and EP side is given below.
|
||||||
|
|
||||||
|
Endpoint Device
|
||||||
|
===============
|
||||||
|
|
||||||
|
Endpoint Controller Devices
|
||||||
|
---------------------------
|
||||||
|
|
||||||
|
To find the list of endpoint controller devices in the system::
|
||||||
|
|
||||||
|
# ls /sys/class/pci_epc/
|
||||||
|
51000000.pcie_ep
|
||||||
|
|
||||||
|
If PCI_ENDPOINT_CONFIGFS is enabled::
|
||||||
|
|
||||||
|
# ls /sys/kernel/config/pci_ep/controllers
|
||||||
|
51000000.pcie_ep
|
||||||
|
|
||||||
|
|
||||||
|
Endpoint Function Drivers
|
||||||
|
-------------------------
|
||||||
|
|
||||||
|
To find the list of endpoint function drivers in the system::
|
||||||
|
|
||||||
|
# ls /sys/bus/pci-epf/drivers
|
||||||
|
pci_epf_test
|
||||||
|
|
||||||
|
If PCI_ENDPOINT_CONFIGFS is enabled::
|
||||||
|
|
||||||
|
# ls /sys/kernel/config/pci_ep/functions
|
||||||
|
pci_epf_test
|
||||||
|
|
||||||
|
|
||||||
|
Creating pci-epf-test Device
|
||||||
|
----------------------------
|
||||||
|
|
||||||
|
PCI endpoint function device can be created using the configfs. To create
|
||||||
|
pci-epf-test device, the following commands can be used::
|
||||||
|
|
||||||
|
# mount -t configfs none /sys/kernel/config
|
||||||
|
# cd /sys/kernel/config/pci_ep/
|
||||||
|
# mkdir functions/pci_epf_test/func1
|
||||||
|
|
||||||
|
The "mkdir func1" above creates the pci-epf-test function device that will
|
||||||
|
be probed by pci_epf_test driver.
|
||||||
|
|
||||||
|
The PCI endpoint framework populates the directory with the following
|
||||||
|
configurable fields::
|
||||||
|
|
||||||
|
# ls functions/pci_epf_test/func1
|
||||||
|
baseclass_code interrupt_pin progif_code subsys_id
|
||||||
|
cache_line_size msi_interrupts revid subsys_vendorid
|
||||||
|
deviceid msix_interrupts subclass_code vendorid
|
||||||
|
|
||||||
|
The PCI endpoint function driver populates these entries with default values
|
||||||
|
when the device is bound to the driver. The pci-epf-test driver populates
|
||||||
|
vendorid with 0xffff and interrupt_pin with 0x0001::
|
||||||
|
|
||||||
|
# cat functions/pci_epf_test/func1/vendorid
|
||||||
|
0xffff
|
||||||
|
# cat functions/pci_epf_test/func1/interrupt_pin
|
||||||
|
0x0001
|
||||||
|
|
||||||
|
|
||||||
|
Configuring pci-epf-test Device
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
|
The user can configure the pci-epf-test device using configfs entry. In order
|
||||||
|
to change the vendorid and the number of MSI interrupts used by the function
|
||||||
|
device, the following commands can be used::
|
||||||
|
|
||||||
|
# echo 0x104c > functions/pci_epf_test/func1/vendorid
|
||||||
|
# echo 0xb500 > functions/pci_epf_test/func1/deviceid
|
||||||
|
# echo 16 > functions/pci_epf_test/func1/msi_interrupts
|
||||||
|
# echo 8 > functions/pci_epf_test/func1/msix_interrupts
|
||||||
|
|
||||||
|
|
||||||
|
Binding pci-epf-test Device to EP Controller
|
||||||
|
--------------------------------------------
|
||||||
|
|
||||||
|
In order for the endpoint function device to be useful, it has to be bound to
|
||||||
|
a PCI endpoint controller driver. Use the configfs to bind the function
|
||||||
|
device to one of the controller driver present in the system::
|
||||||
|
|
||||||
|
# ln -s functions/pci_epf_test/func1 controllers/51000000.pcie_ep/
|
||||||
|
|
||||||
|
Once the above step is completed, the PCI endpoint is ready to establish a link
|
||||||
|
with the host.
|
||||||
|
|
||||||
|
|
||||||
|
Start the Link
|
||||||
|
--------------
|
||||||
|
|
||||||
|
In order for the endpoint device to establish a link with the host, the _start_
|
||||||
|
field should be populated with '1'::
|
||||||
|
|
||||||
|
# echo 1 > controllers/51000000.pcie_ep/start
|
||||||
|
|
||||||
|
|
||||||
|
RootComplex Device
|
||||||
|
==================
|
||||||
|
|
||||||
|
lspci Output
|
||||||
|
------------
|
||||||
|
|
||||||
|
Note that the devices listed here correspond to the value populated in 1.4
|
||||||
|
above::
|
||||||
|
|
||||||
|
00:00.0 PCI bridge: Texas Instruments Device 8888 (rev 01)
|
||||||
|
01:00.0 Unassigned class [ff00]: Texas Instruments Device b500
|
||||||
|
|
||||||
|
|
||||||
|
Using Endpoint Test function Device
|
||||||
|
-----------------------------------
|
||||||
|
|
||||||
|
pcitest.sh added in tools/pci/ can be used to run all the default PCI endpoint
|
||||||
|
tests. To compile this tool the following commands should be used::
|
||||||
|
|
||||||
|
# cd <kernel-dir>
|
||||||
|
# make -C tools/pci
|
||||||
|
|
||||||
|
or if you desire to compile and install in your system::
|
||||||
|
|
||||||
|
# cd <kernel-dir>
|
||||||
|
# make -C tools/pci install
|
||||||
|
|
||||||
|
The tool and script will be located in <rootfs>/usr/bin/
|
||||||
|
|
||||||
|
|
||||||
|
pcitest.sh Output
|
||||||
|
~~~~~~~~~~~~~~~~~
|
||||||
|
::
|
||||||
|
|
||||||
|
# pcitest.sh
|
||||||
|
BAR tests
|
||||||
|
|
||||||
|
BAR0: OKAY
|
||||||
|
BAR1: OKAY
|
||||||
|
BAR2: OKAY
|
||||||
|
BAR3: OKAY
|
||||||
|
BAR4: NOT OKAY
|
||||||
|
BAR5: NOT OKAY
|
||||||
|
|
||||||
|
Interrupt tests
|
||||||
|
|
||||||
|
SET IRQ TYPE TO LEGACY: OKAY
|
||||||
|
LEGACY IRQ: NOT OKAY
|
||||||
|
SET IRQ TYPE TO MSI: OKAY
|
||||||
|
MSI1: OKAY
|
||||||
|
MSI2: OKAY
|
||||||
|
MSI3: OKAY
|
||||||
|
MSI4: OKAY
|
||||||
|
MSI5: OKAY
|
||||||
|
MSI6: OKAY
|
||||||
|
MSI7: OKAY
|
||||||
|
MSI8: OKAY
|
||||||
|
MSI9: OKAY
|
||||||
|
MSI10: OKAY
|
||||||
|
MSI11: OKAY
|
||||||
|
MSI12: OKAY
|
||||||
|
MSI13: OKAY
|
||||||
|
MSI14: OKAY
|
||||||
|
MSI15: OKAY
|
||||||
|
MSI16: OKAY
|
||||||
|
MSI17: NOT OKAY
|
||||||
|
MSI18: NOT OKAY
|
||||||
|
MSI19: NOT OKAY
|
||||||
|
MSI20: NOT OKAY
|
||||||
|
MSI21: NOT OKAY
|
||||||
|
MSI22: NOT OKAY
|
||||||
|
MSI23: NOT OKAY
|
||||||
|
MSI24: NOT OKAY
|
||||||
|
MSI25: NOT OKAY
|
||||||
|
MSI26: NOT OKAY
|
||||||
|
MSI27: NOT OKAY
|
||||||
|
MSI28: NOT OKAY
|
||||||
|
MSI29: NOT OKAY
|
||||||
|
MSI30: NOT OKAY
|
||||||
|
MSI31: NOT OKAY
|
||||||
|
MSI32: NOT OKAY
|
||||||
|
SET IRQ TYPE TO MSI-X: OKAY
|
||||||
|
MSI-X1: OKAY
|
||||||
|
MSI-X2: OKAY
|
||||||
|
MSI-X3: OKAY
|
||||||
|
MSI-X4: OKAY
|
||||||
|
MSI-X5: OKAY
|
||||||
|
MSI-X6: OKAY
|
||||||
|
MSI-X7: OKAY
|
||||||
|
MSI-X8: OKAY
|
||||||
|
MSI-X9: NOT OKAY
|
||||||
|
MSI-X10: NOT OKAY
|
||||||
|
MSI-X11: NOT OKAY
|
||||||
|
MSI-X12: NOT OKAY
|
||||||
|
MSI-X13: NOT OKAY
|
||||||
|
MSI-X14: NOT OKAY
|
||||||
|
MSI-X15: NOT OKAY
|
||||||
|
MSI-X16: NOT OKAY
|
||||||
|
[...]
|
||||||
|
MSI-X2047: NOT OKAY
|
||||||
|
MSI-X2048: NOT OKAY
|
||||||
|
|
||||||
|
Read Tests
|
||||||
|
|
||||||
|
SET IRQ TYPE TO MSI: OKAY
|
||||||
|
READ ( 1 bytes): OKAY
|
||||||
|
READ ( 1024 bytes): OKAY
|
||||||
|
READ ( 1025 bytes): OKAY
|
||||||
|
READ (1024000 bytes): OKAY
|
||||||
|
READ (1024001 bytes): OKAY
|
||||||
|
|
||||||
|
Write Tests
|
||||||
|
|
||||||
|
WRITE ( 1 bytes): OKAY
|
||||||
|
WRITE ( 1024 bytes): OKAY
|
||||||
|
WRITE ( 1025 bytes): OKAY
|
||||||
|
WRITE (1024000 bytes): OKAY
|
||||||
|
WRITE (1024001 bytes): OKAY
|
||||||
|
|
||||||
|
Copy Tests
|
||||||
|
|
||||||
|
COPY ( 1 bytes): OKAY
|
||||||
|
COPY ( 1024 bytes): OKAY
|
||||||
|
COPY ( 1025 bytes): OKAY
|
||||||
|
COPY (1024000 bytes): OKAY
|
||||||
|
COPY (1024001 bytes): OKAY
|
|
@ -1,206 +0,0 @@
|
||||||
PCI TEST USERGUIDE
|
|
||||||
Kishon Vijay Abraham I <kishon@ti.com>
|
|
||||||
|
|
||||||
This document is a guide to help users use pci-epf-test function driver
|
|
||||||
and pci_endpoint_test host driver for testing PCI. The list of steps to
|
|
||||||
be followed in the host side and EP side is given below.
|
|
||||||
|
|
||||||
1. Endpoint Device
|
|
||||||
|
|
||||||
1.1 Endpoint Controller Devices
|
|
||||||
|
|
||||||
To find the list of endpoint controller devices in the system:
|
|
||||||
|
|
||||||
# ls /sys/class/pci_epc/
|
|
||||||
51000000.pcie_ep
|
|
||||||
|
|
||||||
If PCI_ENDPOINT_CONFIGFS is enabled
|
|
||||||
# ls /sys/kernel/config/pci_ep/controllers
|
|
||||||
51000000.pcie_ep
|
|
||||||
|
|
||||||
1.2 Endpoint Function Drivers
|
|
||||||
|
|
||||||
To find the list of endpoint function drivers in the system:
|
|
||||||
|
|
||||||
# ls /sys/bus/pci-epf/drivers
|
|
||||||
pci_epf_test
|
|
||||||
|
|
||||||
If PCI_ENDPOINT_CONFIGFS is enabled
|
|
||||||
# ls /sys/kernel/config/pci_ep/functions
|
|
||||||
pci_epf_test
|
|
||||||
|
|
||||||
1.3 Creating pci-epf-test Device
|
|
||||||
|
|
||||||
PCI endpoint function device can be created using the configfs. To create
|
|
||||||
pci-epf-test device, the following commands can be used
|
|
||||||
|
|
||||||
# mount -t configfs none /sys/kernel/config
|
|
||||||
# cd /sys/kernel/config/pci_ep/
|
|
||||||
# mkdir functions/pci_epf_test/func1
|
|
||||||
|
|
||||||
The "mkdir func1" above creates the pci-epf-test function device that will
|
|
||||||
be probed by pci_epf_test driver.
|
|
||||||
|
|
||||||
The PCI endpoint framework populates the directory with the following
|
|
||||||
configurable fields.
|
|
||||||
|
|
||||||
# ls functions/pci_epf_test/func1
|
|
||||||
baseclass_code interrupt_pin progif_code subsys_id
|
|
||||||
cache_line_size msi_interrupts revid subsys_vendorid
|
|
||||||
deviceid msix_interrupts subclass_code vendorid
|
|
||||||
|
|
||||||
The PCI endpoint function driver populates these entries with default values
|
|
||||||
when the device is bound to the driver. The pci-epf-test driver populates
|
|
||||||
vendorid with 0xffff and interrupt_pin with 0x0001
|
|
||||||
|
|
||||||
# cat functions/pci_epf_test/func1/vendorid
|
|
||||||
0xffff
|
|
||||||
# cat functions/pci_epf_test/func1/interrupt_pin
|
|
||||||
0x0001
|
|
||||||
|
|
||||||
1.4 Configuring pci-epf-test Device
|
|
||||||
|
|
||||||
The user can configure the pci-epf-test device using configfs entry. In order
|
|
||||||
to change the vendorid and the number of MSI interrupts used by the function
|
|
||||||
device, the following commands can be used.
|
|
||||||
|
|
||||||
# echo 0x104c > functions/pci_epf_test/func1/vendorid
|
|
||||||
# echo 0xb500 > functions/pci_epf_test/func1/deviceid
|
|
||||||
# echo 16 > functions/pci_epf_test/func1/msi_interrupts
|
|
||||||
# echo 8 > functions/pci_epf_test/func1/msix_interrupts
|
|
||||||
|
|
||||||
1.5 Binding pci-epf-test Device to EP Controller
|
|
||||||
|
|
||||||
In order for the endpoint function device to be useful, it has to be bound to
|
|
||||||
a PCI endpoint controller driver. Use the configfs to bind the function
|
|
||||||
device to one of the controller driver present in the system.
|
|
||||||
|
|
||||||
# ln -s functions/pci_epf_test/func1 controllers/51000000.pcie_ep/
|
|
||||||
|
|
||||||
Once the above step is completed, the PCI endpoint is ready to establish a link
|
|
||||||
with the host.
|
|
||||||
|
|
||||||
1.6 Start the Link
|
|
||||||
|
|
||||||
In order for the endpoint device to establish a link with the host, the _start_
|
|
||||||
field should be populated with '1'.
|
|
||||||
|
|
||||||
# echo 1 > controllers/51000000.pcie_ep/start
|
|
||||||
|
|
||||||
2. RootComplex Device
|
|
||||||
|
|
||||||
2.1 lspci Output
|
|
||||||
|
|
||||||
Note that the devices listed here correspond to the value populated in 1.4 above
|
|
||||||
|
|
||||||
00:00.0 PCI bridge: Texas Instruments Device 8888 (rev 01)
|
|
||||||
01:00.0 Unassigned class [ff00]: Texas Instruments Device b500
|
|
||||||
|
|
||||||
2.2 Using Endpoint Test function Device
|
|
||||||
|
|
||||||
pcitest.sh added in tools/pci/ can be used to run all the default PCI endpoint
|
|
||||||
tests. To compile this tool the following commands should be used:
|
|
||||||
|
|
||||||
# cd <kernel-dir>
|
|
||||||
# make -C tools/pci
|
|
||||||
|
|
||||||
or if you desire to compile and install in your system:
|
|
||||||
|
|
||||||
# cd <kernel-dir>
|
|
||||||
# make -C tools/pci install
|
|
||||||
|
|
||||||
The tool and script will be located in <rootfs>/usr/bin/
|
|
||||||
|
|
||||||
2.2.1 pcitest.sh Output
|
|
||||||
# pcitest.sh
|
|
||||||
BAR tests
|
|
||||||
|
|
||||||
BAR0: OKAY
|
|
||||||
BAR1: OKAY
|
|
||||||
BAR2: OKAY
|
|
||||||
BAR3: OKAY
|
|
||||||
BAR4: NOT OKAY
|
|
||||||
BAR5: NOT OKAY
|
|
||||||
|
|
||||||
Interrupt tests
|
|
||||||
|
|
||||||
SET IRQ TYPE TO LEGACY: OKAY
|
|
||||||
LEGACY IRQ: NOT OKAY
|
|
||||||
SET IRQ TYPE TO MSI: OKAY
|
|
||||||
MSI1: OKAY
|
|
||||||
MSI2: OKAY
|
|
||||||
MSI3: OKAY
|
|
||||||
MSI4: OKAY
|
|
||||||
MSI5: OKAY
|
|
||||||
MSI6: OKAY
|
|
||||||
MSI7: OKAY
|
|
||||||
MSI8: OKAY
|
|
||||||
MSI9: OKAY
|
|
||||||
MSI10: OKAY
|
|
||||||
MSI11: OKAY
|
|
||||||
MSI12: OKAY
|
|
||||||
MSI13: OKAY
|
|
||||||
MSI14: OKAY
|
|
||||||
MSI15: OKAY
|
|
||||||
MSI16: OKAY
|
|
||||||
MSI17: NOT OKAY
|
|
||||||
MSI18: NOT OKAY
|
|
||||||
MSI19: NOT OKAY
|
|
||||||
MSI20: NOT OKAY
|
|
||||||
MSI21: NOT OKAY
|
|
||||||
MSI22: NOT OKAY
|
|
||||||
MSI23: NOT OKAY
|
|
||||||
MSI24: NOT OKAY
|
|
||||||
MSI25: NOT OKAY
|
|
||||||
MSI26: NOT OKAY
|
|
||||||
MSI27: NOT OKAY
|
|
||||||
MSI28: NOT OKAY
|
|
||||||
MSI29: NOT OKAY
|
|
||||||
MSI30: NOT OKAY
|
|
||||||
MSI31: NOT OKAY
|
|
||||||
MSI32: NOT OKAY
|
|
||||||
SET IRQ TYPE TO MSI-X: OKAY
|
|
||||||
MSI-X1: OKAY
|
|
||||||
MSI-X2: OKAY
|
|
||||||
MSI-X3: OKAY
|
|
||||||
MSI-X4: OKAY
|
|
||||||
MSI-X5: OKAY
|
|
||||||
MSI-X6: OKAY
|
|
||||||
MSI-X7: OKAY
|
|
||||||
MSI-X8: OKAY
|
|
||||||
MSI-X9: NOT OKAY
|
|
||||||
MSI-X10: NOT OKAY
|
|
||||||
MSI-X11: NOT OKAY
|
|
||||||
MSI-X12: NOT OKAY
|
|
||||||
MSI-X13: NOT OKAY
|
|
||||||
MSI-X14: NOT OKAY
|
|
||||||
MSI-X15: NOT OKAY
|
|
||||||
MSI-X16: NOT OKAY
|
|
||||||
[...]
|
|
||||||
MSI-X2047: NOT OKAY
|
|
||||||
MSI-X2048: NOT OKAY
|
|
||||||
|
|
||||||
Read Tests
|
|
||||||
|
|
||||||
SET IRQ TYPE TO MSI: OKAY
|
|
||||||
READ ( 1 bytes): OKAY
|
|
||||||
READ ( 1024 bytes): OKAY
|
|
||||||
READ ( 1025 bytes): OKAY
|
|
||||||
READ (1024000 bytes): OKAY
|
|
||||||
READ (1024001 bytes): OKAY
|
|
||||||
|
|
||||||
Write Tests
|
|
||||||
|
|
||||||
WRITE ( 1 bytes): OKAY
|
|
||||||
WRITE ( 1024 bytes): OKAY
|
|
||||||
WRITE ( 1025 bytes): OKAY
|
|
||||||
WRITE (1024000 bytes): OKAY
|
|
||||||
WRITE (1024001 bytes): OKAY
|
|
||||||
|
|
||||||
Copy Tests
|
|
||||||
|
|
||||||
COPY ( 1 bytes): OKAY
|
|
||||||
COPY ( 1024 bytes): OKAY
|
|
||||||
COPY ( 1025 bytes): OKAY
|
|
||||||
COPY (1024000 bytes): OKAY
|
|
||||||
COPY (1024001 bytes): OKAY
|
|
18
Documentation/PCI/index.rst
Normal file
18
Documentation/PCI/index.rst
Normal file
|
@ -0,0 +1,18 @@
|
||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
=======================
|
||||||
|
Linux PCI Bus Subsystem
|
||||||
|
=======================
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 2
|
||||||
|
:numbered:
|
||||||
|
|
||||||
|
pci
|
||||||
|
picebus-howto
|
||||||
|
pci-iov-howto
|
||||||
|
msi-howto
|
||||||
|
acpi-info
|
||||||
|
pci-error-recovery
|
||||||
|
pcieaer-howto
|
||||||
|
endpoint/index
|
287
Documentation/PCI/msi-howto.rst
Normal file
287
Documentation/PCI/msi-howto.rst
Normal file
|
@ -0,0 +1,287 @@
|
||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
.. include:: <isonum.txt>
|
||||||
|
|
||||||
|
==========================
|
||||||
|
The MSI Driver Guide HOWTO
|
||||||
|
==========================
|
||||||
|
|
||||||
|
:Authors: Tom L Nguyen; Martine Silbermann; Matthew Wilcox
|
||||||
|
|
||||||
|
:Copyright: 2003, 2008 Intel Corporation
|
||||||
|
|
||||||
|
About this guide
|
||||||
|
================
|
||||||
|
|
||||||
|
This guide describes the basics of Message Signaled Interrupts (MSIs),
|
||||||
|
the advantages of using MSI over traditional interrupt mechanisms, how
|
||||||
|
to change your driver to use MSI or MSI-X and some basic diagnostics to
|
||||||
|
try if a device doesn't support MSIs.
|
||||||
|
|
||||||
|
|
||||||
|
What are MSIs?
|
||||||
|
==============
|
||||||
|
|
||||||
|
A Message Signaled Interrupt is a write from the device to a special
|
||||||
|
address which causes an interrupt to be received by the CPU.
|
||||||
|
|
||||||
|
The MSI capability was first specified in PCI 2.2 and was later enhanced
|
||||||
|
in PCI 3.0 to allow each interrupt to be masked individually. The MSI-X
|
||||||
|
capability was also introduced with PCI 3.0. It supports more interrupts
|
||||||
|
per device than MSI and allows interrupts to be independently configured.
|
||||||
|
|
||||||
|
Devices may support both MSI and MSI-X, but only one can be enabled at
|
||||||
|
a time.
|
||||||
|
|
||||||
|
|
||||||
|
Why use MSIs?
|
||||||
|
=============
|
||||||
|
|
||||||
|
There are three reasons why using MSIs can give an advantage over
|
||||||
|
traditional pin-based interrupts.
|
||||||
|
|
||||||
|
Pin-based PCI interrupts are often shared amongst several devices.
|
||||||
|
To support this, the kernel must call each interrupt handler associated
|
||||||
|
with an interrupt, which leads to reduced performance for the system as
|
||||||
|
a whole. MSIs are never shared, so this problem cannot arise.
|
||||||
|
|
||||||
|
When a device writes data to memory, then raises a pin-based interrupt,
|
||||||
|
it is possible that the interrupt may arrive before all the data has
|
||||||
|
arrived in memory (this becomes more likely with devices behind PCI-PCI
|
||||||
|
bridges). In order to ensure that all the data has arrived in memory,
|
||||||
|
the interrupt handler must read a register on the device which raised
|
||||||
|
the interrupt. PCI transaction ordering rules require that all the data
|
||||||
|
arrive in memory before the value may be returned from the register.
|
||||||
|
Using MSIs avoids this problem as the interrupt-generating write cannot
|
||||||
|
pass the data writes, so by the time the interrupt is raised, the driver
|
||||||
|
knows that all the data has arrived in memory.
|
||||||
|
|
||||||
|
PCI devices can only support a single pin-based interrupt per function.
|
||||||
|
Often drivers have to query the device to find out what event has
|
||||||
|
occurred, slowing down interrupt handling for the common case. With
|
||||||
|
MSIs, a device can support more interrupts, allowing each interrupt
|
||||||
|
to be specialised to a different purpose. One possible design gives
|
||||||
|
infrequent conditions (such as errors) their own interrupt which allows
|
||||||
|
the driver to handle the normal interrupt handling path more efficiently.
|
||||||
|
Other possible designs include giving one interrupt to each packet queue
|
||||||
|
in a network card or each port in a storage controller.
|
||||||
|
|
||||||
|
|
||||||
|
How to use MSIs
|
||||||
|
===============
|
||||||
|
|
||||||
|
PCI devices are initialised to use pin-based interrupts. The device
|
||||||
|
driver has to set up the device to use MSI or MSI-X. Not all machines
|
||||||
|
support MSIs correctly, and for those machines, the APIs described below
|
||||||
|
will simply fail and the device will continue to use pin-based interrupts.
|
||||||
|
|
||||||
|
Include kernel support for MSIs
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
|
To support MSI or MSI-X, the kernel must be built with the CONFIG_PCI_MSI
|
||||||
|
option enabled. This option is only available on some architectures,
|
||||||
|
and it may depend on some other options also being set. For example,
|
||||||
|
on x86, you must also enable X86_UP_APIC or SMP in order to see the
|
||||||
|
CONFIG_PCI_MSI option.
|
||||||
|
|
||||||
|
Using MSI
|
||||||
|
---------
|
||||||
|
|
||||||
|
Most of the hard work is done for the driver in the PCI layer. The driver
|
||||||
|
simply has to request that the PCI layer set up the MSI capability for this
|
||||||
|
device.
|
||||||
|
|
||||||
|
To automatically use MSI or MSI-X interrupt vectors, use the following
|
||||||
|
function::
|
||||||
|
|
||||||
|
int pci_alloc_irq_vectors(struct pci_dev *dev, unsigned int min_vecs,
|
||||||
|
unsigned int max_vecs, unsigned int flags);
|
||||||
|
|
||||||
|
which allocates up to max_vecs interrupt vectors for a PCI device. It
|
||||||
|
returns the number of vectors allocated or a negative error. If the device
|
||||||
|
has a requirements for a minimum number of vectors the driver can pass a
|
||||||
|
min_vecs argument set to this limit, and the PCI core will return -ENOSPC
|
||||||
|
if it can't meet the minimum number of vectors.
|
||||||
|
|
||||||
|
The flags argument is used to specify which type of interrupt can be used
|
||||||
|
by the device and the driver (PCI_IRQ_LEGACY, PCI_IRQ_MSI, PCI_IRQ_MSIX).
|
||||||
|
A convenient short-hand (PCI_IRQ_ALL_TYPES) is also available to ask for
|
||||||
|
any possible kind of interrupt. If the PCI_IRQ_AFFINITY flag is set,
|
||||||
|
pci_alloc_irq_vectors() will spread the interrupts around the available CPUs.
|
||||||
|
|
||||||
|
To get the Linux IRQ numbers passed to request_irq() and free_irq() and the
|
||||||
|
vectors, use the following function::
|
||||||
|
|
||||||
|
int pci_irq_vector(struct pci_dev *dev, unsigned int nr);
|
||||||
|
|
||||||
|
Any allocated resources should be freed before removing the device using
|
||||||
|
the following function::
|
||||||
|
|
||||||
|
void pci_free_irq_vectors(struct pci_dev *dev);
|
||||||
|
|
||||||
|
If a device supports both MSI-X and MSI capabilities, this API will use the
|
||||||
|
MSI-X facilities in preference to the MSI facilities. MSI-X supports any
|
||||||
|
number of interrupts between 1 and 2048. In contrast, MSI is restricted to
|
||||||
|
a maximum of 32 interrupts (and must be a power of two). In addition, the
|
||||||
|
MSI interrupt vectors must be allocated consecutively, so the system might
|
||||||
|
not be able to allocate as many vectors for MSI as it could for MSI-X. On
|
||||||
|
some platforms, MSI interrupts must all be targeted at the same set of CPUs
|
||||||
|
whereas MSI-X interrupts can all be targeted at different CPUs.
|
||||||
|
|
||||||
|
If a device supports neither MSI-X or MSI it will fall back to a single
|
||||||
|
legacy IRQ vector.
|
||||||
|
|
||||||
|
The typical usage of MSI or MSI-X interrupts is to allocate as many vectors
|
||||||
|
as possible, likely up to the limit supported by the device. If nvec is
|
||||||
|
larger than the number supported by the device it will automatically be
|
||||||
|
capped to the supported limit, so there is no need to query the number of
|
||||||
|
vectors supported beforehand::
|
||||||
|
|
||||||
|
nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_ALL_TYPES)
|
||||||
|
if (nvec < 0)
|
||||||
|
goto out_err;
|
||||||
|
|
||||||
|
If a driver is unable or unwilling to deal with a variable number of MSI
|
||||||
|
interrupts it can request a particular number of interrupts by passing that
|
||||||
|
number to pci_alloc_irq_vectors() function as both 'min_vecs' and
|
||||||
|
'max_vecs' parameters::
|
||||||
|
|
||||||
|
ret = pci_alloc_irq_vectors(pdev, nvec, nvec, PCI_IRQ_ALL_TYPES);
|
||||||
|
if (ret < 0)
|
||||||
|
goto out_err;
|
||||||
|
|
||||||
|
The most notorious example of the request type described above is enabling
|
||||||
|
the single MSI mode for a device. It could be done by passing two 1s as
|
||||||
|
'min_vecs' and 'max_vecs'::
|
||||||
|
|
||||||
|
ret = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_ALL_TYPES);
|
||||||
|
if (ret < 0)
|
||||||
|
goto out_err;
|
||||||
|
|
||||||
|
Some devices might not support using legacy line interrupts, in which case
|
||||||
|
the driver can specify that only MSI or MSI-X is acceptable::
|
||||||
|
|
||||||
|
nvec = pci_alloc_irq_vectors(pdev, 1, nvec, PCI_IRQ_MSI | PCI_IRQ_MSIX);
|
||||||
|
if (nvec < 0)
|
||||||
|
goto out_err;
|
||||||
|
|
||||||
|
Legacy APIs
|
||||||
|
-----------
|
||||||
|
|
||||||
|
The following old APIs to enable and disable MSI or MSI-X interrupts should
|
||||||
|
not be used in new code::
|
||||||
|
|
||||||
|
pci_enable_msi() /* deprecated */
|
||||||
|
pci_disable_msi() /* deprecated */
|
||||||
|
pci_enable_msix_range() /* deprecated */
|
||||||
|
pci_enable_msix_exact() /* deprecated */
|
||||||
|
pci_disable_msix() /* deprecated */
|
||||||
|
|
||||||
|
Additionally there are APIs to provide the number of supported MSI or MSI-X
|
||||||
|
vectors: pci_msi_vec_count() and pci_msix_vec_count(). In general these
|
||||||
|
should be avoided in favor of letting pci_alloc_irq_vectors() cap the
|
||||||
|
number of vectors. If you have a legitimate special use case for the count
|
||||||
|
of vectors we might have to revisit that decision and add a
|
||||||
|
pci_nr_irq_vectors() helper that handles MSI and MSI-X transparently.
|
||||||
|
|
||||||
|
Considerations when using MSIs
|
||||||
|
------------------------------
|
||||||
|
|
||||||
|
Spinlocks
|
||||||
|
~~~~~~~~~
|
||||||
|
|
||||||
|
Most device drivers have a per-device spinlock which is taken in the
|
||||||
|
interrupt handler. With pin-based interrupts or a single MSI, it is not
|
||||||
|
necessary to disable interrupts (Linux guarantees the same interrupt will
|
||||||
|
not be re-entered). If a device uses multiple interrupts, the driver
|
||||||
|
must disable interrupts while the lock is held. If the device sends
|
||||||
|
a different interrupt, the driver will deadlock trying to recursively
|
||||||
|
acquire the spinlock. Such deadlocks can be avoided by using
|
||||||
|
spin_lock_irqsave() or spin_lock_irq() which disable local interrupts
|
||||||
|
and acquire the lock (see Documentation/kernel-hacking/locking.rst).
|
||||||
|
|
||||||
|
How to tell whether MSI/MSI-X is enabled on a device
|
||||||
|
----------------------------------------------------
|
||||||
|
|
||||||
|
Using 'lspci -v' (as root) may show some devices with "MSI", "Message
|
||||||
|
Signalled Interrupts" or "MSI-X" capabilities. Each of these capabilities
|
||||||
|
has an 'Enable' flag which is followed with either "+" (enabled)
|
||||||
|
or "-" (disabled).
|
||||||
|
|
||||||
|
|
||||||
|
MSI quirks
|
||||||
|
==========
|
||||||
|
|
||||||
|
Several PCI chipsets or devices are known not to support MSIs.
|
||||||
|
The PCI stack provides three ways to disable MSIs:
|
||||||
|
|
||||||
|
1. globally
|
||||||
|
2. on all devices behind a specific bridge
|
||||||
|
3. on a single device
|
||||||
|
|
||||||
|
Disabling MSIs globally
|
||||||
|
-----------------------
|
||||||
|
|
||||||
|
Some host chipsets simply don't support MSIs properly. If we're
|
||||||
|
lucky, the manufacturer knows this and has indicated it in the ACPI
|
||||||
|
FADT table. In this case, Linux automatically disables MSIs.
|
||||||
|
Some boards don't include this information in the table and so we have
|
||||||
|
to detect them ourselves. The complete list of these is found near the
|
||||||
|
quirk_disable_all_msi() function in drivers/pci/quirks.c.
|
||||||
|
|
||||||
|
If you have a board which has problems with MSIs, you can pass pci=nomsi
|
||||||
|
on the kernel command line to disable MSIs on all devices. It would be
|
||||||
|
in your best interests to report the problem to linux-pci@vger.kernel.org
|
||||||
|
including a full 'lspci -v' so we can add the quirks to the kernel.
|
||||||
|
|
||||||
|
Disabling MSIs below a bridge
|
||||||
|
-----------------------------
|
||||||
|
|
||||||
|
Some PCI bridges are not able to route MSIs between busses properly.
|
||||||
|
In this case, MSIs must be disabled on all devices behind the bridge.
|
||||||
|
|
||||||
|
Some bridges allow you to enable MSIs by changing some bits in their
|
||||||
|
PCI configuration space (especially the Hypertransport chipsets such
|
||||||
|
as the nVidia nForce and Serverworks HT2000). As with host chipsets,
|
||||||
|
Linux mostly knows about them and automatically enables MSIs if it can.
|
||||||
|
If you have a bridge unknown to Linux, you can enable
|
||||||
|
MSIs in configuration space using whatever method you know works, then
|
||||||
|
enable MSIs on that bridge by doing::
|
||||||
|
|
||||||
|
echo 1 > /sys/bus/pci/devices/$bridge/msi_bus
|
||||||
|
|
||||||
|
where $bridge is the PCI address of the bridge you've enabled (eg
|
||||||
|
0000:00:0e.0).
|
||||||
|
|
||||||
|
To disable MSIs, echo 0 instead of 1. Changing this value should be
|
||||||
|
done with caution as it could break interrupt handling for all devices
|
||||||
|
below this bridge.
|
||||||
|
|
||||||
|
Again, please notify linux-pci@vger.kernel.org of any bridges that need
|
||||||
|
special handling.
|
||||||
|
|
||||||
|
Disabling MSIs on a single device
|
||||||
|
---------------------------------
|
||||||
|
|
||||||
|
Some devices are known to have faulty MSI implementations. Usually this
|
||||||
|
is handled in the individual device driver, but occasionally it's necessary
|
||||||
|
to handle this with a quirk. Some drivers have an option to disable use
|
||||||
|
of MSI. While this is a convenient workaround for the driver author,
|
||||||
|
it is not good practice, and should not be emulated.
|
||||||
|
|
||||||
|
Finding why MSIs are disabled on a device
|
||||||
|
-----------------------------------------
|
||||||
|
|
||||||
|
From the above three sections, you can see that there are many reasons
|
||||||
|
why MSIs may not be enabled for a given device. Your first step should
|
||||||
|
be to examine your dmesg carefully to determine whether MSIs are enabled
|
||||||
|
for your machine. You should also check your .config to be sure you
|
||||||
|
have enabled CONFIG_PCI_MSI.
|
||||||
|
|
||||||
|
Then, 'lspci -t' gives the list of bridges above a device. Reading
|
||||||
|
`/sys/bus/pci/devices/*/msi_bus` will tell you whether MSIs are enabled (1)
|
||||||
|
or disabled (0). If 0 is found in any of the msi_bus files belonging
|
||||||
|
to bridges between the PCI root and the device, MSIs are disabled.
|
||||||
|
|
||||||
|
It is also worth checking the device driver to see whether it supports MSIs.
|
||||||
|
For example, it may contain calls to pci_irq_alloc_vectors() with the
|
||||||
|
PCI_IRQ_MSI or PCI_IRQ_MSIX flags.
|
424
Documentation/PCI/pci-error-recovery.rst
Normal file
424
Documentation/PCI/pci-error-recovery.rst
Normal file
|
@ -0,0 +1,424 @@
|
||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
==================
|
||||||
|
PCI Error Recovery
|
||||||
|
==================
|
||||||
|
|
||||||
|
|
||||||
|
:Authors: - Linas Vepstas <linasvepstas@gmail.com>
|
||||||
|
- Richard Lary <rlary@us.ibm.com>
|
||||||
|
- Mike Mason <mmlnx@us.ibm.com>
|
||||||
|
|
||||||
|
|
||||||
|
Many PCI bus controllers are able to detect a variety of hardware
|
||||||
|
PCI errors on the bus, such as parity errors on the data and address
|
||||||
|
buses, as well as SERR and PERR errors. Some of the more advanced
|
||||||
|
chipsets are able to deal with these errors; these include PCI-E chipsets,
|
||||||
|
and the PCI-host bridges found on IBM Power4, Power5 and Power6-based
|
||||||
|
pSeries boxes. A typical action taken is to disconnect the affected device,
|
||||||
|
halting all I/O to it. The goal of a disconnection is to avoid system
|
||||||
|
corruption; for example, to halt system memory corruption due to DMA's
|
||||||
|
to "wild" addresses. Typically, a reconnection mechanism is also
|
||||||
|
offered, so that the affected PCI device(s) are reset and put back
|
||||||
|
into working condition. The reset phase requires coordination
|
||||||
|
between the affected device drivers and the PCI controller chip.
|
||||||
|
This document describes a generic API for notifying device drivers
|
||||||
|
of a bus disconnection, and then performing error recovery.
|
||||||
|
This API is currently implemented in the 2.6.16 and later kernels.
|
||||||
|
|
||||||
|
Reporting and recovery is performed in several steps. First, when
|
||||||
|
a PCI hardware error has resulted in a bus disconnect, that event
|
||||||
|
is reported as soon as possible to all affected device drivers,
|
||||||
|
including multiple instances of a device driver on multi-function
|
||||||
|
cards. This allows device drivers to avoid deadlocking in spinloops,
|
||||||
|
waiting for some i/o-space register to change, when it never will.
|
||||||
|
It also gives the drivers a chance to defer incoming I/O as
|
||||||
|
needed.
|
||||||
|
|
||||||
|
Next, recovery is performed in several stages. Most of the complexity
|
||||||
|
is forced by the need to handle multi-function devices, that is,
|
||||||
|
devices that have multiple device drivers associated with them.
|
||||||
|
In the first stage, each driver is allowed to indicate what type
|
||||||
|
of reset it desires, the choices being a simple re-enabling of I/O
|
||||||
|
or requesting a slot reset.
|
||||||
|
|
||||||
|
If any driver requests a slot reset, that is what will be done.
|
||||||
|
|
||||||
|
After a reset and/or a re-enabling of I/O, all drivers are
|
||||||
|
again notified, so that they may then perform any device setup/config
|
||||||
|
that may be required. After these have all completed, a final
|
||||||
|
"resume normal operations" event is sent out.
|
||||||
|
|
||||||
|
The biggest reason for choosing a kernel-based implementation rather
|
||||||
|
than a user-space implementation was the need to deal with bus
|
||||||
|
disconnects of PCI devices attached to storage media, and, in particular,
|
||||||
|
disconnects from devices holding the root file system. If the root
|
||||||
|
file system is disconnected, a user-space mechanism would have to go
|
||||||
|
through a large number of contortions to complete recovery. Almost all
|
||||||
|
of the current Linux file systems are not tolerant of disconnection
|
||||||
|
from/reconnection to their underlying block device. By contrast,
|
||||||
|
bus errors are easy to manage in the device driver. Indeed, most
|
||||||
|
device drivers already handle very similar recovery procedures;
|
||||||
|
for example, the SCSI-generic layer already provides significant
|
||||||
|
mechanisms for dealing with SCSI bus errors and SCSI bus resets.
|
||||||
|
|
||||||
|
|
||||||
|
Detailed Design
|
||||||
|
===============
|
||||||
|
|
||||||
|
Design and implementation details below, based on a chain of
|
||||||
|
public email discussions with Ben Herrenschmidt, circa 5 April 2005.
|
||||||
|
|
||||||
|
The error recovery API support is exposed to the driver in the form of
|
||||||
|
a structure of function pointers pointed to by a new field in struct
|
||||||
|
pci_driver. A driver that fails to provide the structure is "non-aware",
|
||||||
|
and the actual recovery steps taken are platform dependent. The
|
||||||
|
arch/powerpc implementation will simulate a PCI hotplug remove/add.
|
||||||
|
|
||||||
|
This structure has the form::
|
||||||
|
|
||||||
|
struct pci_error_handlers
|
||||||
|
{
|
||||||
|
int (*error_detected)(struct pci_dev *dev, enum pci_channel_state);
|
||||||
|
int (*mmio_enabled)(struct pci_dev *dev);
|
||||||
|
int (*slot_reset)(struct pci_dev *dev);
|
||||||
|
void (*resume)(struct pci_dev *dev);
|
||||||
|
};
|
||||||
|
|
||||||
|
The possible channel states are::
|
||||||
|
|
||||||
|
enum pci_channel_state {
|
||||||
|
pci_channel_io_normal, /* I/O channel is in normal state */
|
||||||
|
pci_channel_io_frozen, /* I/O to channel is blocked */
|
||||||
|
pci_channel_io_perm_failure, /* PCI card is dead */
|
||||||
|
};
|
||||||
|
|
||||||
|
Possible return values are::
|
||||||
|
|
||||||
|
enum pci_ers_result {
|
||||||
|
PCI_ERS_RESULT_NONE, /* no result/none/not supported in device driver */
|
||||||
|
PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */
|
||||||
|
PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */
|
||||||
|
PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */
|
||||||
|
PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */
|
||||||
|
};
|
||||||
|
|
||||||
|
A driver does not have to implement all of these callbacks; however,
|
||||||
|
if it implements any, it must implement error_detected(). If a callback
|
||||||
|
is not implemented, the corresponding feature is considered unsupported.
|
||||||
|
For example, if mmio_enabled() and resume() aren't there, then it
|
||||||
|
is assumed that the driver is not doing any direct recovery and requires
|
||||||
|
a slot reset. Typically a driver will want to know about
|
||||||
|
a slot_reset().
|
||||||
|
|
||||||
|
The actual steps taken by a platform to recover from a PCI error
|
||||||
|
event will be platform-dependent, but will follow the general
|
||||||
|
sequence described below.
|
||||||
|
|
||||||
|
STEP 0: Error Event
|
||||||
|
-------------------
|
||||||
|
A PCI bus error is detected by the PCI hardware. On powerpc, the slot
|
||||||
|
is isolated, in that all I/O is blocked: all reads return 0xffffffff,
|
||||||
|
all writes are ignored.
|
||||||
|
|
||||||
|
|
||||||
|
STEP 1: Notification
|
||||||
|
--------------------
|
||||||
|
Platform calls the error_detected() callback on every instance of
|
||||||
|
every driver affected by the error.
|
||||||
|
|
||||||
|
At this point, the device might not be accessible anymore, depending on
|
||||||
|
the platform (the slot will be isolated on powerpc). The driver may
|
||||||
|
already have "noticed" the error because of a failing I/O, but this
|
||||||
|
is the proper "synchronization point", that is, it gives the driver
|
||||||
|
a chance to cleanup, waiting for pending stuff (timers, whatever, etc...)
|
||||||
|
to complete; it can take semaphores, schedule, etc... everything but
|
||||||
|
touch the device. Within this function and after it returns, the driver
|
||||||
|
shouldn't do any new IOs. Called in task context. This is sort of a
|
||||||
|
"quiesce" point. See note about interrupts at the end of this doc.
|
||||||
|
|
||||||
|
All drivers participating in this system must implement this call.
|
||||||
|
The driver must return one of the following result codes:
|
||||||
|
|
||||||
|
- PCI_ERS_RESULT_CAN_RECOVER
|
||||||
|
Driver returns this if it thinks it might be able to recover
|
||||||
|
the HW by just banging IOs or if it wants to be given
|
||||||
|
a chance to extract some diagnostic information (see
|
||||||
|
mmio_enable, below).
|
||||||
|
- PCI_ERS_RESULT_NEED_RESET
|
||||||
|
Driver returns this if it can't recover without a
|
||||||
|
slot reset.
|
||||||
|
- PCI_ERS_RESULT_DISCONNECT
|
||||||
|
Driver returns this if it doesn't want to recover at all.
|
||||||
|
|
||||||
|
The next step taken will depend on the result codes returned by the
|
||||||
|
drivers.
|
||||||
|
|
||||||
|
If all drivers on the segment/slot return PCI_ERS_RESULT_CAN_RECOVER,
|
||||||
|
then the platform should re-enable IOs on the slot (or do nothing in
|
||||||
|
particular, if the platform doesn't isolate slots), and recovery
|
||||||
|
proceeds to STEP 2 (MMIO Enable).
|
||||||
|
|
||||||
|
If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET),
|
||||||
|
then recovery proceeds to STEP 4 (Slot Reset).
|
||||||
|
|
||||||
|
If the platform is unable to recover the slot, the next step
|
||||||
|
is STEP 6 (Permanent Failure).
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
The current powerpc implementation assumes that a device driver will
|
||||||
|
*not* schedule or semaphore in this routine; the current powerpc
|
||||||
|
implementation uses one kernel thread to notify all devices;
|
||||||
|
thus, if one device sleeps/schedules, all devices are affected.
|
||||||
|
Doing better requires complex multi-threaded logic in the error
|
||||||
|
recovery implementation (e.g. waiting for all notification threads
|
||||||
|
to "join" before proceeding with recovery.) This seems excessively
|
||||||
|
complex and not worth implementing.
|
||||||
|
|
||||||
|
The current powerpc implementation doesn't much care if the device
|
||||||
|
attempts I/O at this point, or not. I/O's will fail, returning
|
||||||
|
a value of 0xff on read, and writes will be dropped. If more than
|
||||||
|
EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH
|
||||||
|
assumes that the device driver has gone into an infinite loop
|
||||||
|
and prints an error to syslog. A reboot is then required to
|
||||||
|
get the device working again.
|
||||||
|
|
||||||
|
STEP 2: MMIO Enabled
|
||||||
|
--------------------
|
||||||
|
The platform re-enables MMIO to the device (but typically not the
|
||||||
|
DMA), and then calls the mmio_enabled() callback on all affected
|
||||||
|
device drivers.
|
||||||
|
|
||||||
|
This is the "early recovery" call. IOs are allowed again, but DMA is
|
||||||
|
not, with some restrictions. This is NOT a callback for the driver to
|
||||||
|
start operations again, only to peek/poke at the device, extract diagnostic
|
||||||
|
information, if any, and eventually do things like trigger a device local
|
||||||
|
reset or some such, but not restart operations. This callback is made if
|
||||||
|
all drivers on a segment agree that they can try to recover and if no automatic
|
||||||
|
link reset was performed by the HW. If the platform can't just re-enable IOs
|
||||||
|
without a slot reset or a link reset, it will not call this callback, and
|
||||||
|
instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
The following is proposed; no platform implements this yet:
|
||||||
|
Proposal: All I/O's should be done _synchronously_ from within
|
||||||
|
this callback, errors triggered by them will be returned via
|
||||||
|
the normal pci_check_whatever() API, no new error_detected()
|
||||||
|
callback will be issued due to an error happening here. However,
|
||||||
|
such an error might cause IOs to be re-blocked for the whole
|
||||||
|
segment, and thus invalidate the recovery that other devices
|
||||||
|
on the same segment might have done, forcing the whole segment
|
||||||
|
into one of the next states, that is, link reset or slot reset.
|
||||||
|
|
||||||
|
The driver should return one of the following result codes:
|
||||||
|
- PCI_ERS_RESULT_RECOVERED
|
||||||
|
Driver returns this if it thinks the device is fully
|
||||||
|
functional and thinks it is ready to start
|
||||||
|
normal driver operations again. There is no
|
||||||
|
guarantee that the driver will actually be
|
||||||
|
allowed to proceed, as another driver on the
|
||||||
|
same segment might have failed and thus triggered a
|
||||||
|
slot reset on platforms that support it.
|
||||||
|
|
||||||
|
- PCI_ERS_RESULT_NEED_RESET
|
||||||
|
Driver returns this if it thinks the device is not
|
||||||
|
recoverable in its current state and it needs a slot
|
||||||
|
reset to proceed.
|
||||||
|
|
||||||
|
- PCI_ERS_RESULT_DISCONNECT
|
||||||
|
Same as above. Total failure, no recovery even after
|
||||||
|
reset driver dead. (To be defined more precisely)
|
||||||
|
|
||||||
|
The next step taken depends on the results returned by the drivers.
|
||||||
|
If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform
|
||||||
|
proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
|
||||||
|
|
||||||
|
If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
|
||||||
|
proceeds to STEP 4 (Slot Reset)
|
||||||
|
|
||||||
|
STEP 3: Link Reset
|
||||||
|
------------------
|
||||||
|
The platform resets the link. This is a PCI-Express specific step
|
||||||
|
and is done whenever a fatal error has been detected that can be
|
||||||
|
"solved" by resetting the link.
|
||||||
|
|
||||||
|
STEP 4: Slot Reset
|
||||||
|
------------------
|
||||||
|
|
||||||
|
In response to a return value of PCI_ERS_RESULT_NEED_RESET, the
|
||||||
|
the platform will perform a slot reset on the requesting PCI device(s).
|
||||||
|
The actual steps taken by a platform to perform a slot reset
|
||||||
|
will be platform-dependent. Upon completion of slot reset, the
|
||||||
|
platform will call the device slot_reset() callback.
|
||||||
|
|
||||||
|
Powerpc platforms implement two levels of slot reset:
|
||||||
|
soft reset(default) and fundamental(optional) reset.
|
||||||
|
|
||||||
|
Powerpc soft reset consists of asserting the adapter #RST line and then
|
||||||
|
restoring the PCI BAR's and PCI configuration header to a state
|
||||||
|
that is equivalent to what it would be after a fresh system
|
||||||
|
power-on followed by power-on BIOS/system firmware initialization.
|
||||||
|
Soft reset is also known as hot-reset.
|
||||||
|
|
||||||
|
Powerpc fundamental reset is supported by PCI Express cards only
|
||||||
|
and results in device's state machines, hardware logic, port states and
|
||||||
|
configuration registers to initialize to their default conditions.
|
||||||
|
|
||||||
|
For most PCI devices, a soft reset will be sufficient for recovery.
|
||||||
|
Optional fundamental reset is provided to support a limited number
|
||||||
|
of PCI Express devices for which a soft reset is not sufficient
|
||||||
|
for recovery.
|
||||||
|
|
||||||
|
If the platform supports PCI hotplug, then the reset might be
|
||||||
|
performed by toggling the slot electrical power off/on.
|
||||||
|
|
||||||
|
It is important for the platform to restore the PCI config space
|
||||||
|
to the "fresh poweron" state, rather than the "last state". After
|
||||||
|
a slot reset, the device driver will almost always use its standard
|
||||||
|
device initialization routines, and an unusual config space setup
|
||||||
|
may result in hung devices, kernel panics, or silent data corruption.
|
||||||
|
|
||||||
|
This call gives drivers the chance to re-initialize the hardware
|
||||||
|
(re-download firmware, etc.). At this point, the driver may assume
|
||||||
|
that the card is in a fresh state and is fully functional. The slot
|
||||||
|
is unfrozen and the driver has full access to PCI config space,
|
||||||
|
memory mapped I/O space and DMA. Interrupts (Legacy, MSI, or MSI-X)
|
||||||
|
will also be available.
|
||||||
|
|
||||||
|
Drivers should not restart normal I/O processing operations
|
||||||
|
at this point. If all device drivers report success on this
|
||||||
|
callback, the platform will call resume() to complete the sequence,
|
||||||
|
and let the driver restart normal I/O processing.
|
||||||
|
|
||||||
|
A driver can still return a critical failure for this function if
|
||||||
|
it can't get the device operational after reset. If the platform
|
||||||
|
previously tried a soft reset, it might now try a hard reset (power
|
||||||
|
cycle) and then call slot_reset() again. It the device still can't
|
||||||
|
be recovered, there is nothing more that can be done; the platform
|
||||||
|
will typically report a "permanent failure" in such a case. The
|
||||||
|
device will be considered "dead" in this case.
|
||||||
|
|
||||||
|
Drivers for multi-function cards will need to coordinate among
|
||||||
|
themselves as to which driver instance will perform any "one-shot"
|
||||||
|
or global device initialization. For example, the Symbios sym53cxx2
|
||||||
|
driver performs device init only from PCI function 0::
|
||||||
|
|
||||||
|
+ if (PCI_FUNC(pdev->devfn) == 0)
|
||||||
|
+ sym_reset_scsi_bus(np, 0);
|
||||||
|
|
||||||
|
Result codes:
|
||||||
|
- PCI_ERS_RESULT_DISCONNECT
|
||||||
|
Same as above.
|
||||||
|
|
||||||
|
Drivers for PCI Express cards that require a fundamental reset must
|
||||||
|
set the needs_freset bit in the pci_dev structure in their probe function.
|
||||||
|
For example, the QLogic qla2xxx driver sets the needs_freset bit for certain
|
||||||
|
PCI card types::
|
||||||
|
|
||||||
|
+ /* Set EEH reset type to fundamental if required by hba */
|
||||||
|
+ if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha))
|
||||||
|
+ pdev->needs_freset = 1;
|
||||||
|
+
|
||||||
|
|
||||||
|
Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent
|
||||||
|
Failure).
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
The current powerpc implementation does not try a power-cycle
|
||||||
|
reset if the driver returned PCI_ERS_RESULT_DISCONNECT.
|
||||||
|
However, it probably should.
|
||||||
|
|
||||||
|
|
||||||
|
STEP 5: Resume Operations
|
||||||
|
-------------------------
|
||||||
|
The platform will call the resume() callback on all affected device
|
||||||
|
drivers if all drivers on the segment have returned
|
||||||
|
PCI_ERS_RESULT_RECOVERED from one of the 3 previous callbacks.
|
||||||
|
The goal of this callback is to tell the driver to restart activity,
|
||||||
|
that everything is back and running. This callback does not return
|
||||||
|
a result code.
|
||||||
|
|
||||||
|
At this point, if a new error happens, the platform will restart
|
||||||
|
a new error recovery sequence.
|
||||||
|
|
||||||
|
STEP 6: Permanent Failure
|
||||||
|
-------------------------
|
||||||
|
A "permanent failure" has occurred, and the platform cannot recover
|
||||||
|
the device. The platform will call error_detected() with a
|
||||||
|
pci_channel_state value of pci_channel_io_perm_failure.
|
||||||
|
|
||||||
|
The device driver should, at this point, assume the worst. It should
|
||||||
|
cancel all pending I/O, refuse all new I/O, returning -EIO to
|
||||||
|
higher layers. The device driver should then clean up all of its
|
||||||
|
memory and remove itself from kernel operations, much as it would
|
||||||
|
during system shutdown.
|
||||||
|
|
||||||
|
The platform will typically notify the system operator of the
|
||||||
|
permanent failure in some way. If the device is hotplug-capable,
|
||||||
|
the operator will probably want to remove and replace the device.
|
||||||
|
Note, however, not all failures are truly "permanent". Some are
|
||||||
|
caused by over-heating, some by a poorly seated card. Many
|
||||||
|
PCI error events are caused by software bugs, e.g. DMA's to
|
||||||
|
wild addresses or bogus split transactions due to programming
|
||||||
|
errors. See the discussion in powerpc/eeh-pci-error-recovery.txt
|
||||||
|
for additional detail on real-life experience of the causes of
|
||||||
|
software errors.
|
||||||
|
|
||||||
|
|
||||||
|
Conclusion; General Remarks
|
||||||
|
---------------------------
|
||||||
|
The way the callbacks are called is platform policy. A platform with
|
||||||
|
no slot reset capability may want to just "ignore" drivers that can't
|
||||||
|
recover (disconnect them) and try to let other cards on the same segment
|
||||||
|
recover. Keep in mind that in most real life cases, though, there will
|
||||||
|
be only one driver per segment.
|
||||||
|
|
||||||
|
Now, a note about interrupts. If you get an interrupt and your
|
||||||
|
device is dead or has been isolated, there is a problem :)
|
||||||
|
The current policy is to turn this into a platform policy.
|
||||||
|
That is, the recovery API only requires that:
|
||||||
|
|
||||||
|
- There is no guarantee that interrupt delivery can proceed from any
|
||||||
|
device on the segment starting from the error detection and until the
|
||||||
|
slot_reset callback is called, at which point interrupts are expected
|
||||||
|
to be fully operational.
|
||||||
|
|
||||||
|
- There is no guarantee that interrupt delivery is stopped, that is,
|
||||||
|
a driver that gets an interrupt after detecting an error, or that detects
|
||||||
|
an error within the interrupt handler such that it prevents proper
|
||||||
|
ack'ing of the interrupt (and thus removal of the source) should just
|
||||||
|
return IRQ_NOTHANDLED. It's up to the platform to deal with that
|
||||||
|
condition, typically by masking the IRQ source during the duration of
|
||||||
|
the error handling. It is expected that the platform "knows" which
|
||||||
|
interrupts are routed to error-management capable slots and can deal
|
||||||
|
with temporarily disabling that IRQ number during error processing (this
|
||||||
|
isn't terribly complex). That means some IRQ latency for other devices
|
||||||
|
sharing the interrupt, but there is simply no other way. High end
|
||||||
|
platforms aren't supposed to share interrupts between many devices
|
||||||
|
anyway :)
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
Implementation details for the powerpc platform are discussed in
|
||||||
|
the file Documentation/powerpc/eeh-pci-error-recovery.txt
|
||||||
|
|
||||||
|
As of this writing, there is a growing list of device drivers with
|
||||||
|
patches implementing error recovery. Not all of these patches are in
|
||||||
|
mainline yet. These may be used as "examples":
|
||||||
|
|
||||||
|
- drivers/scsi/ipr
|
||||||
|
- drivers/scsi/sym53c8xx_2
|
||||||
|
- drivers/scsi/qla2xxx
|
||||||
|
- drivers/scsi/lpfc
|
||||||
|
- drivers/next/bnx2.c
|
||||||
|
- drivers/next/e100.c
|
||||||
|
- drivers/net/e1000
|
||||||
|
- drivers/net/e1000e
|
||||||
|
- drivers/net/ixgb
|
||||||
|
- drivers/net/ixgbe
|
||||||
|
- drivers/net/cxgb3
|
||||||
|
- drivers/net/s2io.c
|
||||||
|
- drivers/net/qlge
|
|
@ -1,413 +0,0 @@
|
||||||
|
|
||||||
PCI Error Recovery
|
|
||||||
------------------
|
|
||||||
February 2, 2006
|
|
||||||
|
|
||||||
Current document maintainer:
|
|
||||||
Linas Vepstas <linasvepstas@gmail.com>
|
|
||||||
updated by Richard Lary <rlary@us.ibm.com>
|
|
||||||
and Mike Mason <mmlnx@us.ibm.com> on 27-Jul-2009
|
|
||||||
|
|
||||||
|
|
||||||
Many PCI bus controllers are able to detect a variety of hardware
|
|
||||||
PCI errors on the bus, such as parity errors on the data and address
|
|
||||||
buses, as well as SERR and PERR errors. Some of the more advanced
|
|
||||||
chipsets are able to deal with these errors; these include PCI-E chipsets,
|
|
||||||
and the PCI-host bridges found on IBM Power4, Power5 and Power6-based
|
|
||||||
pSeries boxes. A typical action taken is to disconnect the affected device,
|
|
||||||
halting all I/O to it. The goal of a disconnection is to avoid system
|
|
||||||
corruption; for example, to halt system memory corruption due to DMA's
|
|
||||||
to "wild" addresses. Typically, a reconnection mechanism is also
|
|
||||||
offered, so that the affected PCI device(s) are reset and put back
|
|
||||||
into working condition. The reset phase requires coordination
|
|
||||||
between the affected device drivers and the PCI controller chip.
|
|
||||||
This document describes a generic API for notifying device drivers
|
|
||||||
of a bus disconnection, and then performing error recovery.
|
|
||||||
This API is currently implemented in the 2.6.16 and later kernels.
|
|
||||||
|
|
||||||
Reporting and recovery is performed in several steps. First, when
|
|
||||||
a PCI hardware error has resulted in a bus disconnect, that event
|
|
||||||
is reported as soon as possible to all affected device drivers,
|
|
||||||
including multiple instances of a device driver on multi-function
|
|
||||||
cards. This allows device drivers to avoid deadlocking in spinloops,
|
|
||||||
waiting for some i/o-space register to change, when it never will.
|
|
||||||
It also gives the drivers a chance to defer incoming I/O as
|
|
||||||
needed.
|
|
||||||
|
|
||||||
Next, recovery is performed in several stages. Most of the complexity
|
|
||||||
is forced by the need to handle multi-function devices, that is,
|
|
||||||
devices that have multiple device drivers associated with them.
|
|
||||||
In the first stage, each driver is allowed to indicate what type
|
|
||||||
of reset it desires, the choices being a simple re-enabling of I/O
|
|
||||||
or requesting a slot reset.
|
|
||||||
|
|
||||||
If any driver requests a slot reset, that is what will be done.
|
|
||||||
|
|
||||||
After a reset and/or a re-enabling of I/O, all drivers are
|
|
||||||
again notified, so that they may then perform any device setup/config
|
|
||||||
that may be required. After these have all completed, a final
|
|
||||||
"resume normal operations" event is sent out.
|
|
||||||
|
|
||||||
The biggest reason for choosing a kernel-based implementation rather
|
|
||||||
than a user-space implementation was the need to deal with bus
|
|
||||||
disconnects of PCI devices attached to storage media, and, in particular,
|
|
||||||
disconnects from devices holding the root file system. If the root
|
|
||||||
file system is disconnected, a user-space mechanism would have to go
|
|
||||||
through a large number of contortions to complete recovery. Almost all
|
|
||||||
of the current Linux file systems are not tolerant of disconnection
|
|
||||||
from/reconnection to their underlying block device. By contrast,
|
|
||||||
bus errors are easy to manage in the device driver. Indeed, most
|
|
||||||
device drivers already handle very similar recovery procedures;
|
|
||||||
for example, the SCSI-generic layer already provides significant
|
|
||||||
mechanisms for dealing with SCSI bus errors and SCSI bus resets.
|
|
||||||
|
|
||||||
|
|
||||||
Detailed Design
|
|
||||||
---------------
|
|
||||||
Design and implementation details below, based on a chain of
|
|
||||||
public email discussions with Ben Herrenschmidt, circa 5 April 2005.
|
|
||||||
|
|
||||||
The error recovery API support is exposed to the driver in the form of
|
|
||||||
a structure of function pointers pointed to by a new field in struct
|
|
||||||
pci_driver. A driver that fails to provide the structure is "non-aware",
|
|
||||||
and the actual recovery steps taken are platform dependent. The
|
|
||||||
arch/powerpc implementation will simulate a PCI hotplug remove/add.
|
|
||||||
|
|
||||||
This structure has the form:
|
|
||||||
struct pci_error_handlers
|
|
||||||
{
|
|
||||||
int (*error_detected)(struct pci_dev *dev, enum pci_channel_state);
|
|
||||||
int (*mmio_enabled)(struct pci_dev *dev);
|
|
||||||
int (*slot_reset)(struct pci_dev *dev);
|
|
||||||
void (*resume)(struct pci_dev *dev);
|
|
||||||
};
|
|
||||||
|
|
||||||
The possible channel states are:
|
|
||||||
enum pci_channel_state {
|
|
||||||
pci_channel_io_normal, /* I/O channel is in normal state */
|
|
||||||
pci_channel_io_frozen, /* I/O to channel is blocked */
|
|
||||||
pci_channel_io_perm_failure, /* PCI card is dead */
|
|
||||||
};
|
|
||||||
|
|
||||||
Possible return values are:
|
|
||||||
enum pci_ers_result {
|
|
||||||
PCI_ERS_RESULT_NONE, /* no result/none/not supported in device driver */
|
|
||||||
PCI_ERS_RESULT_CAN_RECOVER, /* Device driver can recover without slot reset */
|
|
||||||
PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */
|
|
||||||
PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */
|
|
||||||
PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */
|
|
||||||
};
|
|
||||||
|
|
||||||
A driver does not have to implement all of these callbacks; however,
|
|
||||||
if it implements any, it must implement error_detected(). If a callback
|
|
||||||
is not implemented, the corresponding feature is considered unsupported.
|
|
||||||
For example, if mmio_enabled() and resume() aren't there, then it
|
|
||||||
is assumed that the driver is not doing any direct recovery and requires
|
|
||||||
a slot reset. Typically a driver will want to know about
|
|
||||||
a slot_reset().
|
|
||||||
|
|
||||||
The actual steps taken by a platform to recover from a PCI error
|
|
||||||
event will be platform-dependent, but will follow the general
|
|
||||||
sequence described below.
|
|
||||||
|
|
||||||
STEP 0: Error Event
|
|
||||||
-------------------
|
|
||||||
A PCI bus error is detected by the PCI hardware. On powerpc, the slot
|
|
||||||
is isolated, in that all I/O is blocked: all reads return 0xffffffff,
|
|
||||||
all writes are ignored.
|
|
||||||
|
|
||||||
|
|
||||||
STEP 1: Notification
|
|
||||||
--------------------
|
|
||||||
Platform calls the error_detected() callback on every instance of
|
|
||||||
every driver affected by the error.
|
|
||||||
|
|
||||||
At this point, the device might not be accessible anymore, depending on
|
|
||||||
the platform (the slot will be isolated on powerpc). The driver may
|
|
||||||
already have "noticed" the error because of a failing I/O, but this
|
|
||||||
is the proper "synchronization point", that is, it gives the driver
|
|
||||||
a chance to cleanup, waiting for pending stuff (timers, whatever, etc...)
|
|
||||||
to complete; it can take semaphores, schedule, etc... everything but
|
|
||||||
touch the device. Within this function and after it returns, the driver
|
|
||||||
shouldn't do any new IOs. Called in task context. This is sort of a
|
|
||||||
"quiesce" point. See note about interrupts at the end of this doc.
|
|
||||||
|
|
||||||
All drivers participating in this system must implement this call.
|
|
||||||
The driver must return one of the following result codes:
|
|
||||||
- PCI_ERS_RESULT_CAN_RECOVER:
|
|
||||||
Driver returns this if it thinks it might be able to recover
|
|
||||||
the HW by just banging IOs or if it wants to be given
|
|
||||||
a chance to extract some diagnostic information (see
|
|
||||||
mmio_enable, below).
|
|
||||||
- PCI_ERS_RESULT_NEED_RESET:
|
|
||||||
Driver returns this if it can't recover without a
|
|
||||||
slot reset.
|
|
||||||
- PCI_ERS_RESULT_DISCONNECT:
|
|
||||||
Driver returns this if it doesn't want to recover at all.
|
|
||||||
|
|
||||||
The next step taken will depend on the result codes returned by the
|
|
||||||
drivers.
|
|
||||||
|
|
||||||
If all drivers on the segment/slot return PCI_ERS_RESULT_CAN_RECOVER,
|
|
||||||
then the platform should re-enable IOs on the slot (or do nothing in
|
|
||||||
particular, if the platform doesn't isolate slots), and recovery
|
|
||||||
proceeds to STEP 2 (MMIO Enable).
|
|
||||||
|
|
||||||
If any driver requested a slot reset (by returning PCI_ERS_RESULT_NEED_RESET),
|
|
||||||
then recovery proceeds to STEP 4 (Slot Reset).
|
|
||||||
|
|
||||||
If the platform is unable to recover the slot, the next step
|
|
||||||
is STEP 6 (Permanent Failure).
|
|
||||||
|
|
||||||
>>> The current powerpc implementation assumes that a device driver will
|
|
||||||
>>> *not* schedule or semaphore in this routine; the current powerpc
|
|
||||||
>>> implementation uses one kernel thread to notify all devices;
|
|
||||||
>>> thus, if one device sleeps/schedules, all devices are affected.
|
|
||||||
>>> Doing better requires complex multi-threaded logic in the error
|
|
||||||
>>> recovery implementation (e.g. waiting for all notification threads
|
|
||||||
>>> to "join" before proceeding with recovery.) This seems excessively
|
|
||||||
>>> complex and not worth implementing.
|
|
||||||
|
|
||||||
>>> The current powerpc implementation doesn't much care if the device
|
|
||||||
>>> attempts I/O at this point, or not. I/O's will fail, returning
|
|
||||||
>>> a value of 0xff on read, and writes will be dropped. If more than
|
|
||||||
>>> EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH
|
|
||||||
>>> assumes that the device driver has gone into an infinite loop
|
|
||||||
>>> and prints an error to syslog. A reboot is then required to
|
|
||||||
>>> get the device working again.
|
|
||||||
|
|
||||||
STEP 2: MMIO Enabled
|
|
||||||
-------------------
|
|
||||||
The platform re-enables MMIO to the device (but typically not the
|
|
||||||
DMA), and then calls the mmio_enabled() callback on all affected
|
|
||||||
device drivers.
|
|
||||||
|
|
||||||
This is the "early recovery" call. IOs are allowed again, but DMA is
|
|
||||||
not, with some restrictions. This is NOT a callback for the driver to
|
|
||||||
start operations again, only to peek/poke at the device, extract diagnostic
|
|
||||||
information, if any, and eventually do things like trigger a device local
|
|
||||||
reset or some such, but not restart operations. This callback is made if
|
|
||||||
all drivers on a segment agree that they can try to recover and if no automatic
|
|
||||||
link reset was performed by the HW. If the platform can't just re-enable IOs
|
|
||||||
without a slot reset or a link reset, it will not call this callback, and
|
|
||||||
instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
|
|
||||||
|
|
||||||
>>> The following is proposed; no platform implements this yet:
|
|
||||||
>>> Proposal: All I/O's should be done _synchronously_ from within
|
|
||||||
>>> this callback, errors triggered by them will be returned via
|
|
||||||
>>> the normal pci_check_whatever() API, no new error_detected()
|
|
||||||
>>> callback will be issued due to an error happening here. However,
|
|
||||||
>>> such an error might cause IOs to be re-blocked for the whole
|
|
||||||
>>> segment, and thus invalidate the recovery that other devices
|
|
||||||
>>> on the same segment might have done, forcing the whole segment
|
|
||||||
>>> into one of the next states, that is, link reset or slot reset.
|
|
||||||
|
|
||||||
The driver should return one of the following result codes:
|
|
||||||
- PCI_ERS_RESULT_RECOVERED
|
|
||||||
Driver returns this if it thinks the device is fully
|
|
||||||
functional and thinks it is ready to start
|
|
||||||
normal driver operations again. There is no
|
|
||||||
guarantee that the driver will actually be
|
|
||||||
allowed to proceed, as another driver on the
|
|
||||||
same segment might have failed and thus triggered a
|
|
||||||
slot reset on platforms that support it.
|
|
||||||
|
|
||||||
- PCI_ERS_RESULT_NEED_RESET
|
|
||||||
Driver returns this if it thinks the device is not
|
|
||||||
recoverable in its current state and it needs a slot
|
|
||||||
reset to proceed.
|
|
||||||
|
|
||||||
- PCI_ERS_RESULT_DISCONNECT
|
|
||||||
Same as above. Total failure, no recovery even after
|
|
||||||
reset driver dead. (To be defined more precisely)
|
|
||||||
|
|
||||||
The next step taken depends on the results returned by the drivers.
|
|
||||||
If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform
|
|
||||||
proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
|
|
||||||
|
|
||||||
If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
|
|
||||||
proceeds to STEP 4 (Slot Reset)
|
|
||||||
|
|
||||||
STEP 3: Link Reset
|
|
||||||
------------------
|
|
||||||
The platform resets the link. This is a PCI-Express specific step
|
|
||||||
and is done whenever a fatal error has been detected that can be
|
|
||||||
"solved" by resetting the link.
|
|
||||||
|
|
||||||
STEP 4: Slot Reset
|
|
||||||
------------------
|
|
||||||
|
|
||||||
In response to a return value of PCI_ERS_RESULT_NEED_RESET, the
|
|
||||||
the platform will perform a slot reset on the requesting PCI device(s).
|
|
||||||
The actual steps taken by a platform to perform a slot reset
|
|
||||||
will be platform-dependent. Upon completion of slot reset, the
|
|
||||||
platform will call the device slot_reset() callback.
|
|
||||||
|
|
||||||
Powerpc platforms implement two levels of slot reset:
|
|
||||||
soft reset(default) and fundamental(optional) reset.
|
|
||||||
|
|
||||||
Powerpc soft reset consists of asserting the adapter #RST line and then
|
|
||||||
restoring the PCI BAR's and PCI configuration header to a state
|
|
||||||
that is equivalent to what it would be after a fresh system
|
|
||||||
power-on followed by power-on BIOS/system firmware initialization.
|
|
||||||
Soft reset is also known as hot-reset.
|
|
||||||
|
|
||||||
Powerpc fundamental reset is supported by PCI Express cards only
|
|
||||||
and results in device's state machines, hardware logic, port states and
|
|
||||||
configuration registers to initialize to their default conditions.
|
|
||||||
|
|
||||||
For most PCI devices, a soft reset will be sufficient for recovery.
|
|
||||||
Optional fundamental reset is provided to support a limited number
|
|
||||||
of PCI Express devices for which a soft reset is not sufficient
|
|
||||||
for recovery.
|
|
||||||
|
|
||||||
If the platform supports PCI hotplug, then the reset might be
|
|
||||||
performed by toggling the slot electrical power off/on.
|
|
||||||
|
|
||||||
It is important for the platform to restore the PCI config space
|
|
||||||
to the "fresh poweron" state, rather than the "last state". After
|
|
||||||
a slot reset, the device driver will almost always use its standard
|
|
||||||
device initialization routines, and an unusual config space setup
|
|
||||||
may result in hung devices, kernel panics, or silent data corruption.
|
|
||||||
|
|
||||||
This call gives drivers the chance to re-initialize the hardware
|
|
||||||
(re-download firmware, etc.). At this point, the driver may assume
|
|
||||||
that the card is in a fresh state and is fully functional. The slot
|
|
||||||
is unfrozen and the driver has full access to PCI config space,
|
|
||||||
memory mapped I/O space and DMA. Interrupts (Legacy, MSI, or MSI-X)
|
|
||||||
will also be available.
|
|
||||||
|
|
||||||
Drivers should not restart normal I/O processing operations
|
|
||||||
at this point. If all device drivers report success on this
|
|
||||||
callback, the platform will call resume() to complete the sequence,
|
|
||||||
and let the driver restart normal I/O processing.
|
|
||||||
|
|
||||||
A driver can still return a critical failure for this function if
|
|
||||||
it can't get the device operational after reset. If the platform
|
|
||||||
previously tried a soft reset, it might now try a hard reset (power
|
|
||||||
cycle) and then call slot_reset() again. It the device still can't
|
|
||||||
be recovered, there is nothing more that can be done; the platform
|
|
||||||
will typically report a "permanent failure" in such a case. The
|
|
||||||
device will be considered "dead" in this case.
|
|
||||||
|
|
||||||
Drivers for multi-function cards will need to coordinate among
|
|
||||||
themselves as to which driver instance will perform any "one-shot"
|
|
||||||
or global device initialization. For example, the Symbios sym53cxx2
|
|
||||||
driver performs device init only from PCI function 0:
|
|
||||||
|
|
||||||
+ if (PCI_FUNC(pdev->devfn) == 0)
|
|
||||||
+ sym_reset_scsi_bus(np, 0);
|
|
||||||
|
|
||||||
Result codes:
|
|
||||||
- PCI_ERS_RESULT_DISCONNECT
|
|
||||||
Same as above.
|
|
||||||
|
|
||||||
Drivers for PCI Express cards that require a fundamental reset must
|
|
||||||
set the needs_freset bit in the pci_dev structure in their probe function.
|
|
||||||
For example, the QLogic qla2xxx driver sets the needs_freset bit for certain
|
|
||||||
PCI card types:
|
|
||||||
|
|
||||||
+ /* Set EEH reset type to fundamental if required by hba */
|
|
||||||
+ if (IS_QLA24XX(ha) || IS_QLA25XX(ha) || IS_QLA81XX(ha))
|
|
||||||
+ pdev->needs_freset = 1;
|
|
||||||
+
|
|
||||||
|
|
||||||
Platform proceeds either to STEP 5 (Resume Operations) or STEP 6 (Permanent
|
|
||||||
Failure).
|
|
||||||
|
|
||||||
>>> The current powerpc implementation does not try a power-cycle
|
|
||||||
>>> reset if the driver returned PCI_ERS_RESULT_DISCONNECT.
|
|
||||||
>>> However, it probably should.
|
|
||||||
|
|
||||||
|
|
||||||
STEP 5: Resume Operations
|
|
||||||
-------------------------
|
|
||||||
The platform will call the resume() callback on all affected device
|
|
||||||
drivers if all drivers on the segment have returned
|
|
||||||
PCI_ERS_RESULT_RECOVERED from one of the 3 previous callbacks.
|
|
||||||
The goal of this callback is to tell the driver to restart activity,
|
|
||||||
that everything is back and running. This callback does not return
|
|
||||||
a result code.
|
|
||||||
|
|
||||||
At this point, if a new error happens, the platform will restart
|
|
||||||
a new error recovery sequence.
|
|
||||||
|
|
||||||
STEP 6: Permanent Failure
|
|
||||||
-------------------------
|
|
||||||
A "permanent failure" has occurred, and the platform cannot recover
|
|
||||||
the device. The platform will call error_detected() with a
|
|
||||||
pci_channel_state value of pci_channel_io_perm_failure.
|
|
||||||
|
|
||||||
The device driver should, at this point, assume the worst. It should
|
|
||||||
cancel all pending I/O, refuse all new I/O, returning -EIO to
|
|
||||||
higher layers. The device driver should then clean up all of its
|
|
||||||
memory and remove itself from kernel operations, much as it would
|
|
||||||
during system shutdown.
|
|
||||||
|
|
||||||
The platform will typically notify the system operator of the
|
|
||||||
permanent failure in some way. If the device is hotplug-capable,
|
|
||||||
the operator will probably want to remove and replace the device.
|
|
||||||
Note, however, not all failures are truly "permanent". Some are
|
|
||||||
caused by over-heating, some by a poorly seated card. Many
|
|
||||||
PCI error events are caused by software bugs, e.g. DMA's to
|
|
||||||
wild addresses or bogus split transactions due to programming
|
|
||||||
errors. See the discussion in powerpc/eeh-pci-error-recovery.txt
|
|
||||||
for additional detail on real-life experience of the causes of
|
|
||||||
software errors.
|
|
||||||
|
|
||||||
|
|
||||||
Conclusion; General Remarks
|
|
||||||
---------------------------
|
|
||||||
The way the callbacks are called is platform policy. A platform with
|
|
||||||
no slot reset capability may want to just "ignore" drivers that can't
|
|
||||||
recover (disconnect them) and try to let other cards on the same segment
|
|
||||||
recover. Keep in mind that in most real life cases, though, there will
|
|
||||||
be only one driver per segment.
|
|
||||||
|
|
||||||
Now, a note about interrupts. If you get an interrupt and your
|
|
||||||
device is dead or has been isolated, there is a problem :)
|
|
||||||
The current policy is to turn this into a platform policy.
|
|
||||||
That is, the recovery API only requires that:
|
|
||||||
|
|
||||||
- There is no guarantee that interrupt delivery can proceed from any
|
|
||||||
device on the segment starting from the error detection and until the
|
|
||||||
slot_reset callback is called, at which point interrupts are expected
|
|
||||||
to be fully operational.
|
|
||||||
|
|
||||||
- There is no guarantee that interrupt delivery is stopped, that is,
|
|
||||||
a driver that gets an interrupt after detecting an error, or that detects
|
|
||||||
an error within the interrupt handler such that it prevents proper
|
|
||||||
ack'ing of the interrupt (and thus removal of the source) should just
|
|
||||||
return IRQ_NOTHANDLED. It's up to the platform to deal with that
|
|
||||||
condition, typically by masking the IRQ source during the duration of
|
|
||||||
the error handling. It is expected that the platform "knows" which
|
|
||||||
interrupts are routed to error-management capable slots and can deal
|
|
||||||
with temporarily disabling that IRQ number during error processing (this
|
|
||||||
isn't terribly complex). That means some IRQ latency for other devices
|
|
||||||
sharing the interrupt, but there is simply no other way. High end
|
|
||||||
platforms aren't supposed to share interrupts between many devices
|
|
||||||
anyway :)
|
|
||||||
|
|
||||||
>>> Implementation details for the powerpc platform are discussed in
|
|
||||||
>>> the file Documentation/powerpc/eeh-pci-error-recovery.txt
|
|
||||||
|
|
||||||
>>> As of this writing, there is a growing list of device drivers with
|
|
||||||
>>> patches implementing error recovery. Not all of these patches are in
|
|
||||||
>>> mainline yet. These may be used as "examples":
|
|
||||||
>>>
|
|
||||||
>>> drivers/scsi/ipr
|
|
||||||
>>> drivers/scsi/sym53c8xx_2
|
|
||||||
>>> drivers/scsi/qla2xxx
|
|
||||||
>>> drivers/scsi/lpfc
|
|
||||||
>>> drivers/next/bnx2.c
|
|
||||||
>>> drivers/next/e100.c
|
|
||||||
>>> drivers/net/e1000
|
|
||||||
>>> drivers/net/e1000e
|
|
||||||
>>> drivers/net/ixgb
|
|
||||||
>>> drivers/net/ixgbe
|
|
||||||
>>> drivers/net/cxgb3
|
|
||||||
>>> drivers/net/s2io.c
|
|
||||||
>>> drivers/net/qlge
|
|
||||||
|
|
||||||
The End
|
|
||||||
-------
|
|
172
Documentation/PCI/pci-iov-howto.rst
Normal file
172
Documentation/PCI/pci-iov-howto.rst
Normal file
|
@ -0,0 +1,172 @@
|
||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
.. include:: <isonum.txt>
|
||||||
|
|
||||||
|
====================================
|
||||||
|
PCI Express I/O Virtualization Howto
|
||||||
|
====================================
|
||||||
|
|
||||||
|
:Copyright: |copy| 2009 Intel Corporation
|
||||||
|
:Authors: - Yu Zhao <yu.zhao@intel.com>
|
||||||
|
- Donald Dutile <ddutile@redhat.com>
|
||||||
|
|
||||||
|
Overview
|
||||||
|
========
|
||||||
|
|
||||||
|
What is SR-IOV
|
||||||
|
--------------
|
||||||
|
|
||||||
|
Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended
|
||||||
|
capability which makes one physical device appear as multiple virtual
|
||||||
|
devices. The physical device is referred to as Physical Function (PF)
|
||||||
|
while the virtual devices are referred to as Virtual Functions (VF).
|
||||||
|
Allocation of the VF can be dynamically controlled by the PF via
|
||||||
|
registers encapsulated in the capability. By default, this feature is
|
||||||
|
not enabled and the PF behaves as traditional PCIe device. Once it's
|
||||||
|
turned on, each VF's PCI configuration space can be accessed by its own
|
||||||
|
Bus, Device and Function Number (Routing ID). And each VF also has PCI
|
||||||
|
Memory Space, which is used to map its register set. VF device driver
|
||||||
|
operates on the register set so it can be functional and appear as a
|
||||||
|
real existing PCI device.
|
||||||
|
|
||||||
|
User Guide
|
||||||
|
==========
|
||||||
|
|
||||||
|
How can I enable SR-IOV capability
|
||||||
|
----------------------------------
|
||||||
|
|
||||||
|
Multiple methods are available for SR-IOV enablement.
|
||||||
|
In the first method, the device driver (PF driver) will control the
|
||||||
|
enabling and disabling of the capability via API provided by SR-IOV core.
|
||||||
|
If the hardware has SR-IOV capability, loading its PF driver would
|
||||||
|
enable it and all VFs associated with the PF. Some PF drivers require
|
||||||
|
a module parameter to be set to determine the number of VFs to enable.
|
||||||
|
In the second method, a write to the sysfs file sriov_numvfs will
|
||||||
|
enable and disable the VFs associated with a PCIe PF. This method
|
||||||
|
enables per-PF, VF enable/disable values versus the first method,
|
||||||
|
which applies to all PFs of the same device. Additionally, the
|
||||||
|
PCI SRIOV core support ensures that enable/disable operations are
|
||||||
|
valid to reduce duplication in multiple drivers for the same
|
||||||
|
checks, e.g., check numvfs == 0 if enabling VFs, ensure
|
||||||
|
numvfs <= totalvfs.
|
||||||
|
The second method is the recommended method for new/future VF devices.
|
||||||
|
|
||||||
|
How can I use the Virtual Functions
|
||||||
|
-----------------------------------
|
||||||
|
|
||||||
|
The VF is treated as hot-plugged PCI devices in the kernel, so they
|
||||||
|
should be able to work in the same way as real PCI devices. The VF
|
||||||
|
requires device driver that is same as a normal PCI device's.
|
||||||
|
|
||||||
|
Developer Guide
|
||||||
|
===============
|
||||||
|
|
||||||
|
SR-IOV API
|
||||||
|
----------
|
||||||
|
|
||||||
|
To enable SR-IOV capability:
|
||||||
|
|
||||||
|
(a) For the first method, in the driver::
|
||||||
|
|
||||||
|
int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
|
||||||
|
|
||||||
|
'nr_virtfn' is number of VFs to be enabled.
|
||||||
|
|
||||||
|
(b) For the second method, from sysfs::
|
||||||
|
|
||||||
|
echo 'nr_virtfn' > \
|
||||||
|
/sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
|
||||||
|
|
||||||
|
To disable SR-IOV capability:
|
||||||
|
|
||||||
|
(a) For the first method, in the driver::
|
||||||
|
|
||||||
|
void pci_disable_sriov(struct pci_dev *dev);
|
||||||
|
|
||||||
|
(b) For the second method, from sysfs::
|
||||||
|
|
||||||
|
echo 0 > \
|
||||||
|
/sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
|
||||||
|
|
||||||
|
To enable auto probing VFs by a compatible driver on the host, run
|
||||||
|
command below before enabling SR-IOV capabilities. This is the
|
||||||
|
default behavior.
|
||||||
|
::
|
||||||
|
|
||||||
|
echo 1 > \
|
||||||
|
/sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_drivers_autoprobe
|
||||||
|
|
||||||
|
To disable auto probing VFs by a compatible driver on the host, run
|
||||||
|
command below before enabling SR-IOV capabilities. Updating this
|
||||||
|
entry will not affect VFs which are already probed.
|
||||||
|
::
|
||||||
|
|
||||||
|
echo 0 > \
|
||||||
|
/sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_drivers_autoprobe
|
||||||
|
|
||||||
|
Usage example
|
||||||
|
-------------
|
||||||
|
|
||||||
|
Following piece of code illustrates the usage of the SR-IOV API.
|
||||||
|
::
|
||||||
|
|
||||||
|
static int dev_probe(struct pci_dev *dev, const struct pci_device_id *id)
|
||||||
|
{
|
||||||
|
pci_enable_sriov(dev, NR_VIRTFN);
|
||||||
|
|
||||||
|
...
|
||||||
|
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
static void dev_remove(struct pci_dev *dev)
|
||||||
|
{
|
||||||
|
pci_disable_sriov(dev);
|
||||||
|
|
||||||
|
...
|
||||||
|
}
|
||||||
|
|
||||||
|
static int dev_suspend(struct pci_dev *dev, pm_message_t state)
|
||||||
|
{
|
||||||
|
...
|
||||||
|
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
static int dev_resume(struct pci_dev *dev)
|
||||||
|
{
|
||||||
|
...
|
||||||
|
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
static void dev_shutdown(struct pci_dev *dev)
|
||||||
|
{
|
||||||
|
...
|
||||||
|
}
|
||||||
|
|
||||||
|
static int dev_sriov_configure(struct pci_dev *dev, int numvfs)
|
||||||
|
{
|
||||||
|
if (numvfs > 0) {
|
||||||
|
...
|
||||||
|
pci_enable_sriov(dev, numvfs);
|
||||||
|
...
|
||||||
|
return numvfs;
|
||||||
|
}
|
||||||
|
if (numvfs == 0) {
|
||||||
|
....
|
||||||
|
pci_disable_sriov(dev);
|
||||||
|
...
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
static struct pci_driver dev_driver = {
|
||||||
|
.name = "SR-IOV Physical Function driver",
|
||||||
|
.id_table = dev_id_table,
|
||||||
|
.probe = dev_probe,
|
||||||
|
.remove = dev_remove,
|
||||||
|
.suspend = dev_suspend,
|
||||||
|
.resume = dev_resume,
|
||||||
|
.shutdown = dev_shutdown,
|
||||||
|
.sriov_configure = dev_sriov_configure,
|
||||||
|
};
|
|
@ -1,147 +0,0 @@
|
||||||
PCI Express I/O Virtualization Howto
|
|
||||||
Copyright (C) 2009 Intel Corporation
|
|
||||||
Yu Zhao <yu.zhao@intel.com>
|
|
||||||
|
|
||||||
Update: November 2012
|
|
||||||
-- sysfs-based SRIOV enable-/disable-ment
|
|
||||||
Donald Dutile <ddutile@redhat.com>
|
|
||||||
|
|
||||||
1. Overview
|
|
||||||
|
|
||||||
1.1 What is SR-IOV
|
|
||||||
|
|
||||||
Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended
|
|
||||||
capability which makes one physical device appear as multiple virtual
|
|
||||||
devices. The physical device is referred to as Physical Function (PF)
|
|
||||||
while the virtual devices are referred to as Virtual Functions (VF).
|
|
||||||
Allocation of the VF can be dynamically controlled by the PF via
|
|
||||||
registers encapsulated in the capability. By default, this feature is
|
|
||||||
not enabled and the PF behaves as traditional PCIe device. Once it's
|
|
||||||
turned on, each VF's PCI configuration space can be accessed by its own
|
|
||||||
Bus, Device and Function Number (Routing ID). And each VF also has PCI
|
|
||||||
Memory Space, which is used to map its register set. VF device driver
|
|
||||||
operates on the register set so it can be functional and appear as a
|
|
||||||
real existing PCI device.
|
|
||||||
|
|
||||||
2. User Guide
|
|
||||||
|
|
||||||
2.1 How can I enable SR-IOV capability
|
|
||||||
|
|
||||||
Multiple methods are available for SR-IOV enablement.
|
|
||||||
In the first method, the device driver (PF driver) will control the
|
|
||||||
enabling and disabling of the capability via API provided by SR-IOV core.
|
|
||||||
If the hardware has SR-IOV capability, loading its PF driver would
|
|
||||||
enable it and all VFs associated with the PF. Some PF drivers require
|
|
||||||
a module parameter to be set to determine the number of VFs to enable.
|
|
||||||
In the second method, a write to the sysfs file sriov_numvfs will
|
|
||||||
enable and disable the VFs associated with a PCIe PF. This method
|
|
||||||
enables per-PF, VF enable/disable values versus the first method,
|
|
||||||
which applies to all PFs of the same device. Additionally, the
|
|
||||||
PCI SRIOV core support ensures that enable/disable operations are
|
|
||||||
valid to reduce duplication in multiple drivers for the same
|
|
||||||
checks, e.g., check numvfs == 0 if enabling VFs, ensure
|
|
||||||
numvfs <= totalvfs.
|
|
||||||
The second method is the recommended method for new/future VF devices.
|
|
||||||
|
|
||||||
2.2 How can I use the Virtual Functions
|
|
||||||
|
|
||||||
The VF is treated as hot-plugged PCI devices in the kernel, so they
|
|
||||||
should be able to work in the same way as real PCI devices. The VF
|
|
||||||
requires device driver that is same as a normal PCI device's.
|
|
||||||
|
|
||||||
3. Developer Guide
|
|
||||||
|
|
||||||
3.1 SR-IOV API
|
|
||||||
|
|
||||||
To enable SR-IOV capability:
|
|
||||||
(a) For the first method, in the driver:
|
|
||||||
int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn);
|
|
||||||
'nr_virtfn' is number of VFs to be enabled.
|
|
||||||
(b) For the second method, from sysfs:
|
|
||||||
echo 'nr_virtfn' > \
|
|
||||||
/sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
|
|
||||||
|
|
||||||
To disable SR-IOV capability:
|
|
||||||
(a) For the first method, in the driver:
|
|
||||||
void pci_disable_sriov(struct pci_dev *dev);
|
|
||||||
(b) For the second method, from sysfs:
|
|
||||||
echo 0 > \
|
|
||||||
/sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_numvfs
|
|
||||||
|
|
||||||
To enable auto probing VFs by a compatible driver on the host, run
|
|
||||||
command below before enabling SR-IOV capabilities. This is the
|
|
||||||
default behavior.
|
|
||||||
echo 1 > \
|
|
||||||
/sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_drivers_autoprobe
|
|
||||||
|
|
||||||
To disable auto probing VFs by a compatible driver on the host, run
|
|
||||||
command below before enabling SR-IOV capabilities. Updating this
|
|
||||||
entry will not affect VFs which are already probed.
|
|
||||||
echo 0 > \
|
|
||||||
/sys/bus/pci/devices/<DOMAIN:BUS:DEVICE.FUNCTION>/sriov_drivers_autoprobe
|
|
||||||
|
|
||||||
3.2 Usage example
|
|
||||||
|
|
||||||
Following piece of code illustrates the usage of the SR-IOV API.
|
|
||||||
|
|
||||||
static int dev_probe(struct pci_dev *dev, const struct pci_device_id *id)
|
|
||||||
{
|
|
||||||
pci_enable_sriov(dev, NR_VIRTFN);
|
|
||||||
|
|
||||||
...
|
|
||||||
|
|
||||||
return 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
static void dev_remove(struct pci_dev *dev)
|
|
||||||
{
|
|
||||||
pci_disable_sriov(dev);
|
|
||||||
|
|
||||||
...
|
|
||||||
}
|
|
||||||
|
|
||||||
static int dev_suspend(struct pci_dev *dev, pm_message_t state)
|
|
||||||
{
|
|
||||||
...
|
|
||||||
|
|
||||||
return 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
static int dev_resume(struct pci_dev *dev)
|
|
||||||
{
|
|
||||||
...
|
|
||||||
|
|
||||||
return 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
static void dev_shutdown(struct pci_dev *dev)
|
|
||||||
{
|
|
||||||
...
|
|
||||||
}
|
|
||||||
|
|
||||||
static int dev_sriov_configure(struct pci_dev *dev, int numvfs)
|
|
||||||
{
|
|
||||||
if (numvfs > 0) {
|
|
||||||
...
|
|
||||||
pci_enable_sriov(dev, numvfs);
|
|
||||||
...
|
|
||||||
return numvfs;
|
|
||||||
}
|
|
||||||
if (numvfs == 0) {
|
|
||||||
....
|
|
||||||
pci_disable_sriov(dev);
|
|
||||||
...
|
|
||||||
return 0;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
static struct pci_driver dev_driver = {
|
|
||||||
.name = "SR-IOV Physical Function driver",
|
|
||||||
.id_table = dev_id_table,
|
|
||||||
.probe = dev_probe,
|
|
||||||
.remove = dev_remove,
|
|
||||||
.suspend = dev_suspend,
|
|
||||||
.resume = dev_resume,
|
|
||||||
.shutdown = dev_shutdown,
|
|
||||||
.sriov_configure = dev_sriov_configure,
|
|
||||||
};
|
|
578
Documentation/PCI/pci.rst
Normal file
578
Documentation/PCI/pci.rst
Normal file
|
@ -0,0 +1,578 @@
|
||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
|
||||||
|
==============================
|
||||||
|
How To Write Linux PCI Drivers
|
||||||
|
==============================
|
||||||
|
|
||||||
|
:Authors: - Martin Mares <mj@ucw.cz>
|
||||||
|
- Grant Grundler <grundler@parisc-linux.org>
|
||||||
|
|
||||||
|
The world of PCI is vast and full of (mostly unpleasant) surprises.
|
||||||
|
Since each CPU architecture implements different chip-sets and PCI devices
|
||||||
|
have different requirements (erm, "features"), the result is the PCI support
|
||||||
|
in the Linux kernel is not as trivial as one would wish. This short paper
|
||||||
|
tries to introduce all potential driver authors to Linux APIs for
|
||||||
|
PCI device drivers.
|
||||||
|
|
||||||
|
A more complete resource is the third edition of "Linux Device Drivers"
|
||||||
|
by Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman.
|
||||||
|
LDD3 is available for free (under Creative Commons License) from:
|
||||||
|
http://lwn.net/Kernel/LDD3/.
|
||||||
|
|
||||||
|
However, keep in mind that all documents are subject to "bit rot".
|
||||||
|
Refer to the source code if things are not working as described here.
|
||||||
|
|
||||||
|
Please send questions/comments/patches about Linux PCI API to the
|
||||||
|
"Linux PCI" <linux-pci@atrey.karlin.mff.cuni.cz> mailing list.
|
||||||
|
|
||||||
|
|
||||||
|
Structure of PCI drivers
|
||||||
|
========================
|
||||||
|
PCI drivers "discover" PCI devices in a system via pci_register_driver().
|
||||||
|
Actually, it's the other way around. When the PCI generic code discovers
|
||||||
|
a new device, the driver with a matching "description" will be notified.
|
||||||
|
Details on this below.
|
||||||
|
|
||||||
|
pci_register_driver() leaves most of the probing for devices to
|
||||||
|
the PCI layer and supports online insertion/removal of devices [thus
|
||||||
|
supporting hot-pluggable PCI, CardBus, and Express-Card in a single driver].
|
||||||
|
pci_register_driver() call requires passing in a table of function
|
||||||
|
pointers and thus dictates the high level structure of a driver.
|
||||||
|
|
||||||
|
Once the driver knows about a PCI device and takes ownership, the
|
||||||
|
driver generally needs to perform the following initialization:
|
||||||
|
|
||||||
|
- Enable the device
|
||||||
|
- Request MMIO/IOP resources
|
||||||
|
- Set the DMA mask size (for both coherent and streaming DMA)
|
||||||
|
- Allocate and initialize shared control data (pci_allocate_coherent())
|
||||||
|
- Access device configuration space (if needed)
|
||||||
|
- Register IRQ handler (request_irq())
|
||||||
|
- Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)
|
||||||
|
- Enable DMA/processing engines
|
||||||
|
|
||||||
|
When done using the device, and perhaps the module needs to be unloaded,
|
||||||
|
the driver needs to take the follow steps:
|
||||||
|
|
||||||
|
- Disable the device from generating IRQs
|
||||||
|
- Release the IRQ (free_irq())
|
||||||
|
- Stop all DMA activity
|
||||||
|
- Release DMA buffers (both streaming and coherent)
|
||||||
|
- Unregister from other subsystems (e.g. scsi or netdev)
|
||||||
|
- Release MMIO/IOP resources
|
||||||
|
- Disable the device
|
||||||
|
|
||||||
|
Most of these topics are covered in the following sections.
|
||||||
|
For the rest look at LDD3 or <linux/pci.h> .
|
||||||
|
|
||||||
|
If the PCI subsystem is not configured (CONFIG_PCI is not set), most of
|
||||||
|
the PCI functions described below are defined as inline functions either
|
||||||
|
completely empty or just returning an appropriate error codes to avoid
|
||||||
|
lots of ifdefs in the drivers.
|
||||||
|
|
||||||
|
|
||||||
|
pci_register_driver() call
|
||||||
|
==========================
|
||||||
|
|
||||||
|
PCI device drivers call ``pci_register_driver()`` during their
|
||||||
|
initialization with a pointer to a structure describing the driver
|
||||||
|
(``struct pci_driver``):
|
||||||
|
|
||||||
|
.. kernel-doc:: include/linux/pci.h
|
||||||
|
:functions: pci_driver
|
||||||
|
|
||||||
|
The ID table is an array of ``struct pci_device_id`` entries ending with an
|
||||||
|
all-zero entry. Definitions with static const are generally preferred.
|
||||||
|
|
||||||
|
.. kernel-doc:: include/linux/mod_devicetable.h
|
||||||
|
:functions: pci_device_id
|
||||||
|
|
||||||
|
Most drivers only need ``PCI_DEVICE()`` or ``PCI_DEVICE_CLASS()`` to set up
|
||||||
|
a pci_device_id table.
|
||||||
|
|
||||||
|
New PCI IDs may be added to a device driver pci_ids table at runtime
|
||||||
|
as shown below::
|
||||||
|
|
||||||
|
echo "vendor device subvendor subdevice class class_mask driver_data" > \
|
||||||
|
/sys/bus/pci/drivers/{driver}/new_id
|
||||||
|
|
||||||
|
All fields are passed in as hexadecimal values (no leading 0x).
|
||||||
|
The vendor and device fields are mandatory, the others are optional. Users
|
||||||
|
need pass only as many optional fields as necessary:
|
||||||
|
|
||||||
|
- subvendor and subdevice fields default to PCI_ANY_ID (FFFFFFFF)
|
||||||
|
- class and classmask fields default to 0
|
||||||
|
- driver_data defaults to 0UL.
|
||||||
|
|
||||||
|
Note that driver_data must match the value used by any of the pci_device_id
|
||||||
|
entries defined in the driver. This makes the driver_data field mandatory
|
||||||
|
if all the pci_device_id entries have a non-zero driver_data value.
|
||||||
|
|
||||||
|
Once added, the driver probe routine will be invoked for any unclaimed
|
||||||
|
PCI devices listed in its (newly updated) pci_ids list.
|
||||||
|
|
||||||
|
When the driver exits, it just calls pci_unregister_driver() and the PCI layer
|
||||||
|
automatically calls the remove hook for all devices handled by the driver.
|
||||||
|
|
||||||
|
|
||||||
|
"Attributes" for driver functions/data
|
||||||
|
--------------------------------------
|
||||||
|
|
||||||
|
Please mark the initialization and cleanup functions where appropriate
|
||||||
|
(the corresponding macros are defined in <linux/init.h>):
|
||||||
|
|
||||||
|
====== =================================================
|
||||||
|
__init Initialization code. Thrown away after the driver
|
||||||
|
initializes.
|
||||||
|
__exit Exit code. Ignored for non-modular drivers.
|
||||||
|
====== =================================================
|
||||||
|
|
||||||
|
Tips on when/where to use the above attributes:
|
||||||
|
- The module_init()/module_exit() functions (and all
|
||||||
|
initialization functions called _only_ from these)
|
||||||
|
should be marked __init/__exit.
|
||||||
|
|
||||||
|
- Do not mark the struct pci_driver.
|
||||||
|
|
||||||
|
- Do NOT mark a function if you are not sure which mark to use.
|
||||||
|
Better to not mark the function than mark the function wrong.
|
||||||
|
|
||||||
|
|
||||||
|
How to find PCI devices manually
|
||||||
|
================================
|
||||||
|
|
||||||
|
PCI drivers should have a really good reason for not using the
|
||||||
|
pci_register_driver() interface to search for PCI devices.
|
||||||
|
The main reason PCI devices are controlled by multiple drivers
|
||||||
|
is because one PCI device implements several different HW services.
|
||||||
|
E.g. combined serial/parallel port/floppy controller.
|
||||||
|
|
||||||
|
A manual search may be performed using the following constructs:
|
||||||
|
|
||||||
|
Searching by vendor and device ID::
|
||||||
|
|
||||||
|
struct pci_dev *dev = NULL;
|
||||||
|
while (dev = pci_get_device(VENDOR_ID, DEVICE_ID, dev))
|
||||||
|
configure_device(dev);
|
||||||
|
|
||||||
|
Searching by class ID (iterate in a similar way)::
|
||||||
|
|
||||||
|
pci_get_class(CLASS_ID, dev)
|
||||||
|
|
||||||
|
Searching by both vendor/device and subsystem vendor/device ID::
|
||||||
|
|
||||||
|
pci_get_subsys(VENDOR_ID,DEVICE_ID, SUBSYS_VENDOR_ID, SUBSYS_DEVICE_ID, dev).
|
||||||
|
|
||||||
|
You can use the constant PCI_ANY_ID as a wildcard replacement for
|
||||||
|
VENDOR_ID or DEVICE_ID. This allows searching for any device from a
|
||||||
|
specific vendor, for example.
|
||||||
|
|
||||||
|
These functions are hotplug-safe. They increment the reference count on
|
||||||
|
the pci_dev that they return. You must eventually (possibly at module unload)
|
||||||
|
decrement the reference count on these devices by calling pci_dev_put().
|
||||||
|
|
||||||
|
|
||||||
|
Device Initialization Steps
|
||||||
|
===========================
|
||||||
|
|
||||||
|
As noted in the introduction, most PCI drivers need the following steps
|
||||||
|
for device initialization:
|
||||||
|
|
||||||
|
- Enable the device
|
||||||
|
- Request MMIO/IOP resources
|
||||||
|
- Set the DMA mask size (for both coherent and streaming DMA)
|
||||||
|
- Allocate and initialize shared control data (pci_allocate_coherent())
|
||||||
|
- Access device configuration space (if needed)
|
||||||
|
- Register IRQ handler (request_irq())
|
||||||
|
- Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)
|
||||||
|
- Enable DMA/processing engines.
|
||||||
|
|
||||||
|
The driver can access PCI config space registers at any time.
|
||||||
|
(Well, almost. When running BIST, config space can go away...but
|
||||||
|
that will just result in a PCI Bus Master Abort and config reads
|
||||||
|
will return garbage).
|
||||||
|
|
||||||
|
|
||||||
|
Enable the PCI device
|
||||||
|
---------------------
|
||||||
|
Before touching any device registers, the driver needs to enable
|
||||||
|
the PCI device by calling pci_enable_device(). This will:
|
||||||
|
|
||||||
|
- wake up the device if it was in suspended state,
|
||||||
|
- allocate I/O and memory regions of the device (if BIOS did not),
|
||||||
|
- allocate an IRQ (if BIOS did not).
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
pci_enable_device() can fail! Check the return value.
|
||||||
|
|
||||||
|
.. warning::
|
||||||
|
OS BUG: we don't check resource allocations before enabling those
|
||||||
|
resources. The sequence would make more sense if we called
|
||||||
|
pci_request_resources() before calling pci_enable_device().
|
||||||
|
Currently, the device drivers can't detect the bug when when two
|
||||||
|
devices have been allocated the same range. This is not a common
|
||||||
|
problem and unlikely to get fixed soon.
|
||||||
|
|
||||||
|
This has been discussed before but not changed as of 2.6.19:
|
||||||
|
http://lkml.org/lkml/2006/3/2/194
|
||||||
|
|
||||||
|
|
||||||
|
pci_set_master() will enable DMA by setting the bus master bit
|
||||||
|
in the PCI_COMMAND register. It also fixes the latency timer value if
|
||||||
|
it's set to something bogus by the BIOS. pci_clear_master() will
|
||||||
|
disable DMA by clearing the bus master bit.
|
||||||
|
|
||||||
|
If the PCI device can use the PCI Memory-Write-Invalidate transaction,
|
||||||
|
call pci_set_mwi(). This enables the PCI_COMMAND bit for Mem-Wr-Inval
|
||||||
|
and also ensures that the cache line size register is set correctly.
|
||||||
|
Check the return value of pci_set_mwi() as not all architectures
|
||||||
|
or chip-sets may support Memory-Write-Invalidate. Alternatively,
|
||||||
|
if Mem-Wr-Inval would be nice to have but is not required, call
|
||||||
|
pci_try_set_mwi() to have the system do its best effort at enabling
|
||||||
|
Mem-Wr-Inval.
|
||||||
|
|
||||||
|
|
||||||
|
Request MMIO/IOP resources
|
||||||
|
--------------------------
|
||||||
|
Memory (MMIO), and I/O port addresses should NOT be read directly
|
||||||
|
from the PCI device config space. Use the values in the pci_dev structure
|
||||||
|
as the PCI "bus address" might have been remapped to a "host physical"
|
||||||
|
address by the arch/chip-set specific kernel support.
|
||||||
|
|
||||||
|
See Documentation/io-mapping.txt for how to access device registers
|
||||||
|
or device memory.
|
||||||
|
|
||||||
|
The device driver needs to call pci_request_region() to verify
|
||||||
|
no other device is already using the same address resource.
|
||||||
|
Conversely, drivers should call pci_release_region() AFTER
|
||||||
|
calling pci_disable_device().
|
||||||
|
The idea is to prevent two devices colliding on the same address range.
|
||||||
|
|
||||||
|
.. tip::
|
||||||
|
See OS BUG comment above. Currently (2.6.19), The driver can only
|
||||||
|
determine MMIO and IO Port resource availability _after_ calling
|
||||||
|
pci_enable_device().
|
||||||
|
|
||||||
|
Generic flavors of pci_request_region() are request_mem_region()
|
||||||
|
(for MMIO ranges) and request_region() (for IO Port ranges).
|
||||||
|
Use these for address resources that are not described by "normal" PCI
|
||||||
|
BARs.
|
||||||
|
|
||||||
|
Also see pci_request_selected_regions() below.
|
||||||
|
|
||||||
|
|
||||||
|
Set the DMA mask size
|
||||||
|
---------------------
|
||||||
|
.. note::
|
||||||
|
If anything below doesn't make sense, please refer to
|
||||||
|
Documentation/DMA-API.txt. This section is just a reminder that
|
||||||
|
drivers need to indicate DMA capabilities of the device and is not
|
||||||
|
an authoritative source for DMA interfaces.
|
||||||
|
|
||||||
|
While all drivers should explicitly indicate the DMA capability
|
||||||
|
(e.g. 32 or 64 bit) of the PCI bus master, devices with more than
|
||||||
|
32-bit bus master capability for streaming data need the driver
|
||||||
|
to "register" this capability by calling pci_set_dma_mask() with
|
||||||
|
appropriate parameters. In general this allows more efficient DMA
|
||||||
|
on systems where System RAM exists above 4G _physical_ address.
|
||||||
|
|
||||||
|
Drivers for all PCI-X and PCIe compliant devices must call
|
||||||
|
pci_set_dma_mask() as they are 64-bit DMA devices.
|
||||||
|
|
||||||
|
Similarly, drivers must also "register" this capability if the device
|
||||||
|
can directly address "consistent memory" in System RAM above 4G physical
|
||||||
|
address by calling pci_set_consistent_dma_mask().
|
||||||
|
Again, this includes drivers for all PCI-X and PCIe compliant devices.
|
||||||
|
Many 64-bit "PCI" devices (before PCI-X) and some PCI-X devices are
|
||||||
|
64-bit DMA capable for payload ("streaming") data but not control
|
||||||
|
("consistent") data.
|
||||||
|
|
||||||
|
|
||||||
|
Setup shared control data
|
||||||
|
-------------------------
|
||||||
|
Once the DMA masks are set, the driver can allocate "consistent" (a.k.a. shared)
|
||||||
|
memory. See Documentation/DMA-API.txt for a full description of
|
||||||
|
the DMA APIs. This section is just a reminder that it needs to be done
|
||||||
|
before enabling DMA on the device.
|
||||||
|
|
||||||
|
|
||||||
|
Initialize device registers
|
||||||
|
---------------------------
|
||||||
|
Some drivers will need specific "capability" fields programmed
|
||||||
|
or other "vendor specific" register initialized or reset.
|
||||||
|
E.g. clearing pending interrupts.
|
||||||
|
|
||||||
|
|
||||||
|
Register IRQ handler
|
||||||
|
--------------------
|
||||||
|
While calling request_irq() is the last step described here,
|
||||||
|
this is often just another intermediate step to initialize a device.
|
||||||
|
This step can often be deferred until the device is opened for use.
|
||||||
|
|
||||||
|
All interrupt handlers for IRQ lines should be registered with IRQF_SHARED
|
||||||
|
and use the devid to map IRQs to devices (remember that all PCI IRQ lines
|
||||||
|
can be shared).
|
||||||
|
|
||||||
|
request_irq() will associate an interrupt handler and device handle
|
||||||
|
with an interrupt number. Historically interrupt numbers represent
|
||||||
|
IRQ lines which run from the PCI device to the Interrupt controller.
|
||||||
|
With MSI and MSI-X (more below) the interrupt number is a CPU "vector".
|
||||||
|
|
||||||
|
request_irq() also enables the interrupt. Make sure the device is
|
||||||
|
quiesced and does not have any interrupts pending before registering
|
||||||
|
the interrupt handler.
|
||||||
|
|
||||||
|
MSI and MSI-X are PCI capabilities. Both are "Message Signaled Interrupts"
|
||||||
|
which deliver interrupts to the CPU via a DMA write to a Local APIC.
|
||||||
|
The fundamental difference between MSI and MSI-X is how multiple
|
||||||
|
"vectors" get allocated. MSI requires contiguous blocks of vectors
|
||||||
|
while MSI-X can allocate several individual ones.
|
||||||
|
|
||||||
|
MSI capability can be enabled by calling pci_alloc_irq_vectors() with the
|
||||||
|
PCI_IRQ_MSI and/or PCI_IRQ_MSIX flags before calling request_irq(). This
|
||||||
|
causes the PCI support to program CPU vector data into the PCI device
|
||||||
|
capability registers. Many architectures, chip-sets, or BIOSes do NOT
|
||||||
|
support MSI or MSI-X and a call to pci_alloc_irq_vectors with just
|
||||||
|
the PCI_IRQ_MSI and PCI_IRQ_MSIX flags will fail, so try to always
|
||||||
|
specify PCI_IRQ_LEGACY as well.
|
||||||
|
|
||||||
|
Drivers that have different interrupt handlers for MSI/MSI-X and
|
||||||
|
legacy INTx should chose the right one based on the msi_enabled
|
||||||
|
and msix_enabled flags in the pci_dev structure after calling
|
||||||
|
pci_alloc_irq_vectors.
|
||||||
|
|
||||||
|
There are (at least) two really good reasons for using MSI:
|
||||||
|
|
||||||
|
1) MSI is an exclusive interrupt vector by definition.
|
||||||
|
This means the interrupt handler doesn't have to verify
|
||||||
|
its device caused the interrupt.
|
||||||
|
|
||||||
|
2) MSI avoids DMA/IRQ race conditions. DMA to host memory is guaranteed
|
||||||
|
to be visible to the host CPU(s) when the MSI is delivered. This
|
||||||
|
is important for both data coherency and avoiding stale control data.
|
||||||
|
This guarantee allows the driver to omit MMIO reads to flush
|
||||||
|
the DMA stream.
|
||||||
|
|
||||||
|
See drivers/infiniband/hw/mthca/ or drivers/net/tg3.c for examples
|
||||||
|
of MSI/MSI-X usage.
|
||||||
|
|
||||||
|
|
||||||
|
PCI device shutdown
|
||||||
|
===================
|
||||||
|
|
||||||
|
When a PCI device driver is being unloaded, most of the following
|
||||||
|
steps need to be performed:
|
||||||
|
|
||||||
|
- Disable the device from generating IRQs
|
||||||
|
- Release the IRQ (free_irq())
|
||||||
|
- Stop all DMA activity
|
||||||
|
- Release DMA buffers (both streaming and consistent)
|
||||||
|
- Unregister from other subsystems (e.g. scsi or netdev)
|
||||||
|
- Disable device from responding to MMIO/IO Port addresses
|
||||||
|
- Release MMIO/IO Port resource(s)
|
||||||
|
|
||||||
|
|
||||||
|
Stop IRQs on the device
|
||||||
|
-----------------------
|
||||||
|
How to do this is chip/device specific. If it's not done, it opens
|
||||||
|
the possibility of a "screaming interrupt" if (and only if)
|
||||||
|
the IRQ is shared with another device.
|
||||||
|
|
||||||
|
When the shared IRQ handler is "unhooked", the remaining devices
|
||||||
|
using the same IRQ line will still need the IRQ enabled. Thus if the
|
||||||
|
"unhooked" device asserts IRQ line, the system will respond assuming
|
||||||
|
it was one of the remaining devices asserted the IRQ line. Since none
|
||||||
|
of the other devices will handle the IRQ, the system will "hang" until
|
||||||
|
it decides the IRQ isn't going to get handled and masks the IRQ (100,000
|
||||||
|
iterations later). Once the shared IRQ is masked, the remaining devices
|
||||||
|
will stop functioning properly. Not a nice situation.
|
||||||
|
|
||||||
|
This is another reason to use MSI or MSI-X if it's available.
|
||||||
|
MSI and MSI-X are defined to be exclusive interrupts and thus
|
||||||
|
are not susceptible to the "screaming interrupt" problem.
|
||||||
|
|
||||||
|
|
||||||
|
Release the IRQ
|
||||||
|
---------------
|
||||||
|
Once the device is quiesced (no more IRQs), one can call free_irq().
|
||||||
|
This function will return control once any pending IRQs are handled,
|
||||||
|
"unhook" the drivers IRQ handler from that IRQ, and finally release
|
||||||
|
the IRQ if no one else is using it.
|
||||||
|
|
||||||
|
|
||||||
|
Stop all DMA activity
|
||||||
|
---------------------
|
||||||
|
It's extremely important to stop all DMA operations BEFORE attempting
|
||||||
|
to deallocate DMA control data. Failure to do so can result in memory
|
||||||
|
corruption, hangs, and on some chip-sets a hard crash.
|
||||||
|
|
||||||
|
Stopping DMA after stopping the IRQs can avoid races where the
|
||||||
|
IRQ handler might restart DMA engines.
|
||||||
|
|
||||||
|
While this step sounds obvious and trivial, several "mature" drivers
|
||||||
|
didn't get this step right in the past.
|
||||||
|
|
||||||
|
|
||||||
|
Release DMA buffers
|
||||||
|
-------------------
|
||||||
|
Once DMA is stopped, clean up streaming DMA first.
|
||||||
|
I.e. unmap data buffers and return buffers to "upstream"
|
||||||
|
owners if there is one.
|
||||||
|
|
||||||
|
Then clean up "consistent" buffers which contain the control data.
|
||||||
|
|
||||||
|
See Documentation/DMA-API.txt for details on unmapping interfaces.
|
||||||
|
|
||||||
|
|
||||||
|
Unregister from other subsystems
|
||||||
|
--------------------------------
|
||||||
|
Most low level PCI device drivers support some other subsystem
|
||||||
|
like USB, ALSA, SCSI, NetDev, Infiniband, etc. Make sure your
|
||||||
|
driver isn't losing resources from that other subsystem.
|
||||||
|
If this happens, typically the symptom is an Oops (panic) when
|
||||||
|
the subsystem attempts to call into a driver that has been unloaded.
|
||||||
|
|
||||||
|
|
||||||
|
Disable Device from responding to MMIO/IO Port addresses
|
||||||
|
--------------------------------------------------------
|
||||||
|
io_unmap() MMIO or IO Port resources and then call pci_disable_device().
|
||||||
|
This is the symmetric opposite of pci_enable_device().
|
||||||
|
Do not access device registers after calling pci_disable_device().
|
||||||
|
|
||||||
|
|
||||||
|
Release MMIO/IO Port Resource(s)
|
||||||
|
--------------------------------
|
||||||
|
Call pci_release_region() to mark the MMIO or IO Port range as available.
|
||||||
|
Failure to do so usually results in the inability to reload the driver.
|
||||||
|
|
||||||
|
|
||||||
|
How to access PCI config space
|
||||||
|
==============================
|
||||||
|
|
||||||
|
You can use `pci_(read|write)_config_(byte|word|dword)` to access the config
|
||||||
|
space of a device represented by `struct pci_dev *`. All these functions return
|
||||||
|
0 when successful or an error code (`PCIBIOS_...`) which can be translated to a
|
||||||
|
text string by pcibios_strerror. Most drivers expect that accesses to valid PCI
|
||||||
|
devices don't fail.
|
||||||
|
|
||||||
|
If you don't have a struct pci_dev available, you can call
|
||||||
|
`pci_bus_(read|write)_config_(byte|word|dword)` to access a given device
|
||||||
|
and function on that bus.
|
||||||
|
|
||||||
|
If you access fields in the standard portion of the config header, please
|
||||||
|
use symbolic names of locations and bits declared in <linux/pci.h>.
|
||||||
|
|
||||||
|
If you need to access Extended PCI Capability registers, just call
|
||||||
|
pci_find_capability() for the particular capability and it will find the
|
||||||
|
corresponding register block for you.
|
||||||
|
|
||||||
|
|
||||||
|
Other interesting functions
|
||||||
|
===========================
|
||||||
|
|
||||||
|
============================= ================================================
|
||||||
|
pci_get_domain_bus_and_slot() Find pci_dev corresponding to given domain,
|
||||||
|
bus and slot and number. If the device is
|
||||||
|
found, its reference count is increased.
|
||||||
|
pci_set_power_state() Set PCI Power Management state (0=D0 ... 3=D3)
|
||||||
|
pci_find_capability() Find specified capability in device's capability
|
||||||
|
list.
|
||||||
|
pci_resource_start() Returns bus start address for a given PCI region
|
||||||
|
pci_resource_end() Returns bus end address for a given PCI region
|
||||||
|
pci_resource_len() Returns the byte length of a PCI region
|
||||||
|
pci_set_drvdata() Set private driver data pointer for a pci_dev
|
||||||
|
pci_get_drvdata() Return private driver data pointer for a pci_dev
|
||||||
|
pci_set_mwi() Enable Memory-Write-Invalidate transactions.
|
||||||
|
pci_clear_mwi() Disable Memory-Write-Invalidate transactions.
|
||||||
|
============================= ================================================
|
||||||
|
|
||||||
|
|
||||||
|
Miscellaneous hints
|
||||||
|
===================
|
||||||
|
|
||||||
|
When displaying PCI device names to the user (for example when a driver wants
|
||||||
|
to tell the user what card has it found), please use pci_name(pci_dev).
|
||||||
|
|
||||||
|
Always refer to the PCI devices by a pointer to the pci_dev structure.
|
||||||
|
All PCI layer functions use this identification and it's the only
|
||||||
|
reasonable one. Don't use bus/slot/function numbers except for very
|
||||||
|
special purposes -- on systems with multiple primary buses their semantics
|
||||||
|
can be pretty complex.
|
||||||
|
|
||||||
|
Don't try to turn on Fast Back to Back writes in your driver. All devices
|
||||||
|
on the bus need to be capable of doing it, so this is something which needs
|
||||||
|
to be handled by platform and generic code, not individual drivers.
|
||||||
|
|
||||||
|
|
||||||
|
Vendor and device identifications
|
||||||
|
=================================
|
||||||
|
|
||||||
|
Do not add new device or vendor IDs to include/linux/pci_ids.h unless they
|
||||||
|
are shared across multiple drivers. You can add private definitions in
|
||||||
|
your driver if they're helpful, or just use plain hex constants.
|
||||||
|
|
||||||
|
The device IDs are arbitrary hex numbers (vendor controlled) and normally used
|
||||||
|
only in a single location, the pci_device_id table.
|
||||||
|
|
||||||
|
Please DO submit new vendor/device IDs to http://pci-ids.ucw.cz/.
|
||||||
|
There are mirrors of the pci.ids file at http://pciids.sourceforge.net/
|
||||||
|
and https://github.com/pciutils/pciids.
|
||||||
|
|
||||||
|
|
||||||
|
Obsolete functions
|
||||||
|
==================
|
||||||
|
|
||||||
|
There are several functions which you might come across when trying to
|
||||||
|
port an old driver to the new PCI interface. They are no longer present
|
||||||
|
in the kernel as they aren't compatible with hotplug or PCI domains or
|
||||||
|
having sane locking.
|
||||||
|
|
||||||
|
================= ===========================================
|
||||||
|
pci_find_device() Superseded by pci_get_device()
|
||||||
|
pci_find_subsys() Superseded by pci_get_subsys()
|
||||||
|
pci_find_slot() Superseded by pci_get_domain_bus_and_slot()
|
||||||
|
pci_get_slot() Superseded by pci_get_domain_bus_and_slot()
|
||||||
|
================= ===========================================
|
||||||
|
|
||||||
|
The alternative is the traditional PCI device driver that walks PCI
|
||||||
|
device lists. This is still possible but discouraged.
|
||||||
|
|
||||||
|
|
||||||
|
MMIO Space and "Write Posting"
|
||||||
|
==============================
|
||||||
|
|
||||||
|
Converting a driver from using I/O Port space to using MMIO space
|
||||||
|
often requires some additional changes. Specifically, "write posting"
|
||||||
|
needs to be handled. Many drivers (e.g. tg3, acenic, sym53c8xx_2)
|
||||||
|
already do this. I/O Port space guarantees write transactions reach the PCI
|
||||||
|
device before the CPU can continue. Writes to MMIO space allow the CPU
|
||||||
|
to continue before the transaction reaches the PCI device. HW weenies
|
||||||
|
call this "Write Posting" because the write completion is "posted" to
|
||||||
|
the CPU before the transaction has reached its destination.
|
||||||
|
|
||||||
|
Thus, timing sensitive code should add readl() where the CPU is
|
||||||
|
expected to wait before doing other work. The classic "bit banging"
|
||||||
|
sequence works fine for I/O Port space::
|
||||||
|
|
||||||
|
for (i = 8; --i; val >>= 1) {
|
||||||
|
outb(val & 1, ioport_reg); /* write bit */
|
||||||
|
udelay(10);
|
||||||
|
}
|
||||||
|
|
||||||
|
The same sequence for MMIO space should be::
|
||||||
|
|
||||||
|
for (i = 8; --i; val >>= 1) {
|
||||||
|
writeb(val & 1, mmio_reg); /* write bit */
|
||||||
|
readb(safe_mmio_reg); /* flush posted write */
|
||||||
|
udelay(10);
|
||||||
|
}
|
||||||
|
|
||||||
|
It is important that "safe_mmio_reg" not have any side effects that
|
||||||
|
interferes with the correct operation of the device.
|
||||||
|
|
||||||
|
Another case to watch out for is when resetting a PCI device. Use PCI
|
||||||
|
Configuration space reads to flush the writel(). This will gracefully
|
||||||
|
handle the PCI master abort on all platforms if the PCI device is
|
||||||
|
expected to not respond to a readl(). Most x86 platforms will allow
|
||||||
|
MMIO reads to master abort (a.k.a. "Soft Fail") and return garbage
|
||||||
|
(e.g. ~0). But many RISC platforms will crash (a.k.a."Hard Fail").
|
|
@ -1,636 +0,0 @@
|
||||||
|
|
||||||
How To Write Linux PCI Drivers
|
|
||||||
|
|
||||||
by Martin Mares <mj@ucw.cz> on 07-Feb-2000
|
|
||||||
updated by Grant Grundler <grundler@parisc-linux.org> on 23-Dec-2006
|
|
||||||
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
The world of PCI is vast and full of (mostly unpleasant) surprises.
|
|
||||||
Since each CPU architecture implements different chip-sets and PCI devices
|
|
||||||
have different requirements (erm, "features"), the result is the PCI support
|
|
||||||
in the Linux kernel is not as trivial as one would wish. This short paper
|
|
||||||
tries to introduce all potential driver authors to Linux APIs for
|
|
||||||
PCI device drivers.
|
|
||||||
|
|
||||||
A more complete resource is the third edition of "Linux Device Drivers"
|
|
||||||
by Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman.
|
|
||||||
LDD3 is available for free (under Creative Commons License) from:
|
|
||||||
|
|
||||||
http://lwn.net/Kernel/LDD3/
|
|
||||||
|
|
||||||
However, keep in mind that all documents are subject to "bit rot".
|
|
||||||
Refer to the source code if things are not working as described here.
|
|
||||||
|
|
||||||
Please send questions/comments/patches about Linux PCI API to the
|
|
||||||
"Linux PCI" <linux-pci@atrey.karlin.mff.cuni.cz> mailing list.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
0. Structure of PCI drivers
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
PCI drivers "discover" PCI devices in a system via pci_register_driver().
|
|
||||||
Actually, it's the other way around. When the PCI generic code discovers
|
|
||||||
a new device, the driver with a matching "description" will be notified.
|
|
||||||
Details on this below.
|
|
||||||
|
|
||||||
pci_register_driver() leaves most of the probing for devices to
|
|
||||||
the PCI layer and supports online insertion/removal of devices [thus
|
|
||||||
supporting hot-pluggable PCI, CardBus, and Express-Card in a single driver].
|
|
||||||
pci_register_driver() call requires passing in a table of function
|
|
||||||
pointers and thus dictates the high level structure of a driver.
|
|
||||||
|
|
||||||
Once the driver knows about a PCI device and takes ownership, the
|
|
||||||
driver generally needs to perform the following initialization:
|
|
||||||
|
|
||||||
Enable the device
|
|
||||||
Request MMIO/IOP resources
|
|
||||||
Set the DMA mask size (for both coherent and streaming DMA)
|
|
||||||
Allocate and initialize shared control data (pci_allocate_coherent())
|
|
||||||
Access device configuration space (if needed)
|
|
||||||
Register IRQ handler (request_irq())
|
|
||||||
Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)
|
|
||||||
Enable DMA/processing engines
|
|
||||||
|
|
||||||
When done using the device, and perhaps the module needs to be unloaded,
|
|
||||||
the driver needs to take the follow steps:
|
|
||||||
Disable the device from generating IRQs
|
|
||||||
Release the IRQ (free_irq())
|
|
||||||
Stop all DMA activity
|
|
||||||
Release DMA buffers (both streaming and coherent)
|
|
||||||
Unregister from other subsystems (e.g. scsi or netdev)
|
|
||||||
Release MMIO/IOP resources
|
|
||||||
Disable the device
|
|
||||||
|
|
||||||
Most of these topics are covered in the following sections.
|
|
||||||
For the rest look at LDD3 or <linux/pci.h> .
|
|
||||||
|
|
||||||
If the PCI subsystem is not configured (CONFIG_PCI is not set), most of
|
|
||||||
the PCI functions described below are defined as inline functions either
|
|
||||||
completely empty or just returning an appropriate error codes to avoid
|
|
||||||
lots of ifdefs in the drivers.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
1. pci_register_driver() call
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
PCI device drivers call pci_register_driver() during their
|
|
||||||
initialization with a pointer to a structure describing the driver
|
|
||||||
(struct pci_driver):
|
|
||||||
|
|
||||||
field name Description
|
|
||||||
---------- ------------------------------------------------------
|
|
||||||
id_table Pointer to table of device ID's the driver is
|
|
||||||
interested in. Most drivers should export this
|
|
||||||
table using MODULE_DEVICE_TABLE(pci,...).
|
|
||||||
|
|
||||||
probe This probing function gets called (during execution
|
|
||||||
of pci_register_driver() for already existing
|
|
||||||
devices or later if a new device gets inserted) for
|
|
||||||
all PCI devices which match the ID table and are not
|
|
||||||
"owned" by the other drivers yet. This function gets
|
|
||||||
passed a "struct pci_dev *" for each device whose
|
|
||||||
entry in the ID table matches the device. The probe
|
|
||||||
function returns zero when the driver chooses to
|
|
||||||
take "ownership" of the device or an error code
|
|
||||||
(negative number) otherwise.
|
|
||||||
The probe function always gets called from process
|
|
||||||
context, so it can sleep.
|
|
||||||
|
|
||||||
remove The remove() function gets called whenever a device
|
|
||||||
being handled by this driver is removed (either during
|
|
||||||
deregistration of the driver or when it's manually
|
|
||||||
pulled out of a hot-pluggable slot).
|
|
||||||
The remove function always gets called from process
|
|
||||||
context, so it can sleep.
|
|
||||||
|
|
||||||
suspend Put device into low power state.
|
|
||||||
suspend_late Put device into low power state.
|
|
||||||
|
|
||||||
resume_early Wake device from low power state.
|
|
||||||
resume Wake device from low power state.
|
|
||||||
|
|
||||||
(Please see Documentation/power/pci.txt for descriptions
|
|
||||||
of PCI Power Management and the related functions.)
|
|
||||||
|
|
||||||
shutdown Hook into reboot_notifier_list (kernel/sys.c).
|
|
||||||
Intended to stop any idling DMA operations.
|
|
||||||
Useful for enabling wake-on-lan (NIC) or changing
|
|
||||||
the power state of a device before reboot.
|
|
||||||
e.g. drivers/net/e100.c.
|
|
||||||
|
|
||||||
err_handler See Documentation/PCI/pci-error-recovery.txt
|
|
||||||
|
|
||||||
|
|
||||||
The ID table is an array of struct pci_device_id entries ending with an
|
|
||||||
all-zero entry. Definitions with static const are generally preferred.
|
|
||||||
|
|
||||||
Each entry consists of:
|
|
||||||
|
|
||||||
vendor,device Vendor and device ID to match (or PCI_ANY_ID)
|
|
||||||
|
|
||||||
subvendor, Subsystem vendor and device ID to match (or PCI_ANY_ID)
|
|
||||||
subdevice,
|
|
||||||
|
|
||||||
class Device class, subclass, and "interface" to match.
|
|
||||||
See Appendix D of the PCI Local Bus Spec or
|
|
||||||
include/linux/pci_ids.h for a full list of classes.
|
|
||||||
Most drivers do not need to specify class/class_mask
|
|
||||||
as vendor/device is normally sufficient.
|
|
||||||
|
|
||||||
class_mask limit which sub-fields of the class field are compared.
|
|
||||||
See drivers/scsi/sym53c8xx_2/ for example of usage.
|
|
||||||
|
|
||||||
driver_data Data private to the driver.
|
|
||||||
Most drivers don't need to use driver_data field.
|
|
||||||
Best practice is to use driver_data as an index
|
|
||||||
into a static list of equivalent device types,
|
|
||||||
instead of using it as a pointer.
|
|
||||||
|
|
||||||
|
|
||||||
Most drivers only need PCI_DEVICE() or PCI_DEVICE_CLASS() to set up
|
|
||||||
a pci_device_id table.
|
|
||||||
|
|
||||||
New PCI IDs may be added to a device driver pci_ids table at runtime
|
|
||||||
as shown below:
|
|
||||||
|
|
||||||
echo "vendor device subvendor subdevice class class_mask driver_data" > \
|
|
||||||
/sys/bus/pci/drivers/{driver}/new_id
|
|
||||||
|
|
||||||
All fields are passed in as hexadecimal values (no leading 0x).
|
|
||||||
The vendor and device fields are mandatory, the others are optional. Users
|
|
||||||
need pass only as many optional fields as necessary:
|
|
||||||
o subvendor and subdevice fields default to PCI_ANY_ID (FFFFFFFF)
|
|
||||||
o class and classmask fields default to 0
|
|
||||||
o driver_data defaults to 0UL.
|
|
||||||
|
|
||||||
Note that driver_data must match the value used by any of the pci_device_id
|
|
||||||
entries defined in the driver. This makes the driver_data field mandatory
|
|
||||||
if all the pci_device_id entries have a non-zero driver_data value.
|
|
||||||
|
|
||||||
Once added, the driver probe routine will be invoked for any unclaimed
|
|
||||||
PCI devices listed in its (newly updated) pci_ids list.
|
|
||||||
|
|
||||||
When the driver exits, it just calls pci_unregister_driver() and the PCI layer
|
|
||||||
automatically calls the remove hook for all devices handled by the driver.
|
|
||||||
|
|
||||||
|
|
||||||
1.1 "Attributes" for driver functions/data
|
|
||||||
|
|
||||||
Please mark the initialization and cleanup functions where appropriate
|
|
||||||
(the corresponding macros are defined in <linux/init.h>):
|
|
||||||
|
|
||||||
__init Initialization code. Thrown away after the driver
|
|
||||||
initializes.
|
|
||||||
__exit Exit code. Ignored for non-modular drivers.
|
|
||||||
|
|
||||||
Tips on when/where to use the above attributes:
|
|
||||||
o The module_init()/module_exit() functions (and all
|
|
||||||
initialization functions called _only_ from these)
|
|
||||||
should be marked __init/__exit.
|
|
||||||
|
|
||||||
o Do not mark the struct pci_driver.
|
|
||||||
|
|
||||||
o Do NOT mark a function if you are not sure which mark to use.
|
|
||||||
Better to not mark the function than mark the function wrong.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
2. How to find PCI devices manually
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
PCI drivers should have a really good reason for not using the
|
|
||||||
pci_register_driver() interface to search for PCI devices.
|
|
||||||
The main reason PCI devices are controlled by multiple drivers
|
|
||||||
is because one PCI device implements several different HW services.
|
|
||||||
E.g. combined serial/parallel port/floppy controller.
|
|
||||||
|
|
||||||
A manual search may be performed using the following constructs:
|
|
||||||
|
|
||||||
Searching by vendor and device ID:
|
|
||||||
|
|
||||||
struct pci_dev *dev = NULL;
|
|
||||||
while (dev = pci_get_device(VENDOR_ID, DEVICE_ID, dev))
|
|
||||||
configure_device(dev);
|
|
||||||
|
|
||||||
Searching by class ID (iterate in a similar way):
|
|
||||||
|
|
||||||
pci_get_class(CLASS_ID, dev)
|
|
||||||
|
|
||||||
Searching by both vendor/device and subsystem vendor/device ID:
|
|
||||||
|
|
||||||
pci_get_subsys(VENDOR_ID,DEVICE_ID, SUBSYS_VENDOR_ID, SUBSYS_DEVICE_ID, dev).
|
|
||||||
|
|
||||||
You can use the constant PCI_ANY_ID as a wildcard replacement for
|
|
||||||
VENDOR_ID or DEVICE_ID. This allows searching for any device from a
|
|
||||||
specific vendor, for example.
|
|
||||||
|
|
||||||
These functions are hotplug-safe. They increment the reference count on
|
|
||||||
the pci_dev that they return. You must eventually (possibly at module unload)
|
|
||||||
decrement the reference count on these devices by calling pci_dev_put().
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
3. Device Initialization Steps
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
As noted in the introduction, most PCI drivers need the following steps
|
|
||||||
for device initialization:
|
|
||||||
|
|
||||||
Enable the device
|
|
||||||
Request MMIO/IOP resources
|
|
||||||
Set the DMA mask size (for both coherent and streaming DMA)
|
|
||||||
Allocate and initialize shared control data (pci_allocate_coherent())
|
|
||||||
Access device configuration space (if needed)
|
|
||||||
Register IRQ handler (request_irq())
|
|
||||||
Initialize non-PCI (i.e. LAN/SCSI/etc parts of the chip)
|
|
||||||
Enable DMA/processing engines.
|
|
||||||
|
|
||||||
The driver can access PCI config space registers at any time.
|
|
||||||
(Well, almost. When running BIST, config space can go away...but
|
|
||||||
that will just result in a PCI Bus Master Abort and config reads
|
|
||||||
will return garbage).
|
|
||||||
|
|
||||||
|
|
||||||
3.1 Enable the PCI device
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
Before touching any device registers, the driver needs to enable
|
|
||||||
the PCI device by calling pci_enable_device(). This will:
|
|
||||||
o wake up the device if it was in suspended state,
|
|
||||||
o allocate I/O and memory regions of the device (if BIOS did not),
|
|
||||||
o allocate an IRQ (if BIOS did not).
|
|
||||||
|
|
||||||
NOTE: pci_enable_device() can fail! Check the return value.
|
|
||||||
|
|
||||||
[ OS BUG: we don't check resource allocations before enabling those
|
|
||||||
resources. The sequence would make more sense if we called
|
|
||||||
pci_request_resources() before calling pci_enable_device().
|
|
||||||
Currently, the device drivers can't detect the bug when when two
|
|
||||||
devices have been allocated the same range. This is not a common
|
|
||||||
problem and unlikely to get fixed soon.
|
|
||||||
|
|
||||||
This has been discussed before but not changed as of 2.6.19:
|
|
||||||
http://lkml.org/lkml/2006/3/2/194
|
|
||||||
]
|
|
||||||
|
|
||||||
pci_set_master() will enable DMA by setting the bus master bit
|
|
||||||
in the PCI_COMMAND register. It also fixes the latency timer value if
|
|
||||||
it's set to something bogus by the BIOS. pci_clear_master() will
|
|
||||||
disable DMA by clearing the bus master bit.
|
|
||||||
|
|
||||||
If the PCI device can use the PCI Memory-Write-Invalidate transaction,
|
|
||||||
call pci_set_mwi(). This enables the PCI_COMMAND bit for Mem-Wr-Inval
|
|
||||||
and also ensures that the cache line size register is set correctly.
|
|
||||||
Check the return value of pci_set_mwi() as not all architectures
|
|
||||||
or chip-sets may support Memory-Write-Invalidate. Alternatively,
|
|
||||||
if Mem-Wr-Inval would be nice to have but is not required, call
|
|
||||||
pci_try_set_mwi() to have the system do its best effort at enabling
|
|
||||||
Mem-Wr-Inval.
|
|
||||||
|
|
||||||
|
|
||||||
3.2 Request MMIO/IOP resources
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
Memory (MMIO), and I/O port addresses should NOT be read directly
|
|
||||||
from the PCI device config space. Use the values in the pci_dev structure
|
|
||||||
as the PCI "bus address" might have been remapped to a "host physical"
|
|
||||||
address by the arch/chip-set specific kernel support.
|
|
||||||
|
|
||||||
See Documentation/io-mapping.txt for how to access device registers
|
|
||||||
or device memory.
|
|
||||||
|
|
||||||
The device driver needs to call pci_request_region() to verify
|
|
||||||
no other device is already using the same address resource.
|
|
||||||
Conversely, drivers should call pci_release_region() AFTER
|
|
||||||
calling pci_disable_device().
|
|
||||||
The idea is to prevent two devices colliding on the same address range.
|
|
||||||
|
|
||||||
[ See OS BUG comment above. Currently (2.6.19), The driver can only
|
|
||||||
determine MMIO and IO Port resource availability _after_ calling
|
|
||||||
pci_enable_device(). ]
|
|
||||||
|
|
||||||
Generic flavors of pci_request_region() are request_mem_region()
|
|
||||||
(for MMIO ranges) and request_region() (for IO Port ranges).
|
|
||||||
Use these for address resources that are not described by "normal" PCI
|
|
||||||
BARs.
|
|
||||||
|
|
||||||
Also see pci_request_selected_regions() below.
|
|
||||||
|
|
||||||
|
|
||||||
3.3 Set the DMA mask size
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
[ If anything below doesn't make sense, please refer to
|
|
||||||
Documentation/DMA-API.txt. This section is just a reminder that
|
|
||||||
drivers need to indicate DMA capabilities of the device and is not
|
|
||||||
an authoritative source for DMA interfaces. ]
|
|
||||||
|
|
||||||
While all drivers should explicitly indicate the DMA capability
|
|
||||||
(e.g. 32 or 64 bit) of the PCI bus master, devices with more than
|
|
||||||
32-bit bus master capability for streaming data need the driver
|
|
||||||
to "register" this capability by calling pci_set_dma_mask() with
|
|
||||||
appropriate parameters. In general this allows more efficient DMA
|
|
||||||
on systems where System RAM exists above 4G _physical_ address.
|
|
||||||
|
|
||||||
Drivers for all PCI-X and PCIe compliant devices must call
|
|
||||||
pci_set_dma_mask() as they are 64-bit DMA devices.
|
|
||||||
|
|
||||||
Similarly, drivers must also "register" this capability if the device
|
|
||||||
can directly address "consistent memory" in System RAM above 4G physical
|
|
||||||
address by calling pci_set_consistent_dma_mask().
|
|
||||||
Again, this includes drivers for all PCI-X and PCIe compliant devices.
|
|
||||||
Many 64-bit "PCI" devices (before PCI-X) and some PCI-X devices are
|
|
||||||
64-bit DMA capable for payload ("streaming") data but not control
|
|
||||||
("consistent") data.
|
|
||||||
|
|
||||||
|
|
||||||
3.4 Setup shared control data
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
Once the DMA masks are set, the driver can allocate "consistent" (a.k.a. shared)
|
|
||||||
memory. See Documentation/DMA-API.txt for a full description of
|
|
||||||
the DMA APIs. This section is just a reminder that it needs to be done
|
|
||||||
before enabling DMA on the device.
|
|
||||||
|
|
||||||
|
|
||||||
3.5 Initialize device registers
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
Some drivers will need specific "capability" fields programmed
|
|
||||||
or other "vendor specific" register initialized or reset.
|
|
||||||
E.g. clearing pending interrupts.
|
|
||||||
|
|
||||||
|
|
||||||
3.6 Register IRQ handler
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
While calling request_irq() is the last step described here,
|
|
||||||
this is often just another intermediate step to initialize a device.
|
|
||||||
This step can often be deferred until the device is opened for use.
|
|
||||||
|
|
||||||
All interrupt handlers for IRQ lines should be registered with IRQF_SHARED
|
|
||||||
and use the devid to map IRQs to devices (remember that all PCI IRQ lines
|
|
||||||
can be shared).
|
|
||||||
|
|
||||||
request_irq() will associate an interrupt handler and device handle
|
|
||||||
with an interrupt number. Historically interrupt numbers represent
|
|
||||||
IRQ lines which run from the PCI device to the Interrupt controller.
|
|
||||||
With MSI and MSI-X (more below) the interrupt number is a CPU "vector".
|
|
||||||
|
|
||||||
request_irq() also enables the interrupt. Make sure the device is
|
|
||||||
quiesced and does not have any interrupts pending before registering
|
|
||||||
the interrupt handler.
|
|
||||||
|
|
||||||
MSI and MSI-X are PCI capabilities. Both are "Message Signaled Interrupts"
|
|
||||||
which deliver interrupts to the CPU via a DMA write to a Local APIC.
|
|
||||||
The fundamental difference between MSI and MSI-X is how multiple
|
|
||||||
"vectors" get allocated. MSI requires contiguous blocks of vectors
|
|
||||||
while MSI-X can allocate several individual ones.
|
|
||||||
|
|
||||||
MSI capability can be enabled by calling pci_alloc_irq_vectors() with the
|
|
||||||
PCI_IRQ_MSI and/or PCI_IRQ_MSIX flags before calling request_irq(). This
|
|
||||||
causes the PCI support to program CPU vector data into the PCI device
|
|
||||||
capability registers. Many architectures, chip-sets, or BIOSes do NOT
|
|
||||||
support MSI or MSI-X and a call to pci_alloc_irq_vectors with just
|
|
||||||
the PCI_IRQ_MSI and PCI_IRQ_MSIX flags will fail, so try to always
|
|
||||||
specify PCI_IRQ_LEGACY as well.
|
|
||||||
|
|
||||||
Drivers that have different interrupt handlers for MSI/MSI-X and
|
|
||||||
legacy INTx should chose the right one based on the msi_enabled
|
|
||||||
and msix_enabled flags in the pci_dev structure after calling
|
|
||||||
pci_alloc_irq_vectors.
|
|
||||||
|
|
||||||
There are (at least) two really good reasons for using MSI:
|
|
||||||
1) MSI is an exclusive interrupt vector by definition.
|
|
||||||
This means the interrupt handler doesn't have to verify
|
|
||||||
its device caused the interrupt.
|
|
||||||
|
|
||||||
2) MSI avoids DMA/IRQ race conditions. DMA to host memory is guaranteed
|
|
||||||
to be visible to the host CPU(s) when the MSI is delivered. This
|
|
||||||
is important for both data coherency and avoiding stale control data.
|
|
||||||
This guarantee allows the driver to omit MMIO reads to flush
|
|
||||||
the DMA stream.
|
|
||||||
|
|
||||||
See drivers/infiniband/hw/mthca/ or drivers/net/tg3.c for examples
|
|
||||||
of MSI/MSI-X usage.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
4. PCI device shutdown
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
When a PCI device driver is being unloaded, most of the following
|
|
||||||
steps need to be performed:
|
|
||||||
|
|
||||||
Disable the device from generating IRQs
|
|
||||||
Release the IRQ (free_irq())
|
|
||||||
Stop all DMA activity
|
|
||||||
Release DMA buffers (both streaming and consistent)
|
|
||||||
Unregister from other subsystems (e.g. scsi or netdev)
|
|
||||||
Disable device from responding to MMIO/IO Port addresses
|
|
||||||
Release MMIO/IO Port resource(s)
|
|
||||||
|
|
||||||
|
|
||||||
4.1 Stop IRQs on the device
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
How to do this is chip/device specific. If it's not done, it opens
|
|
||||||
the possibility of a "screaming interrupt" if (and only if)
|
|
||||||
the IRQ is shared with another device.
|
|
||||||
|
|
||||||
When the shared IRQ handler is "unhooked", the remaining devices
|
|
||||||
using the same IRQ line will still need the IRQ enabled. Thus if the
|
|
||||||
"unhooked" device asserts IRQ line, the system will respond assuming
|
|
||||||
it was one of the remaining devices asserted the IRQ line. Since none
|
|
||||||
of the other devices will handle the IRQ, the system will "hang" until
|
|
||||||
it decides the IRQ isn't going to get handled and masks the IRQ (100,000
|
|
||||||
iterations later). Once the shared IRQ is masked, the remaining devices
|
|
||||||
will stop functioning properly. Not a nice situation.
|
|
||||||
|
|
||||||
This is another reason to use MSI or MSI-X if it's available.
|
|
||||||
MSI and MSI-X are defined to be exclusive interrupts and thus
|
|
||||||
are not susceptible to the "screaming interrupt" problem.
|
|
||||||
|
|
||||||
|
|
||||||
4.2 Release the IRQ
|
|
||||||
~~~~~~~~~~~~~~~~~~~
|
|
||||||
Once the device is quiesced (no more IRQs), one can call free_irq().
|
|
||||||
This function will return control once any pending IRQs are handled,
|
|
||||||
"unhook" the drivers IRQ handler from that IRQ, and finally release
|
|
||||||
the IRQ if no one else is using it.
|
|
||||||
|
|
||||||
|
|
||||||
4.3 Stop all DMA activity
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
It's extremely important to stop all DMA operations BEFORE attempting
|
|
||||||
to deallocate DMA control data. Failure to do so can result in memory
|
|
||||||
corruption, hangs, and on some chip-sets a hard crash.
|
|
||||||
|
|
||||||
Stopping DMA after stopping the IRQs can avoid races where the
|
|
||||||
IRQ handler might restart DMA engines.
|
|
||||||
|
|
||||||
While this step sounds obvious and trivial, several "mature" drivers
|
|
||||||
didn't get this step right in the past.
|
|
||||||
|
|
||||||
|
|
||||||
4.4 Release DMA buffers
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
Once DMA is stopped, clean up streaming DMA first.
|
|
||||||
I.e. unmap data buffers and return buffers to "upstream"
|
|
||||||
owners if there is one.
|
|
||||||
|
|
||||||
Then clean up "consistent" buffers which contain the control data.
|
|
||||||
|
|
||||||
See Documentation/DMA-API.txt for details on unmapping interfaces.
|
|
||||||
|
|
||||||
|
|
||||||
4.5 Unregister from other subsystems
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
Most low level PCI device drivers support some other subsystem
|
|
||||||
like USB, ALSA, SCSI, NetDev, Infiniband, etc. Make sure your
|
|
||||||
driver isn't losing resources from that other subsystem.
|
|
||||||
If this happens, typically the symptom is an Oops (panic) when
|
|
||||||
the subsystem attempts to call into a driver that has been unloaded.
|
|
||||||
|
|
||||||
|
|
||||||
4.6 Disable Device from responding to MMIO/IO Port addresses
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
io_unmap() MMIO or IO Port resources and then call pci_disable_device().
|
|
||||||
This is the symmetric opposite of pci_enable_device().
|
|
||||||
Do not access device registers after calling pci_disable_device().
|
|
||||||
|
|
||||||
|
|
||||||
4.7 Release MMIO/IO Port Resource(s)
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
Call pci_release_region() to mark the MMIO or IO Port range as available.
|
|
||||||
Failure to do so usually results in the inability to reload the driver.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
5. How to access PCI config space
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
You can use pci_(read|write)_config_(byte|word|dword) to access the config
|
|
||||||
space of a device represented by struct pci_dev *. All these functions return 0
|
|
||||||
when successful or an error code (PCIBIOS_...) which can be translated to a text
|
|
||||||
string by pcibios_strerror. Most drivers expect that accesses to valid PCI
|
|
||||||
devices don't fail.
|
|
||||||
|
|
||||||
If you don't have a struct pci_dev available, you can call
|
|
||||||
pci_bus_(read|write)_config_(byte|word|dword) to access a given device
|
|
||||||
and function on that bus.
|
|
||||||
|
|
||||||
If you access fields in the standard portion of the config header, please
|
|
||||||
use symbolic names of locations and bits declared in <linux/pci.h>.
|
|
||||||
|
|
||||||
If you need to access Extended PCI Capability registers, just call
|
|
||||||
pci_find_capability() for the particular capability and it will find the
|
|
||||||
corresponding register block for you.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
6. Other interesting functions
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
pci_get_domain_bus_and_slot() Find pci_dev corresponding to given domain,
|
|
||||||
bus and slot and number. If the device is
|
|
||||||
found, its reference count is increased.
|
|
||||||
pci_set_power_state() Set PCI Power Management state (0=D0 ... 3=D3)
|
|
||||||
pci_find_capability() Find specified capability in device's capability
|
|
||||||
list.
|
|
||||||
pci_resource_start() Returns bus start address for a given PCI region
|
|
||||||
pci_resource_end() Returns bus end address for a given PCI region
|
|
||||||
pci_resource_len() Returns the byte length of a PCI region
|
|
||||||
pci_set_drvdata() Set private driver data pointer for a pci_dev
|
|
||||||
pci_get_drvdata() Return private driver data pointer for a pci_dev
|
|
||||||
pci_set_mwi() Enable Memory-Write-Invalidate transactions.
|
|
||||||
pci_clear_mwi() Disable Memory-Write-Invalidate transactions.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
7. Miscellaneous hints
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
When displaying PCI device names to the user (for example when a driver wants
|
|
||||||
to tell the user what card has it found), please use pci_name(pci_dev).
|
|
||||||
|
|
||||||
Always refer to the PCI devices by a pointer to the pci_dev structure.
|
|
||||||
All PCI layer functions use this identification and it's the only
|
|
||||||
reasonable one. Don't use bus/slot/function numbers except for very
|
|
||||||
special purposes -- on systems with multiple primary buses their semantics
|
|
||||||
can be pretty complex.
|
|
||||||
|
|
||||||
Don't try to turn on Fast Back to Back writes in your driver. All devices
|
|
||||||
on the bus need to be capable of doing it, so this is something which needs
|
|
||||||
to be handled by platform and generic code, not individual drivers.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
8. Vendor and device identifications
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
Do not add new device or vendor IDs to include/linux/pci_ids.h unless they
|
|
||||||
are shared across multiple drivers. You can add private definitions in
|
|
||||||
your driver if they're helpful, or just use plain hex constants.
|
|
||||||
|
|
||||||
The device IDs are arbitrary hex numbers (vendor controlled) and normally used
|
|
||||||
only in a single location, the pci_device_id table.
|
|
||||||
|
|
||||||
Please DO submit new vendor/device IDs to http://pci-ids.ucw.cz/.
|
|
||||||
There are mirrors of the pci.ids file at http://pciids.sourceforge.net/
|
|
||||||
and https://github.com/pciutils/pciids.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
9. Obsolete functions
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
There are several functions which you might come across when trying to
|
|
||||||
port an old driver to the new PCI interface. They are no longer present
|
|
||||||
in the kernel as they aren't compatible with hotplug or PCI domains or
|
|
||||||
having sane locking.
|
|
||||||
|
|
||||||
pci_find_device() Superseded by pci_get_device()
|
|
||||||
pci_find_subsys() Superseded by pci_get_subsys()
|
|
||||||
pci_find_slot() Superseded by pci_get_domain_bus_and_slot()
|
|
||||||
pci_get_slot() Superseded by pci_get_domain_bus_and_slot()
|
|
||||||
|
|
||||||
|
|
||||||
The alternative is the traditional PCI device driver that walks PCI
|
|
||||||
device lists. This is still possible but discouraged.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
10. MMIO Space and "Write Posting"
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
|
|
||||||
Converting a driver from using I/O Port space to using MMIO space
|
|
||||||
often requires some additional changes. Specifically, "write posting"
|
|
||||||
needs to be handled. Many drivers (e.g. tg3, acenic, sym53c8xx_2)
|
|
||||||
already do this. I/O Port space guarantees write transactions reach the PCI
|
|
||||||
device before the CPU can continue. Writes to MMIO space allow the CPU
|
|
||||||
to continue before the transaction reaches the PCI device. HW weenies
|
|
||||||
call this "Write Posting" because the write completion is "posted" to
|
|
||||||
the CPU before the transaction has reached its destination.
|
|
||||||
|
|
||||||
Thus, timing sensitive code should add readl() where the CPU is
|
|
||||||
expected to wait before doing other work. The classic "bit banging"
|
|
||||||
sequence works fine for I/O Port space:
|
|
||||||
|
|
||||||
for (i = 8; --i; val >>= 1) {
|
|
||||||
outb(val & 1, ioport_reg); /* write bit */
|
|
||||||
udelay(10);
|
|
||||||
}
|
|
||||||
|
|
||||||
The same sequence for MMIO space should be:
|
|
||||||
|
|
||||||
for (i = 8; --i; val >>= 1) {
|
|
||||||
writeb(val & 1, mmio_reg); /* write bit */
|
|
||||||
readb(safe_mmio_reg); /* flush posted write */
|
|
||||||
udelay(10);
|
|
||||||
}
|
|
||||||
|
|
||||||
It is important that "safe_mmio_reg" not have any side effects that
|
|
||||||
interferes with the correct operation of the device.
|
|
||||||
|
|
||||||
Another case to watch out for is when resetting a PCI device. Use PCI
|
|
||||||
Configuration space reads to flush the writel(). This will gracefully
|
|
||||||
handle the PCI master abort on all platforms if the PCI device is
|
|
||||||
expected to not respond to a readl(). Most x86 platforms will allow
|
|
||||||
MMIO reads to master abort (a.k.a. "Soft Fail") and return garbage
|
|
||||||
(e.g. ~0). But many RISC platforms will crash (a.k.a."Hard Fail").
|
|
||||||
|
|
311
Documentation/PCI/pcieaer-howto.rst
Normal file
311
Documentation/PCI/pcieaer-howto.rst
Normal file
|
@ -0,0 +1,311 @@
|
||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
.. include:: <isonum.txt>
|
||||||
|
|
||||||
|
===========================================================
|
||||||
|
The PCI Express Advanced Error Reporting Driver Guide HOWTO
|
||||||
|
===========================================================
|
||||||
|
|
||||||
|
:Authors: - T. Long Nguyen <tom.l.nguyen@intel.com>
|
||||||
|
- Yanmin Zhang <yanmin.zhang@intel.com>
|
||||||
|
|
||||||
|
:Copyright: |copy| 2006 Intel Corporation
|
||||||
|
|
||||||
|
Overview
|
||||||
|
===========
|
||||||
|
|
||||||
|
About this guide
|
||||||
|
----------------
|
||||||
|
|
||||||
|
This guide describes the basics of the PCI Express Advanced Error
|
||||||
|
Reporting (AER) driver and provides information on how to use it, as
|
||||||
|
well as how to enable the drivers of endpoint devices to conform with
|
||||||
|
PCI Express AER driver.
|
||||||
|
|
||||||
|
|
||||||
|
What is the PCI Express AER Driver?
|
||||||
|
-----------------------------------
|
||||||
|
|
||||||
|
PCI Express error signaling can occur on the PCI Express link itself
|
||||||
|
or on behalf of transactions initiated on the link. PCI Express
|
||||||
|
defines two error reporting paradigms: the baseline capability and
|
||||||
|
the Advanced Error Reporting capability. The baseline capability is
|
||||||
|
required of all PCI Express components providing a minimum defined
|
||||||
|
set of error reporting requirements. Advanced Error Reporting
|
||||||
|
capability is implemented with a PCI Express advanced error reporting
|
||||||
|
extended capability structure providing more robust error reporting.
|
||||||
|
|
||||||
|
The PCI Express AER driver provides the infrastructure to support PCI
|
||||||
|
Express Advanced Error Reporting capability. The PCI Express AER
|
||||||
|
driver provides three basic functions:
|
||||||
|
|
||||||
|
- Gathers the comprehensive error information if errors occurred.
|
||||||
|
- Reports error to the users.
|
||||||
|
- Performs error recovery actions.
|
||||||
|
|
||||||
|
AER driver only attaches root ports which support PCI-Express AER
|
||||||
|
capability.
|
||||||
|
|
||||||
|
|
||||||
|
User Guide
|
||||||
|
==========
|
||||||
|
|
||||||
|
Include the PCI Express AER Root Driver into the Linux Kernel
|
||||||
|
-------------------------------------------------------------
|
||||||
|
|
||||||
|
The PCI Express AER Root driver is a Root Port service driver attached
|
||||||
|
to the PCI Express Port Bus driver. If a user wants to use it, the driver
|
||||||
|
has to be compiled. Option CONFIG_PCIEAER supports this capability. It
|
||||||
|
depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and
|
||||||
|
CONFIG_PCIEAER = y.
|
||||||
|
|
||||||
|
Load PCI Express AER Root Driver
|
||||||
|
--------------------------------
|
||||||
|
|
||||||
|
Some systems have AER support in firmware. Enabling Linux AER support at
|
||||||
|
the same time the firmware handles AER may result in unpredictable
|
||||||
|
behavior. Therefore, Linux does not handle AER events unless the firmware
|
||||||
|
grants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0
|
||||||
|
Specification for details regarding _OSC usage.
|
||||||
|
|
||||||
|
AER error output
|
||||||
|
----------------
|
||||||
|
|
||||||
|
When a PCIe AER error is captured, an error message will be output to
|
||||||
|
console. If it's a correctable error, it is output as a warning.
|
||||||
|
Otherwise, it is printed as an error. So users could choose different
|
||||||
|
log level to filter out correctable error messages.
|
||||||
|
|
||||||
|
Below shows an example::
|
||||||
|
|
||||||
|
0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID)
|
||||||
|
0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000
|
||||||
|
0000:50:00.0: [20] Unsupported Request (First)
|
||||||
|
0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100
|
||||||
|
|
||||||
|
In the example, 'Requester ID' means the ID of the device who sends
|
||||||
|
the error message to root port. Pls. refer to pci express specs for
|
||||||
|
other fields.
|
||||||
|
|
||||||
|
AER Statistics / Counters
|
||||||
|
-------------------------
|
||||||
|
|
||||||
|
When PCIe AER errors are captured, the counters / statistics are also exposed
|
||||||
|
in the form of sysfs attributes which are documented at
|
||||||
|
Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
|
||||||
|
|
||||||
|
Developer Guide
|
||||||
|
===============
|
||||||
|
|
||||||
|
To enable AER aware support requires a software driver to configure
|
||||||
|
the AER capability structure within its device and to provide callbacks.
|
||||||
|
|
||||||
|
To support AER better, developers need understand how AER does work
|
||||||
|
firstly.
|
||||||
|
|
||||||
|
PCI Express errors are classified into two types: correctable errors
|
||||||
|
and uncorrectable errors. This classification is based on the impacts
|
||||||
|
of those errors, which may result in degraded performance or function
|
||||||
|
failure.
|
||||||
|
|
||||||
|
Correctable errors pose no impacts on the functionality of the
|
||||||
|
interface. The PCI Express protocol can recover without any software
|
||||||
|
intervention or any loss of data. These errors are detected and
|
||||||
|
corrected by hardware. Unlike correctable errors, uncorrectable
|
||||||
|
errors impact functionality of the interface. Uncorrectable errors
|
||||||
|
can cause a particular transaction or a particular PCI Express link
|
||||||
|
to be unreliable. Depending on those error conditions, uncorrectable
|
||||||
|
errors are further classified into non-fatal errors and fatal errors.
|
||||||
|
Non-fatal errors cause the particular transaction to be unreliable,
|
||||||
|
but the PCI Express link itself is fully functional. Fatal errors, on
|
||||||
|
the other hand, cause the link to be unreliable.
|
||||||
|
|
||||||
|
When AER is enabled, a PCI Express device will automatically send an
|
||||||
|
error message to the PCIe root port above it when the device captures
|
||||||
|
an error. The Root Port, upon receiving an error reporting message,
|
||||||
|
internally processes and logs the error message in its PCI Express
|
||||||
|
capability structure. Error information being logged includes storing
|
||||||
|
the error reporting agent's requestor ID into the Error Source
|
||||||
|
Identification Registers and setting the error bits of the Root Error
|
||||||
|
Status Register accordingly. If AER error reporting is enabled in Root
|
||||||
|
Error Command Register, the Root Port generates an interrupt if an
|
||||||
|
error is detected.
|
||||||
|
|
||||||
|
Note that the errors as described above are related to the PCI Express
|
||||||
|
hierarchy and links. These errors do not include any device specific
|
||||||
|
errors because device specific errors will still get sent directly to
|
||||||
|
the device driver.
|
||||||
|
|
||||||
|
Configure the AER capability structure
|
||||||
|
--------------------------------------
|
||||||
|
|
||||||
|
AER aware drivers of PCI Express component need change the device
|
||||||
|
control registers to enable AER. They also could change AER registers,
|
||||||
|
including mask and severity registers. Helper function
|
||||||
|
pci_enable_pcie_error_reporting could be used to enable AER. See
|
||||||
|
section 3.3.
|
||||||
|
|
||||||
|
Provide callbacks
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
callback reset_link to reset pci express link
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
This callback is used to reset the pci express physical link when a
|
||||||
|
fatal error happens. The root port aer service driver provides a
|
||||||
|
default reset_link function, but different upstream ports might
|
||||||
|
have different specifications to reset pci express link, so all
|
||||||
|
upstream ports should provide their own reset_link functions.
|
||||||
|
|
||||||
|
In struct pcie_port_service_driver, a new pointer, reset_link, is
|
||||||
|
added.
|
||||||
|
::
|
||||||
|
|
||||||
|
pci_ers_result_t (*reset_link) (struct pci_dev *dev);
|
||||||
|
|
||||||
|
Section 3.2.2.2 provides more detailed info on when to call
|
||||||
|
reset_link.
|
||||||
|
|
||||||
|
PCI error-recovery callbacks
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
The PCI Express AER Root driver uses error callbacks to coordinate
|
||||||
|
with downstream device drivers associated with a hierarchy in question
|
||||||
|
when performing error recovery actions.
|
||||||
|
|
||||||
|
Data struct pci_driver has a pointer, err_handler, to point to
|
||||||
|
pci_error_handlers who consists of a couple of callback function
|
||||||
|
pointers. AER driver follows the rules defined in
|
||||||
|
pci-error-recovery.txt except pci express specific parts (e.g.
|
||||||
|
reset_link). Pls. refer to pci-error-recovery.txt for detailed
|
||||||
|
definitions of the callbacks.
|
||||||
|
|
||||||
|
Below sections specify when to call the error callback functions.
|
||||||
|
|
||||||
|
Correctable errors
|
||||||
|
~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Correctable errors pose no impacts on the functionality of
|
||||||
|
the interface. The PCI Express protocol can recover without any
|
||||||
|
software intervention or any loss of data. These errors do not
|
||||||
|
require any recovery actions. The AER driver clears the device's
|
||||||
|
correctable error status register accordingly and logs these errors.
|
||||||
|
|
||||||
|
Non-correctable (non-fatal and fatal) errors
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
If an error message indicates a non-fatal error, performing link reset
|
||||||
|
at upstream is not required. The AER driver calls error_detected(dev,
|
||||||
|
pci_channel_io_normal) to all drivers associated within a hierarchy in
|
||||||
|
question. for example::
|
||||||
|
|
||||||
|
EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort
|
||||||
|
|
||||||
|
If Upstream port A captures an AER error, the hierarchy consists of
|
||||||
|
Downstream port B and EndPoint.
|
||||||
|
|
||||||
|
A driver may return PCI_ERS_RESULT_CAN_RECOVER,
|
||||||
|
PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on
|
||||||
|
whether it can recover or the AER driver calls mmio_enabled as next.
|
||||||
|
|
||||||
|
If an error message indicates a fatal error, kernel will broadcast
|
||||||
|
error_detected(dev, pci_channel_io_frozen) to all drivers within
|
||||||
|
a hierarchy in question. Then, performing link reset at upstream is
|
||||||
|
necessary. As different kinds of devices might use different approaches
|
||||||
|
to reset link, AER port service driver is required to provide the
|
||||||
|
function to reset link. Firstly, kernel looks for if the upstream
|
||||||
|
component has an aer driver. If it has, kernel uses the reset_link
|
||||||
|
callback of the aer driver. If the upstream component has no aer driver
|
||||||
|
and the port is downstream port, we will perform a hot reset as the
|
||||||
|
default by setting the Secondary Bus Reset bit of the Bridge Control
|
||||||
|
register associated with the downstream port. As for upstream ports,
|
||||||
|
they should provide their own aer service drivers with reset_link
|
||||||
|
function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and
|
||||||
|
reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes
|
||||||
|
to mmio_enabled.
|
||||||
|
|
||||||
|
helper functions
|
||||||
|
----------------
|
||||||
|
::
|
||||||
|
|
||||||
|
int pci_enable_pcie_error_reporting(struct pci_dev *dev);
|
||||||
|
|
||||||
|
pci_enable_pcie_error_reporting enables the device to send error
|
||||||
|
messages to root port when an error is detected. Note that devices
|
||||||
|
don't enable the error reporting by default, so device drivers need
|
||||||
|
call this function to enable it.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
int pci_disable_pcie_error_reporting(struct pci_dev *dev);
|
||||||
|
|
||||||
|
pci_disable_pcie_error_reporting disables the device to send error
|
||||||
|
messages to root port when an error is detected.
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);`
|
||||||
|
|
||||||
|
pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable
|
||||||
|
error status register.
|
||||||
|
|
||||||
|
Frequent Asked Questions
|
||||||
|
------------------------
|
||||||
|
|
||||||
|
Q:
|
||||||
|
What happens if a PCI Express device driver does not provide an
|
||||||
|
error recovery handler (pci_driver->err_handler is equal to NULL)?
|
||||||
|
|
||||||
|
A:
|
||||||
|
The devices attached with the driver won't be recovered. If the
|
||||||
|
error is fatal, kernel will print out warning messages. Please refer
|
||||||
|
to section 3 for more information.
|
||||||
|
|
||||||
|
Q:
|
||||||
|
What happens if an upstream port service driver does not provide
|
||||||
|
callback reset_link?
|
||||||
|
|
||||||
|
A:
|
||||||
|
Fatal error recovery will fail if the errors are reported by the
|
||||||
|
upstream ports who are attached by the service driver.
|
||||||
|
|
||||||
|
Q:
|
||||||
|
How does this infrastructure deal with driver that is not PCI
|
||||||
|
Express aware?
|
||||||
|
|
||||||
|
A:
|
||||||
|
This infrastructure calls the error callback functions of the
|
||||||
|
driver when an error happens. But if the driver is not aware of
|
||||||
|
PCI Express, the device might not report its own errors to root
|
||||||
|
port.
|
||||||
|
|
||||||
|
Q:
|
||||||
|
What modifications will that driver need to make it compatible
|
||||||
|
with the PCI Express AER Root driver?
|
||||||
|
|
||||||
|
A:
|
||||||
|
It could call the helper functions to enable AER in devices and
|
||||||
|
cleanup uncorrectable status register. Pls. refer to section 3.3.
|
||||||
|
|
||||||
|
|
||||||
|
Software error injection
|
||||||
|
========================
|
||||||
|
|
||||||
|
Debugging PCIe AER error recovery code is quite difficult because it
|
||||||
|
is hard to trigger real hardware errors. Software based error
|
||||||
|
injection can be used to fake various kinds of PCIe errors.
|
||||||
|
|
||||||
|
First you should enable PCIe AER software error injection in kernel
|
||||||
|
configuration, that is, following item should be in your .config.
|
||||||
|
|
||||||
|
CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m
|
||||||
|
|
||||||
|
After reboot with new kernel or insert the module, a device file named
|
||||||
|
/dev/aer_inject should be created.
|
||||||
|
|
||||||
|
Then, you need a user space tool named aer-inject, which can be gotten
|
||||||
|
from:
|
||||||
|
|
||||||
|
https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/
|
||||||
|
|
||||||
|
More information about aer-inject can be found in the document comes
|
||||||
|
with its source code.
|
|
@ -1,267 +0,0 @@
|
||||||
The PCI Express Advanced Error Reporting Driver Guide HOWTO
|
|
||||||
T. Long Nguyen <tom.l.nguyen@intel.com>
|
|
||||||
Yanmin Zhang <yanmin.zhang@intel.com>
|
|
||||||
07/29/2006
|
|
||||||
|
|
||||||
|
|
||||||
1. Overview
|
|
||||||
|
|
||||||
1.1 About this guide
|
|
||||||
|
|
||||||
This guide describes the basics of the PCI Express Advanced Error
|
|
||||||
Reporting (AER) driver and provides information on how to use it, as
|
|
||||||
well as how to enable the drivers of endpoint devices to conform with
|
|
||||||
PCI Express AER driver.
|
|
||||||
|
|
||||||
1.2 Copyright (C) Intel Corporation 2006.
|
|
||||||
|
|
||||||
1.3 What is the PCI Express AER Driver?
|
|
||||||
|
|
||||||
PCI Express error signaling can occur on the PCI Express link itself
|
|
||||||
or on behalf of transactions initiated on the link. PCI Express
|
|
||||||
defines two error reporting paradigms: the baseline capability and
|
|
||||||
the Advanced Error Reporting capability. The baseline capability is
|
|
||||||
required of all PCI Express components providing a minimum defined
|
|
||||||
set of error reporting requirements. Advanced Error Reporting
|
|
||||||
capability is implemented with a PCI Express advanced error reporting
|
|
||||||
extended capability structure providing more robust error reporting.
|
|
||||||
|
|
||||||
The PCI Express AER driver provides the infrastructure to support PCI
|
|
||||||
Express Advanced Error Reporting capability. The PCI Express AER
|
|
||||||
driver provides three basic functions:
|
|
||||||
|
|
||||||
- Gathers the comprehensive error information if errors occurred.
|
|
||||||
- Reports error to the users.
|
|
||||||
- Performs error recovery actions.
|
|
||||||
|
|
||||||
AER driver only attaches root ports which support PCI-Express AER
|
|
||||||
capability.
|
|
||||||
|
|
||||||
|
|
||||||
2. User Guide
|
|
||||||
|
|
||||||
2.1 Include the PCI Express AER Root Driver into the Linux Kernel
|
|
||||||
|
|
||||||
The PCI Express AER Root driver is a Root Port service driver attached
|
|
||||||
to the PCI Express Port Bus driver. If a user wants to use it, the driver
|
|
||||||
has to be compiled. Option CONFIG_PCIEAER supports this capability. It
|
|
||||||
depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and
|
|
||||||
CONFIG_PCIEAER = y.
|
|
||||||
|
|
||||||
2.2 Load PCI Express AER Root Driver
|
|
||||||
|
|
||||||
Some systems have AER support in firmware. Enabling Linux AER support at
|
|
||||||
the same time the firmware handles AER may result in unpredictable
|
|
||||||
behavior. Therefore, Linux does not handle AER events unless the firmware
|
|
||||||
grants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0
|
|
||||||
Specification for details regarding _OSC usage.
|
|
||||||
|
|
||||||
2.3 AER error output
|
|
||||||
|
|
||||||
When a PCIe AER error is captured, an error message will be output to
|
|
||||||
console. If it's a correctable error, it is output as a warning.
|
|
||||||
Otherwise, it is printed as an error. So users could choose different
|
|
||||||
log level to filter out correctable error messages.
|
|
||||||
|
|
||||||
Below shows an example:
|
|
||||||
0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID)
|
|
||||||
0000:50:00.0: device [8086:0329] error status/mask=00100000/00000000
|
|
||||||
0000:50:00.0: [20] Unsupported Request (First)
|
|
||||||
0000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100
|
|
||||||
|
|
||||||
In the example, 'Requester ID' means the ID of the device who sends
|
|
||||||
the error message to root port. Pls. refer to pci express specs for
|
|
||||||
other fields.
|
|
||||||
|
|
||||||
2.4 AER Statistics / Counters
|
|
||||||
|
|
||||||
When PCIe AER errors are captured, the counters / statistics are also exposed
|
|
||||||
in the form of sysfs attributes which are documented at
|
|
||||||
Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
|
|
||||||
|
|
||||||
3. Developer Guide
|
|
||||||
|
|
||||||
To enable AER aware support requires a software driver to configure
|
|
||||||
the AER capability structure within its device and to provide callbacks.
|
|
||||||
|
|
||||||
To support AER better, developers need understand how AER does work
|
|
||||||
firstly.
|
|
||||||
|
|
||||||
PCI Express errors are classified into two types: correctable errors
|
|
||||||
and uncorrectable errors. This classification is based on the impacts
|
|
||||||
of those errors, which may result in degraded performance or function
|
|
||||||
failure.
|
|
||||||
|
|
||||||
Correctable errors pose no impacts on the functionality of the
|
|
||||||
interface. The PCI Express protocol can recover without any software
|
|
||||||
intervention or any loss of data. These errors are detected and
|
|
||||||
corrected by hardware. Unlike correctable errors, uncorrectable
|
|
||||||
errors impact functionality of the interface. Uncorrectable errors
|
|
||||||
can cause a particular transaction or a particular PCI Express link
|
|
||||||
to be unreliable. Depending on those error conditions, uncorrectable
|
|
||||||
errors are further classified into non-fatal errors and fatal errors.
|
|
||||||
Non-fatal errors cause the particular transaction to be unreliable,
|
|
||||||
but the PCI Express link itself is fully functional. Fatal errors, on
|
|
||||||
the other hand, cause the link to be unreliable.
|
|
||||||
|
|
||||||
When AER is enabled, a PCI Express device will automatically send an
|
|
||||||
error message to the PCIe root port above it when the device captures
|
|
||||||
an error. The Root Port, upon receiving an error reporting message,
|
|
||||||
internally processes and logs the error message in its PCI Express
|
|
||||||
capability structure. Error information being logged includes storing
|
|
||||||
the error reporting agent's requestor ID into the Error Source
|
|
||||||
Identification Registers and setting the error bits of the Root Error
|
|
||||||
Status Register accordingly. If AER error reporting is enabled in Root
|
|
||||||
Error Command Register, the Root Port generates an interrupt if an
|
|
||||||
error is detected.
|
|
||||||
|
|
||||||
Note that the errors as described above are related to the PCI Express
|
|
||||||
hierarchy and links. These errors do not include any device specific
|
|
||||||
errors because device specific errors will still get sent directly to
|
|
||||||
the device driver.
|
|
||||||
|
|
||||||
3.1 Configure the AER capability structure
|
|
||||||
|
|
||||||
AER aware drivers of PCI Express component need change the device
|
|
||||||
control registers to enable AER. They also could change AER registers,
|
|
||||||
including mask and severity registers. Helper function
|
|
||||||
pci_enable_pcie_error_reporting could be used to enable AER. See
|
|
||||||
section 3.3.
|
|
||||||
|
|
||||||
3.2. Provide callbacks
|
|
||||||
|
|
||||||
3.2.1 callback reset_link to reset pci express link
|
|
||||||
|
|
||||||
This callback is used to reset the pci express physical link when a
|
|
||||||
fatal error happens. The root port aer service driver provides a
|
|
||||||
default reset_link function, but different upstream ports might
|
|
||||||
have different specifications to reset pci express link, so all
|
|
||||||
upstream ports should provide their own reset_link functions.
|
|
||||||
|
|
||||||
In struct pcie_port_service_driver, a new pointer, reset_link, is
|
|
||||||
added.
|
|
||||||
|
|
||||||
pci_ers_result_t (*reset_link) (struct pci_dev *dev);
|
|
||||||
|
|
||||||
Section 3.2.2.2 provides more detailed info on when to call
|
|
||||||
reset_link.
|
|
||||||
|
|
||||||
3.2.2 PCI error-recovery callbacks
|
|
||||||
|
|
||||||
The PCI Express AER Root driver uses error callbacks to coordinate
|
|
||||||
with downstream device drivers associated with a hierarchy in question
|
|
||||||
when performing error recovery actions.
|
|
||||||
|
|
||||||
Data struct pci_driver has a pointer, err_handler, to point to
|
|
||||||
pci_error_handlers who consists of a couple of callback function
|
|
||||||
pointers. AER driver follows the rules defined in
|
|
||||||
pci-error-recovery.txt except pci express specific parts (e.g.
|
|
||||||
reset_link). Pls. refer to pci-error-recovery.txt for detailed
|
|
||||||
definitions of the callbacks.
|
|
||||||
|
|
||||||
Below sections specify when to call the error callback functions.
|
|
||||||
|
|
||||||
3.2.2.1 Correctable errors
|
|
||||||
|
|
||||||
Correctable errors pose no impacts on the functionality of
|
|
||||||
the interface. The PCI Express protocol can recover without any
|
|
||||||
software intervention or any loss of data. These errors do not
|
|
||||||
require any recovery actions. The AER driver clears the device's
|
|
||||||
correctable error status register accordingly and logs these errors.
|
|
||||||
|
|
||||||
3.2.2.2 Non-correctable (non-fatal and fatal) errors
|
|
||||||
|
|
||||||
If an error message indicates a non-fatal error, performing link reset
|
|
||||||
at upstream is not required. The AER driver calls error_detected(dev,
|
|
||||||
pci_channel_io_normal) to all drivers associated within a hierarchy in
|
|
||||||
question. for example,
|
|
||||||
EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort.
|
|
||||||
If Upstream port A captures an AER error, the hierarchy consists of
|
|
||||||
Downstream port B and EndPoint.
|
|
||||||
|
|
||||||
A driver may return PCI_ERS_RESULT_CAN_RECOVER,
|
|
||||||
PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on
|
|
||||||
whether it can recover or the AER driver calls mmio_enabled as next.
|
|
||||||
|
|
||||||
If an error message indicates a fatal error, kernel will broadcast
|
|
||||||
error_detected(dev, pci_channel_io_frozen) to all drivers within
|
|
||||||
a hierarchy in question. Then, performing link reset at upstream is
|
|
||||||
necessary. As different kinds of devices might use different approaches
|
|
||||||
to reset link, AER port service driver is required to provide the
|
|
||||||
function to reset link. Firstly, kernel looks for if the upstream
|
|
||||||
component has an aer driver. If it has, kernel uses the reset_link
|
|
||||||
callback of the aer driver. If the upstream component has no aer driver
|
|
||||||
and the port is downstream port, we will perform a hot reset as the
|
|
||||||
default by setting the Secondary Bus Reset bit of the Bridge Control
|
|
||||||
register associated with the downstream port. As for upstream ports,
|
|
||||||
they should provide their own aer service drivers with reset_link
|
|
||||||
function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and
|
|
||||||
reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes
|
|
||||||
to mmio_enabled.
|
|
||||||
|
|
||||||
3.3 helper functions
|
|
||||||
|
|
||||||
3.3.1 int pci_enable_pcie_error_reporting(struct pci_dev *dev);
|
|
||||||
pci_enable_pcie_error_reporting enables the device to send error
|
|
||||||
messages to root port when an error is detected. Note that devices
|
|
||||||
don't enable the error reporting by default, so device drivers need
|
|
||||||
call this function to enable it.
|
|
||||||
|
|
||||||
3.3.2 int pci_disable_pcie_error_reporting(struct pci_dev *dev);
|
|
||||||
pci_disable_pcie_error_reporting disables the device to send error
|
|
||||||
messages to root port when an error is detected.
|
|
||||||
|
|
||||||
3.3.3 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);
|
|
||||||
pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable
|
|
||||||
error status register.
|
|
||||||
|
|
||||||
3.4 Frequent Asked Questions
|
|
||||||
|
|
||||||
Q: What happens if a PCI Express device driver does not provide an
|
|
||||||
error recovery handler (pci_driver->err_handler is equal to NULL)?
|
|
||||||
|
|
||||||
A: The devices attached with the driver won't be recovered. If the
|
|
||||||
error is fatal, kernel will print out warning messages. Please refer
|
|
||||||
to section 3 for more information.
|
|
||||||
|
|
||||||
Q: What happens if an upstream port service driver does not provide
|
|
||||||
callback reset_link?
|
|
||||||
|
|
||||||
A: Fatal error recovery will fail if the errors are reported by the
|
|
||||||
upstream ports who are attached by the service driver.
|
|
||||||
|
|
||||||
Q: How does this infrastructure deal with driver that is not PCI
|
|
||||||
Express aware?
|
|
||||||
|
|
||||||
A: This infrastructure calls the error callback functions of the
|
|
||||||
driver when an error happens. But if the driver is not aware of
|
|
||||||
PCI Express, the device might not report its own errors to root
|
|
||||||
port.
|
|
||||||
|
|
||||||
Q: What modifications will that driver need to make it compatible
|
|
||||||
with the PCI Express AER Root driver?
|
|
||||||
|
|
||||||
A: It could call the helper functions to enable AER in devices and
|
|
||||||
cleanup uncorrectable status register. Pls. refer to section 3.3.
|
|
||||||
|
|
||||||
|
|
||||||
4. Software error injection
|
|
||||||
|
|
||||||
Debugging PCIe AER error recovery code is quite difficult because it
|
|
||||||
is hard to trigger real hardware errors. Software based error
|
|
||||||
injection can be used to fake various kinds of PCIe errors.
|
|
||||||
|
|
||||||
First you should enable PCIe AER software error injection in kernel
|
|
||||||
configuration, that is, following item should be in your .config.
|
|
||||||
|
|
||||||
CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m
|
|
||||||
|
|
||||||
After reboot with new kernel or insert the module, a device file named
|
|
||||||
/dev/aer_inject should be created.
|
|
||||||
|
|
||||||
Then, you need a user space tool named aer-inject, which can be gotten
|
|
||||||
from:
|
|
||||||
https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/
|
|
||||||
|
|
||||||
More information about aer-inject can be found in the document comes
|
|
||||||
with its source code.
|
|
220
Documentation/PCI/picebus-howto.rst
Normal file
220
Documentation/PCI/picebus-howto.rst
Normal file
|
@ -0,0 +1,220 @@
|
||||||
|
.. SPDX-License-Identifier: GPL-2.0
|
||||||
|
.. include:: <isonum.txt>
|
||||||
|
|
||||||
|
===========================================
|
||||||
|
The PCI Express Port Bus Driver Guide HOWTO
|
||||||
|
===========================================
|
||||||
|
|
||||||
|
:Author: Tom L Nguyen tom.l.nguyen@intel.com 11/03/2004
|
||||||
|
:Copyright: |copy| 2004 Intel Corporation
|
||||||
|
|
||||||
|
About this guide
|
||||||
|
================
|
||||||
|
|
||||||
|
This guide describes the basics of the PCI Express Port Bus driver
|
||||||
|
and provides information on how to enable the service drivers to
|
||||||
|
register/unregister with the PCI Express Port Bus Driver.
|
||||||
|
|
||||||
|
|
||||||
|
What is the PCI Express Port Bus Driver
|
||||||
|
=======================================
|
||||||
|
|
||||||
|
A PCI Express Port is a logical PCI-PCI Bridge structure. There
|
||||||
|
are two types of PCI Express Port: the Root Port and the Switch
|
||||||
|
Port. The Root Port originates a PCI Express link from a PCI Express
|
||||||
|
Root Complex and the Switch Port connects PCI Express links to
|
||||||
|
internal logical PCI buses. The Switch Port, which has its secondary
|
||||||
|
bus representing the switch's internal routing logic, is called the
|
||||||
|
switch's Upstream Port. The switch's Downstream Port is bridging from
|
||||||
|
switch's internal routing bus to a bus representing the downstream
|
||||||
|
PCI Express link from the PCI Express Switch.
|
||||||
|
|
||||||
|
A PCI Express Port can provide up to four distinct functions,
|
||||||
|
referred to in this document as services, depending on its port type.
|
||||||
|
PCI Express Port's services include native hotplug support (HP),
|
||||||
|
power management event support (PME), advanced error reporting
|
||||||
|
support (AER), and virtual channel support (VC). These services may
|
||||||
|
be handled by a single complex driver or be individually distributed
|
||||||
|
and handled by corresponding service drivers.
|
||||||
|
|
||||||
|
Why use the PCI Express Port Bus Driver?
|
||||||
|
========================================
|
||||||
|
|
||||||
|
In existing Linux kernels, the Linux Device Driver Model allows a
|
||||||
|
physical device to be handled by only a single driver. The PCI
|
||||||
|
Express Port is a PCI-PCI Bridge device with multiple distinct
|
||||||
|
services. To maintain a clean and simple solution each service
|
||||||
|
may have its own software service driver. In this case several
|
||||||
|
service drivers will compete for a single PCI-PCI Bridge device.
|
||||||
|
For example, if the PCI Express Root Port native hotplug service
|
||||||
|
driver is loaded first, it claims a PCI-PCI Bridge Root Port. The
|
||||||
|
kernel therefore does not load other service drivers for that Root
|
||||||
|
Port. In other words, it is impossible to have multiple service
|
||||||
|
drivers load and run on a PCI-PCI Bridge device simultaneously
|
||||||
|
using the current driver model.
|
||||||
|
|
||||||
|
To enable multiple service drivers running simultaneously requires
|
||||||
|
having a PCI Express Port Bus driver, which manages all populated
|
||||||
|
PCI Express Ports and distributes all provided service requests
|
||||||
|
to the corresponding service drivers as required. Some key
|
||||||
|
advantages of using the PCI Express Port Bus driver are listed below:
|
||||||
|
|
||||||
|
- Allow multiple service drivers to run simultaneously on
|
||||||
|
a PCI-PCI Bridge Port device.
|
||||||
|
|
||||||
|
- Allow service drivers implemented in an independent
|
||||||
|
staged approach.
|
||||||
|
|
||||||
|
- Allow one service driver to run on multiple PCI-PCI Bridge
|
||||||
|
Port devices.
|
||||||
|
|
||||||
|
- Manage and distribute resources of a PCI-PCI Bridge Port
|
||||||
|
device to requested service drivers.
|
||||||
|
|
||||||
|
Configuring the PCI Express Port Bus Driver vs. Service Drivers
|
||||||
|
===============================================================
|
||||||
|
|
||||||
|
Including the PCI Express Port Bus Driver Support into the Kernel
|
||||||
|
-----------------------------------------------------------------
|
||||||
|
|
||||||
|
Including the PCI Express Port Bus driver depends on whether the PCI
|
||||||
|
Express support is included in the kernel config. The kernel will
|
||||||
|
automatically include the PCI Express Port Bus driver as a kernel
|
||||||
|
driver when the PCI Express support is enabled in the kernel.
|
||||||
|
|
||||||
|
Enabling Service Driver Support
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
|
PCI device drivers are implemented based on Linux Device Driver Model.
|
||||||
|
All service drivers are PCI device drivers. As discussed above, it is
|
||||||
|
impossible to load any service driver once the kernel has loaded the
|
||||||
|
PCI Express Port Bus Driver. To meet the PCI Express Port Bus Driver
|
||||||
|
Model requires some minimal changes on existing service drivers that
|
||||||
|
imposes no impact on the functionality of existing service drivers.
|
||||||
|
|
||||||
|
A service driver is required to use the two APIs shown below to
|
||||||
|
register its service with the PCI Express Port Bus driver (see
|
||||||
|
section 5.2.1 & 5.2.2). It is important that a service driver
|
||||||
|
initializes the pcie_port_service_driver data structure, included in
|
||||||
|
header file /include/linux/pcieport_if.h, before calling these APIs.
|
||||||
|
Failure to do so will result an identity mismatch, which prevents
|
||||||
|
the PCI Express Port Bus driver from loading a service driver.
|
||||||
|
|
||||||
|
pcie_port_service_register
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
::
|
||||||
|
|
||||||
|
int pcie_port_service_register(struct pcie_port_service_driver *new)
|
||||||
|
|
||||||
|
This API replaces the Linux Driver Model's pci_register_driver API. A
|
||||||
|
service driver should always calls pcie_port_service_register at
|
||||||
|
module init. Note that after service driver being loaded, calls
|
||||||
|
such as pci_enable_device(dev) and pci_set_master(dev) are no longer
|
||||||
|
necessary since these calls are executed by the PCI Port Bus driver.
|
||||||
|
|
||||||
|
pcie_port_service_unregister
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
::
|
||||||
|
|
||||||
|
void pcie_port_service_unregister(struct pcie_port_service_driver *new)
|
||||||
|
|
||||||
|
pcie_port_service_unregister replaces the Linux Driver Model's
|
||||||
|
pci_unregister_driver. It's always called by service driver when a
|
||||||
|
module exits.
|
||||||
|
|
||||||
|
Sample Code
|
||||||
|
~~~~~~~~~~~
|
||||||
|
|
||||||
|
Below is sample service driver code to initialize the port service
|
||||||
|
driver data structure.
|
||||||
|
::
|
||||||
|
|
||||||
|
static struct pcie_port_service_id service_id[] = { {
|
||||||
|
.vendor = PCI_ANY_ID,
|
||||||
|
.device = PCI_ANY_ID,
|
||||||
|
.port_type = PCIE_RC_PORT,
|
||||||
|
.service_type = PCIE_PORT_SERVICE_AER,
|
||||||
|
}, { /* end: all zeroes */ }
|
||||||
|
};
|
||||||
|
|
||||||
|
static struct pcie_port_service_driver root_aerdrv = {
|
||||||
|
.name = (char *)device_name,
|
||||||
|
.id_table = &service_id[0],
|
||||||
|
|
||||||
|
.probe = aerdrv_load,
|
||||||
|
.remove = aerdrv_unload,
|
||||||
|
|
||||||
|
.suspend = aerdrv_suspend,
|
||||||
|
.resume = aerdrv_resume,
|
||||||
|
};
|
||||||
|
|
||||||
|
Below is a sample code for registering/unregistering a service
|
||||||
|
driver.
|
||||||
|
::
|
||||||
|
|
||||||
|
static int __init aerdrv_service_init(void)
|
||||||
|
{
|
||||||
|
int retval = 0;
|
||||||
|
|
||||||
|
retval = pcie_port_service_register(&root_aerdrv);
|
||||||
|
if (!retval) {
|
||||||
|
/*
|
||||||
|
* FIX ME
|
||||||
|
*/
|
||||||
|
}
|
||||||
|
return retval;
|
||||||
|
}
|
||||||
|
|
||||||
|
static void __exit aerdrv_service_exit(void)
|
||||||
|
{
|
||||||
|
pcie_port_service_unregister(&root_aerdrv);
|
||||||
|
}
|
||||||
|
|
||||||
|
module_init(aerdrv_service_init);
|
||||||
|
module_exit(aerdrv_service_exit);
|
||||||
|
|
||||||
|
Possible Resource Conflicts
|
||||||
|
===========================
|
||||||
|
|
||||||
|
Since all service drivers of a PCI-PCI Bridge Port device are
|
||||||
|
allowed to run simultaneously, below lists a few of possible resource
|
||||||
|
conflicts with proposed solutions.
|
||||||
|
|
||||||
|
MSI and MSI-X Vector Resource
|
||||||
|
-----------------------------
|
||||||
|
|
||||||
|
Once MSI or MSI-X interrupts are enabled on a device, it stays in this
|
||||||
|
mode until they are disabled again. Since service drivers of the same
|
||||||
|
PCI-PCI Bridge port share the same physical device, if an individual
|
||||||
|
service driver enables or disables MSI/MSI-X mode it may result
|
||||||
|
unpredictable behavior.
|
||||||
|
|
||||||
|
To avoid this situation all service drivers are not permitted to
|
||||||
|
switch interrupt mode on its device. The PCI Express Port Bus driver
|
||||||
|
is responsible for determining the interrupt mode and this should be
|
||||||
|
transparent to service drivers. Service drivers need to know only
|
||||||
|
the vector IRQ assigned to the field irq of struct pcie_device, which
|
||||||
|
is passed in when the PCI Express Port Bus driver probes each service
|
||||||
|
driver. Service drivers should use (struct pcie_device*)dev->irq to
|
||||||
|
call request_irq/free_irq. In addition, the interrupt mode is stored
|
||||||
|
in the field interrupt_mode of struct pcie_device.
|
||||||
|
|
||||||
|
PCI Memory/IO Mapped Regions
|
||||||
|
----------------------------
|
||||||
|
|
||||||
|
Service drivers for PCI Express Power Management (PME), Advanced
|
||||||
|
Error Reporting (AER), Hot-Plug (HP) and Virtual Channel (VC) access
|
||||||
|
PCI configuration space on the PCI Express port. In all cases the
|
||||||
|
registers accessed are independent of each other. This patch assumes
|
||||||
|
that all service drivers will be well behaved and not overwrite
|
||||||
|
other service driver's configuration settings.
|
||||||
|
|
||||||
|
PCI Config Registers
|
||||||
|
--------------------
|
||||||
|
|
||||||
|
Each service driver runs its PCI config operations on its own
|
||||||
|
capability structure except the PCI Express capability structure, in
|
||||||
|
which Root Control register and Device Control register are shared
|
||||||
|
between PME and AER. This patch assumes that all service drivers
|
||||||
|
will be well behaved and not overwrite other service driver's
|
||||||
|
configuration settings.
|
143
Documentation/RCU/UP.rst
Normal file
143
Documentation/RCU/UP.rst
Normal file
|
@ -0,0 +1,143 @@
|
||||||
|
.. _up_doc:
|
||||||
|
|
||||||
|
RCU on Uniprocessor Systems
|
||||||
|
===========================
|
||||||
|
|
||||||
|
A common misconception is that, on UP systems, the call_rcu() primitive
|
||||||
|
may immediately invoke its function. The basis of this misconception
|
||||||
|
is that since there is only one CPU, it should not be necessary to
|
||||||
|
wait for anything else to get done, since there are no other CPUs for
|
||||||
|
anything else to be happening on. Although this approach will *sort of*
|
||||||
|
work a surprising amount of the time, it is a very bad idea in general.
|
||||||
|
This document presents three examples that demonstrate exactly how bad
|
||||||
|
an idea this is.
|
||||||
|
|
||||||
|
Example 1: softirq Suicide
|
||||||
|
--------------------------
|
||||||
|
|
||||||
|
Suppose that an RCU-based algorithm scans a linked list containing
|
||||||
|
elements A, B, and C in process context, and can delete elements from
|
||||||
|
this same list in softirq context. Suppose that the process-context scan
|
||||||
|
is referencing element B when it is interrupted by softirq processing,
|
||||||
|
which deletes element B, and then invokes call_rcu() to free element B
|
||||||
|
after a grace period.
|
||||||
|
|
||||||
|
Now, if call_rcu() were to directly invoke its arguments, then upon return
|
||||||
|
from softirq, the list scan would find itself referencing a newly freed
|
||||||
|
element B. This situation can greatly decrease the life expectancy of
|
||||||
|
your kernel.
|
||||||
|
|
||||||
|
This same problem can occur if call_rcu() is invoked from a hardware
|
||||||
|
interrupt handler.
|
||||||
|
|
||||||
|
Example 2: Function-Call Fatality
|
||||||
|
---------------------------------
|
||||||
|
|
||||||
|
Of course, one could avert the suicide described in the preceding example
|
||||||
|
by having call_rcu() directly invoke its arguments only if it was called
|
||||||
|
from process context. However, this can fail in a similar manner.
|
||||||
|
|
||||||
|
Suppose that an RCU-based algorithm again scans a linked list containing
|
||||||
|
elements A, B, and C in process contexts, but that it invokes a function
|
||||||
|
on each element as it is scanned. Suppose further that this function
|
||||||
|
deletes element B from the list, then passes it to call_rcu() for deferred
|
||||||
|
freeing. This may be a bit unconventional, but it is perfectly legal
|
||||||
|
RCU usage, since call_rcu() must wait for a grace period to elapse.
|
||||||
|
Therefore, in this case, allowing call_rcu() to immediately invoke
|
||||||
|
its arguments would cause it to fail to make the fundamental guarantee
|
||||||
|
underlying RCU, namely that call_rcu() defers invoking its arguments until
|
||||||
|
all RCU read-side critical sections currently executing have completed.
|
||||||
|
|
||||||
|
Quick Quiz #1:
|
||||||
|
Why is it *not* legal to invoke synchronize_rcu() in this case?
|
||||||
|
|
||||||
|
:ref:`Answers to Quick Quiz <answer_quick_quiz_up>`
|
||||||
|
|
||||||
|
Example 3: Death by Deadlock
|
||||||
|
----------------------------
|
||||||
|
|
||||||
|
Suppose that call_rcu() is invoked while holding a lock, and that the
|
||||||
|
callback function must acquire this same lock. In this case, if
|
||||||
|
call_rcu() were to directly invoke the callback, the result would
|
||||||
|
be self-deadlock.
|
||||||
|
|
||||||
|
In some cases, it would possible to restructure to code so that
|
||||||
|
the call_rcu() is delayed until after the lock is released. However,
|
||||||
|
there are cases where this can be quite ugly:
|
||||||
|
|
||||||
|
1. If a number of items need to be passed to call_rcu() within
|
||||||
|
the same critical section, then the code would need to create
|
||||||
|
a list of them, then traverse the list once the lock was
|
||||||
|
released.
|
||||||
|
|
||||||
|
2. In some cases, the lock will be held across some kernel API,
|
||||||
|
so that delaying the call_rcu() until the lock is released
|
||||||
|
requires that the data item be passed up via a common API.
|
||||||
|
It is far better to guarantee that callbacks are invoked
|
||||||
|
with no locks held than to have to modify such APIs to allow
|
||||||
|
arbitrary data items to be passed back up through them.
|
||||||
|
|
||||||
|
If call_rcu() directly invokes the callback, painful locking restrictions
|
||||||
|
or API changes would be required.
|
||||||
|
|
||||||
|
Quick Quiz #2:
|
||||||
|
What locking restriction must RCU callbacks respect?
|
||||||
|
|
||||||
|
:ref:`Answers to Quick Quiz <answer_quick_quiz_up>`
|
||||||
|
|
||||||
|
Summary
|
||||||
|
-------
|
||||||
|
|
||||||
|
Permitting call_rcu() to immediately invoke its arguments breaks RCU,
|
||||||
|
even on a UP system. So do not do it! Even on a UP system, the RCU
|
||||||
|
infrastructure *must* respect grace periods, and *must* invoke callbacks
|
||||||
|
from a known environment in which no locks are held.
|
||||||
|
|
||||||
|
Note that it *is* safe for synchronize_rcu() to return immediately on
|
||||||
|
UP systems, including PREEMPT SMP builds running on UP systems.
|
||||||
|
|
||||||
|
Quick Quiz #3:
|
||||||
|
Why can't synchronize_rcu() return immediately on UP systems running
|
||||||
|
preemptable RCU?
|
||||||
|
|
||||||
|
.. _answer_quick_quiz_up:
|
||||||
|
|
||||||
|
Answer to Quick Quiz #1:
|
||||||
|
Why is it *not* legal to invoke synchronize_rcu() in this case?
|
||||||
|
|
||||||
|
Because the calling function is scanning an RCU-protected linked
|
||||||
|
list, and is therefore within an RCU read-side critical section.
|
||||||
|
Therefore, the called function has been invoked within an RCU
|
||||||
|
read-side critical section, and is not permitted to block.
|
||||||
|
|
||||||
|
Answer to Quick Quiz #2:
|
||||||
|
What locking restriction must RCU callbacks respect?
|
||||||
|
|
||||||
|
Any lock that is acquired within an RCU callback must be acquired
|
||||||
|
elsewhere using an _bh variant of the spinlock primitive.
|
||||||
|
For example, if "mylock" is acquired by an RCU callback, then
|
||||||
|
a process-context acquisition of this lock must use something
|
||||||
|
like spin_lock_bh() to acquire the lock. Please note that
|
||||||
|
it is also OK to use _irq variants of spinlocks, for example,
|
||||||
|
spin_lock_irqsave().
|
||||||
|
|
||||||
|
If the process-context code were to simply use spin_lock(),
|
||||||
|
then, since RCU callbacks can be invoked from softirq context,
|
||||||
|
the callback might be called from a softirq that interrupted
|
||||||
|
the process-context critical section. This would result in
|
||||||
|
self-deadlock.
|
||||||
|
|
||||||
|
This restriction might seem gratuitous, since very few RCU
|
||||||
|
callbacks acquire locks directly. However, a great many RCU
|
||||||
|
callbacks do acquire locks *indirectly*, for example, via
|
||||||
|
the kfree() primitive.
|
||||||
|
|
||||||
|
Answer to Quick Quiz #3:
|
||||||
|
Why can't synchronize_rcu() return immediately on UP systems
|
||||||
|
running preemptable RCU?
|
||||||
|
|
||||||
|
Because some other task might have been preempted in the middle
|
||||||
|
of an RCU read-side critical section. If synchronize_rcu()
|
||||||
|
simply immediately returned, it would prematurely signal the
|
||||||
|
end of the grace period, which would come as a nasty shock to
|
||||||
|
that other thread when it started running again.
|
|
@ -1,133 +0,0 @@
|
||||||
RCU on Uniprocessor Systems
|
|
||||||
|
|
||||||
|
|
||||||
A common misconception is that, on UP systems, the call_rcu() primitive
|
|
||||||
may immediately invoke its function. The basis of this misconception
|
|
||||||
is that since there is only one CPU, it should not be necessary to
|
|
||||||
wait for anything else to get done, since there are no other CPUs for
|
|
||||||
anything else to be happening on. Although this approach will -sort- -of-
|
|
||||||
work a surprising amount of the time, it is a very bad idea in general.
|
|
||||||
This document presents three examples that demonstrate exactly how bad
|
|
||||||
an idea this is.
|
|
||||||
|
|
||||||
|
|
||||||
Example 1: softirq Suicide
|
|
||||||
|
|
||||||
Suppose that an RCU-based algorithm scans a linked list containing
|
|
||||||
elements A, B, and C in process context, and can delete elements from
|
|
||||||
this same list in softirq context. Suppose that the process-context scan
|
|
||||||
is referencing element B when it is interrupted by softirq processing,
|
|
||||||
which deletes element B, and then invokes call_rcu() to free element B
|
|
||||||
after a grace period.
|
|
||||||
|
|
||||||
Now, if call_rcu() were to directly invoke its arguments, then upon return
|
|
||||||
from softirq, the list scan would find itself referencing a newly freed
|
|
||||||
element B. This situation can greatly decrease the life expectancy of
|
|
||||||
your kernel.
|
|
||||||
|
|
||||||
This same problem can occur if call_rcu() is invoked from a hardware
|
|
||||||
interrupt handler.
|
|
||||||
|
|
||||||
|
|
||||||
Example 2: Function-Call Fatality
|
|
||||||
|
|
||||||
Of course, one could avert the suicide described in the preceding example
|
|
||||||
by having call_rcu() directly invoke its arguments only if it was called
|
|
||||||
from process context. However, this can fail in a similar manner.
|
|
||||||
|
|
||||||
Suppose that an RCU-based algorithm again scans a linked list containing
|
|
||||||
elements A, B, and C in process contexts, but that it invokes a function
|
|
||||||
on each element as it is scanned. Suppose further that this function
|
|
||||||
deletes element B from the list, then passes it to call_rcu() for deferred
|
|
||||||
freeing. This may be a bit unconventional, but it is perfectly legal
|
|
||||||
RCU usage, since call_rcu() must wait for a grace period to elapse.
|
|
||||||
Therefore, in this case, allowing call_rcu() to immediately invoke
|
|
||||||
its arguments would cause it to fail to make the fundamental guarantee
|
|
||||||
underlying RCU, namely that call_rcu() defers invoking its arguments until
|
|
||||||
all RCU read-side critical sections currently executing have completed.
|
|
||||||
|
|
||||||
Quick Quiz #1: why is it -not- legal to invoke synchronize_rcu() in
|
|
||||||
this case?
|
|
||||||
|
|
||||||
|
|
||||||
Example 3: Death by Deadlock
|
|
||||||
|
|
||||||
Suppose that call_rcu() is invoked while holding a lock, and that the
|
|
||||||
callback function must acquire this same lock. In this case, if
|
|
||||||
call_rcu() were to directly invoke the callback, the result would
|
|
||||||
be self-deadlock.
|
|
||||||
|
|
||||||
In some cases, it would possible to restructure to code so that
|
|
||||||
the call_rcu() is delayed until after the lock is released. However,
|
|
||||||
there are cases where this can be quite ugly:
|
|
||||||
|
|
||||||
1. If a number of items need to be passed to call_rcu() within
|
|
||||||
the same critical section, then the code would need to create
|
|
||||||
a list of them, then traverse the list once the lock was
|
|
||||||
released.
|
|
||||||
|
|
||||||
2. In some cases, the lock will be held across some kernel API,
|
|
||||||
so that delaying the call_rcu() until the lock is released
|
|
||||||
requires that the data item be passed up via a common API.
|
|
||||||
It is far better to guarantee that callbacks are invoked
|
|
||||||
with no locks held than to have to modify such APIs to allow
|
|
||||||
arbitrary data items to be passed back up through them.
|
|
||||||
|
|
||||||
If call_rcu() directly invokes the callback, painful locking restrictions
|
|
||||||
or API changes would be required.
|
|
||||||
|
|
||||||
Quick Quiz #2: What locking restriction must RCU callbacks respect?
|
|
||||||
|
|
||||||
|
|
||||||
Summary
|
|
||||||
|
|
||||||
Permitting call_rcu() to immediately invoke its arguments breaks RCU,
|
|
||||||
even on a UP system. So do not do it! Even on a UP system, the RCU
|
|
||||||
infrastructure -must- respect grace periods, and -must- invoke callbacks
|
|
||||||
from a known environment in which no locks are held.
|
|
||||||
|
|
||||||
Note that it -is- safe for synchronize_rcu() to return immediately on
|
|
||||||
UP systems, including !PREEMPT SMP builds running on UP systems.
|
|
||||||
|
|
||||||
Quick Quiz #3: Why can't synchronize_rcu() return immediately on
|
|
||||||
UP systems running preemptable RCU?
|
|
||||||
|
|
||||||
|
|
||||||
Answer to Quick Quiz #1:
|
|
||||||
Why is it -not- legal to invoke synchronize_rcu() in this case?
|
|
||||||
|
|
||||||
Because the calling function is scanning an RCU-protected linked
|
|
||||||
list, and is therefore within an RCU read-side critical section.
|
|
||||||
Therefore, the called function has been invoked within an RCU
|
|
||||||
read-side critical section, and is not permitted to block.
|
|
||||||
|
|
||||||
Answer to Quick Quiz #2:
|
|
||||||
What locking restriction must RCU callbacks respect?
|
|
||||||
|
|
||||||
Any lock that is acquired within an RCU callback must be
|
|
||||||
acquired elsewhere using an _irq variant of the spinlock
|
|
||||||
primitive. For example, if "mylock" is acquired by an
|
|
||||||
RCU callback, then a process-context acquisition of this
|
|
||||||
lock must use something like spin_lock_irqsave() to
|
|
||||||
acquire the lock.
|
|
||||||
|
|
||||||
If the process-context code were to simply use spin_lock(),
|
|
||||||
then, since RCU callbacks can be invoked from softirq context,
|
|
||||||
the callback might be called from a softirq that interrupted
|
|
||||||
the process-context critical section. This would result in
|
|
||||||
self-deadlock.
|
|
||||||
|
|
||||||
This restriction might seem gratuitous, since very few RCU
|
|
||||||
callbacks acquire locks directly. However, a great many RCU
|
|
||||||
callbacks do acquire locks -indirectly-, for example, via
|
|
||||||
the kfree() primitive.
|
|
||||||
|
|
||||||
Answer to Quick Quiz #3:
|
|
||||||
Why can't synchronize_rcu() return immediately on UP systems
|
|
||||||
running preemptable RCU?
|
|
||||||
|
|
||||||
Because some other task might have been preempted in the middle
|
|
||||||
of an RCU read-side critical section. If synchronize_rcu()
|
|
||||||
simply immediately returned, it would prematurely signal the
|
|
||||||
end of the grace period, which would come as a nasty shock to
|
|
||||||
that other thread when it started running again.
|
|
19
Documentation/RCU/index.rst
Normal file
19
Documentation/RCU/index.rst
Normal file
|
@ -0,0 +1,19 @@
|
||||||
|
.. _rcu_concepts:
|
||||||
|
|
||||||
|
============
|
||||||
|
RCU concepts
|
||||||
|
============
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 1
|
||||||
|
|
||||||
|
rcu
|
||||||
|
listRCU
|
||||||
|
UP
|
||||||
|
|
||||||
|
.. only:: subproject and html
|
||||||
|
|
||||||
|
Indices
|
||||||
|
=======
|
||||||
|
|
||||||
|
* :ref:`genindex`
|
321
Documentation/RCU/listRCU.rst
Normal file
321
Documentation/RCU/listRCU.rst
Normal file
|
@ -0,0 +1,321 @@
|
||||||
|
.. _list_rcu_doc:
|
||||||
|
|
||||||
|
Using RCU to Protect Read-Mostly Linked Lists
|
||||||
|
=============================================
|
||||||
|
|
||||||
|
One of the best applications of RCU is to protect read-mostly linked lists
|
||||||
|
("struct list_head" in list.h). One big advantage of this approach
|
||||||
|
is that all of the required memory barriers are included for you in
|
||||||
|
the list macros. This document describes several applications of RCU,
|
||||||
|
with the best fits first.
|
||||||
|
|
||||||
|
Example 1: Read-Side Action Taken Outside of Lock, No In-Place Updates
|
||||||
|
----------------------------------------------------------------------
|
||||||
|
|
||||||
|
The best applications are cases where, if reader-writer locking were
|
||||||
|
used, the read-side lock would be dropped before taking any action
|
||||||
|
based on the results of the search. The most celebrated example is
|
||||||
|
the routing table. Because the routing table is tracking the state of
|
||||||
|
equipment outside of the computer, it will at times contain stale data.
|
||||||
|
Therefore, once the route has been computed, there is no need to hold
|
||||||
|
the routing table static during transmission of the packet. After all,
|
||||||
|
you can hold the routing table static all you want, but that won't keep
|
||||||
|
the external Internet from changing, and it is the state of the external
|
||||||
|
Internet that really matters. In addition, routing entries are typically
|
||||||
|
added or deleted, rather than being modified in place.
|
||||||
|
|
||||||
|
A straightforward example of this use of RCU may be found in the
|
||||||
|
system-call auditing support. For example, a reader-writer locked
|
||||||
|
implementation of audit_filter_task() might be as follows::
|
||||||
|
|
||||||
|
static enum audit_state audit_filter_task(struct task_struct *tsk)
|
||||||
|
{
|
||||||
|
struct audit_entry *e;
|
||||||
|
enum audit_state state;
|
||||||
|
|
||||||
|
read_lock(&auditsc_lock);
|
||||||
|
/* Note: audit_netlink_sem held by caller. */
|
||||||
|
list_for_each_entry(e, &audit_tsklist, list) {
|
||||||
|
if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
|
||||||
|
read_unlock(&auditsc_lock);
|
||||||
|
return state;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
read_unlock(&auditsc_lock);
|
||||||
|
return AUDIT_BUILD_CONTEXT;
|
||||||
|
}
|
||||||
|
|
||||||
|
Here the list is searched under the lock, but the lock is dropped before
|
||||||
|
the corresponding value is returned. By the time that this value is acted
|
||||||
|
on, the list may well have been modified. This makes sense, since if
|
||||||
|
you are turning auditing off, it is OK to audit a few extra system calls.
|
||||||
|
|
||||||
|
This means that RCU can be easily applied to the read side, as follows::
|
||||||
|
|
||||||
|
static enum audit_state audit_filter_task(struct task_struct *tsk)
|
||||||
|
{
|
||||||
|
struct audit_entry *e;
|
||||||
|
enum audit_state state;
|
||||||
|
|
||||||
|
rcu_read_lock();
|
||||||
|
/* Note: audit_netlink_sem held by caller. */
|
||||||
|
list_for_each_entry_rcu(e, &audit_tsklist, list) {
|
||||||
|
if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
|
||||||
|
rcu_read_unlock();
|
||||||
|
return state;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
rcu_read_unlock();
|
||||||
|
return AUDIT_BUILD_CONTEXT;
|
||||||
|
}
|
||||||
|
|
||||||
|
The read_lock() and read_unlock() calls have become rcu_read_lock()
|
||||||
|
and rcu_read_unlock(), respectively, and the list_for_each_entry() has
|
||||||
|
become list_for_each_entry_rcu(). The _rcu() list-traversal primitives
|
||||||
|
insert the read-side memory barriers that are required on DEC Alpha CPUs.
|
||||||
|
|
||||||
|
The changes to the update side are also straightforward. A reader-writer
|
||||||
|
lock might be used as follows for deletion and insertion::
|
||||||
|
|
||||||
|
static inline int audit_del_rule(struct audit_rule *rule,
|
||||||
|
struct list_head *list)
|
||||||
|
{
|
||||||
|
struct audit_entry *e;
|
||||||
|
|
||||||
|
write_lock(&auditsc_lock);
|
||||||
|
list_for_each_entry(e, list, list) {
|
||||||
|
if (!audit_compare_rule(rule, &e->rule)) {
|
||||||
|
list_del(&e->list);
|
||||||
|
write_unlock(&auditsc_lock);
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
write_unlock(&auditsc_lock);
|
||||||
|
return -EFAULT; /* No matching rule */
|
||||||
|
}
|
||||||
|
|
||||||
|
static inline int audit_add_rule(struct audit_entry *entry,
|
||||||
|
struct list_head *list)
|
||||||
|
{
|
||||||
|
write_lock(&auditsc_lock);
|
||||||
|
if (entry->rule.flags & AUDIT_PREPEND) {
|
||||||
|
entry->rule.flags &= ~AUDIT_PREPEND;
|
||||||
|
list_add(&entry->list, list);
|
||||||
|
} else {
|
||||||
|
list_add_tail(&entry->list, list);
|
||||||
|
}
|
||||||
|
write_unlock(&auditsc_lock);
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
Following are the RCU equivalents for these two functions::
|
||||||
|
|
||||||
|
static inline int audit_del_rule(struct audit_rule *rule,
|
||||||
|
struct list_head *list)
|
||||||
|
{
|
||||||
|
struct audit_entry *e;
|
||||||
|
|
||||||
|
/* Do not use the _rcu iterator here, since this is the only
|
||||||
|
* deletion routine. */
|
||||||
|
list_for_each_entry(e, list, list) {
|
||||||
|
if (!audit_compare_rule(rule, &e->rule)) {
|
||||||
|
list_del_rcu(&e->list);
|
||||||
|
call_rcu(&e->rcu, audit_free_rule);
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return -EFAULT; /* No matching rule */
|
||||||
|
}
|
||||||
|
|
||||||
|
static inline int audit_add_rule(struct audit_entry *entry,
|
||||||
|
struct list_head *list)
|
||||||
|
{
|
||||||
|
if (entry->rule.flags & AUDIT_PREPEND) {
|
||||||
|
entry->rule.flags &= ~AUDIT_PREPEND;
|
||||||
|
list_add_rcu(&entry->list, list);
|
||||||
|
} else {
|
||||||
|
list_add_tail_rcu(&entry->list, list);
|
||||||
|
}
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
Normally, the write_lock() and write_unlock() would be replaced by
|
||||||
|
a spin_lock() and a spin_unlock(), but in this case, all callers hold
|
||||||
|
audit_netlink_sem, so no additional locking is required. The auditsc_lock
|
||||||
|
can therefore be eliminated, since use of RCU eliminates the need for
|
||||||
|
writers to exclude readers. Normally, the write_lock() calls would
|
||||||
|
be converted into spin_lock() calls.
|
||||||
|
|
||||||
|
The list_del(), list_add(), and list_add_tail() primitives have been
|
||||||
|
replaced by list_del_rcu(), list_add_rcu(), and list_add_tail_rcu().
|
||||||
|
The _rcu() list-manipulation primitives add memory barriers that are
|
||||||
|
needed on weakly ordered CPUs (most of them!). The list_del_rcu()
|
||||||
|
primitive omits the pointer poisoning debug-assist code that would
|
||||||
|
otherwise cause concurrent readers to fail spectacularly.
|
||||||
|
|
||||||
|
So, when readers can tolerate stale data and when entries are either added
|
||||||
|
or deleted, without in-place modification, it is very easy to use RCU!
|
||||||
|
|
||||||
|
Example 2: Handling In-Place Updates
|
||||||
|
------------------------------------
|
||||||
|
|
||||||
|
The system-call auditing code does not update auditing rules in place.
|
||||||
|
However, if it did, reader-writer-locked code to do so might look as
|
||||||
|
follows (presumably, the field_count is only permitted to decrease,
|
||||||
|
otherwise, the added fields would need to be filled in)::
|
||||||
|
|
||||||
|
static inline int audit_upd_rule(struct audit_rule *rule,
|
||||||
|
struct list_head *list,
|
||||||
|
__u32 newaction,
|
||||||
|
__u32 newfield_count)
|
||||||
|
{
|
||||||
|
struct audit_entry *e;
|
||||||
|
struct audit_newentry *ne;
|
||||||
|
|
||||||
|
write_lock(&auditsc_lock);
|
||||||
|
/* Note: audit_netlink_sem held by caller. */
|
||||||
|
list_for_each_entry(e, list, list) {
|
||||||
|
if (!audit_compare_rule(rule, &e->rule)) {
|
||||||
|
e->rule.action = newaction;
|
||||||
|
e->rule.file_count = newfield_count;
|
||||||
|
write_unlock(&auditsc_lock);
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
write_unlock(&auditsc_lock);
|
||||||
|
return -EFAULT; /* No matching rule */
|
||||||
|
}
|
||||||
|
|
||||||
|
The RCU version creates a copy, updates the copy, then replaces the old
|
||||||
|
entry with the newly updated entry. This sequence of actions, allowing
|
||||||
|
concurrent reads while doing a copy to perform an update, is what gives
|
||||||
|
RCU ("read-copy update") its name. The RCU code is as follows::
|
||||||
|
|
||||||
|
static inline int audit_upd_rule(struct audit_rule *rule,
|
||||||
|
struct list_head *list,
|
||||||
|
__u32 newaction,
|
||||||
|
__u32 newfield_count)
|
||||||
|
{
|
||||||
|
struct audit_entry *e;
|
||||||
|
struct audit_newentry *ne;
|
||||||
|
|
||||||
|
list_for_each_entry(e, list, list) {
|
||||||
|
if (!audit_compare_rule(rule, &e->rule)) {
|
||||||
|
ne = kmalloc(sizeof(*entry), GFP_ATOMIC);
|
||||||
|
if (ne == NULL)
|
||||||
|
return -ENOMEM;
|
||||||
|
audit_copy_rule(&ne->rule, &e->rule);
|
||||||
|
ne->rule.action = newaction;
|
||||||
|
ne->rule.file_count = newfield_count;
|
||||||
|
list_replace_rcu(&e->list, &ne->list);
|
||||||
|
call_rcu(&e->rcu, audit_free_rule);
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return -EFAULT; /* No matching rule */
|
||||||
|
}
|
||||||
|
|
||||||
|
Again, this assumes that the caller holds audit_netlink_sem. Normally,
|
||||||
|
the reader-writer lock would become a spinlock in this sort of code.
|
||||||
|
|
||||||
|
Example 3: Eliminating Stale Data
|
||||||
|
---------------------------------
|
||||||
|
|
||||||
|
The auditing examples above tolerate stale data, as do most algorithms
|
||||||
|
that are tracking external state. Because there is a delay from the
|
||||||
|
time the external state changes before Linux becomes aware of the change,
|
||||||
|
additional RCU-induced staleness is normally not a problem.
|
||||||
|
|
||||||
|
However, there are many examples where stale data cannot be tolerated.
|
||||||
|
One example in the Linux kernel is the System V IPC (see the ipc_lock()
|
||||||
|
function in ipc/util.c). This code checks a "deleted" flag under a
|
||||||
|
per-entry spinlock, and, if the "deleted" flag is set, pretends that the
|
||||||
|
entry does not exist. For this to be helpful, the search function must
|
||||||
|
return holding the per-entry spinlock, as ipc_lock() does in fact do.
|
||||||
|
|
||||||
|
Quick Quiz:
|
||||||
|
Why does the search function need to return holding the per-entry lock for
|
||||||
|
this deleted-flag technique to be helpful?
|
||||||
|
|
||||||
|
:ref:`Answer to Quick Quiz <answer_quick_quiz_list>`
|
||||||
|
|
||||||
|
If the system-call audit module were to ever need to reject stale data,
|
||||||
|
one way to accomplish this would be to add a "deleted" flag and a "lock"
|
||||||
|
spinlock to the audit_entry structure, and modify audit_filter_task()
|
||||||
|
as follows::
|
||||||
|
|
||||||
|
static enum audit_state audit_filter_task(struct task_struct *tsk)
|
||||||
|
{
|
||||||
|
struct audit_entry *e;
|
||||||
|
enum audit_state state;
|
||||||
|
|
||||||
|
rcu_read_lock();
|
||||||
|
list_for_each_entry_rcu(e, &audit_tsklist, list) {
|
||||||
|
if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
|
||||||
|
spin_lock(&e->lock);
|
||||||
|
if (e->deleted) {
|
||||||
|
spin_unlock(&e->lock);
|
||||||
|
rcu_read_unlock();
|
||||||
|
return AUDIT_BUILD_CONTEXT;
|
||||||
|
}
|
||||||
|
rcu_read_unlock();
|
||||||
|
return state;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
rcu_read_unlock();
|
||||||
|
return AUDIT_BUILD_CONTEXT;
|
||||||
|
}
|
||||||
|
|
||||||
|
Note that this example assumes that entries are only added and deleted.
|
||||||
|
Additional mechanism is required to deal correctly with the
|
||||||
|
update-in-place performed by audit_upd_rule(). For one thing,
|
||||||
|
audit_upd_rule() would need additional memory barriers to ensure
|
||||||
|
that the list_add_rcu() was really executed before the list_del_rcu().
|
||||||
|
|
||||||
|
The audit_del_rule() function would need to set the "deleted"
|
||||||
|
flag under the spinlock as follows::
|
||||||
|
|
||||||
|
static inline int audit_del_rule(struct audit_rule *rule,
|
||||||
|
struct list_head *list)
|
||||||
|
{
|
||||||
|
struct audit_entry *e;
|
||||||
|
|
||||||
|
/* Do not need to use the _rcu iterator here, since this
|
||||||
|
* is the only deletion routine. */
|
||||||
|
list_for_each_entry(e, list, list) {
|
||||||
|
if (!audit_compare_rule(rule, &e->rule)) {
|
||||||
|
spin_lock(&e->lock);
|
||||||
|
list_del_rcu(&e->list);
|
||||||
|
e->deleted = 1;
|
||||||
|
spin_unlock(&e->lock);
|
||||||
|
call_rcu(&e->rcu, audit_free_rule);
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return -EFAULT; /* No matching rule */
|
||||||
|
}
|
||||||
|
|
||||||
|
Summary
|
||||||
|
-------
|
||||||
|
|
||||||
|
Read-mostly list-based data structures that can tolerate stale data are
|
||||||
|
the most amenable to use of RCU. The simplest case is where entries are
|
||||||
|
either added or deleted from the data structure (or atomically modified
|
||||||
|
in place), but non-atomic in-place modifications can be handled by making
|
||||||
|
a copy, updating the copy, then replacing the original with the copy.
|
||||||
|
If stale data cannot be tolerated, then a "deleted" flag may be used
|
||||||
|
in conjunction with a per-entry spinlock in order to allow the search
|
||||||
|
function to reject newly deleted data.
|
||||||
|
|
||||||
|
.. _answer_quick_quiz_list:
|
||||||
|
|
||||||
|
Answer to Quick Quiz:
|
||||||
|
Why does the search function need to return holding the per-entry
|
||||||
|
lock for this deleted-flag technique to be helpful?
|
||||||
|
|
||||||
|
If the search function drops the per-entry lock before returning,
|
||||||
|
then the caller will be processing stale data in any case. If it
|
||||||
|
is really OK to be processing stale data, then you don't need a
|
||||||
|
"deleted" flag. If processing stale data really is a problem,
|
||||||
|
then you need to hold the per-entry lock across all of the code
|
||||||
|
that uses the value that was returned.
|
|
@ -1,315 +0,0 @@
|
||||||
Using RCU to Protect Read-Mostly Linked Lists
|
|
||||||
|
|
||||||
|
|
||||||
One of the best applications of RCU is to protect read-mostly linked lists
|
|
||||||
("struct list_head" in list.h). One big advantage of this approach
|
|
||||||
is that all of the required memory barriers are included for you in
|
|
||||||
the list macros. This document describes several applications of RCU,
|
|
||||||
with the best fits first.
|
|
||||||
|
|
||||||
|
|
||||||
Example 1: Read-Side Action Taken Outside of Lock, No In-Place Updates
|
|
||||||
|
|
||||||
The best applications are cases where, if reader-writer locking were
|
|
||||||
used, the read-side lock would be dropped before taking any action
|
|
||||||
based on the results of the search. The most celebrated example is
|
|
||||||
the routing table. Because the routing table is tracking the state of
|
|
||||||
equipment outside of the computer, it will at times contain stale data.
|
|
||||||
Therefore, once the route has been computed, there is no need to hold
|
|
||||||
the routing table static during transmission of the packet. After all,
|
|
||||||
you can hold the routing table static all you want, but that won't keep
|
|
||||||
the external Internet from changing, and it is the state of the external
|
|
||||||
Internet that really matters. In addition, routing entries are typically
|
|
||||||
added or deleted, rather than being modified in place.
|
|
||||||
|
|
||||||
A straightforward example of this use of RCU may be found in the
|
|
||||||
system-call auditing support. For example, a reader-writer locked
|
|
||||||
implementation of audit_filter_task() might be as follows:
|
|
||||||
|
|
||||||
static enum audit_state audit_filter_task(struct task_struct *tsk)
|
|
||||||
{
|
|
||||||
struct audit_entry *e;
|
|
||||||
enum audit_state state;
|
|
||||||
|
|
||||||
read_lock(&auditsc_lock);
|
|
||||||
/* Note: audit_netlink_sem held by caller. */
|
|
||||||
list_for_each_entry(e, &audit_tsklist, list) {
|
|
||||||
if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
|
|
||||||
read_unlock(&auditsc_lock);
|
|
||||||
return state;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
read_unlock(&auditsc_lock);
|
|
||||||
return AUDIT_BUILD_CONTEXT;
|
|
||||||
}
|
|
||||||
|
|
||||||
Here the list is searched under the lock, but the lock is dropped before
|
|
||||||
the corresponding value is returned. By the time that this value is acted
|
|
||||||
on, the list may well have been modified. This makes sense, since if
|
|
||||||
you are turning auditing off, it is OK to audit a few extra system calls.
|
|
||||||
|
|
||||||
This means that RCU can be easily applied to the read side, as follows:
|
|
||||||
|
|
||||||
static enum audit_state audit_filter_task(struct task_struct *tsk)
|
|
||||||
{
|
|
||||||
struct audit_entry *e;
|
|
||||||
enum audit_state state;
|
|
||||||
|
|
||||||
rcu_read_lock();
|
|
||||||
/* Note: audit_netlink_sem held by caller. */
|
|
||||||
list_for_each_entry_rcu(e, &audit_tsklist, list) {
|
|
||||||
if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
|
|
||||||
rcu_read_unlock();
|
|
||||||
return state;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
rcu_read_unlock();
|
|
||||||
return AUDIT_BUILD_CONTEXT;
|
|
||||||
}
|
|
||||||
|
|
||||||
The read_lock() and read_unlock() calls have become rcu_read_lock()
|
|
||||||
and rcu_read_unlock(), respectively, and the list_for_each_entry() has
|
|
||||||
become list_for_each_entry_rcu(). The _rcu() list-traversal primitives
|
|
||||||
insert the read-side memory barriers that are required on DEC Alpha CPUs.
|
|
||||||
|
|
||||||
The changes to the update side are also straightforward. A reader-writer
|
|
||||||
lock might be used as follows for deletion and insertion:
|
|
||||||
|
|
||||||
static inline int audit_del_rule(struct audit_rule *rule,
|
|
||||||
struct list_head *list)
|
|
||||||
{
|
|
||||||
struct audit_entry *e;
|
|
||||||
|
|
||||||
write_lock(&auditsc_lock);
|
|
||||||
list_for_each_entry(e, list, list) {
|
|
||||||
if (!audit_compare_rule(rule, &e->rule)) {
|
|
||||||
list_del(&e->list);
|
|
||||||
write_unlock(&auditsc_lock);
|
|
||||||
return 0;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
write_unlock(&auditsc_lock);
|
|
||||||
return -EFAULT; /* No matching rule */
|
|
||||||
}
|
|
||||||
|
|
||||||
static inline int audit_add_rule(struct audit_entry *entry,
|
|
||||||
struct list_head *list)
|
|
||||||
{
|
|
||||||
write_lock(&auditsc_lock);
|
|
||||||
if (entry->rule.flags & AUDIT_PREPEND) {
|
|
||||||
entry->rule.flags &= ~AUDIT_PREPEND;
|
|
||||||
list_add(&entry->list, list);
|
|
||||||
} else {
|
|
||||||
list_add_tail(&entry->list, list);
|
|
||||||
}
|
|
||||||
write_unlock(&auditsc_lock);
|
|
||||||
return 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
Following are the RCU equivalents for these two functions:
|
|
||||||
|
|
||||||
static inline int audit_del_rule(struct audit_rule *rule,
|
|
||||||
struct list_head *list)
|
|
||||||
{
|
|
||||||
struct audit_entry *e;
|
|
||||||
|
|
||||||
/* Do not use the _rcu iterator here, since this is the only
|
|
||||||
* deletion routine. */
|
|
||||||
list_for_each_entry(e, list, list) {
|
|
||||||
if (!audit_compare_rule(rule, &e->rule)) {
|
|
||||||
list_del_rcu(&e->list);
|
|
||||||
call_rcu(&e->rcu, audit_free_rule);
|
|
||||||
return 0;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
return -EFAULT; /* No matching rule */
|
|
||||||
}
|
|
||||||
|
|
||||||
static inline int audit_add_rule(struct audit_entry *entry,
|
|
||||||
struct list_head *list)
|
|
||||||
{
|
|
||||||
if (entry->rule.flags & AUDIT_PREPEND) {
|
|
||||||
entry->rule.flags &= ~AUDIT_PREPEND;
|
|
||||||
list_add_rcu(&entry->list, list);
|
|
||||||
} else {
|
|
||||||
list_add_tail_rcu(&entry->list, list);
|
|
||||||
}
|
|
||||||
return 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
Normally, the write_lock() and write_unlock() would be replaced by
|
|
||||||
a spin_lock() and a spin_unlock(), but in this case, all callers hold
|
|
||||||
audit_netlink_sem, so no additional locking is required. The auditsc_lock
|
|
||||||
can therefore be eliminated, since use of RCU eliminates the need for
|
|
||||||
writers to exclude readers. Normally, the write_lock() calls would
|
|
||||||
be converted into spin_lock() calls.
|
|
||||||
|
|
||||||
The list_del(), list_add(), and list_add_tail() primitives have been
|
|
||||||
replaced by list_del_rcu(), list_add_rcu(), and list_add_tail_rcu().
|
|
||||||
The _rcu() list-manipulation primitives add memory barriers that are
|
|
||||||
needed on weakly ordered CPUs (most of them!). The list_del_rcu()
|
|
||||||
primitive omits the pointer poisoning debug-assist code that would
|
|
||||||
otherwise cause concurrent readers to fail spectacularly.
|
|
||||||
|
|
||||||
So, when readers can tolerate stale data and when entries are either added
|
|
||||||
or deleted, without in-place modification, it is very easy to use RCU!
|
|
||||||
|
|
||||||
|
|
||||||
Example 2: Handling In-Place Updates
|
|
||||||
|
|
||||||
The system-call auditing code does not update auditing rules in place.
|
|
||||||
However, if it did, reader-writer-locked code to do so might look as
|
|
||||||
follows (presumably, the field_count is only permitted to decrease,
|
|
||||||
otherwise, the added fields would need to be filled in):
|
|
||||||
|
|
||||||
static inline int audit_upd_rule(struct audit_rule *rule,
|
|
||||||
struct list_head *list,
|
|
||||||
__u32 newaction,
|
|
||||||
__u32 newfield_count)
|
|
||||||
{
|
|
||||||
struct audit_entry *e;
|
|
||||||
struct audit_newentry *ne;
|
|
||||||
|
|
||||||
write_lock(&auditsc_lock);
|
|
||||||
/* Note: audit_netlink_sem held by caller. */
|
|
||||||
list_for_each_entry(e, list, list) {
|
|
||||||
if (!audit_compare_rule(rule, &e->rule)) {
|
|
||||||
e->rule.action = newaction;
|
|
||||||
e->rule.file_count = newfield_count;
|
|
||||||
write_unlock(&auditsc_lock);
|
|
||||||
return 0;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
write_unlock(&auditsc_lock);
|
|
||||||
return -EFAULT; /* No matching rule */
|
|
||||||
}
|
|
||||||
|
|
||||||
The RCU version creates a copy, updates the copy, then replaces the old
|
|
||||||
entry with the newly updated entry. This sequence of actions, allowing
|
|
||||||
concurrent reads while doing a copy to perform an update, is what gives
|
|
||||||
RCU ("read-copy update") its name. The RCU code is as follows:
|
|
||||||
|
|
||||||
static inline int audit_upd_rule(struct audit_rule *rule,
|
|
||||||
struct list_head *list,
|
|
||||||
__u32 newaction,
|
|
||||||
__u32 newfield_count)
|
|
||||||
{
|
|
||||||
struct audit_entry *e;
|
|
||||||
struct audit_newentry *ne;
|
|
||||||
|
|
||||||
list_for_each_entry(e, list, list) {
|
|
||||||
if (!audit_compare_rule(rule, &e->rule)) {
|
|
||||||
ne = kmalloc(sizeof(*entry), GFP_ATOMIC);
|
|
||||||
if (ne == NULL)
|
|
||||||
return -ENOMEM;
|
|
||||||
audit_copy_rule(&ne->rule, &e->rule);
|
|
||||||
ne->rule.action = newaction;
|
|
||||||
ne->rule.file_count = newfield_count;
|
|
||||||
list_replace_rcu(&e->list, &ne->list);
|
|
||||||
call_rcu(&e->rcu, audit_free_rule);
|
|
||||||
return 0;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
return -EFAULT; /* No matching rule */
|
|
||||||
}
|
|
||||||
|
|
||||||
Again, this assumes that the caller holds audit_netlink_sem. Normally,
|
|
||||||
the reader-writer lock would become a spinlock in this sort of code.
|
|
||||||
|
|
||||||
|
|
||||||
Example 3: Eliminating Stale Data
|
|
||||||
|
|
||||||
The auditing examples above tolerate stale data, as do most algorithms
|
|
||||||
that are tracking external state. Because there is a delay from the
|
|
||||||
time the external state changes before Linux becomes aware of the change,
|
|
||||||
additional RCU-induced staleness is normally not a problem.
|
|
||||||
|
|
||||||
However, there are many examples where stale data cannot be tolerated.
|
|
||||||
One example in the Linux kernel is the System V IPC (see the ipc_lock()
|
|
||||||
function in ipc/util.c). This code checks a "deleted" flag under a
|
|
||||||
per-entry spinlock, and, if the "deleted" flag is set, pretends that the
|
|
||||||
entry does not exist. For this to be helpful, the search function must
|
|
||||||
return holding the per-entry spinlock, as ipc_lock() does in fact do.
|
|
||||||
|
|
||||||
Quick Quiz: Why does the search function need to return holding the
|
|
||||||
per-entry lock for this deleted-flag technique to be helpful?
|
|
||||||
|
|
||||||
If the system-call audit module were to ever need to reject stale data,
|
|
||||||
one way to accomplish this would be to add a "deleted" flag and a "lock"
|
|
||||||
spinlock to the audit_entry structure, and modify audit_filter_task()
|
|
||||||
as follows:
|
|
||||||
|
|
||||||
static enum audit_state audit_filter_task(struct task_struct *tsk)
|
|
||||||
{
|
|
||||||
struct audit_entry *e;
|
|
||||||
enum audit_state state;
|
|
||||||
|
|
||||||
rcu_read_lock();
|
|
||||||
list_for_each_entry_rcu(e, &audit_tsklist, list) {
|
|
||||||
if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
|
|
||||||
spin_lock(&e->lock);
|
|
||||||
if (e->deleted) {
|
|
||||||
spin_unlock(&e->lock);
|
|
||||||
rcu_read_unlock();
|
|
||||||
return AUDIT_BUILD_CONTEXT;
|
|
||||||
}
|
|
||||||
rcu_read_unlock();
|
|
||||||
return state;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
rcu_read_unlock();
|
|
||||||
return AUDIT_BUILD_CONTEXT;
|
|
||||||
}
|
|
||||||
|
|
||||||
Note that this example assumes that entries are only added and deleted.
|
|
||||||
Additional mechanism is required to deal correctly with the
|
|
||||||
update-in-place performed by audit_upd_rule(). For one thing,
|
|
||||||
audit_upd_rule() would need additional memory barriers to ensure
|
|
||||||
that the list_add_rcu() was really executed before the list_del_rcu().
|
|
||||||
|
|
||||||
The audit_del_rule() function would need to set the "deleted"
|
|
||||||
flag under the spinlock as follows:
|
|
||||||
|
|
||||||
static inline int audit_del_rule(struct audit_rule *rule,
|
|
||||||
struct list_head *list)
|
|
||||||
{
|
|
||||||
struct audit_entry *e;
|
|
||||||
|
|
||||||
/* Do not need to use the _rcu iterator here, since this
|
|
||||||
* is the only deletion routine. */
|
|
||||||
list_for_each_entry(e, list, list) {
|
|
||||||
if (!audit_compare_rule(rule, &e->rule)) {
|
|
||||||
spin_lock(&e->lock);
|
|
||||||
list_del_rcu(&e->list);
|
|
||||||
e->deleted = 1;
|
|
||||||
spin_unlock(&e->lock);
|
|
||||||
call_rcu(&e->rcu, audit_free_rule);
|
|
||||||
return 0;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
return -EFAULT; /* No matching rule */
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
Summary
|
|
||||||
|
|
||||||
Read-mostly list-based data structures that can tolerate stale data are
|
|
||||||
the most amenable to use of RCU. The simplest case is where entries are
|
|
||||||
either added or deleted from the data structure (or atomically modified
|
|
||||||
in place), but non-atomic in-place modifications can be handled by making
|
|
||||||
a copy, updating the copy, then replacing the original with the copy.
|
|
||||||
If stale data cannot be tolerated, then a "deleted" flag may be used
|
|
||||||
in conjunction with a per-entry spinlock in order to allow the search
|
|
||||||
function to reject newly deleted data.
|
|
||||||
|
|
||||||
|
|
||||||
Answer to Quick Quiz
|
|
||||||
Why does the search function need to return holding the per-entry
|
|
||||||
lock for this deleted-flag technique to be helpful?
|
|
||||||
|
|
||||||
If the search function drops the per-entry lock before returning,
|
|
||||||
then the caller will be processing stale data in any case. If it
|
|
||||||
is really OK to be processing stale data, then you don't need a
|
|
||||||
"deleted" flag. If processing stale data really is a problem,
|
|
||||||
then you need to hold the per-entry lock across all of the code
|
|
||||||
that uses the value that was returned.
|
|
92
Documentation/RCU/rcu.rst
Normal file
92
Documentation/RCU/rcu.rst
Normal file
|
@ -0,0 +1,92 @@
|
||||||
|
.. _rcu_doc:
|
||||||
|
|
||||||
|
RCU Concepts
|
||||||
|
============
|
||||||
|
|
||||||
|
The basic idea behind RCU (read-copy update) is to split destructive
|
||||||
|
operations into two parts, one that prevents anyone from seeing the data
|
||||||
|
item being destroyed, and one that actually carries out the destruction.
|
||||||
|
A "grace period" must elapse between the two parts, and this grace period
|
||||||
|
must be long enough that any readers accessing the item being deleted have
|
||||||
|
since dropped their references. For example, an RCU-protected deletion
|
||||||
|
from a linked list would first remove the item from the list, wait for
|
||||||
|
a grace period to elapse, then free the element. See the
|
||||||
|
Documentation/RCU/listRCU.rst file for more information on using RCU with
|
||||||
|
linked lists.
|
||||||
|
|
||||||
|
Frequently Asked Questions
|
||||||
|
--------------------------
|
||||||
|
|
||||||
|
- Why would anyone want to use RCU?
|
||||||
|
|
||||||
|
The advantage of RCU's two-part approach is that RCU readers need
|
||||||
|
not acquire any locks, perform any atomic instructions, write to
|
||||||
|
shared memory, or (on CPUs other than Alpha) execute any memory
|
||||||
|
barriers. The fact that these operations are quite expensive
|
||||||
|
on modern CPUs is what gives RCU its performance advantages
|
||||||
|
in read-mostly situations. The fact that RCU readers need not
|
||||||
|
acquire locks can also greatly simplify deadlock-avoidance code.
|
||||||
|
|
||||||
|
- How can the updater tell when a grace period has completed
|
||||||
|
if the RCU readers give no indication when they are done?
|
||||||
|
|
||||||
|
Just as with spinlocks, RCU readers are not permitted to
|
||||||
|
block, switch to user-mode execution, or enter the idle loop.
|
||||||
|
Therefore, as soon as a CPU is seen passing through any of these
|
||||||
|
three states, we know that that CPU has exited any previous RCU
|
||||||
|
read-side critical sections. So, if we remove an item from a
|
||||||
|
linked list, and then wait until all CPUs have switched context,
|
||||||
|
executed in user mode, or executed in the idle loop, we can
|
||||||
|
safely free up that item.
|
||||||
|
|
||||||
|
Preemptible variants of RCU (CONFIG_PREEMPT_RCU) get the
|
||||||
|
same effect, but require that the readers manipulate CPU-local
|
||||||
|
counters. These counters allow limited types of blocking within
|
||||||
|
RCU read-side critical sections. SRCU also uses CPU-local
|
||||||
|
counters, and permits general blocking within RCU read-side
|
||||||
|
critical sections. These variants of RCU detect grace periods
|
||||||
|
by sampling these counters.
|
||||||
|
|
||||||
|
- If I am running on a uniprocessor kernel, which can only do one
|
||||||
|
thing at a time, why should I wait for a grace period?
|
||||||
|
|
||||||
|
See the Documentation/RCU/UP.rst file for more information.
|
||||||
|
|
||||||
|
- How can I see where RCU is currently used in the Linux kernel?
|
||||||
|
|
||||||
|
Search for "rcu_read_lock", "rcu_read_unlock", "call_rcu",
|
||||||
|
"rcu_read_lock_bh", "rcu_read_unlock_bh", "srcu_read_lock",
|
||||||
|
"srcu_read_unlock", "synchronize_rcu", "synchronize_net",
|
||||||
|
"synchronize_srcu", and the other RCU primitives. Or grab one
|
||||||
|
of the cscope databases from:
|
||||||
|
|
||||||
|
(http://www.rdrop.com/users/paulmck/RCU/linuxusage/rculocktab.html).
|
||||||
|
|
||||||
|
- What guidelines should I follow when writing code that uses RCU?
|
||||||
|
|
||||||
|
See the checklist.txt file in this directory.
|
||||||
|
|
||||||
|
- Why the name "RCU"?
|
||||||
|
|
||||||
|
"RCU" stands for "read-copy update". The file Documentation/RCU/listRCU.rst
|
||||||
|
has more information on where this name came from, search for
|
||||||
|
"read-copy update" to find it.
|
||||||
|
|
||||||
|
- I hear that RCU is patented? What is with that?
|
||||||
|
|
||||||
|
Yes, it is. There are several known patents related to RCU,
|
||||||
|
search for the string "Patent" in RTFP.txt to find them.
|
||||||
|
Of these, one was allowed to lapse by the assignee, and the
|
||||||
|
others have been contributed to the Linux kernel under GPL.
|
||||||
|
There are now also LGPL implementations of user-level RCU
|
||||||
|
available (http://liburcu.org/).
|
||||||
|
|
||||||
|
- I hear that RCU needs work in order to support realtime kernels?
|
||||||
|
|
||||||
|
Realtime-friendly RCU can be enabled via the CONFIG_PREEMPT_RCU
|
||||||
|
kernel configuration parameter.
|
||||||
|
|
||||||
|
- Where can I find more information on RCU?
|
||||||
|
|
||||||
|
See the RTFP.txt file in this directory.
|
||||||
|
Or point your browser at (http://www.rdrop.com/users/paulmck/RCU/).
|
|
@ -1,89 +0,0 @@
|
||||||
RCU Concepts
|
|
||||||
|
|
||||||
|
|
||||||
The basic idea behind RCU (read-copy update) is to split destructive
|
|
||||||
operations into two parts, one that prevents anyone from seeing the data
|
|
||||||
item being destroyed, and one that actually carries out the destruction.
|
|
||||||
A "grace period" must elapse between the two parts, and this grace period
|
|
||||||
must be long enough that any readers accessing the item being deleted have
|
|
||||||
since dropped their references. For example, an RCU-protected deletion
|
|
||||||
from a linked list would first remove the item from the list, wait for
|
|
||||||
a grace period to elapse, then free the element. See the listRCU.txt
|
|
||||||
file for more information on using RCU with linked lists.
|
|
||||||
|
|
||||||
|
|
||||||
Frequently Asked Questions
|
|
||||||
|
|
||||||
o Why would anyone want to use RCU?
|
|
||||||
|
|
||||||
The advantage of RCU's two-part approach is that RCU readers need
|
|
||||||
not acquire any locks, perform any atomic instructions, write to
|
|
||||||
shared memory, or (on CPUs other than Alpha) execute any memory
|
|
||||||
barriers. The fact that these operations are quite expensive
|
|
||||||
on modern CPUs is what gives RCU its performance advantages
|
|
||||||
in read-mostly situations. The fact that RCU readers need not
|
|
||||||
acquire locks can also greatly simplify deadlock-avoidance code.
|
|
||||||
|
|
||||||
o How can the updater tell when a grace period has completed
|
|
||||||
if the RCU readers give no indication when they are done?
|
|
||||||
|
|
||||||
Just as with spinlocks, RCU readers are not permitted to
|
|
||||||
block, switch to user-mode execution, or enter the idle loop.
|
|
||||||
Therefore, as soon as a CPU is seen passing through any of these
|
|
||||||
three states, we know that that CPU has exited any previous RCU
|
|
||||||
read-side critical sections. So, if we remove an item from a
|
|
||||||
linked list, and then wait until all CPUs have switched context,
|
|
||||||
executed in user mode, or executed in the idle loop, we can
|
|
||||||
safely free up that item.
|
|
||||||
|
|
||||||
Preemptible variants of RCU (CONFIG_PREEMPT_RCU) get the
|
|
||||||
same effect, but require that the readers manipulate CPU-local
|
|
||||||
counters. These counters allow limited types of blocking within
|
|
||||||
RCU read-side critical sections. SRCU also uses CPU-local
|
|
||||||
counters, and permits general blocking within RCU read-side
|
|
||||||
critical sections. These variants of RCU detect grace periods
|
|
||||||
by sampling these counters.
|
|
||||||
|
|
||||||
o If I am running on a uniprocessor kernel, which can only do one
|
|
||||||
thing at a time, why should I wait for a grace period?
|
|
||||||
|
|
||||||
See the UP.txt file in this directory.
|
|
||||||
|
|
||||||
o How can I see where RCU is currently used in the Linux kernel?
|
|
||||||
|
|
||||||
Search for "rcu_read_lock", "rcu_read_unlock", "call_rcu",
|
|
||||||
"rcu_read_lock_bh", "rcu_read_unlock_bh", "srcu_read_lock",
|
|
||||||
"srcu_read_unlock", "synchronize_rcu", "synchronize_net",
|
|
||||||
"synchronize_srcu", and the other RCU primitives. Or grab one
|
|
||||||
of the cscope databases from:
|
|
||||||
|
|
||||||
http://www.rdrop.com/users/paulmck/RCU/linuxusage/rculocktab.html
|
|
||||||
|
|
||||||
o What guidelines should I follow when writing code that uses RCU?
|
|
||||||
|
|
||||||
See the checklist.txt file in this directory.
|
|
||||||
|
|
||||||
o Why the name "RCU"?
|
|
||||||
|
|
||||||
"RCU" stands for "read-copy update". The file listRCU.txt has
|
|
||||||
more information on where this name came from, search for
|
|
||||||
"read-copy update" to find it.
|
|
||||||
|
|
||||||
o I hear that RCU is patented? What is with that?
|
|
||||||
|
|
||||||
Yes, it is. There are several known patents related to RCU,
|
|
||||||
search for the string "Patent" in RTFP.txt to find them.
|
|
||||||
Of these, one was allowed to lapse by the assignee, and the
|
|
||||||
others have been contributed to the Linux kernel under GPL.
|
|
||||||
There are now also LGPL implementations of user-level RCU
|
|
||||||
available (http://liburcu.org/).
|
|
||||||
|
|
||||||
o I hear that RCU needs work in order to support realtime kernels?
|
|
||||||
|
|
||||||
Realtime-friendly RCU can be enabled via the CONFIG_PREEMPT_RCU
|
|
||||||
kernel configuration parameter.
|
|
||||||
|
|
||||||
o Where can I find more information on RCU?
|
|
||||||
|
|
||||||
See the RTFP.txt file in this directory.
|
|
||||||
Or point your browser at http://www.rdrop.com/users/paulmck/RCU/.
|
|
|
@ -12,6 +12,7 @@ please read on.
|
||||||
Reference counting on elements of lists which are protected by traditional
|
Reference counting on elements of lists which are protected by traditional
|
||||||
reader/writer spinlocks or semaphores are straightforward:
|
reader/writer spinlocks or semaphores are straightforward:
|
||||||
|
|
||||||
|
CODE LISTING A:
|
||||||
1. 2.
|
1. 2.
|
||||||
add() search_and_reference()
|
add() search_and_reference()
|
||||||
{ {
|
{ {
|
||||||
|
@ -28,7 +29,8 @@ add() search_and_reference()
|
||||||
release_referenced() delete()
|
release_referenced() delete()
|
||||||
{ {
|
{ {
|
||||||
... write_lock(&list_lock);
|
... write_lock(&list_lock);
|
||||||
atomic_dec(&el->rc, relfunc) ...
|
if(atomic_dec_and_test(&el->rc)) ...
|
||||||
|
kfree(el);
|
||||||
... remove_element
|
... remove_element
|
||||||
} write_unlock(&list_lock);
|
} write_unlock(&list_lock);
|
||||||
...
|
...
|
||||||
|
@ -44,6 +46,7 @@ search_and_reference() could potentially hold reference to an element which
|
||||||
has already been deleted from the list/array. Use atomic_inc_not_zero()
|
has already been deleted from the list/array. Use atomic_inc_not_zero()
|
||||||
in this scenario as follows:
|
in this scenario as follows:
|
||||||
|
|
||||||
|
CODE LISTING B:
|
||||||
1. 2.
|
1. 2.
|
||||||
add() search_and_reference()
|
add() search_and_reference()
|
||||||
{ {
|
{ {
|
||||||
|
@ -79,6 +82,7 @@ search_and_reference() code path. In such cases, the
|
||||||
atomic_dec_and_test() may be moved from delete() to el_free()
|
atomic_dec_and_test() may be moved from delete() to el_free()
|
||||||
as follows:
|
as follows:
|
||||||
|
|
||||||
|
CODE LISTING C:
|
||||||
1. 2.
|
1. 2.
|
||||||
add() search_and_reference()
|
add() search_and_reference()
|
||||||
{ {
|
{ {
|
||||||
|
@ -114,6 +118,17 @@ element can therefore safely be freed. This in turn guarantees that if
|
||||||
any reader finds the element, that reader may safely acquire a reference
|
any reader finds the element, that reader may safely acquire a reference
|
||||||
without checking the value of the reference counter.
|
without checking the value of the reference counter.
|
||||||
|
|
||||||
|
A clear advantage of the RCU-based pattern in listing C over the one
|
||||||
|
in listing B is that any call to search_and_reference() that locates
|
||||||
|
a given object will succeed in obtaining a reference to that object,
|
||||||
|
even given a concurrent invocation of delete() for that same object.
|
||||||
|
Similarly, a clear advantage of both listings B and C over listing A is
|
||||||
|
that a call to delete() is not delayed even if there are an arbitrarily
|
||||||
|
large number of calls to search_and_reference() searching for the same
|
||||||
|
object that delete() was invoked on. Instead, all that is delayed is
|
||||||
|
the eventual invocation of kfree(), which is usually not a problem on
|
||||||
|
modern computer systems, even the small ones.
|
||||||
|
|
||||||
In cases where delete() can sleep, synchronize_rcu() can be called from
|
In cases where delete() can sleep, synchronize_rcu() can be called from
|
||||||
delete(), so that el_free() can be subsumed into delete as follows:
|
delete(), so that el_free() can be subsumed into delete as follows:
|
||||||
|
|
||||||
|
@ -130,3 +145,7 @@ delete()
|
||||||
kfree(el);
|
kfree(el);
|
||||||
...
|
...
|
||||||
}
|
}
|
||||||
|
|
||||||
|
As additional examples in the kernel, the pattern in listing C is used by
|
||||||
|
reference counting of struct pid, while the pattern in listing B is used by
|
||||||
|
struct posix_acl.
|
||||||
|
|
Some files were not shown because too many files have changed in this diff Show more
Loading…
Add table
Reference in a new issue