PaulSD.com
RAID / LVM / Filesystem Alignment Notes
(Created: 07/06/2017)


RAIDs have a "Chunk Size" or "Stripe Size" or "Stripe Element Size" that is set when the RAID is created. This determines the amount of data on each disk that is covered by each parity calculation, which has an effect on performance. There is also a "Stripe Width" or "Full Stripe Size", which is the "Chunk Size" times the number of data disks. In addition, some disks/controllers may require an additional initial alignment offset (this is apparently used by some drives to compensate for quirks in Windows).

Every abstraction layer on top of the RAID (including within VMs stored on the RAID) must be aligned to a multiple of the "Stripe Width" plus any required initial alignment offset to ensure that a read or write of a single block at the filesystem level will not span a stripe boundary and cause multiple stripes to be read/written at the disk level (as that would have a severe performance impact). If the block size of every layer on top of the RAID is <= the Chunk Size, it may be sufficient to align to a multiple of the Chunk Size instead of a multiple of the Stripe Width to avoid performance issues. However, for simplicity, it is best to always align to the Stripe Width.

The following notes were developed specifically for use on RHEL7 servers with HP/LSI MegaRAID controllers. However, most of this should easily translate to other Linux distributions and other RAID controllers.


Determining RAID Chunk Size / Stripe Width / Initial Alignment Offset (HP/LSI MegaRAID)

First, determine which RAID tool is installed and run either:
alias hpcli=hpacucli
or:
alias hpcli=hpssacli

To print the current Chunk Size ("Strip Size") and Stripe Width ("Full Stripe Size"):
hpcli ctrl slot=X logicaldrive 1 show

The Stripe Width can also be determined using:
cat /sys/block/sdX/queue/optimal_io_size

To determine the required initial alignment offset:
cat /sys/block/sdX/alignment_offset

It is possible to change the Chunk Size when creating the array, but this has a number of implications, so it is probably best to use the default in most cases.

Example:
$ which hpacucli
/usr/bin/which: no hpacucli in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin)
$ which hpssacli
/usr/sbin/hpssacli
$ alias hpcli=hpssacli
$ hpcli ctrl all show
Smart Array P440ar in Slot 0 (Embedded) (sn: XXXXXXXXXXXXXX)
Smart Array P840 in Slot 3 (sn: XXXXXXXXXXXXXX)
$ hpcli ctrl slot=3 logicaldrive 1 delete forced
$ hpcli ctrl slot=3 create type=ld drives=all raid=6
$ hpcli ctrl slot=3 logicaldrive 1 show
Strip Size: 256 KB
Full Stripe Size: 1536 KB
$ cat /sys/block/sdb/queue/optimal_io_size
1572864
$ cat /sys/block/sdb/alignment_offset
0


Other RAID parameters that affect performance

These parameters are not directly related to alignment, but should also be adjusted when setting up the system.

Elevator Sort

The "Elevator Sort" feature on HP/LSI RAID controllers will re-order I/O requests to eliminate seeks when sequential operations are interspersed with non-sequential operations. This effectively prioritizes sequential operations over non-sequential operations, which may be a good or bad thing depending on the workload.

In our environment, most user operations are non-sequential and most offline operations (backups, security scans, virus scans, etc) are sequential, so we generally want to prioritize non-sequential operations over sequential operations. Therefore Elevator Sort should generally be turned off in our environment.

To determine whether Elevator Sort is enabled:
hpcli ctrl slot=X show

To change the Elevator Sort setting:
hpcli ctrl slot=X modify elevatorsort=<enable|disable>

Example:
$ hpcli ctrl slot=3 show
Elevator Sort: Enabled
$ hpcli ctrl slot=3 modify elevatorsort=disable
$ hpcli ctrl slot=3 show
Elevator Sort: Disabled

Cache Ratio

The RAID "Cache Ratio" setting on HP/LSI RAID controllers adjusts the amount of the cache to use for read operations vs write operations.

In our environment, our servers generally have enormous amounts of spare RAM available which the Linux kernel automatically uses as a read cache, so the RAID cache should generally be dedicated to write operations.

However, read performance on these RAID controllers drops drastically if read caching is disabled entirely, and the minimum increment for this setting is 5%, so 5%/95% should be used.

To determine the current Cache Ratio:
hpcli ctrl slot=X show

To change the Cache Ratio:
hpcli ctrl slot=X modify cacheratio=<read %>/<write %>

Example:
$ hpcli ctrl slot=3 show
Cache Ratio: 10% Read / 90% Write
$ hpcli ctrl slot=3 modify cacheratio=5/95
$ hpcli ctrl slot=3 show
Cache Ratio: 5% Read / 95% Write

Linux I/O Request Size

/sys/block/sdX/queue/max_sectors_kb determines the max size of each I/O request that will be sent from the Linux kernel to the controller. By default in RHEL7, this value is 1536. However, for some reason values >= 1536 make rewrites on Thin LVs perform at about 1/3 the speed of writes, while values <= 1535 make rewrites on Thin LVs perform similarly to writes.

To determine the current setting:
cat /sys/block/sdX/queue/max_sectors_kb

To temporarily adjust the setting:
echo 1024 > /sys/block/sdX/queue/max_sectors_kb

To permanently adjust the setting, add the above line to /etc/rc.d/rc.local or add a udev rule as described at https://access.redhat.com/solutions/43861 . Note that this value seems to be reset during any LVM operations, so it must be manually reconfigured after every LVM change.


Partition Alignment

When using LVM, a non-LVM partition is typically needed to support boot images, so the first LVM PV must be put in a partition alongside the non-LVM partion that is required for boot. However, if there is more than one RAID on the system then only one RAID needs this boot partition, and other RAIDs can typically have the LVM PV created directly on the disk without a partition, which eliminates the additional abstraction layer and the slight overhead associated with it.

If partitions are used, each partition should start on a multiple of the Stripe Width plus any required initial alignment offset (as printed by `cat /sys/block/sdX/alignment_offset`).

To create an aligned partition, simply specify an appropriate start offset in bytes or kiB:
(WARNING: parted uses units of k=kB=1000 and kiB=1024, so kiB must be used to match the KB units used by most other tools)
parted /dev/sdX mkpart primary <Stripe Width>kiB 100%

To print the current locations of all partitions:
parted /dev/sdX unit B print

Example:
$ parted /dev/sdb mklabel msdos
$ parted /dev/sdb mkpart primary 1536kiB 100%
$ parted /dev/sdb unit B print
Number Start End Size Type File system Flags
1 1572864B 880690593791B 880689020928B primary ext4
$ blockdev --rereadpt /dev/sdb


LVM PV Alignment

The first Physical Extent allocated within the PV should be aligned to a multiple of the Stripe Width. If the PV is created directly on a raw disk (rather than in a partition that is already offset as needed), then it should also be offset by any required initial alignment offset (as printed by `cat /sys/block/sdX/alignment_offset`).

To create a PV that will properly align the first Physical Extent:
pvcreate --dataalignment <Stripe Width>k /dev/sdXY

pvcreate automatically reads /sys/block/<device>/alignment_offset and adjusts the position of the first Physical Extent accordingly. However, this can be manually overridden using:
pvcreate --dataalignment <Stripe Width>k --dataalignmentoffset <offset>k /dev/sdXY

To print the location of the first Physical Extent:
pvs --units b -o +pe_start

Example:
$ pvcreate --dataalignment 1536k /dev/sdb
$ pvs --units b -o +pe_start
PV VG Fmt Attr PSize PFree 1st PE
/dev/sdb lvm2 --- 880691011584B 880691011584B 1572864B


LVM VG Alignment

The Physical Extent Size is configured on the Volume Group. LVM requires this size to be a power of 2. The default size is 4MB.

Ideally this should be set to the Stripe Width or a multiple of the Stripe Width. However, if the number of data disks is not a power of 2, then the Stripe Width will not be a power of 2, so the Physical Extent Size cannot be set to a multiple of the Stripe Width. In this case, it should be set to a value such that the least common multiple of Stripe Width and Physical Extent Size is a reasonable number (because LVs will need to start on a multiple of this number, so we want this number to be easy to calculate, and we want to ensure that this number isn't so large that it forces us to create LVs that are larger than necessary).

To create a VG with a specific Physical Extent Size:
vgcreate --physicalextentsize <PE Size>M <VG> /dev/sdXY

To print the current Physical Extent Size:
vgs --units b -o +vg_extent_size

Example:
$ vgcreate --physicalextentsize 4M testvg /dev/sdb
$ vgs --units b -o +vg_extent_size
VG #PV #LV #SN Attr VSize VFree Ext
testvg 1 0 0 wz--n- 880686399488B 880686399488B 4194304B


LVM LV Alignment

Physical Extents are the units used to allocate space from a VG/PV to a non-Thin LV or to a Thin Pool/Metadata/Spare LV. Physical Extents are numbered starting at 0.

Contiguous Allocation

By default, new LVs are always allocated a contiguous range of Physical Extents when possible, however they may be allocated non-contiguous ranges if there is no contiguous range of the necessary size available. Since a non-contiguous LV could impact performance, you can optionally prevent non-contiguous allocation using:
lvcreate --contiguous y ...

The --contiguous argument also sets the LV's "Allocation Policy" to prevent future expansion of the LV if expansion would cause the LV to become non-contiguous.

To reset the Allocation Policy:
lvchange --contiguous n ...

Similarly, if an existing LV is currently contiguous but was not created with --contiguous, the Allocation Policy can be set at a later time:
lvchange --contiguous y ...

To determine the current allocation policy ("inherit" or "contiguous"):
lvs -o +lv_allocation_policy

LV Alignment

If the VG Physical Extent Size is not a multiple of the Stripe Width, then each non-Thin LV or Thin Pool/Metadata/Spare LV should start on a Physical Extent that is a multiple of the number of Physical Extents in the least common multiple of Stripe Width and Physical Extent Size.

For example, if the Stripe Width is 1.5M and the Physical Extent Size is 4MB, then the least common multiple is 12MB. There are three 4MB Physical Extents in 12MB, so each each non-Thin LV must start on a Physical Extent that is a multiple of three: 0, 3, 6, etc.

To determine the number of Physical Extents in each PV:
pvs -o +pe_count

To create an LV that starts on a specific Physical Extent on a PV:
lvcreate --contiguous y --name <LV> <VG> <PV>:<Start PE>-<End PE>

For simplicity, (number of Physical Extents - 1) may be used for <End PE> and --size or --extents may be provided to lvcreate to avoid the need to manually calculate an appropriate <End PE>. If the LV must be split across multiple PVs or across multiple non-contiguous PE ranges, "--contiguous y" may be removed and additional <PV>:<Start PE>-<End PE> strings may be appended to the end of the command.

To print the PE count and PE ranges used by each existing LV:
lvs -o +seg_size_pe,seg_le_ranges --all

Example:
$ pvs -o +pe_count
PV VG Fmt Attr PSize PFree PE
/dev/sdb testvg lvm2 a-- 820.20g 820.20g 209972
$ lvcreate --contiguous y --size 200G -n testlv testvg /dev/sdb:15-209971
$ lvs -o +devices,seg_size_pe --all
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices SSize
testlv testvg -wc-a----- 200.00g /dev/sdb(15) 51215

Thin Pool LV Chunk Size and Zeroing Mode

Thin Pool Chunks are the units used to allocate space from a Thin Pool to a Thin LV.

When creating a Thin Pool, the Chunk Size should be set to the Stripe Width:
lvcreate --chunksize <Stripe Width>k ...

To print the current Chunk Size:
lvs --units b -o +chunk_size

Setting the Chunk Size to the Stripe Width ensures that all Thin LVs created within a Thin Pool will be properly aligned. No further alignment is needed when creating the Thin LV itself within the Thin Pool.

In addition to setting the Chunk Size, if `cat /sys/block/sdX/queue/discard_zeroes_data` is 0 then Thin Pool Zeroing Mode should be set to 'n' after creating the Thin Pool:
lvchange --zero n <VG>/<TP>

To check the current Thin Pool Zeroing Mode, run `lvs`. If 'z' is listed under 'Attr', then Zeroing Mode is 'y', otherwise it is 'n'.

See the "Thin Pool LV Alignment" section below for an example.

Thin Pool LV Alignment

A Thin Pool actually consists of three separate non-Thin LVs: a data LV, a metadata LV, and a spare LV (which may be used to repair the metadata LV).

When creating a Thin Pool using `lvcreate --thinpool ...`, LVM automatically calculates the appropriate size for the metadata and spare LVs based on the size of the data LV. Unfortunately, it cannot automatically align each of these LVs as described under "LV Alignment" above.

To create and remove a Thin Pool to determine appropriate metadata/spare LV sizes before manual alignment:
lvcreate --extents 100%FREE --thinpool testtp <VG> --chunksize <Stripe Width>k
lvs -o +seg_size_pe,seg_le_ranges --all
lvremove <VG>/testtp

Note that if all free space is allocated as shown above, you may need to reduce the size of the data LV by a few PEs to allow the metadata and spare LVs to be properly aligned.

To manually create/align these LVs, create two non-Thin LVs as described under "LV Alignment" above, naming the data LV with the desired Thin Pool name, naming the metadata LV with "meta" appended to the desired Thin Pool name, and leaving properly-aligned empty space for the spare LV before the data LV. (The spare LV cannot be manually created but will be automatically created.) Then run:
lvconvert --type thin-pool --chunksize <Stripe Width>k --poolmetadata <VG>/<TP>meta <VG>/<TP>

Example:
$ lvcreate --chunksize 1536k --extents 100%FREE --thinpool testtp testvg
$ lvs -o +devices,seg_size_pe --all
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices SSize
[lvol0_pmspare] testvg ewi------- 36.00m /dev/sdb(0) 9
testtp testvg twi-a-tz-- 820.13g 0.00 0.47 testtp_tdata(0) 209954
[testtp_tdata] testvg Twi-ao---- 820.13g /dev/sdb(9) 209954
[testtp_tmeta] testvg ewi-ao---- 36.00m /dev/sdb(209963) 9
$ # Note that lvol0_pmspare is properly aligned (it starts at PE 0), and testtp_tdata is also properly aligned (it starts at PE 9, which is a multiple of 3 as is needed when Stripe Width is 1.5M and Physical Extent Size is 4MB), but testtp_tmeta is not properly aligned (it starts at PE 209963 which is not a multiple of 3).
$ lvremove testvg/testtp
$ pvs -o +pe_count
PV VG Fmt Attr PSize PFree PE
/dev/sdb testvg lvm2 a-- 820.20g 820.20g 209972
$ lvcreate --contiguous y --extents 209952 -n testtp testvg /dev/sdb:9-209971
$ lvcreate --contiguous y --extents 9 -n testtpmeta testvg /dev/sdb:209961-209971
$ lvconvert -f --type thin-pool --chunksize 1536k --poolmetadata testvg/testtpmeta testvg/testtp
$ lvs -o +devices,seg_size_pe --all
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices SSize
[lvol0_pmspare] testvg ewi------- 36.00m /dev/sdb(0) 9
testtp testvg twi-a-tz-- 820.12g 0.00 0.47 testtp_tdata(0) 209952
[testtp_tdata] testvg Twc-ao---- 820.12g /dev/sdb(9) 209952
[testtp_tmeta] testvg ewc-ao---- 36.00m /dev/sdb(209961) 9
$ lvs --units b -o +chunk_size
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Chunk
testtp testvg twi-a-tz-- 880602513408B 0.00 0.47 1572864B
$ cat /sys/block/sdb/queue/discard_zeroes_data
0
$ lvchange --zero n testvg/testtp
Logical volume testvg/testtp changed.
$ lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
testtp testvg twi-a-t--- 820.12g 0.00 0.47
$ lvcreate --virtualsize 200G --thin --name testlv testvg/testtp
$ echo 1024 > /sys/block/sdb/queue/max_sectors_kb


Filesystem Alignment

The Chunk Size and Stripe Width should be provided to the mkfs.extX to ensure that filesystem metadata is placed efficiently:
mkfs.ext4 -b 4096 -E stride=<Chunk Size / 4096>,stripe_width=<Stripe Width / 4096>

To print the settings the filesystem was created with:
tune2fs -l /dev/...

`mount` should automatically use a matching "-o stripe=..." option when mounting the filesystem. This can be verified by running `mount` after mounting the filesystem.

Example:
$ mkfs.ext4 -b 4096 -E stride=64,stripe_width=384 /dev/testvg/testlv
$ tune2fs -l /dev/testvg/testlv
RAID stride: 64
RAID stripe width: 384
$ mount /dev/testvg/testlv /test
$ mount
/dev/mapper/testvg-testlv on /test type ext4 (rw,relatime,seclabel,stripe=384,data=ordered)