fix: deprecate old settings and elaborate on tuning (#9542)

fixes #9538
master
Harshavardhana 5 years ago committed by GitHub
parent 7e3ea77fdf
commit 53f4c0fdc0
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
  1. 117
      docs/deployment/kernel-tuning/README.md
  2. 50
      docs/deployment/kernel-tuning/disk-tuning.sh

@ -1,81 +1,98 @@
# Kernel Tuning for MinIO Production Deployment on Linux Servers [![Slack](https://slack.min.io/slack?type=svg)](https://slack.min.io) [![Docker Pulls](https://img.shields.io/docker/pulls/minio/minio.svg?maxAge=604800)](https://hub.docker.com/r/minio/minio/)
## Tuning Network Parameters
Following are the recommended settings, a copy of this [script](https://github.com/minio/minio/blob/master/docs/deployment/kernel-tuning/sysctl.sh) is available here to be applied on Linux servers.
Following network parameter settings can help ensure optimal MinIO server performance on production workloads.
> NOTE: Although these settings are generally good on Linux servers, users must be careful on any premature tuning. These tunings are generally considered good to have but not mandatory, these settings do not fix any hardware issues and should not be considered as an alternative to boost performance. Under most circumstances this tuning is to be done after performing baseline performance tests for the hardware with expected results.
- *`tcp_fin_timeout`* : A socket left in memory takes approximately 1.5Kb of memory. It makes sense to close the unused sockets preemptively to ensure no memory leakage. This way, even if a peer doesn't close the socket due to some reason, the system itself closes it after a timeout. `tcp_fin_timeout` variable defines this timeout and tells kernel how long to keep sockets in the state FIN-WAIT-2. We recommend setting it to 30. You can set it as shown below
```sh
sysctl -w net.ipv4.tcp_fin_timeout=30
```
#!/bin/bash
cat > sysctl.conf <<EOF
# maximum number of open files/file descriptors
fs.file-max = 4194303
- *`tcp_keepalive_probes`* : This variable defines the number of unacknowledged probes to be sent before considering a connection dead. You can set it as shown below
# use as little swap space as possible
vm.swappiness = 1
```sh
sysctl -w net.ipv4.tcp_keepalive_probes=5
```
# prioritize application RAM against disk/swap cache
vm.vfs_cache_pressure = 10
- *`wmem_max`*: This parameter sets the max OS send buffer size for all types of connections.
# minimum free memory
vm.min_free_kbytes = 1000000
```sh
sysctl -w net.core.wmem_max=540000
```
# maximum receive socket buffer (bytes)
net.core.rmem_max = 268435456
- *`rmem_max`*: This parameter sets the max OS receive buffer size for all types of connections.
# maximum send buffer socket buffer (bytes)
net.core.wmem_max = 268435456
```sh
sysctl -w net.core.rmem_max=540000
```
# default receive buffer socket size (bytes)
net.core.rmem_default = 67108864
## Tuning Virtual Memory
# default send buffer socket size (bytes)
net.core.wmem_default = 67108864
Recommended virtual memory settings are as follows.
# maximum number of packets in one poll cycle
net.core.netdev_budget = 1200
- *`swappiness`* : This parameter controls the relative weight given to swapping out runtime memory, as opposed to dropping pages from the system page cache. It takes values from 0 to 100, both inclusive. We recommend setting it to 10.
# maximum ancillary buffer size per socket
net.core.optmem_max = 134217728
```sh
sysctl -w vm.swappiness=1
```
# maximum number of incoming connections
net.core.somaxconn = 65535
- *`dirty_background_ratio`*: This is the percentage of system memory that can be filled with `dirty` pages, i.e. memory pages that still need to be written to disk. We recommend writing the data to the disk as soon as possible. To do this, set the `dirty_background_ratio` to 1.
# maximum number of packets queued
net.core.netdev_max_backlog = 250000
```sh
sysctl -w vm.dirty_background_ratio=1
```
# maximum read buffer space
net.ipv4.tcp_rmem = 67108864 134217728 268435456
- *`dirty_ratio`*: This defines is the absolute maximum amount of system memory that can be filled with dirty pages before everything must get committed to disk.
# maximum write buffer space
net.ipv4.tcp_wmem = 67108864 134217728 268435456
```sh
sysctl -w vm.dirty_ratio=5
```
# enable low latency mode
net.ipv4.tcp_low_latency = 1
- *`Transparent Hugepage Support`*: This is a Linux kernel feature intended to improve performance by making more efficient use of processor’s memory-mapping hardware. But this may cause [problems](https://blogs.oracle.com/linux/performance-issues-with-transparent-huge-pages-thp) for non-optimized applications. As most Linux distributions set it to `enabled=always` by default, we recommend changing this to `enabled=madvise`. This will allow applications optimized for transparent hugepages to obtain the performance benefits, while preventing the associated problems otherwise.
# socket buffer portion used for TCP window
net.ipv4.tcp_adv_win_scale = 1
```sh
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
```
# queue length of completely established sockets waiting for accept
net.ipv4.tcp_max_syn_backlog = 30000
Also, set `transparent_hugepage=madvise` on your kernel command line (e.g. in /etc/default/grub) to persistently set this value.
# maximum number of sockets in TIME_WAIT state
net.ipv4.tcp_max_tw_buckets = 2000000
All these system level tunings are conveniently packaged in [shell script](https://github.com/minio/minio/blob/master/docs/deployment/kernel-tuning/sysctl.sh). Please review the shell script for our recommendations.
# reuse sockets in TIME_WAIT state when safe
net.ipv4.tcp_tw_reuse = 1
## Tuning Scheduler
# time to wait (seconds) for FIN packet
net.ipv4.tcp_fin_timeout = 5
Proper scheduler configuration makes sure MinIO process gets adequate CPU time. Here are the recommended scheduler settings
# disable icmp send redirects
net.ipv4.conf.all.send_redirects = 0
- *`sched_min_granularity_ns`*: This parameter decides the minimum time a task will be be allowed to run on CPU before being pre-empted out. We recommend setting it to 10ms.
# disable icmp accept redirect
net.ipv4.conf.all.accept_redirects = 0
```sh
sysctl -w kernel.sched_min_granularity_ns=10000000
```
# drop packets with LSR or SSR
net.ipv4.conf.all.accept_source_route = 0
- *`sched_wakeup_granularity_ns`*: Lowering this parameter improves wake-up latency and throughput for latency critical tasks, particularly when a short duty cycle load component must compete with CPU bound components.
# MTU discovery, only enable when ICMP blackhole detected
net.ipv4.tcp_mtu_probing = 1
```sh
sysctl -w kernel.sched_wakeup_granularity_ns=15000000
```
EOF
## Tuning Disks
echo "Enabling system level tuning params"
sysctl --quiet --load sysctl.conf && rm -f sysctl.conf
The recommendations for disk tuning are conveniently packaged in a well commented [shell script](https://github.com/minio/minio/blob/master/docs/deployment/kernel-tuning/disk-tuning.sh). Please review the shell script for our recommendations.
# `Transparent Hugepage Support`*: This is a Linux kernel feature intended to improve
# performance by making more efficient use of processor’s memory-mapping hardware.
# But this may cause https://blogs.oracle.com/linux/performance-issues-with-transparent-huge-pages-thp
# for non-optimized applications. As most Linux distributions set it to `enabled=always` by default,
# we recommend changing this to `enabled=madvise`. This will allow applications optimized
# for transparent hugepages to obtain the performance benefits, while preventing the
# associated problems otherwise. Also, set `transparent_hugepage=madvise` on your kernel
# command line (e.g. in /etc/default/grub) to persistently set this value.
echo "Enabling THP madvise"
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
```

@ -1,50 +0,0 @@
#!/bin/bash
## MinIO Cloud Storage, (C) 2017, 2018 MinIO, Inc.
##
## Licensed under the Apache License, Version 2.0 (the "License");
## you may not use this file except in compliance with the License.
## You may obtain a copy of the License at
##
## http://www.apache.org/licenses/LICENSE-2.0
##
## Unless required by applicable law or agreed to in writing, software
## distributed under the License is distributed on an "AS IS" BASIS,
## WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
## See the License for the specific language governing permissions and
## limitations under the License.
# This script changes protected files, and must be run as root
for i in $(echo /sys/block/*/queue/iosched 2>/dev/null); do
iosched_dir=$(echo "${i}" | awk '/iosched/ {print $1}')
[ -z "${iosched_dir}" ] && {
continue
}
## Change each disk ioscheduler to be "deadline"
## Deadline dispatches I/Os in batches. A batch is a
## sequence of either read or write I/Os which are in
## increasing LBA order (the one-way elevator). After
## processing each batch, the I/O scheduler checks to
## see whether write requests have been starved for too
## long, and then decides whether to start a new batch
## of reads or writes
path=$(dirname "${iosched_dir}")
[ -f "${path}/scheduler" ] && {
echo "deadline" > "${path}/scheduler" 2>/dev/null || true
}
## This controls how many requests may be allocated
## in the block layer for read or write requests.
## Note that the total allocated number may be twice
## this amount, since it applies only to reads or
## writes (not the accumulate sum).
[ -f "${path}/nr_requests" ] && {
echo "256" > "${path}/nr_requests" 2>/dev/null || true
}
## This is the maximum number of kilobytes
## supported in a single data transfer at
## block layer.
[ -f "${path}/max_sectors_kb" ] && {
echo "1024" > "${path}/max_sectors_kb" 2>/dev/null || true
}
done
Loading…
Cancel
Save