this is to detect situations of corruption disk
format etc errors quickly and keep the disk online
in such scenarios for requests to fail appropriately.
This PR adds support for healing older
content i.e from 2yrs, 1yr. Also handles
other situations where our config was
not encrypted yet.
This PR also ensures that our Listing
is consistent and quorum friendly,
such that we don't list partial objects
- admin info node offline check is now quicker
- admin info now doesn't duplicate the code
across doing the same checks for disks
- rely on StorageInfo to return appropriate errors
instead of calling locally.
- diskID checks now return proper errors when
disk not found v/s format.json missing.
- add more disk states for more clarity on the
underlying disk errors.
while we handle all situations for writes and reads
on older format, what we didn't cater for properly
yet was delete where we only ended up deleting
just `xl.meta` - instead we should allow all the
deletes to go through for older format without
versioning enabled buckets.
Currently, lifecycle expiry is deleting all object versions which is not
correct, unless noncurrent versions field is specified.
Also, only delete the delete marker if it is the only version of the
given object.
The S3 specification says that versions are ordered in the response of
list object versions.
mc snapshot needs this to know which version comes first especially when
two versions have the same exact last-modified field.
Looking into full disk errors on zoned setup. We don't take the
5% space requirement into account when selecting a zone.
The interesting part is that even considering this we don't
know the size of the object the user wants to upload when
they do multipart uploads.
It seems quite defensive to always upload multiparts to
the zone where there is the most space since all load will
be directed to a part of the cluster.
In these cases we make sure it can at least hold a 1GiB file
and we disadvantage fuller zones more by subtracting the
expected size before weighing.
This PR fixes all the below scenarios
and handles them correctly.
- existing data/bucket is replaced with
new content, no versioning enabled old
structure vanishes.
- existing data/bucket - enable versioning
before uploading any data, once versioning
enabled upload new content, old content
is preserved.
- suspend versioning on the bucket again, now
upload content again the old content is purged
since that is the default "null" version.
Additionally sync data after xl.json -> xl.meta
rename(), to avoid any surprises if there is a
crash during this rename operation.
- Implement a new xl.json 2.0.0 format to support,
this moves the entire marshaling logic to POSIX
layer, top layer always consumes a common FileInfo
construct which simplifies the metadata reads.
- Implement list object versions
- Migrate to siphash from crchash for new deployments
for object placements.
Fixes#2111
Uploading files with names that could not be written to disk
would result in "reduce your request" errors returned.
Instead check explicitly for disallowed characters and reject
files with `Object name contains unsupported characters.`
size calculation in crawler was using the real size
of the object instead of its actual size i.e either
a decrypted or uncompressed size.
this is needed to make sure all other accounting
such as bucket quota and mcs UI to display the
correct values.
Shuffling arguments that we pass to MinIO server are supported. However,
when that happens, Prometheus returns wrong information about disks usage
and online/offline status.
The commit fixes the issue by avoiding relying on xl.endpoints since
it is not ordered.
data usage tracker and crawler seem to be logging
non-actionable information on console, which is not
useful and is fixed on its own in almost all deployments,
lets keep this logging to minimal.
make rest of the Walk() function more predictable,
it was observed that in nominal deployments even
without much workload the drives are generally
slow for respond for readdir operations, for the
sleepDuration factor of 10 this can cause
unexpected slowness in the Listing calls, while
it is good for all other I/O, it may simply slow
down Listing immensely which is not useful.
fixes#9261
In FS mode under Windows, removing an object will not automatically.
remove parent empty prefixes.
The reason is that path.Dir() was used, however filepath.Dir() is
more appropriate since filepath is physical (meaning it operates
on OS filesystem paths)
This is not caught because failure for Windows CI is not caught.
- acquire since leader lock for all background operations
- healing, crawling and applying lifecycle policies.
- simplify lifecyle to avoid network calls, which was a
bug in implementation - we should hold a leader and
do everything from there, we have access to entire
name space.
- make listing, walking not interfere by slowing itself
down like the crawler.
- effectively use global context everywhere to ensure
proper shutdown, in cache, lifecycle, healing
- don't read `format.json` for prometheus metrics in
StorageInfo() call.
Bulk delete API was using cleanupObjectsBulk() which calls posix
listing and delete API to remove objects internal files in the
backend (xl.json and parts) one by one.
Add DeletePrefixes in the storage API to remove the content
of a directory in a single call.
Also use a remove goroutine for each disk to accelerate removal.
- Remove the requirement to honor storage class for deletes
- Improve `posix.DeleteFileBulk` code to Stat the volumeDir
only once per call, rather than for all object paths.
Metrics used to have its own code to calculate offline disks.
StorageInfo() was avoided because it is an expensive operation
by sending calls to all nodes.
To make metrics & server info share the same code, a new
argument `local` is added to StorageInfo() so it will only
query local disks when needed.
Metrics now calls StorageInfo() as server info handler does
but with the local flag set to false.
Co-authored-by: Praveen raj Mani <praveen@minio.io>
Co-authored-by: Harshavardhana <harsha@minio.io>
Streams are returning a readcloser and returning would
decrement io count instantly, fix it.
change maxActiveIOCount to 3, meaning it will pause
crawling if 3 operations are running.
On every restart of the server, usage was being
calculated which is not useful instead wait for
sufficient time to start the crawling routine.
This PR also avoids lots of double allocations
through strings, optimizes usage of string builders
and also avoids crawling through symbolic links.
Fixes#8844
Choosing maxAllowedIOError is arbitrary and
prone to errors, when drives might be perfectly
capable of taking I/O with only few locations
return I/O error. This is a hindrance of sort
where backend filesystems like ZFS can automatically
fix and handle these scenarios.
The added problem with current approach that we
take the drive offline, making it virtually impossible
to bring it online without restart the server which
is not desirable on a busy cluster. Remove this state
such that let the backend return error appropriately
to caller and let the caller decide what to do with
the error.
instead perform a liveness check call to
verify if server is online and print relevant
errors.
Also introduce a StorageErr string error type
instead of errors.New() deprecate usage of
VerifyFileError, DeleteFileError for gob,
change in datastructure also requires bump in
storage REST version to v13.
Fixes#8811
When formatting a set validate if a host failure will likely lead to data loss.
While we don't know what config will be set in the future
evaluate to our best knowledge, assuming default settings.