Design: https://gist.github.com/klauspost/025c09b48ed4a1293c917cecfabdf21c
Gist of improvements:
* Cross-server caching and listing will use the same data across servers and requests.
* Lists can be arbitrarily resumed at a constant speed.
* Metadata for all files scanned is stored for streaming retrieval.
* The existing bloom filters controlled by the crawler is used for validating caches.
* Concurrent requests for the same data (or parts of it) will not spawn additional walkers.
* Listing a subdirectory of an existing recursive cache will use the cache.
* All listing operations are fully streamable so the number of objects in a bucket no
longer dictates the amount of memory.
* Listings can be handled by any server within the cluster.
* Caches are cleaned up when out of date or superseded by a more recent one.
`mc admin info` on busy setups will not move HDD
heads unnecessarily for repeated calls, provides
a better responsiveness for the call overall.
Bonus change allow listTolerancePerSet be N-1
for good entries, to avoid skipping entries
for some reason one of the disk went offline.
add a hint on the disk to allow for tracking fresh disk
being healed, to allow for restartable heals, and also
use this as a way to track and remove disks.
There are more pending changes where we should move
all the disk formatting logic to backend drives, this
PR doesn't deal with this refactor instead makes it
easier to track healing in the future.
Add context to all (non-trivial) calls to the storage layer.
Contexts are propagated through the REST client.
- `context.TODO()` is left in place for the places where it needs to be added to the caller.
- `endWalkCh` could probably be removed from the walkers, but no changes so far.
The "dangerous" part is that now a caller disconnecting *will* propagate down, so a
"delete" operation will now be interrupted. In some cases we might want to disconnect
this functionality so the operation completes if it has started, leaving the system in a cleaner state.
We can reduce this further in the future, but this is a good
value to keep around. With the advent of continuous healing,
we can be assured that namespace will eventually be
consistent so we are okay to avoid the necessity to
a list across all drives on all sets.
Bonus Pop()'s in parallel seem to have the potential to
wait too on large drive setups and cause more slowness
instead of gaining any performance remove it for now.
Also, implement load balanced reply for local disks,
ensuring that local disks have an affinity for
- cleanupStaleMultipartUploads()
- Implement a new xl.json 2.0.0 format to support,
this moves the entire marshaling logic to POSIX
layer, top layer always consumes a common FileInfo
construct which simplifies the metadata reads.
- Implement list object versions
- Migrate to siphash from crchash for new deployments
for object placements.
Fixes#2111
Too many deployments come up with an odd number
of hosts or drives, to facilitate even distribution
among those setups allow for odd and prime numbers
based packs.
- adding oauth support to MinIO browser (#8400) by @kanagaraj
- supports multi-line get/set/del for all config fields
- add support for comments, allow toggle
- add extensive validation of config before saving
- support MinIO browser to support proper claims, using STS tokens
- env support for all config parameters, legacy envs are also
supported with all documentation now pointing to latest ENVs
- preserve accessKey/secretKey from FS mode setups
- add history support implements three APIs
- ClearHistory
- RestoreHistory
- ListHistory
- add help command support for each config parameters
- all the bug fixes after migration to KV, and other bug
fixes encountered during testing.
xl.isObject() returns 'nil' for not found disks when
calculating the existance of xl.json for a given object,
which what StatFile() is also doing (setting nil) if
xl.json exists.
This commit avoids this confusion by setting errDiskNotFound
error when the storage disk is not found.
Other listing optimizations include
- remove double sorting while filtering object entries
- improve error message when upload-id is not in quorum
- use jsoniter for full unmarshal json, instead of gjson
- remove unused code
We should change the logic for both isObject()
and isObjectDir() leaf detection to be done
with quorum, due to how our directory navigation
works - this allows for properly deleting all
the dangling directories or objects if any.
foo.CORRUPTED should never be created because when
multiple sets are involved we would hash the file
to wrong a location, this PR removes the code.
But allows DeleteBucket() to work properly to delete
dangling buckets/objects. Also adds another option
to Healing where a user needs to specify `--remove`
such that all dangling objects will be deleted with
user confirmation.
Continuing from PR 157ed65c35
Our posix.go implementation did not handle I/O errors
properly on the disks, this led to situations where
top-level callers such as ListObjects might return early
without even verifying all the available disks.
This commit tries to address this in Kubernetes, drbd/nbd based
persistent volumes which can disconnect under load and
result in the situations with disks return I/O errors.
This commit also simplifies listing operation, listing
never returns any error. We can avoid this since we pretty
much ignore most of the errors anyways. When objects are
accessed directly we return proper errors.
Better support of HEAD and listing of zero sized objects with trailing
slash (a.k.a empty directory). For that, isLeafDir function is added
to indicate if the specified object is an empty directory or not. Each
backend (xl, fs) has the responsibility to store that information.
Currently, in both of XL & FS, an empty directory is represented by
an empty directory in the backend.
isLeafDir() checks if the given path is an empty directory or not,
since dir listing is costly if the latter contains too many objects,
readDirN() is added in this PR to list only N number of entries.
In isLeadDir(), we will only list one entry to check if a directory
is empty or not.
- Data from disk was being read after bitrot verification to return
data for GetObject. Strictly speaking this does not guarantee bitrot
protection, as disks may return bad data even temporarily.
- This fix reads data from disk, verifies data for bitrot and then
returns data to the client directly.
This PR implements an object layer which
combines input erasure sets of XL layers
into a unified namespace.
This object layer extends the existing
erasure coded implementation, it is assumed
in this design that providing > 16 disks is
a static configuration as well i.e if you started
the setup with 32 disks with 4 sets 8 disks per
pack then you would need to provide 4 sets always.
Some design details and restrictions:
- Objects are distributed using consistent ordering
to a unique erasure coded layer.
- Each pack has its own dsync so locks are synchronized
properly at pack (erasure layer).
- Each pack still has a maximum of 16 disks
requirement, you can start with multiple
such sets statically.
- Static sets set of disks and cannot be
changed, there is no elastic expansion allowed.
- Static sets set of disks and cannot be
changed, there is no elastic removal allowed.
- ListObjects() across sets can be noticeably
slower since List happens on all servers,
and is merged at this sets layer.
Fixes#5465Fixes#5464Fixes#5461Fixes#5460Fixes#5459Fixes#5458Fixes#5460Fixes#5488Fixes#5489Fixes#5497Fixes#5496
It can happen that an incoming PutObject() request might
have inputs of following form eg:-
- bucketName is 'testbucket'
- objectName is '/'
bucketName exists and was previously created but there
are no other objects in this bucket. In a situation like
this parentDirIsObject() goes into an infinite loop.
Verifying that if '/' is an object fails on both backends
but the resulting `path.Dir('/')` returns `'/'` this causes
the closure to loop onto itself.
Fixes#4940
The reason is any function relying on `getLoadBalancedQuorumDisks`
cannot possibly have an idempotent behavior.
The problem comes from given a set of N disks returning just a
shuffled N/2 disks. In case of a scenario where we have N/2
number of failed disks, the returned value of `getLoadBalancedQuorumDisks`
is not equal to the same failed disks so essentially calls using such
disks might succeed or fail randomly at different intervals in time.
This proposal change is we move to `getLoadBalancedDisks()`
and use the shuffled N disks as a whole. Since most of the time we might
hit a good disk since we are not reducing our solution space. This
also provides consistent behavior for all the functions which rely
on shuffled disks.
Fixes#2242
Each metadata ops have a list of errors which can be
ignored, this is essentially needed when
- disks are not found
- disks are found but cannot be accessed (permission denied)
- disks are there but fresh disks were added
This is needed since we don't have healing code in place where
it would have healed the fresh disks added.
Fixes#2072
Previously xl.isObject() returns false if one of the disk doesn't have
the object. Its possible that object may be present in another disk.
This patch fixes the issue by returning false only if given prefix
doesn't exist in all disks.
Fixes#1855