minio

Commit Graph

Author	SHA1	Message	Date
Harshavardhana	3ca6330661	fix: optimize parentDirIsObject by moving isObject to storage layer (#11291 ) For objects with `N` prefix depth, this PR reduces `N` such network operations by converting `CheckFile` into a single bulk operation. Reduction in chattiness here would allow disks to be utilized more cleanly, while maintaining the same functionality along with one extra volume check stat() call is removed. Update tests to test multiple sets scenario	4 years ago
Harshavardhana	f903cae6ff	Support variable server pools (#11256 ) Current implementation requires server pools to have same erasure stripe sizes, to facilitate same SLA and expectations. This PR allows server pools to be variadic, i.e they do not have to be same erasure stripe sizes - instead they should have SLA for parity ratio. If the parity ratio cannot be guaranteed by the new server pool, the deployment is rejected i.e server pool expansion is not allowed.	4 years ago
Harshavardhana	1a5775e2e8	enable small and large file optimization (#11260 ) - for large objects we found that 1MiB block for r/w respectively. - for small objects we found that 128KiB block for r/w respectively.	4 years ago
Harshavardhana	e4e117faab	fix: enable xl.json to xl.meta only if legacy drive is found (#11255 ) another optimization is renameLegacyMetadata() never needs to validate bucket with os.Stat() again, leading to reduction in one extra syscall.	4 years ago
Harshavardhana	4593b146be	fix: print errors only when metacache status has errors (#11248 )	4 years ago
Harshavardhana	f21d650ed4	fix: readData in bulk call using messagepack byte wrappers (#11228 ) This PR refactors the way we use buffers for O_DIRECT and to re-use those buffers for messagepack reader writer. After some extensive benchmarking found that not all objects have this benefit, and only objects smaller than 64KiB see this benefit overall. Benefits are seen from almost all objects from 1KiB - 32KiB Beyond this no objects see benefit with bulk call approach as the latency of bytes sent over the wire v/s streaming content directly from disk negate each other with no remarkable benefits. All other optimizations include reuse of msgp.Reader, msgp.Writer using sync.Pool's for all internode calls.	4 years ago
Harshavardhana	76e2713ffe	fix: use buffers only when necessary for io.Copy() (#11229 ) Use separate sync.Pool for writes/reads Avoid passing buffers for io.CopyBuffer() if the writer or reader implement io.WriteTo or io.ReadFrom respectively then its useless for sync.Pool to allocate buffers on its own since that will be completely ignored by the io.CopyBuffer Go implementation. Improve this wherever we see this to be optimal. This allows us to be more efficient on memory usage. ``` 385 // copyBuffer is the actual implementation of Copy and CopyBuffer. 386 // if buf is nil, one is allocated. 387 func copyBuffer(dst Writer, src Reader, buf []byte) (written int64, err error) { 388 // If the reader has a WriteTo method, use it to do the copy. 389 // Avoids an allocation and a copy. 390 if wt, ok := src.(WriterTo); ok { 391 return wt.WriteTo(dst) 392 } 393 // Similarly, if the writer has a ReadFrom method, use it to do the copy. 394 if rt, ok := dst.(ReaderFrom); ok { 395 return rt.ReadFrom(src) 396 } ``` From readahead package ``` // WriteTo writes data to w until there's no more data to write or when an error occurs. // The return value n is the number of bytes written. // Any error encountered during the write is also returned. func (a *reader) WriteTo(w io.Writer) (n int64, err error) { if a.err != nil { return 0, a.err } n = 0 for { err = a.fill() if err != nil { return n, err } n2, err := w.Write(a.cur.buffer()) a.cur.inc(n2) n += int64(n2) if err != nil { return n, err } ```	4 years ago
Harshavardhana	d0027c3c41	do not use large buffers if not necessary (#11220 ) without this change, there is a performance regression for small objects GETs, this makes the overall speed to go back to pre '59d363' commit days.	4 years ago
Harshavardhana	c4b1d394d6	erasure: avoid io.Copy in hotpaths to reduce allocation (#11213 )	4 years ago
Harshavardhana	c4131c2798	feat: Small object optimization read data in single bulk call (#11207 )	4 years ago
Anis Elleuch	c9d502e6fa	parentDirIsObject() to return quickly with inexistant parent (#11204 ) Rewrite parentIsObject() function. Currently if a client uploads a/b/c/d, we always check if c, b, a are actual objects or not. The new code will check with the reverse order and quickly quit if the segment doesn't exist. So if a, b, c in 'a/b/c' does not exist in the first place, then returns false quickly.	4 years ago
Anis Elleuch	677e80c0f8	xl: Remove check-dir in ReadVersion (#11200 ) The only purpose of check-dir flag in ReadVersion is to return 404 when an object has xl.meta but without data. This is causing an extract call to the disk which can be penalizing in case of busy system where disks receive many concurrent access.	4 years ago
Anis Elleuch	a317d220ed	xl-storage: Do not stat bucket assuming the object exists (#11201 ) In HEAD/GET, only STAT the bucket if the object does not exist to return the correct error response.	4 years ago
Harshavardhana	cc457f1798	fix: enhance logging in crawler use console.Debug instead of logger.Info (#11179 )	4 years ago
Harshavardhana	445a9bd827	fix: heal optimizations in crawler to avoid multiple healing attempts (#11173 ) Fixes two problems - Double healing when bitrot is enabled, instead heal attempt once in applyActions() before lifecycle is applied. - If applyActions() is successful and getSize() returns proper value, then object is accounted for and should be removed from the oldCache namespace map to avoid double heal attempts.	4 years ago
Harshavardhana	c19e6ce773	avoid a crash in crawler when lifecycle is not initialized (#11170 ) Bonus for static buffers use bytes.NewReader instead of bytes.NewBuffer, to use a more reader friendly implementation	4 years ago
Harshavardhana	a773cf48d8	fix: overlapping object and prefix rejected (#11130 ) fixes #11129	4 years ago
Harshavardhana	3e83643320	lifecycle improvements and additional debug logging (#11096 ) Bonus change fix browser assets	4 years ago
Anis Elleuch	f164085227	xl: Always set root disk to true in test environment (#11094 ) Tests environments (go test or manual testing) should always consider the passed disks are root disks and should not rely on disk.IsRootDisk() function. The reason is that this latter can return a false negative when called in a busy system. However, returning a false negative will only occur in a testing environment and not in a production, so we can accept this trade-off for now.	4 years ago
Harshavardhana	d8c1f93de6	reject mixed drive situations with drives on root disks (#11057 ) till now we used to match the inode number of the root drive and the drive path minio would use, if they match we knew that its a root disk. this may not be true in all situations such as running inside a container environment where the container might be mounted from a different partition altogether, root disk detection might fail.	4 years ago
Ritesh H Shukla	038bcd9079	Add replication capacity metrics support in crawler (#10786 )	4 years ago
Klaus Post	a896125490	Add crawler delay config + dynamic config values (#11018 )	4 years ago
Harshavardhana	96c0ce1f0c	add support for tuning healing to make healing more aggressive (#11003 ) supports `mc admin config set <alias> heal sleep=100ms` to enable more aggressive healing under certain times. also optimize some areas that were doing extra checks than necessary when bitrotscan was enabled, avoid double sleeps make healing more predictable. fixes #10497	4 years ago
Harshavardhana	bdd094bc39	fix: avoid sending errors on missing objects on locked buckets (#10994 ) make sure multi-object delete returned errors that are AWS S3 compatible	4 years ago
Harshavardhana	df93102235	fix: unwrapping issues with os.Is* functions (#10949 ) reduces 3 stat calls, reducing the overall startup time significantly.	4 years ago
Poorna Krishnamoorthy	39f3d5493b	Show Delete replication status header (#10946 ) X-Minio-Replication-Delete-Status header shows the status of the replication of a permanent delete of a version. All GETs are disallowed and return 405 on this object version. In the case of replicating delete markers. X-Minio-Replication-DeleteMarker-Status shows the status of replication, and would similarly return 405. Additionally, this PR adds reporting of delete marker event completion and updates documentation	4 years ago
Poorna Krishnamoorthy	1ebf6f146a	Add support for ILM transition (#10565 ) This PR adds transition support for ILM to transition data to another MinIO target represented by a storage class ARN. Subsequent GET or HEAD for that object will be streamed from the transition tier. If PostRestoreObject API is invoked, the transitioned object can be restored for duration specified to the source cluster.	4 years ago
Harshavardhana	9a34fd5c4a	Revert "Revert "Add delete marker replication support (#10396 )"" This reverts commit `267d7bf0a9`.	4 years ago
Harshavardhana	267d7bf0a9	Revert "Add delete marker replication support (#10396 )" This reverts commit `50c10a5087`. PR is moved to origin/dev branch	4 years ago
Poorna Krishnamoorthy	50c10a5087	Add delete marker replication support (#10396 ) Delete marker replication is implemented for V2 configuration specified in AWS spec (though AWS allows it only in the V1 configuration). This PR also brings in a MinIO only extension of replicating permanent deletes, i.e. deletes specifying version id are replicated to target cluster.	4 years ago
Harshavardhana	fde3299bf3	re-use optimized readdir for isDirEmpty() (#10829 ) reduces effective memory usage by an order of magnitude, also increases performance for small objects	4 years ago
Harshavardhana	1a1f00fa15	fix: use internode data for DisksInfo, VolsInfo in message pack (#10821 ) Similar to #10775 for fewer memory allocations, since we use getOnlineDisks() extensively for listing we should optimize it further. Additionally, remove all unused walkers from the storage layer	4 years ago
Klaus Post	37749f4623	Optimize FileInfo(Version) transfer (#10775 ) File Info decoding, in particular, is showing up as a major allocator and time consumer for internode data transfers Switch to message pack for cross-server transfers: ``` MSGP: Size: 945 bytes BenchmarkEncodeFileInfoMsgp-32 1558444 866 ns/op 1.16 MB/s 0 B/op 0 allocs/op BenchmarkDecodeFileInfoMsgp-32 479968 2487 ns/op 0.40 MB/s 848 B/op 18 allocs/op GOB: Size: 1409 bytes BenchmarkEncodeFileInfoGOB-32 333339 3237 ns/op 0.31 MB/s 576 B/op 19 allocs/op BenchmarkDecodeFileInfoGOB-32 20869 57837 ns/op 0.02 MB/s 16439 B/op 428 allocs/op ```	4 years ago
Klaus Post	86e0d272f3	Reduce WriteAll allocs (#10810 ) WriteAll saw 127GB allocs in a 5 minute timeframe for 4MiB buffers used by `io.CopyBuffer` even if they are pooled. Since all writers appear to write byte buffers, just send those instead and write directly. The files are opened through the `os` package so they have no special properties anyway. This removes the alloc and copy for each operation. REST sends content length so a precise alloc can be made.	4 years ago
Krishna Srinivas	3a2f89b3c0	fix: add support for O_DIRECT reads for erasure backends (#10718 )	4 years ago
Klaus Post	a982baff27	ListObjects Metadata Caching (#10648 ) Design: https://gist.github.com/klauspost/025c09b48ed4a1293c917cecfabdf21c Gist of improvements: * Cross-server caching and listing will use the same data across servers and requests. * Lists can be arbitrarily resumed at a constant speed. * Metadata for all files scanned is stored for streaming retrieval. * The existing bloom filters controlled by the crawler is used for validating caches. * Concurrent requests for the same data (or parts of it) will not spawn additional walkers. * Listing a subdirectory of an existing recursive cache will use the cache. * All listing operations are fully streamable so the number of objects in a bucket no longer dictates the amount of memory. * Listings can be handled by any server within the cluster. * Caches are cleaned up when out of date or superseded by a more recent one.	4 years ago
Anis Elleuch	eb95353cb1	fix: Get/HeadObject return 404 on non quorum objects (#10753 )	4 years ago
Anis Elleuch	00124c56d9	erasure: Commit data before xl.meta in RenameData() (#10734 ) This will reduce the chance to have updated xl.meta without data.	4 years ago
Harshavardhana	2042d4873c	rename crawler config option to heal (#10678 )	4 years ago
Klaus Post	03991c5d41	crawler: Remove waitForLowActiveIO (#10667 ) Only use dynamic delays for the crawler. Even though the max wait was 1 second the number of waits could severely impact crawler speed. Instead of relying on a global metric, we use the stateless local delays to keep the crawler running at a speed more adjusted to current conditions. The only case we keep it is before bitrot checks when enabled.	4 years ago
Harshavardhana	a0d0645128	remove safeMode behavior in startup (#10645 ) In almost all scenarios MinIO now is mostly ready for all sub-systems independently, safe-mode is not useful anymore and do not serve its original intended purpose. allow server to be fully functional even with config partially configured, this is to cater for availability of actual I/O v/s manually fixing the server. In k8s like environments it will never make sense to take pod into safe-mode state, because there is no real access to perform any remote operation on them.	4 years ago
Harshavardhana	736e58dd68	fix: handle concurrent lockers with multiple optimizations (#10640 ) - select lockers which are non-local and online to have affinity towards remote servers for lock contention - optimize lock retry interval to avoid sending too many messages during lock contention, reduces average CPU usage as well - if bucket is not set, when deleteObject fails make sure setPutObjHeaders() honors lifecycle only if bucket name is set. - fix top locks to list out always the oldest lockers always, avoid getting bogged down into map's unordered nature.	4 years ago
Harshavardhana	2b4eb87d77	pick disks which are common maximally used (#10600 ) further optimization to ensure that good disks are always used for listing, other than healing we only use disks that are maximally used.	4 years ago
Harshavardhana	00eb6f6bc9	cache DiskInfo at storage layer for performance (#10586 ) `mc admin info` on busy setups will not move HDD heads unnecessarily for repeated calls, provides a better responsiveness for the call overall. Bonus change allow listTolerancePerSet be N-1 for good entries, to avoid skipping entries for some reason one of the disk went offline.	4 years ago
Harshavardhana	66174692a2	add '.healing.bin' for tracking currently healing disk (#10573 ) add a hint on the disk to allow for tracking fresh disk being healed, to allow for restartable heals, and also use this as a way to track and remove disks. There are more pending changes where we should move all the disk formatting logic to backend drives, this PR doesn't deal with this refactor instead makes it easier to track healing in the future.	4 years ago
Harshavardhana	7f9498f43f	fix: ignore faulty drives and continue (#10511 ) drives might return different types of errors handle them individually, and for some errors just log an error and continue	4 years ago
Klaus Post	34859c6d4b	Preallocate (safe) slices when we know the size (#10459 )	4 years ago
Klaus Post	fa01e640f5	Continous healing: add optional bitrot check (#10417 )	4 years ago
Anis Elleuch	af88772a78	lifecycle: NoncurrentVersionExpiration considers noncurrent version age (#10444 ) From https://docs.aws.amazon.com/AmazonS3/latest/dev/intro-lifecycle-rules.html#intro-lifecycle-rules-actions ``` When specifying the number of days in the NoncurrentVersionTransition and NoncurrentVersionExpiration actions in a Lifecycle configuration, note the following: It is the number of days from when the version of the object becomes noncurrent (that is, when the object is overwritten or deleted), that Amazon S3 will perform the action on the specified object or objects. Amazon S3 calculates the time by adding the number of days specified in the rule to the time when the new successor version of the object is created and rounding the resulting time to the next day midnight UTC. For example, in your bucket, suppose that you have a current version of an object that was created at 1/1/2014 10:30 AM UTC. If the new version of the object that replaces the current version is created at 1/15/2014 10:30 AM UTC, and you specify 3 days in a transition rule, the transition date of the object is calculated as 1/19/2014 00:00 UTC. ```	4 years ago
Klaus Post	2d58a8d861	Add storage layer contexts (#10321 ) Add context to all (non-trivial) calls to the storage layer. Contexts are propagated through the REST client. - `context.TODO()` is left in place for the places where it needs to be added to the caller. - `endWalkCh` could probably be removed from the walkers, but no changes so far. The "dangerous" part is that now a caller disconnecting will propagate down, so a "delete" operation will now be interrupted. In some cases we might want to disconnect this functionality so the operation completes if it has started, leaving the system in a cleaner state.	4 years ago

1 2

78 Commits (6bfa162342245d59c6934c94a070c9f40a0e4349)