minio

Commit Graph

Author	SHA1	Message	Date
Harshavardhana	b0e1d4ce78	re-attach offline drive after new drive replacement (#10416 ) inconsistent drive healing when one of the drive is offline while a new drive was replaced, this change is to ensure that we can add the offline drive back into the mix by healing it again.	4 years ago
Harshavardhana	8a291e1dc0	Cluster healthcheck improvements (#10408 ) - do not fail the healthcheck if heal status was not obtained from one of the nodes, if many nodes fail then report this as a catastrophic error. - add "x-minio-write-quorum" value to match the write tolerance supported by server. - admin info now states if a drive is healing where madmin.Disk.Healing is set to true and madmin.Disk.State is "ok"	4 years ago
Daniel Valdivia	7d1734d033	indicate through HTTP header cluster healing in progress (#10342 )	4 years ago
Harshavardhana	fe157166ca	fix: Pass context all the way down to the network call in lockers (#10161 ) Context timeout might race on each other when timeouts are lower i.e when two lock attempts happened very quickly on the same resource and the servers were yet trying to establish quorum. This situation can lead to locks held which wouldn't be unlocked and subsequent lock attempts would fail. This would require a complete server restart. A potential of this issue happening is when server is booting up and we are trying to hold a 'transaction.lock' in quick bursts of timeout.	4 years ago
Harshavardhana	ec06089eda	fix: re-implement cluster healthcheck (#10101 )	4 years ago
Harshavardhana	c0ac25bfff	fix: readiness needs to be like liveness (#9941 ) Readiness as no reasoning to be cluster scope because that is not how the k8s networking works for pods, all the pods to a deployment are not sharing the network in a singleton. Instead they are run as local scopes to themselves, with readiness failures the pod is potentially taken out of the network to be resolvable - this affects the distributed setup in myriad of different ways. Instead readiness should behave like liveness with local scope alone, and should be a dummy implementation. This PR all the startup times and overal k8s startup time dramatically improves. Added another handler called as `/minio/health/cluster` to understand the cluster scope health.	4 years ago
Harshavardhana	4790868878	allow background IAM load to speed up startup (#9796 ) Also fix healthcheck handler to run success only if object layer has initialized fully for S3 API access call.	4 years ago
Harshavardhana	5e529a1c96	simplify context timeout for readiness (#9772 ) additionally also add CORS support to restrict for specific origin, adds a new config and updated the documentation as well	5 years ago
Krishna Srinivas	7d19ab9f62	readiness returns error quickly if any of the set is down (#9662 ) This PR adds a new configuration parameter which allows readiness check to respond within 10secs, this can be reduced to a lower value if necessary using ``` mc admin config set api ready_deadline=5s ``` or ``` export MINIO_API_READY_DEADLINE=5s ```	5 years ago
Anis Elleuch	d4dcf1d722	metrics: Use StorageInfo() instead to have consistent info (#9006 ) Metrics used to have its own code to calculate offline disks. StorageInfo() was avoided because it is an expensive operation by sending calls to all nodes. To make metrics & server info share the same code, a new argument `local` is added to StorageInfo() so it will only query local disks when needed. Metrics now calls StorageInfo() as server info handler does but with the local flag set to false. Co-authored-by: Praveen raj Mani <praveen@minio.io> Co-authored-by: Harshavardhana <harsha@minio.io>	5 years ago
Harshavardhana	0879a4f743	rest/storage: Remove racy LastError usage (#8817 ) instead perform a liveness check call to verify if server is online and print relevant errors. Also introduce a StorageErr string error type instead of errors.New() deprecate usage of VerifyFileError, DeleteFileError for gob, change in datastructure also requires bump in storage REST version to v13. Fixes #8811	5 years ago
Praveen raj Mani	157721f694	Fix readiness to return 200 for read-only mode (#8728 ) - We should declare a cluster ready even if read quorum is achieved (atleast n/2 disks are online). - Such that, all the zones should have enough read quorum. Thus making the cluster ready for reads.	5 years ago
Praveen raj Mani	5d09233115	Fix Readiness check (#8681 ) - Remove goroutine-check in Readiness check - Bring in quorum check for readiness Fixes #8385 Co-authored-by: Harshavardhana <harsha@minio.io>	5 years ago
Harshavardhana	347b29d059	Implement bucket expansion (#8509 )	5 years ago
Harshavardhana	822eb5ddc7	Bring in safe mode support (#8478 ) This PR refactors object layer handling such that upon failure in sub-system initialization server reaches a stage of safe-mode operation wherein only certain API operations are enabled and available. This allows for fixing many scenarios such as - incorrect configuration in vault, etcd, notification targets - missing files, incomplete config migrations unable to read encrypted content etc - any other issues related to notification, policies, lifecycle etc	5 years ago
Harshavardhana	07a556a10b	Avoid ListBuckets() call instead rely on simple HTTP GET (#8475 ) This is to avoid making calls to backend and requiring gateways to allow permissions for ListBuckets() operation just for Liveness checks, we can avoid this and make our liveness checks to be more performant.	5 years ago
Harshavardhana	9e7a3e6adc	Extend further validation of config values (#8469 ) - This PR allows config KVS to be validated properly without being affected by ENV overrides, rejects invalid values during set operation - Expands unit tests and refactors the error handling for notification targets, returns error instead of ignoring targets for invalid KVS - Does all the prep-work for implementing safe-mode style operation for MinIO server, introduces a new global variable to toggle safe mode based operations NOTE: this PR itself doesn't provide safe mode operations	5 years ago
Nitish Tiwari	496fba3e9a	Return 200 OK for liveness checks while distributed cluster starts (#8176 ) With this PR, liveness check responds with 200 OK with "server-not- initialized" header while objectLayer gets initialized. The header is removed as objectLayer is initialized. This is to allow MinIO distributed cluster to get started when running on an orchestration platforms like Docker Swarm. This PR also updates sample Swarm yaml files to use correct values for healthcheck fields. Fixes #8140	5 years ago
Harshavardhana	5a28ef0d47	Bump readiness check upto 10000 go-routines (#8057 ) Most of our current workloads reach this value regularly, it doesn't make sense to keep 1000 go-routine limit.	5 years ago
Anis Elleuch	e857b6741d	Add one log in health checker liveness code (#7861 )	5 years ago
kannappanr	5ecac91a55	Replace Minio refs in docs with MinIO and links (#7494 )	6 years ago
Krishna Srinivas	267f183fc8	Do not do StorageInfo() and ListBuckets() for FS/Erasure in health check handler (#7090 ) Health checking programs very frequently use /minio/health/live to check health, hence we can avoid doing StorageInfo() and ListBuckets() for FS/Erasure backend.	6 years ago
Harshavardhana	166e998788	Fix healthcheck for NAS gateway (#6452 ) It was expected that in gateway mode, we do not know the backend types whereas in NAS gateway since its an extension of FS mode (standalone) this leads to an issue in LivenessCheckHandler() which would perpetually return 503, this would affect all kubernetes, openshift deployments of NAS gateway.	6 years ago
Nitish Tiwari	197af49c99	Fix healthcheck handler to verify gateway backend liveness (#6218 ) Fixes #6217	6 years ago
Harshavardhana	157ed65c35	Fix healthcheck handler to check errors in local disks only (#6184 ) Healthcheck handler in current implementation was performing ListBuckets() to check for liveness of Minio service. ListBuckets() implementation on the other hand doesn't do quorum based listing and if one of the disks returned error, an I/O error it would be lead to kubernetes taking the minio pod down prematurely even if the disk is not local to that minio server. The reason is ListBuckets() call cannot be trusted to provide us the valid information that we need, Minio is a clustered application which is designed to handle disk failures. Error on one of the disks doesn't mean the pod should become fully non-operational. This PR attempts to fix this by only checking for alive disks which are local to each setup and also by simply performing a Stat() operation, if the Stat() returned error on all disks local to a particular server then we can let kubernetes safely take it down, until then we should be operational.	6 years ago
Krishna Srinivas	9ede179a21	Use context.Background() instead of nil Rename Context[Get\|Set] -> [Get\|Set]Context	7 years ago
Krishna Srinivas	e452377b24	Add context to the object-interface methods. Make necessary changes to xl fs azure sia	7 years ago
Nitish Tiwari	10b01ac836	Add healthcheck endpoints (#5543 ) This PR adds readiness and liveness endpoints to probe Minio server instance health. Endpoints can only be accessed without authentication and the paths are /minio/health/live and /minio/health/ready for liveness and readiness respectively. The new healthcheck liveness endpoint is used for Docker healthcheck now. Fixes #5357 Fixes #5514	7 years ago

28 Commits (0104af6bccce0f8fce537a713c424f4aed2041f0)