-
Notifications
You must be signed in to change notification settings - Fork 95
[RFC] Riak CS and Stanchion metrics 2.1
Summary: Since 2.1, Riak CS has introduced a new metric system to
monitor the system metrics in more detail and to diagnose system
issues in better way. New metrics consists of three major
categories; frontend API performance, backend Riak performance and
CS internal performance metrics. Stanchion has also introduced the
same metric system as Riak CS, including newly introduced
stanchion-admin
command and /stats
HTTP endpoint.
Status of this document: Requesting for comments. Although most implementation has been finished, it is still easy to add/remove items to cover minor improvements that reflect real need from operation viewpoint. As well as changing item names for comprehensive English would be appreciated.
The new metric system has more items than previous system to see
- Statistics of API requests - count, latency, success and errors about frontend API performance
- Statistics of Riak PB API requests - count, latency, success and errors about backend Riak performance
- Internal statistics inside system
- Statistics of accessing Stanchion
- Waiting time and service time of Stanchion serialization queue
- System metrics like OTP, memory, and
- Mochiweb metrics.
This document will describe basic ideas of new metric system and try to help maintain Riak CS and Stanchion 2.1 metric system.
Interface to retrive stats information from Riak CS has been command
riak-cs-admin status
and HTTP/JSON API /riak-cs/stats
, which is
not changed. But a lot of new detailed items were added as well as
system metrics. See example results of
riak-cs-admin status
and
HTTP/JSON API /riak-cs/stats
.
To Stanchion a new corresponding command and HTTP/JSON API endpoint
have been added. They have same granularity of items as that of Riak
CS. See example results of
stanchion-admin status
and
HTTP/JSON API /stats
.
Terms:
- in, out - these are for each S3 API call, metrics taken when a request has started and when a request has finished.
- error - if a term 'error' is in metric item name, it stands that the request has failed. Successful requests are for corresponding item that does not include 'error'. Note that most irregular responses with code 50x are not being counted neither.
- one, total - Major suffix for counting. 'one' stands for a time window that is decided by exometer; while 'total' stands for accumulated value since a node has started.
- time - stats for latency from when a request has started and has
finished. All followed by suffix
95
,99
,100
,mean
andmedian
.
Categories:
- S3 API stats - items staring with prefix
service
,bucket
,list
,multiple_delete
,object
andmultipart
(names for S3 APIs) are stats for those APIs, typically followed by a term likeput
,get
ordelete
. See S3 API documentation to cover all APIs. - Stanchion access stats - items starting with prefix
velvet
stand for latency and counts accusing Stanchion process for creating/updating/deleting buckets or creating users. They are useful to know major latency of slow requests are in Stanchion or not. Protocol between Riak CS and Stanchion is undocumented, but hope all names are mostly self-describing. - riakc - items starting with prefix
riakc
stand for latency and call counts to Riak PB API.riakc
usually followed by operations likeput
orget
and their targets likemanifests
orblocks
. They are also useful to know where major latency comes from, like getting user record, bucket record, or updating manifests and so on. Correspondence between S3 API is also undocumented, but following section tries to describe it.
The example below is metrics about frontend performance of GET Object
API.
object_get_in_one : 0
object_get_in_total : 0
object_get_out_one : 0
object_get_out_total : 0
object_get_time_mean : 0
object_get_time_median : 0
object_get_time_95 : 0
object_get_time_99 : 0
object_get_time_100 : 0
object_get_out_error_one : 0
object_get_out_error_total : 0
object_get_time_error_mean : 0
object_get_time_error_median : 0
object_get_time_error_95 : 0
object_get_time_error_99 : 0
object_get_time_error_100 : 0
The prefix object_get
indicates these are about GET Object
. The
prefix is followed by three patterns; in
, out
, time
. As
described above, in
is a count where an incoming request has started
being processed, out
counts a successful finish of a request, and
out_error
is count for requests finished with expected (sometimes
unexpected) errors.
Suffix time
stands for all about latency. Suffixes 95
, 99
and
100
stand for 95 percentile, 99 percentile and 100
percentile. mean
is for mean stats of last time window, as well as
median
is for median.
object_get_out_error_one
is a time window based count of failed
requests. object_get_time_error_*
is a time window based latency
stats of failed requests.
As there are > 1000 items, this section is for describing major prefixes.
S3 API stats
-
service_get
- GET Service -
bucket_(put|head|delete)
- PUT, HEAD, DELETE Bucket -
bucket_acl_(get|put)
- PUT, GET Bucket ACL -
bucket_policy_(get|put|delete)
- PUT, GET, DELETE Bucket Policy -
bucket_location_get
- GET Bucket Location -
list_uploads
- listing all multipart uploads -
multiple_delete
- Delete Multiple Objects -
list_objects
- listing all objects in a bucket, equally GET Bucket -
object_(get|put|delete)
- GET, PUT, DELETE, HEAD Objects -
object_put_copy
- PUT Copy Object -
object_acl
- GET, PUT Object ACL -
multipart_post
- Initiate a multipart upload -
multipart_upload_put
- PUT Multipart Upload, putting a part of an object by copying from existing object -
multipart_upload_post
- complete a multipart upload -
multipart_upload_delete
- delete a part of a multipart upload -
multipart_upload_get
- get a list of parts in a multipart upload
Stanchion Access
-
velvet_create_user
- requesting creating a user to Stanchion -
velvet_update_user
- requesting updating a user to Stanchion -
velvet_create_bucket
- requesting creating a bucket to Stanchion -
velvet_delete_bucket
- requesting deleting a bucket to Stanchion -
velvet_set_bucket_acl
- requesting updating a bucket ACL to Stanchion -
velvet_set_bucket_policy
- requesting putting a new bucket policy to Stanchion -
velvet_delete_bucket_policy
- requesting deleting a policy of the bucket to Stanchion
Riak Access
-
riakc_ping
-ping
PB API. invoked by/riak-cs/ping
-
riakc_get_cs_bucket
- getting a bucket record -
riakc_get_cs_user_strong
- getting a user record with PR=all -
riakc_get_cs_user
- getting a user record with R=quorum and PR=one -
riakc_put_cs_user
- putting a user record after create/deleting a bucket -
riakc_get_manifest
- getting a manifest -
riakc_put_manifest
- putting a manifest -
riakc_delete_manifest
- deleting a manifest (invoked via GC) -
riakc_get_block_n_one
- getting a block with N=1 without sloppy quorum -
riakc_get_block_n_all
- getting a block with N=3 after N=1 get failed -
riakc_get_block_remote
- getting a block after N=3 get resulted in not found -
riakc_get_block_legacy
- getting a block when N=1 get is turned off -
riakc_put_block
- putting a block -
riakc_put_block_resolved
- putting a block when block siblings resolution is invoked -
riakc_head_block
- heading a block, invoked via GC -
riakc_delete_block_constrained
- first trial to delete block with PW=all -
riakc_delete_block_secondary
- second trial to delete block with PW=quorum, after PW=all failed -
riakc_(get|put)_gc_manifest_set
- invoked when a manifest is being moved to GC bucket -
riakc_(get|delete)_gc_manifest_set
- invoked when manifests are being collected -
riakc_(get|put)_access
- getting access stats, putting access stats -
riakc_(get|put)_storage
- getting storage stats, putting storage stats -
riakc_fold_manifest_objs
- invoked inside GET Bucket (listing objects within a bucket) -
riakc_mapred_storage
- stats on each MapReduce job performance -
riakc_list_all_user_keys
- all users are listed out when starting storage calculation -
riakc_list_all_manifest_keys
- only used when deleting a bucket to verify it's empty -
riakc_list_users_receive_chunk
- listing users invoked via/riak-cs/users
API. riakc_get_uploads_by_index
riakc_get_user_by_index
riakc_get_gc_keys_by_index
riakc_get_cs_buckets_by_index
-
riakc_get_clusterid
- invoked when for the first time when a proxy_get is performed
Others
manifest_siblings_bp_sleep
*_pool_*
object_web_active_sockets
-
object_web_waiting_acceptors
- number of inactive acceptor processes in mochiweb object_web_port
-
memory_*
- memory stats same as Riak -
nodename, connected_nodes
- same as Riak, but useless in CS -
sys_*
- system stats same as Riak
Access by Riak CS process
-
bucket_create
,bucket_delete
- bucket creation and deletion -
bucket_put_acl
- updating bucket ACL or Policy -
user_create
,user_update
- user creation and update
Riak Access
-
riakc_ping
- Not used. -
riakc_(get|put)_cs_bucket
- Updated viaPUT Bucket
andDELETE Bucket
call. -
riakc_get_cs_user_strong
,riakc_(get|put)_cs_user
- Updated on updating user. -
riakc_get_manifest
- Updated viaDELETE Bucket
andPUT Bucket
to verify the bucket is really empty. -
riakc_list_all_user_keys
- Ditto. -
riakc_list_all_manifest_keys
- Ditto. riakc_list_users_receive_chunk
riakc_get_user_by_index
riakc_get_gc_keys_by_index
riakc_get_cs_buckets_by_index
Others
-
stanchion_server_msgq_len
- Message queue length of a serialiser process in Stanchion. This should be very close to 0 for immediate response on creating/deleting/updating buckets. -
waiting_time
- Waiting time of a request, from request arrived at Stanchion until a serialiser process started processing the request. This shouldn't be so long compared to Riak access latency.
- Q. Why an item XXXX is not here? A. We haven't just cared about it. Open to discuss.
- Q. An item YYYY should be removed. A. We haven't just cared about it. Open to discuss.
- Q. For what an item ZZZZ is introduced? A. If it's not written above, will add more.
- Q. More than a thousand! Too many trivial metrics. A. Given that those stats are retrieved around once per minute, we anticipate that there aren't much overhead. Will run load test before shipping and make sure about that.
- Q. How much performance overhead for taking many pictures observed? A. For updating a stats, microbenchmark indicated more than hundreds of thousands update could be handled.
Riak CS
- Enrich stats items [JIRA: RCS-217] #961
- Introduce Exometer #1165
- Add latency stats items to S3 API and velvet calls [JIRA: RCS-220] #1180
- Add latency stats for riak pb client operations [JIRA: RCS-243] #1189
- Add status around PB pools, memory, system and mochiweb [RCS-244] #1194
- Add stanchion stats test for API and command #1199
Stanchion
- Introduce metrics. #92
- Add stats to Stanchion #98
- Feature/stats2 #99