feat: Add per-feature-view metrics for online read path#365
feat: Add per-feature-view metrics for online read path#365vanitabhagwat wants to merge 3 commits into
Conversation
…uests, errors, hit rate) Emit per-feature-view latency, request count, error count, and total lookup requests on every online read in the Go feature server (HTTP + gRPC). This enables per-FV hit-rate computation in Datadog and allows filtering latency/error distributions by feature view. Key changes: - Add Distribution() to StatsdClient interface - New FeatureViewReadMetrics emitter (fv_read_latency_ms, fv_read_requests, fv_read_errors) - Extend LookupMetricsAggregator with totalByFV for feature_lookup_requests - Extract FV names from request (works with fullFeatureNames=false) - New unified flag ENABLE_FV_LEVEL_METRICS (backward compat with ENABLE_MISSING_KEY_METRICS) - Instrument GetOnlineFeatures and GetOnlineFeaturesRange in both HTTP and gRPC handlers Co-Authored-By: Claude Opus 4.6 <[email protected]>
… emission - Define constants for all metric names (FVReadLatencyMetric, FVReadRequestsMetric, FVReadErrorsMetric, LookupNotFoundMetric, LookupNullOrExpiredMetric, LookupRequestsMetric) - Extract emitFVReadMetrics helper into server_commons.go to eliminate 4 identical nil-check + construct + emit patterns across HTTP and gRPC handlers - Update tests to reference constants instead of string literals Co-Authored-By: Claude Opus 4.6 <[email protected]>
…te at startup Extract sample-rate parsing into a single ParseSampleRate() helper in config.go and introduce MetricsContext that builds FeatureViewReadMetrics and baseTags once at server startup. This eliminates repeated env reads and per-request tag allocations on the hot path in both gRPC and HTTP servers. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
104b30c to
0bfdf07
Compare
| if m.sampleRate < 1.0 && rand.Float64() > m.sampleRate { | ||
| return | ||
| } |
There was a problem hiding this comment.
Isn't the last argument of m.client.Count() the sampling rate?
Couldn't we drop this guard and then do m.client.Count(FVReadRequestsMetric, 1, tags, m.sampleRate) (same with the other m.client.*() calls?
Worth aligning lookup_metrics.go to the same approach so the two stay consistent.
| request.GetRequestContext(), | ||
| request.GetFullFeatureNames()) | ||
|
|
||
| latencyMs := float64(time.Since(t0).Milliseconds()) |
There was a problem hiding this comment.
time.Duration.Milliseconds() returns a truncated integer, so any fractional latency is floored (e.g. 1.4ms and 1.9ms both record as 1, and anything under 1ms records as 0). That collapses the latency distribution into coarse integer buckets and skews the percentiles. Could we use float64(time.Since(t0).Microseconds()) / 1000.0 to keep sub-ms precision?
| request.GetFullFeatureNames(), | ||
| ) | ||
|
|
||
| latencyMs := float64(time.Since(t0).Milliseconds()) |
There was a problem hiding this comment.
| requestContextProto, | ||
| request.FullFeatureNames) | ||
|
|
||
| latencyMs := float64(time.Since(t0).Milliseconds()) |
There was a problem hiding this comment.
| requestContextProto, | ||
| request.FullFeatureNames) | ||
|
|
||
| latencyMs := float64(time.Since(t0).Milliseconds()) |
There was a problem hiding this comment.
| "github.com/feast-dev/feast/go/internal/feast/registry" | ||
| ) | ||
|
|
||
| const DefaultSampleRate = 0.01 |
There was a problem hiding this comment.
someone who previously ran with ENABLE_MISSING_KEY_METRICS=true and no FEAST_METRICS_SAMPLE_RATE set used to get exact counts (old default 1.0), and after this change they silently get 1% sampling with ×100 extrapolation on the same eature_lookup_* metrics. So existing dashboards for that feature go from exact to coarse/noisy without any config change on their end.
There was a problem hiding this comment.
I agree we shouldn't change the existing default sample rate. Since this new metric is very different, requiring a new default I think it's best to create a separate sapling rate config and default for it.
| func IsFVMetricsEnabled() bool { | ||
| return strings.ToLower(os.Getenv("ENABLE_FV_LEVEL_METRICS")) == "true" || | ||
| IsMissingKeyMetricsEnabled() | ||
| } |
There was a problem hiding this comment.
This couples two separate features. Enabling ENABLE_MISSING_KEY_METRICS now also turns on the new FV-level metrics. Could we keep IsFVMetricsEnabled() checking only its own flag, and introduce a separate IsMetricsClientEnabled() (the OR of both) for the main.go gate that decides whether to build the statsd client? That keeps each flag controlling its own feature and the function name matching its behavior.
There was a problem hiding this comment.
Agreed, these metrics should be enabled independently.
| } | ||
| } | ||
|
|
||
| func extractFVNamesFromRequest(features []string, featureService *model.FeatureService) []string { |
There was a problem hiding this comment.
Move this to server_commons since it is used in both http and grpc.
Also would be good to include unit test coverage for this.
Emit per-feature-view latency, request count, error count, and total lookup requests on every online read in the Go feature server (HTTP + gRPC). This enables per-FV hit-rate computation in Datadog and allows filtering latency/error distributions by feature view.
Key changes:
What this PR does / why we need it:
Which issue(s) this PR fixes:
Misc