Skip to content

feat: Add per-feature-view metrics for online read path#365

Open
vanitabhagwat wants to merge 3 commits into
masterfrom
feat-per-fv-read-metrics
Open

feat: Add per-feature-view metrics for online read path#365
vanitabhagwat wants to merge 3 commits into
masterfrom
feat-per-fv-read-metrics

Conversation

@vanitabhagwat

@vanitabhagwat vanitabhagwat commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Emit per-feature-view latency, request count, error count, and total lookup requests on every online read in the Go feature server (HTTP + gRPC). This enables per-FV hit-rate computation in Datadog and allows filtering latency/error distributions by feature view.

Key changes:

  • Add Distribution() to StatsdClient interface
  • New FeatureViewReadMetrics emitter (fv_read_latency_ms, fv_read_requests, fv_read_errors)
  • Extend LookupMetricsAggregator with totalByFV for feature_lookup_requests
  • Extract FV names from request (works with fullFeatureNames=false)
  • New unified flag ENABLE_FV_LEVEL_METRICS (backward compatible with ENABLE_MISSING_KEY_METRICS)
  • Instrument GetOnlineFeatures and GetOnlineFeaturesRange in both HTTP and gRPC handlers

What this PR does / why we need it:

Which issue(s) this PR fixes:

Misc

vanitabhagwat and others added 2 commits June 2, 2026 22:36
…uests, errors, hit rate)

Emit per-feature-view latency, request count, error count, and total
lookup requests on every online read in the Go feature server (HTTP + gRPC).
This enables per-FV hit-rate computation in Datadog and allows filtering
latency/error distributions by feature view.

Key changes:
- Add Distribution() to StatsdClient interface
- New FeatureViewReadMetrics emitter (fv_read_latency_ms, fv_read_requests, fv_read_errors)
- Extend LookupMetricsAggregator with totalByFV for feature_lookup_requests
- Extract FV names from request (works with fullFeatureNames=false)
- New unified flag ENABLE_FV_LEVEL_METRICS (backward compat with ENABLE_MISSING_KEY_METRICS)
- Instrument GetOnlineFeatures and GetOnlineFeaturesRange in both HTTP and gRPC handlers

Co-Authored-By: Claude Opus 4.6 <[email protected]>
… emission

- Define constants for all metric names (FVReadLatencyMetric, FVReadRequestsMetric,
  FVReadErrorsMetric, LookupNotFoundMetric, LookupNullOrExpiredMetric, LookupRequestsMetric)
- Extract emitFVReadMetrics helper into server_commons.go to eliminate 4 identical
  nil-check + construct + emit patterns across HTTP and gRPC handlers
- Update tests to reference constants instead of string literals

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@vanitabhagwat vanitabhagwat changed the title feat: Add per-feature-view metrics for online read path (latency, req… feat: Add per-feature-view metrics for online read path Jun 3, 2026
…te at startup

Extract sample-rate parsing into a single ParseSampleRate() helper in config.go
and introduce MetricsContext that builds FeatureViewReadMetrics and baseTags once
at server startup. This eliminates repeated env reads and per-request tag
allocations on the hot path in both gRPC and HTTP servers.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@vanitabhagwat vanitabhagwat force-pushed the feat-per-fv-read-metrics branch from 104b30c to 0bfdf07 Compare June 11, 2026 17:47
Comment on lines +45 to +47
if m.sampleRate < 1.0 && rand.Float64() > m.sampleRate {
return
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the last argument of m.client.Count() the sampling rate?

Couldn't we drop this guard and then do m.client.Count(FVReadRequestsMetric, 1, tags, m.sampleRate) (same with the other m.client.*() calls?

Worth aligning lookup_metrics.go to the same approach so the two stay consistent.

request.GetRequestContext(),
request.GetFullFeatureNames())

latencyMs := float64(time.Since(t0).Milliseconds())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time.Duration.Milliseconds() returns a truncated integer, so any fractional latency is floored (e.g. 1.4ms and 1.9ms both record as 1, and anything under 1ms records as 0). That collapses the latency distribution into coarse integer buckets and skews the percentiles. Could we use float64(time.Since(t0).Microseconds()) / 1000.0 to keep sub-ms precision?

request.GetFullFeatureNames(),
)

latencyMs := float64(time.Since(t0).Milliseconds())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requestContextProto,
request.FullFeatureNames)

latencyMs := float64(time.Since(t0).Milliseconds())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requestContextProto,
request.FullFeatureNames)

latencyMs := float64(time.Since(t0).Milliseconds())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"github.com/feast-dev/feast/go/internal/feast/registry"
)

const DefaultSampleRate = 0.01

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

someone who previously ran with ENABLE_MISSING_KEY_METRICS=true and no FEAST_METRICS_SAMPLE_RATE set used to get exact counts (old default 1.0), and after this change they silently get 1% sampling with ×100 extrapolation on the same eature_lookup_* metrics. So existing dashboards for that feature go from exact to coarse/noisy without any config change on their end.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we shouldn't change the existing default sample rate. Since this new metric is very different, requiring a new default I think it's best to create a separate sapling rate config and default for it.

Comment on lines +18 to +21
func IsFVMetricsEnabled() bool {
return strings.ToLower(os.Getenv("ENABLE_FV_LEVEL_METRICS")) == "true" ||
IsMissingKeyMetricsEnabled()
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This couples two separate features. Enabling ENABLE_MISSING_KEY_METRICS now also turns on the new FV-level metrics. Could we keep IsFVMetricsEnabled() checking only its own flag, and introduce a separate IsMetricsClientEnabled() (the OR of both) for the main.go gate that decides whether to build the statsd client? That keeps each flag controlling its own feature and the function name matching its behavior.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, these metrics should be enabled independently.

}
}

func extractFVNamesFromRequest(features []string, featureService *model.FeatureService) []string {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to server_commons since it is used in both http and grpc.
Also would be good to include unit test coverage for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants