Fix duplicate plan limit overage emails#2269
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses duplicate “plan limit overage” organization emails by adding queue-level duplicate detection (via a stable work-item unique identifier) and by tightening handler-side suppression logic so stale duplicates can’t trigger extra monthly overage emails later.
Changes:
- Add a stable
UniqueIdentifiertoOrganizationNotificationWorkItemto enable cross-pod/work-queue deduplication. - Update
OrganizationNotificationWorkItemHandlerto (a) ignore hourly-only items and (b) suppress repeat monthly sends using a per-organization 24h “monthly-sent” cache marker plus a monthly-only lock. - Add regression tests covering delayed duplicate processing, hourly-then-monthly ordering, org isolation, and queue dedup behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| tests/Exceptionless.Tests/Mail/CountingMailer.cs | Adds a test mailer that records organization-notice sends for assertions. |
| tests/Exceptionless.Tests/Jobs/WorkItemHandlers/OrganizationNotificationWorkItemHandlerTests.cs | Adds regression tests for duplicate enqueue/processing and correct monthly notification behavior. |
| src/Exceptionless.Core/Models/WorkItems/OrganizationNotificationWorkItem.cs | Implements IHaveUniqueIdentifier to provide a stable dedup key per org + overage type. |
| src/Exceptionless.Core/Jobs/WorkItemHandlers/OrganizationNotificationWorkItemHandler.cs | Reworks handler throttling/suppression: monthly-only lock + 24h “sent” marker; hourly-only items no longer suppress monthly. |
| src/Exceptionless.Core/Bootstrapper.cs | Registers DuplicateDetectionQueueBehavior<WorkItemData> and wires queue behaviors into queue creation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
268ea30 to
78bf071
Compare
|
Updated this PR with the deeper RCA and coverage. The most likely failure mode is: (1) every web pod subscribes to |
| services.ReplaceSingleton<ICacheClient>(sp => new InMemoryCacheClient(new InMemoryCacheClientOptions | ||
| { | ||
| TimeProvider = sp.GetRequiredService<TimeProvider>(), | ||
| LoggerFactory = sp.GetRequiredService<ILoggerFactory>() | ||
| })); | ||
|
|
||
| services.ReplaceSingleton<IMessageBus>(sp => new InMemoryMessageBus(new InMemoryMessageBusOptions | ||
| { | ||
| Serializer = sp.GetRequiredService<ISerializer>(), | ||
| TimeProvider = sp.GetRequiredService<TimeProvider>(), | ||
| LoggerFactory = sp.GetRequiredService<ILoggerFactory>() | ||
| })); | ||
|
|
||
| services.ReplaceSingleton<IMessagePublisher>(sp => sp.GetRequiredService<IMessageBus>()); | ||
| services.ReplaceSingleton<IMessageSubscriber>(sp => sp.GetRequiredService<IMessageBus>()); |
There was a problem hiding this comment.
this should already be the default, why are we registering it again?
| } | ||
|
|
||
| [Fact] | ||
| public async Task RunAsync_WhenOnePlanOverageIsObservedBySixSubscribersWithQueueDedup_ShouldEnqueueOneWorkItem() |
There was a problem hiding this comment.
three part name.. check pr
| public override Task<ILock?> GetWorkItemLockAsync(object workItem, CancellationToken cancellationToken = default) | ||
| { | ||
| var wi = (OrganizationNotificationWorkItem)workItem; | ||
| if (!ShouldSendNotificationEmail(wi)) | ||
| return Task.FromResult<ILock?>(null); | ||
|
|
||
| return _lockProvider.TryAcquireAsync(GetLegacyNotificationLockKey(wi.OrganizationId, wi.NotificationType), TimeSpan.FromMinutes(15), cancellationToken); | ||
| } |
| public static string GetLegacyNotificationLockKey(string organizationId, string notificationType) | ||
| { | ||
| return notificationType == OrganizationNotificationWorkItem.MonthlyNotificationType | ||
| ? $"{nameof(OrganizationNotificationWorkItemHandler)}:{organizationId}:{notificationType}-lock" | ||
| : GetNotificationLockKey(organizationId, notificationType); | ||
| } |
| public override Task<ILock?> GetWorkItemLockAsync(object workItem, CancellationToken cancellationToken = default) | ||
| { | ||
| var wi = (OrganizationNotificationWorkItem)workItem; | ||
| if (!ShouldSendNotificationEmail(wi)) | ||
| return Task.FromResult<ILock?>(null); | ||
|
|
||
| return _lockProvider.TryAcquireAsync(GetLegacyNotificationLockKey(wi.OrganizationId, wi.NotificationType), TimeSpan.FromMinutes(15), cancellationToken); | ||
| } |
| public static string GetLegacyNotificationLockKey(string organizationId, string notificationType) | ||
| { | ||
| return notificationType == OrganizationNotificationWorkItem.MonthlyNotificationType | ||
| ? $"{nameof(OrganizationNotificationWorkItemHandler)}:{organizationId}:{notificationType}-lock" | ||
| : GetNotificationLockKey(organizationId, notificationType); | ||
| } |
| public override Task<ILock?> GetWorkItemLockAsync(object workItem, CancellationToken cancellationToken = default) | ||
| { | ||
| var wi = (OrganizationNotificationWorkItem)workItem; | ||
| if (!ShouldSendNotificationEmail(wi)) | ||
| return Task.FromResult<ILock?>(null); | ||
|
|
||
| return _lockProvider.TryAcquireAsync(GetLegacyNotificationLockKey(wi.OrganizationId, wi.NotificationType), TimeSpan.FromMinutes(15), cancellationToken); | ||
| } |
| public static string GetLegacyNotificationLockKey(string organizationId, string notificationType) | ||
| { | ||
| return notificationType == OrganizationNotificationWorkItem.MonthlyNotificationType | ||
| ? $"{nameof(OrganizationNotificationWorkItemHandler)}:{organizationId}:{notificationType}-lock" | ||
| : GetNotificationLockKey(organizationId, notificationType); | ||
| } |
78bf071 to
97940c8
Compare
97940c8 to
22de520
Compare
|
Bug found (and fixed): hourly work items looped forever in the queue. When if (lockValue == null)
{
await queueEntry.AbandonAsync();
return JobResult.CancelledWithMessage("Unable to acquire work item lock...");
}Our handler was returning The tests didn't catch this because the integration test helper called Fix: Return Two new regression tests added:
|
22de520 to
5b7e9c6
Compare
|
Second bug found (and fixed): The do
{
gotLock = await _cacheClient.AddAsync(resource, lockId, timeUntilExpires);
if (gotLock) break;
// waits up to 3s for reset event before retrying
using var linked = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
linked.CancelAfter(timeSpan); // timeSpan = min(remaining TTL, 3s)
await autoResetEvent.WaitAsync(linked.Token);
}With a 30-minute lock TTL, the polling loop could run for up to 30 minutes, stalling a work item job slot for no correctness benefit (the sent marker already prevents duplicate emails once the first worker finishes). The new test Fix: Use the |
5b7e9c6 to
074c92d
Compare
|
Third bug found (and fixed): The test helper was calling // Before (wrong): always calls HandleItemAsync, even without the lock
private async Task HandleWorkItemAsync(...)
{
await using var workItemLock = await Handler.GetWorkItemLockAsync(...);
var context = new WorkItemContext(workItem, "test-job", workItemLock, ...);
await Handler.HandleItemAsync(context); // called even when workItemLock == null!
}
// After (correct): mirrors production WorkItemJob behavior
private async Task HandleWorkItemAsync(...)
{
await using var workItemLock = await Handler.GetWorkItemLockAsync(...);
if (workItemLock is null)
return; // WorkItemJob calls AbandonAsync here, not HandleItemAsync
var context = new WorkItemContext(workItem, "test-job", workItemLock, ...);
await Handler.HandleItemAsync(context);
}Why coverage didn't detect it: every call to New regression test added: Also fixed: |
Root cause: every web pod subscribes to PlanOverage at startup via EnqueueOrganizationNotificationOnPlanOverage. Foundatio pub/sub delivers each message to all subscribers, so a single monthly overage event enqueued one work item per running web pod. The original ThrottlingLockProvider(1/hour) allowed exactly one item through per calendar-hour bucket; abandoned duplicates were re-queued and reprocessed once each new bucket opened — producing one email per hour for each duplicate item. Fix: - Queue-level dedup: OrganizationNotificationWorkItem implements IHaveUniqueIdentifier and DuplicateDetectionQueueBehavior is registered so fanout enqueues collapse to one item. - Handler-level idempotency: per-org distributed lock (30 min) + 24-hour sent marker ensure stale duplicates already in the queue at deploy time cannot retrigger an email. - Hourly items short-circuit at GetWorkItemLockAsync and never enter the lock/sent-key path, preventing hourly overages from suppressing subsequent monthly notifications. Also add RCA-pinning unit tests (TestWithServices) and integration tests (IntegrationTestsBase) covering fanout dedup, legacy hourly throttle regression, per-org isolation, 24h resend window, hourly-before-monthly ordering, and idempotency via existing sent marker. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
074c92d to
f10f791
Compare
Root cause
Every web pod registers
EnqueueOrganizationNotificationOnPlanOverageat startup. Foundatio pub/sub delivers eachPlanOveragemessage to all subscribers, so a single monthly overage event enqueued one work item per running web pod (e.g. 6 pods → 6 identical items).The previous
ThrottlingLockProvider(slotsPerPeriod: 1, period: 1 hour)allowed exactly one item through per calendar-hour bucket. When a duplicate item lost the lock race it was abandoned back to the queue (not discarded). Once the next hour bucket opened the item was reprocessed and acquired a fresh lock — producing one email per hour for each duplicate, matching the reported six-emails-over-a-day pattern.The
TimeSpan.FromMinutes(15)in the oldGetWorkItemLockAsyncwas the work-item processing timeout (how long the lock was held during execution), not the throttle window — these are independent parameters.Could the bot-cleanup job have retriggered the edge? Unlikely: bot event deletion removes documents from Elasticsearch but does not decrement the Redis usage counters that
IncrementTotalAsyncuses for edge detection, so the monthly overage edge would not re-fire from cleanup.Fix
Two independent layers, both required:
1. Queue-level dedup —
OrganizationNotificationWorkItemimplementsIHaveUniqueIdentifierandDuplicateDetectionQueueBehavior<WorkItemData>is registered inBootstrapper. The unique identifier isOrganization:{orgId}:notification:{type}(viaGetNotificationKey). Fanout enqueues from all pods collapse to a single queue entry.2. Handler-level idempotency — In
OrganizationNotificationWorkItemHandler:Organization:{orgId}:notification:monthly-sent) ensures stale duplicates already in the queue at deploy time cannot retrigger an email.GetWorkItemLockAsync(returnnulllock) so they never occupy the lock/sent-key path and cannot suppress a later monthly notification.Known limitations (acceptable trade-offs, documented in code comments):
SendOverageNotificationsAsyncthrows mid-loop, some recipients will already have received the email and will receive it again on retry. This is intentional: suppressing retries on partial failure would silently skip un-notified users.Changes
OrganizationNotificationWorkItemHandler.csThrottlingLockProviderwith handler lock + 24h sent marker; hourly items bypass email path; 30-min lock lease; class XML doc with RCAOrganizationNotificationWorkItem.csIHaveUniqueIdentifier,NotificationTypeconstants,GetNotificationKeystatic helper; removed all legacy key helpersBootstrapper.csDuplicateDetectionQueueBehavior<WorkItemData>via DIOrganizationNotificationWorkItemHandlerTests.csOrganizationNotificationWorkItemHandlerIntegrationTests.csCountingMailer.csTesting
All 10 notification tests pass. Full build is clean.
Breaking changes
None.