Skip to content

Fix duplicate plan limit overage emails#2269

Open
niemyjski wants to merge 1 commit into
mainfrom
niemyjski/fix-plan-limit-email-spam
Open

Fix duplicate plan limit overage emails#2269
niemyjski wants to merge 1 commit into
mainfrom
niemyjski/fix-plan-limit-email-spam

Conversation

@niemyjski
Copy link
Copy Markdown
Member

@niemyjski niemyjski commented May 29, 2026

Root cause

Every web pod registers EnqueueOrganizationNotificationOnPlanOverage at startup. Foundatio pub/sub delivers each PlanOverage message to all subscribers, so a single monthly overage event enqueued one work item per running web pod (e.g. 6 pods → 6 identical items).

The previous ThrottlingLockProvider(slotsPerPeriod: 1, period: 1 hour) allowed exactly one item through per calendar-hour bucket. When a duplicate item lost the lock race it was abandoned back to the queue (not discarded). Once the next hour bucket opened the item was reprocessed and acquired a fresh lock — producing one email per hour for each duplicate, matching the reported six-emails-over-a-day pattern.

The TimeSpan.FromMinutes(15) in the old GetWorkItemLockAsync was the work-item processing timeout (how long the lock was held during execution), not the throttle window — these are independent parameters.

Could the bot-cleanup job have retriggered the edge? Unlikely: bot event deletion removes documents from Elasticsearch but does not decrement the Redis usage counters that IncrementTotalAsync uses for edge detection, so the monthly overage edge would not re-fire from cleanup.

Fix

Two independent layers, both required:

1. Queue-level dedupOrganizationNotificationWorkItem implements IHaveUniqueIdentifier and DuplicateDetectionQueueBehavior<WorkItemData> is registered in Bootstrapper. The unique identifier is Organization:{orgId}:notification:{type} (via GetNotificationKey). Fanout enqueues from all pods collapse to a single queue entry.

2. Handler-level idempotency — In OrganizationNotificationWorkItemHandler:

  • Per-org distributed lock (30 min lease) serialises concurrent processing. The 30-minute lease comfortably exceeds the maximum realistic send-loop duration, eliminating the race window where a second worker could acquire a lock before the sent marker is written.
  • 24-hour sent marker (Organization:{orgId}:notification:monthly-sent) ensures stale duplicates already in the queue at deploy time cannot retrigger an email.
  • Hourly items short-circuit at GetWorkItemLockAsync (return null lock) so they never occupy the lock/sent-key path and cannot suppress a later monthly notification.

Known limitations (acceptable trade-offs, documented in code comments):

  • If SendOverageNotificationsAsync throws mid-loop, some recipients will already have received the email and will receive it again on retry. This is intentional: suppressing retries on partial failure would silently skip un-notified users.
  • The 24h sent window is not billing-period aware. An overage within 24h of a billing-cycle reset could be delayed. Follow-up if this becomes an issue.

Changes

File What changed
OrganizationNotificationWorkItemHandler.cs Replaced ThrottlingLockProvider with handler lock + 24h sent marker; hourly items bypass email path; 30-min lock lease; class XML doc with RCA
OrganizationNotificationWorkItem.cs Added IHaveUniqueIdentifier, NotificationType constants, GetNotificationKey static helper; removed all legacy key helpers
Bootstrapper.cs Registered DuplicateDetectionQueueBehavior<WorkItemData> via DI
OrganizationNotificationWorkItemHandlerTests.cs RCA-pinning unit tests: fanout without/with dedup, UniqueIdentifier format, legacy throttle regression
OrganizationNotificationWorkItemHandlerIntegrationTests.cs Integration tests against real handler: dedup across hours, per-org isolation, 24h resend, hourly-only no-send, hourly-before-monthly, sent-marker idempotency
CountingMailer.cs Thread-safe test mailer for counting/asserting sends

Testing

dotnet test tests/Exceptionless.Tests/ -- --filter-class Exceptionless.Tests.Jobs.WorkItemHandlers.OrganizationNotificationWorkItemHandlerTests
dotnet test tests/Exceptionless.Tests/ -- --filter-class Exceptionless.Tests.Jobs.WorkItemHandlers.OrganizationNotificationWorkItemHandlerIntegrationTests
dotnet test tests/Exceptionless.Tests/ -- --filter-class Exceptionless.Tests.Services.UsageServiceTests

All 10 notification tests pass. Full build is clean.

Breaking changes

None.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses duplicate “plan limit overage” organization emails by adding queue-level duplicate detection (via a stable work-item unique identifier) and by tightening handler-side suppression logic so stale duplicates can’t trigger extra monthly overage emails later.

Changes:

  • Add a stable UniqueIdentifier to OrganizationNotificationWorkItem to enable cross-pod/work-queue deduplication.
  • Update OrganizationNotificationWorkItemHandler to (a) ignore hourly-only items and (b) suppress repeat monthly sends using a per-organization 24h “monthly-sent” cache marker plus a monthly-only lock.
  • Add regression tests covering delayed duplicate processing, hourly-then-monthly ordering, org isolation, and queue dedup behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/Exceptionless.Tests/Mail/CountingMailer.cs Adds a test mailer that records organization-notice sends for assertions.
tests/Exceptionless.Tests/Jobs/WorkItemHandlers/OrganizationNotificationWorkItemHandlerTests.cs Adds regression tests for duplicate enqueue/processing and correct monthly notification behavior.
src/Exceptionless.Core/Models/WorkItems/OrganizationNotificationWorkItem.cs Implements IHaveUniqueIdentifier to provide a stable dedup key per org + overage type.
src/Exceptionless.Core/Jobs/WorkItemHandlers/OrganizationNotificationWorkItemHandler.cs Reworks handler throttling/suppression: monthly-only lock + 24h “sent” marker; hourly-only items no longer suppress monthly.
src/Exceptionless.Core/Bootstrapper.cs Registers DuplicateDetectionQueueBehavior<WorkItemData> and wires queue behaviors into queue creation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/Exceptionless.Core/Models/WorkItems/OrganizationNotificationWorkItem.cs Outdated
@niemyjski niemyjski force-pushed the niemyjski/fix-plan-limit-email-spam branch from 268ea30 to 78bf071 Compare May 30, 2026 02:17
@niemyjski
Copy link
Copy Markdown
Member Author

Updated this PR with the deeper RCA and coverage. The most likely failure mode is: (1) every web pod subscribes to PlanOverage, (2) a monthly PlanOverage fans out to each pod and each pod enqueues the same notification work item, and (3) the old ThrottlingLockProvider(1/hour) buckets by hour, so once duplicate monthly work items existed it could leak one email per hour bucket. I couldnt support the bot-cleanup theory from the code because removing bot events doesnt decrement organization usage totals, and monthly PlanOverage is edge-triggered in UsageService.IncrementTotalAsync. The updated tests now cover the fanout path, the legacy hourly leak, the real handler suppression path, hourly-before-monthly ordering, 24-hour resend behavior, and legacy sent-key compatibility.

Comment on lines +54 to +68
services.ReplaceSingleton<ICacheClient>(sp => new InMemoryCacheClient(new InMemoryCacheClientOptions
{
TimeProvider = sp.GetRequiredService<TimeProvider>(),
LoggerFactory = sp.GetRequiredService<ILoggerFactory>()
}));

services.ReplaceSingleton<IMessageBus>(sp => new InMemoryMessageBus(new InMemoryMessageBusOptions
{
Serializer = sp.GetRequiredService<ISerializer>(),
TimeProvider = sp.GetRequiredService<TimeProvider>(),
LoggerFactory = sp.GetRequiredService<ILoggerFactory>()
}));

services.ReplaceSingleton<IMessagePublisher>(sp => sp.GetRequiredService<IMessageBus>());
services.ReplaceSingleton<IMessageSubscriber>(sp => sp.GetRequiredService<IMessageBus>());
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should already be the default, why are we registering it again?

}

[Fact]
public async Task RunAsync_WhenOnePlanOverageIsObservedBySixSubscribersWithQueueDedup_ShouldEnqueueOneWorkItem()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

three part name.. check pr

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Comment on lines +65 to +72
public override Task<ILock?> GetWorkItemLockAsync(object workItem, CancellationToken cancellationToken = default)
{
var wi = (OrganizationNotificationWorkItem)workItem;
if (!ShouldSendNotificationEmail(wi))
return Task.FromResult<ILock?>(null);

return _lockProvider.TryAcquireAsync(GetLegacyNotificationLockKey(wi.OrganizationId, wi.NotificationType), TimeSpan.FromMinutes(15), cancellationToken);
}
Comment on lines +150 to +155
public static string GetLegacyNotificationLockKey(string organizationId, string notificationType)
{
return notificationType == OrganizationNotificationWorkItem.MonthlyNotificationType
? $"{nameof(OrganizationNotificationWorkItemHandler)}:{organizationId}:{notificationType}-lock"
: GetNotificationLockKey(organizationId, notificationType);
}
Comment on lines +65 to +72
public override Task<ILock?> GetWorkItemLockAsync(object workItem, CancellationToken cancellationToken = default)
{
var wi = (OrganizationNotificationWorkItem)workItem;
if (!ShouldSendNotificationEmail(wi))
return Task.FromResult<ILock?>(null);

return _lockProvider.TryAcquireAsync(GetLegacyNotificationLockKey(wi.OrganizationId, wi.NotificationType), TimeSpan.FromMinutes(15), cancellationToken);
}
Comment on lines +150 to +155
public static string GetLegacyNotificationLockKey(string organizationId, string notificationType)
{
return notificationType == OrganizationNotificationWorkItem.MonthlyNotificationType
? $"{nameof(OrganizationNotificationWorkItemHandler)}:{organizationId}:{notificationType}-lock"
: GetNotificationLockKey(organizationId, notificationType);
}
Comment on lines +65 to +72
public override Task<ILock?> GetWorkItemLockAsync(object workItem, CancellationToken cancellationToken = default)
{
var wi = (OrganizationNotificationWorkItem)workItem;
if (!ShouldSendNotificationEmail(wi))
return Task.FromResult<ILock?>(null);

return _lockProvider.TryAcquireAsync(GetLegacyNotificationLockKey(wi.OrganizationId, wi.NotificationType), TimeSpan.FromMinutes(15), cancellationToken);
}
Comment on lines +150 to +155
public static string GetLegacyNotificationLockKey(string organizationId, string notificationType)
{
return notificationType == OrganizationNotificationWorkItem.MonthlyNotificationType
? $"{nameof(OrganizationNotificationWorkItemHandler)}:{organizationId}:{notificationType}-lock"
: GetNotificationLockKey(organizationId, notificationType);
}
@niemyjski niemyjski force-pushed the niemyjski/fix-plan-limit-email-spam branch from 78bf071 to 97940c8 Compare May 30, 2026 03:17
@niemyjski niemyjski force-pushed the niemyjski/fix-plan-limit-email-spam branch from 97940c8 to 22de520 Compare May 30, 2026 12:29
@niemyjski
Copy link
Copy Markdown
Member Author

Bug found (and fixed): hourly work items looped forever in the queue.

When GetWorkItemLockAsync returns null, Foundatio's WorkItemJob calls AbandonAsync on the queue entry — confirmed by decompiling the Foundatio 13.0.1 DLL:

if (lockValue == null)
{
    await queueEntry.AbandonAsync();
    return JobResult.CancelledWithMessage("Unable to acquire work item lock...");
}

Our handler was returning null for hourly items (to skip the distributed lock), but that caused the WorkItemJob to abandon them repeatedly — they would retry forever and HandleItemAsync was never called in production.

The tests didn't catch this because the integration test helper called HandleItemAsync directly even when GetWorkItemLockAsync returned null, bypassing the abandon path entirely.

Fix: Return base.GetWorkItemLockAsync(workItem, cancellationToken) (= Disposable.EmptyLock) for hourly items so WorkItemJob proceeds normally, HandleItemAsync is called and exits early at the ShouldSendNotificationEmail guard, and the item is completed.

Two new regression tests added:

  • GetWorkItemLockAsync_WhenWorkItemIsHourly_ShouldReturnNonNullLockSoItemIsNotAbandoned — would have failed on the old code
  • GetWorkItemLockAsync_WhenMonthlyLockIsAlreadyHeld_ShouldReturnNullSoConcurrentItemIsAbandoned — verifies monthly concurrent items are correctly abandoned

@niemyjski niemyjski force-pushed the niemyjski/fix-plan-limit-email-spam branch from 22de520 to 5b7e9c6 Compare May 30, 2026 12:42
@niemyjski
Copy link
Copy Markdown
Member Author

Second bug found (and fixed): GetWorkItemLockAsync blocked indefinitely instead of failing fast.

The TryAcquireAsync(key, timeUntilExpires, cancellationToken) overload loops and polls every ≤3 s until the lock becomes available or the cancellation token fires — confirmed via decompilation:

do
{
    gotLock = await _cacheClient.AddAsync(resource, lockId, timeUntilExpires);
    if (gotLock) break;
    // waits up to 3s for reset event before retrying
    using var linked = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
    linked.CancelAfter(timeSpan); // timeSpan = min(remaining TTL, 3s)
    await autoResetEvent.WaitAsync(linked.Token);
}

With a 30-minute lock TTL, the polling loop could run for up to 30 minutes, stalling a work item job slot for no correctness benefit (the sent marker already prevents duplicate emails once the first worker finishes).

The new test GetWorkItemLockAsync_WhenMonthlyLockIsAlreadyHeld_ShouldReturnNullSoConcurrentItemIsAbandoned would have hung with the old code — the second lock call would block until the test timeout fired.

Fix: Use the TryAcquireAsync(key, timeUntilExpires, acquireTimeout: TimeSpan.Zero) overload. This creates an already-cancelled CancellationTokenSource, causing the do-while to try once and return null immediately if the lock is held. The work item is abandoned and retried later, at which point the sent marker is already set and the email is skipped.

@niemyjski niemyjski force-pushed the niemyjski/fix-plan-limit-email-spam branch from 5b7e9c6 to 074c92d Compare May 30, 2026 13:43
@niemyjski
Copy link
Copy Markdown
Member Author

Third bug found (and fixed): HandleWorkItemAsync test helper bypassed the null-lock contract.

The test helper was calling HandleItemAsync even when GetWorkItemLockAsync returned null. In production, Foundatio's WorkItemJob calls AbandonAsync when the lock is null and never calls HandleItemAsync. The helper got the semantics wrong.

// Before (wrong): always calls HandleItemAsync, even without the lock
private async Task HandleWorkItemAsync(...)
{
    await using var workItemLock = await Handler.GetWorkItemLockAsync(...);
    var context = new WorkItemContext(workItem, "test-job", workItemLock, ...);
    await Handler.HandleItemAsync(context); // called even when workItemLock == null!
}

// After (correct): mirrors production WorkItemJob behavior
private async Task HandleWorkItemAsync(...)
{
    await using var workItemLock = await Handler.GetWorkItemLockAsync(...);
    if (workItemLock is null)
        return; // WorkItemJob calls AbandonAsync here, not HandleItemAsync
    var context = new WorkItemContext(workItem, "test-job", workItemLock, ...);
    await Handler.HandleItemAsync(context);
}

Why coverage didn't detect it: every call to HandleWorkItemAsync in the existing tests used sequential execution (lock released between calls), so GetWorkItemLockAsync never returned null in practice. The test that verified the null return (GetWorkItemLockAsync_WhenMonthlyLockIsAlreadyHeld_ShouldReturnNullSoConcurrentItemIsAbandoned) only checked the return value — it didn't test what happened after.

New regression test added: HandleItemAsync_WhenConcurrentWorkerHoldsMonthlyLock_ShouldNotSendEmail — pre-acquires the lock, then calls the helper for a second identical item, and asserts count = 0. This test would fail with the old helper (HandleItemAsync would bypass the lock, find no sent marker, and send a duplicate email).

Also fixed: DuplicateDetectionQueueBehavior was created inline in the unit test without being captured in a using var declaration. The type implements IDisposable and the CodeQL report flagged this. Fixed by capturing it: using var dedupBehavior = new DuplicateDetectionQueueBehavior<WorkItemData>(...).

Root cause: every web pod subscribes to PlanOverage at startup via
EnqueueOrganizationNotificationOnPlanOverage. Foundatio pub/sub delivers each message to all
subscribers, so a single monthly overage event enqueued one work item per running web pod. The
original ThrottlingLockProvider(1/hour) allowed exactly one item through per calendar-hour bucket;
abandoned duplicates were re-queued and reprocessed once each new bucket opened — producing one
email per hour for each duplicate item.

Fix:
- Queue-level dedup: OrganizationNotificationWorkItem implements IHaveUniqueIdentifier and
  DuplicateDetectionQueueBehavior is registered so fanout enqueues collapse to one item.
- Handler-level idempotency: per-org distributed lock (30 min) + 24-hour sent marker ensure
  stale duplicates already in the queue at deploy time cannot retrigger an email.
- Hourly items short-circuit at GetWorkItemLockAsync and never enter the lock/sent-key path,
  preventing hourly overages from suppressing subsequent monthly notifications.

Also add RCA-pinning unit tests (TestWithServices) and integration tests (IntegrationTestsBase)
covering fanout dedup, legacy hourly throttle regression, per-org isolation, 24h resend window,
hourly-before-monthly ordering, and idempotency via existing sent marker.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@niemyjski niemyjski force-pushed the niemyjski/fix-plan-limit-email-spam branch from 074c92d to f10f791 Compare May 30, 2026 14:21
@github-actions
Copy link
Copy Markdown

Code Coverage

Package Line Rate Branch Rate Complexity Health
Exceptionless.Insulation 25% 23% 203
Exceptionless.Web 73% 62% 3922
Exceptionless.AppHost 18% 9% 82
Exceptionless.Core 69% 63% 7895
Summary 68% (13657 / 19951) 62% (7198 / 11648) 12102

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants