Skip to content

[Cosmos] NPE 'rootUri is null' in RxGatewayStoreModel.getUri when ThinClientStoreModel resolves to defaultRoutingContext (unmatched preferredRegions on multi-master thin-client account) #49299

@jeet1995

Description

@jeet1995

Summary

ThinClientStoreModel.getRootUri() returns null and triggers NullPointerException: Cannot invoke "java.net.URI.getHost()" because "rootUri" is null at RxGatewayStoreModel.getUri:421 whenever the routing fallback chain bottoms out at defaultRoutingContext, because defaultRoutingContext.thinclientRegionalEndpoint is never populated.

This systematically breaks ~19 query tests (every *QueryTest* / ReadFeed*Test in the query profile) running against thin-client-enabled multi-master accounts in CI, blocking PR merges (e.g. #49090, #49258, and any PR that runs Public_Cosmos_Live_Test_ThinClient_MultiRegion).

Stack trace (from CI log)

java.lang.NullPointerException: Cannot invoke "java.net.URI.getHost()" because "rootUri" is null
    at com.azure.cosmos.implementation.RxGatewayStoreModel.getUri(RxGatewayStoreModel.java:421)
    at com.azure.cosmos.implementation.RxGatewayStoreModel.performRequest(RxGatewayStoreModel.java:301)
    at com.azure.cosmos.implementation.RxGatewayStoreModel.query(RxGatewayStoreModel.java:281)
    at com.azure.cosmos.implementation.RxGatewayStoreModel.invokeAsyncInternal(RxGatewayStoreModel.java:789)
    at com.azure.cosmos.implementation.RxGatewayStoreModel.lambda$invokeAsync$0(RxGatewayStoreModel.java:797)
    at com.azure.cosmos.implementation.BackoffRetryUtility.lambda$executeRetry$0(BackoffRetryUtility.java:36)

Trigger conditions (all must hold)

  1. SDK client has COSMOS.THINCLIENT_ENABLED=true + HTTP2 enabled → useThinClient=true
  2. Account has thinClientReadableLocations (federation has IsThinClientEnabled=true server-side)
  3. Account enableMultipleWriteLocations=true AND client multipleWriteRegionsEnabled=true
  4. Client preferredRegions does not match any account region (e.g. [East US 2] against an account with [West Central US, East US 3])
  5. Request is ResourceType.Document (Document queries / bulk delete during truncateCollection)

Root cause

  1. LocationCache constructor (line 72) builds defaultRoutingContext = new RegionalRoutingContext(defaultEndpoint) from the global account URL. RegionalRoutingContext constructor only sets gatewayRegionalEndpoint; thinclientRegionalEndpoint remains null.
  2. LocationCache.addRoutingContexts() (lines 947-963) is the only place setThinclientRegionalEndpoint(...) is ever called, and it iterates regional endpoints only — defaultRoutingContext is never threaded through it.
  3. With preferred regions that do not match, getPreferredAvailableRoutingContexts() returns an empty endpoint list and falls back to fallbackRegionalRoutingContext (line 887/903). For the WRITE path the fallback is defaultRoutingContext; for the READ path the fallback is writeRegionalRoutingContexts.get(0), which itself fell back to defaultRoutingContext.
  4. For a Document request, RxDocumentClientImpl.useThinClientStoreModel(request) returns true (useThinClient + hasThinClientReadLocations() + ResourceType.Document), so the request is routed through ThinClientStoreModel.
  5. ThinClientStoreModel.getRootUri() returns resolveServiceEndpoint(req).getThinclientRegionalEndpoint() → reads null from defaultRoutingContextrootUri is null at RxGatewayStoreModel.getUri:421.
// ThinClientStoreModel.java:96
@Override
public URI getRootUri(RxDocumentServiceRequest request) {
    // need to have thin client endpoint here
    return this.globalEndpointManager.resolveServiceEndpoint(request).getThinclientRegionalEndpoint();
}

Minimal reproduction

Verified on both the current main (no PR) and on PR #49090 — byte-identical output, no network required:

DatabaseAccount dbAccount = new DatabaseAccount();
dbAccount.setEnableMultipleWriteLocations(true);
List<DatabaseAccountLocation> readable = Arrays.asList(
    loc("West Central US", "https://acct-westcentralus.documents.azure.com:443/"),
    loc("East US 3",       "https://acct-eastus3.documents.azure.com:443/"));
dbAccount.setReadableLocations(readable);
dbAccount.setWritableLocations(readable);
List<DatabaseAccountLocation> tcLocs = Arrays.asList(
    loc("West Central US", "https://acct-westcentralus.documents.azure.com:10250/"),
    loc("East US 3",       "https://acct-eastus3.documents.azure.com:10250/"));
dbAccount.set(Constants.Properties.THINCLIENT_READABLE_LOCATIONS, tcLocs);
dbAccount.set(Constants.Properties.THINCLIENT_WRITABLE_LOCATIONS, tcLocs);

ConnectionPolicy policy = new ConnectionPolicy(DirectConnectionConfig.getDefaultConfig());
policy.setEndpointDiscoveryEnabled(true);
policy.setMultipleWriteRegionsEnabled(true);
policy.setPreferredRegions(Arrays.asList("East US 2"));   // unmatched

LocationCache cache = new LocationCache(policy,
    new URI("https://acct.documents.azure.com:443/"), new Configs());
cache.onDatabaseAccountRead(dbAccount);

RxDocumentServiceRequest req = RxDocumentServiceRequest.create(
    null, OperationType.Query, ResourceType.Document,
    "/dbs/db1/colls/col1/docs", new HashMap<>());

RegionalRoutingContext resolved = cache.resolveServiceEndpoint(req);
assert resolved.getGatewayRegionalEndpoint() != null;     // OK
assert resolved.getThinclientRegionalEndpoint() == null;  // BUG

Output:

Resolved gateway endpoint:    https://acct.documents.azure.com:443/
Resolved thinclient endpoint: null
Matches defaultEndpoint? true

Suggested fixes (any one)

  1. Null-check in ThinClientStoreModel.getRootUri() (smallest, defensive):
    public URI getRootUri(RxDocumentServiceRequest request) {
        RegionalRoutingContext ctx = this.globalEndpointManager.resolveServiceEndpoint(request);
        URI tc = ctx.getThinclientRegionalEndpoint();
        return tc != null ? tc : ctx.getGatewayRegionalEndpoint();
    }
  2. Populate defaultRoutingContext.thinclientRegionalEndpoint when thin-client locations are present (correct fix, keeps RegionalRoutingContext invariants consistent).
  3. Tighten useThinClientStoreModel(request) to also require that the resolved context has a non-null thinclient endpoint.

Concurrent test-infrastructure issue

sdk/cosmos/live-platform-matrix.json, live-thinclient-platform-matrix.json, and live-http2-platform-matrix.json hard-code "PREFERRED_LOCATIONS": "[\"East US 2\"]" under MultiMaster_MultiRegion ArmConfig entries. The live thin-client static accounts (thin-client-multi-writer-ci, thin-client-multi-region-ci) only have [West Central US, East US 3] — no East US 2 — so the hardcoded preferred region never matches and unconditionally triggers the fallback path that exposes this bug. Even after the SDK fix lands, the matrix should be updated to use a region the static account actually has, otherwise we are masking other test signal.

Impact

cc @FabianMeiswinkel

Metadata

Metadata

Assignees

No one assigned

    Labels

    CosmosbugThis issue requires a change to an existing behavior in the product in order to be resolved.needs-team-attentionWorkflow: This issue needs attention from Azure service team or SDK team

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions