Skip to content

SOLR-18298: ZkController.OnReconnect Is Trigger Unnecessarily#4577

Open
linxiaokun528 wants to merge 2 commits into
apache:mainfrom
linxiaokun528:SOLR-18298
Open

SOLR-18298: ZkController.OnReconnect Is Trigger Unnecessarily#4577
linxiaokun528 wants to merge 2 commits into
apache:mainfrom
linxiaokun528:SOLR-18298

Conversation

@linxiaokun528

@linxiaokun528 linxiaokun528 commented Jul 1, 2026

Copy link
Copy Markdown

Description

Curator's RECONNECTED event is different from the previous RECONNECTED event. Before Solr10, the OnReconnect is only triggered after a reconnection from a session expiration. Check the following

"Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper...");

But Curator's RECONNECTED event is triggered every time a Solr node is disconnected from a ZooKeeper instance and reconnected to another ZooKeeper instance.

Therefore, currently ZkController.onReconnect is invoked every time a Solr node reconnects to a ZooKeeper instance without a session expiration, which is a huge overhead, especially when we need to rolling-restart a ZooKeeper Cluster. It can take more than 10 minutes for a small Solr cluster to level out.

Similiarly, now ZkController.onDisconnect is triggered just after a disconnection from a Zookeeper instance. It should only be triggered after a session expiration.

Solution

I created a SolrCuratorEvent class as a Bridge. Now it only has two event types: EXPIRED_RECONNECTION and SESSION_EXPIRATION.
EXPIRED_RECONNECTION: Will only be triggered when Solr reconnects Zookeeper from a session expiration
SESSION_EXPIRATION: Will be triggered when a session expiration happens (The same to Curator's LOST event.)

We should use SolrCuratorEvent instead of Curator's event and we can add more types into it whenever needed.

I also renamed some methods to make them clearer:
ZkController.onReconnect -> ZkController.onExpiredReconnection
ZkController.onDisconnect -> ZkController.onSessionExpiration

Tests

In ZkControllerTest.java,

  1. I added a method testZkDisconnectionEvents to make sure when there is a session expiration, ZkController.onSessionExpiration is triggered.
  2. I added a method testZkReconnectionEvents to make sure when Solr reconnects to ZooKeeper from a session expiration, ZkController.onExpiredReconnection is triggered.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide (This change restores the original behavior (a performance fix), and does not introduce any new user-facing features or configuration changes. Therefore, I believe no Reference Guide update is required.)
  • I have added a changelog entry for my change

@ercsonusharma ercsonusharma left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tracking this down. It's fine to keep the current direction if you want the clean removal, but flagging that the same fix is achievable with much less surface change IMO.

SolrCuratorEvent is more machinery than needed. It's an enum + abstract of() + a new EventAction type that's really just Runnable renamed. The same behavior can live directly on the existing OnReconnect/OnDisconnect callbacks and no new type.
The AtomicBoolean that makes it correct just lives in the returned closure.

FWIW, The public OnReconnect/OnDisconnect interfaces published in solrj-zookeeper, so removing them is a breaking change. The bug is purely in the detection logic which is fixable while keeping the interfaces intact.

@dsmiley dsmiley requested a review from HoustonPutman July 3, 2026 06:25
@dsmiley

dsmiley commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Just a minor response to Sonu's comment:

FWIW, The public OnReconnect/OnDisconnect interfaces published in solrj-zookeeper, so removing them is a breaking change.

solrj-zookeeper is largely SolrCloud internals that only needs to be a client consumable JAR for those users who use it indirectly via a Zk based ClusterStateProvider. Maybe there are some gray areas. So if we think a (technically) breaking change here is better -- go for it.

Curator's RECONNECTED event is different from the previous RECONNECTED event. Before Solr10, the OnReconnect is only triggered after a reconnection from a session expiration. Check the following

https://github.com/apache/solr/blob/fdb5314279657f7895a90123436d834e81ea3157/solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/ConnectionManager.java#L165
https://github.com/apache/solr/blob/fdb5314279657f7895a90123436d834e81ea3157/solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/ConnectionManager.java#L199

But Curator's RECONNECTED event is triggered every time a Solr node is disconnected from a ZooKeeper instance and reconnected to another ZooKeeper instance.

Therefore, currently ZkController.onReconnect is invoked every time a Solr node reconnects to a ZooKeeper instance without a session expiration, which is a huge overhead, especially when we need to rolling-restart a ZooKeeper Cluster. It can take more than 10 minutes for a small Solr cluster to level out.

Similiarly, now ZkController.onDisconnect is triggered just after a disconnection from a Zookeeper instance. It should only be triggered after a session expiration.
@linxiaokun528

linxiaokun528 commented Jul 4, 2026

Copy link
Copy Markdown
Author

Thanks for tracking this down. It's fine to keep the current direction if you want the clean removal, but flagging that the same fix is achievable with much less surface change IMO.

SolrCuratorEvent is more machinery than needed. It's an enum + abstract of() + a new EventAction type that's really just Runnable renamed. The same behavior can live directly on the existing OnReconnect/OnDisconnect callbacks and no new type. The AtomicBoolean that makes it correct just lives in the returned closure.

FWIW, The public OnReconnect/OnDisconnect interfaces published in solrj-zookeeper, so removing them is a breaking change. The bug is purely in the detection logic which is fixable while keeping the interfaces intact.

@ercsonusharma Thanks for reviewing.
I created the new type EventAction/SolrCuratorEvent mainly for future usage. Curator does not contain events like EXPIRED_RECONNECTION. Since we need different events from those provided by Curator, I think we should create our own event types. In the future, we may also need something like UNEXPIRED_RECONNECTION(reconnect from SUSPEND instead of LOST).

The same behavior can live directly on the existing OnReconnect/OnDisconnect callbacks and no new type.

The only way I can figure out this is to have an AtomicBoolean variable named disconnected. When OnDisconnect is triggered, the variable is set to true. While OnReconnect is triggered, we use AtomicBoolean.compareAndSet to set it back to false. But in this way, the two methods are tightly coupled. Besides, as I mentioned above, we may need an event like expiredReconnect again in the future (or maybe now. We may have other places suffering the same kind of issue.) I think we should have an easier way to deal with this kind of event, instead of using AtomicBoolean every time.

@linxiaokun528

Copy link
Copy Markdown
Author

Sorry, I had to force push because I used the wrong author/email. But the code is exactly the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants