Announcement

Collapse
No announcement yet.

Incorrect Master/Slave Node Down Emails

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Incorrect Master/Slave Node Down Emails

    After I performed the first NxFilter update after creating a Cluster, going from 4.6.4.1 to 4.6.4.3 and the following day to 4.6.4.5, I am receiving tons of incorrect Master/Slave Node Down emails and I am not sure how to stop the emails. We have NxFilter installed on Linux Mint and both are up and running just fine. We have this line "cluster_double_check = 1" in the cfg.properties file. Below I have some log information, I have received 150 emails as of 12:00 AM. At 7:45 AM I received the last email about the Slave being down at 7:15 AM, I received this same email every 5 minutes starting at 7:20 AM. All this time our Slave Node was up and running, which was also seen in the Cluster section of the Master Node. I also receive emails from the Slave Node with ""Cluster master node down at null"

    INFO [01-04 07:09:59] - NodeListener.SockHandler.dealCltSock, workCnt-- = 0 by 10.104.1.34.
    INFO [01-04 07:10:00] - AlertMan.writeSlaveDownEmail, Adding a slave node down alert email.
    INFO [01-04 07:10:00] - PostBox.addEmail, An email added, Slave node down!.
    INFO [01-04 07:10:04] - NodeListener.SockHandler.dealHello, line = /HELLO 24891 4192 1 454, cltIp = 10.104.1.34.
    INFO [01-04 07:10:04] - PostBox.sendEmail, An alert email has been sent, Slave node down!.
    INFO [01-04 07:45:04] - NLSHr, workCnt++ = 1 by 10.104.1.34.
    INFO [01-04 07:45:04] - NodeListener.SockHandler.dealHello, line = /HELLO 28475 4593 1 492, cltIp = 10.104.1.34.
    INFO [01-04 07:45:05] - PostBox.sendEmail, An alert email has been sent, Slave node down!.

  • #2
    This one "NodeListener.SockHandler.dealCltSock, workCnt-- = 0 by 10.104.1.34" means that 10.104.1.34 lost its connection to your master node. And it connected again. Are you sure that it runs between [01-04 07:09:59] and [01-04 07:45:04]?

    Comment


    • #3
      Yes, it's up and running the whole time. I should have taken screen shots of the Master and Slave Cluster pages, but I forgot to do that.

      Comment


      • #4
        There could be some connection problem even if it's running then. 'NodeListener.SockHandler.dealCltSock, workCnt-- = 0' happens when there's a socket connection closing event.

        And the connection has been restored, INFO [01-04 07:45:04] - NLSHr, workCnt++ = 1 by 10.104.1.34.

        Comment


        • #5
          And I don't think it's related to the update. I am testing my clustering. When I stop my slave node, I get that email but if I restart it, I don't get any email.

          Did you get Master Down email from your slave node as well? Did you get both Slave Down email and Master Down email?

          Comment


          • #6
            Yes, I get them boh, but more Slave Node than Master Node, the Master Node emails says "Cluster master node down at null". The last one is at 7:45, but it will start up again after taking a "break". Last night the last one was at 12:50 AM, then it didn't start up again after 4:05 AM. I've created the cluster on December 2 and it has been running fine (no Node Down emails) until I did the update 2 days ago.

            Click image for larger version

Name:	image.png
Views:	167
Size:	65.1 KB
ID:	1729
            Attached Files

            Comment


            • #7
              They check it on every 5 mins. Before you restore the connection they will send you emails. If you get both kinds of emails, it means they both were running. But their socket connections have been closed. And the slave node also checks its DB connection to its master node. Anyways, there was something happened between them and it closed their socket connections. Don't know what caused the problem though.

              Comment


              • #8
                I got some screen shots now. I received an email at 12:55, 13:00, 13:05, and 13:10 stating "Cluster slave node down at 01-04 12:50, 10.104.1.34", yet the screen shots below (the first is from the Master, the 2nd from the Slave) show that it is up in NxFilter at 12:55, so why 3 more emails that it was down at 12:50???:

                Click image for larger version

Name:	image.png
Views:	177
Size:	4.8 KB
ID:	1735
                Click image for larger version

Name:	image.png
Views:	155
Size:	10.7 KB
ID:	1736
                Attached Files

                Comment


                • #9
                  So, you still get those emails even if your slave running and you see the Last Contact from the slave gets updated?

                  Comment


                  • #10
                    Yes, indeed, even in NxFilter Dashboard you can see that they are connected, yet it keeps sending emails. After I posted my message I received another wave of emails, it drove me nuts, so I turned them off (I wrote an Outlook Inbox Rule) and we are now checking their status with Nagios instead. Nagios performs a DNS query every 5 minutes and lets me know if they are not responding, so in a way I've created my own fix.

                    Comment


                    • #11
                      Your last contact gets updated on every minutes or several times missing?

                      I don't think it's related with the update. I am running a cluster from last night and I don't get any email unless I stop my slave node.

                      You still get both Master Down and Slave Down emails? If they both run and don't see each other then it's a communication problem between nodes.

                      Comment


                      • #12
                        If you send me all the log files from both nodes I will have a look at it. support @ nxfilter.org.

                        Comment


                        • #13
                          They see each other just fine, the "Last Contact" gets updated every minute, also during the wave of down emails I receive.

                          Comment


                          • #14
                            Don't worry about it, I found another solution (Nagios) that truly shows me when they are down. I know they communicate well, because it does the load balancing in between the two and the Last Contact updates as well, I just don't know what triggers the emails. Anyway, again no worries, I have another solution now.

                            Comment


                            • #15
                              There is a persistent TCP connection between 2 nodes and when it closed they think they lost the other side. What I am saying is that it's a socket connection. We don't close it. And there's a hello message or ping on every minute via the socket connection and that writes the last contact time. Might be an intermittent connection problem. If it keeps connecting and closing, you could have the current situation I guess.

                              Cluster node checking is not just about health checking. Your slave node needs to synch your policy config from its master node and IP session sharing. If it is an intermittent connection problem, your Nagios health checking will be working. And I guess clustering would be working mostly. Anyway, there might be another problem if it's really about a connection problem.

                              You can use it as it is if you are OK but if you have another issue then look into other possibilities I was talking about.

                              Comment

                              Working...
                              X