Skip to content

Add Watchdog timer reset in kill() #3077

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

Blue-Marlin
Copy link
Contributor

Add watchdog reset in the kill loop to simplify recovering with Arduino Mega based boards..

Fix for #3076

Add watchdog reset in the kill loop to simplify recovering.
@thinkyhead
Copy link
Member

Does this do what we want in all cases? From what I can see the watchdog timer is set up to have the board reset if watchdog_reset is not called within a 4 second period. So it looks like the empty loop in kill() is intended to cause the board to reset after 4 seconds. The change you propose will cause the board to hang forever, until a manual reset. If that is what we're after, great.

Some interesting things keep coming up about resetting an Arduino with Watchdog enabled. This of course:

Another thing I found in the datasheet:

Quote:

Note: If the Watchdog is accidentally enabled, for example by a runaway pointer or brown-out condition, the device will be reset and the Watchdog Timer will stay enabled. If the code is not set up to handle the Watchdog, this might lead to an eternal loop of time-out resets. To avoid this situation, the application software should always clear the Watchdog System Reset Flag (WDRF) and the WDE control bit in the initialization routine, even if the Watchdog is not in use.

And… from this forum post at AVR Freaks we have:

The critical step missing in your original code is to reset the WDRF bit in MCUSR. From the ATmega328 data sheet:

WDE is overridden by WDRF in MCUSR. This means that WDE is always set when WDRF is set. To clear WDE, WDRF must be cleared first.

The sequence I use is:

MCUSR = 0;
wdt_disable( );

So that is why WDRF must be cleared before you can clear WDE.

The current procedure at startup for setting the Watchdog Timer is as follows:

// Initialize watchdog with a 4 sec interrupt time
void watchdog_init() {
  #if ENABLED(WATCHDOG_RESET_MANUAL)
    // We enable the watchdog timer, but only for the interrupt.
    // Take care, as this requires the correct order of operation, with interrupts disabled. See the datasheet of any AVR chip for details.
    wdt_reset();
    _WD_CONTROL_REG = _BV(_WD_CHANGE_BIT) | _BV(WDE);
    _WD_CONTROL_REG = _BV(WDIE) | WDTO_4S;
  #else
    wdt_enable(WDTO_4S);
  #endif
}

If this is equivalent to the above discussion, then all is well. I think the main concern is that the watchdog reset apparatus might not always be started, and that this condition should be avoided. Marlin avoids it by doing watchdog reset pretty frequently, and on every reset. I haven't heard any complaints where resetting a board caused endless resets, but it's something to watch out for.

@thinkyhead thinkyhead added Needs: Testing Testing is needed for this change Needs: More Data We need more data in order to proceed C: Boards/Pins labels Mar 6, 2016
@Blue-Marlin
Copy link
Contributor Author

@thinkyhead
If we have had a watchdog reset, there is nothing we can do in Marlin. The reset loop is in the Arduino bootloader (at least for the Arduino Mega - all RAMPS boards, ++).
The only thing we can do is to avoid unnecessary watchdog resets.

The watchdog is ought to catch unexpected endless loops in the code. The kill() loop is not unexpected.

I suggest to make a simple test.
Take a machine with activated watchdog. Run it. Unconnect a thermistor. Mintemp is detected, kill called, the watchdog resets the machine 4 seconds later. Reconnect the thermistor. Try serial reset (unconnect-reconnect). Try the reset button. Power cycle the printer.
Now reflash without the activated watchdog and try the same.
Got it? Do you see the difference?

Lets wait for some days. The day where Octoprint users will begin to complain about having to run down 3 stairs, to power cycle their machines, is not far.

@thinkyhead
Copy link
Member

@Blue-Marlin Customarily when users have different needs we will make such options configurable through Configuration.h and/or Configuration_adv.h. If users are connected to a remote host then it may be more desirable for them to leave the timeout loop in place. Consider packaging this as a new feature, as in:

#define MANUAL_RESET_ON_KILL // Enable this option to require a manual reset on kill()

@jbrazio
Copy link
Contributor

jbrazio commented Mar 6, 2016

This was exactly my suggestion on #3045 when this reset thing came to discussion.

@jbrazio jbrazio mentioned this pull request Mar 6, 2016
@thinkyhead
Copy link
Member

I'm seriously considering including this change even though it redefines the default behavior. I personally would prefer that if my printer locks up, it should stay locked up and not reboot on its own. Your points made in #3085 are compelling, especially surrounding the response of hosts and the AUTOSTART feature. So my new suggestion is something more along the lines of this:

In Configuration.h: a sub-option with USE_WATCHDOG disabled by default:

//#define AUTO_RESET_ON_KILL // Enable this option to automatically reset the board on kill()

In SanityCheck.h::

#if ENABLED(USE_WATCHDOG) && ENABLED(AUTO_RESET_ON_KILL) && ENABLED(AUTOSTART)
  #error AUTO_RESET_ON_KILL combined with AUTOSTART considered risky. Comment this line if you only use AUTOSTART for harmless things.
#endif

And finally in Marlin_main.cpp

  #if ENABLED(USE_WATCHDOG) && DISABLED(AUTO_RESET_ON_KILL)
    watchdog_reset();
  #endif

Another thought occurs to me, which is of course "Why would anyone use USE_WATCHDOG without AUTO_RESET_ON_KILL?" It looks like the whole point of USE_WATCHDOG is to get that automatic reset in the first place. So perhaps no new option is needed. If you don't want an automatic reset, then there's no other reason to use USE_WATCHDOG, is there?

@thinkyhead
Copy link
Member

After reviewing the code I see that, yes, in fact, the whole point of USE_WATCHDOG is to get an Automatic Reset, whether due to kill() or some uncontrolled situation. But I can also see why you might want an automatic reset in some situations — especially where there is an unknown hang potentially leaving heaters on — but you might not want it in the case of a proper kill(), since the heaters will (one hopes) be turned off. Leaving the machine hung forever might also help by leaving the last LCD screen up so you can see the state of the machine at the time of failure.

Hopefully few users will have to deal with these things in the first place, but it's good to look at them once in a while.

Indication of a failure state is always helpful. A hung machine that isn't responsive has to be prodded to determine the failed state. I would prefer to give some clear indication, if only toggling the configured LED pin as a visual alarm. (Audio alarms have been discussed and are considered … too alarming.) If it's immediately apparent by a flashing light that the machine has gone down then users can respond by resetting right away instead of trying the controller knob.

Hosts assume that a machine is hung if it stops responding long enough. They can also look for the Marlin startup message and by that assume that a machine has been reset. But these have not been formalized. We have received requests from makers of host software to send more messages out about Marlin's states ("busy", "awaiting input", etc.). So, even in the "precarious" position of a machine that has gone into the kill() loop, one could argue perhaps for sending a last "hung" or "resetting" message to the host and showing a 4-second countdown on the LCD screen before resetting, as well.

@jbrazio
Copy link
Contributor

jbrazio commented Mar 6, 2016

Sounds sane your suggestion but I would still prefer to ship with USE_WATCHDOG active; if for some reason the AVR locks up and the heater pin is active.. it will never shut it down until everything is on fire. USE_WATCHDOG is the security feature.

The issue is not Marlin to go into a panic mode and need reset, the issue is when the watchdog kicks in people are obliged to power cycle the printer and for some people this may mean remove the plug from mains and plug it back again..

As it is today without USE_WATCHDOG active when kill() is called Marlin locks up in a while (1) {} waiting for any type of reset. The user can then soft-reset or hard-reset it by power cycling the board.

When USE_WATCHDOG is active when kill() is called Marlin locks up on the same while (1) {} but now the watchdog will kick in due to wdt_enable(WDTO_4S); in watchdog.cpp.

When WDTO_4S kicks in the ISR(WDT_vect) traps the interrupt and will once again call kill() and try to lock up Marlin (again) in a while(1) loop. So far so good.

The issue is when the user manually soft-resets the board. The Arduino bootloader is expected to disable the watchdog which it is not.. which means that after 4S of booting Marlin will go again into ISR(WDT_vect). The only way to "fix" this is to hard-reset the board which will reset the watchdog.

This bugfix will still lockup Marlin on the while (1) {} loop but before locking up the uC it will disable the watchdog, then if the user soft-reset the board it will not get trapped again inside ISR(WDT_vect).

@Blue-Marlin Blue-Marlin changed the title Add Watchdog reset in kill() Add Watchdog timwer reset in kill() Mar 6, 2016
@Blue-Marlin Blue-Marlin changed the title Add Watchdog timwer reset in kill() Add Watchdog timer reset in kill() Mar 6, 2016
@thinkyhead
Copy link
Member

The Arduino bootloader is expected to disable the watchdog which it is not

This cannot be counted upon, even according to Arduino's own documentation.

The only way to "fix" this is to hard-reset the board which will reset the watchdog.

As I described above, during program initialization (in setup()) the watchdog must always be explicitly turned off, because a watchdog can stay alive through various kinds of reset. The note above suggests MCUSR = 0; wdt_disable( ); and that's easy enough to test.

This bugfix will still lockup Marlin on the while (1) {} loop but before locking up the uC it will disable the watchdog

I'm not seeing how this PR "disables" the watchdog. It just signals watchdog to stay alive with watchdog_reset() so the board won't self-reset.

thinkyhead added a commit that referenced this pull request Mar 7, 2016
@thinkyhead thinkyhead merged commit 501f638 into MarlinFirmware:RCBugFix Mar 7, 2016
@thinkyhead
Copy link
Member

Merged this as basically an obvious bug-fix. Expected behavior for kill() is to stay in a loop until manually reset. Keep it simple.

@Blue-Marlin
Copy link
Contributor Author

@jbrazio
Thanks for you detailed explanation.

I guess language is the key for our initial misunderstanding.
"Watchdog reset" can mean a reset caused by the watchdog.
Calling "watchdog_reset()" means the opposite - reset of the watchdog timer to prevent a reset by the watchdog.

I don't think it needs an additional configuration option. The behaviour of kill() with activated watchdog is, with this change, the same as if the watchdog would not be activated. The advantage of the watchdog, to catch unexpected endless loops, is untouched.
WATCHDOG_RESET_MANUAL is evil. Not because its calling kill, but because of redefining the interrupt vector. It makes a completely hardware handled reset to a software function. That alone would not be so bad, but if our endless loop is in an interrupt all further software interrupts are disabled and therefore the watchdog interrupt routine never executed. So with WATCHDOG_RESET_MANUAL enabled we lose the ability to protect from hangs in interrupts. For the same reason we do not get a endless chain of kill() calls. If we are once in an interrupt, the interrupt_disable_flag is set until, we reset it ourselves, or leave the interrupt routine. If that would not be the case we'd have a stack overflow very soon. Ergo the watchdog reset with WATCHDOG_RESET_MANUAL did and does the same as kil(), when fired at all.

Have an other look at the kill() function. We explicitly disable software interrupts but the pure hardware watchdog timer interrupt (without WATCHDOG_RESET_MANUAL) still works!

@jbrazio
Copy link
Contributor

jbrazio commented Mar 7, 2016

I'm not seeing how this PR disables the watchdog. It just signals watchdog to stay alive with watchdog_reset() so the board won't self-reset.

You're correct, watchdog_reset() is just a wrapper for wdt_reset()thus only "pings" the watchdog.

@Blue-Marlin Blue-Marlin deleted the watchdogreset-in-kill branch March 9, 2016 13:07
@jbrazio jbrazio modified the milestone: 1.1.0 Jul 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: Boards/Pins Needs: More Data We need more data in order to proceed Needs: Testing Testing is needed for this change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants