Disaster recovery failure revisited

NethServer Version: 7 final
Module:

@davidep @giacomo

Production server build… built it to a point with several modules… many daily backups accumulated, proceeded to test disaster restore…

Restored a snapshot from the beginning… rc4, did an update after bootup, rebooted, installed backup, restore config… everything restored, no users or groups, error msg.
AccountProvider_error_1

account provider start dc button is grayed out.

domain accounts is partial with a msg “Enter SERVER7C$@domain.COM’s password:Join to domain is not valid: NT code 0xfffffff6”

entered shorewall stop… no effect, other than the fw off error msg, rebooted, continued account error.

Feb 14 17:31:29 server7c sshd[1968]: Accepted password for root from 192.168.124.126 port 61751 ssh2
Feb 14 17:31:29 server7c systemd: Created slice user-0.slice.
Feb 14 17:31:29 server7c systemd: Starting user-0.slice.
Feb 14 17:31:29 server7c systemd: Started Session 1 of user root.
Feb 14 17:31:29 server7c systemd-logind: New session 1 of user root.
Feb 14 17:31:29 server7c systemd: Starting Session 1 of user root.
Feb 14 17:34:37 server7c root: Shorewall Stopped
Feb 14 17:34:46 server7c httpd: [ERROR] NethServer\Tool\GroupProvider: AccountProvider_Error_1
Feb 14 17:34:46 server7c httpd: [ERROR] Traceback (most recent call last):#012  File "<stdin>", line 3, in <module>#012KeyError: 'SECRETS/MACHINE_PASSWORD/domain'#012Traceback (most recent call last):#012  File "<stdin>", line 3, in <module>#012KeyError: 'SECRETS/MACHINE_PASSWORD/domain'#012(1) 00002020: Operation unavailable without authentication
Feb 14 17:34:47 server7c admin-todos: Traceback (most recent call last):
Feb 14 17:34:47 server7c admin-todos:  File "<stdin>", line 3, in <module>
Feb 14 17:34:47 server7c admin-todos: KeyError: 'SECRETS/MACHINE_PASSWORD/domain'
Feb 14 17:34:47 server7c admin-todos: Traceback (most recent call last):
Feb 14 17:34:47 server7c admin-todos:  File "<stdin>", line 3, in <module>
Feb 14 17:34:47 server7c admin-todos: KeyError: 'SECRETS/MACHINE_PASSWORD/domain'
Feb 14 17:34:47 server7c admin-todos: (1) 00002020: Operation unavailable without authentication
Feb 14 17:35:09 server7c httpd: [ERROR] NethServer\Tool\UserProvider: AccountProvider_Error_1
Feb 14 17:35:09 server7c httpd: [ERROR] Traceback (most recent call last):#012  File "<stdin>", line 3, in <module>#012KeyError: 'SECRETS/MACHINE_PASSWORD/domain'#012Traceback (most recent call last):#012  File "<stdin>", line 3, in <module>#012KeyError: 'SECRETS/MACHINE_PASSWORD/domain'#012(1) 00002020: Operation unavailable without authentication
Feb 14 17:35:40 server7c httpd: [ERROR] NethServer\Tool\GroupProvider: AccountProvider_Error_1
Feb 14 17:35:40 server7c httpd: [ERROR] Traceback (most recent call last):#012  File "<stdin>", line 3, in <module>#012KeyError: 'SECRETS/MACHINE_PASSWORD/domain'#012Traceback (most recent call last):#012  File "<stdin>", line 3, in <module>#012KeyError: 'SECRETS/MACHINE_PASSWORD/domain'#012(1) 00002020: Operation unavailable without authentication
Feb 14 17:35:41 server7c admin-todos: Traceback (most recent call last):
Feb 14 17:35:41 server7c admin-todos:  File "<stdin>", line 3, in <module>
Feb 14 17:35:41 server7c admin-todos: KeyError: 'SECRETS/MACHINE_PASSWORD/domain'
Feb 14 17:35:41 server7c admin-todos: Traceback (most recent call last):
Feb 14 17:35:41 server7c admin-todos:  File "<stdin>", line 3, in <module>
Feb 14 17:35:41 server7c admin-todos: KeyError: 'SECRETS/MACHINE_PASSWORD/domain'
Feb 14 17:35:41 server7c admin-todos: (1) 00002020: Operation unavailable without authentication
Feb 14 17:36:01 server7c httpd: [ERROR] NethServer\Tool\UserProvider: AccountProvider_Error_1
Feb 14 17:36:01 server7c httpd: [ERROR] Traceback (most recent call last):#012  File "<stdin>", line 3, in <module>#012KeyError: 'SECRETS/MACHINE_PASSWORD/domain'#012Traceback (most recent call last):#012  File "<stdin>", line 3, in <module>#012KeyError: 'SECRETS/MACHINE_PASSWORD/domain'#012(1) 00002020: Operation unavailable without authentication

sigh… I thought we were good.

NetBIOS domain name: domain
LDAP server: 192.168.124.228
LDAP server name: nsdc-server7c.domain.com
Realm: domain.COM
Bind Path: dc=domain,dc=COM
LDAP port: 389
Server time: Tue, 14 Feb 2017 17:52:42 MST
KDC server: 192.168.124.228
Server time offset: 0
Last machine account password change: Wed, 31 Dec 1969 17:00:00 MST

Enter SERVER7C$@domain.COM's password:Join to domain is not valid: NT code 0xfffffff6

I’m guessing I need to reset the… admin? pwd at the cli, can’t do it from the gui, logged in as root… is there a command for that in one of the other threads?.. shouldn’t it be automatic during config restore?.. need we add it to the docs?

Hi @fasttech, thank you for testing the restore procedure again!

It is important to know the exact date of the backup, because we fixed a bug (do you remember 5188?) during rc4. See this comment on Jan, 13th:

The Samba secrets.tdb backup action was moved from nethserver-samba to nethserver-sssd. Old backups could not include secrets.tdb. In that case, automatic config restore is incomplete and a manual re-join to DC is required. See this procedure.

Same symptom reported here:

1 Like

@davidep ok, dropping my notes here;

below is current yum log, includes date of last update of image, then post restore updates and then restore-config triggered update start.

Jan 23 14:53:13 Erased: python-pyasn1-0.1.6-2.el7.noarch
Feb 14 16:17:48 Updated: nethserver-base-3.0.17-1.ns7.noarch
Feb 14 16:17:48 Updated: nethserver-lang-en-1.1.8-1.ns7.noarch
Feb 14 16:17:49 Updated: nethserver-httpd-admin-2.0.7-1.ns7.noarch
Feb 14 16:17:49 Updated: nethserver-duc-1.4.2-1.ns7.noarch
Feb 14 16:17:49 Updated: nethserver-dnsmasq-1.6.3-1.ns7.noarch
Feb 14 16:17:49 Updated: nethserver-lsm-1.2.2-1.ns7.noarch
Feb 14 16:17:49 Updated: nethserver-sssd-1.1.7-1.ns7.noarch
Feb 14 16:17:49 Updated: nethserver-release-7-2.ns7.noarch
Feb 14 16:20:22 Installed: libtirpc-0.2.4-0.8.el7.x86_64
Feb 14 16:20:23 Installed: rpcbind-0.2.0-38.el7.x86_64 

then log of last backup from built machine…

Extract from log file /var/log/last-backup.log:

Reading globbing filelist /tmp/SWdfZnDRV2
Local and Remote metadata are synchronized, no sync needed.
Last full backup date: Fri Feb  3 20:30:37 2017
--------------[ Backup Statistics ]--------------
StartTime 1486783840.40 (Fri Feb 10 20:30:40 2017)
EndTime 1486783915.85 (Fri Feb 10 20:31:55 2017)
ElapsedTime 75.45 (1 minute 15.45 seconds)
SourceFiles 1492
SourceFileSize 512014201 (488 MB)
NewFiles 1492
NewFileSize 512014201 (488 MB)

and here is the current machine state as it was backing up.

Jan 24 11:20:27 Installed: nextcloud-10.0.2-1.ns7.noarch
Jan 24 11:20:27 Installed: nethserver-nextcloud-1.0.4-1.ns7.noarch
Jan 25 14:10:07 Updated: nethserver-restore-data-1.2.3-1.ns7.noarch
Jan 26 12:35:51 Updated: nethserver-base-3.0.16-1.ns7.noarch
Jan 26 12:35:52 Updated: nethserver-dnsmasq-1.6.3-1.ns7.noarch
Jan 26 12:35:52 Updated: nethserver-lsm-1.2.2-1.ns7.noarch
Jan 26 12:35:53 Updated: duplicity-0.7.11-2.el7.x86_64
Jan 30 10:52:43 Updated: nethserver-base-3.0.17-1.ns7.noarch
Jan 30 10:52:43 Updated: nethserver-lang-en-1.1.7-1.ns7.noarch
Jan 30 10:52:45 Updated: nethserver-httpd-admin-2.0.7-1.ns7.noarch
Jan 30 10:52:46 Updated: nethserver-sssd-1.1.6-1.ns7.noarch
Jan 30 10:52:46 Updated: nethserver-release-7-1.ns7.noarch
Feb 06 09:12:48 Updated: python2-crypto-2.6.1-13.el7.x86_64

@davidep
Yesterday, I restored the built snapshot, checked for updates from the software center, there were none, left it to run, it successfully backed up, this morning I restored the old rc4 fresh, ensured net config, ran software update, rebooted, installed backup, rebooted, restored… same error.

here is the message log from after restore package install…

https://my.smbitech.com:12458/owncloud/public.php?service=files&t=7be25fe86799f6a36fccbc8713dfec22

1 Like

I’m quite busy these days, sorry for the delay.

I’d try to reproduce it with another backup!

Meanwhile if somebody else from @quality_team can reproduce it, would be appreciated!

2 posts were split to a new topic: Restore procedure with POP3 connector leads to duplicate messages

So, has anyone else run into this issue?
Has anyone had a successful disaster recovery of a samba dc setup?
I would like to find out if this is just me and I need to figure what I’m doing wrong or if there’s an underlying problem with restoring a samba dc.

1 Like

I deleted the backups.
I fired up the populated machine… samba, file sharing, nextcloud.
I set off a full backup, verified by email.
I installed a fresh install from iso in virtualbox. I setup network, fqdn, and updated… huge, rebooted.
I installed backup, setup backup to previous full backup.
Checked backup was connected… restored…environment restored… auth fail.

So… I fooled around at the cli, nothing worked.

[root@server7c ~]# systemd-run -t -M nsdc /bin/bash
Running as unit run-6473.service.
Press ^] three times within 1s to disconnect TTY.
bash-4.2# samba-tool user enable administrator
Enabled user 'administrator'
bash-4.2# samba-tool user setpassword administrator --newpassword=Nethesis,1234
Changed password OK
bash-4.2# ^^^
bash: :s^^^: no previous substitution
bash-4.2# exit
exit
[root@server7c ~]# net ads info
LDAP server: 192.168.124.228
LDAP server name: nsdc-server7c.domain.com
Realm: domain.COM
Bind Path: dc=domain,dc=COM
LDAP port: 389
Server time: Wed, 22 Feb 2017 14:07:43 MST
KDC server: 192.168.124.228
Server time offset: 0
Last machine account password change: Wed, 31 Dec 1969 17:00:00 MST
[root@server7c ~]# getent passwd administrator@`config get DomainName`
administrator@domain.com:*:1318000500:1318000513:Administrator:/var/lib/nethserver/home/administrator:/usr/libexec/openssh/sftp-server
[root@server7c ~]# host -t SRV _ldap._tcp.`config get DomainName`
_ldap._tcp.domain.com has SRV record 0 100 389 nsdc-server7c.domain.com.
[root@server7c ~]# > /etc/sssd/sssd.conf
[root@server7c ~]# realm join `config get DomainName`
realm: Already joined to this domain
[root@server7c ~]# expand-template /etc/sssd/sssd.conf

current domain accounts

NetBIOS domain name: domain
LDAP server: 192.168.124.228
LDAP server name: nsdc-server7c.domain.com
Realm: domain.COM
Bind Path: dc=domain,dc=COM
LDAP port: 389
Server time: Wed, 22 Feb 2017 14:13:18 MST
KDC server: 192.168.124.228
Server time offset: 0
Last machine account password change: Wed, 31 Dec 1969 17:00:00 MST

Enter SERVER7C$@domain.COM's password:Join to domain is not valid: NT code 0xfffffff6
1 Like

I’m currently on mail-server and docs. I hope to be back on DC soon! I’m really missing it :cry:

1 Like

You’re right, the backup/restore of configuration fails: I can reproduce it!

Luckily, you can workaround the issue with the following commands:

signal-event nethserver-sssd-leave
realm join -U admin $(hostname -d)

    ...[enter admin's password]

signal-event nethserver-sssd-save

The problem is caused by a little oversight in the backup procedure. The samba secrets.tdb backup is not actually executed! That file contains the machine password. It can be obtained again with the leave/join workaround above.

All existing systems are affected by this bug.

1 Like

A package is ready from nethserver-testing repo: /cc @quality_team

yum --enablerepo=nethserver-testing update nethserver-sssd-1.1.7-1.5.g514186f.ns7.noarch

The fix involves the backup procedure:

  • existing backups continue to cause the same problem after restoring them - apply the workaround;
  • backups generated by the fixed procedure should not lead to the problem any more.
1 Like

I beg your pardon but… shouldn’t the backup and restore procedure be tested before releasing a stable OS?
I mean: nothing is perfect, but having a stable release with totally broken backup function sounds frightening to me…

Well, the backup is not “totally broken”: as you see there’s a simple workaround. I see no reason to be frightened!

… and it has been tested, as you can see here:

https://github.com/NethServer/dev/issues/5188

What’s the point? How could we improve the QA test? More helpful people?

Davide: AD feature is a core one… it’s, maybe, the most important one NS has (compared to NS6.8)
when you released as STABLE NS7 the backup/restore function was broken (hence this topic), wasn’t it?
and a restore which doesn’t work is useless… you’ve found a workaround, but this is the kind of things you’d have done earlier, in RC stage.

for the point, see above…
for the other questions, I answer using a picture (if you follow Rugby you’ll know what I mean)

given enough eyeballs, all bugs are shallow

Said someone once. Thanks @fasttech for your eyes :slight_smile:

Over the last year, we had more than 1k posts about bug/testing discussions, some hundreds of bugs fixed, more than 50 people involved in testing. Check out #development:testing #bug for further information.
Do we need more people? Of course! @Stefano_Zamboni feel free to offer your support
I would really love to see your participation in testing and reporting, the more hand on deck the better :yum:

1 Like

Not following. Who are you blaming? :slight_smile:
We’re fixing it now and @davidep is offering all the help he can. So? What’s your point?
You are free to state whatever you want. I feel obligated to point out that we don’t need referees, we need people who are willing to help in a constructive manner.
Again, feel free to support and help us to improve the product.

I’m already doing so… I’m not the only one who noticed that NS was released “a bit” in hurry (just before FOSDEM)

I think you did not get the point of the picture (strange enough, you’re the bigger fan of pictures here…)

hint:
> yum install irony

Sorry! My fault. I didn’t get the irony. You know, it could happen online.

Going back to the topic, @fasttech could you please verify @davidep’s fix?