Disaster recovery failure revisited

Hi @fasttech, thank you for testing the restore procedure again!

It is important to know the exact date of the backup, because we fixed a bug (do you remember 5188?) during rc4. See this comment on Jan, 13th:

The Samba secrets.tdb backup action was moved from nethserver-samba to nethserver-sssd. Old backups could not include secrets.tdb. In that case, automatic config restore is incomplete and a manual re-join to DC is required. See this procedure.

Same symptom reported here:

1 Like

@davidep ok, dropping my notes here;

below is current yum log, includes date of last update of image, then post restore updates and then restore-config triggered update start.

Jan 23 14:53:13 Erased: python-pyasn1-0.1.6-2.el7.noarch
Feb 14 16:17:48 Updated: nethserver-base-3.0.17-1.ns7.noarch
Feb 14 16:17:48 Updated: nethserver-lang-en-1.1.8-1.ns7.noarch
Feb 14 16:17:49 Updated: nethserver-httpd-admin-2.0.7-1.ns7.noarch
Feb 14 16:17:49 Updated: nethserver-duc-1.4.2-1.ns7.noarch
Feb 14 16:17:49 Updated: nethserver-dnsmasq-1.6.3-1.ns7.noarch
Feb 14 16:17:49 Updated: nethserver-lsm-1.2.2-1.ns7.noarch
Feb 14 16:17:49 Updated: nethserver-sssd-1.1.7-1.ns7.noarch
Feb 14 16:17:49 Updated: nethserver-release-7-2.ns7.noarch
Feb 14 16:20:22 Installed: libtirpc-0.2.4-0.8.el7.x86_64
Feb 14 16:20:23 Installed: rpcbind-0.2.0-38.el7.x86_64 

then log of last backup from built machine…

Extract from log file /var/log/last-backup.log:

Reading globbing filelist /tmp/SWdfZnDRV2
Local and Remote metadata are synchronized, no sync needed.
Last full backup date: Fri Feb  3 20:30:37 2017
--------------[ Backup Statistics ]--------------
StartTime 1486783840.40 (Fri Feb 10 20:30:40 2017)
EndTime 1486783915.85 (Fri Feb 10 20:31:55 2017)
ElapsedTime 75.45 (1 minute 15.45 seconds)
SourceFiles 1492
SourceFileSize 512014201 (488 MB)
NewFiles 1492
NewFileSize 512014201 (488 MB)

and here is the current machine state as it was backing up.

Jan 24 11:20:27 Installed: nextcloud-10.0.2-1.ns7.noarch
Jan 24 11:20:27 Installed: nethserver-nextcloud-1.0.4-1.ns7.noarch
Jan 25 14:10:07 Updated: nethserver-restore-data-1.2.3-1.ns7.noarch
Jan 26 12:35:51 Updated: nethserver-base-3.0.16-1.ns7.noarch
Jan 26 12:35:52 Updated: nethserver-dnsmasq-1.6.3-1.ns7.noarch
Jan 26 12:35:52 Updated: nethserver-lsm-1.2.2-1.ns7.noarch
Jan 26 12:35:53 Updated: duplicity-0.7.11-2.el7.x86_64
Jan 30 10:52:43 Updated: nethserver-base-3.0.17-1.ns7.noarch
Jan 30 10:52:43 Updated: nethserver-lang-en-1.1.7-1.ns7.noarch
Jan 30 10:52:45 Updated: nethserver-httpd-admin-2.0.7-1.ns7.noarch
Jan 30 10:52:46 Updated: nethserver-sssd-1.1.6-1.ns7.noarch
Jan 30 10:52:46 Updated: nethserver-release-7-1.ns7.noarch
Feb 06 09:12:48 Updated: python2-crypto-2.6.1-13.el7.x86_64

@davidep
Yesterday, I restored the built snapshot, checked for updates from the software center, there were none, left it to run, it successfully backed up, this morning I restored the old rc4 fresh, ensured net config, ran software update, rebooted, installed backup, rebooted, restored… same error.

here is the message log from after restore package install…

https://my.smbitech.com:12458/owncloud/public.php?service=files&t=7be25fe86799f6a36fccbc8713dfec22

1 Like

I’m quite busy these days, sorry for the delay.

I’d try to reproduce it with another backup!

Meanwhile if somebody else from @quality_team can reproduce it, would be appreciated!

2 posts were split to a new topic: Restore procedure with POP3 connector leads to duplicate messages

So, has anyone else run into this issue?
Has anyone had a successful disaster recovery of a samba dc setup?
I would like to find out if this is just me and I need to figure what I’m doing wrong or if there’s an underlying problem with restoring a samba dc.

1 Like

I deleted the backups.
I fired up the populated machine… samba, file sharing, nextcloud.
I set off a full backup, verified by email.
I installed a fresh install from iso in virtualbox. I setup network, fqdn, and updated… huge, rebooted.
I installed backup, setup backup to previous full backup.
Checked backup was connected… restored…environment restored… auth fail.

So… I fooled around at the cli, nothing worked.

[root@server7c ~]# systemd-run -t -M nsdc /bin/bash
Running as unit run-6473.service.
Press ^] three times within 1s to disconnect TTY.
bash-4.2# samba-tool user enable administrator
Enabled user 'administrator'
bash-4.2# samba-tool user setpassword administrator --newpassword=Nethesis,1234
Changed password OK
bash-4.2# ^^^
bash: :s^^^: no previous substitution
bash-4.2# exit
exit
[root@server7c ~]# net ads info
LDAP server: 192.168.124.228
LDAP server name: nsdc-server7c.domain.com
Realm: domain.COM
Bind Path: dc=domain,dc=COM
LDAP port: 389
Server time: Wed, 22 Feb 2017 14:07:43 MST
KDC server: 192.168.124.228
Server time offset: 0
Last machine account password change: Wed, 31 Dec 1969 17:00:00 MST
[root@server7c ~]# getent passwd administrator@`config get DomainName`
administrator@domain.com:*:1318000500:1318000513:Administrator:/var/lib/nethserver/home/administrator:/usr/libexec/openssh/sftp-server
[root@server7c ~]# host -t SRV _ldap._tcp.`config get DomainName`
_ldap._tcp.domain.com has SRV record 0 100 389 nsdc-server7c.domain.com.
[root@server7c ~]# > /etc/sssd/sssd.conf
[root@server7c ~]# realm join `config get DomainName`
realm: Already joined to this domain
[root@server7c ~]# expand-template /etc/sssd/sssd.conf

current domain accounts

NetBIOS domain name: domain
LDAP server: 192.168.124.228
LDAP server name: nsdc-server7c.domain.com
Realm: domain.COM
Bind Path: dc=domain,dc=COM
LDAP port: 389
Server time: Wed, 22 Feb 2017 14:13:18 MST
KDC server: 192.168.124.228
Server time offset: 0
Last machine account password change: Wed, 31 Dec 1969 17:00:00 MST

Enter SERVER7C$@domain.COM's password:Join to domain is not valid: NT code 0xfffffff6
1 Like

I’m currently on mail-server and docs. I hope to be back on DC soon! I’m really missing it :cry:

1 Like

You’re right, the backup/restore of configuration fails: I can reproduce it!

Luckily, you can workaround the issue with the following commands:

signal-event nethserver-sssd-leave
realm join -U admin $(hostname -d)

    ...[enter admin's password]

signal-event nethserver-sssd-save

The problem is caused by a little oversight in the backup procedure. The samba secrets.tdb backup is not actually executed! That file contains the machine password. It can be obtained again with the leave/join workaround above.

All existing systems are affected by this bug.

1 Like

A package is ready from nethserver-testing repo: /cc @quality_team

yum --enablerepo=nethserver-testing update nethserver-sssd-1.1.7-1.5.g514186f.ns7.noarch

The fix involves the backup procedure:

  • existing backups continue to cause the same problem after restoring them - apply the workaround;
  • backups generated by the fixed procedure should not lead to the problem any more.
1 Like

I beg your pardon but… shouldn’t the backup and restore procedure be tested before releasing a stable OS?
I mean: nothing is perfect, but having a stable release with totally broken backup function sounds frightening to me…

Well, the backup is not “totally broken”: as you see there’s a simple workaround. I see no reason to be frightened!

… and it has been tested, as you can see here:

https://github.com/NethServer/dev/issues/5188

What’s the point? How could we improve the QA test? More helpful people?

Davide: AD feature is a core one… it’s, maybe, the most important one NS has (compared to NS6.8)
when you released as STABLE NS7 the backup/restore function was broken (hence this topic), wasn’t it?
and a restore which doesn’t work is useless… you’ve found a workaround, but this is the kind of things you’d have done earlier, in RC stage.

for the point, see above…
for the other questions, I answer using a picture (if you follow Rugby you’ll know what I mean)

given enough eyeballs, all bugs are shallow

Said someone once. Thanks @fasttech for your eyes :slight_smile:

Over the last year, we had more than 1k posts about bug/testing discussions, some hundreds of bugs fixed, more than 50 people involved in testing. Check out Testing Bug for further information.
Do we need more people? Of course! @Stefano_Zamboni feel free to offer your support
I would really love to see your participation in testing and reporting, the more hand on deck the better :yum:

1 Like

Not following. Who are you blaming? :slight_smile:
We’re fixing it now and @davidep is offering all the help he can. So? What’s your point?
You are free to state whatever you want. I feel obligated to point out that we don’t need referees, we need people who are willing to help in a constructive manner.
Again, feel free to support and help us to improve the product.

I’m already doing so… I’m not the only one who noticed that NS was released “a bit” in hurry (just before FOSDEM)

I think you did not get the point of the picture (strange enough, you’re the bigger fan of pictures here…)

hint:
> yum install irony

Sorry! My fault. I didn’t get the irony. You know, it could happen online.

Going back to the topic, @fasttech could you please verify @davidep’s fix?

Soon as I can get a block of time, I’ll get on it.

@Stefano_Zamboni you’re a serious drag. I happen to agree with you, as does everyone else I would guess, that disaster recovery is probably one of the most important functions… that’s why I’m testing it, when I have time, documenting the issues and passing them on the devs… instead of sniveling and whining about it, and beating the dead horse. In the words of an actor in a movie… “Son, you have an attitude problem”.

3 Likes

thank you, I’m proud of it :slight_smile:
and, be sure, I won’t stop

fine… just some questions (and you’re in the quality team, so I guess you’re the right person to ask so):

  1. why such a feature is tested now?
  2. why the setup issue (mainly with network) weren’t tested during beta stage? nor in RC (which should be just a phase where you have all working)?
  3. who decided the release of RC and STABLE? I can’t find anywhere a discussion about it (point me in the right direction, please), but I’d say that something went wrong…

I know, but it’s not an issue on my side, and I can live with it :wink: