

1. Redis单实例作为分布式锁存在什么问题?



2. 新版Redlock算法的思路是什么?

  1. It gets the current time in milliseconds.
  2. It tries to acquire the lock in all the N instances sequentially, using the same key name and random value in all the instances. During step 2, when setting the lock in each instance, the client uses a timeout which is small compared to the total lock auto-release time in order to acquire it. For example if the auto-release time is 10 seconds, the timeout could be in the ~ 5-50 milliseconds range. This prevents the client from remaining blocked for a long time trying to talk with a Redis node which is down: if an instance is not available, we should try to talk with the next instance ASAP.
  3. The client computes how much time elapsed in order to acquire the lock, by subtracting from the current time the timestamp obtained in step 1. If and only if the client was able to acquire the lock in the majority of the instances (at least 3), and the total time elapsed to acquire the lock is less than lock validity time, the lock is considered to be acquired.
  4. If the lock was acquired, its validity time is considered to be the initial validity time minus the time elapsed, as computed in step 3.
  5. If the client failed to acquire the lock for some reason (either it was not able to lock N/2+1 instances or the validity time is negative), it will try to unlock all the instances (even the instances it believed it was not able to lock).

3. Martin认为以上算法存在什么问题?


互斥性问题。见下图。Client1获取锁后进入STW,STW后锁已经超时,但Client1仍然认为自己持有锁。antirez在他的博客回应可以通过时间的double check来规避这个问题,但Martin提出,STW可能在任何情况下发生,更甚的是网络导致的延迟更是程序难以规避的,检查时间根本没用。Martin在他的博客提出了fencing的解决方案,详见他的博客。




4. antirez的回应有哪些亮点?


客户端获取redlock锁失败后,应休眠random delay后重试,防止多客户端在同一时间又去竞争锁,竞态得不到缓解。

如果N个redis实例其中某一个crash了,可以为其设置一个delay start,防止它恢复后突然加入打破现有的平衡。也可以考虑设置fsync,这样每次redis数据修改都会落盘。