Distributed bugs are often latent: a Troubleshoot Story

Posted on 2020-02-28 In openstack

Abstract

OpenStack, as a nested distributed system(where some of the subsystems are distributed systems), to troubleshoot on it could be quite tricky, not to mention its interdependence with other distributed systems in real word.

We all share the feelings on how hard it is to solve a hidden issue in a distributed system as each call during a chain of invocation is unpredictable, the methods/ tools for sniffing/ tracking each part are either too hard, incapable, outside of control, or too expensive.

The trigger made me composing this blog is the article: Challenges with distributed systems from Jacob Gabrielson, where Jacob expounded how and why designing/ managing a request/response distributed system is challenging. I recommend to read it 😊.

The story in this blog is a troubleshoot case I solved in a customer staging network years ago, where OpenStack clusters were integrated with some other different (virtual) Network Function clusters.

Background

There were many OpenStack Clusters onsite, with different VNFs running on top of them.

Two months before I arrived the customer site, the alarm [NTP Upstream Server Failure] was raised in one of the OpenStack Cluster.
Every party involved on this issue had looked into it jointly and each of them claimed they are all good: no clue to move on, and after the whole site’s yearly power down and up activity, the issue was gone.
One week ago, after a control plane nodes reboot, this alarm was raised again on another OpenStack Cluster.

Reviewing everything to have the pattern of the issue:

There were two types of servers acted as upstream NTP for Openstack Clusters. Let’s say NTP-A and NTP-B, they were actually not dedicated NTP servers but a more heavy network function(NF):
- NTP-A was the one associated with the issue OpenStack Clusters, which was a VNF running in Linux.
- NTP-B was a native NF in Solaris, to be repalced by NTP-A in future. #Customer has reasons to worry the issue more 😜.
The OpenStack side claimed they were good because they captured the NTP client packet sent to NTP-A and no response recieved.
The other side checked everything around NTP configuration, network etc. nothing abnormal was sorted out. Thus they had to blame the OpenStack side as there were other systems utilizing their NTP-A as upstream NTP, while no issues had encounterred.
Then both side suspected the packet was dropped somehow between NTP-A and OpenStack Cluster: the networking/ switches.
- The switch team observed nothing pointing to packet drop from the traffic .
- Even though there was chance they didn’t capture the drop on right time, why only OpenStack’s NTP request not responded?
- Was this issue happening in small frequency?

jupyter pyenv and virtualenv in macOS

Posted on 2019-03-27 In random notes

Conbined with pyenv virtualenv and Jupyter kernal specification, we could have different versions of project-specific python wrapper as below.

This article includes needed information on enabling this.

pyenv

ref: https://github.com/pyenv/pyenv/issues/1219

zlib not available, one thing to highlight is to enable zlib, we have to do this:

1 2	xcode-select --install sudo installer -pkg /Library/Developer/CommandLineTools/Packages/macOS_SDK_headers_for_macOS_10.14.pkg -target /

Below are things we need to ensure pyenv magically modified our enviroment variables to point specific version of python.

$ brew install pyenv
$ cd <project_path>
$ pyenv install 3.6.8
$ pyenv local 3.6.8
$ pyenv init
# Load pyenv automatically by appending
# the following to ~/.zshrc:

eval "$(pyenv init -)"

$ which python
/usr/local/bin/python
$ eval "$(pyenv init -)"

$ which python
/Users/<user>/.pyenv/shims/python

jupyter/IPython kernel management

ref: https://ipython.readthedocs.io/en/latest/install/kernel_install.html

$ jupyter kernelspec list
Available kernels:
  python3    /usr/local/share/jupyter/kernels/python3

$ python -m virtualenv py36

$ source py36/bin/activate

$ pip install ipykernel

$ python -m ipykernel install --user --name py36 --display-name "ProjectName py36"

$ jupyter kernelspec list
Available kernels:
  py36       /Users/<user>/Library/Jupyter/kernels/py36
  python3    /usr/local/share/jupyter/kernels/python3

Then, from the jupyter notebook webpage, you could select kernels from Kernel: Change Kernel

`matplotlib` calling `latex` failed on macOS

plot call failed due to latex could not be invoked

Root cause

LaTeX was not installed or cannot be called fromlatex

Minimal reproducing

>>> import subprocess; subprocess.check_call("latex")

# FileNotFoundError: [Errno 2] No such file or directory: 'latex': 'latex'

# or from shell

$ latex
# latex not found

Fix

install Tex Live for macOS
append PATH for /Library/TeX/texbin

$ wget http://tug.org/cgi-bin/mactex-download/MacTeX.pkg
# checksum and install MacTeX.pkg

# assuming you are using zsh
$ echo 'export PATH="$PATH:/Library/TeX/texbin"' >> ~/.zshrc

# assuming you are using bash
$ echo 'export PATH="$PATH:/Library/TeX/texbin"' >> ~/.bashrc

Verify it

$ export PATH="$PATH:/Library/TeX/texbin"
$ latex --version
pdfTeX 3.14159265-2.6-1.40.19 (TeX Live 2018)
kpathsea version 6.3.0
Copyright 2018 Han The Thanh (pdfTeX) et al.
There is NO warranty.  Redistribution of this software is
covered by the terms of both the pdfTeX copyright and
the Lesser GNU General Public License.
For more information about these matters, see the file
named COPYING and the pdfTeX source.
Primary author of pdfTeX: Han The Thanh (pdfTeX) et al.
Compiled with libpng 1.6.34; using libpng 1.6.34
Compiled with zlib 1.2.11; using zlib 1.2.11
Compiled with xpdf version 4.00

nova scheduling study on making it smart by machine learning

Posted on 2019-03-06

update in 2019 June: I managed to implement it, check here ;-) will write more on how it was done…

This note is my brain dump on getting ideas to do machine learning enabled optimized nova scheduler weighing.

How existing Weighing works?

Short version conclusion

By default, it simply weighted all existing weighers with weighing factor 1.0.

TL;DR

see ref first : https://www.slideshare.net/guptapeeyush1/presentation1-23249150

The weighing was called by:

self.weight_handler.get_weighed_objects(self.weighers)
- self.weighers comes from CONF.host_mgr_sched_wgt_cls_opt
  - By default it’s all weighers
    
    default=["nova.scheduler.weights.all_weighers"]
- get_weighed_objects is doing this:
  1
  2
  3
  for i, weight in enumerate(weights):
  obj = weighed_objs[i]
  obj.weight += weigher.weight_multiplier() * weight

Below are mentioned subroutines…