Tuesday, January 28, 2020

Putting the fun back in OpenStack CI/CD pipelines with RHOSP and Gitlab


Introduction

As a technical consultant, an important part of my daily activities is being able to develop code or customization for the customer I'm working with. Quite often, this means working extra hours in lab environments to make sure we have a 100% reliable, reproducible and idempotent software change.

If you're worked with 'Red Hat OpenStack Platform' after version 7, you probably know that a Director-based deployment of RHOSP can be quite time-consuming and may involve some decent amount of trial-and-error.

At the same time, if I only wanted to test a specific feature in a lab, it may take me a few hours to launch a complete environment to try that feature out. This was beginning to become very time-consuming, even with powerful hypervisors.

So, I came up with the crazy idea to make my life easier and use a gitlab pipeline to deploy RHOSP so I didn't have to perform the pre-required steps manually.

Defining the Production Chain

Here's the process I came up with :

  1. select an hypervisor and boot it if it's turned off.
  2. select a RHOSP version (8,9,10,13,15 or 16)
  3. select an overcloud deployment size
  4. select if the selected undercloud would need to be restored from snapshot.
  5. Block until all answers are provided. When all cleared, proceed with deployment.

Here is what the complete workflow looks like in a successful run:
Then:
And finally:


 It wasn't easy to come up with the complete setup and it took many attempts. A few weeks into this project, I got it working and here are some more details.

The GitLab setup

To make my life easier, I decided to run Gitlab Omnibus in a VM on one of my hypervisors and then add runners as docker gitlab-runners (one per Hypervisor).

This works because Gitlab runners come up and go down dynamically and Gitlab handles this fine. Here's a diagram describing the setup:



The Pipeline design

When running a pipeline, the requirements were pretty simple:

The pipeline would block in stage 5 (after the 4 previous stages, one for each option) and wait for the user to provide the required answers:

The end-user would then go and select an answer for each of the previous stages in no particular order.


  • Choose a hypervisor to run the virtual deployment on:




Once selected, the stage changes to:

And then later to:



  • Next, choose a Red Hat OpenStack Platform version:


Once selected, the stage changes to:


  • Choose an overcloud deployment size:

(1 director, 1 controller, 1 ceph and 2 computes or something larger):


  • Choose whether the undercloud should be restored from a snapshot or not



When all 4 questions have been answered, the config injection task will terminate and unleash the rest of the stages:

We have a liftoff!

All this is implemented using a single Gitlab CI/CD YAML here which got added to my templates respository.

The whole idea was that I could keep developing RHOSP TripleO Heat Templates and easily spin up a full-fledged environment when I needed to test a change or reproduce a customer issue.

All that was required was to take my existing my git repo...

..and then add a GitLab CI/CD configuration file:







This configuration file is a little long and contains some specific commands but the whole process should be straightforward with the use of the documentation : https://docs.gitlab.com/ee/ci/










Friday, December 20, 2019

The case of the missing filler (Dell PowerEdge Rack to Tower conversion)

So you've bought some PowerEdge Tower servers but these were originally rack servers.
No problem, you thought, as you remove the rack 'ears' and sourced a top cover + feet to turn your T430 or T630 back into a Tower.

But one problem remained: On Rack Servers, the front panel is actually shorter by 15mm, leaving some of the internals/electronics exposed to dust and spills. Here's the top of a T630 converted back to 'Tower':




Then I realized we were in 2019 and that 3D printing is now common. There are companies on the Internet that deliver your 3D printed CAD files to your door.

So, I started FreeCAD's Appimage on my RHEL machine and designed a part that fit the size of the gap and borders I had just measured:


The CAD files I created are located here:
FreeCAD

After a couple weeks, the parts (one for my T430, another for my T630) were received. I had gotten the color wrong but at least they did fit nicely:




I didn't have any black paint so I used a 'sharpie' pencil to make the surfaces darker:









And the final result was fine (no more gaps):




Monday, November 25, 2019

A new addition to the FrankenStation family, the Dell PowerEdge T630 runs RHEL (and Windows 10 Pro for DCS World) just fine!

A little while ago, I decided to retire my main workstation (A Dell Precision T7910) and go for a real server with lots of spare bays.
I had mostly been using as the T7910 as a RHEL hypervisor and I kept a small dual-boot Windows 10 Pro partition on it for the sole purpose of playing DCS World.

After much shopping around, I found a used PowerEdge T630 with 18 3.5" HDD bays. The machine took a few days to get to my home and transferring the H/W between the two systems took about an hour.

This included:
- all of my RAM (8 * 32Gb DDR4 LRDIMMs)
- Two excellent E5-2682 V4 cpus I got from a nice Lady in Shanghai.
- An H730P (2Gb) with several 2Tb SSDs and one HDD.
- An EVGA GTX 1660 Ti (6Gb DDR6)
- A SuperMicro Dual NVME PCI-E card with an 970 EVO drive for KVM guest datastore.
- A Quad-1G i350-T4 NIC (internal NICs, external interface, heartbeats).


New H/W included a pair of brand new high-wattage heatsinks (up to 160W per cpu), a GPU power distribution board and a pair of GPU cables.

I powered up the T630 and went upstairs. When I returned, the system had booted from the boot drive, found all of its hardware and the machine had re-joined the cluster.

Even more surprising was the nice discovery that the Windows 10 Pro install (on a small partition of the boot SSD) was able to reconfigure itself and boot successfully after that. I started with Windows 10 1903 and it was able to update itself to Windows 10 1909 without any kind of trouble.

A few days into this, I can say that I'm very happy with the transfer: the T630 does -everything- that the T7910 did and is even more silent than its former sibling. The T630 has more drive days, more DIMM slots and more PCI-E slots, which allowed me to re-install the Dual QLogic 8gb FC HBA I had had in the T7910.


To celebrate, I decided to take the Viper out for a small flight.. DCS World @ 4k on a Dell PowerEdge T630 with a GTX 1660 Ti works so well it is almost unbelievable:



Dell System Update was even able to patch the system (it applied an iDRAC update just fine) but the system icon was all wrong (My T630 isn't exactly a laptop):




And of course, it runs RHEL7 and OWacomp when I'm not flying over Nevada:

Monday, March 18, 2019

An NVidia GTX 1050 Ti in that PowerEdge T440 without the GPU Kit.

Having recently upgraded the GPU in my Dell Precision T7910, I found myself with this card lying around (ASUS ROC STRIX 1050 Ti Gaming):














So I had this crazy thought: What if I tried to use that card in my PowerEdge T440? This would provide a decent (and silent) upgrade to the MSI 1030 GTX currently in the server.

One problem was that the card required external GPU power and I had ordered my T440 without the GPU kit, which can only be installed at Point-Of-Sale and cannot be retro-fitted afterwards.
I tried using the 1050 Ti with just the power provided by the x16 PCI-E slot but the server failed to recognize the GPU.

So I went looking into my T440 (that GPU Kit must draw power from somewhere, right?) and found a white connector on the PCB directly attached to the PSU cage:

A close inspection using my Phone revealed something looking like almost like an 8-Pin GPU connector with an informative label on the side:


Did you read "GPU_PWR" too? I surely did but that white connector was a little different from what I used to see as far as GPU power connectors go.

Then I remembered I had seen similar connectors very recently.. in my Dell Precision T7910!!
(Why change a good design when you've got one?)

Luckily, the Precision T7910, with its 1300W PSU had lots of GPU power cables (enough for 2 power-hungry 6-pin or 8-pin GPUs) and I was pretty sure I'd never use more than one GPU in my T7910, A nice 1660 GTX was good enough for me.
So I went ahead and pulled one of the two GPU cables from the Precision T7910. Unfortunately, the 8pin cable from the T7910 didn't fit on the T440 PSU connector due to mismatched diameters on the two bottom right slots.

After trying to make those 'fingers' thinner using a cutter, I realized that those two didn't even have electrical wiring so I just cut them off:


Once this (not so) delicate surgery took place, the GPU cable fit perfectly into the GPU Power connector on the PSU PCB of my T440. The other end of the cable (6pin) made its way to the GPU card and I powered up the server, which came up perfectly.

Such cables can be ordered on ebay for about 10USD.

Here are a few pictures of the finished assembly:



In conclusion I'll state that although I like Dell PowerEdge and Precision hardware, I dislike very much the FUD surrounding those systems:
- I didn't buy my T440 with a GPU Kit (my own mistake) and Dell wasn't able to help retrofit a kit afterwards (no such solution exists).
- I still managed to power up that GTX card using an extra cable I borrowed from another system without Dell's help.
- Notwithstanding their (Dell's) desire to sell me Platinum 795W or 1100W PSU units, my complete T440 system still idles at 88Wats and never seem to exceed 200W. IMHO that 495W PSU might be just fine.

# ipmitool sdr list full|grep Watt
Pwr Consumption  | 88 Watts          | ok
Complete Specs here:
  • Two 4110 Xeon Silver cpus (8C/16T)
  • 96 (3 * 32) Gb RAM
  • Two Samsung 860 SSD Evos
  • One WD Red 8Tb drive
  • One H730P HBA
  • One i350-4 Quad Gigabit NIC
  • One 495W PSU
This server is probably one of the best workstations I've ever had!

Friday, March 8, 2019

Some Tips about PowerEdge as Workstation (Revisited for 14th Gen servers)


A new computer: Dell PowerEdge T440 server.

As much as I consider it a very fine machine now, that road wasn't easy. Some of the previous 12th Gen and 13th Gen Tips didn't apply and had to be-revisited.

Also. because I was unaware of some of the 'quirks' I ran into some issues after purchase and it took me a while to add in the extra hardware to make the T440 experience more enjoyable.

3rd Party PCI Fan response

There is no more a one-size-fits all 3rd party PCI fan response.
Instead, this is now done per slot. Look for this in the idrac GUI under
'Hardware Settings':

Or, for the CLI-minded:
/admin1-> racadm get System.PCIESlotLFM.1
[Key=System.Embedded.1#PCIeSlotLFM.1]
#3rdPartyCard=Yes
#CardType=NIC
CustomLFM=0
LFMMode=Disabled
#MaxLFM=310
#SlotState=Defined
#TargetLFM=-
/admin1-> racadm set System.PCIESlotLFM.1.LFMMode 2
[Key=System.Embedded.1#PCIeSlotLFM.1]
Object value modified successfully
/admin1-> racadm get System.PCIESlotLFM.1 
[Key=System.Embedded.1#PCIeSlotLFM.1]
#3rdPartyCard=Yes
#CardType=NIC
CustomLFM=0
LFMMode=Custom
#MaxLFM=310
#SlotState=Defined
#TargetLFM=-

Fan Speed

The server is -very- picky about components health (there are a lot more sensors). At one point it was pushing 100% fan because of the lack of a temperature sensor on the 850 evo SSD which was behind the H730P. Upgraded to 860 evo's, problem solved. The PowerEdge T130 which had both the H730P and the SSD never had a single issue with that.

I decided that I liked it more if the fan stayed around 1080rpm so I added a script to my RHEL7 system:

# gmake install
chkconfig --add dellfanctl
(II) -------
/etc/rc.d/rc1.d/K02dellfanctl
/etc/rc.d/init.d/dellfanctl
/etc/rc.d/rc3.d/S75dellfanctl
/etc/rc.d/rc2.d/S75dellfanctl
/etc/rc.d/rc0.d/K02dellfanctl
/etc/rc.d/rc4.d/S75dellfanctl
/etc/rc.d/rc6.d/K02dellfanctl
/etc/rc.d/rc5.d/S75dellfanctl
(II) -------
You have new mail in /var/spool/mail/root
# systemctl -al|grep dellf
  dellfanctl.service                                                                                             loaded    active     exited    SYSV: Enables manual IPMI Dell Fan control after boot
# crontab -l|grep dellf
*/35 * * * * /etc/init.d/dellfanctl start > /dev/null 2>&1
# /etc/init.d/dellfanctl status
(II) MAX T: 65C, Current T: 30C, Fan: 1080 (+/- 120) RPM   [  OK  ]
# /etc/init.d/dellfanctl start
(II) Enabled Manual fan Control on host daltigoth          [  OK  ]


This script can be downloaded here (adapt script for your hostnames):
dellfanctl

GPU cards

The single x16 slot for GPU only gets enabled for GPUs when you have two Xeons, not one.
If you want a GPU that draws more power than that provided by the PCI slot, please remember to order your server with the GPU kit as it cannot be retrofitted/ordered afterwards. I am planning to research this soon.

I'm currently using an MSI Geforce GTX 1030 single width card in the machine.

Power Draw


With one Xeon Silver 4110, my T440 idled around 66 Watts. With two cpus it idles around
  88 Watts. That's quite decent.

Wednesday, March 22, 2017

LVM2 bootdisk encapsulation on RHEL7/Centos7

Introduction


Hi everyone,
Life on overcloud nodes was simple back then and everybody loved that single 'root' partition on the (currently less than 2Tb) bootdisk. This gave us overcloud nodes partitioned like this:

[root@msccld2-l-rh-cmp-12 ~]# df -h -t xfs 
 Filesystem Size Used Avail Use% Mounted on 
/dev/sda2 1.1T 4.6G 1.1T 1% /

The problem with this approach is that anything filling up any subdirectory on the boot disk will cause services to fail. This story is almost 30 years old.
For that reason, most security policies (Think SCAP) insist that /var, /tmp, /home be different logical volumes and that any disk uses LVM2 to allow additional logical volumes.

To solve this problem, whole-disk image support is coming to Ironic. It landed in 5.6.0 (See [1] ) but missed the OSP10 release. With whole-disk image support in Ironic, we could easily change overcloud-full.qcow2 to be a full-disk image with LVM and separate volumes. This work is a tremendous advance, thanks to Yolanda Robla. I hope it gets backported to stable/Newton (OSP10, our first LTS release).

I wanted to solve this issue for OSP10 (and maybe for previous versions too) and started working on some tool to 'encapsulate' the existing overcloud partition into LVM2 during deployment. This is now working reliably and I wanted to present the result here so this could be re-used for other purposes.

Resulting configuration

The resulting config is fully configurable and automated. It will make use of an arbitrary number of logical volumes from your freshly deployed overcloud node. 
Here's an example for a compute node with a 64gb boot disk and an 8Tb secondary disk:

[root@krynn-cmpt-1 ~]# df -t xfs
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/rootdg-lv_root 16766976 3157044 13609932 19% /
/dev/mapper/rootdg-lv_tmp 2086912 33052 2053860 2% /tmp
/dev/mapper/rootdg-lv_var 33538048 428144 33109904 2% /var
/dev/mapper/rootdg-lv_home 2086912 33056 2053856 2% /home

[root@krynn-cmpt-1 ~]# pvs
PV VG Fmt Attr PSize PFree
/dev/sda2 rootdg lvm2 a-- 63.99g 11.99g

[root@krynn-cmpt-1 ~]# vgs
VG #PV #LV #SN Attr VSize VFree
rootdg 1 4 0 wz--n- 63.99g 11.99g

Implementation

The tool (mostly a big fat shell script) will come into action at the end of firstboot and use a temporary disk to create the LVM2 structures and volumes. It will then set the root to this newly-created LV and will reboot the system.

When the system boots, it will wipe clean the partition the system was originally installed on. Then it will proceed to mirror back the LV's and VG to that single partition. Once finished, everything will be back to where it was before, except for the temporary disk which was wiped clean too..

Logs of all actions are kept on the nodes themselves:

root@krynn-cmpt-1 ~]# ls -lrt /var/log/ospd/*root*log
-rw-r--r--. 1 root root 15835 Mar 20 16:53 /var/log/ospd/firstboot-encapsulate_rootvol.log
-rw-r--r--. 1 root root 2645 Mar 20 17:02 /var/log/ospd/firstboot-lvmroot-relocate.log



The first log details the execution of the initial part of the encapsulation: creating the VG, the LV's, setting up GRUB, injecting the boot run-once service, etc..
The second log details the execution of the run-once service that mirrors back the Volumes to the original partition carved by tripleo during a deploy.

It is called by the global multi-FirstBoot template here:

Which we called from the main environment file:


Configuration

The tool provides you with the ability to change the names of the Volume Group, how many volumes are needed, what size they shall be, etc... The only way to change this is to edit your copy of the script and edit the lines marked as 'EDITABLE' at the top. E.g:

boot_dg=rootdg                                 # EDITABLE
boot_lv=lv_root                                # EDITABLE
# ${temp_disk} is the target disk. This disk will be wiped clean, be careful.
temp_disk=/dev/sdc                             # EDITABLE
temp_part="${temp_disk}1"
# Size the volume
declare -A boot_vols
boot_vols["${boot_lv}"]="16g"                   # EDITABLE
boot_vols["lv_var"]="32g"                       # EDITABLE
boot_vols["lv_home"]="2g"                       # EDITABLE
boot_vols["lv_tmp"]="2g"                        # EDITABLE
declare -A vol_mounts
vol_mounts["${boot_lv}"]="/"
vol_mounts["lv_var"]="/var"                     # EDITABLE
vol_mounts["lv_home"]="/home"                   # EDITABLE
vol_mounts["lv_tmp"]="/tmp"                     # EDITABLE


All of the fields marked 'EDITABLE' can be change. Any new LV can be added by inserting a new entry for both boot_vols and vol_mounts.

Warnings, Caveats and Limitations


Please be aware of the following warnings
  • The tool will WIPE/ERASE/DESTROY whatever temporary disk you give it. (I use /dev/sdc because /dev/sdb is used for something else). This is less than ideal but I haven't found something better yet.
  • This tool has only been used on RHEL7.3 and above. It should work fine on Centos7.
  • The tool -REQUIRES- a temporary disk. It will not function without it. It will WIPE THAT DISK.
  • This tool can be used outside of OSP-Director. In fact this is how I developed this script but you still REQUIRE a temporary disk. 
  • This tool can be used with OSP-Director but it MUST be invoked in firstboot and it MUST execute last. One way to do this is to make it 'depend' on all of the previous first boot scripts. For my templates, it involved doing the following:
  • It lengthens your deployment time and causes an I/O storm on your machines as the data blocks are copied back and forth. If you do it in a virtual environment, I have added 'rootdelay' and 'scsi_mod.scan=sync' to help the nodes find their 'root' after reboot. If some nodes complain that they couldn't mount 'root' on unknown(0,0) this is likely caused by that issue and resetting the node manually should get everything back on track.
  • The resulting final configuration is fully RHEL-supported, nothing specific there.

  • THIS IS A WORK IN PROGRESS, feel free to report back success and/or failure.

Tuesday, November 15, 2016

Some Tips about running a Dell PowerEdge Tower Server as your workstation

Some use workstations as servers.
I'm using servers as workstations.

Over the years, I've changed computing gear on quite a few occasions. I've been using Tower Servers for the past 5 years and would like to share some tips to help others.




  • But why would anyone want to do that??


- Servers are well integrated systems and are usually seriously designed and tested.

- They offer greater expandability (6x3.5" hotswap bays in my previous T410, now 8x3.5" in my T430).

- They usually include some kind of Remote Access Card (RAC) which is great for remote'ing in when all else has failed.

- I can get tons of server equipment on ebay that will be compatible with that system.

- Where else can I get 192Gb of ECC DDR4 RDIMM, dual 6-core Xeons and 8 hotswap bays?

  • Tip #1 : Choose your chassis with care.
Not all servers are created equal:

- Rack servers are usually thin and noisy (those 8k rpm fans have the job of cool that 2U enclosure). It is not uncommon for them to be in the 60-70dBA range.

- Tower servers are much bigger and less noisy. The are also more expensive -but- you get an electricity bill that's lower than a comparable Rack server so the price difference will shrink after a few months. And having a server that makes less noise and draws less power is more environment-friendly!

- Most pre-2011 tower servers from Dell and HP (before Dell 11th Gen and before HP's Gen8) are less quiet than their modern counter-parts.

In 2016, I'd recommend getting a 12th or 13th Gen from Dell.. If you are into HP Gear, get a Gen8 or a Gen9. I've never done Lenovo or Cisco gear, so I can't help here.

- Most modern towers from Dell feature a single 120mm PWM fan to cool the entire chassis. That's the T410, T420 or T430. I assume the T310, T320 and T320 are similar since they feature the same chassis.

- The environmental ratings for current and past servers can usually be obtained from the manufacturer. Check the specs carefully. I found the spec for most recent Dell Tower servers here:

Dell-13G PowerEdge Acoustical Performance and Dependencies

  • Tip #2 : Choose your components carefully.

Now that you've selected the system, let's pick the components.

- CPUs

- Most recent tower servers feature PWM (4-pin) fans that are controlled by the iDrac/iLo controller. The sensors on these systems feed the former with information which they use to drive the speed of the fans.

- Consequently, even if you want enough Xeon cores, you probably don't want one of their 145W 12-core monsters. Such a chip (or a pair of them) will increase thermal response under load in your system which will result in increased fan speed. On the other hand, lower Wattage Xeons usually have a low core frequency that might make the user experience in interactive sessions oh-not-so-great.

I usually pick Xeons in the 65W-85W range. These typically feature decent punch while keeping heat (and noise) tolerable.

Wikipedia has a great list of all Xeon processors with Wattage, Cores, etc.. here:
List of Intel Xeon Microprocessors

- Graphics!

The bundled graphics adapter in your server will not let you run much else than a 2D environment. This can be solved by adding a PCI-E GPU which will give you decent 3D performance.

Forget about the latest Radeon or NVidia monster, it's not going to work at all.
When I tried my NVidia Quadro K2000 (a 65W card) in my Dell PowerEdge T130, the system simply refused to boot and told me that the card was drawing too much power to power on all components.

GPUs can usually work fine if they are in the 45W or below range. I've used with great success NVidia Quadro K620 and K600 cards in my Poweredge. The Passive Geforce cards from the previous gens (GT730, etc...) can also be used successfully.

Here's my Poweredge T130:
# lspci |grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation GF108 [GeForce GT 620] (rev a1)
That card was replaced by a GT 730:
lspci |grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation GK208 [GeForce GT 730] (rev a1)


And here's the Poweredge T430:
# lspci |grep VGA
03:00.0 VGA compatible controller: NVIDIA Corporation GF108GL [Quadro 600] (rev a1)

- Sound

Servers don't have sound cards.. but I've used with much success USB audio adapters to get sound from Videos and Games on my Linux Servers/Workstations.
These can usually be obtain for about USD10 on amazon or e-bay:


  • Tip #3 : Use the right settings
Dell servers need some parameters passed to the iDrac in order to keep noise to a minimum even when using 3rd Party PCI-E cards.

Disable PCI-E 3rd Party thermal reponse (can also be done from the iDrac submenu of the BIOS GUI):

Here's a 13th Gen server. I hightlighted the most important fields.

/admin1-> racadm get System.ThermalSettings[Key=System.Embedded.1#ThermalSettings.1]
#FanSpeedHighOffsetVal=75
#FanSpeedLowOffsetVal=15
#FanSpeedMaxOffsetVal=100
#FanSpeedMediumOffsetVal=45
FanSpeedOffset=Off

#MFSMaximumLimit=100
#MFSMinimumLimit=5
MinimumFanSpeed=255
ThermalProfile=Minimum Power
ThirdPartyPCIFanResponse=Disabled 


Some of these can be modified by using the iDrac CLI:
/admin1-> racadm set  System.ThermalSettings.FanSpeedOffset Off      
/admin1-> racadm set  System.ThermalSettings.ThirdPartyPCIFanResponse 0
[Key=System.Embedded.1#ThermalSettings.1]
Object value modified successfully

To be continued...

Putting the fun back in OpenStack CI/CD pipelines with RHOSP and Gitlab

Introduction As a technical consultant, an important part of my daily activities is being able to develop code or customization for the c...