Archive for the 'vmware' Category

Oct 16 2008

Enabling SSH on ESXi

Published by Brian under Computers, Geeky, Linux, vmware

So, I finally had a chance to play with VMware ESXi.   It’s pretty much what I expected, a straight-up version of ESX.  Very, very nice… I’ll start moving more servers over from VMware Server 1.x and report back on my progress.

One of the things that annoyed me out of the gate is the lack of SSH support.   It’s there in the underlying operating system, just not enabled.   Here’s how to turn it on:

  1. Get on the console of the ESXi server.
  2. Press ALT-F1 to get to the OS system console
  3. Type “unsupported”
  4. Enter the root password at the password prompt.
  5. Edit /etc/inetd.conf with vi, and uncomment the SSH line
  6. Run:  kill -1 $(cat /var/run/inetd.pid)

And viola!  SSH to your ESX box.   Enjoy!

2 responses so far

Apr 08 2008

VMware Server Tips ‘n Tricks

Published by Brian under Computers, Linux, Scripting, vmware

As anyone whom reads this blog regularly knows, I’m a happy VMware Server user. In using it, I’ve come across some handy methods in administrating it and the virtual machines created with it. Without further ado, here they are!

Tip #1 - Start and stop your VMs from the command line

If your VMware server is headless and gui-less (you didn’t install a GUI did you?) it’s handy to be able to start and stop your VM processes with a command line tool over ssh. Use the vmware-cmd tool for this:

vmware-cmd /path/to/vmxfile.vmx stop <hard|soft>

or

vmware-cmd /path/to/vmxfile.vmx start

The third option is the powerop mode. ’soft’ uses the VMware tools within the guest OS, while ‘hard’ simply powers on and off the VM without the tools.

Tip #2 - Re-install your VM Tools quickly

After upgrading your kernel on Linux-based virtual machines, you’ll also have to re-compile vmware tools’ kernel modules. Upon initial installation, you probably executed the usual:

/usr/bin/vmware-config-tools.pl

But did you know you can speed up the process and make it automatic by using the default options? The next time you need to recompile your tools, use this instead

/usr/bin/vmware-config-tools.pl -default

Tip #3 - Fine-grain your VM’s priority

VMware Server does not provide the flexibility of ESX, but you can get it part-way there by using the Linux scheduler to prioritize your virtual machines. By default, VS gives all vmware-vmx processes a nice value of “-10″. In Linux, processes with “-20″ have the highest priority for system resources, and “20″ have the lowest. By adjusting your busy VMs to a higher negative number (e.g. -15) and your less-intensive VMs to a higher positive number (e.g 0) you can more finely tune your server’s performance and ensure timeslices on the host are more accurately granted.

To do this, use the `renice’ command. First, find the PIDs of your vmware-vmx processes, by using `ps’:

[root@tlfvm5 ~]# ps -ef | grep vmware-vmx
root      3374     1 13 Mar18 ?        2-20:03:36 /usr/lib/vmware/bin/vmware-vmx -C /vmware/tlfmonitor/tlfmonitor.vmx -@ ""
root      4833     1 15 Mar18 ?        3-04:09:11 /usr/lib/vmware/bin/vmware-vmx -C /vmware/DellMonitor/DellMonitor.vmx -@ ""

Then renice the appropriate PID. For example, to give the “tlfmonitor” a bit of a bump to “-12″:

renice -12 33

Like all good things, moderation is key. Start with smaller increments and note the change, then if needed bump it again. It should be noted that your reniced values will disappear as soon as the PID terminates. You can also give it a default higher priority via the .vmx file in the prority.grabbed and priority.ungrabbed directives (see http://sanbarrow.com/vmx/vmx-config-ini.html).

Tip #4 - Manage and extend your virtual disks

VMware Server comes with a tool to completely manage your .vmdk disks. The vmware-vdiskmanager tool can create, defrag, extend, and convert vmdks from one type to another. For example, to expand a vmdk from 10GB to 15GB, power off the VM and issue this command:

vmware-vdiskmanager -x 15Gb /path/to/vmdkfile.vmdk

Note that this extends the raw disk, but not the guest file system. For instance, after doing an extend in Linux on an ext3 file system, use “resize2fs” to adjust it accordingly. You may want to run the vmware-vdiskmanager command without arguments to see some help on the different options, as well as some examples.

Tip #5 - Install VMware tools from the command line

You don’t need to click “VM -> Install Vmware Tools…” on the Server Console to mount the virtual media. Do it from the command line!

vmrun installtools /path/to/vmxfile.vmx

This does precisely what clicking in the GUI does. Once this has been run from the host, go to your VM and mount up the /dev/cdrom device and find your tools RPM ready to go.

That’s it for now. Do you have any tips that are useful for other VMware Server administrators? If so, let me know!

2 responses so far

Jan 31 2008

More VMware Server in Production

Published by Brian under Clustering, Linux, Tlf, vmware

I happened to run across this fella today, who also runs a very large VMware Server farm in a production environment. He makes a few mentions of his architecture which is slightly different than our approach, however it could prove handy to someone else building out such a thing.

2 responses so far

Aug 08 2007

VMWare Server on NFS & RedHat Cluster Suite

Published by Brian under Computers, Linux, Tlf, vmware

Over the past few weeks I’ve managed to get a pretty darn stable NFS / VMWare Server setup running.

The basic specs are as follows:

  • VMWare Host: Dell PowerEdge 1950 (dual quad core, 8 gigs RAM)
  • NFS Cluster: Two Dell PowerEdge 860s (single quad core, 4 gigs RAM each)
  • Networking: Dell PowerConnect 5324
  • Centralized storage: EonStor A24F-R2224
  • Internal DRAC on each node for cluster fencing (not ideal, read below).
  • All the VMDKs are stored on an NFS mount from the cluster.

Through quite a bit of experimentation and trial and error, I have it running pretty solid. Some of the key points:

  1. Used RedHat EL 5 all the way around, with the related GFS/RHCS packages
  2. Mounted GFS on the NFS nodes with noatime,noquota for some minor speed improvements.
  3. NFS4 and TCP for everything. Makes failover to the other node more reliable.
  4. Use ‘hard’ mounts on the VMWare server, use timeo=600,retrans=2 in your options. This allows TCP to handle transmission delays during a failover versus NFS.
  5. On the export side, craft your /etc/exports so that each has a matching ‘fsid=’ for every export. This gets around stale handles.
  6. Use the GFS/shared storage/floating IP technique as documented in the Cluster NFS Cookbook versus managed NFS (read why below).
  7. Bonded the NICs on the NFS nodes for higher throughput (in our case they will be exporting to more than one server when in production, so this was necessary).
  8. Spanning tree algorithm delays on the PowerConnect can get you in trouble with a fencing loop in a two-node setup. During a reboot situation one of the nodes, the NICs come up quicker during Linux sysinit than they do on the switch. Thus, Linux thinks the interface should be reachable (when it’s not) and when fenced attempts to initalize, it cannot reach the other node and consequently fences that one. Solution is to either add “LINKDELAY” to /etc/sysconfig/network or just disable spanning tree on the switch.

I intially tried the managed NFS setup in Cluster Suite (check the cookbook), however there are two major problems. At this time, managed NFS appears to be set up to use NFSv2 and v3 only, as there is no opportunity to modify the export options via Cluster Suite. Also, there are timing delays with how Cluster Suite manages the NFS daemons…

Of course, during a failover speed is of the essence. So, when I had this rig configured for managed NFS failover, I was experiencing 12+ second delays in failover. Why?

Well, it turns out RedHat has a sleep command in /usr/share/cluster/ip.sh (the virtual IP management script) that adds 10 seconds to the failver so NFSD can clear its cache (!?). Pretty hackish, and results in an unacceptable delay during a failovers. Unfortunately if you’re running managed NFS, there’s no real way around this unless you want to risk corruption of NFSD going down without flushing its cache to disk.

I found this in the Cluster Project FAQ. With the ’sleep 10′ command gone, failover is much, much quicker. As long as you’re doing the GFS thing versus the managed NFS setup this works quite nicely and fast enough that VMWare doesn’t seem to know the better of what is going on.

Performance-wise, it’s pretty darn good. I have a dedicated PowerConnect 5324 for use as an “ethernet SAN” to interconnect the NFS nodes, and VMWare Server. That being said, 20 concurrent lightly loaded VMs results in nothing abnormal in terms of performance or reliability. In fact, it’s hard to tell the difference from local disk–even during a failover. The NIC being used for “front-end” access to the VMWare Server Console even seems to experience more traffic than the NFS one according to the PowerConnect’s interface reports, though that leaves me a bit skeptical.

I would have been interesting to see if the TOE (TCP Offload Engine) equipped on the 1950’s NetExtremeII NICs would have made a performance improvment, but it works in Windows 2003 only. Bummer.

Another “gotcha” to watch for is using RAC cards on the servers for fencing purposes. In most cases it works fine however when power is lost to the entire server, the DRAC goes down with it and becomes unreachable. This leaves the surviving node stuck trying to fence the dead one, and failover never occurs. A better option would be to use a managable PDU (which we’ll do ultimately).

Bottom line it seems to work very well in almost all failure situations I’ve tested it in. The only time I was able to make it fail (badly) was to yank the power cord out of one of the cluster nodes, and have the entire cluster crunch to a halt, due to the problem I mentioned above. I did this while installing X Windows on 50% of the VMs to simulate a lot of NFS write activity.

After bringing the entire cluster back up manually, the only damage was a corrupt RPM DB on one of the VMs. The others came back up fine after a fsck on boot. Not bad!

After my testing, I’m confident this set up will work in a production environment. If you wish to do try the same, ensure your testing plan includes every possible outage situation you can fathom. Weird/odd stuff can come up (for example the spanning tree thing) and of course it is far better to nail those down in R&D than in production!

3 responses so far