Windows Failover Cluster (WFC) is a component of Windows Server OS, that allows several machines (here nodes) to function together as a failover cluster. The purpose of a cluster is to administer resources assigned to it. A cluster monitors the health of resources and can restart or migrate them on another node if needed. Resources can be dependent on other resources and also belong to resource groups so that all resources from a group are running on the same node. WFC supports multiple different resources, three of them are of our particular interest – Generic Application, Generic Script, and IP Address.
A Generic Application is a resource that is backed by a customer-specified executable. WFC starts the executable on one of its nodes when a resource should go online and then tracks the process. If the process terminates or the node goes irresponsive, the cluster takes corrective actions, like restarting the process or migrating it to another node.
A Generic Script resource is a customer provided WSH-script. This script is used to manage some resource, and the cluster calls the script for tasks like setting the resource's online/offline status and retrieving the resource's current state.
An IP Address resource is, as its name implies, just an IP address. Whenever a node with this resource is online, it gets this address assigned to it.
WCF also provides for Clustered Shared Volume (CSV) services. CSV is a shared synchronized storage that is available for all cluster nodes, and which is presented to a cluster node as a regular NTFS volume. It provides all regular storage services, for example file-system locks that effectively become distributed locks in a cluster.
*Note: WebApp
is just a name of a .NET Core Starcounter 3.0 Web Application.*
Now we've covered all the resources we need to setup a Starcounter failover cluster. First, we start with an easy setup and show how it recovers from possible faults. Then we point to the drawbacks of this setup and show how we address it with a more advanced approach that would be appropriate for real deployments.
An easy setup could look like this:
Group: Starcounter└─ Resource: WebApp (Type: Generic Application, executable: WebApp.exe)├─ Resource: IP Address└─ Resource: Database (Type: Generic Application, executable: scdata.exe)
Here "Starcounter" is a resource group containg three resources: WebApp
, a Starcounter application we want to make highly available, and two other resources that WebApp
depends on: an IP Address and a Database.
When starting a resource group, the cluster assigns it to some cluster node. On this node, it first starts WebApp
's dependencies, i.e. the IP Address and Database resources, and then the application.
The Database resource is a Generic Application resource. Starting it just starts the scdata
process, thus making the database available to connect to. The scdata
process locks the transaction log on CSV, reads it and then gets ready to serve incoming requests.
All write transactions also go to CSV. Once its prerequisite resources are started, the cluster starts the WebApp
resource by launching WebApp.exe
. WebApp.exe
, being a web app, binds to all local addresses, including the IP Address resource we assigned to it.
Now the group is fully started and ready to serve requests.
Let's consider possible faults and correcting actions:
Fault | Recovery action |
| Cluster detects that |
| Cluster detects |
Node goes offline (network failure or power outage) | Cluster detects that the node is offline and decides to move the role to another node. First it dismounts CSV from the old node, so that all locks are released. Then it selects a new hosting node and starts all resources on it. Due to the locks being released, |
*Note: the basic setup is not recommended for production use and described for educational purposes only.*
The setup described above has an important drawback. In certain cases, recovery will require a fresh start of the scdata
process, which can take a significant amount of time. To overcome this, we need to run our scdata
instance in a special standby mode, in which it can:
Function without locking the transaction log.
Periodically read and apply the transaction log.
Switch to active mode upon request.
To keep scdata
running on a non-active cluster node we can't use cluster resources, as WFC ensures that all resources are online on a single node. Instead we must provide an auto-started windows service, starservice
, that administers scdata.exe
for us.
This is how it works:
Event | Reaction |
|
|
|
|
| OS kills |
|
|
|
|
To properly control starservice
, i.e. starting, stopping and sending promotion requests, we use a Generic Script resource.
Now the setup looks like this:
Every cluster node has an instance of configured starservice
.
We configure these cluster resources as such:
Group: Starcounter└─ Resource: WebApp (Type: Generic Application, executable: WebApp.exe)├─ Resouce: IP Address└─ Resource: Database (Type: Generic Script)
The Database script has the following workflow:
Cluster event | Action |
Go Online | Start |
Go Offline | Restart starservice ². |
¹ As a safety measure. The normal condition for a service is to be always started.
² As of now we can't switch scdata
from active to standby mode, so we restart the service and thus scdata
, so it restarts in standby mode. It's not a problem since the resource goes offline most likely because we're transferring the group to another node, so we have enough time to load the database on this node. Next time the cluster decides to host the group again on this node, scdata
will already be prepared.
Now instead of starting scdata
when the group moves to a new node, the cluster will start the Database script resource, which in turn ensures that scdata
is started and active. WCF will handle migration of CSV, IP Address, and WebApp
.
This new setup shares one drawback with the first one: if scdata
crashes, the cluster will restart it on the same node first. And it might take time. This issue could be seen as marginal, however, as scdata
should never crash. A crashing scdata
process is in of itself a more severe problem than a slow recovery.
We plan to design scdata
to allow it to serve read requests in standby mode. With this feature, every cluster node will become an eventually consistent read-only replica.
*Note: It is important to specify the database path using exactly the same value in all places where it occurs. Values such as C:\Path\To\Db
& C:/Path/To/Db
are treated as different.*
See also the article about the Database connection string.
Using the star
tool:
star new /csv/path/to/db
Using the native sccreatedb
tool:
sccreatedb -ip <path on csv volume> <database name>
Download, unzip, and copy the starservice
files to all nodes. These files should have the same lication on all nodes.
Create a service to start the database. The service name should be the same on all nodes.
Using the sc.exe
tool:
sc create <database name> start=auto binPath="<path to starservice.exe> service <path to the database>"
Using the starservice.exe
tool itself:
starservice.exe install <database name> <path to the database>
Add-ClusterGroup Starcounter
Add-ClusterResource -Name "IP Address" -ResourceType "IP Address" -group "Starcounter"Get-ClusterResource "IP Address" | Set-ClusterParameter –Multiple @{"Address"="<ip address>";"SubnetMask"="<subnet mask>";"EnableDhcp"=0}
Copy the scripts
folder from the previously downloaded archive to all nodes. The local path should be the same on all nodes. Don't use the CSV volume as it will complicate resource upgrade and troubleshooting.
Then:
Add-ClusterResource -name Database -group "Starcounter" -ResourceType "Generic Script"Get-ClusterResource "Database" | Set-ClusterParameter -name ScriptFilepath -value "<path to script.js>"Get-ClusterResource "Database" | Set-ClusterParameter -name DbName -value "<database name>" -create
Copy the WebApp
files to all nodes. The local path should be the same across nodes. Don't use CSV volume as it will complicate resource upgrade and troubleshooting.
Then:
Add-ClusterResource -Name "WebApp" -Group "Starcounter" -ResourceType "Generic Application"Get-ClusterResource "WebApp" | Set-ClusterParameter -Name CommandLine -Value "<path to webapp.exe>"
Set-ClusterResourceDependency -Resource WebApp -Dependency "([IP Address]) AND ([Database])"
Start-ClusterGroup Starcounter
Make sure to specify a reasonable configuration for the maximum allowed failures over time for the required resources. The default configuration is very limiting.
Make sure to have at least three nodes in a cluster, or a file share witness to keep the cluster alive when a node goes down.
Starcounter 3 Release Candidate does not yet support failover on Linux operating systems out of the box. If you have a Linux production environment which requires failover, please contact us.