Docs » µAPM Deployment Guide » Deploying the SignalFx Smart Gateway

Deploying the SignalFx Smart Gateway

Summary of steps

To deploy the SignalFx Smart Gateway, start by reviewing the sizing information below to determine the recommended hardware for running the Smart Gateway. You then need to install and configure the gateway and, if you have determined it is necessary, install and configure a clustered gateway. Finally, you will start the gateway and point your applications to the correct locations for reporting their trace spans.

Instance sizing

Recommended instance sizing to run the Smart Gateway are as follows, based on your expected volume of trace spans per minute (SPM).

SPM AWS EC2 Type
Up to 6M c5.18xlarge
Up to 3M c5.9xlarge
Up to 1.5M c5.4xlarge
Up to 750k c5.2xlarge

If a single Smart Gateway is insufficient for your needs (that is, you expect to be sending more traces than the gateway can handle, you can either use a bigger instance or set up a clustered gateway behind an HTTP load balancer; the Smart Gateway embeds Etcd for cluster coordination. You will want to deploy a minimum of three gateways when clustering, and it’s best to have an odd number. You can read more on best practices for etcd clustering.

Start by following the installation and configuration instructions below. You will then be able to implement a clustered gateway.

Install and configure the Smart Gateway

If you have not already installed and configured the SignalFx Gateway (formerly called the Metric Proxy), download the SignalFx Gateway binary and then install and configure it. Be sure to correctly configure forwarders (ForwardTo) and listeners (ListenFrom) (see this section).

The following instructions enable our NoSample™ Tail-Based Distributed Tracing feature, and “transform” the SignalFx Gateway into the SignalFx Smart Gateway.

Add a stanza to your config file as shown in the example below. The only value you should have to configure is where the Smart Gateway writes out state when restarting (the BackupLocation). This directory must exist, and if you’re inside a container, it should be persisted outside the container. By default the Smart Gateway will look for the configuration file at /etc/gateway.conf; to specify a different location, use the command line flag --configfile.

Notes about the example:

/var/log/gateway is where you want logs to go, and /var/config/gateway/data is a good location to save data for persistent restarts. You can also change the Name on the forwarder and ServerName in the main section to some identifier of your choosing that is meaningful to your organization. Do not set ServerName to the same as another Smart Gateway in the same cluster or organization. Also, do not set the Name of a forwarder to be the same as any other forwarder on the same gateway instance. Both of these will result in errors.

json
{
  "StatsDelay": "10s",
  "LogDir": "/var/log/gateway",
  "ServerName": "smart-gateway",
  "ListenFrom": [
    {
      "Type": "signalfx",
      "ListenAddr": "0.0.0.0:8080"
    }
  ],
  "ForwardTo": [
    {
      "Type": "signalfx",
      "DefaultAuthToken": "PUTYOURORGTOKENHERE",
      "Name": "smart-gateway-forwarder",
      "TraceSample": {
        "BackupLocation": "/var/config/gateway/data"
      }
    }
  ]
}

If you plan to implement a clustered smart gateway, skip to Install and configure a clustered Smart Gateway. Otherwise, start the gateway. Commands for doing this are ./smart-gateway or, if you want to start the gateway and specify an alternate path to a config file, use ./smart-gateway --configfile /path/to/your/gateway.conf.

Finally, make sure your deployed SignalFx Smart Agents are configured to send data through your Smart Gateway by configuring their ingestUrl to http://<your-gateway>:8080/ (as per the Smart Agent Configuration documentation).

SignalFx is now ready to accept traces, datapoints and events. For information about metrics, see Metrics emitted by the SignalFx Smart Gateway.

Install and configure a clustered Smart Gateway

To benefit from high-availability or to handle large trace volumes, you can deploy multiple instances of the SignalFx Smart Gateway that work together as a cluster. To configure a clustered gateway, perform the following steps after completing the initial installation and configuration steps. Once your Smart Gateway instances are installed and configured, you will need to deploy an HTTP load balancer in front of them (HAProxy or Nginx are good options).

Cluster configuration options

The configurations for clustering the Smart Gateway are defined in the gateway.conf file, but can be overridden at start up using their corresponding environment variables.

Please note that all cluster configurations will be ignored if a cluster operation is not specified (“seed” or “join”).

Review the configuration options below, and then continue to Configuring the clustered Smart Gateway.

Configuration Environment Variable Description Default Value
ServerName SFX_SERVER_NAME The name the server should be identified by. This value should be unique for each instance in the cluster. <none>
ClusterOperation SFX_CLUSTER_OPERATION The cluster operation that the smart gateway should perform on startup. Options are “join” or “seed”. If left blank then the Smart Gateway will not operate in cluster mode. Please note that the command line flag “-cluster-op” will override both the config file and the environment variable. <none>
TargetClusterAddresses SFX_TARGET_CLUSTER_ADDRESSES A comma-separated list of peer addresses and ports for the Smart Gateway to join. If using the environment variable, assign the list as a single string of addresses with ports separated by commas. These addresses are static. For example: SFX_TARGET_CLUSTER_ADDRESSES=”127.0.0.1:2379,127.0.0.1:2380” <none>
ListenOnPeerAddress SFX_LISTEN_ON_PEER_ADDRESS The address and port that the etcd server listens on for peer connections. This address is static. 127.0.0.1:2380
AdvertisePeerAddress SFX_ADVERTISE_PEER_ADDRESS The address and port advertised by the etcd server for peer connections. This address is static. 127.0.0.1:2380
ListenOnClientAddress SFX_LISTEN_ON_CLIENT_ADDRESS The address and port that the etcd server listens on for client connections. This address is static. 127.0.0.1:2379
AdvertiseClientAddress SFX_ADVERTISE_CLIENT_ADDRESS The address and port advertised by the etcd server for client connections. This address is static. 127.0.0.1:2379
ETCDMetricsAddress SFX_ETCD_METRICS_ADDRESS The address and port used to expose prometheus style metrics about the embedded etcd server. This address is static. 127.0.0.1:2381
ClusterDataDir SFX_CLUSTER_DATA_DIR A file system path for the etcd server to store data in. NOTE: If running in a container make sure this is persisted outside the container. ./etcd-data
UnhealthyMemberTTL SFX_UNHEALTHY_MEMBER_TTL he duration after which an etcd member should be removed from the cluster when it is presumed unhealthy. 5s
RemoveMemberTimeout SFX_REMOVE_MEMBER_TIMEOUT The time to wait for the instance to remove itself from the etcd cluster when shutting down the Smart Gateway instance. 1s

Configuring the clustered Smart Gateway

After reviewing the information in Cluster configuration options above, create a config similar to the one below for each config in the cluster. Notice that you must set ServerName to something different for each member of the cluster; the name should also be globally unique.

You’ll notice this example is a config file for a cluster of 3 members. They should all be listed in the config file. If you leave out any off the addresses they’ll default to listening on localhost.

If a cluster operation is not specified then the Smart Gateway will ignore all other cluster specific configurations and start in standalone mode.

Note that the cluster mode requires the SignalFx listener to be configured, and for the IngestAddress and ListenRebalanceAddress options to be set in the TraceSample stanza. If you’re inside a container you will probably want to specify the AdvertiseRebalanceAddress so that you can listen on a different host/port combination from what the real machine exposes. If you don’t specify the AdvertiseRebalanceAddress the listener will advertise the ListenRebalanceAddress.

Here is an example of what might go on non-containerized machine, where the IP of that machine is 10.1.77.44 in a cluster of 3 machines.

json
{
  "StatsDelay": "10s",
  "LogDir": "/var/log/gateway",
  "ServerName": "smart - gateway - bbaa",
  "ListenOnPeerAddress": "10.1.77.44:2380",
  "AdvertisePeerAddress": "10.1.77.44:2380",
  "ListenOnClientAddress": "10.1.77.44:2379",
  "AdvertiseClientAddress": "10.1.77.44:2379",
  "ETCDMetricsAddress": "10.1.77.44:2381",
  "ClusterDataDir": "/var/config/gateway/etcd",
  "ClusterOperation": "join",
  "UnhealthyMemberTTL": "5s",
  "RemoveMemberTimeout": "1s",
  "TargetClusterAddresses": [
    "10.1.77.44:2379",
    "10.5.88.55:2379",
    "10.0.140.232:2379"
  ],
  "ListenFrom": [
    {
      "Type": "signalfx",
      "ListenAddr": "0.0.0.0:8080"
    }
  ],
  "ForwardTo": [
    {
      "Type": "signalfx",
      "DefaultAuthToken": "PUTYOURTOKENHERE",
      "Name": "smart-gateway-forwarder",
      "TraceSample": {
      "BackupLocation": "/var/config/gateway/data",
      "ListenRebalanceAddress": "0.0.0.0:2382",
      "IngestAddress": "http://10.1.77.44:8080"
      }
    }
  ]
}

Here is an example of what might go onto a containerized machine where the external IP of the machine is still 10.1.77.44 but internal ports are mapped externally with a 2 in front of them.

json
{
  "StatsDelay": "10s",
  "LogDir": "/var/log/gateway",
  "ServerName": "smart - gateway - bbaa",
  "ListenOnPeerAddress": "0.0.0.0:2380",
  "AdvertisePeerAddress": "10.1.77.44:22380",
  "ListenOnClientAddress": "0.0.0.0:2379",
  "AdvertiseClientAddress": "10.1.77.44:22379",
  "ETCDMetricsAddress": "0.0.0.0:2381",
  "ClusterDataDir": "/var/config/gateway/etcd",
  "ClusterOperation": "join",
  "UnhealthyMemberTTL": "5s",
  "RemoveMemberTimeout": "1s",
  "AdditionalDimensions": {"cluster":"bb"},
  "TargetClusterAddresses": [
    "10.1.77.44:22379",
    "10.5.88.55:22379",
    "10.0.140.232:22379"
  ],
  "ListenFrom": [
    {
      "Type": "signalfx",
      "ListenAddr": "0.0.0.0:8080"
    }
  ],
  "ForwardTo": [
    {
      "type": "signalfx",
      "DefaultAuthToken": "PUTYOURTOKENHERE",
      "Name": "smart-gateway-forwarder",
      "TraceSample": {
        "BackupLocation": "/var/config/gateway/data",
        "ListenRebalanceAddress": "0.0.0.0:2382",
        "AdvertiseRebalanceAddress": "10.1.77.44:22382",
        "IngestAddress": "http://10.1.77.44:28080"
      }
    }
  ]
}

Start the gateway

To start up the first instance in the cluster, you will need to override the join function and seed the network. So the command line would be something like ./smart-gateway --cluster-op seed --configfile /var/config/gateway/gateway.conf. The rest and any restarts should just require the config file parameter.

After the first node has completely stood up, start up each additional node in the cluster one by one, waiting for each node to completely stand up before starting the next node. If StatsDelay is configured on the gateway, then you can verify that the node joined the cluster by looking at the reported cluster size via the metric proxy.tracing.sampler.clusterSize.

Finally, make sure your deployed SignalFx Smart Agents are configured to send data through your Smart Gateway by configuring their ingestUrl to http://<your-gateway>:8080/ (as per the Smart Agent Configuration documentation).

Stop the gateway

To stop the gateway, send a SIGTERM to the gateway process and wait for the process to complete. You must do this one by one in the cluster.

Restart the gateway

To restart the gateway, stop and start the gateway one by one and ensure that the gateway has stood up before continuing to the next one.

Metrics emitted by the SignalFx Smart Gateway

In addition to the metrics listed in Traces, spans, and SignalFx metrics, we also emit the metrics listed below. All metrics sent by the Smart Gateway have the dimensions host:ServerName and source:gateway on them.

Metric Name Additional Dimensions Description
gateway.commit samplerCommit: the sampler’s commit SHA Gauge value emits the value 1 and contains the SHAs of the components that make up the Smart Gateway.
gateway.processedTraces none Cumulative counter of all traces processed by this gateway or cluster
gateway.processedSpans none Cumulative counter of all spans processed by this gateway or cluster
gateway.sentTraces none Cumulative counter of all traces that were selected by the Smart Gateway and sent to SignalFx
gateway.sentSpans none Cumulative counter of all spans that were selected by the Smart Gateway and sent to SignalFx
dropped_spans reason: the reason the span was dropped Cumulative counter of all spans dropped by the Smart Gateway
dropped_traces reason: the reason the trace was dropped Cumulative counter of all traces dropped by the Smart Gateway

Troubleshooting

Etcd is embedded inside of the Smart Gateway and is used for cluster management.

In some circumstances the etcd cluster may become unhealthy if a Smart Gateway is terminated, but etcd was unable cleanly remove the member from the cluster.

In this situation, the remaining nodes should eventually remove the member from etcd.

You can verify this using etcdctl and pointing etcdctl at one of the other cluster members. etcdctl is distributed as a binary executable on the etcd github repository.

Verify that the member that was terminated has been removed by listing the members in the cluster:

$ ./etcdctl --endpoints=http://<client address>:2379 member list
8a052b9d07b922c8: name=gateway-1 peerURLs=http://<host>:<port> clientURLs=http://<host>:<port> isLeader=true
abb83521a48373b5: name=gateway-2 peerURLs=http://<host>:<port> clientURLs=http://<host>:<port> isLeader=false
d936b01b7ddff746: name=gateway-3 peerURLs=http://<host>:<port> clientURLs=http://<host>:<port> isLeader=false

You can also check the health of each cluster member using etcdctl:

$ ./etcdctl --endpoints=http://<client address>:<client port> cluster-health
member 8a052b9d07b922c8 is healthy: got healthy result from http://<host>:<port>
member abb83521a48373b5 is healthy: got healthy result from http://<host>:<port>
member d936b01b7ddff746 is healthy: got healthy result from http://<host>:<port>

If a Smart Gateway is failing to restart because the cluster is “unhealthy,” check that the Smart Gateway is no longer listed as a member of the cluster using the above two commands. If the Smart Gateway still appears in the list of members, try removing the Smart Gatway manually via etcdctl using the Smart Gateway instance’s etcd member ID. The member ID is printed first on each entry in the member list.

$ ./etcdctl member remove 8a052b9d07b922c8
# Member 8a052b9d07b922c8 removed from cluster ef37ad9dc622a7c4

Once the member has been successfully removed from the etcd cluster, try restarting the Smart Gateway instance.

High cardinality span identities generated by variables in span names

Many applications are instrumented with variable span names. This is an anti-pattern and will lead to poor performance and sampling accuracy of the Smart Gateway; it creates a very large number of span and trace identities, which results in the Smart Gateway’s inability to construct consistent baselines for those spans and traces while also consuming more memory resources. This pattern will also impact the performance and user experience of the SignalFx APM UI.

Instead of using variable span names, we recommend using span tags instead. However, if you are unable to modify your application to not emit variable span names and leverage tags instead for those variable elements, the Smart Gateway can turn these high cardinality names into tags using a configurable set of replacement rules.

Doing this will make the span identity space much smaller and will allow the Smart Gateway to establish span-level and trace-level baselines accurately, restoring the quality and accuracy of the trace selection algorithm while retaining all the required information on the spans, making them available for analysis by our Outlier Analyzer.

What’s Next?

Continue to Deploying the SignalFx Smart Agent.