Monitoring AWS with Telegraf & InfluxDB Cloud
Recently I’ve been playing with InfluxDB Cloud and Telegraf for synthetic monitoring of Amazon Web Services API endpoints, and thought I’d share my configuration notes.
What are InfluxDB and Telegraf?
InfuxDB is an open-source time-series database, and Telegraf is a software agent for sending time-series data to InfluxDB. Telegraf has hundreds of plugins to collect different types of time-series data.
One of these plugins, inputs.http_response, sends http/https requests to a list of URLs, and sends metrics about their responses — return code, response time, etc. — to InfluxDB.
Why monitor AWS?
Let’s suppose that you run your applications on Amazon Web Services, and you want to have a record of whether the specific AWS services you use are running or not. Sure, you can check the AWS Service Health Dashboard. And while plenty of good people work at AWS, that’s a bit like the fox guarding the henhouse:
For this reason, it’s good to do your own independent monitoring of AWS services — and for that matter, any cloud services or SaaS apps you depend on. So, here are some quick tips on how to get that going with InfluxDB Cloud and Telegraf http_response.
Telegraf configuration for Synthetic Monitoring
First, get Telegraf running on your machine and an InfluxDB Cloud instance. Here’s a step-by-step tutorial of how to do that. Next, review this overview of synthetic monitoring with Telegraf and InfluxDB.
After doing so, copy this Telegraf configuration file to your local machine, and run it using this command in your terminal:
telegraf --config ./telegraf-synthetic-aws.conf — debug
Assuming all starts well, you’ll see terminal output similar to this:
From here, you log into your InfluxDB Cloud Data Explorer (second icon from the top) and start graphing your data.
Let’s explore some key parts of our Telegraf configuration file,telegraf-synthetic-aws.conf
.
First, we tag our time series to make it easier to query in the Data Explorer. We state that these metrics from a company called Amazon, and a service called AWS. This might be helpful if we’re tracking metrics from, say, Google Cloud or Azure.
# Global tags can be specified here in key=”value” format.
[global_tags]
company = “Amazon”
service = “AWS” # will tag all metrics with service=AWS
Since there are potentially lots of endpoints we could monitor, I like to conserve my data usage. For this reason, I keep the hostname out of my data streams, since I don’t query on that information.
## If set to true, do no set the "host" tag in the telegraf agent.omit_hostname = true
We need to specify that we’re using InfluxDB Cloud version 2, as opposed to the version 1 cloud or a local instance:
# Configuration for sending metrics to InfluxDB
[[outputs.influxdb_v2]]
We state where we can find our InfluxDB Cloud instance:
## The URLs of the InfluxDB cluster nodes.
urls = ["https://us-west-2-1.aws.cloud2.influxdata.com"]
We list our secret token to authenticate Telegraf to post data to our instance of InfluxDB.
## Token for authentication.
token = "$INFLUX_TOKEN"
$INFLUX_TOKEN
means that this is coming from an environment variable of the same name. If you haven’t already set this, you can do so via this terminal command:
export INFLUX_TOKEN=[your unique API token]
Here’s how to find your InfluxDB Cloud token if you don’t already have it.
Anyways, back to telegraf-synthetic-aws.conf
. The next few lines describe the organization and the bucket you’re writing to.
## Organization is the name of the organization you wish to write to; must exist.
organization = "your-org-name"## Destination bucket to write into.
bucket = "your-bucket"
Your organization name is at the top of every screen in InfluxDB Cloud:
And you can find your list of buckets (bucket list?) under the Load Data command:
Alright, now we get to the meat of the matter: the Telegraf plugin we are using, and the URLs we want to monitor:
# HTTP/HTTPS request given an address a method and a timeout
[[inputs.http_response]]
## Server address (default http://localhost)
## List of urls to query.
urls = [
"http://aws.amazon.com"
]
The line [[inputs.http_response]]
tells Telegraf to run our http_reponse plugin. Each URL needs to be surrounded by quotes and separated by commas.
In the example above, we’re monitoring just the main AWS URL http://aws.amazon.com
. But if we wanted to monitor more, we could easily do so. For example, here’s how to monitor some EC2 endpoints:
# HTTP/HTTPS request given an address a method and a timeout
[[inputs.http_response]]
## Server address (default http://localhost)
## List of urls to query.
urls = [
"https://ec2.ca-central-1.amazonaws.com",
"https://ec2.eu-central-1.amazonaws.com",
"https://ec2.eu-west-1.amazonaws.com",
"https://ec2.sa-east-1.amazonaws.com",
"https://ec2.us-east-1.amazonaws.com",
"https://ec2.us-gov-east-1.amazonaws.com",
"https://ec2.us-west-1.amazonaws.com",
]
And here’s how to monitor some S3 endpoints:
# HTTP/HTTPS request given an address a method and a timeout
[[inputs.http_response]]
## Server address (default http://localhost)
## List of urls to query.
urls = [
"https://s3.amazonaws.com",
"https://s3.ca-central-1.amazonaws.com",
"https://s3.eu-central-1.amazonaws.com",
"https://s3.sa-east-1.amazonaws.com",
"https://s3.us-east-1.amazonaws.com",
"https://s3.us-west-1.amazonaws.com",
]
You probably get the picture by now. Here’s a list of all AWS endpoints. Pick out the ones that you care about and plug them into the section above.
The default http_response
configuration file doesn’t follow redirects. I prefer to follow redirects to ensure that they don’t lead to a 5xx error, meaning that a service is down. Here’s how to do that:
## Whether to follow redirects from the server (defaults to false)
follow_redirects = true
Putting all the above together, here’s the entire Telegraf configuration file:
By this point, you should have everything you need to monitor AWS and other cloud services and SaaS apps that are available via the web.
Addendum: List of AWS endpoints
In writing this post, it was pretty amazing how many Amazon Web Services there now are. Even more staggering: AWS is delivered through over 2000 http endpoints for its various APIs.
Now, if you’re only using a few AWS endpoints, you can skip this section. But, if you want to monitor AWS endpoints en-masse, read on.
(Caveat: monitoring dozens or more URLs may put you out of the InfluxDB Cloud free tier and require you to be on their paid plan.)
Here’s a list of AWS service endpoints in plain text format I compiled from this AWS documentation page.
Since this list will change over time, here’s how to compile a list of AWS endpoints using Terminal:
curl https://docs.aws.amazon.com/general/latest/gr/rande.html | grep amazonaws | sort -u > aws-endpoints.txt
Let’s break down what’s happening here:
curl
captures a web pagehttps://docs.aws.amazon.com/general/latest/gr/rande.html
is the webpage of all of AWS endpoints.- Using a Unix pipe, we send that webpage to the
grep
command, which in turn pulls out all lines withamazonaws
. Thankfully, AWS has a convention, as far as I can tell, of always usingamazonaws
in the URLs of its service endpoints. This makes things easier. sort
sorts all lines of text alphabetically, and the-u
flag shows only the unique lines so we don’t have repeats.- We then send the output of
sort
to a new file, calledaws-endpoints.txt
.
From here, you can use the text editor of your choice to any extra cruft on each line.