API Details

Terminology

The first part of any documentation is to define what exactly we are talking about to hopefully avoid confusion later. While there are a lot of terms specific to TSDS, the most common and important ones are listed below:

Measurement Type - a set of measurements that are defined the same way. They have the same value types and same cardinality in their metadata. Examples of this would be "interface" or "cpu".

Measurement - a specific instance of a measurement type. This could be a particular interface on a router or a cpu on the device.

Required Metadata - the bits of data about a measurement that uniquely identify it. For an interface this might be the name of the interface and the name of the device it is on. These fields are marked as required because they must exist as the definition of the measurement.

Metadata - other bits of data about a measurement. This could be the circuits that ride across the interface or the name of the POP the node resides in. All of the non required metadata is optional and is historically tracked per measurement.

Classifier - a metadata field may be set as a classifier on a measurement type. A classifier is used to indicate that this particular field is a good grouping point for the raw measurements. For example, grouping interfaces based on their circuit would be able to indicate both the A and Z sides of the circuit.

Ordinal - the ordinal field on either metadata or values is used to indicate the level of importance of the field. This primarily drives what is shown by default in the UI. For example, input and output bits per second are marked as ordinal fields 1 and 2 for interfaces.

Values - the metrics that are being measured about a measurement type. For interfaces this might be input/output bits/errors/packets per second or for CPU this might be % utilization.

Query - the basic building block of the TSDS backend is the query language. This is used to pull data from the system and manipulate it. It functions very much like a query in an SQL based system.


Metadata

Before being able to write a successful data query a user will have to know what measurement types exist and what metadata and values there are for that type. There is a metadata webservice that provides this information in an easily parseable JSON format.

Below are some examples of using this webservice:

Getting the available measurement types
tsds/services/metadata.cgi?method=get_measurement_types

{
"results": [
{
"name": "interface",
"label": "Interface"
},
{
"name": "cpu",
"label": "CPU"
}
...
]
}

Get the possible values for a given measurement type
tsds/services/metadata.cgi?method=get_measurement_type_values;measurement_type=interface

{
"results": [
{
"ordinal": 1,
"name": "input",
"description": "Input Bits/s",
"units": "bps"
},
{
"name": "outerror",
"description": "Output Errors/s",
"units": "pps"
}
...
]
}

Getting the known metadata fields for a given measurement type

tsds/services/metadata.cgi?method=get_meta_fields&measurement_type=interface

{
"results": [
{
"required": null,
"name": "network"
},
{
"required": null,
"name": "max_bandwidth"
},
{
"required": null,
"name": "parent_interface"
},
{
"ordinal": 2,
"required": 1,
"name": "node"
}
...
]
}

Getting existing values for a given metadata field in a given measurement type
tsds/services/metadata.cgi?method=get_meta_field_values;measurement_type=interface;meta_field=node;limit=10;offset=0

{
"total": 337,
"results": [
{
"value": "rtr1.foo.net",
},
{
"value": "rtr2.foo.net",
},
{
"value": "rtr3.foo.net",
}
]

}

Fetching Data

All data is fetched from the system by issuing queries from the query webservice. Since queries are very flexible by definition, there is only the one service to retrieve data from instead of several services in previous versions of SNAPP.

tsds/services/query.cgi?method=query

Query Structure

A query is composed of 6 principal parts:

1. The fields to fetch, including any manipulation functions on them

get <meta_field1>,
<meta_field2>,
<value_field1>,
<value_field2>...

2. A time clause

between(“<start_date>", “<end_date>")


3. A grouping by clause

by <meta_field1>,
<meta_field2>...

4. Where to fetch the data from

from <measurement_type OR subquery>

5. A set of filters

where ( (<meta_field1> <operator> “foo” <and|or> <meta_field2> <operator> “bar”) )

6. Any limit or order clauses

 limit <number_to_limit_by> offset <number_to_offset_by>
ordered by <field_1> <asc|desc>, <field_2> <asc|desc>

A basic query would look like this:

get intf, 
node,
aggregate(values.input, 300, average),
aggregate(values.output, 300, average)
between("04/26/2015 13:00:00 UTC", "04/27/2015 13:00:00 UTC")
by intf,
node
from interface
where ( intf = "ae0" and node = "rtr.foo.net")


The above query gets the interface name, node name, and the input and output bits per second for interface "ae0" on node "rtr.foo.net", arranged as 5 minute averages for a day. The "by" clause here is ensuring that all of the results are arranged into a single series - when performing the most basic queries this may seem a little redundant since the result set only matches a single thing, but the power can be seen by modifying the query slightly.


get intf,
node,
aggregate(values.input, 300, average),
aggregate(values.output, 300, average)
between("04/26/2015 13:00:00 UTC", "04/27/2015 13:00:00 UTC")
by intf,
node
from interface
where ( node = "rtr.foo.net")


By removing the "intf" match in the where clause, the query above now fetches every interface on the node "rtr.foo.net" and break them out into their corresponding series of unique node+interface pairings. By modifying the query slightly again we can further change that behavior.


get intf,
node,
aggregate(values.input, 300, average),
aggregate(values.output, 300, average)
between("04/26/2015 13:00:00 UTC", "04/27/2015 13:00:00 UTC")
by node
from interface
where ( node = "rtr.foo.net")


By removing the "intf" field in our by clause, after the results are fetched they are merged into a single series based on the node name instead of individual series based on node+interface. As a result, the above query is getting the 5 minute buckets averaged across ALL interfaces on the node into a single result set.


Queries may also have subqueries where instead of selecting data from a measurement type it is selected from an inner query. This nesting can be arbitrarily deep. Outer queries do not specify a between time or a grouping clause but are otherwise identical to any other query. Using one of the earlier queries as an example, the added outer query here will calculate the 95th percentile of input and output after it has been aggregated into 5 minute average buckets.


get intf,
node,
percentile(avg_input, 95),
percentile(avg_output, 95)
from (
get intf,
node,
aggregate(values.input, 300, average) as avg_input,
aggregate(values.output, 300, average) as avg_output
between("04/26/2015 13:00:00 UTC", "04/27/2015 13:00:00 UTC")
by node
from interface
where ( node = "rtr.foo.net")
)

Data Manipulation Functions

Below is a listing of the manipulation functions available for value data:

  • average(<value_field>) - returns the average of the data points for the specified value
  • aggregate(<value_field>, <seconds_to_group_by>, <aggregator_function> - applies the “aggregator_function”, which is most of the other functions noted here, to each series of data points grouped by the “seconds_to_group_by" parameter
  • percentile(<value_field>|<aggregate>, <nth_percentile>) - calculates the nth_percentile of the value field or aggregate clause passed in
  • count(<value_field>|<aggregate>) - returns the number of data points returned in the series for a value field or an aggregate
  • min(<value_field>|<aggregate>) - returns the minimum value in the value_field’s or aggregator function’s series of data points
  • max(<value_field>|<aggregate>) - returns the maximum value in the value_field’s or aggregator function’s series of data points
  • sum(<value_field>|<aggregate>) - returns the sum of the value_field’s or aggregator function’s series of data points
  • histogram(<value_field>, <bin size>) - returns an object representing a histogram of the data fit to the specified bin size
  • extrapolate(<field>|<aggregate>, <date>|<number>) - estimates what the value’s or aggregate clause’s value will be at the provided date in the future, or when the value will reach the provided number

Time Specification


Time for a between clause may be specified in the following formats. In the case where abbreviated formats are used it is assumed to be local server timezone for TIMEZONE and 00:00:00 for HMS.

  • MM/DD/YYYY HH:MM:SS TIMEZONE
  • MM/DD/YYYY HH:MM:SS
  • MM/DD/YYYY TIMEZONE
  • MM/DD/YYYY

Filter Operators


The possible operators for filters in the where clause are below

  • = - returns measurements whose meta field or value equals the specified value exactly
  • != - returns measurements whose meta field or value does not equal the specified value exactly
  • > - returns measurements whose value is greater than the specified value ( can be a date )
  • >= - returns measurements whose value is greater than or equal to the specified value (can be a date)
  • < - returns measurements whose value is less than the specified value (can be a date)
  • <= - returns measurements whose value is less than or equal to the specified value ( can be a date )
  • in - returns measurements whose value or meta field is contained in parenthesis surrounded, list of double quoted values specified, i.e. in (“foo”, “bar”, “biz”)
  • between - returns measurements whose values are between the two supplied values as a numeric range, comma separated list of values provided, i.e. (1,10)
  • not like - returns measurements whose meta fields do not match the double quoted word or regular expression provided
  • like - returns measurements whose meta fields match the double quoted word or regular expression provided

Pushing Data

Data can be added to the system either by directly sending to the RabbitMQ interface (high performance / throughput) or via the webservice interface (lower performance / throughput). In almost all cases, the webservice interface is easier to use and more than sufficient for performance and would be the recommended way to get started.

tsds/services/push.cgi?method=add_data

Data messages look like the following and must be properly JSON encoded when sent:

[{interval: 20,
meta: {
intf: “xe-11/0/4.71",
node: "rtr.ipiu.ilight.net"
},
time: 1409670371,
type: "interface",
values: {
inUcast: 46.50,
inerror: null,
input: 72419.20,
outUcast: 58.65,
outerror: null,
output: 62798.40,
status: 1
}
}
]

The message is always an array of objects, even if there is only a single element. This means that multiple updates can be sent in a single call, even for unrelated measurements or unrelated measurement types. When grouping requests together, it's generally best to cap it around 50 or so for a batch process. Larger requests might get dropped by the underlying RabbitMQ system.

Messages for unknown types or that are missing fields will be dropped and an error logged on the system.

Sending as many updates as desired is perfectly fine. Updates for time periods that have already had an update will simply overwrite the previous value.

The fields that are required in the message (case sensitive):

interval - the interval at which data is to be regularly expected for this measurement. This number can change in the future but helps the system to create the right amount of storage.

meta - an object whose keys are all of the required meta fields for the measurement type and whose values will uniquely identify this measurement.

time - the timestamp for this update. The system will align this timestamp onto the interval.

type - what measurement type this is. This type must have been already created and seeded with the required meta fields.

values - an object whose keys are the names of the values being measured and whose values are the measured numbers. No interpolation or calculation is done on these numbers - it is the responsibility of any data pusher to calculate its values correctly.