Architecture

How is VHS implemented internally?

Introduction

vhs is a network traffic utility that works by chaining modules loaded from plugins. It offers a high-performance concurrent architecture for routing data and executing modules that enables users to configure and extend vhs for a variety of purposes, including traffic recording, replay, live metrics collection, and many others.

Concepts

The architecture of vhs is built around the concept of a data flow, a directed graph that represents the routing of a stream of data through software components that act on that data stream. In a data flow graph, nodes represent software components that originate, terminate, or modify the data stream passing through them, and edges represent the data stream passing between components.

In vhs, the data flow graph looks something like this:

graph LR
  src[Source]
  in_mod[[Input Modifier]]
  in_fmt[[Input Format]]
  out_fmt[[Output Format]]
  out_mod[[Output Modifier]]
  sink[Sink]

  src --> in_mod
  in_mod --> in_fmt
  in_fmt --> out_fmt
  out_fmt --> out_mod
  out_mod --> sink

Where each node represents a particular type of software component and the edges represent the connections between those components. The following two subsections will describe the components and connections that make up the vhs data flow in more detail.

Nodes: VHS Components

Each node in the graph represents a concurrently-executed software component. In vhs, these components fall into four categories, as listed below:

source: source components originate data streams. A source brings data into vhs from somewhere else. This could mean capturing data from a network interface, reading from cloud storage, reading from a local file, etc. More information on the internal architecture of source components can be found here: source architecture. Information about the sources currently available in vhs can be found here: sources.
modifier: modifier components modify the data passing through them in its raw format (stream of bytes). Modifiers may exist in either the input chain or the output chain of the vhs data flow, and input and output modifiers are implemented separately. More information about the architecture of modifiers can be found here: modifier architecture. Information about the input and output modifiers currently available in vhs can be found here: input modifiers and output modifiers.
format: format components modify or interpret the data passing through them by imposing a format on it, usually in terms of native Go datatypes. Like modifiers, formats may exist in either the input chain or output chain, and input and output formats are implemented separately. More information about the architecture of formats can be found here: format architecture. Information about the input and output formats currently available in vhs can be found here: input formats and output formats.
sink: sink components terminate data streams. A sink provides a way for data to leave vhs. This data could be written to a file, stored on cloud storage, transmitted to a network location, etc. More information on the internal architecture of sink components can be found here, and information about the sinks currently available in vhs can be found here.

Additionally, vhs provides an optional facility for middleware. The middleware facility allows users to place their own external modifier code into the vhs data flow. If used, the middleware is placed into the data flow between the output format and the output modifier as shown in the diagram below. This external middleware will receive formatted data in the form of stringified JSON from the chosen output format on stdin and must write modified data to stdout.

graph LR
  src[Source]
  in_mod[[Input Modifier]]
  in_fmt[[Input Format]]
  out_fmt[[Output Format]]
  mware[[Middleware]]
  out_mod[[Output Modifier]]
  sink[Sink]

  src --> in_mod
  in_mod --> in_fmt
  in_fmt -.-> mware 
  mware -.-> out_fmt
  in_fmt --> out_fmt 
  out_fmt --> out_mod
  out_mod --> sink

  style mware fill:#D55E00

Edges: Connections Between Components

The edges of the data flow graph represent data streams that pass between components. In vhs, these edges represent the connections between the components described in the previous section. These connections are implemented using channels, a facility for communication between concurrent software components provided by the Go language. At the most basic level, communications between components in vhs utilize two basic strategies. Where raw data streams are needed, vhs uses types from the io package in the Go standard library, specifically the io.ReadCloser interface. Where structured data needs to be passed between components, the empty interface type interface{} is used for maximum flexibility.

Metadata

Sometimes it is useful to pass descriptive information about a data stream between two connected components. For example, the tcp source tracks information about source and destination IP addresses and ports for the tcp streams it captures. This information may be useful to components downstream in the vhs data flow, so vhs provides a key-value metadata facility for recording this type of information and passing it between components. This metadata facility takes the form of a construct called Meta that is implemented in core/meta.go.

To pass Meta between components, it is wrapped together with an io.ReadCloser into struct. For example, the InputReader interface is used as the connection between a source and an input modifier. It is defined as follows:

// InputReader is an input reader.
type InputReader interface {
  io.ReadCloser
  Meta() *Meta
}

Metadata is not currently supported on the output chain of the vhs data flow, so the corresponding output interface OutputWriter is much simpler:

// OutputWriter is an output writer.
type OutputWriter interface {
  io.WriteCloser
}

Putting this all together, we can see the components and their connections on the following data flow graph:

graph LR
  src[Source]
  in_mod[[Input Modifier]]
  in_fmt[[Input Format]]
  out_fmt[[Output Format]]
  mware[[Middleware]]
  out_mod[[Output Modifier]]
  sink[Sink]

  src -- InputReader --> in_mod
  in_mod -- InputReader --> in_fmt
  in_fmt -. JSON string .-> mware 
  mware -. JSON string .-> out_fmt
  in_fmt -- "interface{}" --> out_fmt 
  out_fmt -- OutputWriter --> out_mod
  out_mod -- OutputWriter --> sink

  style mware fill:#D55E00

More details on the implementation of each component and the connections between them can be found on their pages in this section.

Supporting Infrastructure

vhs provides several software constructs to implement and manage its data flow and modules.

flow: Flow is the highest level construct. It contains the input chain and output chains and manages the execution of all the modules for a given vhs session. Defined in flow/flow.go.
input: The input construct contains and manages the input chain of a vhs data flow. vhs supports a single input chain per session. Defined in flow/input.go.
output: The output construct contains and manages the output chain(s) of a vhs data flow. vhs supports multiple output chains per session. Defined in flow/output.go.

The conceptual arrangement of these constructs is shown in the figure below. In most cases, it should not be necessary to modify these portions of the codebase when adding new data flow components, but a general understanding of these constructs should be helpful for both vhs developers and end users.

graph LR
  src[Source]
  in_mod[[Input Modifier]]
  in_fmt[[Input Format]]
  out_fmt[[Output Format]]
  out_mod[[Output Modifier]]
  sink[Sink]

  src --> in_mod
  in_mod --> in_fmt
  in_fmt --> out_fmt
  out_fmt --> out_mod
  out_mod --> sink

  subgraph in [Input]
    src
    in_mod
    in_fmt
  end

  subgraph out [Output]
    out_fmt
    out_mod
    sink
  end

  subgraph flow [Flow]
    in
    out
  end

  style in fill:#F0E442
  style flow fill:#009E73
  style out fill:#56B4E9

Parser: Specifying VHS data flows

vhs data flows are defined at runtime with command line flags as described on the reference page (Inputs and Outputs). Internally, these flags are processed by a parser. This parser reads the specified input and output chain descriptions and instantiates a Flow that contains the specified components. All components must register with the parser by calling the appropriate Load... function with the identifying token and the constructor for the component. The parser is implemented in flow/parser.go and component registration with the default Parser is done in cmd/vhs/main.go.