Surround

Welcome to Surround

Welcome to Surround’s documentation! Surround is a framework for building machine learning pipelines in Python.

For a quick rundown on getting started with Surround see Getting Started. For more information on the aim and philosophy of Surround see About. Just need to learn more about a particular method or class? See API Reference.

About

What is Surround?

Surround is an open-source framework developed by the Applied Artificial Intelligence Institute (A2I2) to take machine learning solutions through from exploration all the way to production. For this reason, it is developed with both research engineers and software developers in mind. Designed to play nice with existing machine learning frameworks (Tensorflow, MXNet, PyTorch, etc) and cloud services (Google Cloud AI, SageMaker, Rekognition etc), engineers have the freedom to use whatever necessary to solve their problem.

A Philosophy

Surround isn’t just a framework, its also a philosophy. From the moment data lands on our desk we need to be thinking about the final use case for the solutions we are developing. To reduce the amount of time between data exploration and a containerised proof-of-concept web application ready to be deployed, Surround was built to resolve some competing requirements of both researchers and engineers. Where in general researchers want to dive into the data and leave code quality to later, and engineers prefer well structured code from the start. We attempt to solve this problem with Surround by introducing a “production first” mindset and providing conventions for researchers (a separate folder for data exploration scripts).

Long ago, web frameworks realised there are a set of concerns that almost all web applications must deal with, such as connecting to databases, managing configuration, rendering static and dynamic content, and handling security concerns. Machine Learning projets have similar concerns but also have their own set of special concerns such as:

  • Experimentation is a first class citizen
  • Data and models need to be versioned and managed
  • Model performance needs to be visualized
  • Training infrastructure is required
  • Etc..

Surround strives to provide a single place for every concern that arises when building a ML project. Ideally there will be a single solution to any concerns that occur to either the research engineer or the software developer. But to be the single place for ML projects we are going to have to support as many existing frameworks, libraries and APIs as we can. This can be seen reflected in the design of Surround where the Core framework could be used to build:

  • A solution based on cloud APIs
  • A custom Docker image for SageMaker
  • Form part of a batch process running on an internal Kubernetes cluster

By playing nice with others we hope the core Surround framework can continue to be used as the ML ecosystem evolves.

A set of conventions

Surround attempts to enforce a set of conventions to help researchers keep their solutions structured for software developers and implements solutions for common ML project concepts such as managing configuration so that they don’t have to.

These conventions are adhered to through the use of a project generator and project linter that will check for the core conventions. For example during project generation, the following structure is used:

package name
├── Dockerfile
├── README.md
├── data
├── package name
│   ├── stages
│   │   ├── __init__.py
│   │   ├── input_validator.py
│   │   ├── baseline.py
│   │   └── assembler_state.py
│   ├── __init__.py
│   ├── __main__.py
│   ├── web_runner.py
│   ├── file_system_runner.py
│   └── config.yaml
├── docs
├── dodo.py
├── models
├── notebooks
├── output
├── requirements.txt
├── scripts
├── spikes
└── tests

Every Surround project has the following characteristics:

  • Dockerfile for bundling up the project as a Docker container.
  • dodo.py file containing useful tasks such as train, batch predict and test for a project.
  • Tests for catching training serving skew.
  • A single entry point for running the application, __main__.py.
  • A place for data exploration with Jupyter notebooks and miscellaneous scripts.
  • A single place, for output files, data, and model storage.

A command line tool

Surround also comes with a command line tool (CLI) which can perform a variety of tasks such as project generation and running the project in Docker. The tools included are shown below:

  • init - Used to generate a new Surround project.
  • lint - Used to run the Surround Linter which checks if Surround conventions are being used correctly.
  • run - Used to run a task defined in dodo.py.

Where the run command is essentially a wrapper around the doit library and the Surround Linter will perform multiple checks on the current project to see if it is following standard conventions. The intention of the Surround Linter will to become more of an assistant when building ML projects. These tools are automatically added to your environment path so they can be used anywhere in your preferred terminal application.

A Python library

The last component of Surround is the Python library. We developed the Python library to provide a flexible way of running a ML pipeline in a variety of situations whether that be from a queue, a http endpoint or from a file system. We found that during development the research engineer often needed to run results from a file, something that is not always needed in a production environment. Surround’s Python library was designed to leverage the conventions outlined above to provide maximum productivity boost to research engineers provided the conventions are followed. Surround also provides wrappers around libraries such as the Tornado web server to provide advanced functionality. These 3rd party dependencies are not installed by default and need to be added to the project before Surround will make the wrappers available.

How does Surround work at its core?

At its core, there are four main concepts that you need to understand while using Surround, these are:

The most important being the first two since they make up the actual pipeline that is responsible for taking in data and spitting out a prediction based on that input.

Assembler

Assembler flow diagram

The Assembler is responsible for constructing and executing a pipeline on data. How the pipeline is constructed (and where/how data is loaded) depends on which execution mode is being used. The above diagram describes a simple Surround pipeline showing three different modes of execution. These modes are described below.

Training
Training flow diagram

Primarily built for training, training data is loaded from disk (usually in bulk) then fed through the pipeline with the estimator set to fit mode. Once training the pipeline is complete the data is then fed to a visualiser which will help display useful information about the training operation.

Batch-predict
Batch-predict flow diagram

Primarily built for evaluation, data is loaded from disk (also usually in bulk) then fed through the pipeline with the estimator set to estimate mode. Once processing is complete the data is then fed to a visualiser which will help summarise and visualise the overall results / performance.

Web / Predict
Web / Predict flow diagram

This mode is built for production. When your pipeline is setup, training has been completed, evaluation of the model shows good performance and is ready for use, this mode is to be used to serve your pipeline. Depending on the type of project you generated initially, the input data may come from your local disk or from the body of a POST HTTP request and the result may be saved locally or returned to the client who sent the request.

Stages

A stage, at its base, can do three things:

  • Initialize anything needed to complete its function. This may include a loading a Tensorflow graph or loading configuration data.
  • Perform its intended operation. Whether that be feeding data through a model or checking if the data is correct.
  • Dump output from the operation to the console (if requested, used for debugging).

Between each stage, during processing, there are two objects passed between them:

  • State object which contains the input data, has a field for errors (which stops the execution when added to) and holds the output of each stage (if any).
  • Configuration object which contains all the settings loaded in from YAML files plus paths to folders in the project such as input/ and output/.
Validators

Validators are stages that are responsible for checking if the input data that is about to be fed through the pipeline is valid. Meaning is the data the correct format, checking whether there is any detectable reason why the data would cause issues while being processed. This stage is positioned first in the execution of the pipeline, they are not intended to create any output, only errors or warnings.

Filters

Filters are stages that are responsible for getting data ready for the next stage of execution. These are typically placed before or after Estimators. There are generally two types of filters: Wranglers (Pre-filters) and Deciders (Post-filters).

Wranglers (Pre-filters)

Wranglers perform data wrangling operations on the data. Meaning getting the data from one format into another that is useful for the next stage (typically an Estimator). For example the input data might be a str formatted in JSON but the estimator next in the pipeline might only accept a Python dict so a Wrangler would be used to parse the str into a dict.

Deciders (Post-filters)

Deciders, placed after Estimators, are stages which make descisions based on the output of them. For example in a Voice Activity Detection pipeline, we may have an estimator that outputs confidence values on whether the input audio data was speech or not, you would then place a Decider after which may perform thresholding on the confidence values.

Estimators

Estimators are stages where the actual prediction or training of an ML model takes place. Depending on the pipeline configuration the estimator will either use the input data to make a prediction or use the input data as training data. This stage should have some form of output. Typically placed between two Filters during execution. For example you may be using Tensorflow to run your model, so an estimator would be created, which would load the model and create a Tensorflow session during initialization and the session would be ran with the input data during execution of the stage.

In more complex pipelines, these stages may be composed of an entirely separate Surround pipeline (another Assembler instance). Surround is designed this way to allow pipelines as complex as required.

Visualisers

Visualisers are stages where they do what their name entails, visualize the data. Typically used during training and evaluation of the model, these stages are used to generate reports on how the model is performing. For example in a Facial Detection pipeline during evaluation of the model, the visualiser may display an example image it processed and render boxes around the faces it detected.

Configuration

Every instance of Assembler has a configuration object constructed from the project’s configuration file. This configuration object is passed between each stage of the pipeline during initialization and execution. The configuration file uses the YAML data-serialization language.

Example configuration file:

pathToModels: ../models
model: hog                                                       # 'hog' or 'cnn'
minFaceWidth: 100                                                # Threshold for the width of a face bounding box in pixels
minFaceHeight: 125                                               # Threshold for the height of a face bounding box in pixels
useAllFaces: true                                                # If false, only extract encodings for the largest face
imageTooDark: 23                                                 # Threshold for determining if an image is too dark, lower values = darker image
blurryThreshold: 4                                               # Smaller values indicate a "more" blurry image
gpuDynamicMemoryAllocation: true                                 # If true, Tensorflow will allocate GPU memory on an as-needs basis. perProcessGpuMemoryFraction will have no effect.
perProcessGpuMemoryFraction: 0.5                                 # Fraction of GPU memory Tensorflow should acquire. Has no effect if gpuDynamicMemoryAllocation is true.
rotateImageModelFile: image-rotator/image-rotator-2018-04-05.pb  # Model used to detect the orientation of the image
rotateImageModelLabels: image-rotator/labels.txt                 # Model used to detect the orientation of the image
rotateImageInputLayer: conv2d_1_input                            # Tensorflow input layer
rotateImageOutputLayer: activation_5/Softmax                     # Tensorflow output layer
rotateImageInputHeight: 100                                      # Input image height to the image stage neural network
rotateImageInputWidth: 100                                       # Input image width to the image stage neural network
rotateImageThreshold: 0.5                                        # Rotate image if the orientation is above this threshold
rotateImageSkip: false                                           # Option to skip image rotation step
imageSizeMax: 700                                                # Maximum allowable image size (width or height). Images larger than this will be downsized.
postgres:                                                        # Postgres database options
    user: postgres                                               #   Postgres username
    password: postgres                                           #   Postgres password
    host: localhost                                              #   Postgres server host
    port: 5432                                                   #   Postgres server port
    db: face_recognition                                         #   Which database to connect to
webcamStream:                                                    # Webcam stream options
    drawBox: true                                                #   Whether to draw a box around detected faces
    minConfidence: 0.5                                           #   Discard detections below this confidence level
    highConfidence: 0.9                                          #   Confidence values at or above this level are deemed to be 'highly confident'
celery:
    broker: pyamqp://guest@localhost
    backend: redis://localhost

State

Every time an Assembler is ran, it requires an object that will be used to store the input data and eventually store the output. Passed between stages during execution, it can also be used to store any intermediate data between stages.

Getting Started

Installation

Prerequisites

  • Python 3+ (Tested on 3.6.5)
  • Docker
  • Supports MacOS, Linux, and Windows

Install via Pip

Run the following command to install the latest version of Surround:

$ pip3 install surround

Note

If this doesn’t work make sure you have pip installed. See here on how to install it.

Now the Surround library and command-line tool should be installed! To make sure run the following command to test:

$ surround

If it works then you are ready for the Project Setup stage.

Project Setup

Before we can create our first pipeline, we need to generate an empty Surround project. Use the following command to generate a new project:

$ surround init -p test_project -d "Our first pipeline"

When it asks the following, respond with n (we’ll cover this in later sections):

Does it require a web runner? (y/n) n

This will create a new folder called test_project with the following file structure:

test_project
├── test_project/
│   ├── stages
│   │   ├── __init__.py
│   │   ├── input_validator.py
│   │   ├── baseline.py
│   │   └── assembler_state.py
│   ├── __main__.py
│   ├── __init__.py
│   ├── config.yaml
│   └── file_system_runner.py
├── input/
├── docs/
├── models/
├── notebooks/
├── output/
├── scripts/
├── spikes/
├── tests/
├── __main__.py
├── __init__.py
├── dodo.py
├── Dockerfile
├── requirements.txt
└── README.md

The generated project comes with an example pipeline that can be ran straight away using the command:

$ cd test_project
$ surround run batchLocal

Which should output the following:

INFO:surround.assembler:Starting 'baseline'
INFO:surround.assembler:Validator InputValidator took 0:00:00 secs
INFO:surround.assembler:Estimator Baseline took 0:00:00 secs

Now you are ready for Creating your first pipeline.

See also

Not sure what a pipeline is? Checkout our About section first!

Creating your first pipeline

For our first Surround pipeline, we are going to do some very basic data transformation and convert the input string from lower case to upper case. This pipeline is going to consist of two stages, InputValidator and MakeUpperCase.

Open the script stages/validator.py and you should see the following code already generated:

from surround import Validator

class InputValidator(Validator):
    def validate(self, state, config):
        if not state.input_data:
            raise ValueError("'input_data' is None")

As you can see we are already given the InputValidator stage, we just need to edit the operate method to check if the input data is the correct data type (str):

def validate(self, state, config):
    if not isinstance(state.input_data, str):
        # Raise an exception, this will stop the pipeline
        raise ValueError('Input is not a string!')

Now we need to create our MakeUpperCase stage, so head to stages/baseline.py, you should see:

from surround import Estimator

class Baseline(Estimator):
    def estimate(self, state, config):
        state.output_data = state.input_data

    def fit(self, state, config):
        LOGGER.info("TODO: Train your model here")

Make the following changes:

class MakeUpperCase(Estimator):
    def estimate(self, state, config):
        # Convert the input into upper case
        state.output_data = state.input_data.upper()

        # Print the output to the terminal (to check its working)
        LOGGER.info("Output: %s" % state.output_data)

    def fit(self, state, config):
        # Leave the fit method the same
        # We aren't doing any training in this guide
        LOGGER.info("TODO: Train your model here")

Since we renamed the estimator, we need to reflect that change when we create the Assembler.

First head to the stages/__init__.py file and rename Baseline to MakeUpperCase:

from .baseline import MakeUpperCase
from .input_validator import InputValidator
from .assembler_state import AssemblerState

Then in __main__.py where the estimator is imported make sure it looks like so:

from stages import MakeUpperCase, InputValidator

And where the assembler is created, make sure it looks like so:

assemblies = [
    Assembler("baseline")
        .set_stages([InputValidator(), MakeUpperCase()])
]

That’s it for the pipeline! To test the pipeline with default input ("TODO Load raw data here" string) just run the following command:

$ surround run batchLocal

The output should be the following:

INFO:surround.assembler:Starting 'baseline'
INFO:stages.baseline:Output: TODO: LOAD RAW DATA HERE
INFO:surround.assembler:Estimator MakeUpperCase took 0:00:00 secs

To change what input is fed through the pipeline, modify batch_runner.py and change what is given to data.input_data:

import logging
from surround import Runner
from stages import AssemblyState

logging.basicConfig(level=logging.INFO)

class FileSystemRunner(Runner):
    def load_data(self, mode, config):
        state = AssemblyState()

        # Load data to be processed
        raw_data = "This daTa wiLL end UP captializED"

        # Setup input data
        state.input_data = raw_data

        return state

Note

To test training mode (fit will be called instead in the estimator), run the following command: $ surround run trainLocal

Running your first pipeline in a container

First you must build an image for your container. To do this just run the following command:

$ surround run build

Then to run the container in dev mode just use the following command:

$ surround run dev

This will run the container linking the folder testproject/testproject with the working directory in the container. So during development when you make small changes, there is no need to build the image, just run this command again.

Then when you are ready for production you can use the following command:

$ surround run prod

Which will first build the image and then run the container without any linking to the host machine. The image created in the build can also then be committed to a Docker Hub repository and shared.

Note

Both dev and prod will use the default mode of the project, which in non-web projects is RunMode.BATCH_PREDICT, otherwise it’s RunMode.WEB.

The following commands will force which mode to use:

$ surround run batch
$ surround run train

Note

To see a list of available tasks, just run the command $ surround run

Serving your first pipeline via Web Endpoint

When generating a project, you get asked:

Does it require a web runner? (y/n)

If we say yes to this then Surround will generate a generic batch_runner.py but it will also generate a new script called web_runner.py.

This script contains a new Runner which will use Tornado to host a web server which will allow your pipeline to be accessible via HTTP request. By default the WebRunner will host two endpoints:

  • /info - access via GET request, will return {'version': '0.0.1'}

  • /estimate - access via POST request, body must have a JSON document containing input data:

    {
        "message": "this text will be processed"
    }
    

So lets create a new pipeline that does the same data processing as the one in Creating your first pipeline but we will send strings via web endpoint and get the results in the response of the request.

First generate a new project, this time saying yes to the require web prompt, and make all the changes we did in Creating your first pipeline and test it is still working locally.

Next we are going to build an image for our pipeline using the command:

$ surround run build

Then we are going to run our default server using the command:

$ surround run web

You should get output like so:

INFO:root:Server started at http://localhost:8080

Note

If you would like to run it on the host machine instead of in a container, you must install Tornado using this command: $ pip3 install tornado==6.0.2

Now hopefully if you load http://localhost:8080/info in your preferred browser, you should see the following:

{"version": "0.0.1"}

Note

If you are running this on Windows and don’t see the above, try using http://192.168.99.100:8080/info instead.

Next we are going to test the /estimate endpoint by using the following command in another terminal:

On Linux/MacOS:

$ curl -d "{ \"message\": \"test phrase\" }" http://localhost:8080/estimate

On Windows (in Powershell):

$ Invoke-WebRequest http://192.168.99.100:8080/estimate -Method POST -Body "{ ""message"": ""test phrase"" }"

You should see the following output in the terminal running the pipeline:

INFO:surround.assembler:Starting 'baseline'
INFO:surround.assembler:Estimator MakeUpperCase took 0:00:00 secs
INFO:root:Message: TEST PHRASE
INFO:tornado.access:200 POST /estimate (::1) 1.95ms

So our data is successfully being processed! But what if we need the result?

Head to the script web_runner.py and append the following to the post method of EstimateHandler:

# Return the result of the processing
self.write({"output": self.data.output_data})

Restart the web server, use the same command as before and you should see the following output:

On Linux/MacOS:

% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100    53  100    25  100    28    806    903 --:--:-- --:--:-- --:--:--  1709
{"output": "TEST PHRASE"}

On Windows (in Powershell):

StatusCode        : 200
StatusDescription : OK
Content           : {"output": "TEST PHRASE"}
RawContent        : HTTP/1.1 200 OK
                    Content-Length: 25
                    Content-Type: application/json; charset=UTF-8
                    Date: Mon, 17 Jun 2019 06:43:54 GMT
                    Server: TornadoServer/6.0.2

                    {"output": "TEST PHRASE"}
Forms             : {}
Headers           : {[Content-Length, 25], [Content-Type, application/json; charset=UTF-8], [Date, Mon, 17 Jun 2019 06:43:54 GMT], [Server, TornadoServer/6.0.2]}
Images            : {}
InputFields       : {}
Links             : {}
ParsedHtml        : mshtml.HTMLDocumentClass
RawContentLength  : 25

Thats it, you are now serving a Surround pipeline! Now you could potentially use this pipeline in virtually any application.

Note

Since this project was generated with a web runner, the default mode is web, to run the pipeline using the FileSystemRunner instead, use the command $ surround run batch or $ surround run train.

Command-line Interface

The following is a list of the sub-commands contained in Surround’s CLI tool.

surround

The Surround Command Line Interface

usage: surround [-h] [-v]
                {init,run,lint,store,config,experimentation,split,viz,data}
                ...

Named Arguments

-v, --version

Show the current version of Surround

Default: False

init

Initialize a new Surround project.

usage: surround init [-h] [-p PROJECT_NAME] [-d DESCRIPTION] [-w REQUIRE_WEB]
                     [path]

Positional Arguments

path

Path for creating a Surround project

Default: “./”

Named Arguments

-p, --project-name
 Name of the project
-d, --description
 A description for the project
-w, --require-web
 Is web service required for the project

run

Run a Surround project assembler and task.

Without any arguments, all tasks will be listed.

Assemblers are defined in the __main__.py file of the current project. The default assembler that comes with every project is called baseline.

Tasks are defined in the dodo.py file of the current project. Each project comes with a set of default tasks listed below.

Containerised Tasks:

  • build - Build a Docker image for your Surround project.
  • dev - Run the specified assembler in a Docker container with the current source code (via drive mount, no build neccessary).
  • prod - Build the Docker image and run the specified assembler inside a container (no drive mounting).
  • batch - Run the specified assembler in a Docker container (mounting input and output folders) set to batch mode.
  • train - Run the specified assembler in a Docker container (mounting input and output folders) set to train mode.
  • web - Serve the specified assembler via HTTP endpoints inside a Docker container.
  • remove - Remove the Docker image built for this project (if any).
  • jupyter - Run a jupyter notebook server in a Docker container (mounting the whole project).

Local Tasks:

  • batchLocal - Run the specified assembler locally set to batch-predict mode.
  • trainLocal - Run the specified assembler locally set to train mode.
  • webLocal - Serve the specified assembler via HTTP endpoints locally.
usage: surround run [-h] [task]

Positional Arguments

task Task defined in dodo.py file of your project

lint

Run the Surround Linter on the current project.

For more information on what this does, see linter.

usage: surround lint [-h] [-l | path]

Positional Arguments

path

Path for running the Surround linter

Default: ./

Named Arguments

-l, --list

List all Surround checkers

Default: False

data

usage: surround data [-h] {create,inspect,lint} ...

subcommands

This tool must be called with one of the following commands

command Possible choices: create, inspect, lint

Sub-commands:

create

Create a data container from a file or directory

surround data create [-h] (-f FILE | -d DIRECTORY | -m) [-o OUTPUT]
                     [-e EXPORT_METADATA]
Named Arguments
-f, --file Path to file to import into container
-d, --directory
 Path to directory to import into container
-m, --metadata-only
 

Generate metadata without a file system

Default: False

-o, --output Path to file to export container to (default: specified-path.data.zip)
-e, --export-metadata
 Path to JSON file to export metadata to
inspect

Inspect the metadata and/or contents of a data container

surround data inspect [-h] [-m | -c] container_file
Positional Arguments
container_file Path to the data container to inspect
Named Arguments
-m, --metadata-only
 

Inspect the metadata of the container only

Default: False

-c, --content-only
 

Inspect the contents of the container only

Default: False

lint

Check the validity of a data container

surround data lint [-h] [-l] [-c CHECK_ID] container_path
Positional Arguments
container_path Path to the container to perform checks on
Named Arguments
-l, --list

List the checks the linter will perform

Default: False

-c, --check-id Specify a single check to perform (get id from –list)

split

Tool to randomly split data into test, train, and validation sets.

Supports splitting:

  • Directory of files
  • CSV files
  • Text files (just ensure you use the --no-header flag)

Example - Split a directory of images into train/test/validate:

$ surround split -d images -e png

Example - Reset a split directory:

$ surround spit --reset images

Example - Splitting and resetting a CSV file:

$ surround split -t test.csv
$ surround split --reset .
usage: surround split [-h] (-t TEXT_FILE | -d DIRECTORY | -r RESET)
                      [-e EXTENSION] [-tr TRAIN] [-te TEST] [-va VALIDATE]
                      [-nv] [-ns] [-nh]

Named Arguments

-t, --text-file
 Split text file into train/test/validate sets
-d, --directory
 Split directory into train/test/validate sets
-r, --reset Path to directory containing train/test/validate folders to reset
-e, --extension
 

File extension of the files to process (default: *)

Default: “*”

-tr, --train

Percentage of files for training (default: 80%)

Default: 80

-te, --test

Percentage of files for test (default: 10%)

Default: 10

-va, --validate
 

Percentage of files for validate (default: 10%)

Default: 10

-nv, --no-validate
 

Don’t produce a validation set when splitting

Default: False

-ns, --no-shuffle
 

Don’t randomise when splitting data

Default: False

-nh, --no-header
 

Use this flag when the text file has no headers

Default: False

API Reference

Here you can find documentation on all classes and their methods in Surround.

Assembler

class surround.assembler.Assembler(assembler_name='', config=None)[source]

Class responsible for assembling and executing a Surround pipeline.

Responsibilities:

  • Encapsulate the configuration data and pipeline stages
  • Load configuration from a specified module
  • Run the pipeline with input data in predict/batch/train mode

For more information on this process, see the About page.

Example:

assembler = Assembler("Example pipeline")
assembler.set_stages([PreFilter(), PredictStage(), PostFilter()])
assembler.init_assembler(batch_mode=False)

data = AssemblyState("some data")
assembler.run(data, is_training=False)

Batch-predict mode:

assembler.init_assembler(batch_mode=True)
assembler.run(data, is_training=False)

Training mode:

assembler.init_assembler(batch_mode=True)
assembler.run(data, is_training=True)

Predict/Estimate mode:

assembler.init_assembler(batch_mode=False)
assembler.run(data, is_training=False)

Constructor for an Assembler pipeline:

Parameters:
  • assembler_name (str) – The name of the pipeline
  • config – Surround Config object
init_assembler()[source]

Initializes the assembler and all of it’s stages.

Calls the surround.stage.Stage.initialise() method of all stages and the estimator.

Note

Should be called after surround.assembler.Assembler.set_config().

Returns:whether the initialisation was successful
Return type:bool
load_config(module)[source]

Given a module contained in the root of the project, create an instance of surround.config.Config loading configuration data from the config.yaml found in the project, and use this configuration for the pipeline.

Note

Should be called before surround.assembler.Assemble.init_assembler()

Parameters:module (str) – name of the module
run(state=None, mode=<RunMode.PREDICT: 2>)[source]

Run the pipeline using the input data provided.

If is_training is set to True then when it gets to the execution of the estimator, it will use the surround.stage.Estimator.fit() method instead.

If surround.enable_stage_output_dump is enabled in the Config instance then each stage and estimator’s surround.stage.Stage.dump_output() method will be called.

This method doesn’t return anything, instead results should be stored in the state object passed in the parameters.

Parameters:
  • state (surround.State) – Data passed between each stage in the pipeline
  • is_training (bool) – Run the pipeline in training mode or not
set_config(config)[source]

Set the configuration data to be used during pipeline execution.

Note

Should be called before surround.assembler.Assembler.init_assembler().

Parameters:config (surround.config.Config) – the configuration data
set_finaliser(finaliser)[source]

Set the final stage that will be executed no matter how the pipeline runs. This will be executed even when the pipeline fails or throws an error.

Parameters:finaliser (surround.stage.Stage) – the final stage instance
set_stages(stages)[source]

Set the stages to be executed one after the other in the pipeline.

Parameters:stages (list of surround.stage.Stage) – list of stages to execute

Config

class surround.config.Config(project_root=None, package_path=None, auto_load=False)[source]

An iterable dictionary class that loads and stores all the configuration settings from both default and project YAML files and environment variables. Primarily used in stages to retrieve configuration data set for development/production.

Responsibilities:

  • Parse the config.yaml file and store the data as key-value pairs.
  • Allow environment variables override data loaded from file/dict (must be prefixed with SURROUND_).
  • Provide READ-ONLY access to the stored config values via [] operator and iteration.

Example usage:

config = Config()
config.read_from_dict({ "debug": True })
config.read_config_files(["config.yaml"])

if config["debug"]:
    # Do debug stuff

for key, value in config:
    # Iterate over all data

You could then override the above configuration using the systems environment variables, just prefix the var with SURROUND_ like so:

SURROUND_DEBUG=False

It also supports overriding nested configuration data, for example with the following config:

predict:
    debug: True

We can override the above with the following environment variable:

SURRROUND_PREDICT_DEBUG=False

Constructor of the Config class, loads the default YAML file into storage. If the project_root is provided then the project’s config.yaml file is also loaded into configuration.

The default config file (defaults.yaml) can be found in the same directory as the config.py script. The project config file (config.yaml) can be found in the root of the project folder.

Parameters:
  • project_root (str) – path to the root directory of the surround project (default: None)
  • package_path (str) – path to the root directory of the package that contains the surround project (default: None)
  • auto_load (bool) – Attempt to load the config.yaml file from the Surround project in the current directory (default: False)
get_dict()[source]

Returns the configuration data in a dictionary

Returns:dictionary of the configuration data
Return type:dict
get_path(path)[source]

Returns value that can be found at the key path provided (useful for nested values).

For example:

config.get_path('surround.stages') == config['surround']['stages']
--> True
Parameters:path (str) – path to the value in storage
Returns:the value found at the path or none if not found
Return type:any
static instance()[source]

Static method which returns the a singleton instance of Config.

read_config_files(yaml_files)[source]

Parses the YAML files provided and stores their key-value pairs in config.

Parameters:yaml_files (list) – multiple paths to the YAML files to load
Returns:true on success, throws IOError on failure
Return type:bool
read_from_dict(config_dict)[source]

Retrieve all key-value pairs from the dict provided and store in config.

Parameters:config_dict (dict) – configuration settings to be added to storage
Returns:true on success, throws exception on failure (TypeError)
Return type:bool

State

class surround.State[source]

Stores the data to be passed between each stage in a pipeline. Each stage is responsible for setting the attributes to this class.

Formerly know as SurroundData.

Attributes:

  • stage_metadata (list) - information that can be used to identify the stage
  • execution_time (str) - how long it took to execute the entire pipeline
  • errors (list) - list of error messages (stops the pipeline when appended to)
  • warnings (list) - list of warning messages (displayed in console)

Example:

class AssemblyState(State):
    # Extra attributes must be defined before the pipeline is ran!
    input_data = None
    output_data = None

    def __init__(self, input_data)
        self.input_data = input_data


class Predict(Estimator):
    # Do prediction here

pipeline = Assembler("Example")
            .set_stages([Predict()])
pipeline.init_assembler()

data = PipelineData("received data")
pipeline.run(data)

print(data.output_data)

Note

This class is frozen when the pipeline is being ran. This means that an exception will be thrown if a new attribute is added during pipeline execution.

Stage

class surround.stage.Stage[source]

Base class of all stages in a Surround pipeline.

See the following class for more information:

dump_output(state, config)[source]

Dump the output of the stage after the stage has transformed the data.

Note

This is called by surround.assembler.Assembler.run() (when dumping output is requested).

Parameters:
initialise(config)[source]

Initialise the stage, this may be loading a model or loading data.

Parameters:config (surround.config.Config) – Contains the settings for each stage
operate(state, config)[source]

Main function to be called in an assembly. :param state: Contains all pipeline state including input and output data :param config: Config for the assembly

Estimator

class surround.stage.Estimator[source]

Base class for an estimator in a Surround pipeline. Responsible for performing estimation or training using the input data.

This stage is executed by surround.assembler.Assembler.run().

Example:

class Predict(Estimator):
    def initialise(self, config):
        self.model = load_model(os.path.join(config["models_path"], "model.pb"))

    def estimate(self, state, config):
        state.output_data = run_model(self.model)

    def fit(self, state, config):
        state.output_data = train_model(self.model)
estimate(state, config)[source]

Process input data and store estimated values.

Note

This method is ONLY called by surround.assembler.Assembler.run() when running in predict/batch-predict mode.

Parameters:
  • state (Instance or child of the surround.State class) – Stores intermediate data from each stage in the pipeline
  • config (surround.config.Config) – Contains the settings for each stage
fit(state, config)[source]

Train a model using the input data.

Note

This method is ONLY called by surround.assembler.Assembler.run() when running in training mode.

Parameters:
  • state (Instance or child of the surround.State class) – Stores intermediate data from each stage in the pipeline
  • config (surround.config.Config) – Contains the settings for each stage

Runner

class surround.runners.Runner(assembler=None)[source]

Base class for runners which are responsible for:

Example batch runner:

class BatchRunner(Runner):
    def load_data(self, mode, config):
        state = AssemblyState()

        if mode == RunMode.TRAIN:
            state.input_data = load_files('training_set')
        else:
            state.input_data = load_files('predict_set')

        return state

Note

You get a Batch Runner and Web Runner (if web requested) when you generate a project using the CLI tool.

Parameters:assembler (surround.assembler.Assembler) – The assembler the runner will execute
load_data(mode, config)[source]

Load the data and prepare it to be fed into the surround.assembler.Assembler.

Parameters:
  • mode (surround.runners.RunMode) – the mode the assembly was run in (batch, train, predict, web)
  • config (surround.config.Config) – the configuration of the assembly
run(mode=<RunMode.PREDICT: 2>)[source]

Prepare data and execute the surround.assembler.Assembler.

Parameters:is_training (bool) – Run the pipeline in training mode or not
set_assembler(assembler)[source]

Set the Assembler instance the runner will execute.

Parameters:assembler (surround.assembler.Assembler) – the Assembler instance

Data Container

class surround.data.container.DataContainer(path=None, metadata_version='v0.1')[source]

Represents a data container which holds both data and metadata.

Responsibilities:

  • Import files into a container and export
  • Load existing containers
  • Extract files
Parameters:
  • path (str) – path for container to load (default: None)
  • metadata_version (str) – the version of metadata being used (default: v0.1)
export(export_to)[source]

Import all staged files into the container, hash the contents, set the hash to the metadata and import the metadata file.

Parameters:export_to (str) – path to export the file to
extract_all(extract_to)[source]

Extract all files in the current data container to a path on disk

Parameters:extract_to (str) – path to extract files to
Returns:true on success, false otherwise
Return type:bool
extract_file(internal_path, extract_path='.')[source]

Extract a file in the current data container to a path on disk

Parameters:
  • internal_path (str) – path inside the container
  • extract_path – path to extract file to
Returns:

true on success, false otherwise

Return type:

bool

extract_file_bytes(path)[source]

Extract the bytes of a file in the current data container

Parameters:path (str) – path inside the container
Returns:the bytes extracted or None if it doesn’t exist
Return type:bytes
extract_files(internal_paths, extract_path='.')[source]

Extract files in the current data container to a path on disk

Parameters:
  • internal_paths (list) – list of files to extract
  • extract_path (str) – path to extract files to
Returns:

true on success, false otherwise

Return type:

bool

file_exists(path)[source]

Checks whether file exists in current data container

Returns:true if the file exists
Return type:bool
get_files()[source]

Returns all the files in the current data container

Returns:list of the files
Return type:list
import_directory(path, generate_metadata=True, reimport=True)[source]

Stage the directory provided for importing when export is requested.

Parameters:
  • path (str) – the directory of files to import
  • generate_metadata (bool) – whether metadata should be generated for this folder
  • reimport (bool) – whether or not files that are already staged should be staged again
import_file(import_path, internal_path, generate_metadata=True)[source]

Stage file for importing when the next export operation is called.

Parameters:
  • import_path (str) – path to the file on the users drive
  • internal_path (str) – path to the file inside the container
  • generate_metadata (bool) – whether metadata should be generated for this file
import_files(files, generate_metadata=True)[source]

Stage the list of files for importing when export is requested.

Parameters:
  • files (list) – list of files to import
  • generate_metadata (bool) – whether metadata should be generated for this file
load(path)[source]

Load an existing data container, preparing it for extracting files.

Parameters:path (str) – path to the container

Metadata

class surround.data.metadata.Metadata(version='v0.1')[source]

Represents metadata of a Data Container.

Responsibilities:

  • Create metadata, explorting to YAML string and/or file
  • Generate default metadata as per schema
  • Automatically generate values to fields based on files given
  • Get/set properties
Parameters:version (str) – the version of the schema to use (default: v0.1)
generate_default(version)[source]

Generate a dictionary with all required fields created as per the schema.

Parameters:version (str) – which version of the schema to use
Returns:the dictionary with default values
Return type:dict
generate_from_directory(directory)[source]

Automatically generate metadata from a directory, such as:

  • Formats (mime types)
  • Types (types from vocab)
  • Group manifests (each root level directory is considered a group)
Parameters:directory (str) – path to the directory to generate from
generate_from_file(filepath)[source]

Automatically generate metadata from a single file

Parameters:filepath (str) – path to the file
generate_from_files(files, root, root_level_dirs)[source]

Automatically generate metadata from a list of files such as:

  • Formats (mime types)
  • Types (types from vocab)
  • Group manifests (each root level directory is considered a group)
Parameters:
  • files (list) – list of files to generate from
  • root (str) – path to the root of the folder container the files
  • root_level_dirs (list) – list of directories in the root
generate_manifest_for_group(group_name, files, formats=None)[source]

Generate a manifest for a group of files where the manifest contains:

  • path
  • description
  • language
  • formats (mime types)
  • types (from vocab)

Store the manifest in the metadata storage plus return it.

Parameters:
  • group_name (str) – name of the group
  • files (list) – list of files in the group
  • formats (list) – list of formats in the group
Returns:

the manifest created

Return type:

dict

get_property(path)[source]

Get the value of a property given a path in dot notation e.g. summary.title

metadata.get_property('summary.title') would retrieve Test name from the following:

summary:
    title: Test name
Parameters:path (str) – path to the property using dot notation
Returns:the value of the property, none otherwise
Return type:any
load_from_data(data)[source]

Load metadata from a YAML string

Parameters:data (str) – YAML string
load_from_path(path)[source]

Load metadata from file (YAML)

Parameters:path (str) – path to the YAML file
save_to_data()[source]

Returns metadata as string formatted in YAML

Returns:the data in YAML string
Return type:str
save_to_json(indent=4)[source]

Returns metadata as string formatted in JSON

Parameters:indent (int) – number of spaces in indentations
Returns:the data in JSON format
Return type:str
save_to_json_file(path, indent=4)[source]

Saves metadata to JSON file

Parameters:
  • path (str) – path to file to export to
  • indent (int) – number of spaces in indentations
save_to_path(path)[source]

Save metadata to YAML file

Parameters:path (str) – path to save file to
set_property(path, value)[source]

Set the value of a property given a path in dot notation e.g. summary.title

metadata.set_property('summary.title') would set the title of the data container.

Parameters:
  • path (str) – path to the property in dot notation
  • value (any) – value to set to the property