
Welcome to Surround¶
Welcome to Surround’s documentation! Surround is a framework for building machine learning pipelines in Python.
For a quick rundown on getting started with Surround see Getting Started. For more information on the aim and philosophy of Surround see About. Just need to learn more about a particular method or class? See API Reference.
About¶
What is Surround?¶
Surround is an open-source framework developed by the Applied Artificial Intelligence Institute (A2I2) to take machine learning solutions through from exploration all the way to production. For this reason, it is developed with both research engineers and software developers in mind. Designed to play nice with existing machine learning frameworks (Tensorflow, MXNet, PyTorch, etc) and cloud services (Google Cloud AI, SageMaker, Rekognition etc), engineers have the freedom to use whatever necessary to solve their problem.
A Philosophy¶
Surround isn’t just a framework, its also a philosophy. From the moment data lands on our desk we need to be thinking about the final use case for the solutions we are developing. To reduce the amount of time between data exploration and a containerised proof-of-concept web application ready to be deployed, Surround was built to resolve some competing requirements of both researchers and engineers. Where in general researchers want to dive into the data and leave code quality to later, and engineers prefer well structured code from the start. We attempt to solve this problem with Surround by introducing a “production first” mindset and providing conventions for researchers (a separate folder for data exploration scripts).
Long ago, web frameworks realised there are a set of concerns that almost all web applications must deal with, such as connecting to databases, managing configuration, rendering static and dynamic content, and handling security concerns. Machine Learning projets have similar concerns but also have their own set of special concerns such as:
- Experimentation is a first class citizen
- Data and models need to be versioned and managed
- Model performance needs to be visualized
- Training infrastructure is required
- Etc..
Surround strives to provide a single place for every concern that arises when building a ML project. Ideally there will be a single solution to any concerns that occur to either the research engineer or the software developer. But to be the single place for ML projects we are going to have to support as many existing frameworks, libraries and APIs as we can. This can be seen reflected in the design of Surround where the Core framework could be used to build:
- A solution based on cloud APIs
- A custom Docker image for SageMaker
- Form part of a batch process running on an internal Kubernetes cluster
By playing nice with others we hope the core Surround framework can continue to be used as the ML ecosystem evolves.
A set of conventions¶
Surround attempts to enforce a set of conventions to help researchers keep their solutions structured for software developers and implements solutions for common ML project concepts such as managing configuration so that they don’t have to.
These conventions are adhered to through the use of a project generator and project linter that will check for the core conventions. For example during project generation, the following structure is used:
package name
├── Dockerfile
├── README.md
├── data
├── package name
│ ├── stages
│ │ ├── __init__.py
│ │ ├── input_validator.py
│ │ ├── baseline.py
│ │ └── assembler_state.py
│ ├── __init__.py
│ ├── __main__.py
│ ├── web_runner.py
│ ├── file_system_runner.py
│ └── config.yaml
├── docs
├── dodo.py
├── models
├── notebooks
├── output
├── requirements.txt
├── scripts
├── spikes
└── tests
Every Surround project has the following characteristics:
Dockerfile
for bundling up the project as a Docker container.dodo.py
file containing useful tasks such as train, batch predict and test for a project.- Tests for catching training serving skew.
- A single entry point for running the application,
__main__.py
. - A place for data exploration with Jupyter notebooks and miscellaneous scripts.
- A single place, for output files, data, and model storage.
A command line tool¶
Surround also comes with a command line tool (CLI) which can perform a variety of tasks such as project generation and running the project in Docker. The tools included are shown below:
init
- Used to generate a new Surround project.lint
- Used to run the Surround Linter which checks if Surround conventions are being used correctly.run
- Used to run a task defined indodo.py
.
Where the run
command is essentially a wrapper around the doit
library and the Surround Linter will perform multiple checks
on the current project to see if it is following standard conventions. The intention of the Surround Linter will to become more
of an assistant when building ML projects. These tools are automatically added to your environment path so they can be used anywhere
in your preferred terminal application.
A Python library¶
The last component of Surround is the Python library. We developed the Python library to provide a flexible way of running a ML pipeline in a variety of situations whether that be from a queue, a http endpoint or from a file system. We found that during development the research engineer often needed to run results from a file, something that is not always needed in a production environment. Surround’s Python library was designed to leverage the conventions outlined above to provide maximum productivity boost to research engineers provided the conventions are followed. Surround also provides wrappers around libraries such as the Tornado web server to provide advanced functionality. These 3rd party dependencies are not installed by default and need to be added to the project before Surround will make the wrappers available.
How does Surround work at its core?¶
At its core, there are four main concepts that you need to understand while using Surround, these are:
The most important being the first two since they make up the actual pipeline that is responsible for taking in data and spitting out a prediction based on that input.
Assembler¶

The Assembler is responsible for constructing and executing a pipeline on data. How the pipeline is constructed (and where/how data is loaded) depends on which execution mode is being used. The above diagram describes a simple Surround pipeline showing three different modes of execution. These modes are described below.
Training¶

Primarily built for training, training data is loaded from disk (usually in bulk) then fed through the pipeline
with the estimator set to fit
mode. Once training the pipeline is complete the data is then fed to a visualiser which
will help display useful information about the training operation.
Batch-predict¶

Primarily built for evaluation, data is loaded from disk (also usually in bulk) then fed through the pipeline with
the estimator set to estimate
mode. Once processing is complete the data is then fed to a visualiser which
will help summarise and visualise the overall results / performance.
Web / Predict¶

This mode is built for production. When your pipeline is setup, training has been completed, evaluation of the model shows good performance and is ready for use, this mode is to be used to serve your pipeline. Depending on the type of project you generated initially, the input data may come from your local disk or from the body of a POST HTTP request and the result may be saved locally or returned to the client who sent the request.
Stages¶
A stage, at its base, can do three things:
- Initialize anything needed to complete its function. This may include a loading a Tensorflow graph or loading configuration data.
- Perform its intended operation. Whether that be feeding data through a model or checking if the data is correct.
- Dump output from the operation to the console (if requested, used for debugging).
Between each stage, during processing, there are two objects passed between them:
- State object which contains the input data, has a field for errors (which stops the execution when added to) and holds the output of each stage (if any).
- Configuration object which contains all the settings loaded in from YAML files plus paths to folders in the project such as
input/
andoutput/
.
Validators¶
Validators are stages that are responsible for checking if the input data that is about to be fed through the pipeline is valid. Meaning is the data the correct format, checking whether there is any detectable reason why the data would cause issues while being processed. This stage is positioned first in the execution of the pipeline, they are not intended to create any output, only errors or warnings.
Filters¶
Filters are stages that are responsible for getting data ready for the next stage of execution. These are typically placed before or after Estimators. There are generally two types of filters: Wranglers (Pre-filters) and Deciders (Post-filters).
Wranglers (Pre-filters)¶
Wranglers perform data wrangling operations on the data. Meaning getting the data from one format into another that is useful
for the next stage (typically an Estimator). For example the input data might be a str
formatted in JSON but the estimator
next in the pipeline might only accept a Python dict
so a Wrangler would be used to parse the str
into a dict
.
Deciders (Post-filters)¶
Deciders, placed after Estimators, are stages which make descisions based on the output of them. For example in a Voice Activity Detection pipeline, we may have an estimator that outputs confidence values on whether the input audio data was speech or not, you would then place a Decider after which may perform thresholding on the confidence values.
Estimators¶
Estimators are stages where the actual prediction or training of an ML model takes place. Depending on the pipeline configuration the estimator will either use the input data to make a prediction or use the input data as training data. This stage should have some form of output. Typically placed between two Filters during execution. For example you may be using Tensorflow to run your model, so an estimator would be created, which would load the model and create a Tensorflow session during initialization and the session would be ran with the input data during execution of the stage.
In more complex pipelines, these stages may be composed of an entirely separate Surround pipeline (another Assembler instance). Surround is designed this way to allow pipelines as complex as required.
Visualisers¶
Visualisers are stages where they do what their name entails, visualize the data. Typically used during training and evaluation of the model, these stages are used to generate reports on how the model is performing. For example in a Facial Detection pipeline during evaluation of the model, the visualiser may display an example image it processed and render boxes around the faces it detected.
Configuration¶
Every instance of Assembler has a configuration object constructed from the project’s configuration file. This configuration object is passed between each stage of the pipeline during initialization and execution. The configuration file uses the YAML data-serialization language.
Example configuration file:
pathToModels: ../models
model: hog # 'hog' or 'cnn'
minFaceWidth: 100 # Threshold for the width of a face bounding box in pixels
minFaceHeight: 125 # Threshold for the height of a face bounding box in pixels
useAllFaces: true # If false, only extract encodings for the largest face
imageTooDark: 23 # Threshold for determining if an image is too dark, lower values = darker image
blurryThreshold: 4 # Smaller values indicate a "more" blurry image
gpuDynamicMemoryAllocation: true # If true, Tensorflow will allocate GPU memory on an as-needs basis. perProcessGpuMemoryFraction will have no effect.
perProcessGpuMemoryFraction: 0.5 # Fraction of GPU memory Tensorflow should acquire. Has no effect if gpuDynamicMemoryAllocation is true.
rotateImageModelFile: image-rotator/image-rotator-2018-04-05.pb # Model used to detect the orientation of the image
rotateImageModelLabels: image-rotator/labels.txt # Model used to detect the orientation of the image
rotateImageInputLayer: conv2d_1_input # Tensorflow input layer
rotateImageOutputLayer: activation_5/Softmax # Tensorflow output layer
rotateImageInputHeight: 100 # Input image height to the image stage neural network
rotateImageInputWidth: 100 # Input image width to the image stage neural network
rotateImageThreshold: 0.5 # Rotate image if the orientation is above this threshold
rotateImageSkip: false # Option to skip image rotation step
imageSizeMax: 700 # Maximum allowable image size (width or height). Images larger than this will be downsized.
postgres: # Postgres database options
user: postgres # Postgres username
password: postgres # Postgres password
host: localhost # Postgres server host
port: 5432 # Postgres server port
db: face_recognition # Which database to connect to
webcamStream: # Webcam stream options
drawBox: true # Whether to draw a box around detected faces
minConfidence: 0.5 # Discard detections below this confidence level
highConfidence: 0.9 # Confidence values at or above this level are deemed to be 'highly confident'
celery:
broker: pyamqp://guest@localhost
backend: redis://localhost
Getting Started¶
Installation¶
Prerequisites¶
- Python 3+ (Tested on 3.6.5)
- Docker
- Supports MacOS, Linux, and Windows
Install via Pip¶
Run the following command to install the latest version of Surround:
$ pip3 install surround
Note
If this doesn’t work make sure you have pip installed. See here on how to install it.
Now the Surround library and command-line tool should be installed! To make sure run the following command to test:
$ surround
If it works then you are ready for the Project Setup stage.
Project Setup¶
Before we can create our first pipeline, we need to generate an empty Surround project. Use the following command to generate a new project:
$ surround init -p test_project -d "Our first pipeline"
When it asks the following, respond with n
(we’ll cover this in later sections):
Does it require a web runner? (y/n) n
This will create a new folder called test_project
with the following file structure:
test_project
├── test_project/
│ ├── stages
│ │ ├── __init__.py
│ │ ├── input_validator.py
│ │ ├── baseline.py
│ │ └── assembler_state.py
│ ├── __main__.py
│ ├── __init__.py
│ ├── config.yaml
│ └── file_system_runner.py
├── input/
├── docs/
├── models/
├── notebooks/
├── output/
├── scripts/
├── spikes/
├── tests/
├── __main__.py
├── __init__.py
├── dodo.py
├── Dockerfile
├── requirements.txt
└── README.md
The generated project comes with an example pipeline that can be ran straight away using the command:
$ cd test_project
$ surround run batchLocal
Which should output the following:
INFO:surround.assembler:Starting 'baseline'
INFO:surround.assembler:Validator InputValidator took 0:00:00 secs
INFO:surround.assembler:Estimator Baseline took 0:00:00 secs
Now you are ready for Creating your first pipeline.
See also
Not sure what a pipeline is? Checkout our About section first!
Creating your first pipeline¶
For our first Surround pipeline, we are going to do some very basic data transformation and convert the input string
from lower case to upper case. This pipeline is going to consist of two stages, InputValidator
and MakeUpperCase
.
Open the script stages/validator.py
and you should see the following code already generated:
from surround import Validator
class InputValidator(Validator):
def validate(self, state, config):
if not state.input_data:
raise ValueError("'input_data' is None")
As you can see we are already given the InputValidator
stage, we just need to edit the operate
method to
check if the input data is the correct data type (str
):
def validate(self, state, config):
if not isinstance(state.input_data, str):
# Raise an exception, this will stop the pipeline
raise ValueError('Input is not a string!')
Now we need to create our MakeUpperCase
stage, so head to stages/baseline.py
, you should see:
from surround import Estimator
class Baseline(Estimator):
def estimate(self, state, config):
state.output_data = state.input_data
def fit(self, state, config):
LOGGER.info("TODO: Train your model here")
Make the following changes:
class MakeUpperCase(Estimator):
def estimate(self, state, config):
# Convert the input into upper case
state.output_data = state.input_data.upper()
# Print the output to the terminal (to check its working)
LOGGER.info("Output: %s" % state.output_data)
def fit(self, state, config):
# Leave the fit method the same
# We aren't doing any training in this guide
LOGGER.info("TODO: Train your model here")
Since we renamed the estimator, we need to reflect that change when we create the Assembler
.
First head to the stages/__init__.py
file and rename Baseline
to MakeUpperCase
:
from .baseline import MakeUpperCase
from .input_validator import InputValidator
from .assembler_state import AssemblerState
Then in __main__.py
where the estimator is imported make sure it looks like so:
from stages import MakeUpperCase, InputValidator
And where the assembler is created, make sure it looks like so:
assemblies = [
Assembler("baseline")
.set_stages([InputValidator(), MakeUpperCase()])
]
That’s it for the pipeline!
To test the pipeline with default input ("TODO Load raw data here"
string) just run the following command:
$ surround run batchLocal
The output should be the following:
INFO:surround.assembler:Starting 'baseline'
INFO:stages.baseline:Output: TODO: LOAD RAW DATA HERE
INFO:surround.assembler:Estimator MakeUpperCase took 0:00:00 secs
To change what input is fed through the pipeline, modify batch_runner.py
and change what is given to data.input_data
:
import logging
from surround import Runner
from stages import AssemblyState
logging.basicConfig(level=logging.INFO)
class FileSystemRunner(Runner):
def load_data(self, mode, config):
state = AssemblyState()
# Load data to be processed
raw_data = "This daTa wiLL end UP captializED"
# Setup input data
state.input_data = raw_data
return state
Note
To test training mode (fit
will be called instead in the estimator), run the following command:
$ surround run trainLocal
Running your first pipeline in a container¶
First you must build an image for your container. To do this just run the following command:
$ surround run build
Then to run the container in dev mode just use the following command:
$ surround run dev
This will run the container linking the folder testproject/testproject
with the working directory in the
container. So during development when you make small changes, there is no need to build the image, just run
this command again.
Then when you are ready for production you can use the following command:
$ surround run prod
Which will first build the image and then run the container without any linking to the host machine. The image created in the build can also then be committed to a Docker Hub repository and shared.
Note
Both dev
and prod
will use the default mode of the project, which in non-web projects
is RunMode.BATCH_PREDICT
, otherwise it’s RunMode.WEB
.
The following commands will force which mode to use:
$ surround run batch
$ surround run train
Note
To see a list of available tasks, just run the command $ surround run
Serving your first pipeline via Web Endpoint¶
When generating a project, you get asked:
Does it require a web runner? (y/n)
If we say yes to this then Surround will generate a generic batch_runner.py
but it will also
generate a new script called web_runner.py
.
This script contains a new Runner
which will use Tornado
to host a web server which will allow your pipeline to be accessible via HTTP request. By default the
WebRunner
will host two endpoints:
/info
- access via GET request, will return{'version': '0.0.1'}
/estimate
- access via POST request, body must have a JSON document containing input data:{ "message": "this text will be processed" }
So lets create a new pipeline that does the same data processing as the one in Creating your first pipeline but we will send strings via web endpoint and get the results in the response of the request.
First generate a new project, this time saying yes to the require web prompt, and make all the changes we did in Creating your first pipeline and test it is still working locally.
Next we are going to build an image for our pipeline using the command:
$ surround run build
Then we are going to run our default server using the command:
$ surround run web
You should get output like so:
INFO:root:Server started at http://localhost:8080
Note
If you would like to run it on the host machine instead of in a container, you must install Tornado using
this command: $ pip3 install tornado==6.0.2
Now hopefully if you load http://localhost:8080/info
in your preferred browser, you should see the following:
{"version": "0.0.1"}
Note
If you are running this on Windows and don’t see the above, try using http://192.168.99.100:8080/info
instead.
Next we are going to test the /estimate
endpoint by using the following command in another terminal:
On Linux/MacOS:
$ curl -d "{ \"message\": \"test phrase\" }" http://localhost:8080/estimate
On Windows (in Powershell):
$ Invoke-WebRequest http://192.168.99.100:8080/estimate -Method POST -Body "{ ""message"": ""test phrase"" }"
You should see the following output in the terminal running the pipeline:
INFO:surround.assembler:Starting 'baseline'
INFO:surround.assembler:Estimator MakeUpperCase took 0:00:00 secs
INFO:root:Message: TEST PHRASE
INFO:tornado.access:200 POST /estimate (::1) 1.95ms
So our data is successfully being processed! But what if we need the result?
Head to the script web_runner.py
and append the following to the post
method of EstimateHandler
:
# Return the result of the processing
self.write({"output": self.data.output_data})
Restart the web server, use the same command as before and you should see the following output:
On Linux/MacOS:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 53 100 25 100 28 806 903 --:--:-- --:--:-- --:--:-- 1709
{"output": "TEST PHRASE"}
On Windows (in Powershell):
StatusCode : 200
StatusDescription : OK
Content : {"output": "TEST PHRASE"}
RawContent : HTTP/1.1 200 OK
Content-Length: 25
Content-Type: application/json; charset=UTF-8
Date: Mon, 17 Jun 2019 06:43:54 GMT
Server: TornadoServer/6.0.2
{"output": "TEST PHRASE"}
Forms : {}
Headers : {[Content-Length, 25], [Content-Type, application/json; charset=UTF-8], [Date, Mon, 17 Jun 2019 06:43:54 GMT], [Server, TornadoServer/6.0.2]}
Images : {}
InputFields : {}
Links : {}
ParsedHtml : mshtml.HTMLDocumentClass
RawContentLength : 25
Thats it, you are now serving a Surround pipeline! Now you could potentially use this pipeline in virtually any application.
Note
Since this project was generated with a web runner, the default mode is web
, to run the pipeline
using the FileSystemRunner
instead, use the command $ surround run batch
or $ surround run train
.
Command-line Interface¶
The following is a list of the sub-commands contained in Surround’s CLI tool.
surround¶
The Surround Command Line Interface
usage: surround [-h] [-v]
{init,run,lint,store,config,experimentation,split,viz,data}
...
Named Arguments¶
-v, --version | Show the current version of Surround Default: False |
init¶
Initialize a new Surround project.
usage: surround init [-h] [-p PROJECT_NAME] [-d DESCRIPTION] [-w REQUIRE_WEB]
[path]
Positional Arguments¶
path | Path for creating a Surround project Default: “./” |
Named Arguments¶
-p, --project-name | |
Name of the project | |
-d, --description | |
A description for the project | |
-w, --require-web | |
Is web service required for the project |
run¶
Run a Surround project assembler and task.
Without any arguments, all tasks will be listed.
Assemblers are defined in the __main__.py
file of the current project. The default assembler that comes
with every project is called baseline
.
Tasks are defined in the dodo.py
file of the current project. Each project comes with a set of default tasks listed below.
Containerised Tasks:
build
- Build a Docker image for your Surround project.dev
- Run the specified assembler in a Docker container with the current source code (via drive mount, no build neccessary).prod
- Build the Docker image and run the specified assembler inside a container (no drive mounting).batch
- Run the specified assembler in a Docker container (mountinginput
andoutput
folders) set to batch mode.train
- Run the specified assembler in a Docker container (mountinginput
andoutput
folders) set to train mode.web
- Serve the specified assembler via HTTP endpoints inside a Docker container.remove
- Remove the Docker image built for this project (if any).jupyter
- Run a jupyter notebook server in a Docker container (mounting the whole project).
Local Tasks:
batchLocal
- Run the specified assembler locally set to batch-predict mode.trainLocal
- Run the specified assembler locally set to train mode.webLocal
- Serve the specified assembler via HTTP endpoints locally.
usage: surround run [-h] [task]
Positional Arguments¶
task | Task defined in dodo.py file of your project |
lint¶
Run the Surround Linter on the current project.
For more information on what this does, see linter.
usage: surround lint [-h] [-l | path]
Positional Arguments¶
path | Path for running the Surround linter Default: ./ |
Named Arguments¶
-l, --list | List all Surround checkers Default: False |
data¶
usage: surround data [-h] {create,inspect,lint} ...
subcommands¶
This tool must be called with one of the following commands
command | Possible choices: create, inspect, lint |
Sub-commands:¶
create¶
Create a data container from a file or directory
surround data create [-h] (-f FILE | -d DIRECTORY | -m) [-o OUTPUT]
[-e EXPORT_METADATA]
Named Arguments¶
-f, --file | Path to file to import into container |
-d, --directory | |
Path to directory to import into container | |
-m, --metadata-only | |
Generate metadata without a file system Default: False | |
-o, --output | Path to file to export container to (default: specified-path.data.zip) |
-e, --export-metadata | |
Path to JSON file to export metadata to |
inspect¶
Inspect the metadata and/or contents of a data container
surround data inspect [-h] [-m | -c] container_file
Positional Arguments¶
container_file | Path to the data container to inspect |
Named Arguments¶
-m, --metadata-only | |
Inspect the metadata of the container only Default: False | |
-c, --content-only | |
Inspect the contents of the container only Default: False |
lint¶
Check the validity of a data container
surround data lint [-h] [-l] [-c CHECK_ID] container_path
Positional Arguments¶
container_path | Path to the container to perform checks on |
Named Arguments¶
-l, --list | List the checks the linter will perform Default: False |
-c, --check-id | Specify a single check to perform (get id from –list) |
split¶
Tool to randomly split data into test, train, and validation sets.
Supports splitting:
- Directory of files
- CSV files
- Text files (just ensure you use the
--no-header
flag)
Example - Split a directory of images into train/test/validate:
$ surround split -d images -e png
Example - Reset a split directory:
$ surround spit --reset images
Example - Splitting and resetting a CSV file:
$ surround split -t test.csv
$ surround split --reset .
usage: surround split [-h] (-t TEXT_FILE | -d DIRECTORY | -r RESET)
[-e EXTENSION] [-tr TRAIN] [-te TEST] [-va VALIDATE]
[-nv] [-ns] [-nh]
Named Arguments¶
-t, --text-file | |
Split text file into train/test/validate sets | |
-d, --directory | |
Split directory into train/test/validate sets | |
-r, --reset | Path to directory containing train/test/validate folders to reset |
-e, --extension | |
File extension of the files to process (default: *) Default: “*” | |
-tr, --train | Percentage of files for training (default: 80%) Default: 80 |
-te, --test | Percentage of files for test (default: 10%) Default: 10 |
-va, --validate | |
Percentage of files for validate (default: 10%) Default: 10 | |
-nv, --no-validate | |
Don’t produce a validation set when splitting Default: False | |
-ns, --no-shuffle | |
Don’t randomise when splitting data Default: False | |
-nh, --no-header | |
Use this flag when the text file has no headers Default: False |
API Reference¶
Here you can find documentation on all classes and their methods in Surround.
Assembler¶
-
class
surround.assembler.
Assembler
(assembler_name='', config=None)[source]¶ Class responsible for assembling and executing a Surround pipeline.
Responsibilities:
- Encapsulate the configuration data and pipeline stages
- Load configuration from a specified module
- Run the pipeline with input data in predict/batch/train mode
For more information on this process, see the About page.
Example:
assembler = Assembler("Example pipeline") assembler.set_stages([PreFilter(), PredictStage(), PostFilter()]) assembler.init_assembler(batch_mode=False) data = AssemblyState("some data") assembler.run(data, is_training=False)
Batch-predict mode:
assembler.init_assembler(batch_mode=True) assembler.run(data, is_training=False)
Training mode:
assembler.init_assembler(batch_mode=True) assembler.run(data, is_training=True)
Predict/Estimate mode:
assembler.init_assembler(batch_mode=False) assembler.run(data, is_training=False)
Constructor for an Assembler pipeline:
Parameters: - assembler_name (str) – The name of the pipeline
- config – Surround Config object
-
init_assembler
()[source]¶ Initializes the assembler and all of it’s stages.
Calls the
surround.stage.Stage.initialise()
method of all stages and the estimator.Note
Should be called after
surround.assembler.Assembler.set_config()
.Returns: whether the initialisation was successful Return type: bool
-
load_config
(module)[source]¶ Given a module contained in the root of the project, create an instance of
surround.config.Config
loading configuration data from theconfig.yaml
found in the project, and use this configuration for the pipeline.Note
Should be called before
surround.assembler.Assemble.init_assembler()
Parameters: module (str) – name of the module
-
run
(state=None, mode=<RunMode.PREDICT: 2>)[source]¶ Run the pipeline using the input data provided.
If
is_training
is set toTrue
then when it gets to the execution of the estimator, it will use thesurround.stage.Estimator.fit()
method instead.If
surround.enable_stage_output_dump
is enabled in the Config instance then each stage and estimator’ssurround.stage.Stage.dump_output()
method will be called.This method doesn’t return anything, instead results should be stored in the
state
object passed in the parameters.Parameters: - state (
surround.State
) – Data passed between each stage in the pipeline - is_training (bool) – Run the pipeline in training mode or not
- state (
-
set_config
(config)[source]¶ Set the configuration data to be used during pipeline execution.
Note
Should be called before
surround.assembler.Assembler.init_assembler()
.Parameters: config ( surround.config.Config
) – the configuration data
-
set_finaliser
(finaliser)[source]¶ Set the final stage that will be executed no matter how the pipeline runs. This will be executed even when the pipeline fails or throws an error.
Parameters: finaliser ( surround.stage.Stage
) – the final stage instance
-
set_stages
(stages)[source]¶ Set the stages to be executed one after the other in the pipeline.
Parameters: stages (list of surround.stage.Stage
) – list of stages to execute
Config¶
-
class
surround.config.
Config
(project_root=None, package_path=None, auto_load=False)[source]¶ An iterable dictionary class that loads and stores all the configuration settings from both default and project YAML files and environment variables. Primarily used in stages to retrieve configuration data set for development/production.
Responsibilities:
- Parse the config.yaml file and store the data as key-value pairs.
- Allow environment variables override data loaded from file/dict (must be prefixed with
SURROUND_
). - Provide READ-ONLY access to the stored config values via
[]
operator and iteration.
Example usage:
config = Config() config.read_from_dict({ "debug": True }) config.read_config_files(["config.yaml"]) if config["debug"]: # Do debug stuff for key, value in config: # Iterate over all data
You could then override the above configuration using the systems environment variables, just prefix the var with SURROUND_ like so:
SURROUND_DEBUG=False
It also supports overriding nested configuration data, for example with the following config:
predict: debug: True
We can override the above with the following environment variable:
SURRROUND_PREDICT_DEBUG=False
Constructor of the Config class, loads the default YAML file into storage. If the
project_root
is provided then the project’s config.yaml file is also loaded into configuration.The default config file (defaults.yaml) can be found in the same directory as the config.py script. The project config file (config.yaml) can be found in the root of the project folder.
Parameters: - project_root (str) – path to the root directory of the surround project (default: None)
- package_path (str) – path to the root directory of the package that contains the surround project (default: None)
- auto_load (bool) – Attempt to load the config.yaml file from the Surround project in the current directory (default: False)
-
get_dict
()[source]¶ Returns the configuration data in a dictionary
Returns: dictionary of the configuration data Return type: dict
-
get_path
(path)[source]¶ Returns value that can be found at the key path provided (useful for nested values).
For example:
config.get_path('surround.stages') == config['surround']['stages'] --> True
Parameters: path (str) – path to the value in storage Returns: the value found at the path or none if not found Return type: any
State¶
-
class
surround.
State
[source]¶ Stores the data to be passed between each stage in a pipeline. Each stage is responsible for setting the attributes to this class.
Formerly know as
SurroundData
.Attributes:
- stage_metadata (
list
) - information that can be used to identify the stage - execution_time (
str
) - how long it took to execute the entire pipeline - errors (
list
) - list of error messages (stops the pipeline when appended to) - warnings (
list
) - list of warning messages (displayed in console)
Example:
class AssemblyState(State): # Extra attributes must be defined before the pipeline is ran! input_data = None output_data = None def __init__(self, input_data) self.input_data = input_data class Predict(Estimator): # Do prediction here pipeline = Assembler("Example") .set_stages([Predict()]) pipeline.init_assembler() data = PipelineData("received data") pipeline.run(data) print(data.output_data)
Note
This class is frozen when the pipeline is being ran. This means that an exception will be thrown if a new attribute is added during pipeline execution.
- stage_metadata (
Stage¶
-
class
surround.stage.
Stage
[source]¶ Base class of all stages in a Surround pipeline.
See the following class for more information:
-
dump_output
(state, config)[source]¶ Dump the output of the stage after the stage has transformed the data.
Note
This is called by
surround.assembler.Assembler.run()
(when dumping output is requested).Parameters: - state (Instance or child of the
surround.State
class) – Stores intermediate data from each stage in the pipeline - config (
surround.config.Config
) – Config of the pipeline
- state (Instance or child of the
-
initialise
(config)[source]¶ Initialise the stage, this may be loading a model or loading data.
Note
This is called by
surround.assembler.Assembler.init_assembler()
.Parameters: config ( surround.config.Config
) – Contains the settings for each stage
-
Estimator¶
-
class
surround.stage.
Estimator
[source]¶ Base class for an estimator in a Surround pipeline. Responsible for performing estimation or training using the input data.
This stage is executed by
surround.assembler.Assembler.run()
.Example:
class Predict(Estimator): def initialise(self, config): self.model = load_model(os.path.join(config["models_path"], "model.pb")) def estimate(self, state, config): state.output_data = run_model(self.model) def fit(self, state, config): state.output_data = train_model(self.model)
-
estimate
(state, config)[source]¶ Process input data and store estimated values.
Note
This method is ONLY called by
surround.assembler.Assembler.run()
when running in predict/batch-predict mode.Parameters: - state (Instance or child of the
surround.State
class) – Stores intermediate data from each stage in the pipeline - config (
surround.config.Config
) – Contains the settings for each stage
- state (Instance or child of the
-
fit
(state, config)[source]¶ Train a model using the input data.
Note
This method is ONLY called by
surround.assembler.Assembler.run()
when running in training mode.Parameters: - state (Instance or child of the
surround.State
class) – Stores intermediate data from each stage in the pipeline - config (
surround.config.Config
) – Contains the settings for each stage
- state (Instance or child of the
-
Runner¶
-
class
surround.runners.
Runner
(assembler=None)[source]¶ Base class for runners which are responsible for:
- Initializing an
surround.assembler.Assembler
. - Loading/preparing input data.
- Running the
surround.assembler.Assembler
.
Example batch runner:
class BatchRunner(Runner): def load_data(self, mode, config): state = AssemblyState() if mode == RunMode.TRAIN: state.input_data = load_files('training_set') else: state.input_data = load_files('predict_set') return state
Note
You get a Batch Runner and Web Runner (if web requested) when you generate a project using the CLI tool.
Parameters: assembler ( surround.assembler.Assembler
) – The assembler the runner will execute-
load_data
(mode, config)[source]¶ Load the data and prepare it to be fed into the
surround.assembler.Assembler
.Parameters: - mode (
surround.runners.RunMode
) – the mode the assembly was run in (batch, train, predict, web) - config (
surround.config.Config
) – the configuration of the assembly
- mode (
-
run
(mode=<RunMode.PREDICT: 2>)[source]¶ Prepare data and execute the
surround.assembler.Assembler
.Parameters: is_training (bool) – Run the pipeline in training mode or not
-
set_assembler
(assembler)[source]¶ Set the Assembler instance the runner will execute.
Parameters: assembler ( surround.assembler.Assembler
) – the Assembler instance
- Initializing an
Data Container¶
-
class
surround.data.container.
DataContainer
(path=None, metadata_version='v0.1')[source]¶ Represents a data container which holds both data and metadata.
Responsibilities:
- Import files into a container and export
- Load existing containers
- Extract files
Parameters: -
export
(export_to)[source]¶ Import all staged files into the container, hash the contents, set the hash to the metadata and import the metadata file.
Parameters: export_to (str) – path to export the file to
-
extract_all
(extract_to)[source]¶ Extract all files in the current data container to a path on disk
Parameters: extract_to (str) – path to extract files to Returns: true on success, false otherwise Return type: bool
-
extract_file
(internal_path, extract_path='.')[source]¶ Extract a file in the current data container to a path on disk
Parameters: - internal_path (str) – path inside the container
- extract_path – path to extract file to
Returns: true on success, false otherwise
Return type:
-
extract_file_bytes
(path)[source]¶ Extract the bytes of a file in the current data container
Parameters: path (str) – path inside the container Returns: the bytes extracted or None if it doesn’t exist Return type: bytes
-
extract_files
(internal_paths, extract_path='.')[source]¶ Extract files in the current data container to a path on disk
Parameters: Returns: true on success, false otherwise
Return type:
-
file_exists
(path)[source]¶ Checks whether file exists in current data container
Returns: true if the file exists Return type: bool
-
get_files
()[source]¶ Returns all the files in the current data container
Returns: list of the files Return type: list
-
import_directory
(path, generate_metadata=True, reimport=True)[source]¶ Stage the directory provided for importing when export is requested.
Parameters:
-
import_file
(import_path, internal_path, generate_metadata=True)[source]¶ Stage file for importing when the next export operation is called.
Parameters:
Metadata¶
-
class
surround.data.metadata.
Metadata
(version='v0.1')[source]¶ Represents metadata of a Data Container.
Responsibilities:
- Create metadata, explorting to YAML string and/or file
- Generate default metadata as per schema
- Automatically generate values to fields based on files given
- Get/set properties
Parameters: version (str) – the version of the schema to use (default: v0.1) -
generate_default
(version)[source]¶ Generate a dictionary with all required fields created as per the schema.
Parameters: version (str) – which version of the schema to use Returns: the dictionary with default values Return type: dict
-
generate_from_directory
(directory)[source]¶ Automatically generate metadata from a directory, such as:
- Formats (mime types)
- Types (types from vocab)
- Group manifests (each root level directory is considered a group)
Parameters: directory (str) – path to the directory to generate from
-
generate_from_file
(filepath)[source]¶ Automatically generate metadata from a single file
Parameters: filepath (str) – path to the file
-
generate_from_files
(files, root, root_level_dirs)[source]¶ Automatically generate metadata from a list of files such as:
- Formats (mime types)
- Types (types from vocab)
- Group manifests (each root level directory is considered a group)
Parameters:
-
generate_manifest_for_group
(group_name, files, formats=None)[source]¶ Generate a manifest for a group of files where the manifest contains:
- path
- description
- language
- formats (mime types)
- types (from vocab)
Store the manifest in the metadata storage plus return it.
Parameters: Returns: the manifest created
Return type:
-
get_property
(path)[source]¶ Get the value of a property given a path in dot notation e.g. summary.title
metadata.get_property('summary.title')
would retrieveTest name
from the following:summary: title: Test name
Parameters: path (str) – path to the property using dot notation Returns: the value of the property, none otherwise Return type: any
-
load_from_path
(path)[source]¶ Load metadata from file (YAML)
Parameters: path (str) – path to the YAML file
-
save_to_data
()[source]¶ Returns metadata as string formatted in YAML
Returns: the data in YAML string Return type: str
-
save_to_json
(indent=4)[source]¶ Returns metadata as string formatted in JSON
Parameters: indent (int) – number of spaces in indentations Returns: the data in JSON format Return type: str