Getting Started

Installation

Prerequisites

  • Python 3+ (Tested on 3.6.5)
  • Docker
  • Supports MacOS, Linux, and Windows

Install via Pip

Run the following command to install the latest version of Surround:

$ pip3 install surround

Note

If this doesn’t work make sure you have pip installed. See here on how to install it.

Now the Surround library and command-line tool should be installed! To make sure run the following command to test:

$ surround

If it works then you are ready for the Project Setup stage.

Project Setup

Before we can create our first pipeline, we need to generate an empty Surround project. Use the following command to generate a new project:

$ surround init -p test_project -d "Our first pipeline"

When it asks the following, respond with n (we’ll cover this in later sections):

Does it require a web runner? (y/n) n

This will create a new folder called test_project with the following file structure:

test_project
├── test_project/
│   ├── stages
│   │   ├── __init__.py
│   │   ├── input_validator.py
│   │   ├── baseline.py
│   │   └── assembler_state.py
│   ├── __main__.py
│   ├── __init__.py
│   ├── config.yaml
│   └── file_system_runner.py
├── input/
├── docs/
├── models/
├── notebooks/
├── output/
├── scripts/
├── spikes/
├── tests/
├── __main__.py
├── __init__.py
├── dodo.py
├── Dockerfile
├── requirements.txt
└── README.md

The generated project comes with an example pipeline that can be ran straight away using the command:

$ cd test_project
$ surround run batchLocal

Which should output the following:

INFO:surround.assembler:Starting 'baseline'
INFO:surround.assembler:Validator InputValidator took 0:00:00 secs
INFO:surround.assembler:Estimator Baseline took 0:00:00 secs

Now you are ready for Creating your first pipeline.

See also

Not sure what a pipeline is? Checkout our About section first!

Creating your first pipeline

For our first Surround pipeline, we are going to do some very basic data transformation and convert the input string from lower case to upper case. This pipeline is going to consist of two stages, InputValidator and MakeUpperCase.

Open the script stages/validator.py and you should see the following code already generated:

from surround import Validator

class InputValidator(Validator):
    def validate(self, state, config):
        if not state.input_data:
            raise ValueError("'input_data' is None")

As you can see we are already given the InputValidator stage, we just need to edit the operate method to check if the input data is the correct data type (str):

def validate(self, state, config):
    if not isinstance(state.input_data, str):
        # Raise an exception, this will stop the pipeline
        raise ValueError('Input is not a string!')

Now we need to create our MakeUpperCase stage, so head to stages/baseline.py, you should see:

from surround import Estimator

class Baseline(Estimator):
    def estimate(self, state, config):
        state.output_data = state.input_data

    def fit(self, state, config):
        LOGGER.info("TODO: Train your model here")

Make the following changes:

class MakeUpperCase(Estimator):
    def estimate(self, state, config):
        # Convert the input into upper case
        state.output_data = state.input_data.upper()

        # Print the output to the terminal (to check its working)
        LOGGER.info("Output: %s" % state.output_data)

    def fit(self, state, config):
        # Leave the fit method the same
        # We aren't doing any training in this guide
        LOGGER.info("TODO: Train your model here")

Since we renamed the estimator, we need to reflect that change when we create the Assembler.

First head to the stages/__init__.py file and rename Baseline to MakeUpperCase:

from .baseline import MakeUpperCase
from .input_validator import InputValidator
from .assembler_state import AssemblerState

Then in __main__.py where the estimator is imported make sure it looks like so:

from stages import MakeUpperCase, InputValidator

And where the assembler is created, make sure it looks like so:

assemblies = [
    Assembler("baseline")
        .set_stages([InputValidator(), MakeUpperCase()])
]

That’s it for the pipeline! To test the pipeline with default input ("TODO Load raw data here" string) just run the following command:

$ surround run batchLocal

The output should be the following:

INFO:surround.assembler:Starting 'baseline'
INFO:stages.baseline:Output: TODO: LOAD RAW DATA HERE
INFO:surround.assembler:Estimator MakeUpperCase took 0:00:00 secs

To change what input is fed through the pipeline, modify batch_runner.py and change what is given to data.input_data:

import logging
from surround import Runner
from stages import AssemblyState

logging.basicConfig(level=logging.INFO)

class FileSystemRunner(Runner):
    def load_data(self, mode, config):
        state = AssemblyState()

        # Load data to be processed
        raw_data = "This daTa wiLL end UP captializED"

        # Setup input data
        state.input_data = raw_data

        return state

Note

To test training mode (fit will be called instead in the estimator), run the following command: $ surround run trainLocal

Running your first pipeline in a container

First you must build an image for your container. To do this just run the following command:

$ surround run build

Then to run the container in dev mode just use the following command:

$ surround run dev

This will run the container linking the folder testproject/testproject with the working directory in the container. So during development when you make small changes, there is no need to build the image, just run this command again.

Then when you are ready for production you can use the following command:

$ surround run prod

Which will first build the image and then run the container without any linking to the host machine. The image created in the build can also then be committed to a Docker Hub repository and shared.

Note

Both dev and prod will use the default mode of the project, which in non-web projects is RunMode.BATCH_PREDICT, otherwise it’s RunMode.WEB.

The following commands will force which mode to use:

$ surround run batch
$ surround run train

Note

To see a list of available tasks, just run the command $ surround run

Serving your first pipeline via Web Endpoint

When generating a project, you get asked:

Does it require a web runner? (y/n)

If we say yes to this then Surround will generate a generic batch_runner.py but it will also generate a new script called web_runner.py.

This script contains a new Runner which will use Tornado to host a web server which will allow your pipeline to be accessible via HTTP request. By default the WebRunner will host two endpoints:

  • /info - access via GET request, will return {'version': '0.0.1'}

  • /estimate - access via POST request, body must have a JSON document containing input data:

    {
        "message": "this text will be processed"
    }
    

So lets create a new pipeline that does the same data processing as the one in Creating your first pipeline but we will send strings via web endpoint and get the results in the response of the request.

First generate a new project, this time saying yes to the require web prompt, and make all the changes we did in Creating your first pipeline and test it is still working locally.

Next we are going to build an image for our pipeline using the command:

$ surround run build

Then we are going to run our default server using the command:

$ surround run web

You should get output like so:

INFO:root:Server started at http://localhost:8080

Note

If you would like to run it on the host machine instead of in a container, you must install Tornado using this command: $ pip3 install tornado==6.0.2

Now hopefully if you load http://localhost:8080/info in your preferred browser, you should see the following:

{"version": "0.0.1"}

Note

If you are running this on Windows and don’t see the above, try using http://192.168.99.100:8080/info instead.

Next we are going to test the /estimate endpoint by using the following command in another terminal:

On Linux/MacOS:

$ curl -d "{ \"message\": \"test phrase\" }" http://localhost:8080/estimate

On Windows (in Powershell):

$ Invoke-WebRequest http://192.168.99.100:8080/estimate -Method POST -Body "{ ""message"": ""test phrase"" }"

You should see the following output in the terminal running the pipeline:

INFO:surround.assembler:Starting 'baseline'
INFO:surround.assembler:Estimator MakeUpperCase took 0:00:00 secs
INFO:root:Message: TEST PHRASE
INFO:tornado.access:200 POST /estimate (::1) 1.95ms

So our data is successfully being processed! But what if we need the result?

Head to the script web_runner.py and append the following to the post method of EstimateHandler:

# Return the result of the processing
self.write({"output": self.data.output_data})

Restart the web server, use the same command as before and you should see the following output:

On Linux/MacOS:

% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100    53  100    25  100    28    806    903 --:--:-- --:--:-- --:--:--  1709
{"output": "TEST PHRASE"}

On Windows (in Powershell):

StatusCode        : 200
StatusDescription : OK
Content           : {"output": "TEST PHRASE"}
RawContent        : HTTP/1.1 200 OK
                    Content-Length: 25
                    Content-Type: application/json; charset=UTF-8
                    Date: Mon, 17 Jun 2019 06:43:54 GMT
                    Server: TornadoServer/6.0.2

                    {"output": "TEST PHRASE"}
Forms             : {}
Headers           : {[Content-Length, 25], [Content-Type, application/json; charset=UTF-8], [Date, Mon, 17 Jun 2019 06:43:54 GMT], [Server, TornadoServer/6.0.2]}
Images            : {}
InputFields       : {}
Links             : {}
ParsedHtml        : mshtml.HTMLDocumentClass
RawContentLength  : 25

Thats it, you are now serving a Surround pipeline! Now you could potentially use this pipeline in virtually any application.

Note

Since this project was generated with a web runner, the default mode is web, to run the pipeline using the FileSystemRunner instead, use the command $ surround run batch or $ surround run train.