Getting Started¶
Installation¶
Prerequisites¶
- Python 3+ (Tested on 3.6.5)
- Docker
- Supports MacOS, Linux, and Windows
Install via Pip¶
Run the following command to install the latest version of Surround:
$ pip3 install surround
Note
If this doesn’t work make sure you have pip installed. See here on how to install it.
Now the Surround library and command-line tool should be installed! To make sure run the following command to test:
$ surround
If it works then you are ready for the Project Setup stage.
Project Setup¶
Before we can create our first pipeline, we need to generate an empty Surround project. Use the following command to generate a new project:
$ surround init -p test_project -d "Our first pipeline"
When it asks the following, respond with n
(we’ll cover this in later sections):
Does it require a web runner? (y/n) n
This will create a new folder called test_project
with the following file structure:
test_project
├── test_project/
│ ├── stages
│ │ ├── __init__.py
│ │ ├── input_validator.py
│ │ ├── baseline.py
│ │ └── assembler_state.py
│ ├── __main__.py
│ ├── __init__.py
│ ├── config.yaml
│ └── file_system_runner.py
├── input/
├── docs/
├── models/
├── notebooks/
├── output/
├── scripts/
├── spikes/
├── tests/
├── __main__.py
├── __init__.py
├── dodo.py
├── Dockerfile
├── requirements.txt
└── README.md
The generated project comes with an example pipeline that can be ran straight away using the command:
$ cd test_project
$ surround run batchLocal
Which should output the following:
INFO:surround.assembler:Starting 'baseline'
INFO:surround.assembler:Validator InputValidator took 0:00:00 secs
INFO:surround.assembler:Estimator Baseline took 0:00:00 secs
Now you are ready for Creating your first pipeline.
See also
Not sure what a pipeline is? Checkout our About section first!
Creating your first pipeline¶
For our first Surround pipeline, we are going to do some very basic data transformation and convert the input string
from lower case to upper case. This pipeline is going to consist of two stages, InputValidator
and MakeUpperCase
.
Open the script stages/validator.py
and you should see the following code already generated:
from surround import Validator
class InputValidator(Validator):
def validate(self, state, config):
if not state.input_data:
raise ValueError("'input_data' is None")
As you can see we are already given the InputValidator
stage, we just need to edit the operate
method to
check if the input data is the correct data type (str
):
def validate(self, state, config):
if not isinstance(state.input_data, str):
# Raise an exception, this will stop the pipeline
raise ValueError('Input is not a string!')
Now we need to create our MakeUpperCase
stage, so head to stages/baseline.py
, you should see:
from surround import Estimator
class Baseline(Estimator):
def estimate(self, state, config):
state.output_data = state.input_data
def fit(self, state, config):
LOGGER.info("TODO: Train your model here")
Make the following changes:
class MakeUpperCase(Estimator):
def estimate(self, state, config):
# Convert the input into upper case
state.output_data = state.input_data.upper()
# Print the output to the terminal (to check its working)
LOGGER.info("Output: %s" % state.output_data)
def fit(self, state, config):
# Leave the fit method the same
# We aren't doing any training in this guide
LOGGER.info("TODO: Train your model here")
Since we renamed the estimator, we need to reflect that change when we create the Assembler
.
First head to the stages/__init__.py
file and rename Baseline
to MakeUpperCase
:
from .baseline import MakeUpperCase
from .input_validator import InputValidator
from .assembler_state import AssemblerState
Then in __main__.py
where the estimator is imported make sure it looks like so:
from stages import MakeUpperCase, InputValidator
And where the assembler is created, make sure it looks like so:
assemblies = [
Assembler("baseline")
.set_stages([InputValidator(), MakeUpperCase()])
]
That’s it for the pipeline!
To test the pipeline with default input ("TODO Load raw data here"
string) just run the following command:
$ surround run batchLocal
The output should be the following:
INFO:surround.assembler:Starting 'baseline'
INFO:stages.baseline:Output: TODO: LOAD RAW DATA HERE
INFO:surround.assembler:Estimator MakeUpperCase took 0:00:00 secs
To change what input is fed through the pipeline, modify batch_runner.py
and change what is given to data.input_data
:
import logging
from surround import Runner
from stages import AssemblyState
logging.basicConfig(level=logging.INFO)
class FileSystemRunner(Runner):
def load_data(self, mode, config):
state = AssemblyState()
# Load data to be processed
raw_data = "This daTa wiLL end UP captializED"
# Setup input data
state.input_data = raw_data
return state
Note
To test training mode (fit
will be called instead in the estimator), run the following command:
$ surround run trainLocal
Running your first pipeline in a container¶
First you must build an image for your container. To do this just run the following command:
$ surround run build
Then to run the container in dev mode just use the following command:
$ surround run dev
This will run the container linking the folder testproject/testproject
with the working directory in the
container. So during development when you make small changes, there is no need to build the image, just run
this command again.
Then when you are ready for production you can use the following command:
$ surround run prod
Which will first build the image and then run the container without any linking to the host machine. The image created in the build can also then be committed to a Docker Hub repository and shared.
Note
Both dev
and prod
will use the default mode of the project, which in non-web projects
is RunMode.BATCH_PREDICT
, otherwise it’s RunMode.WEB
.
The following commands will force which mode to use:
$ surround run batch
$ surround run train
Note
To see a list of available tasks, just run the command $ surround run
Serving your first pipeline via Web Endpoint¶
When generating a project, you get asked:
Does it require a web runner? (y/n)
If we say yes to this then Surround will generate a generic batch_runner.py
but it will also
generate a new script called web_runner.py
.
This script contains a new Runner
which will use Tornado
to host a web server which will allow your pipeline to be accessible via HTTP request. By default the
WebRunner
will host two endpoints:
/info
- access via GET request, will return{'version': '0.0.1'}
/estimate
- access via POST request, body must have a JSON document containing input data:{ "message": "this text will be processed" }
So lets create a new pipeline that does the same data processing as the one in Creating your first pipeline but we will send strings via web endpoint and get the results in the response of the request.
First generate a new project, this time saying yes to the require web prompt, and make all the changes we did in Creating your first pipeline and test it is still working locally.
Next we are going to build an image for our pipeline using the command:
$ surround run build
Then we are going to run our default server using the command:
$ surround run web
You should get output like so:
INFO:root:Server started at http://localhost:8080
Note
If you would like to run it on the host machine instead of in a container, you must install Tornado using
this command: $ pip3 install tornado==6.0.2
Now hopefully if you load http://localhost:8080/info
in your preferred browser, you should see the following:
{"version": "0.0.1"}
Note
If you are running this on Windows and don’t see the above, try using http://192.168.99.100:8080/info
instead.
Next we are going to test the /estimate
endpoint by using the following command in another terminal:
On Linux/MacOS:
$ curl -d "{ \"message\": \"test phrase\" }" http://localhost:8080/estimate
On Windows (in Powershell):
$ Invoke-WebRequest http://192.168.99.100:8080/estimate -Method POST -Body "{ ""message"": ""test phrase"" }"
You should see the following output in the terminal running the pipeline:
INFO:surround.assembler:Starting 'baseline'
INFO:surround.assembler:Estimator MakeUpperCase took 0:00:00 secs
INFO:root:Message: TEST PHRASE
INFO:tornado.access:200 POST /estimate (::1) 1.95ms
So our data is successfully being processed! But what if we need the result?
Head to the script web_runner.py
and append the following to the post
method of EstimateHandler
:
# Return the result of the processing
self.write({"output": self.data.output_data})
Restart the web server, use the same command as before and you should see the following output:
On Linux/MacOS:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 53 100 25 100 28 806 903 --:--:-- --:--:-- --:--:-- 1709
{"output": "TEST PHRASE"}
On Windows (in Powershell):
StatusCode : 200
StatusDescription : OK
Content : {"output": "TEST PHRASE"}
RawContent : HTTP/1.1 200 OK
Content-Length: 25
Content-Type: application/json; charset=UTF-8
Date: Mon, 17 Jun 2019 06:43:54 GMT
Server: TornadoServer/6.0.2
{"output": "TEST PHRASE"}
Forms : {}
Headers : {[Content-Length, 25], [Content-Type, application/json; charset=UTF-8], [Date, Mon, 17 Jun 2019 06:43:54 GMT], [Server, TornadoServer/6.0.2]}
Images : {}
InputFields : {}
Links : {}
ParsedHtml : mshtml.HTMLDocumentClass
RawContentLength : 25
Thats it, you are now serving a Surround pipeline! Now you could potentially use this pipeline in virtually any application.
Note
Since this project was generated with a web runner, the default mode is web
, to run the pipeline
using the FileSystemRunner
instead, use the command $ surround run batch
or $ surround run train
.