.. _about:
About
=====
What is Surround?
*************************
Surround is an open-source framework developed by the `Applied Artificial Intelligence Institute `_ (`A`:superscript:`2`\ `I`:superscript:`2`) to take machine
learning solutions through from exploration all the way to production. For this reason, it is developed with both
research engineers and software developers in mind. Designed to play nice with existing machine learning frameworks
(Tensorflow, MXNet, PyTorch, etc) and cloud services (Google Cloud AI, SageMaker, Rekognition etc), engineers have the freedom
to use whatever necessary to solve their problem.
A Philosophy
^^^^^^^^^^^^
Surround isn't just a framework, its also a philosophy. From the moment data lands on our desk we need to be thinking about the
final use case for the solutions we are developing. To reduce the amount of time between data exploration and a containerised
proof-of-concept web application ready to be deployed, Surround was built to resolve some competing requirements of both researchers
and engineers. Where in general researchers want to dive into the data and leave code quality to later, and engineers prefer
well structured code from the start. We attempt to solve this problem with Surround by introducing a "production first" mindset and
providing conventions for researchers (a separate folder for data exploration scripts).
Long ago, web frameworks realised there are a set of concerns that almost all web applications must deal with, such as connecting
to databases, managing configuration, rendering static and dynamic content, and handling security concerns. Machine Learning projets
have similar concerns but also have their own set of special concerns such as:
- Experimentation is a first class citizen
- Data and models need to be versioned and managed
- Model performance needs to be visualized
- Training infrastructure is required
- Etc..
Surround strives to provide a **single place** for every concern that arises when building a ML project. Ideally there will be a single
solution to any concerns that occur to either the research engineer or the software developer. But to be the **single place** for ML projects
we are going to have to support as many existing frameworks, libraries and APIs as we can. This can be seen reflected in the design of Surround
where the Core framework could be used to build:
- A solution based on cloud APIs
- A custom Docker image for SageMaker
- Form part of a batch process running on an internal Kubernetes cluster
By **playing nice with others** we hope the core Surround framework can continue to be used as the ML ecosystem evolves.
A set of conventions
^^^^^^^^^^^^^^^^^^^^
Surround attempts to enforce a set of conventions to help researchers keep their solutions structured for software developers and
implements solutions for common ML project concepts such as managing configuration so that they don't have to.
These conventions are adhered to through the use of a project generator and project linter that will check for the core conventions.
For example during project generation, the following structure is used::
package name
├── Dockerfile
├── README.md
├── data
├── package name
│ ├── stages
│ │ ├── __init__.py
│ │ ├── input_validator.py
│ │ ├── baseline.py
│ │ └── assembler_state.py
│ ├── __init__.py
│ ├── __main__.py
│ ├── web_runner.py
│ ├── file_system_runner.py
│ └── config.yaml
├── docs
├── dodo.py
├── models
├── notebooks
├── output
├── requirements.txt
├── scripts
├── spikes
└── tests
Every Surround project has the following characteristics:
- ``Dockerfile`` for bundling up the project as a Docker container.
- ``dodo.py`` file containing useful tasks such as train, batch predict and test for a project.
- Tests for catching training serving skew.
- A single entry point for running the application, ``__main__.py``.
- A place for data exploration with Jupyter notebooks and miscellaneous scripts.
- A single place, for output files, data, and model storage.
A command line tool
^^^^^^^^^^^^^^^^^^^
Surround also comes with a command line tool (CLI) which can perform a variety of tasks such as project generation and running
the project in Docker. The tools included are shown below:
- ``init`` - Used to generate a new Surround project.
- ``lint`` - Used to run the Surround Linter which checks if Surround conventions are being used correctly.
- ``run`` - Used to run a task defined in ``dodo.py``.
Where the ``run`` command is essentially a wrapper around the ``doit`` library and the Surround Linter will perform multiple checks
on the current project to see if it is following standard conventions. The intention of the Surround Linter will to become more
of an assistant when building ML projects. These tools are automatically added to your environment path so they can be used anywhere
in your preferred terminal application.
A Python library
^^^^^^^^^^^^^^^^
The last component of Surround is the Python library. We developed the Python library to provide a flexible way of running a ML
pipeline in a variety of situations whether that be from a queue, a http endpoint or from a file system. We found that during
development the research engineer often needed to run results from a file, something that is not always needed in a production
environment. Surround's Python library was designed to leverage the conventions outlined above to provide maximum productivity
boost to research engineers provided the conventions are followed. Surround also provides wrappers around libraries such as
the Tornado web server to provide advanced functionality. These 3rd party dependencies are not installed by default and need
to be added to the project before Surround will make the wrappers available.
How does Surround work at its core?
***********************************
At its core, there are four main concepts that you need to understand while using Surround, these are:
- :ref:`assembler`
- :ref:`stages`
- :ref:`configuration`
- :ref:`data`
The most **important** being the **first two** since they make up the actual pipeline that is responsible for taking in data and spitting
out a prediction based on that input.
.. _assembler:
Assembler
^^^^^^^^^
.. image:: pipeline_flow_diagram.png
:alt: Assembler flow diagram
:align: center
The Assembler is responsible for constructing and executing a pipeline on data. How the pipeline is constructed (and where/how data is loaded) depends on which
execution mode is being used. The above diagram describes a simple Surround pipeline showing three different modes of
execution. These modes are described below.
Training
########
.. image:: train_diagram.png
:alt: Training flow diagram
:align: center
Primarily built for **training**, training data is loaded from disk (usually in bulk) then fed through the pipeline
with the estimator set to ``fit`` mode. Once training the pipeline is complete the data is then fed to a visualiser which
will help display useful information about the training operation.
Batch-predict
#############
.. image:: batch_diagram.png
:alt: Batch-predict flow diagram
:align: center
Primarily built for **evaluation**, data is loaded from disk (also usually in bulk) then fed through the pipeline with
the estimator set to ``estimate`` mode. Once processing is complete the data is then fed to a visualiser which
will help summarise and visualise the overall results / performance.
Web / Predict
#############
.. image:: predict_diagram.png
:alt: Web / Predict flow diagram
:align: center
This mode is built for **production**. When your pipeline is setup, training has been completed, evaluation of the model
shows good performance and is ready for use, this mode is to be used to serve your pipeline. Depending on the type of project you generated
initially, the input data may come from your local disk or from the body of a POST HTTP request and the result may be
saved locally or returned to the client who sent the request.
.. _stages:
Stages
^^^^^^
A stage, at its base, can do three things:
- **Initialize** anything needed to complete its function. This may include a loading a Tensorflow graph or loading configuration data.
- **Perform** its intended operation. Whether that be feeding data through a model or checking if the data is correct.
- **Dump** output from the operation to the console (if requested, used for debugging).
Between each stage, during processing, there are two objects passed between them:
- :ref:`data` object which contains the input data, has a field for errors (which stops the execution when added to) and holds the output of each stage (if any).
- :ref:`configuration` object which contains all the settings loaded in from YAML files plus paths to folders in the project such as ``input/`` and ``output/``.
.. _validators:
Validators
##########
Validators are stages that are responsible for checking if the input data that is about to be fed through the pipeline is valid.
Meaning is the data the correct format, checking whether there is any detectable reason why the data would cause issues while
being processed. This stage is positioned first in the execution of the pipeline, they are not intended to create any output,
only errors or warnings.
.. _filters:
Filters
#######
Filters are stages that are responsible for getting data ready for the next stage of execution. These are typically placed before
or after :ref:`estimators`. There are generally two types of filters: :ref:`wranglers` and :ref:`deciders`.
.. _wranglers:
Wranglers (Pre-filters)
-----------------------
Wranglers perform data wrangling operations on the data. Meaning getting the data from one format into another that is useful
for the next stage (typically an Estimator). For example the input data might be a :class:`str` formatted in JSON but the estimator
next in the pipeline might only accept a Python :class:`dict` so a Wrangler would be used to parse the :class:`str` into a :class:`dict`.
.. _deciders:
Deciders (Post-filters)
-----------------------
Deciders, placed after :ref:`estimators`, are stages which make descisions based on the output of them. For example in a Voice Activity
Detection pipeline, we may have an estimator that outputs confidence values on whether the input audio data was speech or not, you would
then place a Decider after which may perform thresholding on the confidence values.
.. _estimators:
Estimators
##########
Estimators are stages where the actual prediction or training of an ML model takes place. Depending on the pipeline configuration
the estimator will either use the input data to make a prediction or use the input data as training data. This stage should have
some form of output. Typically placed between two :ref:`filters` during execution. For example you may be using Tensorflow to run your model, so
an estimator would be created, which would load the model and create a Tensorflow session during initialization and the session
would be ran with the input data during execution of the stage.
In more complex pipelines, these stages may be composed of an entirely separate Surround pipeline (another Assembler instance). Surround is designed this way
to allow pipelines as complex as required.
.. _visualisers:
Visualisers
###########
Visualisers are stages where they do what their name entails, visualize the data. Typically used during training and evaluation
of the model, these stages are used to generate reports on how the model is performing. For example in a Facial Detection pipeline
during evaluation of the model, the visualiser may display an example image it processed and render boxes around the faces it detected.
.. _configuration:
Configuration
^^^^^^^^^^^^^
Every instance of :ref:`assembler` has a configuration object constructed from the project's configuration file. This
configuration object is passed between each stage of the pipeline during initialization and execution. The configuration
file uses the `YAML `_ data-serialization language.
Example configuration file::
pathToModels: ../models
model: hog # 'hog' or 'cnn'
minFaceWidth: 100 # Threshold for the width of a face bounding box in pixels
minFaceHeight: 125 # Threshold for the height of a face bounding box in pixels
useAllFaces: true # If false, only extract encodings for the largest face
imageTooDark: 23 # Threshold for determining if an image is too dark, lower values = darker image
blurryThreshold: 4 # Smaller values indicate a "more" blurry image
gpuDynamicMemoryAllocation: true # If true, Tensorflow will allocate GPU memory on an as-needs basis. perProcessGpuMemoryFraction will have no effect.
perProcessGpuMemoryFraction: 0.5 # Fraction of GPU memory Tensorflow should acquire. Has no effect if gpuDynamicMemoryAllocation is true.
rotateImageModelFile: image-rotator/image-rotator-2018-04-05.pb # Model used to detect the orientation of the image
rotateImageModelLabels: image-rotator/labels.txt # Model used to detect the orientation of the image
rotateImageInputLayer: conv2d_1_input # Tensorflow input layer
rotateImageOutputLayer: activation_5/Softmax # Tensorflow output layer
rotateImageInputHeight: 100 # Input image height to the image stage neural network
rotateImageInputWidth: 100 # Input image width to the image stage neural network
rotateImageThreshold: 0.5 # Rotate image if the orientation is above this threshold
rotateImageSkip: false # Option to skip image rotation step
imageSizeMax: 700 # Maximum allowable image size (width or height). Images larger than this will be downsized.
postgres: # Postgres database options
user: postgres # Postgres username
password: postgres # Postgres password
host: localhost # Postgres server host
port: 5432 # Postgres server port
db: face_recognition # Which database to connect to
webcamStream: # Webcam stream options
drawBox: true # Whether to draw a box around detected faces
minConfidence: 0.5 # Discard detections below this confidence level
highConfidence: 0.9 # Confidence values at or above this level are deemed to be 'highly confident'
celery:
broker: pyamqp://guest@localhost
backend: redis://localhost
.. _data:
State
^^^^^
Every time an :ref:`assembler` is ran, it requires an object that will be used to store the input data and eventually store
the output. Passed between stages during execution, it can also be used to store any intermediate data between stages.