Getting Started 🌟

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites 📋

Before you begin with Baler, ensure you have the following tools installed and configured on your system:

Docker

Docker is a platform for developers and sysadmins to develop, deploy, and run applications with containers. The use of Linux containers to deploy applications is called containerization.

To install Docker, follow the instructions in the official Docker documentation for your operating system: Docker installation guide

Minikube (or any equivalent Kubernetes setup)

Minikube is a tool that lets you run Kubernetes locally. Minikube runs a single-node Kubernetes cluster on your personal computer (including Windows, macOS, and Linux PCs) so that you can try out Kubernetes or for daily development work. If you’re using a different Kubernetes setup, ensure it’s configured correctly.

To install Minikube, follow the instructions in the official Minikube documentation: Minikube installation guide

For other Kubernetes environments, consult your specific cloud provider’s documentation or the Kubernetes Getting Started Guide.

kubectl

The Kubernetes command-line tool, kubectl, allows you to run commands against Kubernetes clusters. You can use kubectl to deploy applications, inspect and manage cluster resources, and view logs.

To install kubectl, follow the instructions in the official Kubernetes documentation for your operating system: kubectl installation guide

Helm

Helm is a package manager for Kubernetes that allows you to define, install, and upgrade Kubernetes applications. Helm uses a packaging format called charts, which include all of the Kubernetes resources needed to deploy an application, such as deployments, services, and ingress rules.

To install Helm, follow the instructions in the official Helm documentation for your operating system: Helm installation guide

Make utility

The Make utility automates the build process by reading files called Makefiles which specify how to derive the target program. Although it’s used primarily for compiling programs, it can be used for managing any project where you need to execute arbitrary commands.

Linux: Usually available by default or can be installed via your package manager. For example, on Debian-based systems: sudo apt-get install make.
macOS: Included with Xcode Command Line Tools, which can be installed with xcode-select --install.
Windows: Install using Chocolatey: choco install make.

Ensure all these tools are installed and properly configured before proceeding with the setup and deployment of Baler.

Installing 💾

A step-by-step series of examples that tell you how to get a development environment running.

Clone the project repository
Execute make create-namespace command to trigger the baler namespace
Execute make deploy-operator command to deploy the operator
Execute kubectl get pods -n baler --watch command to see the operator pod coming up
You should see something like this: baler-operator-pod 1/1 Running 0 106m
This means that the operator is ready to receive our pipelines 🎉
It was easy, wasn’t it? 😉

Let’s attempt running a test to see our pipeline in work!

Our first pipeline deployment 🚀

First we need to deploy our pipeline definition file, for this we will use a basic Haystack pipeline YAML definition.
You can find the example pipelines in ./examples/
Let’s use example_01.yaml that defines a simple query pipeline that uses an ElasticsearchDataStore (one of the available documentstores in haystack)
Deploy the pipeline via executing the following command: kubectl apply -f ./examples/example_01.yaml
Run kubectl get haystack -A and you should see your haystack pipeline is in pending state

NAME                           STATUS
example-haystack-pipeline-01   Pending

After a couple of seconds you should be able to see a similar output if you use the kubectl get haystack -A command:

NAME                           STATUS
example-haystack-pipeline-01   Running

Let’s make sure that all underlaying resources are created. Run the kubectl get pods -n default command and you should see a similar output:

NAME                            READY   STATUS    RESTARTS       AGE
elasticsearch-57dc7c9df-vtwvq   1/1     Running   0              3m7s
indexing-5798465f9c-wjg96       1/1     Running   3 (117s ago)   3m7s
query-5687b797f-w8psm           1/1     Running   3 (2m1s ago)   3m8s

As you can see there are two pods created for a pipeline: one for query and one for indexing.

This is because we defined two pipelines in our examples/example_01.yaml.

...
pipelines:
    - name: query
      nodes: ...
    - name: indexing
      nodes: ...
...

Our operator is aware of multiple pipeline definitions in our manifest and will provision resources for each.

⚠️ Please note that documentstores will be provisioned per Kubernetes resource, so both pipelines are consuming the same documentstore (Elasticsearch) in our example [TODO: link to automatic documentstore provisioning]. In a production use-case most likely you have a documentstore outside of your Kubernetes cluster.

Consuming our pipeline 🍿

📰Read more about Haystack REST API

Let’s start up a busybox so we can interract with our pipeline
Execute kubectl run busybox --image=busybox --restart=Never -- /bin/sh -c "while true; do sleep 3600; done"
Run kubectl get pods busybox -n default and you should see someting similar: busybox 1/1 Running
Get a shell in our newly created pod via running the following command: kubectl exec -it busybox -n default -- sh
In order to operate with our service API we need to install curl into our busybox

wget https://github.com/moparisthebest/static-curl/releases/download/v7.80.0/curl-amd64
mv ./curl-amd64 /bin/curl
chmod +x ./bin/curl

Our operator provides Kubernetes Service endpoints for our created pipelines. Run the kubectl get services -n default command and you should get an ouput similar to this:

NAME                                 TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
elasticsearch                        ClusterIP   10.100.178.197   <none>        9200/TCP   15m
haystack-pipeline-service-indexing   ClusterIP   10.111.41.114    <none>        80/TCP     15m
haystack-pipeline-service-query      ClusterIP   10.101.174.33    <none>        80/TCP     15m

The elasticsearch endpoint is created because of the autoprovision behaviour for documentstores in our operator. Our pipelines are configured to use this endpoint.

...
  components:
    - name: DocumentStore
      type: ElasticsearchDocumentStore
      params:
        host: 'elasticsearch'
        port: 9200
        embedding_dim: 384
...

The haystack-pipeline-service-indexing provides our indexing API to upload files. The haystack-pipeline-service-query provides our query API to run queries.

Let’s focus on the indexing first:

Let’s validate that our pipeline API is up and running
Execute curl -XGET http://haystack-pipeline-service-indexing/initialized
You should get true as a response if your pipeline is ready to be used
Now we need to upload a file to our documentstore in order to be able to query
Run the wget https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/article_txt_countries_and_capitals.zip command to download a sample dataset
Run the unzip article_txt_countries_and_capitals.zip command to unzip the archive
Now we are ready to upload a sample file using find ./article_txt_countries_and_capitals -name '*.txt' -exec \ curl --request POST \ --url http://haystack-pipeline-service-indexing/file-upload \ --header 'accept: application/json' \ --header 'content-type: multipart/form-data' \ --form files="@{}" \ --form meta=null \;
This can take some time depending on your local performance. Great time to have a coffee ☕️.

⚠️ Note that this way of uploading files to be indexed is not something that you want to index large amount of files, but completely sufficient for our test case!

You can follow progress via monitoring the logs from your pipeline using kubectl logs deployment/haystack-pipeline-deployment-indexing -n default --follow command, that will show you the indexing progress.

Finally we can make a query against our pipeline:

Execute the following curl command to post a query against your pipeline curl --request POST --url http://haystack-pipeline-service-query/query --header 'accept: application/json' --header 'content-type: application/json' --data '{"query": "climate in Scandinavia" }'
You should receive the relevant documents from your documentstore
And now the journey begins ⛵️!