Getting Started 🌟
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
Prerequisites 📋
Before you begin with Baler, ensure you have the following tools installed and configured on your system:
Docker
Docker is a platform for developers and sysadmins to develop, deploy, and run applications with containers. The use of Linux containers to deploy applications is called containerization.
To install Docker, follow the instructions in the official Docker documentation for your operating system: Docker installation guide
Minikube (or any equivalent Kubernetes setup)
Minikube is a tool that lets you run Kubernetes locally. Minikube runs a single-node Kubernetes cluster on your personal computer (including Windows, macOS, and Linux PCs) so that you can try out Kubernetes or for daily development work. If you’re using a different Kubernetes setup, ensure it’s configured correctly.
To install Minikube, follow the instructions in the official Minikube documentation: Minikube installation guide
For other Kubernetes environments, consult your specific cloud provider’s documentation or the Kubernetes Getting Started Guide.
kubectl
The Kubernetes command-line tool, kubectl, allows you to run commands against Kubernetes clusters. You can use kubectl to deploy applications, inspect and manage cluster resources, and view logs.
To install kubectl, follow the instructions in the official Kubernetes documentation for your operating system: kubectl installation guide
Helm
Helm is a package manager for Kubernetes that allows you to define, install, and upgrade Kubernetes applications. Helm uses a packaging format called charts, which include all of the Kubernetes resources needed to deploy an application, such as deployments, services, and ingress rules.
To install Helm, follow the instructions in the official Helm documentation for your operating system: Helm installation guide
Make utility
The Make utility automates the build process by reading files called Makefiles which specify how to derive the target program. Although it’s used primarily for compiling programs, it can be used for managing any project where you need to execute arbitrary commands.
Linux: Usually available by default or can be installed via your package manager. For example, on Debian-based systems:
sudo apt-get install make.macOS: Included with Xcode Command Line Tools, which can be installed with
xcode-select --install.Windows: Install using Chocolatey:
choco install make.
Ensure all these tools are installed and properly configured before proceeding with the setup and deployment of Baler.
Installing 💾
A step-by-step series of examples that tell you how to get a development environment running.
Clone the project repository
Execute
make create-namespacecommand to trigger thebalernamespaceExecute
make deploy-operatorcommand to deploy the operatorExecute
kubectl get pods -n baler --watchcommand to see the operator pod coming upYou should see something like this:
baler-operator-pod 1/1 Running 0 106mThis means that the operator is ready to receive our pipelines 🎉
It was easy, wasn’t it? 😉
Let’s attempt running a test to see our pipeline in work!
Our first pipeline deployment 🚀
First we need to deploy our pipeline definition file, for this we will use a basic Haystack pipeline YAML definition.
You can find the example pipelines in
./examples/Let’s use
example_01.yamlthat defines a simple query pipeline that uses an ElasticsearchDataStore (one of the available documentstores in haystack)Deploy the pipeline via executing the following command:
kubectl apply -f ./examples/example_01.yamlRun
kubectl get haystack -Aand you should see your haystack pipeline is in pending state
NAME STATUS
example-haystack-pipeline-01 Pending
After a couple of seconds you should be able to see a similar output if you use the kubectl get haystack -A command:
NAME STATUS
example-haystack-pipeline-01 Running
Let’s make sure that all underlaying resources are created. Run the kubectl get pods -n default command and you should see a similar output:
NAME READY STATUS RESTARTS AGE
elasticsearch-57dc7c9df-vtwvq 1/1 Running 0 3m7s
indexing-5798465f9c-wjg96 1/1 Running 3 (117s ago) 3m7s
query-5687b797f-w8psm 1/1 Running 3 (2m1s ago) 3m8s
As you can see there are two pods created for a pipeline: one for query and one for indexing.
This is because we defined two pipelines in our examples/example_01.yaml.
...
pipelines:
- name: query
nodes: ...
- name: indexing
nodes: ...
...
Our operator is aware of multiple pipeline definitions in our manifest and will provision resources for each.
⚠️ Please note that documentstores will be provisioned per Kubernetes resource, so both pipelines are consuming the same documentstore (Elasticsearch) in our example [TODO: link to automatic documentstore provisioning]. In a production use-case most likely you have a documentstore outside of your Kubernetes cluster.
Consuming our pipeline 🍿
📰Read more about Haystack documentstores
📰Read more about Haystack REST API
Let’s start up a busybox so we can interract with our pipeline
Execute
kubectl run busybox --image=busybox --restart=Never -- /bin/sh -c "while true; do sleep 3600; done"Run
kubectl get pods busybox -n defaultand you should see someting similar:busybox 1/1 RunningGet a shell in our newly created pod via running the following command:
kubectl exec -it busybox -n default -- shIn order to operate with our service API we need to install
curlinto our busybox
wget https://github.com/moparisthebest/static-curl/releases/download/v7.80.0/curl-amd64
mv ./curl-amd64 /bin/curl
chmod +x ./bin/curl
Our operator provides Kubernetes Service endpoints for our created pipelines.
Run the kubectl get services -n default command and you should get an ouput similar to this:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
elasticsearch ClusterIP 10.100.178.197 <none> 9200/TCP 15m
haystack-pipeline-service-indexing ClusterIP 10.111.41.114 <none> 80/TCP 15m
haystack-pipeline-service-query ClusterIP 10.101.174.33 <none> 80/TCP 15m
The elasticsearch endpoint is created because of the autoprovision behaviour for documentstores in our operator. Our pipelines are configured to use this endpoint.
...
components:
- name: DocumentStore
type: ElasticsearchDocumentStore
params:
host: 'elasticsearch'
port: 9200
embedding_dim: 384
...
The haystack-pipeline-service-indexing provides our indexing API to upload files.
The haystack-pipeline-service-query provides our query API to run queries.
Let’s focus on the indexing first:
Let’s validate that our pipeline API is up and running
Execute
curl -XGET http://haystack-pipeline-service-indexing/initializedYou should get
trueas a response if your pipeline is ready to be usedNow we need to upload a file to our documentstore in order to be able to query
Run the
wget https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/article_txt_countries_and_capitals.zipcommand to download a sample datasetRun the
unzip article_txt_countries_and_capitals.zipcommand to unzip the archiveNow we are ready to upload a sample file using
find ./article_txt_countries_and_capitals -name '*.txt' -exec \ curl --request POST \ --url http://haystack-pipeline-service-indexing/file-upload \ --header 'accept: application/json' \ --header 'content-type: multipart/form-data' \ --form files="@{}" \ --form meta=null \;This can take some time depending on your local performance. Great time to have a coffee ☕️.
⚠️ Note that this way of uploading files to be indexed is not something that you want to index large amount of files, but completely sufficient for our test case!
You can follow progress via monitoring the logs from your pipeline using
kubectl logs deployment/haystack-pipeline-deployment-indexing -n default --followcommand, that will show you the indexing progress.
Finally we can make a query against our pipeline:
Execute the following curl command to post a query against your pipeline
curl --request POST --url http://haystack-pipeline-service-query/query --header 'accept: application/json' --header 'content-type: application/json' --data '{"query": "climate in Scandinavia" }'You should receive the relevant documents from your documentstore
And now the journey begins ⛵️!