## `docker-compose.yml`: llama.cpp CUDA server
```yml
services:
  gemma:
    image: ghcr.io/ggerganov/llama.cpp:server-cuda
    ports:
      - "8080:8080"
    volumes:
      - llama-cache:/root/.cache/llama.cpp
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    command: >
      --host 0.0.0.0
      --hf-repo bartowski/gemma-2-9b-it-GGUF
      --hf-file gemma-2-9b-it-IQ4_XS.gguf
      --gpu-layers 99
      --main-gpu 0
volumes:
  llama-cache:
```

## cURL test

```bash
$ curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
```

```json
{"content":"\n\n**1. Define Your Purpose:**\n\n* What do you want to achieve with your website? (e.g., sell products, share information, build a community)\n* Who is your target audience?\n\n**2. Choose a Domain Name:**\n\n* Select a memorable and relevant name that reflects your website's purpose.\n* Check availability and register your domain name.\n\n**3. Select a Web Hosting Provider:**\n\n* Choose a reliable hosting provider that offers the necessary resources (storage, bandwidth, etc.) for your website.\n* Consider factors like price, uptime, and customer support.\n\n**4.","id_slot":0, ... }
```

## Python Client

```python
import requests
response = requests.post(
    'http://localhost:8080/completion',
    json={"prompt": "Building a website can be done in 10 simple steps:",
          "n_predict": 128}
).json()

print(response['content'])
````

```
**1. Define Your Purpose:**

* What do you want to achieve with your website? (e.g., sell products, share information, build a community)
* Who is your target audience?

**2. Choose a Domain Name:**

* Select a memorable and relevant name that reflects your website's purpose.
* Check availability and register your domain name.

**3. Select a Web Hosting Provider:**

* Choose a reliable hosting provider that offers the necessary resources (storage, bandwidth, etc.) for your website.
* Consider factors like price, uptime, and customer support.

**4.
```

## Single token prediction with probs

```sh
pip install requests polars
```

```python
import requests
import polars as pl
pl.Config.set_tbl_rows(40)
response = requests.post(
    'http://localhost:8080/completion',
    json={"prompt": prompt,
          "temperature": -1,
          "n_predict": 1,
          "n_probs":40
         }
).json()

# print(response['content'])


df = pl.DataFrame(response['completion_probabilities'][0]['probs'])
print(df)
```

```
shape: (40, 2)
┌─────────┬──────────┐
│ tok_str ┆ prob     │
│ ---     ┆ ---      │
│ str     ┆ f64      │
╞═════════╪══════════╡
│ D       ┆ 0.996538 │
│ **      ┆ 0.002495 │
│  **     ┆ 0.000539 │
│ C       ┆ 0.000053 │
│ B       ┆ 0.000047 │
│ A       ┆ 0.000032 │
│  D      ┆ 0.000029 │
│ **(     ┆ 0.000024 │
│ d       ┆ 0.000024 │
│ **)     ┆ 0.00002  │
│ E       ┆ 0.000017 │
│ Seoul   ┆ 0.000015 │
│ ㄷ      ┆ 0.000008 │
│ ㄹ      ┆ 0.000007 │
│ 주      ┆ 0.000006 │
│ Д       ┆ 0.000004 │
│ **,     ┆ 0.000004 │
│ 답      ┆ 0.000004 │
│ 디      ┆ 0.000004 │
│ 도      ┆ 0.000004 │
│ ㅁ      ┆ 0.000003 │
│ Ｄ      ┆ 0.000003 │
│ Answer  ┆ 0.000003 │
│ 가      ┆ 0.000003 │
│  )      ┆ 0.000003 │
│ ④       ┆ 0.000003 │
│ )**     ┆ 0.000003 │
│ ד       ┆ 0.000003 │
│ ****    ┆ 0.000003 │
│ ㄱ      ┆ 0.000002 │
│ 다      ┆ 0.000002 │
│ 을      ┆ 0.000002 │
│ ㅇ      ┆ 0.000002 │
│ 유      ┆ 0.000002 │
│ Korean  ┆ 0.000002 │
│ 4       ┆ 0.000002 │
│ G       ┆ 0.000002 │
│ 이      ┆ 0.000001 │
│ ***     ┆ 0.000001 │
│ 하      ┆ 0.000001 │
└─────────┴──────────┘
```