evaluation

Benchmark Evaluations: A Guide to Testing Your MemMachine

Welcome to the MemMachine evaluation toolsets! We've created a simple tool to help you measure the performance, response quality of your MemMachine instance, and generate a LoCoMo score for your system.

Episodic Memory Tool Set: This tool measures how fast and accurately MemMachine performs core episodic memory tasks. For a list of specific commands, check out the Episodic Memory Tool Set.

Getting Started

Before you run any benchmarks, you'll need to set up your environment.

General Prerequisites:

MemMachine Backend: Both tools require that your MemMachine backend be installed and configured. If you need help with this, you can check out our QuickStart Guide.
Start the Backend: Once everything is set up, start MemMachine with this command:
```
memmachine-server
```

Tool-Specific Prerequisites:

Please ensure your cfg.yml file has been copied into your locomo directory (/memmachine/evaluation/locomo/) and renamed to locomo_config.yaml.

Running the Benchmark

Ready to go? Follow these simple steps:

A. All commands should be run from their respective tool directory (default locomo/episodic_memory/).

B. The path to your data file, locomo10.json, should be updated to match its location. By default, you can find it in /memmachine/evaluation/locomo/.

C. Once you have performed step 1 below, you can repeat the benchmark run by performing steps 2-4. Once are you finished performing the benchmark, run step 5.

Note: Please refer to the Episodic Memory Tool Set for exact commands.

Step 1: Ingest a Conversation

First, let's add conversation data to MemMachine. This only needs to be done once per test run.

Step 2: Search the Conversation

Let's search through the data you just added.

Step 3: Evaluate the Responses

Next, run a LoCoMo evaluation against the search results.

Step 4: Generate Your Final Score

Once the evaluation is complete, you can generate the final scores.

The output will be a table in your shell showing the mean scores for each category and an overall score, like the example below:

Mean Scores Per Category:
          llm_score  count         type
category                               
1            0.8050    282    multi_hop
2            0.7259    321     temporal
3            0.6458     96  open_domain
4            0.9334    841   single_hop

Overall Mean Scores:
llm_score    0.8487
dtype: float64

Step 5: Clean Up Your Data

When you're finished, you may want to delete the test data.

Name		Name	Last commit message	Last commit date
parent directory ..
locomo		locomo
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Benchmark Evaluations: A Guide to Testing Your MemMachine

Getting Started

Running the Benchmark

Step 1: Ingest a Conversation

Step 2: Search the Conversation

Step 3: Evaluate the Responses

Step 4: Generate Your Final Score

Step 5: Clean Up Your Data

FilesExpand file tree

evaluation

Directory actions

More options

Directory actions

More options

Latest commit

History

evaluation

Folders and files

parent directory

README.md

Benchmark Evaluations: A Guide to Testing Your MemMachine

Getting Started

Running the Benchmark

Step 1: Ingest a Conversation

Step 2: Search the Conversation

Step 3: Evaluate the Responses

Step 4: Generate Your Final Score

Step 5: Clean Up Your Data