Welcome to the MemMachine evaluation toolsets! We've created a simple tool to help you measure the performance, response quality of your MemMachine instance, and generate a LoCoMo score for your system.
Episodic Memory Tool Set: This tool measures how fast and accurately MemMachine performs core episodic memory tasks. For a list of specific commands, check out the Episodic Memory Tool Set.
Before you run any benchmarks, you'll need to set up your environment.
General Prerequisites:
-
MemMachine Backend: Both tools require that your MemMachine backend be installed and configured. If you need help with this, you can check out our QuickStart Guide.
-
Start the Backend: Once everything is set up, start MemMachine with this command:
memmachine-server
Tool-Specific Prerequisites:
- Please ensure your
cfg.ymlfile has been copied into yourlocomodirectory (/memmachine/evaluation/locomo/) and renamed tolocomo_config.yaml.
Ready to go? Follow these simple steps:
A. All commands should be run from their respective tool directory (default locomo/episodic_memory/).
B. The path to your data file, locomo10.json, should be updated to match its location. By default, you can find it in /memmachine/evaluation/locomo/.
C. Once you have performed step 1 below, you can repeat the benchmark run by performing steps 2-4. Once are you finished performing the benchmark, run step 5.
Note: Please refer to the Episodic Memory Tool Set for exact commands.
First, let's add conversation data to MemMachine. This only needs to be done once per test run.
Let's search through the data you just added.
Next, run a LoCoMo evaluation against the search results.
Once the evaluation is complete, you can generate the final scores.
The output will be a table in your shell showing the mean scores for each category and an overall score, like the example below:
Mean Scores Per Category:
llm_score count type
category
1 0.8050 282 multi_hop
2 0.7259 321 temporal
3 0.6458 96 open_domain
4 0.9334 841 single_hop
Overall Mean Scores:
llm_score 0.8487
dtype: float64When you're finished, you may want to delete the test data.