This post might interest you if you run local models with ollama and want to connect them with your notes, and if you care about reliability. The MCP server and scripts of these experiments are available here on Github.
MCP in a few words
Model Context Protocol (MCP) was developed and coined by Anthropic in november 2024 as an open-source, universal language for Large Language Models to interact with external tools. A model by itself is capable of much, but prone to hallucinating, and, for example, not able to do maths reliably or remember a database of factual information (about a list of objects, clients, events…). It is then useful to augment the model with a set of tools that can help it perform a variety of tasks.
In MCP, tools are actions, resources are data.
| Concept | Verb or noun? | Typical use | Example |
|---|---|---|---|
| Tool | Verb | Ask the model to do something | add(a, b), search_notes(...) |
| Resource | Noun | Let the model see some data / metadata | config://app, notes://list |
Use a tool when the model should do something: call an API, modify a file, write a note, run a search, compute a result. If can be described with a verb (create, update, summarize, search), it is probably a tool.
Use a resource when the model should see something: read-only config, a directory listing, a static document, or a pre-computed index. If it can be described with a noun (config, notes list, embedding index), this is probably a resource.
FastMCP
The FastMCP framework is a python package that helps build, manage and interact with MCP servers.
Defining your tools and resources
For the examples in this section, assume we are working on a server.py python file that we will launch as our FastMCP server.
from fastmcp import FastMCP
from fastmcp.resources import DirectoryResource
from pathlib import Path
mcp = FastMCP("demo-server")
BASE_DIR = Path("./data").resolve()
And at the end :
def main() -> None:
mcp.run(transport="http", host="127.0.0.1", port=8000)
if __name__ == "__main__":
main()
Using MCP vocabulary, tools are functions that are used to perform an external action by the LLM. It can be an API call, reading a file, creating a file, performing a mathematical operation. Here is a simple, straightforward tool :
@mcp.tool
def add(a: int, b: int) -> int:
"""Add two integers and return the result."""
return a + b
It is a typed function, with a well-written docstring (important for future guiding of the model) and a @mcp.tool decorator.
Resources are a new, important concept to understand in MCP, and are used to list data, information or files using an URI address. In real-world examples, they would rather be statically called to add to a model context rather than dynamically via model choice of external calling. The decorator defines the URI and accepts a MIME type that tells the host how to display or treat the content, as shown in the simple example below:
@mcp.resource("config://app", mime_type="application/json")
def app_config() -> dict:
"""Static application configuration."""
return {
"name": "Demo MCP Server",
"version": "1.0.0",
"env": "dev"
}
Resources can be parameterized as shown here :
@mcp.resource("greetings://{name}")
def personalized_greeting(name: str) -> str:
"""Generates a personalized greeting for the given name."""
return f"Hello, {name}! Welcome to the MCP server."
Here is an example of calling these simple tools in order (check the config, do an addition, call the user greeting function) with llama3.2:3b-instruct-fp16 with the following prompt :
You have access to these MCP capabilities:
- Tool: add(a: int, b: int) -> int
- Resource: resource://config (JSON app configuration)
- Resource: resource://greetings/{name} (returns a greeting for the given name)
Please do the following, using the tools and resources instead of guessing:
1. Fetch the resource://config resource and briefly summarize the app’s name, version, and environment.
2. Use the add tool to compute 40 + 2, and tell me the result.
3. Fetch resource://greetings/Alice and include that greeting in your answer.
Finally, produce a short, friendly paragraph that:
When EVERYTHING is done, send ONE final message that starts with the line:
END_OF_RUN
- Mentions the configuration you found,
- States the result of 40 + 2,
- And naturally weaves in the greeting for Alice.
I’ve run this simple workflow over 50 runs for a few tool-calling models with ollama. We can already observe the variability and instability of runs with this runtime versus error rate versus number of steps below.
Even on this toy workflow, I saw runs that stopped early or hallucinated “I did X” while nothing was written. Some runs chose to use all tools available for whatever reason instead of following the workflow.
Prompt and model influence greatly error patterns. Tool calling isn’t a solved problem, especially on small local models.
Here is a compiled table of metrics on these simple tasks for a bunch of models that run on my graphics card (NVIDIA RTX 5070 Ti, with 16 GB VRAM):
| Model | Median wall (ms) | Success rate | All required tools | Notes |
|---|---|---|---|---|
| llama3.2:3b-instruct-fp16 | 2 209 | 0.74 | 0.80 | Fast, but unstable / partial flows |
| llama3-groq-tool-use:8b | 1 856 | 1.00 | 0.00 | No errors… but doesn’t do the job |
| granite4:7b-a1b-h | 1 889 | 1.00 | 1.00 | Fast and reliably follows workflow |
| mistral-nemo:12b-instruct-2407-q8_0 | 17 356 | 1.00 | 1.00 | Slow but very reliable |
| qwen2.5-14b-lean:latest | 21 400 | 0.96 | 1.00 | Slow, but good on more complex flows |
| lukaspetrik/gemma3-tools:12b | 4 837 | 1.00 | 1.00 | Good generalist tool-caller |
In a word : granite4:7b-a1b-h is very fast and always reliable. If I need a fast and reliable agent for a simple flow i’ll think of this new IBM model for sure. llama3-groq-tool-use:8b reliably triggers no errors… but does not do what is needed, at all. Larger models (Mistral, Qwen, Gemma) were slower but very reliable and shine when the workflow is more complex.
Considering a success here is simply ‘having called the right tools’ we could also use a LLM as a judge comparing the model’s END_OF_RUN output and the ideal output. Reading outputs quickly, most of successful runs had not significant errors in their output, maybe only a bit of creativity in formulation, here are two example outputs :
qwen2.5-14b-lean:latest The application configuration indicates that this is a version 1.0 built by MyTeam in their specified environment. The computation of 40 + 2 using the add tool yields a result of 42. Alice has been warmly welcomed with the greeting “Hello, Alice! Welcome to the MCP server.” It’s great to see everything functioning smoothly and to have Alice join us on this journey through the capabilities of our system.
lukaspetrik/gemma3-tools:12b The application’s configuration reveals it’s version 1.0, authored by MyTeam. I was able to compute that 40 + 2 equals 42. And, as a friendly greeting, I’m including “Hello, Alice! Welcome to the MCP server.” It’s been a productive session!
Some of my custom tools for interacting with markdown notes
This code defines a resource based on my notes directory.
BASE_DIR.mkdir(parents=True, exist_ok=True)
notes_listing = DirectoryResource(
uri="notes://list",
path=BASE_DIR,
name="Demo notes directory",
description="Lists Markdown notes in ./data.",
recursive=False,
)
mcp.add_resource(notes_listing)
This tool uses the read_text method of the pathlib module to read the content of a file in the notes directory.
@mcp.tool
def read_note(relpath: str) -> str:
"""
Read a UTF-8 text note from the demo directory.
Example:
{"relpath": "example.md"}
"""
p = (BASE_DIR / relpath).resolve()
if BASE_DIR not in p.parents and p != BASE_DIR:
raise ValueError("Path escapes BASE_DIR")
if not p.exists():
raise FileNotFoundError(relpath)
return p.read_text(encoding="utf-8", errors="ignore")
This tool performs a simple search in the notes for a given string in a line and returns the first 10 results, truncated to about 200 characters.
import re
from typing import List
from pydantic import BaseModel, Field
class NoteHit(BaseModel):
path: str = Field(..., description="Path relative to BASE_DIR")
line: int
snippet: str
@mcp.tool
def search_notes(query: str, max_results: int = 10) -> List[NoteHit]:
"""
Search all notes in BASE_DIR for a substring.
"""
pattern = re.escape(query)
rx = re.compile(pattern, re.IGNORECASE)
hits: List[NoteHit] = []
for p in BASE_DIR.rglob("*.md"):
if len(hits) >= max_results:
break
with p.open("r", encoding="utf-8", errors="ignore") as f:
for i, line in enumerate(f, start=1):
if rx.search(line):
rel = p.relative_to(BASE_DIR).as_posix()
snip = line.strip()
if len(snip) > 200:
snip = snip[:197] + "…"
hits.append(NoteHit(path=rel, line=i, snippet=snip))
if len(hits) >= max_results:
break
return hits
Running and calling a server
I am using ollama to run a local LLM and make it interact with the FastMCP tools. Here is the general architectural flow of the way all this interact in this experiment. I have defined tools and services in a FastMCP server script (a few of those tools are defined above) :
fastmcp run server.py:mcp --transport http --port 8000
I need to have ollama running, so let’s launch it with the desired model :
ollama run llama3.2:3b-instruct-fp16 --verbose
Now it is time to run a MCP client that glues it all together:
flowchart LR
subgraph Notes[📄 Markdown Notes]
end
subgraph Server[FastMCP Server]
direction TB
T1[tools]
T2[resources]
end
subgraph Client[FastMCP Client]
end
subgraph Ollama[Ollama API]
end
LLMs[LLMs / Embeddings]
%% Arrows to notes
Notes -->|"reads"| Server
Server -->|"creates"| Notes
%% Bidirectional arrows
Server <--> Client
Client <--> Ollama
Ollama --> LLMs
To go beyond the simple exact word search, I also created an embedding for chunked notes, that the model can query to perform a semantic search if needed:
flowchart LR
Notes[(📝 Markdown notes)]
Server["🧩 FastMCP Server
build_embed_index"]
Vectors[(🧊 vectors.npy)]
Ollama["🦙 Ollama API"]
Model{{"🤖 bge-m3:latest
embedding model"}}
%% Two stacked arrows Notes ↔ Server
Notes -->|"read & chunk"| Server
Server -->|"(optional) update"| Notes
%% Embedding flow
Server <-->|"embed chunks"| Ollama
Model -->|"loaded and called"| Ollama
Server -->|"write embeddings"| Vectors
Prompts, overall workflows logic, the client for fastMCP, loops for trying and guardrails are launched via a python script following the above schema structure:
python bridge_mcp_ollama.py
Experiments with notes creating and curating
The main problem: even with simple prompts, small local models were highly variable. Some runs worked well, others confidently hallucinated notes that never existed.
In notes-search experiments, in a good run the model pulls its list of tools and resources, reads relevant notes and outputs a resulting summary. In a bad run it hallucinates a note without having even searched anything.
In notes creating experiments (populating the notes sandbox with a variety of credible notes), a good run outputs a few creative ideas into .md files in the folder. Sometimes it creatively hallucinates having created notes. I had trouble outputting more than two or three notes reliably. I will need to do more experiments with task flows, prompt semantics, nudging gates and various models before showing more quantitative results. The runs I have done for now are not conclusive.
What was useful
- Setting up an overall loop to stop iterating after a maximum number of steps (for example 15 or 30) in case of infinite loops.
- Prompting for an explicit end of task tag to avoid stopping too early (here,
END_OF_RUN). At first, I had only a rule that an empty response was the end of the loop. - Guiding the model by having a catalog and listing functions and tools, so that a lost model can fetch a list of usable tools if needed.
- Have a scaffolding of an example sequence of tools following a logical path, so that the model has an inkling of what to do and in which order.
- Have a set of tracked “had to be called” tools that basically force the model to continue looping as long as this set of tools was not called : for example
WRITE_TOOLS = {"append_note"}when I wanted to populate the notes sandbox with ai-generated notes. Reduced the hallucinated statements like “END_OF_RUN: I wrote 5 notes” when in fact nothing had been created.
- Having a model both good and reliable at tool calling (that requires rather a low temperature) and creative enough to output interesting text is a compromise hard to reach, but this can be circumvented by calling a model specifically for creative writing and using the tool-calling model as an orchestrator for the content-creating tool.
Here is one of my system prompts exemplifying these different ideas :
{"role": "system",
"content":
"You have tools: list_resources, read_resource, catalog, search_notes, read_note, append_note, summarize_note, similar_notes, build_embed_index, refresh_list.\n"
"Strategy: call catalog(). call refresh_list() to register the notes listing resource. Call list_resources() and look for a URI like resource://notes-list. \n"
"Call read_resource on that URI to inspect the sandbox. append_note to write new notes.\n"
"ALWAYS produce a final natural-language answer ONLY AT THE END, starting the final message with the line 'END_OF_RUN'.\n"
"Until you reach END_OF_RUN, prefer issuing tool calls over plain text.\n"
"If more information is needed, call tools instead of asking the user.\n"
"Operator stance:\n"
"- Act-first. Prefer issuing tool calls over speculative text.\n"
"- If uncertain, choose a sensible default and proceed.\n"
"- Stop when the job is complete and you have created all required notes and the Operator Log.\n"},
In general, with the small local models I tried, there was a lot of variability of results even with simple prompts. Tool calling by small local models does not seem like a solved problem to me.
Good results come from choosing the right model from the job (orchestrator versus writer for instance) and adding enough guardrails and explicit workflows.
I will keep iterating more on the subject, and will keep you updated once I have stronger quantitative results.