MCP glossary

Plain-English definitions of the terms you'll meet across the Model Context Protocol ecosystem.

Agent memory

Agent memory is persistent context that an AI agent can write to and read back across sessions, so it remembers facts, decisions, and preferences instead of starting cold every conversation.

Agent orchestration is the coordination of multiple AI agents or steps toward a goal, deciding which agent or tool runs when, how results pass between them, and how shared state and memory are kept in sync.

Agentic RAG

Agentic RAG is retrieval-augmented generation driven by an agent that decides when and what to retrieve, can issue multiple searches, refine its queries, and use tools, instead of running a single fixed retrieve-then-answer step.

Agentic workflow

An agentic workflow is a multi-step process driven by an AI agent that chains tool calls, decisions, and intermediate results to accomplish a task, rather than relying on a single model response.

AI agent

An AI agent is a system built around a language model that can pursue a goal over multiple steps, deciding which tools to call, observing results, and adjusting, rather than producing a single one-shot answer.

API key

An API key is a secret string that identifies and authenticates a caller to a service. Simple to use but powerful if leaked, so keys should be scoped, rotated, and stored as hashes, never plaintext.

Approximate nearest neighbor (ANN)

Approximate nearest neighbor is a class of algorithms that find the vectors most similar to a query quickly by trading a little accuracy for huge speed gains, making vector search practical at scale.

Bearer token

A bearer token is a credential that grants access to whoever holds it, sent in the HTTP Authorization header; remote MCP servers accept one as a simpler alternative to a full OAuth flow.

Canary deployment

A canary deployment ships a new version to a small slice of traffic first, watches its metrics against the stable version, and only rolls out to everyone once the canary proves healthy.

Capability negotiation (MCP)

Capability negotiation is the MCP initialization handshake where client and server each declare which features they support, so both sides only use functionality the other side actually implements.

Chain of thought

Chain-of-thought (CoT) prompting elicits intermediate reasoning steps from an LLM before its final answer, improving accuracy on multi-step problems by letting the model work through them in tokens.

Change data capture (CDC)

Change data capture is a technique for detecting and streaming every insert, update, and delete in a database as it happens, so other systems can react to changes in near real time instead of repeatedly polling.

Chunking

Chunking is splitting a large document into smaller passages before embedding it, so retrieval can return focused, relevant pieces that fit a model's context window instead of whole files.

Coding agent

A coding agent is an AI agent specialized for software work; it reads a codebase, edits files, runs commands and tests, and iterates toward a goal, usually inside an IDE or terminal.

Confused deputy attack

A confused deputy attack tricks a trusted intermediary into misusing its authority on an attacker's behalf. In MCP, it arises when a server forwards a token meant for itself to an upstream API that accepts it.

Context compaction

Context compaction shrinks a growing conversation or agent history so it fits the context window, typically by summarizing old turns, dropping low-value content, or offloading detail to external memory.

Context engineering

Context engineering is the practice of deliberately curating what goes into a model's context window, instructions, tools, retrieved data, and memory, so the model has exactly what it needs and nothing that distracts it.

Context rot

Context rot is the degradation in an LLM's answer quality as its context window fills up with stale, redundant, or low-signal tokens, so older and middle content gets effectively ignored even though it technically still fits.

Context window

A context window is the maximum amount of text, measured in tokens, that a language model can consider at once, covering the prompt, conversation history, retrieved data, and the model's own output.

Cosine similarity

Cosine similarity measures how alike two vectors are by the angle between them rather than their length, producing a score from -1 to 1 that is the standard way to compare embeddings in semantic search.

Data lake

A data lake is a central store that holds raw data of any shape, structured, semi-structured, and unstructured, at scale and low cost, with schema applied on read rather than on write, the opposite of a rigid data warehouse.

Dead-letter queue

A dead-letter queue holds messages that a system failed to process after exhausting retries, so they are quarantined for inspection and reprocessing instead of being lost or blocking the main queue forever.

Distributed tracing

Distributed tracing follows a single request as it hops across services, stitching per-service spans into one end-to-end trace so you can see where time went and which hop failed in a microservice system.

Dynamic Client Registration

Dynamic Client Registration (DCR) is the OAuth mechanism that lets an MCP client register itself with a server's authorization server at runtime, so users do not have to manually create client credentials.

Elicitation (MCP)

Elicitation is a Model Context Protocol feature that lets a server pause mid-operation to ask the user for specific structured input, rather than failing or guessing when it needs more information.

ELT (Extract, Load, Transform)

ELT loads raw data into a warehouse first and transforms it there using the warehouse's compute, the modern inversion of ETL that suits cloud data warehouses and lets analysts model data in SQL after the fact.

Embedding

An embedding is a vector of numbers that captures the meaning of a piece of text or other data, positioning semantically similar items close together so software can compare them by similarity.

Episodic memory

Episodic memory is an agent's record of specific past events, what happened in a particular session, when, and in what order, so it can recall and learn from concrete experiences rather than only general facts.

Error budget

An error budget is the amount of unreliability a service is allowed under its SLO, the gap between the target and 100%, spent on deploys, experiments, and risk before the team must stop shipping and stabilize.

ETL

ETL (Extract, Transform, Load) is the pipeline pattern that pulls data from source systems, reshapes and cleans it, then writes it into a destination like a data warehouse. ELT swaps the order, loading raw first and transforming in-warehouse.

Eval

An eval is a structured test that measures how well an LLM or agent performs on a task, run over a dataset of cases with scoring, so you can track quality and catch regressions as prompts and models change.

Event-driven architecture

Event-driven architecture builds systems around events, facts that something happened, emitted and reacted to asynchronously, so services stay loosely coupled and can scale and evolve independently.

Eventual consistency

Eventual consistency is a guarantee that, if no new writes occur, all replicas of data will converge to the same value over time, trading immediate global agreement for higher availability and lower latency.

Exponential backoff

Exponential backoff is a retry strategy that waits progressively longer between attempts, doubling the delay each time, usually with random jitter, so a client recovers from transient failures without overwhelming an already-struggling service.

FastMCP

FastMCP is a Python (and TypeScript) framework that lets you build MCP servers by decorating ordinary functions, handling the protocol's transport, schema generation, and lifecycle so you write tools instead of plumbing.

Feature flag

A feature flag is a runtime switch that turns a piece of functionality on or off without a redeploy, letting teams ship code dark, roll out gradually, target cohorts, and kill a bad change instantly.

Few-shot prompting

Few-shot prompting steers an LLM by including a handful of input-output examples in the prompt, letting the model infer the desired format and behavior at inference time without any fine-tuning.

Fine-tuning

Fine-tuning further trains a pretrained LLM on a curated dataset to specialize its behavior, style, or domain knowledge, baking patterns into the weights instead of supplying them in every prompt.

Function calling

Function calling is the model-API feature that lets a language model return a structured request to invoke a named function with JSON arguments, instead of plain text; it is the foundation tool calling builds on.

GraphQL

GraphQL is a query language and API style where the client specifies exactly the fields it wants in one request against a typed schema, avoiding the over- and under-fetching common with REST and exposing a single flexible endpoint.

GraphRAG

GraphRAG is retrieval-augmented generation that retrieves from a knowledge graph of entities and relationships, not just isolated text chunks, so the model can follow connections and answer questions that span many documents.

gRPC

gRPC is a high-performance RPC framework from Google that uses Protocol Buffers over HTTP/2 for compact, strongly-typed, bidirectional service-to-service calls, popular for internal microservice communication.

Guardrails

Guardrails are the checks and constraints placed around an LLM or agent, on inputs, outputs, and tool calls, to keep behavior safe, on-policy, and within bounds, independent of the model's own cooperation.

Hallucination

A hallucination is a confident but false or fabricated output from an LLM, an invented fact, citation, API, or tool argument that looks plausible but has no basis in the model's input or in reality.

Human in the loop

Human in the loop (HITL) inserts a person's approval, correction, or input into an automated AI workflow at key decision points, so high-stakes or uncertain actions get human judgment before they execute.

Hybrid search

Hybrid search combines semantic (vector) search with keyword (lexical) search and merges the results, capturing both meaning-based matches and exact terms like product names or error codes.

Idempotency

An operation is idempotent if running it many times has the same effect as running it once. It is what makes safe retries possible when a network call's outcome is uncertain, a core concern for agents.

Inference

Inference is the act of running a trained model to produce outputs, the step where an LLM actually generates tokens for a prompt, as opposed to training. It is where ongoing cost, latency, and throughput live in production.

Jailbreak

A jailbreak is a crafted prompt that gets an LLM to bypass its safety training and produce content it was aligned to refuse, using role-play, obfuscation, or instruction-overriding tricks to defeat the model's guardrails.

JSON Schema

JSON Schema is a standard vocabulary for describing the structure of JSON data, types, required fields, and constraints, so it can be validated automatically; MCP uses it to define tool inputs and outputs.

JSON-RPC

JSON-RPC is a lightweight remote-procedure-call protocol that encodes requests and responses as JSON objects; the Model Context Protocol uses JSON-RPC 2.0 as its wire format.

Knowledge graph

A knowledge graph stores information as entities (nodes) and the relationships (edges) between them, letting an agent traverse connections, like which person owns which service, rather than just matching text.

Latency vs throughput

Latency is how long a single request takes; throughput is how many requests a system completes per unit time. For LLM serving the two trade off, batching raises throughput but can raise per-request latency, so you tune for one or the other.

LLM observability

LLM observability is the practice of capturing and inspecting what an AI application actually does at runtime, prompts, responses, tool calls, tokens, latency, and cost, so you can debug, monitor quality, and control spend.

LLM-as-judge

LLM-as-judge is using a language model to score or compare other models' outputs against criteria you define, automating evaluation that would otherwise need slow, expensive human grading.

Local MCP server

A local MCP server runs as a process on your own machine, usually launched by the host over the stdio transport, so it can touch local files, a local Git checkout, or databases on your network.

Long-term memory (agents)

Long-term memory is the durable store an AI agent writes facts and experiences to so they survive across sessions, retrieved back into context only when a later task needs them, the opposite of the transient context window.

LoRA

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that freezes a model's weights and trains small low-rank adapter matrices instead, so you customize a model with a tiny fraction of the compute and storage of full fine-tuning.

Lost in the middle

Lost in the middle is the well-documented tendency of LLMs to recall information placed at the start or end of a long context far more reliably than information buried in the middle, even when all of it fits in the window.

MCP authorization

MCP authorization is the protocol's OAuth 2.1-based scheme for securing remote servers: the server is an OAuth resource server, the client obtains a tightly-scoped access token, and tokens are audience-bound via Resource Indicators.

MCP client

An MCP client is the AI application, such as Claude Code, Cursor, or VS Code, that connects to MCP servers, discovers their tools, and lets the model call them on the user's behalf.

MCP gateway

An MCP gateway is a proxy that sits between agents and many MCP servers, presenting one endpoint while it handles routing, authentication, access control, and observability for the servers behind it.

MCP host

An MCP host is the application a user actually interacts with, like Claude Desktop, Cursor, or an IDE, that embeds one or more MCP clients and lets the model use connected servers.

MCP Inspector

MCP Inspector is the official developer tool for testing MCP servers: it connects to a server, lists its tools, resources, and prompts, and lets you invoke them interactively to debug behavior.

MCP prompt

An MCP prompt is a reusable, parameterized message template an MCP server offers, typically surfaced as a slash command or menu item the user picks to kick off a structured task.

MCP registry

An MCP registry is a catalog of available MCP servers, with metadata like install commands and capabilities, that helps users and hosts discover, vet, and connect to servers without hunting across repos.

MCP resource

An MCP resource is read-only data an MCP server exposes by URI, like a file, a database row, or a document, that the host can load into the model's context without the model taking an action.

MCP roots

Roots are a Model Context Protocol primitive where the client tells the server which filesystem or URI boundaries it is allowed to operate within, scoping a server's access to a defined set of locations.

MCP sampling

Sampling is a Model Context Protocol feature that lets a server request a completion from the client's language model, so the server can use the model's reasoning without holding its own API key.

MCP server

An MCP server is a program that exposes tools, resources, and prompts to AI agents over the Model Context Protocol, giving a model a uniform way to read data or take actions in an external system.

MCP session

An MCP session is a single stateful connection between a client and server, from the initialize handshake to disconnect, over which negotiated capabilities, context, and requests persist.

MCP tool

An MCP tool is a named, schema-described action that an MCP server exposes for a model to call, like creating an issue or running a query; the model invokes it and the server runs the work.

mcp-remote

mcp-remote is a bridge utility that lets MCP hosts which only speak the local stdio transport connect to remote, OAuth-protected MCP servers, handling the HTTP transport and sign-in flow on their behalf.

mcpServers config

The mcpServers config is the JSON block, used by Claude Desktop, Cursor, Cline, and others, that registers MCP servers with a client by naming each one and giving its command, args, env, or URL.

Memory store

A memory store is the durable backend where an AI agent's long-term memory actually lives, the database or service that persists facts and observations and serves the relevant ones back into context on demand.

Message queue

A message queue is a buffer that decouples producers from consumers, letting one part of a system hand off work asynchronously so the receiver can process it at its own pace, with retries and durability.

Mixture of experts

A mixture-of-experts (MoE) model splits its feed-forward layers into many specialized expert subnetworks and a router that activates only a few per token, so the model has huge total capacity but only runs a fraction of it on any given input.

Model Context Protocol (MCP)

The Model Context Protocol (MCP) is an open standard that lets AI applications connect to external tools, data, and services through a uniform interface, so any compliant client can use any compliant server.

Model distillation

Model distillation trains a smaller, cheaper student model to mimic a larger teacher model's outputs, capturing much of the teacher's capability at a fraction of the inference cost and latency.

Multi-agent system

A multi-agent system is an AI setup where several agents, often specialized, work together on a task, dividing the work, passing results between each other, and ideally sharing memory so their understanding stays consistent.

npx

npx is the Node.js package runner that downloads and executes an npm package in one step, which is why most local MCP servers are launched with a command like npx -y some-mcp-server.

OAuth 2.1

OAuth 2.1 is a consolidation of the OAuth 2.0 authorization framework that folds in security best practices, making PKCE mandatory and removing insecure flows, and it is the authorization standard the MCP spec adopts for remote servers.

OAuth for MCP

OAuth for MCP is how remote MCP servers authorize users: the spec adopts OAuth 2.1 so each user signs in and grants scoped access, instead of pasting a long-lived secret into a config file.

Persistent memory

Persistent memory is information an AI agent stores durably so it survives across sessions, letting the agent recall earlier facts and decisions instead of losing everything when the conversation ends.

PKCE

PKCE (Proof Key for Code Exchange) is an OAuth 2.1 extension that stops stolen authorization codes from being redeemed, by binding the code to a secret the client proves it knows at token exchange.

Progress notification (MCP)

A progress notification is an MCP message a server sends during a long-running operation to report incremental progress, so the client can show status instead of waiting blindly for the final result.

Prompt caching

Prompt caching lets an LLM reuse the computed state of a repeated prompt prefix across calls, so a long, stable system prompt or document is processed once and replayed cheaply on subsequent requests.

Prompt chaining

Prompt chaining decomposes a task into a fixed sequence of LLM calls, where each step's output feeds the next, trading a single complex prompt for several focused ones that are easier to control and debug.

Prompt injection

Prompt injection is an attack where adversarial text hidden in tool output, web pages, or documents hijacks an LLM agent's instructions, causing it to ignore its system prompt and follow the attacker's commands instead.

Publish-subscribe (pub/sub)

Publish-subscribe is a messaging pattern where producers publish events to a topic and any number of independent subscribers receive them, decoupling senders from receivers so systems can fan out events cleanly.

Quantization

Quantization shrinks a model by storing its weights and activations at lower numeric precision, for example 8-bit or 4-bit integers instead of 16-bit floats, cutting memory and speeding up inference with minimal quality loss.

RAG vs MCP

RAG and MCP solve different layers: RAG is a technique for retrieving relevant text and injecting it into a prompt, while MCP is a protocol for connecting models to tools and data sources, including RAG retrievers.

Rate limiting

Rate limiting caps how many requests or tokens a client may consume in a time window, protecting a service from overload and abuse. LLM and MCP APIs enforce it, and agents must handle it gracefully.

ReAct agent

A ReAct agent interleaves reasoning and acting: the model alternates between thinking out a next step and calling a tool, feeding each tool result back in, until it has enough information to answer.

Red-teaming

Red-teaming is the practice of deliberately attacking your own AI system, probing for jailbreaks, prompt injection, data leaks, and harmful outputs, to find failure modes before adversaries or real users do.

Reflexion

Reflexion is an agent technique where the model verbally critiques its own failed attempts and stores those reflections in memory, so later attempts at the same task improve without retraining the weights.

Remote MCP server

A remote MCP server runs as a hosted service at a URL and connects over Streamable HTTP, usually with OAuth, so multiple users and machines can share one always-on integration.

Reranking

Reranking is a second retrieval pass that reorders an initial set of candidate results by relevance using a more accurate model, so the best passages rise to the top before they reach the agent.

Resource indicator

A resource indicator (RFC 8707) is an OAuth `resource` parameter that names the exact API a token is meant for, binding the token's audience so it cannot be replayed against a different service. MCP requires it.

REST API

A REST API exposes resources as URLs manipulated with standard HTTP verbs (GET, POST, PUT, DELETE), returning JSON, the most common style of web API and the default many MCP servers wrap to give agents tool access.

Retrieval-augmented generation (RAG)

RAG is a technique that retrieves relevant passages from an external knowledge source and inserts them into the model's prompt, so the answer is grounded in your data rather than only the model's training.

Reverse ETL

Reverse ETL pushes modeled data from the warehouse back into operational tools like CRMs, ad platforms, and support apps, so the metrics analysts compute become usable by the teams who act on them.

Schema migration

A schema migration is a versioned, repeatable change to a database's structure, adding a column, creating an index, backfilling data, applied in order so every environment converges on the same schema.

Semantic memory

Semantic memory is an agent's store of general, timeless facts, conventions, preferences, and how-things-work, abstracted away from when or how they were learned, so the agent knows things rather than just recalling events.

Semantic search

Semantic search finds results by meaning rather than exact keywords, comparing vector embeddings of the query and documents so it surfaces relevant matches even when the wording differs.

Shared agent memory

Shared agent memory is a memory store that multiple agents or teammates read from and write to in common, so knowledge one agent learns is instantly available to every other agent on the team.

Short-term memory (agents)

Short-term memory is the recent context an AI agent holds during the current session, the conversation so far and latest tool results, that lives in the context window and is lost when the session ends.

SLI (Service Level Indicator)

A Service Level Indicator is a concrete metric of service health, such as the ratio of successful requests to total requests, that you measure directly and compare against an SLO target.

SLO (Service Level Objective)

A Service Level Objective is a target reliability threshold for a service, like 99.9% of requests succeeding over 30 days, that teams measure against and use to decide whether they can ship risk or must slow down.

SSE transport

SSE transport is the older MCP remote transport that paired Server-Sent Events for server-to-client streaming with HTTP POST for client requests; it has been superseded by Streamable HTTP.

stdio transport

The stdio transport runs an MCP server as a local subprocess and exchanges protocol messages over its standard input and output streams, the default way to run a local MCP server.

Streamable HTTP

Streamable HTTP is the MCP transport for remote servers, carrying protocol messages over HTTP with streaming responses; it superseded the older HTTP+SSE transport.

Structured output

Structured output is machine-readable data returned in a defined shape, such as JSON validated against a schema, so a program or agent can parse it reliably instead of scraping free-form text.

System prompt

A system prompt is the high-priority instruction set, placed before the conversation, that defines an LLM's role, rules, tone, and available tools, shaping every response in the session.

Temperature (LLM)

Temperature is a sampling parameter that controls randomness in an LLM's output: low values make responses focused and deterministic, high values make them more varied and creative.

Token

A token is the atomic unit an LLM reads and generates, a word, sub-word, or character fragment. Models measure context length, throughput, and pricing in tokens, not words or characters.

Tokenizer

A tokenizer converts raw text into the sequence of tokens an LLM consumes, and back again. Its vocabulary and splitting rules determine how many tokens a given string costs.

Tool calling

Tool calling is the pattern where a language model, given a set of described tools, decides to invoke one with structured arguments; the system runs it and feeds the result back into the conversation.

Tool poisoning

Tool poisoning is an indirect prompt-injection attack where malicious instructions are hidden in an MCP tool's metadata, name, description, or runtime output, so an agent reads them as trusted input and acts on them.

Tracing

Tracing records the full path of a request as it moves through a system, capturing each step as a timed, nested span, so you can see exactly what happened, in what order, and where time or errors went.

uvx

uvx is the package runner from the uv Python toolchain that fetches and runs a Python tool in an ephemeral environment, making it the common way to launch Python-based local MCP servers.

Vector database

A vector database stores data as high-dimensional embeddings and finds items by similarity rather than exact match, making it the storage layer behind semantic search and retrieval-augmented generation.

Vibe coding

Vibe coding is building software by describing what you want to an AI coding agent in natural language and accepting its generated code, steering by intent and results rather than writing every line yourself.

Webhook

A webhook is a user-defined HTTP callback: instead of polling an API for changes, you register a URL and the service POSTs an event to it the moment something happens, the push-based backbone of integrations and automation.

Well-known URI

A well-known URI is a standardized path under /.well-known/ where a server publishes metadata for automatic discovery; MCP clients fetch these to learn a remote server's OAuth configuration.

Working memory

Working memory is the information an AI agent actively holds for the task in front of it, the current goal, recent steps, and tool results, kept live in the context window and discarded once the task is done.

Zod

Zod is a TypeScript-first schema validation library where you declare a schema once and get both runtime validation and a static type for free, widely used to define and enforce MCP tool inputs and LLM outputs.