A Serverless AI Agent system built on Claude Agent SDK, implementing stateful conversation persistence across stateless containers using S3+DynamoDB.
A Serverless AI Agent system built on Claude Agent SDK, implementing stateful conversation persistence across stateless containers using S3+DynamoDB.
Exploratory Project | This project explores how to achieve stateful AI Agent sessions using FileSystem + Stateless Containers (AWS Lambda with Firecracker runtime as the foundation). It demonstrates how to maintain conversation persistence across stateless function invocations.
Telegram User → Bot API → API Gateway → Producer Lambda → SQS FIFO Queue → Consumer Lambda
↓ ↓
Return 200 agent-server Lambda
immediately ↓
DynamoDB (Session mapping) + S3 (Session files) + Bedrock (Claude)
Core Design:
- Uses the Hybrid Sessions pattern recommended by Claude Agent SDK
- SQS FIFO Async Architecture: Producer returns 200 immediately to Telegram, Consumer processes requests asynchronously with message ordering guarantee
- Session Persistence: DynamoDB for mapping storage, S3 for conversation history, cross-request recovery support
- Multi-tenant Isolation: Client isolation based on Telegram chat_id + thread_id
- Forum Group Support: Topic-based conversation isolation with auto-precheck
- User Whitelist: Control private chat and group invitation permissions
- SubAgent Support: Configurable specialized Agents (e.g., AWS support) with example implementations
- Skills Support: Reusable skill modules with hello-world example
- MCP Integration: Support for HTTP and local command-based MCP servers (Node.js 20+)
- Security: Telegram Webhook secret token verification (HMAC)
- Auto Cleanup: 25-day TTL + S3 lifecycle management
- SQS FIFO Queue: Ordered async processing + auto retry + dead letter queue
- Quick Start: Provides example Skill/SubAgent/MCP configurations for adding other components
| Command | Description |
|---|---|
/newchat <message> |
Create new Topic in Forum group and start conversation |
/debug |
Download current session files (conversation.jsonl, debug.txt, todos.json) |
/start |
Welcome message (private chat) |
/help |
Show help message |
├── agent-sdk-server/ # Agent Runtime (Docker Container)
│ ├── handler.py # Lambda Entry Point
│ ├── agent_session.py # SDK Wrapper
│ ├── session_store.py # Session Persistence
│ └── claude-config/ # Configuration Files
│ ├── agents.json # SubAgent Definitions
│ ├── mcp.json # MCP Server Configuration
│ ├── skills/ # Skills Definitions
│ │ └── hello-world/ # Example Skill
│ └── system_prompt.md # System Prompt
│
├── agent-sdk-client/ # Telegram Client (ZIP Deployment)
│ ├── handler.py # Producer: Webhook receiver, writes to SQS
│ ├── consumer.py # Consumer: SQS consumer, calls Agent
│ ├── config.py # Configuration management
│ ├── config.toml # Command configuration
│ └── security.py # Security utilities
│
├── docs/ # Documentation
│ └── anthropic-agent-sdk-official/ # SDK Official Docs Reference
│
├── template.yaml # SAM Deployment Template
└── samconfig.toml # SAM Configuration
- AWS CLI + SAM CLI
- Docker
- Amazon Bedrock access (Claude models)
- Telegram Bot Token
- Copy and modify configuration files:
cp .env.example .env
# Edit .env to fill in required environment variables- Build and deploy:
sam build
sam deploy --guided| Variable | Description |
|---|---|
SESSION_BUCKET |
S3 bucket name (auto-created) |
SESSION_TABLE |
DynamoDB table name (auto-created) |
BEDROCK_ACCESS_KEY_ID |
Bedrock access key |
BEDROCK_SECRET_ACCESS_KEY |
Bedrock secret key |
SDK_CLIENT_AUTH_TOKEN |
Internal authentication token |
TELEGRAM_BOT_TOKEN |
Telegram Bot Token |
TELEGRAM_WEBHOOK_SECRET |
(Optional) Webhook secret for security verification |
QUEUE_URL |
SQS queue URL (auto-created) |
- Runtime: Python 3.12 + Claude Agent SDK
- Computing: AWS Lambda (ARM64)
- Storage: S3 + DynamoDB
- Message Queue: AWS SQS (FIFO Queue + DLQ)
- AI: Claude via Amazon Bedrock
- Orchestration: AWS SAM
- Integration: Telegram Bot API + MCP
Problem Solved: Telegram Webhook times out and retries after ~27s, while Agent processing may take 30-70s, causing duplicate responses.
Solution:
- Producer Lambda receives Webhook, writes to SQS FIFO, returns 200 immediately (<1s)
- Consumer Lambda consumes from SQS, calls Agent Server, sends response to Telegram
- FIFO queue ensures message ordering within same session (MessageGroupId = chat_id:thread_id)
- Retry 3 times on failure, then move to dead letter queue (DLQ)
Queue Configuration:
- FifoQueue: true (ordered delivery per MessageGroupId)
- VisibilityTimeout: 900s (= Lambda timeout)
- maxReceiveCount: 3 (retry 3 times)
- DLQ Alarm: CloudWatch alarm triggers when messages enter DLQ
Lifecycle:
- New message → Query DynamoDB mapping
- Mapping exists → Download
conversation.jsonlfrom S3 → Restore session - No mapping → Create new session → Save mapping to DynamoDB
- Processing done → Upload updates to S3
Persistent Files:
conversation.jsonl- Conversation history (required for restoration)debug.txt- Debug logstodos.json- Task status
Edit agent-sdk-client/config.toml:
[agent_commands]
commands = ["/custom-skill", "/hello-world"]
[local_commands]
# Static response
help = { type = "static", response = "Hello World" }
# Handler function
newchat = { type = "handler", handler = "newchat" }
debug = { type = "handler", handler = "debug" }
[security]
user_whitelist = ["all"] # or [123456789, 987654321]Edit agent-sdk-server/claude-config/agents.json:
{
"agent-name": {
"description": "Agent description",
"prompt_file": "agents/prompt.md",
"tools": ["specific tool name"],
"model": "haiku"
}
}Note: The tools field does not support wildcards; you must specify complete tool names.
Create a new Skill in the agent-sdk-server/claude-config/skills/ directory:
- Create a folder:
skills/your-skill/ - Create a
SKILL.mdfile with YAML frontmatter and Markdown description - Claude Agent SDK will auto-discover and use these Skills
Example: skills/hello-world/SKILL.md
Edit agent-sdk-server/claude-config/mcp.json, supporting two types:
- HTTP MCP: HTTP endpoint pointing to remote MCP servers
- Command-line MCP: Start local MCP servers via
commandandargs
Examples include AWS knowledge base MCP servers. Refer to existing configurations to add more MCP servers.
For Telegram Forum groups:
- Enable Topics feature in group settings
- Add Bot to group (must be by whitelisted user)
- Promote Bot to admin with "Manage Topics" permission
- Use
/newchat <message>to create new conversation topics
See docs/forum-group-security.md for details.
The project includes the following example components; follow these examples to add other components:
- SubAgent Example:
aws-supportAgent inagents.json - Skill Example:
skills/hello-world/SKILL.md - MCP Example: AWS knowledge base and documentation MCP servers in
mcp.json
- Multi-tenant TenantID isolation
MIT
基于 Claude Agent SDK 构建的 Serverless AI Agent 系统,通过 S3+DynamoDB 实现无状态容器的"有状态"会话持久化。
探索性项目 | 本项目旨在探索如何通过 FileSystem + 无状态容器(以 Firecracker 为底层的 AWS Lambda)实现有状态 AI Agent 会话。项目展示了在无状态函数调用间维持对话持久化的实现方式。
Telegram User → Bot API → API Gateway → Producer Lambda → SQS FIFO Queue → Consumer Lambda
↓ ↓
立即返回 200 agent-server Lambda
↓
DynamoDB (Session映射) + S3 (Session文件) + Bedrock (Claude)
核心设计:
- 采用 Claude Agent SDK 官方推荐的 Hybrid Sessions 模式
- SQS FIFO 异步架构:Producer 立即返回 200 给 Telegram,Consumer 异步处理请求,保证消息顺序
- Session 持久化:DynamoDB 存储映射,S3 存储对话历史,支持跨请求恢复
- 多租户隔离:基于 Telegram chat_id + thread_id 实现客户端隔离
- Forum 群组支持:基于 Topic 的对话隔离,自动预检权限
- 用户白名单:控制私聊和群组邀请权限
- SubAgent 支持:可配置多个专业 Agent(如 AWS 支持),包含示例实现
- Skills 支持:可复用的技能模块,包含 hello-world 示例
- MCP 集成:支持 HTTP 和本地命令类型的 MCP 服务器 (Node.js 20+)
- 安全验证:支持 Telegram Webhook 密钥验证 (HMAC)
- 自动清理:25天 TTL + S3 生命周期管理
- SQS FIFO 队列:有序异步处理 + 自动重试 + 死信队列
- 快速开始:提供示例 Skill/SubAgent/MCP 配置,可按照示例添加其他组件
| 命令 | 说明 |
|---|---|
/newchat <消息> |
在 Forum 群组中创建新 Topic 开始对话 |
/debug |
下载当前会话文件 (conversation.jsonl, debug.txt, todos.json) |
/start |
欢迎消息 (私聊) |
/help |
显示帮助信息 |
├── agent-sdk-server/ # Agent Runtime (Docker容器)
│ ├── handler.py # Lambda入口
│ ├── agent_session.py # SDK包装器
│ ├── session_store.py # Session持久化
│ └── claude-config/ # 配置文件
│ ├── agents.json # SubAgent定义
│ ├── mcp.json # MCP服务器配置
│ ├── skills/ # Skills定义
│ │ └── hello-world/ # 示例 Skill
│ └── system_prompt.md # 系统提示
│
├── agent-sdk-client/ # Telegram客户端 (ZIP部署)
│ ├── handler.py # Producer: Webhook接收,写入SQS
│ ├── consumer.py # Consumer: SQS消费,调用Agent
│ ├── config.py # 配置管理
│ ├── config.toml # 命令配置
│ └── security.py # 安全工具
│
├── docs/ # 文档
│ └── anthropic-agent-sdk-official/ # SDK官方文档参考
│
├── template.yaml # SAM部署模板
└── samconfig.toml # SAM配置
- AWS CLI + SAM CLI
- Docker
- Amazon Bedrock 访问权限(Claude模型)
- Telegram Bot Token
- 复制并修改配置文件:
cp .env.example .env
# 编辑 .env 填入必要的环境变量- 构建和部署:
sam build
sam deploy --guided| 变量 | 说明 |
|---|---|
SESSION_BUCKET |
S3桶名称(自动创建) |
SESSION_TABLE |
DynamoDB表名(自动创建) |
BEDROCK_ACCESS_KEY_ID |
Bedrock访问密钥 |
BEDROCK_SECRET_ACCESS_KEY |
Bedrock密钥 |
SDK_CLIENT_AUTH_TOKEN |
内部认证Token |
TELEGRAM_BOT_TOKEN |
Telegram Bot Token |
TELEGRAM_WEBHOOK_SECRET |
(可选) Webhook密钥验证 |
QUEUE_URL |
SQS队列URL(自动创建) |
- Runtime: Python 3.12 + Claude Agent SDK
- 计算: AWS Lambda (ARM64)
- 存储: S3 + DynamoDB
- 消息队列: AWS SQS (FIFO Queue + DLQ)
- AI: Claude via Amazon Bedrock
- 编排: AWS SAM
- 集成: Telegram Bot API + MCP
解决的问题:Telegram Webhook 在 ~27s 后超时重试,而 Agent 处理可能需要 30-70s,导致重复响应。
解决方案:
- Producer Lambda 接收 Webhook,写入 SQS FIFO,立即返回 200(<1s)
- Consumer Lambda 从 SQS 消费,调用 Agent Server,发送响应给 Telegram
- FIFO 队列保证同一会话内消息顺序 (MessageGroupId = chat_id:thread_id)
- 失败重试 3 次,最终失败进入死信队列(DLQ)
队列配置:
- FifoQueue: true(按 MessageGroupId 有序投递)
- VisibilityTimeout: 900s(= Lambda 超时)
- maxReceiveCount: 3(重试 3 次)
- DLQ 告警:消息进入 DLQ 时触发 CloudWatch 告警
生命周期:
- 新消息 → 查询 DynamoDB 映射
- 存在映射 → 从 S3 下载
conversation.jsonl→ 恢复会话 - 不存在 → 创建新 session → 保存映射到 DynamoDB
- 处理完成 → 上传更新到 S3
持久化文件:
conversation.jsonl- 对话历史(恢复必需)debug.txt- 调试日志todos.json- 任务状态
编辑 agent-sdk-client/config.toml:
[agent_commands]
commands = ["/custom-skill", "/hello-world"]
[local_commands]
# 静态回复
help = { type = "static", response = "Hello World" }
# 处理函数
newchat = { type = "handler", handler = "newchat" }
debug = { type = "handler", handler = "debug" }
[security]
user_whitelist = ["all"] # 或 [123456789, 987654321]编辑 agent-sdk-server/claude-config/agents.json:
{
"agent-name": {
"description": "Agent描述",
"prompt_file": "agents/prompt.md",
"tools": ["具体工具名称"],
"model": "haiku"
}
}注意:tools 字段不支持通配符,必须指定完整工具名称。
在 agent-sdk-server/claude-config/skills/ 目录下创建新 Skill:
- 创建文件夹:
skills/your-skill/ - 在文件夹中创建
SKILL.md文件,包含 YAML 前置和 Markdown 描述 - Claude Agent SDK 会自动发现并使用这些 Skills
参考示例:skills/hello-world/SKILL.md
编辑 agent-sdk-server/claude-config/mcp.json,支持两种类型:
- HTTP MCP:指向远程 MCP 服务器的 HTTP 端点
- 命令行 MCP:通过
command和args启动本地 MCP 服务器
示例中配置了 AWS 知识库 MCP 服务器。可参考现有配置添加更多 MCP 服务器。
在 Telegram Forum 群组中使用:
- 在群组设置中启用 Topics 功能
- 将 Bot 添加到群组(必须由白名单用户添加)
- 将 Bot 提升为管理员,授予「管理 Topics」权限
- 使用
/newchat <消息>创建新对话 Topic
详见 docs/forum-group-security.md。
项目已包含以下示例组件,可按照这些示例添加其他组件:
- SubAgent 示例:
agents.json中的aws-supportAgent - Skill 示例:
skills/hello-world/SKILL.md - MCP 示例:
mcp.json中的 AWS 知识库和文档 MCP 服务器
- 多租户 TenantID 隔离
MIT