Setting up DeepSeek R1 on AWS using Ollama and FastAPI

Overview of DeepSeek-R1

DeepSeek-R1 uses a Mixture of Experts (MoE) architecture and is 671 billion parameters in size. The MoE architecture allows activation of 37 billion parameters, enabling efficient inference by routing queries to the most relevant expert "clusters." This approach allows the model to specialize in different problem domains while maintaining overall efficiency. DeepSeek-R1 requires at least 800 GB of HBM memory in FP8 format for inference. In this post, we will deploy a distilled 7 billion parameter model using g4dn instances, which are powered by NVIDIA T4 GPUs. These instances are the most cost-effective for machine learning inference and small-scale training, offering high performance for graphics applications optimized with NVIDIA libraries like CUDA and CuDNN.

Why Not AWS Bedrock?

While AWS Bedrock provides an excellent managed solution for deploying LLMs, it has a significant limitation for users in Singapore and Hong Kong: it's not available in these regions. This regional restriction, despite Bedrock's ease of setup and integration with AWS services, forces organizations in these locations to explore alternative deployment options.

The Case for Cloud Deployment

Running large language models locally requires substantial computational resources, particularly GPUs. Not everyone has access to graphics cards capable of running models like DeepSeek R1 efficiently. Cloud deployment offers several advantages:

No upfront hardware investment required
Scalable resources based on demand
Access to powerful GPU instances (e.g., AWS g4dn or g5 instances)
Better cost management through pay-as-you-go pricing

Overview of Ollama and the Need for Security

Ollama is an excellent tool for managing and running LLMs locally, offering:

Simple model management and deployment
Easy-to-use API interface
Efficient model loading and serving

Source: https://github.com/ollama/ollama

However, Ollama wasn't designed with built-in security features for cloud deployment. When exposing an Ollama server to the internet, you need additional security layers. This is where FastAPI comes in, providing API key authentication, Rate limiting etc.

Our Proposed Cloud Architecture

1. FastAPI Service

Deployed on AWS Fargate, it listens on port 8080 and handles incoming requests from users via the Application Load Balancer (ALB), which forwards HTTPS traffic from the public subnets.

2. Ollama Service

Running on an EC2 instance (g4dn.xlarge) and listening on port 11434, it processes requests from FastAPI, performing specific computational tasks.

3. Integration and Scaling

FastAPI communicates with Ollama over HTTP, while an Auto Scaling Group (ASG) maintains one to two instances of Ollama based on demand. Cloud Map facilitates service discovery, enhancing interaction between the two services.

4. Security

The architecture secures backend services by placing them in private subnets and routing public traffic through the ALB, protecting them from direct internet exposure.

Implementation Steps

1. Preparing Docker Files for Ollama and FastAPI

Preparing the Dockerfile for setting up the Ollama Service, the following configuration creates an image that utilizes the official Ollama image as its base. It copies an entrypoint script into the image and makes it executable. The Dockerfile then starts the Ollama server in the background, waits for it to be ready, and pulls the DeepSeek R1 model before setting the entrypoint to the copied script, ensuring the server is fully operational when the container starts.

# Use official Ollama image
FROM ollama/ollama:latest

# Copy entrypoint script into the image
COPY entrypoint.sh /entrypoint.sh

# Make script executable
RUN chmod +x /entrypoint.sh

# Start the Ollama server in the background
RUN ollama serve & \
  echo "Waiting for Ollama server to be ready..." && \
  while [ "$(ollama list | grep 'NAME')" = "" ]; do sleep 1; done && \
  echo "Ollama server is ready. Pulling model..." && \
  ollama pull deepseek-r1:7b

# Set entrypoint
ENTRYPOINT ["/entrypoint.sh"]

Note: Similarly, you can prepare a Dockerfile for the FastAPI Service by starting from a base image like python:3.11-slim, installing dependencies, and copying the application code into the image.

2. Preparing FastAPI app

This FastAPI application creates an API that allows users to interact with a language model. It loads configuration settings from environment variables and defines a request model for validation. The main functionality is provided through a /query endpoint, which accepts user input, verifies the API key for security, and processes the request using the language model. The application handles errors gracefully, returning appropriate HTTP status codes and messages when issues arise.

#app.py
from fastapi import FastAPI, HTTPException, Depends, Header, status
from fastapi.responses import JSONResponse
from langchain_ollama import ChatOllama

app = FastAPI()

def create_llm(temperature: float):
    return ChatOllama(
        base_url=OLLAMA_ENDPOINT,
        model=MODEL_NAME,
        temperature=temperature
    )

@app.post('/query')
async def query(data: QueryRequest, authorization: str = Depends(verify_api_key)):
    try:
        llm = create_llm(data.temperature)
        response = llm.invoke(data.message)
        return JSONResponse(content={"response": response.content})
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


if __name__ == '__main__':
    import uvicorn
    uvicorn.run(app, host='0.0.0.0', port=8080)

3. Infrastructure Setup with CDK

We'll use AWS CDK to provision our infrastructure, including:

EC2 instance with GPU support (g4dn.xlarge)

// lib/infra-stack.js
ollamaTaskDef.addContainer('OllamaContainer', {
  image: ecs.ContainerImage.fromAsset(path.join(__dirname, '../../ollama-service')),
  memoryReservationMiB: 4096,
  portMappings: [{ containerPort: 11434 }],
});

// Create Capacity Provider for GPU instances (g4dn.xlarge)
const autoScalingGroup = new autoscaling.AutoScalingGroup(this, 'OllamaASG', {
  vpc,
  instanceType: ec2.InstanceType.of(ec2.InstanceClass.G4DN, ec2.InstanceSize.XLARGE),
  machineImage: ecs.EcsOptimizedImage.amazonLinux2(),
  minCapacity: 1,
  maxCapacity: 2,
});

Application Load Balancer for HTTPS termination

// Create Application Load Balancer
const zone = route53.HostedZone.fromLookup(this, 'HostedZone', { domainName });
const certificate = new acm.Certificate(this, 'Certificate', {
  domainName: <your-domain-name>,
  validation: acm.CertificateValidation.fromDns(zone),
});

const lb = new elbv2.ApplicationLoadBalancer(this, 'LB', {
  vpc,
  internetFacing: true,
  securityGroup: fastapiSG
});

const listener = lb.addListener('HttpsListener', {
  port: 443,
  certificates: [certificate],
});

listener.addTargets('FastapiTarget', {
  port: 8080,
  targets: [fastapiService],
  healthCheck: { path: '/health', interval: cdk.Duration.seconds(30) }
});

new route53.ARecord(this, 'ARecord', {
  zone,
  recordName: <your-record-name>,
  target: route53.RecordTarget.fromAlias(new targets.LoadBalancerTarget(lb)),
});

Deploy AWS infrastructure with CDK
$cdk deploy --context env="your-environment"

Conclusion

By combining Ollama's model management capabilities with FastAPI's security features and AWS's infrastructure, you can create a robust, secure, and scalable DeepSeek R1 deployment suitable for production use in regions where AWS Bedrock isn't available.

Next Steps:

Please inbox us at support@alphamatch.ai to access the full code and receive our detailed implementation guide for step-by-step instructions on setting up your own secure DeepSeek R1 server.