Why Vector Databases Were Created

The rise of AI and machine learning has fundamentally changed how we process and understand data. Traditional databases excel at storing structured data like names, numbers, and categories, but they struggle with the complexity of modern AI applications that need to understand:

The AI Data Challenge

Modern AI systems convert unstructured data (text, images, audio) into vector embeddings – numerical representations that capture semantic meaning. For example:

Mathematical Foundation

Vector databases solve the Approximate Nearest Neighbor (ANN) problem:

What is a Vector Database?

A vector database is a specialized database designed to store, index, and query high-dimensional vector embeddings efficiently. Unlike traditional databases that store discrete values, vector databases store continuous numerical representations of data that capture semantic meaning and relationships.

Core Components

  1. Vector Storage: Optimized storage for high-dimensional arrays (typically 128-4096 dimensions)
  2. Vector Index: Specialized data structures (HNSW, IVF, LSH) for fast similarity search
  3. Similarity Metrics: Distance functions (cosine, euclidean, dot product) to measure vector similarity
  4. Metadata Integration: Ability to combine vector search with traditional filtering
  5. Approximate Search: Trading perfect accuracy for speed using probabilistic algorithms

Traditional Database vs Vector Database

Example: Understanding Vector Similarity

# Traditional database query
SELECT title, content FROM articles 
WHERE category = 'technology' AND publish_date > '2024-01-01'

# Vector database query (conceptual)
SELECT title, content, similarity_score 
FROM articles 
WHERE vector_similarity(embedding, query_vector) > 0.8
ORDER BY similarity_score DESC
LIMIT 10

When to Use Vector Databases

Perfect for:

Not ideal for:


Overview

Amazon S3 Vectors integrates with OpenSearch to provide flexible vector storage and search capabilities, offering up to 90% cost reduction compared to conventional approaches while seamlessly integrating with Amazon Bedrock Knowledge Bases, SageMaker, and OpenSearch for AI applications.

What’s New and Important

Amazon S3 Vectors is a new cloud object store that provides native support for storing and querying vectors at massive scale, offering up to 90% cost reduction compared to conventional approaches while seamlessly integrating with Amazon Bedrock Knowledge Bases, SageMaker, and OpenSearch for AI applications.

Amazon OpenSearch Serverless now supports half (0.5) OpenSearch Compute Units (OCUs) for indexing and search workloads, cutting the entry cost in half.

Architecture Options

Option 1: S3 Vectors + OpenSearch Serverless (Recommended)

Option 2: OpenSearch Serverless Only

Prerequisites:

# Install AWS CLI
aws --version

# Configure AWS credentials
aws configure

# Install required tools
pip install boto3 opensearch-py numpy

Part 1: Setting Up S3 Vectors

1.1 Create S3 Vectors Bucket

# Create S3 bucket with vector configuration
aws s3api create-bucket \
    --bucket my-vector-bucket-2024 \
    --region us-east-1 \
    --create-bucket-configuration LocationConstraint=us-east-1

# Enable versioning (recommended)
aws s3api put-bucket-versioning \
    --bucket my-vector-bucket-2024 \
    --versioning-configuration Status=Enabled

# Create vector configuration
aws s3api put-bucket-vector-configuration \
    --bucket my-vector-bucket-2024 \
    --vector-configuration '{
        "VectorIndex": {
            "Status": "Enabled",
            "VectorDimensions": 768,
            "VectorDataType": "float32",
            "DistanceMetric": "cosine"
        }
    }'

1.2 Upload Vector Data

# upload_vectors.py
import boto3
import numpy as np
import json

def upload_vectors_to_s3():
    s3_client = boto3.client(‘s3’)
    bucket_name = ‘my-vector-bucket-2024’
   
    # Sample vector data (768-dimensional vectors)
    vectors = [
        {
            “id”: “doc_1”,
            “vector”: np.random.rand(768).tolist(),
            “metadata”: {
                “title”: “Document 1”,
                “category”: “technology”,
                “content”: “AI and machine learning concepts”
            }
        },
        {
            “id”: “doc_2”,
            “vector”: np.random.rand(768).tolist(),
            “metadata”: {
                “title”: “Document 2”,
                “category”: “science”,
                “content”: “Physics and quantum mechanics”
            }
        }
    ]
   
    # Upload vectors
    for i, vector_data in enumerate(vectors):
        key = f"vectors/{vector_data['id']}.json"
       
        s3_client.put_object(
            Bucket=bucket_name,
            Key=key,
            Body=json.dumps(vector_data),
            ContentType='application/json'
        )
        print(f"Uploaded vector {vector_data['id']}")

if __name__ == "__main__":
    upload_vectors_to_s3()


Part 2: Setting Up OpenSearch Serverless

2.1 Create OpenSearch Serverless Collection

# Create security policy
aws opensearchserverless create-security-policy \
    –name vector-search-policy \
    –type encryption \
    –policy ‘{
        “Rules”: [
            {
                “ResourceType”: “collection”,
                “Resource”: [“collection/vector-search-*”]
            }
        ],
        “AWSOwnedKey”: true
    }’

# Create network policy
aws opensearchserverless create-security-policy \
    –name vector-search-network-policy \
    –type network \
    –policy ‘[
        {
            “Rules”: [
                {
                    “ResourceType”: “collection”,
                    “Resource”: [“collection/vector-search-*”]
                },
                {
                    “ResourceType”: “dashboard”,
                    “Resource”: [“collection/vector-search-*”]
                }
            ],
            “AllowFromPublic”: true
        }
    ]’

# Create access policy
aws opensearchserverless create-access-policy \
    –name vector-search-access-policy \
    –type data \
    –policy ‘[
        {
            “Rules”: [
                {
                    “ResourceType”: “collection”,
                    “Resource”: [“collection/vector-search-*”],
                    “Permission”: [“aoss:*”]
                },
                {
                    “ResourceType”: “index”,
                    “Resource”: [“index/vector-search-*/*”],
                    “Permission”: [“aoss:*”]
                }
            ],
            “Principal”: [“arn:aws:iam::YOUR_ACCOUNT_ID:root”]
        }
    ]'

# Create the collection
aws opensearchserverless create-collection \
    --name vector-search-collection \
    --type VECTORSEARCH \
    --description "Vector search collection for embeddings"

2.2 Create Vector Index

# create_index.py
from opensearchpy import OpenSearch, RequestsHttpConnection
from aws_requests_auth.aws_auth import AWSRequestsAuth
import boto3

def create_vector_index():
    # Get credentials
    credentials = boto3.Session().get_credentials()
    awsauth = AWSRequestsAuth(credentials, ‘us-east-1’, ‘aoss’)
   
    # OpenSearch client
    client = OpenSearch(
        hosts=[{‘host’: ‘your-collection-endpoint.us-east-1.aoss.amazonaws.com’, ‘port’: 443}],
        http_auth=awsauth,
        use_ssl=True,
        verify_certs=True,
        connection_class=RequestsHttpConnection
    )
   
    # Create index with vector mapping
    index_name = “vector-search-index”
    index_body = {
        “settings”: {
            “index”: {
                “knn”: True,
                “knn.algo_param.ef_search”: 100
            }
        },
        “mappings”: {
            “properties”: {
                “vector”: {
                    “type”: “knn_vector”,
                    “dimension”: 768,
                    “method”: {
                        “name”: “hnsw”,
                        “space_type”: “cosinesimil”,
                        “engine”: “lucene”,
                        “parameters”: {
                            “ef_construction”: 128,
                            “m”: 24
                        }
                    }
                },
                “metadata”: {
                    “properties”: {
                        “title”: {“type”: “text”},
                        “category”: {“type”: “keyword”},
                        “content”: {“type”: “text”}
                    }
                }
            }
        }
    }
   
    response = client.indices.create(index_name, body=index_body)
    print(f”Index created: {response}”)

if __name__ == “__main__”:
    create_vector_index()

Part 3: Integrating S3 Vectors with OpenSearch

3.1 Set Up S3-OpenSearch Integration

# Create integration between S3 Vectors and OpenSearch Serverless
aws opensearchserverless create-vector-source \
    --collection-name vector-search-collection \
    --name s3-vector-source \
    --type S3 \
    --configuration '{
        "S3Configuration": {
            "BucketName": "my-vector-bucket-2024",
            "Prefix": "vectors/",
            "VectorField": "vector",
            "MetadataFields": ["metadata"]
        }
    }'

3.2 Import Vectors from S3

# import_vectors.py
import boto3
import json
from opensearchpy import OpenSearch, RequestsHttpConnection
from aws_requests_auth.aws_auth import AWSRequestsAuth

def import_vectors_from_s3():
    # Initialize clients
    s3_client = boto3.client('s3')
    credentials = boto3.Session().get_credentials()
    awsauth = AWSRequestsAuth(credentials, 'us-east-1', 'aoss')
   
    opensearch_client = OpenSearch(
        hosts=[{'host': 'your-collection-endpoint.us-east-1.aoss.amazonaws.com', 'port': 443}],
        http_auth=awsauth,
        use_ssl=True,
        verify_certs=True,
        connection_class=RequestsHttpConnection
    )
   
    bucket_name = 'my-vector-bucket-2024'
    index_name = 'vector-search-index'
   
    # List objects in S3 bucket
    response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix='vectors/')
   
    if 'Contents' in response:
        for obj in response['Contents']:
            key = obj['Key']
           
            # Download vector data
            obj_response = s3_client.get_object(Bucket=bucket_name, Key=key)
            vector_data = json.loads(obj_response['Body'].read())
           
            # Index in OpenSearch
            doc = {
                'vector': vector_data['vector'],
                'metadata': vector_data['metadata']
            }
           
            opensearch_client.index(
                index=index_name,
                id=vector_data['id'],
                body=doc
            )
            print(f"Indexed document: {vector_data['id']}")
   
    # Refresh index
    opensearch_client.indices.refresh(index=index_name)
    print("Import completed and index refreshed")

if __name__ == "__main__":
    import_vectors_from_s3()

Part 4: Querying and Search

4.1 Vector Similarity Search

# search_vectors.py
import numpy as np
from opensearchpy import OpenSearch, RequestsHttpConnection
from aws_requests_auth.aws_auth import AWSRequestsAuth
import boto3

def search_similar_vectors(query_vector, k=5):
    credentials = boto3.Session().get_credentials()
    awsauth = AWSRequestsAuth(credentials, 'us-east-1', 'aoss')
   
    client = OpenSearch(
        hosts=[{'host': 'your-collection-endpoint.us-east-1.aoss.amazonaws.com', 'port': 443}],
        http_auth=awsauth,
        use_ssl=True,
        verify_certs=True,
        connection_class=RequestsHttpConnection
    )
   
    # Search query
    search_body = {
        "size": k,
        "query": {
            "knn": {
                "vector": {
                    "vector": query_vector,
                    "k": k
                }
            }
        },
        "_source": ["metadata"]
    }
   
    response = client.search(
        index="vector-search-index",
        body=search_body
    )
   
    results = []
    for hit in response['hits']['hits']:
        results.append({
            'id': hit['_id'],
            'score': hit['_score'],
            'metadata': hit['_source']['metadata']
        })
   
    return results

# Example usage
def main():
    # Generate a query vector (in practice, this would come from your embedding model)
    query_vector = np.random.rand(768).tolist()
   
    results = search_similar_vectors(query_vector, k=3)
   
    print("Search Results:")
    for result in results:
        print(f"ID: {result['id']}")
        print(f"Score: {result['score']}")
        print(f"Title: {result['metadata']['title']}")
        print(f"Category: {result['metadata']['category']}")
        print("---")

if __name__ == "__main__":
    main()

4.2 Hybrid Search (Vector + Text)

# hybrid_search.py
def hybrid_search(query_vector, text_query, k=5):
    credentials = boto3.Session().get_credentials()
    awsauth = AWSRequestsAuth(credentials, 'us-east-1', 'aoss')
   
    client = OpenSearch(
        hosts=[{'host': 'your-collection-endpoint.us-east-1.aoss.amazonaws.com', 'port': 443}],
        http_auth=awsauth,
        use_ssl=True,
        verify_certs=True,
        connection_class=RequestsHttpConnection
    )
   
    # Hybrid search combining vector similarity and text matching
    search_body = {
        "size": k,
        "query": {
            "bool": {
                "should": [
                    {
                        "knn": {
                            "vector": {
                                "vector": query_vector,
                                "k": k,
                                "boost": 1.0
                            }
                        }
                    },
                    {
                        "multi_match": {
                            "query": text_query,
                            "fields": ["metadata.title^2", "metadata.content"],
                            "boost": 0.5
                        }
                    }
                ]
            }
        },
        "_source": ["metadata"]
    }
   
    response = client.search(
        index="vector-search-index",
        body=search_body
    )
   
    return response['hits']['hits']

Part 5: Use Cases and Examples

5.1 Document Similarity Search

# Use case: Find similar documents

def find_similar_documents(document_embedding):

    results = search_similar_vectors(document_embedding, k=10)

    return [r for r in results if r[‘score’] > 0.8]

5.2 Recommendation System

# Use case: Product recommendations
def get_product_recommendations(user_embedding, category_filter=None):
    search_body = {
        “size”: 20,
        “query”: {
            “bool”: {
                “must”: [
                    {
                        “knn”: {
                            “vector”: {
                                “vector”: user_embedding,
                                “k”: 20
                            }
                        }
                    }
                ]
            }
        }
    }
   
    if category_filter:
        search_body[“query”][“bool”][“filter”] = [
            {“term”: {“metadata.category”: category_filter}}
        ]
   
    # Execute search and return recommendations

5.3 RAG (Retrieval-Augmented Generation)

# Use case: RAG for chatbots
def retrieve_context_for_rag(question_embedding, top_k=5):
    relevant_docs = search_similar_vectors(question_embedding, k=top_k)
   
    context = ""
    for doc in relevant_docs:
        context += f"{doc['metadata']['content']}\n\n"
   
    return context

Part 6: Monitoring and Optimization

6.1 CloudWatch Metrics

# Monitor OpenSearch Serverless metrics
aws cloudwatch get-metric-statistics \
    --namespace AWS/AOSS \
    --metric-name SearchRate \
    --dimensions Name=ClientId,Value=your-account-id Name=CollectionName,Value=vector-search-collection \
    --start-time 2024-01-01T00:00:00Z \
    --end-time 2024-01-02T00:00:00Z \
    --period 3600 \
    --statistics Average

6.2 Performance Optimization

# Optimize vector indexing performance
def optimize_index_settings():
    index_settings = {
        "index": {
            "knn": True,
            "knn.algo_param.ef_search": 512,  # Higher for better recall
            "refresh_interval": "30s",        # Less frequent refresh for bulk indexing
            "number_of_shards": 1,            # Start with 1, scale as needed
            "number_of_replicas": 0           # 0 for cost optimization
        }
    }
    return index_settings

Challenges and Solutions

Challenge 1: Cold Start Performance

Problem: Initial queries may be slower Solution:

# Implement warming queries
def warm_up_index():
    dummy_vector = [0.1] * 768

    search_similar_vectors(dummy_vector, k=1)

Challenge 2: Vector Dimensionality

Problem: High-dimensional vectors consume more memory Solution:

Challenge 3: Data Consistency

Problem: S3 and OpenSearch sync issues Solution:

# Implement eventual consistency checks
def verify_sync_status():
    s3_count = get_s3_object_count()
    opensearch_count = get_opensearch_doc_count()
   
    if s3_count != opensearch_count:
        trigger_reindex()

Challenge 4: Cost Management

Problem: Unexpected costs from scaling Solution:

# Implement cost monitoring
def monitor_ocu_usage():
    cloudwatch = boto3.client('cloudwatch')
   
    response = cloudwatch.get_metric_statistics(
        Namespace='AWS/AOSS',
        MetricName='SearchOCUUtilization',
        Dimensions=[
            {'Name': 'CollectionName', 'Value': 'vector-search-collection'}
        ],
        StartTime=datetime.utcnow() - timedelta(hours=1),
        EndTime=datetime.utcnow(),
        Period=300,
        Statistics=['Average']
    )
   
    # Alert if utilization > 80%
    for datapoint in response['Datapoints']:
        if datapoint['Average'] > 80:
            send_alert("High OCU utilization detected")

Best Practices

  1. Index Design:
    • Start with fewer dimensions if possible
    • Use appropriate distance metrics (cosine for normalized vectors)
    • Set optimal HNSW parameters based on your use case
  2. Cost Optimization:
    • Use S3 Vectors for large-scale storage
    • Monitor OCU utilization
    • Implement efficient batching for indexing
  3. Performance:
    • Use hybrid search for better relevance
    • Implement query result caching
    • Monitor query latency and adjust accordingly
  4. Security:
    • Use IAM policies for fine-grained access control
    • Enable encryption at rest and in transit
    • Regularly audit access patterns

Conclusion

AWS S3 Vectors with OpenSearch Serverless provides a cost-effective, scalable solution for vector search, offering up to 90% cost reduction and sub-second query response times. This architecture is ideal for AI applications requiring large-scale vector storage and search capabilities.

The combination offers the best of both worlds: cost-effective storage in S3 Vectors and powerful search capabilities through OpenSearch Serverless, making it suitable for everything from recommendation systems to RAG applications.

Leave a Reply

Your email address will not be published. Required fields are marked *