AWS S3 Vector Search End-to-End Guide

Why Vector Databases Were Created The rise of AI and machine learning has fundamentally changed how we process and understand data. Traditional databases excel at storing structured data like names, numbers, and categories, but they struggle with the complexity of modern AI applications that need to understand: The AI Data Challenge Modern AI systems convert unstructured data (text, images, audio) into vector embeddings – numerical representations that capture semantic meaning. For example: Mathematical Foundation Vector databases solve the Approximate Nearest Neighbor (ANN) problem: What is a Vector Database? A vector database is a specialized database designed to store, index, and query high-dimensional vector embeddings efficiently. Unlike traditional databases that store discrete values, vector databases store continuous numerical representations of data that capture semantic meaning and relationships. Core Components Traditional Database vs Vector Database Example: Understanding Vector Similarity When to Use Vector Databases Perfect for: Not ideal for: Overview Amazon S3 Vectors integrates with OpenSearch to provide flexible vector storage and search capabilities, offering up to 90% cost reduction compared to conventional approaches while seamlessly integrating with Amazon Bedrock Knowledge Bases, SageMaker, and OpenSearch for AI applications. What’s New and Important Amazon S3 Vectors is a new cloud object store that provides native support for storing and querying vectors at massive scale, offering up to 90% cost reduction compared to conventional approaches while seamlessly integrating with Amazon Bedrock Knowledge Bases, SageMaker, and OpenSearch for AI applications. Amazon OpenSearch Serverless now supports half (0.5) OpenSearch Compute Units (OCUs) for indexing and search workloads, cutting the entry cost in half. Architecture Options Option 1: S3 Vectors + OpenSearch Serverless (Recommended) Option 2: OpenSearch Serverless Only Prerequisites: Part 1: Setting Up S3 Vectors 1.1 Create S3 Vectors Bucket 1.2 Upload Vector Data # upload_vectors.pyimport boto3import numpy as npimport json def upload_vectors_to_s3(): s3_client = boto3.client(‘s3’) bucket_name = ‘my-vector-bucket-2024’ # Sample vector data (768-dimensional vectors) vectors = [ { “id”: “doc_1”, “vector”: np.random.rand(768).tolist(), “metadata”: { “title”: “Document 1”, “category”: “technology”, “content”: “AI and machine learning concepts” } }, { “id”: “doc_2”, “vector”: np.random.rand(768).tolist(), “metadata”: { “title”: “Document 2”, “category”: “science”, “content”: “Physics and quantum mechanics” } } ] # Upload vectors for i, vector_data in enumerate(vectors): key = f”vectors/{vector_data[‘id’]}.json” s3_client.put_object( Bucket=bucket_name, Key=key, Body=json.dumps(vector_data), ContentType=’application/json’ ) print(f”Uploaded vector {vector_data[‘id’]}”) if __name__ == “__main__”: upload_vectors_to_s3() Part 2: Setting Up OpenSearch Serverless 2.1 Create OpenSearch Serverless Collection # Create security policyaws opensearchserverless create-security-policy \ –name vector-search-policy \ –type encryption \ –policy ‘{ “Rules”: [ { “ResourceType”: “collection”, “Resource”: [“collection/vector-search-*”] } ], “AWSOwnedKey”: true }’ # Create network policyaws opensearchserverless create-security-policy \ –name vector-search-network-policy \ –type network \ –policy ‘[ { “Rules”: [ { “ResourceType”: “collection”, “Resource”: [“collection/vector-search-*”] }, { “ResourceType”: “dashboard”, “Resource”: [“collection/vector-search-*”] } ], “AllowFromPublic”: true } ]’ # Create access policyaws opensearchserverless create-access-policy \ –name vector-search-access-policy \ –type data \ –policy ‘[ { “Rules”: [ { “ResourceType”: “collection”, “Resource”: [“collection/vector-search-*”], “Permission”: [“aoss:*”] }, { “ResourceType”: “index”, “Resource”: [“index/vector-search-*/*”], “Permission”: [“aoss:*”] } ], “Principal”: [“arn:aws:iam::YOUR_ACCOUNT_ID:root”] } ]’ # Create the collectionaws opensearchserverless create-collection \ –name vector-search-collection \ –type VECTORSEARCH \ –description “Vector search collection for embeddings” 2.2 Create Vector Index # create_index.pyfrom opensearchpy import OpenSearch, RequestsHttpConnectionfrom aws_requests_auth.aws_auth import AWSRequestsAuthimport boto3def create_vector_index(): # Get credentials credentials = boto3.Session().get_credentials() awsauth = AWSRequestsAuth(credentials, ‘us-east-1’, ‘aoss’) # OpenSearch client client = OpenSearch( hosts=[{‘host’: ‘your-collection-endpoint.us-east-1.aoss.amazonaws.com’, ‘port’: 443}], http_auth=awsauth, use_ssl=True, verify_certs=True, connection_class=RequestsHttpConnection ) # Create index with vector mapping index_name = “vector-search-index” index_body = { “settings”: { “index”: { “knn”: True, “knn.algo_param.ef_search”: 100
Introduction to Database DevOps

Many companies use automated processes (like pipelines) to manage their software code, deploy it, test it, and set up their computer systems. However, when it comes to working with databases (which store important data), they often don’t use these same automated methods. Instead, they handle databases in a separate way, and this causes a lot of problems. It’s now time to start using automation for databases too. What is Database DevOps? Database DevOps is a method that helps speed up and improve the way software is created and released. It focuses on making it easier for developers and operations teams to work together. When you want to create reliable products, it’s essential to make sure that databases and software work well together. With DevOps, you can build and launch both the software and the database using the same setup. We use DevOps techniques to handle database tasks. We make changes based on feedback from the stages where we deliver and develop applications. This helps ensure a smooth delivery process. Database DevOps Features : Database DevOps products typically have the following features: The Database Bottleneck (Source: Liquibase) A 2019 State of Database Deployments in Application Delivery report found that for the second year in a row, database deployments are a bottleneck. 92% of respondents reported difficulty in accelerating database deployments. Since database changes follow a manual process, requests for database code reviews are often the last thing holding up a release. Developers understandably get frustrated because the code they wrote a few weeks ago is still in review. The whole database change process is just a blocker. Now, teams no longer have to wait for DBAs to review the changes until the final phase. It’s not only possible but necessary to do this earlier in the process and package all code together. Top Database DevOps Challenges Database DevOps, while incredibly beneficial, comes with its fair share of challenges. Some of the top challenges in implementing Database DevOps include: Successfully addressing these challenges involves a combination of technology, processes, and a cultural shift toward collaboration and automation between development and operations teams. How can DevOps help in solving the above challenges? DevOps practices can help address many of the challenges associated with Database DevOps by promoting collaboration, automation, and a systematic approach to managing database changes. Here’s how DevOps can assist in solving the problems mentioned: By combining DevOps practices with these tools and examples, organizations can enhance their Database DevOps capabilities, streamline database management, and achieve more efficient, secure, and reliable database operations. Top Database DevOps Tools Open-Source Database DevOps Tools: Paid Database DevOps Tools: These tools cater to different database systems, such as MySQL, PostgreSQL, Oracle, SQL Server, and more. The choice of tool depends on your specific database technology, project requirements, and budget. It’s essential to evaluate each tool’s features, compatibility, and community/support when selecting the right one for your Database DevOps needs.