Use OpenAI to Vectorize Content and Store in Redis (python code within)

Whether you are working on a recommendation engine, a search engine, or any other application that involves understanding and comparing pieces of text, it’s likely you’ll need to convert that text into a numerical form. This process, known as vectorization, allows us to apply mathematical techniques to analyze and compare our text.

In this blog post, we will use the OpenAI API to generate vectors from the text content of any given web page. Then, we’ll store these vectors in a Redis database for fast retrieval and comparison later.

Setup

Before we start, make sure you have the following Python packages installed:

  • beautifulsoup4 for parsing HTML and extracting the content we want.
  • openai for generating vectors from text.
  • redis for connecting to our Redis database.
  • numpy for handling the vectors.

You will also need to sign up for an OpenAI account and get an API key at OpenAI Platform.

Create a Index in Redis using redis-cli

FT.CREATE posts ON HASH PREFIX 1 "post:" SCHEMA url TEXT

This command creates an index called “posts” on all Redis hash objects with keys that start with “post:”. It adds one field to the index, “url”, which is a text field.

You would run this command in the redis-cli by connecting to your Redis server and then entering the command at the prompt.

Fetching the Content

We’ll be using the Beautiful Soup library in Python to fetch the content of a web page:

from bs4 import BeautifulSoup
import requests

def fetch_content(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    content = soup.find('div', {'class': 'entry-content'}).text
    return content

In this code, we’re fetching the web page at the given URL and parsing it with Beautiful Soup. Then we’re extracting the text within a <div> tag with the class ‘entry-content’. This is where the main content of a blog post typically resides.

Vectorizing the Content

Next, we’ll use the OpenAI API to generate a vector from the content:

import openai
import numpy as np

openai.api_key = os.getenv('OPENAI_API_KEY')

def generate_vector(text):
    embedding = openai.Embedding.create(input=text, model="text-embedding-ada-002")
    vector = embedding["data"][0]["embedding"]
    vector = np.array(vector).astype(np.float32).tobytes()
    return vector

Here, we’re sending our text to the OpenAI API and getting back a vector. This vector represents the semantic content of our text in a form that can be compared mathematically with other vectors.

Storing the Vector

Finally, we’ll store the vector in a Redis database:

import redis

conn = redis.Redis(host='127.0.0.1', port=6379)

def store_vector(url, vector):
    post_hash = {
        "url": url,
        "embedding": vector
    }
    conn.hmset("post:" + url, post_hash)

In this code, we’re connecting to our Redis database and storing the vector as a hash. We use the URL of the web page as the key, so we can easily retrieve the vector later using the URL.

Putting It All Together

Now that we have all the pieces, we can create a function that fetches a web page, generates a vector from its content, and stores the vector in Redis:

def vectorize_url(url):
    content = fetch_content(url)
    vector = generate_vector(content)
    store_vector(url, vector)

And that’s it! We can now vectorize the content of any web page and store the vector for later use. For example, we could call vectorize_url('https://example.com/some-blog-post') to vectorize a specific blog post.

Final Code

Just change the link in the code to your preference. Create a new project folder. Copy paste this code, name the file something like anylinkredis.py and save to your new project foler. Next create a .env in the same directory and add your OpenAI API Key and Redis information.

from bs4 import BeautifulSoup
import openai
import redis
import numpy as np
import json
import os
import requests

# OpenAI API key
openai.api_key = os.getenv('OPENAI_API_KEY')

redis_host = os.getenv('REDIS_HOST')
redis_port = os.getenv('REDIS_PORT')
redis_password = os.getenv('REDIS_PASSWORD')

# Connect to the Redis server
conn = redis.Redis(host='127.0.0.1', port=6379)

def vectorize_url(post_url):
    # Fetch the blog post
    r = requests.get(post_url)
    post_soup = BeautifulSoup(r.text, 'html.parser')
    post_text = post_soup.find('div', {'class': 'entry-content'}).text
    print(f"Fetched blog post from {post_url}")

    # Generate the vector
    embedding = openai.Embedding.create(
        input=post_text,
        model="text-embedding-ada-002"
    )
    vector = embedding["data"][0]["embedding"]
    vector = np.array(vector).astype(np.float32).tobytes()  # Serialize the vector
    print(f"Generated vector for {post_url}")
    
    # Store in Redis
    post_hash = {
        "url": post_url,
        "embedding": vector
    }
    for key, value in post_hash.items():
        conn.hset("post:" + post_url, key, value)
    print(f"Stored vector in Redis for {post_url}")

# Test the function with a specific URL
vectorize_url('https://innerinetcompany.com/2023/07/09/use-openai-to-vectorize-content-and-store-in-redis-python-code-within/')

Example for .env

OPENAI_API_KEY=your_openai_api_key
REDIS_HOST=127.0.0.1
REDIS_PORT=6379
REDIS_PASSWORD=

Conclusion

This simple code allows us to leverage the power of OpenAI’s text embeddings and Redis’s fast data retrieval to create a system that can understand and compare pieces of text in an efficient way. This opens up a world of possibilities for recommendation engines, search engines, and other applications that need to understand and compare pieces of text.

Vectorize .txt files and store to Redis

A simple example where we vectorize text content from .txt files using the OpenAI API and store the result in a Redis database. Note that this won’t work for other file types (e.g., images, PDFs, Word documents), but it’s a start:

import redis
import openai
import numpy as np
import os

# OpenAI API key
openai.api_key = os.getenv('OPENAI_API_KEY')

redis_host = os.getenv('REDIS_HOST')
redis_port = os.getenv('REDIS_PORT')
redis_password = os.getenv('REDIS_PASSWORD')

# Connect to the Redis server
conn = redis.Redis(host='127.0.0.1', port=6379)

def vectorize_file(file_path):
    # Open the file and read the content
    with open(file_path, 'r') as file:
        content = file.read()

    # Generate the vector
    embedding = openai.Embedding.create(
        input=content,
        model="text-embedding-ada-002"
    )
    vector = embedding["data"][0]["embedding"]
    vector = np.array(vector).astype(np.float32).tobytes()  # Serialize the vector

    print("Vector has been generated.")

    # Store in Redis
    file_hash = {
        "path": file_path,
        "embedding": vector
    }
    for key, value in file_hash.items():
        conn.hset("file:" + file_path, key, value)

    print("Vector has been stored in Redis.")

# Test the function with a specific file
vectorize_file('/path/to/your/file.txt')

Change this BOLD code to your text file path, /path/to/your/file.txt

In this script, vectorize_file is a function that takes a file path as an argument. It opens the file, reads the content, generates a vector from the content using the OpenAI API, and stores the vector in a Redis database.

You can call vectorize_file with the path to any text file to vectorize its content and store the vector in Redis. For example, you could call vectorize_file('/path/to/your/file.txt') to vectorize a specific file.

prompt engineered with OpenAI Code Interpreter.

Embeddings? Vectorize-to-Redis

command for redis-cli in RedisInsight

FT.SEARCH posts https

The above cmd searches my “posts” index in RedisInsight and finds 9 post in the “https” folder, and then outputs the embedding for each post.

1) "9"
2) "post:https://innerinetcompany.com/2018/01/23/first-blog-post/"
3) 1) "embedding"
   2) "the actual very long output of embeddings for the post"

Perform Redis Vector Queries

use redis-cli in RedisInsight, the cmd format is;

FT.SEARCH yourIndexName “word or words to query”

FT.SEARCH posts "innerinetcompany.com"

outputs the number of results, the url, and embeddings

1) "11"
2) "post:https://innerinetcompany.com/2018/01/23/first-blog-post/"
3) 1) "embedding"
   2) "\xe5\x02_<b\xa7\xf1\xb ...

GEOADD

In redis-cli, add geospatial items to your index like (longitude, latitude, name)

GEOADD yourindex:locations 86.9250 27.9867 "Mount Everest"

or you can make locations the index itself like

GEOADD locations 86.9250 27.9867 "Mount Everest"

then search locations or coordinates like

GEOPOS yourindex:locations "Mount Everest"

or

GEOPOS locations "Mount Everest"

output is

> GEOPOS locations "Mount Everest"
1) 1) "86.92500025033950806"
   2) "27.98669911504683938"

learn more here at Redis.io

InnerINetwork/

Leave a comment