We are Codebeez. We specialise in everything Python and help our clients build custom scalable, robust and maintainable solutions.

The scenario

Image you have a database with a large number of schemas, and each of these schemas has a large number of tables. One day your boss comes along and asks you to create a CRUD API for the entire database, what do you do? You don't want to write all these routes by hand, there is a lot of overlap after all.

This was the exact premise of the latest (public) Hackathon that we organised at our office. How can we generate an "as complete as possible" API given only as PostgreSQL database. This is a write up of the different solutions the attendees came up with.

The setup

For the hackathon we've used the 2014 Microsoft AdventureWorks dataset which contains 68 tables and 20 views. We felt this was a good representation of a realistic database that you might want to expose via an API for consumption internally or externally. You can find a convenient visual representation of the tables here. Everything ran in Docker and we had access to all files used to create and populate the database.

We were also encouraged to look beyond what we were comfortable with and try methods such as AI code completion or generation. More on that later.

Different approaches

The different approaches can be roughly split up in two main categories:

  1. Generate an API automatically
  2. Generate an API automagically

Let's look at them separately!

Automatically

When talking about automatically generating an API based on the PostgreSQL database in this context, I'm referring to using a tool or library which extracts (meta)data from the database and uses this to setup an API.

During the hackathon, three distinct tools were uses by different people:

FastAPI Code Generator

With us mostly being a Python shop, this was one of the more obvious solutions. This library takes in one or more OpenAPI files and generates both Pydantic models and a set of matching routes. The library is still in an experimental phase but performances well as a basis for further development.

Let's take a look at an example from their docs to get a better understanding of how it works. First, we start with an OpenAPI input file that looks like this (shortened for brevity, check here for the full file)

openapi: "3.0.0"
info:
  version: 1.0.0
  title: Swagger Petstore
  license:
    name: MIT
servers:
  - url: http://petstore.swagger.io/v1
paths:
  /pets:
    get:
      summary: List all pets
      operationId: listPets
      tags:
        - pets
      parameters:
        - name: limit
          in: query
          description: How many items to return at one time (max 100)
          required: false
          schema:
            type: integer
            format: int32
      responses:
        '200':
          description: A paged array of pets
          headers:
            x-next:
              description: A link to the next page of responses
              schema:
                type: string
          content:
            application/json:
              schema:
                $ref: "#/components/schemas/Pets"
        %% shortened for brevity %%
    post:
      summary: Create a pet
      operationId: createPets
      tags:
        - pets
      responses:
        '201':
          description: Null response
      %% shortened for brevity %%

  /pets/{petId}:
    get:
      summary: Info for a specific pet
      operationId: showPetById
      tags:
        - pets
      parameters:
        - name: petId
          in: path
          required: true
          description: The id of the pet to retrieve
          schema:
            type: string
      responses:
        '200':
          description: Expected response to a valid request
          content:
            application/json:
              schema:
                $ref: "#/components/schemas/Pets"
        %% shortened for brevity %%
components:
  schemas:
    Pet:
      required:
        - id
        - name
      properties:
        id:
          type: integer
          format: int64
        name:
          type: string
        tag:
          type: string
    Pets:
      type: array
      description: list of pet
      items:
        $ref: "#/components/schemas/Pet"
    Error:
      required:
        - code
        - message
      properties:
        code:
          type: integer
          format: int32
        message:
          type: string

Now, when we feed this file as input for the library, it automatically generates the following files.

First, an app/main.py files containing the actual routes that can be called. As you can see it only generates a valid skeleton, it's up to us to implement the business logic of fetching and storing data.

# generated by fastapi-codegen:
#   filename:  api.yaml
#   timestamp: 2020-06-14T10:45:22+00:00

from __future__ import annotations

from typing import Optional

from fastapi import FastAPI, Query

from .models import Pets

app = FastAPI(version="1.0.0", title="Swagger Petstore", license="{'name': 'MIT'}",)


@app.get('/pets', response_model=Pets)
def list_pets(limit: Optional[int] = None) -> Pets:
    """
    List all pets
    """
    pass


@app.post('/pets', response_model=None)
def create_pets() -> None:
    """
    Create a pet
    """
    pass


@app.get('/pets/{pet_id}', response_model=Pets)
def show_pet_by_id(pet_id: str = Query(..., alias='petId')) -> Pets:
    """
    Info for a specific pet
    """
    pass

The corresponding Pydantic models are also generated in the app/models.py file:

# generated by datamodel-codegen:
#   filename:  api.yaml
#   timestamp: 2020-06-14T10:45:22+00:00

from typing import List, Optional

from pydantic import BaseModel, Field


class Pet(BaseModel):
    id: int
    name: str
    tag: Optional[str] = None


class Pets(BaseModel):
    __root__: List[Pet] = Field(..., description='list of pet')


class Error(BaseModel):
    code: int
    message: str

Looking at the solutions, we get to the following pros and cons:

Pros - Automatic generation of all specified routes. - Automatic generation of all corresponding Pydantic models. - A great amount of flexibility in business logic. - Based on OpenAPI files, which are easy to understand by developers and non-developers.

Cons - You need to generate an OpenAPI file yourself. - No business logic is generated, only boilerplate. - Only works for FastAPI.

Now, the downside regarding the need for an OpenAPI file was cleverly solved during the hackathon by giving the provided create table SQL statements to ChatGPT 3.5 and asking it to translate them into valid OpenAPI specs. A similar approach was taken for the business logic, asking ChatGPT 3.5 to generate the CRUD logic given the FastAPI skeleton that was generated by the library.

A good combination of automatically and automagically!

Django Inspectdb

Another similar solution is Django's inspectdb utility which basically does the same thing as the fastapi-code-generator but specifically for Django. Let's look at a quick example.

Let's say we have database table called Pets that looks something like this:

Name age color species
Fido 3 gold dog
Bud 6 brown dog
Max 2 red cat

and we give it as an input to the inspectdb utility. The output would look like:

# This is an auto-generated Django model module.
# You'll have to do the following manually to clean this up:
#   * Rearrange models' order
#   * Make sure each model has one field with primary_key=True
#   * Make sure each ForeignKey and OneToOneField has `on_delete` set to the desired behavior
#   * Remove `managed = False` lines if you wish to allow Django to create, modify, and delete the table
# Feel free to rename the models, but don't rename db_table values or field names.
from django.db import models


class Pets(models.Model):
    index = models.BigIntegerField(blank=True, null=True)
    name = models.CharField(blank=True, null=True)
    age = models.BigIntegerField(blank=True, null=True)
    color = models.CharField(blank=True, null=True)
    species = models.CharField(blank=True, null=True)

    class Meta:
        managed = False
        db_table = 'pets'

With this output, a some additional steps were taken to come up with the full CRUD API.

  • Two more custom management commands had to be written that build generic serialisers (DTO) and viewsets (CRUD) for each model.
  • The newly generated viewsets were connected to Django URLs.
  • Enabled built-in rest framework schema generator based on viewsets/serialisers and fed this schema into a custom template view that loads a Swagger UI.

And with that the API was done. Now if we look at this solution the pros and cons are similar to the previous one:

Pros - Automatic generation of models in the database. - A great amount of flexibility in business logic. - Based on the database directly, no intermediate files needed.

Cons - Multiple DB definitions needed in order to support multiple schemas. - Custom commands needed to go from models to full API. - Only works for Django.

Some of these can again be sped up or entirly mitigated by having an LLM generate more boilerplate or business logic.

PostgREST

In this case ChatGPT 4 was used as a glorified search engine where the context of the situation was explained, and asked how to tackle this issue efficiently. It's the only one used that did not generate a Python based API: PostgREST.

This is a standalone web server that turns your PostgreSQL database directly into a RESTful API. The structural constraints and permissions in the database determine the API endpoints and operations.

So in other words, it looks at all the tables in the database, the table constraints and types and immediately spins up a server with the CRUD endpoints!

Setting it up was quite because we were using Docker during the hackathon. Simply adding it to the docker-compose.yml file and pointing it to the PostgreSQL database was enough to get up an running:

server:  
    image: postgrest/postgrest  
    ports:  
        - "3000:3000"  
    links:  
        - db:db  
    environment:  
        PGRST_DB_URI: postgres://app_user:secret@db:5432/postgres  
        PGRST_DB_SCHEMA: public  
        PGRST_DB_ANON_ROLE: app_user 
        PGRST_SERVER_PROXY_URI: "http://127.0.0.1:3000"  
    depends_on:  
        - db

After this we create a new rol that PosttGREST can use to interact with the database, this is also where you set restrictions on your API:

CREATE ROLE web_anon NOLOGIN;
GRANT USAGE ON SCHEMA public TO web_anon;

ALTER DEFAULT PRIVILEGES IN SCHEMA public
GRANT SELECT, INSERT, UPDATE, DELETE ON tables TO web_anon;

After this we start using the API!

curl -X GET http://localhost:3000/pets
[
    {
        "id": 1,
        "name": "Fido"
        ...
    },
    {
        "id": 2,
        "name": "Max"
        ...
    }
]

Pros - Easy setup. - Completely based on the tables, constraints and types of the database. - No additional work required.

Cons - Not flexible in terms of business logic.

If you really only need a CRUD API this was definitely the fastest and easiest way to do it. That being said, it is also the least flexible. If you need to transform data before saving or after retrieving from the database, this is not the tool for you.

Automagically

When we talk about automatically we mean tools that are sufficiently advanced that they appear magical. Think AI's or one-click-solutions. Now, some of the attendees took the AI help a step further and tried to have an LLM generate the whole API. The process came down to feeding data describing the tables and into an LLM and asking it to generate what we need.

If we look at the steps:

  1. Retrieve the metadata from the database using a SQL query

  2. Add the logic needed to connect to the database, since only we know the connection details

import psycopg2  
from fastapi import FastAPI, HTTPException  
from pydantic import BaseModel# Define your PostgreSQL database connection parameters  
db_params = {  
    "dbname": "Adventureworks",  
    "user": "postgres",  
    "password": "postgres",  
    "host": "db",  # e.g., 'localhost' or your database server's IP address  
    "port": "5432",  # default is 5432  
}# Connect to the PostgreSQL database  
try:  
    conn = psycopg2.connect(**db_params)  
except psycopg2.Error as e:  
    print("Error connecting to the database:", e)  
    exit(1)
  1. Pasting the output of the SQL query in VSCode and ask Copilot to generate the necessary code:
"""  
Metadata for table 'humanresources.shift':  
Column Name: shiftid, Data Type: integer, Max Length: None, Numeric Precision: 32, Default Value: nextval('humanresources.shift_shiftid_seq'::regclass), Is Nullable: NO  
Constraint Name: PK_EmployeeDepartmentHistory_BusinessEntityID_StartDate_Departm  
Column Name: shiftid, Data Type: integer, Max Length: None, Numeric Precision: 32, Default Value: nextval('humanresources.shift_shiftid_seq'::regclass), Is Nullable: NO  
Constraint Name: PK_Shift_ShiftID  
Column Name: shiftid, Data Type: integer, Max Length: None, Numeric Precision: 32, Default Value: nextval('humanresources.shift_shiftid_seq'::regclass), Is Nullable: NO  
Constraint Name: FK_EmployeeDepartmentHistory_Shift_ShiftID  
Column Name: name, Data Type: character varying, Max Length: 50, Numeric Precision: None, Default Value: None, Is Nullable: NO  
Column Name: starttime, Data Type: time without time zone, Max Length: None, Numeric Precision: None, Default Value: None, Is Nullable: NO  
Column Name: endtime, Data Type: time without time zone, Max Length: None, Numeric Precision: None, Default Value: None, Is Nullable: NO  
Column Name: modifieddate, Data Type: timestamp without time zone, Max Length: None, Numeric Precision: None, Default Value: now(), Is Nullable: NO  
"""

# please create a fastapi CRUD application with a pydantic model for this schema
  1. Repeat for all 64 tables

Now, as you as can image, this method had its pros and cons.

Pros - No tools needed besides Copilot. - No need to write the (boilerplate)code. - Works completly within you IDE.

Cons - Doesn't scale well to a large number of tables. - Fine tuning what you ask Copilot and how you ask it required to get good ouput.

If you have a relatively small database and you don't mind going back and forth with Copilot a bit, this is a viable solution. However, going beyond is difficult and requires quite a bit of tweaking.

Conclusion

So, what is the best way to do this? As with most things, it depends. There are plenty of tools availble out there that will make the process of generating a CRUD API based on an existing database quite easy. And it looks like the rise of AI will continue to simplify the process even further.

If my boss would actually ask me to do this, and the database size isn't trivial, I would opt for a combination of automatically and automagically such as the fastapi-code-generator. Having an AI generate predictable boilerplate and a dedicated tool for the necessary code seems like the best solution with the most flexibility, at least at the time of writing.

Who know what the future will bring!

Do you have any questions or are you interested in knowing how Codebeez can help you or your business? Don't hesitate to reach out!