Eden Data Language: introduction

Started by CultLeader, August 01, 2022, 05:35:15 PM

Previous topic - Next topic

CultLeader

https://github.com/cultleader777/EdenDB

So, this is a brand new language that focuses on... Simply defining data. Like YAML, but not garbage, typesafe, no nulls or NaNs, with schemas and blazing fast to query. Compiles and emits outputs in milliseconds and relentlessly checks logical errors to disallow funny stuff in production.

Let's have a taste, shall we?

Say we have a table of servers

TABLE server {
  hostname TEXT PRIMARY KEY,
  ram_mb INT,
}


They can have disks, as children

TABLE disks {
  disk_id TEXT PRIMARY KEY CHILD OF server,
  size_bytes INT,
}


Looks similar to SQL, who needs another SQL dialect? Ok, let's see how we can define server with disks, in three ways:

DATA server {
  my-precious-epyc1, 4096;
  my-precious-epyc2, 8192;
}


As you can see, much less verbose and straight to the point as opposed to INSERT INTO ... statement.

There is no concept of insert statement because data is immutable.

Let's add some disks 1TB and 1.5TB disks that are children to server

DATA disks(hostname, disk_id, size_bytes) {
  my-precious-epyc1, root-disk, 1000000000000;
  my-precious-epyc2, root-disk, 1500000000000;
}


The column order (hostname, disk_id, size_bytes) is explicit here but as we've seen with server default tuple order is assumed if ommited.

This is rather verbose of adding separate rows for every server we want into separate place. Also, we redundantly repeat disk hostname.

Could we have everything in one place, so that it would look structured?

We can, with the WITH statement, let's define the same but in more concise way:

DATA server {
  my-precious-epyc1, 4096 WITH disks {
    root-disk, 1000000000000;
  },
  my-precious-epyc2, 8192 WITH disks {
    root-disk, 1500000000000;
  },
}


And here's the magic. We can define child or foreign key elements with WITH keyword and key of parent element will travel to children. There can be many levels of children and WITH statement can also be nested arbitrarily deep. There's a bunch of logic to check for errors, you can read test suite.

But long story short, this is not the garbage of `.yaml` where you can write anything we want, then you run tests and pray it will work. All fields are checked of being appropriate type, all fields must be defined, no null values are allowed and all the stuff like that is checked. Consider this typesafety as strong as defining struct in Rust or OCaml and knowing it will never have nulls or funny values.

Also, there's another variation of `WITH` so that our elements begin to look completely like traditional nested maps in programming languages, this statement is also equivalent:

DATA STRUCT server [
  {
    hostname: my-precious-epyc1, ram_mb: 4096 WITH disks {
        disk_id: root-disk,
        size_bytes: 1000000000000,
    },
  },
  {
    hostname: my-precious-epyc1, ram_mb: 4096 WITH disks [{
        disk_id: root-disk,
        size_bytes: 1500000000000,
    }]
  }
]


So, we represented the same data in three ways and all this will end up as columns for us to analyze.

Aight, what can we do, how about our data correctness?

We can write SQL (thanks embedded sqlite) to prove anything we want about our data with good ol' SQL.

Let's write a proof that we don't have smaller disks than 10GB.


PROOF "no disks exist less than 10 gigabytes" NONE EXIST OF disks {
  SELECT rowid
  FROM disks
  WHERE size_bytes < 10000000000
}


How do we prove we have nothing of certain kind? By searching for it and finding none. This is important, because if this proof in eden data language fails, it will return rowid of the offending rows which can be printed out to user so he could figure out where to fix his data.

Okay, SQL proof might be a little overkill on this, we have also column checks (thanks embedded lua).

Here's how we could have defined the disks table to have same thing checked with lua:

TABLE disks {
  disk_id TEXT PRIMARY KEY CHILD OF server,
  size_bytes INT,

  CHECK { size_bytes >= 10000000000 }
}


Also, lets have a lua computed column for size in megabytes of the disk:


TABLE disks {
  disk_id TEXT PRIMARY KEY CHILD OF server,
  size_bytes INT,
  size_mb INT GENERATED AS { size_bytes / 1000000 },

  CHECK { size_bytes >= 10000000000 }
}


Lua snippet must return an integer or compiler yells at you. No nulls are ever allowed to be returned by computed columns either.

But wait, there's more!

Datalog is the most elegant query language I've ever seen my life. But for datalog to be of any practical use at all, we cannot have any queries that refer to non existing facts or rules. I've found a decent datalog implementation for rust https://crates.io/crates/asdi , unfortunately, it does not support `NOT` statements or arithmetic operators of less, more and etc. So, I'll show two datalog proofs against our data, one that works today (even though it is meaningless) and one that would be practical and would work as soon as asdi starts supporting less than operator.


PROOF "no disks exist less than 10 gigabytes, doesn't work today" NONE EXIST OF disks DATALOG {
  OUTPUT(Offender) :- t_disks__size_mb(Size, Offender), Size < 10000.
}

PROOF "just check that no disks with name 'doofus' exists, works" NONE EXIST OF disks DATALOG {
  OUTPUT(Offender) :- t_disks__disk_id("doofus", Offender).
}


As you can see, our computed column size_mb is available in datalog to query.

But wait, there's more! EdenDB exports typesafe Rust or OCaml code to work with. Basically, once everything is verified we can generate typesafe code and analyze our data with a programming language if we really want to perform some super advanced analysis of our data.

Let's run this:

edendb example.edl --rust-output-directory .


On my computer this compilation literally took 6 milliseconds (thanks rust!). It should never appear in a profiler in a compilation process, unlike most compilers these days...

Let's see the rust source that got compiled!


// Test db content
const DB_BYTES: &[u8] = include_bytes!("edb_data.bin");
lazy_static!{
    pub static ref DB: Database = Database::deserialize(DB_BYTES).unwrap();
}

// Table row pointer types
#[derive(Copy, Clone, Debug, serde::Deserialize)]
pub struct TableRowPointerServer(usize);

#[derive(Copy, Clone, Debug, serde::Deserialize)]
pub struct TableRowPointerDisks(usize);


// Table struct types
#[derive(Debug)]
pub struct TableRowServer {
    pub hostname: ::std::string::String,
    pub ram_mb: i64,
    pub children_disks: Vec<TableRowPointerDisks>,
}

#[derive(Debug)]
pub struct TableRowDisks {
    pub disk_id: ::std::string::String,
    pub size_bytes: i64,
    pub size_mb: i64,
    pub parent: TableRowPointerServer,
}


// Table definitions
pub struct TableDefinitionServer {
    rows: Vec<TableRowServer>,
    c_hostname: Vec<::std::string::String>,
    c_ram_mb: Vec<i64>,
    c_children_disks: Vec<Vec<TableRowPointerDisks>>,
}

pub struct TableDefinitionDisks {
    rows: Vec<TableRowDisks>,
    c_disk_id: Vec<::std::string::String>,
    c_size_bytes: Vec<i64>,
    c_size_mb: Vec<i64>,
    c_parent: Vec<TableRowPointerServer>,
}


// Database definition
pub struct Database {
    server: TableDefinitionServer,
    disks: TableDefinitionDisks,
}

// Database implementation
impl Database {
    pub fn server(&self) -> &TableDefinitionServer {
        &self.server
    }

    pub fn disks(&self) -> &TableDefinitionDisks {
        &self.disks
    }

    pub fn deserialize(compressed: &[u8]) -> Result<Database, Box<dyn ::std::error::Error>> {
        // boring deserialization stuff
        ...
    }
}

// Table definition implementations
impl TableDefinitionServer {
    pub fn len(&self) -> usize {
        self.rows.len()
    }

    pub fn rows_iter(&self) -> impl ::std::iter::Iterator<Item = TableRowPointerServer> {
        (0..self.rows.len()).map(|idx| {
            TableRowPointerServer(idx)
        })
    }

    pub fn row(&self, ptr: TableRowPointerServer) -> &TableRowServer {
        &self.rows[ptr.0]
    }

    pub fn c_hostname(&self, ptr: TableRowPointerServer) -> &::std::string::String {
        &self.c_hostname[ptr.0]
    }

    pub fn c_ram_mb(&self, ptr: TableRowPointerServer) -> i64 {
        self.c_ram_mb[ptr.0]
    }

    pub fn c_children_disks(&self, ptr: TableRowPointerServer) -> &[TableRowPointerDisks] {
        &self.c_children_disks[ptr.0]
    }

}

impl TableDefinitionDisks {
    pub fn len(&self) -> usize {
        self.rows.len()
    }

    pub fn rows_iter(&self) -> impl ::std::iter::Iterator<Item = TableRowPointerDisks> {
        (0..self.rows.len()).map(|idx| {
            TableRowPointerDisks(idx)
        })
    }

    pub fn row(&self, ptr: TableRowPointerDisks) -> &TableRowDisks {
        &self.rows[ptr.0]
    }

    pub fn c_disk_id(&self, ptr: TableRowPointerDisks) -> &::std::string::String {
        &self.c_disk_id[ptr.0]
    }

    pub fn c_size_bytes(&self, ptr: TableRowPointerDisks) -> i64 {
        self.c_size_bytes[ptr.0]
    }

    pub fn c_size_mb(&self, ptr: TableRowPointerDisks) -> i64 {
        self.c_size_mb[ptr.0]
    }

    pub fn c_parent(&self, ptr: TableRowPointerDisks) -> TableRowPointerServer {
        self.c_parent[ptr.0]
    }

}


To get rows from table we can only get them through TableRowPointer special type for every single table. This type is basically wrapped usize and is a zero cost abstraction in rust. But, if you were to use raw usize you'd have a chance of passing TableRowPointerDisks to the server table, which is impossible here. TableRowPointer is just an index to arrays of our column or row stored data.

There are only two ways we can get TableRowPointer for every table:
1. We do a seq scan through entire table to get valid pointers
2. Pointers to other tables may exist as a column in a table. For instance, we can get generated .parent from disk row, or children_disks column generated for server table

Database is immutable and can never be mutated. User in rust can only ever get immutable references to columns. Meaning, you can run thousands of proofs in parallel against this data without ever worrying about locks/consistency and etc. Once your database is compiled it is a done deal and it will never change.

So we don't have nasty garbage in our api like, does this row exist? I better return an Option! No, if you ever have a TableRowPointer value it will ALWAYS point to a valid row in a one and only unique table. You can clone that pointer for free, store it in million other places and it will always point to the same column in the same table.

Of course, we should prefer using column api for performance, getting entire table row is added for convenience.

Deserialization assumes certain order in the include_bytes!() binary data and does not maintain data about tables already read/etc - we just assume that we wrote things in certain order in edendb and read them back in the same order. Also, binary data for rust is compressed with lz4 compression.

OCaml api is analogous and gives the same guarantees. Immutable data, separate types for table pointers for each table and so on.

Okay, let's do something practical with this, let's prove that no intel disks are never next to some cheap crucial disks to not ever humiliate them like that!

We'll change schema a little bit, let's add, what is effectively, a disk manufacturer enum:


TABLE disk_manufacturer {
  model TEXT PRIMARY KEY,
}

DATA EXCLUSIVE disk_manufacturer {
  intel;
  crucial;
}


DATA EXCLUSIVE says, that this data can be only defined once. Otherwise, we could keep filling up this enum with many statements in different places, this effectively makes this table an enum.

Also, let's add our disks to contain the make column with this data:


TABLE disks {
  disk_id TEXT PRIMARY KEY CHILD OF server,
  size_bytes INT,
  size_mb INT GENERATED AS { size_bytes / 1000000 },
  make REF disk_manufacturer,

  CHECK { size_bytes >= 10000000000 }
}


REF disk_manufacturer means this column has to refer by primary key to a single row in disk_manufacturer, you can never add invalid disk manufacturer value in the disk table or compiler yells.

Now our data is redefined with disk manufacturer:

DATA STRUCT server [
  {
    hostname: my-precious-epyc1, ram_mb: 4096 WITH disks {
        disk_id: root-disk,
        size_bytes: 1000000000000,
        make: intel,
    },
  },
  {
    hostname: my-precious-epyc2, ram_mb: 8192 WITH disks [{
        disk_id: root-disk,
        size_bytes: 1500000000000,
        make: intel,
    },{
        disk_id: data-disk,
        size_bytes: 1200000000000,
        make: crucial,
    }]
  }
]


So, how would the rust source look to prove that no intel disk is next to a cheap crucial disk? There ya go, full main.rs file:


#[macro_use]
extern crate lazy_static;

#[allow(dead_code)]
mod database;

fn main() {
    let db = &database::DB;
    db.disks().rows_iter().filter(|i| {
        // we only care about intel disks
        db.disk_manufacturer().c_model(db.disks().c_make(*i)) == "intel"
    }).for_each(|intel_disk| {
        let parent = db.disks().c_parent(intel_disk);
        let children_disks = db.server().c_children_disks(parent);

        for pair in children_disks.windows(2) {
            match (
                db.disk_manufacturer().c_model(db.disks().c_make(pair[0])).as_str(),
                db.disk_manufacturer().c_model(db.disks().c_make(pair[1])).as_str(),
            ) {
                ("intel", "crucial") | ("crucial", "intel") => {
                    panic!("Are you kidding me bro?")
                }
                _ => {}
            }
        }
    });
}


Do you miss endless matching of some/none options to check if some value by key exists in the table or endless unwraps? Me neither. We just get all the values directly because we know they will always be valid given we have table row pointers which can only ever be valid.

Full example project can be found here

So, basically, let's recap all the high level tools EdenDB gives to be successful working and building assumptions about your data:

1. SQL proofs
2. Lua checks/computed columns
3. Datalog (so far meh)
4. Analyze data directly in Rust or OCaml if all else fails

Basically, 4 ways to analyze the same data from four different languages and all data is statically typed and super fast/column based in the end.

Also, there are these features I didn't mention:
- include multiple files syntax
- include lua files
- unique constraints
- sqlite materialized views
- countless checks for correctness (check enum count in errors.rs file)

To see all features of EdenDB check out the tests, because I usually try to test every feature thoroughly.

And more will be added as I see fit.

What can I more say? Having such foundations, time to build the most productive open source web development platform of all time 😜