3 - Datalog and data integration

Relational databases

pyDatalog can be mixed with SQLAlchemy to provide powerful query facilities on relational databases, as shown in this example.

The first step is to create a Base class that combines pyDatalog and SQLAlchemy capability, as follows :

from sqlalchemy.ext.declarative import declarative_base
# define a base class with SQLAlchemy and pyDatalog capabilities
Base = declarative_base(cls=pyDatalog.Mixin, metaclass=pyDatalog.sqlMetaMixin)

The second step is to associate a SQLAlchemy session to this Base class, with the appropriate configuration. pyDatalog will use this session to fetch data from the databases.

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

# create database in memory

engine = create_engine('sqlite:///:memory:', echo=False)
# open a session on a database, then associate it to the Base class
Session = sessionmaker(bind=engine)
session = Session()
Base.session = session

Classes that inherit from Base will now have both pyDatalog and SQLAlchemy capability. There are 2 ways to define the attributes of those classes : explicitly, or by inspecting the existing database tables. The first approach enables the creation of the tables in the relational database.

from sqlalchemy import Column, Integer, String, ForeignKey
class Employee(Base): # Employee inherits from the Base class
    __tablename__ = 'employee' # data are stored in the Employee table
    name = Column(String, primary_key=True)
    manager_name = Column(String, ForeignKey('employee.name'))
    salary = Column(Integer)
# now create the table
Base.metadata.create_all(engine)

The second approach can be used to reverse engineer an existing database, using __table_args__ :

class Employee(Base):
    __tablename__ = 'employee'
    __table_args__ = {'autoload':True} # autoload the model

In both cases, relations can be defined between tables.

The Employee class can now be defined with logic clauses and used in in-line queries, as explained in the previous tutorial. For example :

# who has a salary of 6300 ?
X = pyDatalog.Variable()
Employee.salary[X] == 6300
print(X) # prints [Employee: Mary]

pyDatalog uses the latest in-session data :

X[0].salary = 3000 # modify in-session data
Employee.salary[X] == 3000

print(X) # prints [Employee: Mary]

Different classes in a python program can inherit from different Base classes, each having a session on different databases : pyDatalog queries will join data from the relevant databases, performing multi-database queries effortlessly. SQLAlchemy will issue the SQL statements in the proper dialect for each database. Data changes to instances of the classes can be persisted using SQLAlchemy's transaction facilities. See SQLAlchemy's documentation and tutorial for more details.

Non-relational databases

Queries on non-relational databases can be implemented using predicate resolvers written in python. They can be mixed with queries on relational databases using conjunctive queries or logic clauses.

FoundationDB has a nice tutorial for their pyDatalog binding. The example below illustrates a binding to MongoDB. (see the source).

The first step in the example is to create records in a MongoDB database.

from pymongo import Connection
connection = Connection()
db = connection.Employees
profiles = db.profiles
profiles_to_insert = [{"name": "John", "diploma": "MSc."},
                     {"name": "Mary", "diploma": "EE"},
                     {"name": "Sam", "diploma": "MBA"}]
profiles.insert(profiles_to_insert)

For this to work, the mongoDB database must be started on the machine. See pymongo tutorial.

The second step is to add a predicate resolver for diploma to the Employee class used in the section above, as explained in the following tutorial.

from SQLAlchemy import Employee # import the SQLAlchemy example
def _pyD_diploma2(cls, employee, diploma):
    global profiles
    if employee.is_const():
        r = profiles.find_one({"name": employee.id.name})
        if r: yield (employee, r["diploma"])
        return
    raise AttributeError
# attach the resolver to the Employee class
Employee._pyD_diploma2 = classmethod(_pyD_diploma2) 

This resolver search profiles by name. A search by diploma would be resolved in a similar way, using find().

It then becomes easy to query both databases transparently:

from pyDatalog import pyDatalog
X, N, Diploma = pyDatalog.variables(3)
# Who has a salary of 6800 and a MSc. diploma
(Employee.salary[X]==6800) & (Employee.diploma[X]=="MSc.")
print(X) # prints [Employee: John]

Should diploma be later stored in the relational database, only the python resolver would need to be changed.