DataIku, the name is from Data Haiku
Under a project, there are:
Datasets
you can create datasets, as inputs / outputs, pointing to database, cloud, files, etc.
Flow
A visualized way of connecting dataset to transformation and other tasks.
to start editing the flow, +add dataset (database, cloud, folder, etc.), +add recipe (LLM, python, visual recipes for data transformation).
on adding a python recipe, it requires an input and output. Either select pre-created datasets / folders, or create it on the fly.
python recipe, can be edited via NOTEBOOK / Code Studio, but notebook has a different format to the recipe's format, so it needs to be converted back to recipe for final execution. In notebook mode, the output dataset is not persisted, but only in memory. Under recipe mode, it will be persisted… why designed it like that? So weird.
advanced settings of a recipe. You can select a python environment for each recipe, can also select a container (the compute resource) for each recipe.
Libraries / Git
Under the </> icon… another stupid icon, go to libraries, where you can import git.
Once a git library is imported, within the project's python recipe, you reference to the git library and use it. e.g. a python file within the git is abc.py and a module called DOSOMETHING, then you can "from abc import DOSOMETHING".
Datasets
TO reference to a dataset within python code, it needs the dataiku library.
import dataiku
ecommerce_dataset = dataiku.Dataset("ecommerce_dataset_name")
df = ecommerce_dataset.get_dataframe()
TO write to a folder
test = dataiku.Folder("folder_id_or_name")
test.upload_stream("test.png", .. The stream …)
TO write to a dataset
ecommerce_dataset = dataiku.Dataset("ecommerce_dataset_name")
ecommerce_dataset.write_with_schema(df)
Recipe examples
Webapps
This is the Dash App, under the </> menu item.
It is not part of a flow, why would it be? Right. But a standalone web application.
You can reference to the dataset built by the flows, and surface it through the app.
Standard dash app, but seems only one source file there for everything. I guess if need to split the source into multiple files, e.g. some for css, some for data access, some for callbacks.. Then probably need to build it in a library and reference to the library from the dash app.
It seems to be accessible through dataiku api, but not exposed to external as a real webapp.
If you want to browse the web apps, it need to log into dataiku.
LLM / RAG
With a document dataset, simply append a LLM recipe to it, e.g. Classify Text, where you can tweak the built-in prompts for different tasks, e.g. sentiment analysis.
There is also a built-in Embed recipe for converting documents into embeddings and saved in a knowledge bank. A knowledge bank is a rembedding / vector store, e.g. a FAISS vector stored provided by OpenAI.
The LLM model is not hosted by dataiku, you need to configure connections to your own LLM services.
Navigation between flow, project, recipe, etc is very anti-intuitive … bad UI
e.g. hovering over the DataIku's little bird icon changes it to a back arrow for navigating back to the home… who the hack knows that stupid trick, just show a back button … the tool is for professional users.
Click the menu with (git like branch icon) to navigate to the flow or datasets, etc.
Click on the label name (project name ) on the top menu to navigate back to the project.
e.g.2 the output of a recipe, by default is a dataset. If you want to output files to a folder, there is a tiny text at the bottom to switch to 'New Folder',,,, just simply list all types of outputs in a dropdown list.
The product manager / team should give the tool to a developer who has never used DataIku before and see how it goes. If a developer needs to read extensive tutorial to be able to use the tool, it is already a failure.