5 STAR AI.IO

TOOLS

FOR YOUR BUSINESS

HELLO & WELCOME TO THE

5 STAR AI.IO

TOOLS

FOR YOUR BUSINESS

SHAP-E

Description:

Shap-E is designed to create 3D objects that are dependent on specific text or images. It utilizes advanced technology to generate 3D objects based on user input and can produce a wide variety of shapes and designs. This tool is particularly useful for designers, artists, and architects who require complex 3D models for their work. It can streamline the creative process and save a significant amount of time and effort in the design and production of 3D objects. With Shap-E, users can generate high-quality 3D models that are conditioned on specific text or images, making it an invaluable tool for various industries.

Note: This is a GitHub repository, meaning that it is code that someone created and made publicly available for anyone to use. These tools could require some knowledge of coding.

Pricing Model:GitHub

Tags:Generative Art

Generate Your First Professional AI SHAP-E PROJECT & Get Your BUSINESS 2 Another Level.

What Is OpenAI's Shap-E?

In May 2023, Alex Nichol and Heewon Jun, OpenAI researchers and contributors, released a paper announcing Shap-E, the company's latest innovation. Shap-E is a new tool trained on a massive dataset of paired 3D images and text that can generate 3D models from text or images. It is similar to DALL-E, which can create 2D images from text, but Shap-E produces 3D assets.

Shap-E is trained on a conditional diffusion model and 3D asset mapping. Mapping 3D assets means that Shap-E learns to associate text or images with corresponding 3D models from a large dataset of existing 3D objects. A conditional diffusion model is a generative model that starts from a noisy version of the target output and gradually refines it by removing noise and adding details.

By combining these two components, Shap-E can generate realistic and diverse 3D models that match the given text or image input and can be viewed from different angles and lighting conditions.

How You Can Use OpenAI's Shap-E

Shap-E has not been made public like other OpenAI tools, but its model weight, inference code, and samples are available for download on the Shap-E GitHub page.

You can download the Shap-E code for free and install it using the Python pip command on your computer. You also need an NVIDIA GPU and a high-performance CPU, as Shap-E is very resource-intensive.

After installation, open the 3D images you generate on Microsoft Paint 3D. Likewise, you can convert the images into STL files if you want to print them using 3D printers.

You can also report issues and find solutions to issues already raised by others on the Shap-E GitHub page.

What You Can Do With OpenAI's Shap-E

Shap-E enables you to describe complex ideas using a visual representation of ideas. The potential applications for this technology are limitless, especially as visuals typically have far more reaching effects than texts.

New AI to generate 3D Assets 🔥 from Text - SHAP-E from OpenAI

New AI to generate 3D Assets 🔥 from Text - SHAP-E from OpenAI

Transcript

PART 1

0:00

a has silently released a new text to 3D

0:03

model a 3D generation model that is

0:05

currently available from open a

0:07

completely open source it's called shape

0:09

e it's called shape e it's in the family

0:12

of you know like you can say Dali Point

0:15

e open AI already had a model called

0:17

pointy and which we have covered in the

0:18

channel before so if you want Point

0:20

Cloud generation you can refer that

0:22

video but now we have got a new model

0:24

called shapey so how is shapey different

0:26

from the existing model is what the

0:28

first question that came to my mind and

0:30

the paper very clearly presents that so

0:32

we present shape e a conditional

0:34

generative model for 3D assets unlike a

0:37

recent work on 3D generative models

0:39

which produces a single output

0:41

representation JP directly generates the

0:44

parameters of implicit functions that

0:46

can be rendered as both textual meshes

0:49

and neural Radiance field if you are not

0:53

familiar with neural Radiance field you

0:55

might know this word called Nerf n e r f

0:58

it's been taking the entire internet by

1:01

storm where people are making drone

1:02

shots without having a drone so you can

1:05

just create a neural Radiance field and

1:08

then set the camera in the way you want

1:10

and then you can make the video look

1:12

like it's a drone shot even though if

1:14

you have got not got a drone or a

1:16

robotic camera so what shape he can do

1:19

is shape he can generate both texture

1:21

textured meshes which you can Import in

1:23

tools like a blender and other tools and

1:25

then create 3D objects and also you can

1:28

make Nerf neural Radiance field and this

1:30

is quite interesting and when you

1:33

compare it with Point e which is also

1:35

from open AI an explicit 3D generative

1:38

model over Point clouds shap converges

1:41

faster and reaches comparable or better

1:43

quality despite modeling a high

1:45

dimensional multi-representation output

1:48

space which means if you were planning

1:50

to use Point e or if you are obsessed

1:52

with pointy this is something that you

1:54

should definitely check out thanks to

1:56

open AI for making this open source now

1:58

getting into the model in itself you can

1:59

and see certain samples so a chair that

2:02

looks like an avocado an airplane that

2:05

looks like a banana a spaceship a

2:07

birthday cupcake a chair that looks like

2:09

a tree a green boot a penguin Ube ice

2:13

cream cone a bowl of vegetables so these

2:16

are some of the examples that they've

2:17

given you and you can see how just a

2:20

simple text prompt all you have to give

2:22

is a simple text prompt and that

2:24

generates the 3D object that you can see

2:26

here and you can see these are 3D

2:28

objects like right now it's just a gif

2:30

an animation but you can export the 3D

2:33

object it's very simple to use it's not

2:35

just text to 3D it can also do image to

2:38

3D but that is not something that we are

2:39

going to see in Hands-On video today

2:41

we're going to jump into the Google

2:43

collab notebook that I've prepared for

2:44

you thanks to the example notebooks that

2:46

open a has given it's not completely

2:48

created by me but I've made it a Google

2:50

color notebook for to make it easier for

2:53

you to run this all you have to do is

2:55

open this Google collab notebook which

2:57

I'll link it in the YouTube description

2:58

below the like button click that run all

3:01

and you should be able to see the 3D

3:03

images and I'll let you know where you

3:05

need to change what which will help you

3:07

create slightly different pictures

3:10

to start with first thing that we need

3:12

to do is we need to clone the shape e

3:14

repository and once we have cloned the

3:16

shape e repository we need to enter into

3:18

the shape e repository shape e

3:21

repository is nothing but this this

3:23

Google GitHub repository which has got

3:26

MIT license from open AI that says

3:28

generate 3D objects conditioned on texts

3:31

or images

3:32

and once you have entered into the shape

3:35

e repository then you have to install

3:37

everything at this point you are good to

3:39

go now all you have to do is import all

3:41

the required items from import torch and

3:43

that's quite important and then import

3:45

all the required things from the shapey

3:47

repository like from shapey diffusion

3:49

sample you import sample latents you you

3:52

know you want to create pan cameras you

3:54

want to create widget all these things

3:56

are available here

3:57

once that is there then you need to set

3:59

up the device if you have got a GPU like

4:02

me in the Google collab I'm using the

4:03

free GPU then it would be set as Cuda if

4:07

you do not have GPU it would you know

4:09

assign the device as CPU so that is

4:11

something that you set in this line then

4:13

you are going to load the required

4:14

models then you're going to load the

4:15

transmitter model and the main text 300

4:18

million parameter model and then you're

4:21

going to open the diffusion

4:23

now this is the place where you're going

4:25

to give the settings the batch size how

4:27

many images you want the guidance scale

4:30

which is very important for you to set

4:32

how much you know you want to change the

4:35

image or the creativity from The Prompt

4:36

enter itself a prompt a birthday cake so

4:39

I've just given a birthday cake and all

4:41

these things go here and then the

4:42

latents are getting created and you can

4:44

play around with this parameter I'm not

4:46

I'm not going to get into the details of

4:48

this parameter but you can always go to

4:49

the GitHub repository and then

4:50

understand what these are it took me

4:53

about

4:54

um about one minute for this particular

4:56

configuration batch size for guidance

4:59

skill 15 and for these default settings

5:01

that I got from open AI Library the

5:03

GitHub page say it took about one minute

5:05

in a Google collab notebook

5:07

and next Once you have created the

PART 2

5:10

latents now you can export or you can

5:12

render this latency into two format one

5:14

you can render it as stf in the mesh

5:16

that we discussed about or you can

5:18

render them as Nerf the neural Radiance

5:20

field and you need to give the size this

5:23

the size as you increase the size of the

5:25

render it's going to take a longer time

5:27

so like typically like any video editing

5:29

or any rendering software the the number

5:32

of the size that you increase are in the

5:34

number of frames the we the rendering

5:36

time is going to take time so depending

5:38

upon the machine that you have got make

5:39

sure that you change this value I went

5:41

ahead with the 64 and I got all these

5:44

renders in 36 seconds you can see I've

5:46

got one two three four I've got four

5:49

renders of birthday cake the next thing

5:52

is

5:52

um as most people who work in 3D your

5:55

life is not going to be only inside

5:57

Google collab notebook so it is

5:59

important for you to export this mesh or

6:02

whatever that you have created the

6:03

latents into some kind of file format

6:06

that you can import into a 3D software

6:08

like blender and one of the most popular

6:10

format is dot ply that stands for Point

6:13

Cloud so you can export this as Point

6:16

cloud and once you do that it's going to

6:18

export everything as Point Cloud 0 1 2

6:20

3. so the four images that you created

6:23

are the four 3D objects that you created

6:25

can be exported into Point cloud and

6:28

those have been exported as Point cloud

6:30

and I'm going to just quickly show you

6:32

how it looks when you import it inside

6:34

blender I have this is a very basic

6:37

input I have not done anything very you

6:39

know I've not even added colors I've not

6:40

did anything all you have to do is go to

6:43

the file go to import and select ply and

6:46

then all you have to do is import the

6:48

object after you have downloaded the

6:49

object from Google collab notebook so if

6:52

you do not know how to download it from

6:53

Google collab notebook it's very simple

6:55

go here right click this or click that

6:58

three dots and then click download that

7:00

will download the ply Point Cloud object

7:02

and go to blender and then from blender

7:05

you have to just import it and then

7:07

there are certain ways how you should

7:09

import it if you want the colors like

7:10

vertex colors but I just want to show

7:13

you that the 3D thing actually works

7:15

like you can see the 3D entire 3D

7:17

objects if you want to if you don't want

7:19

colors if you want to just give your own

7:21

colors you can do it with blender I'm

7:22

not a blender expert but I just wanted

7:24

to show you that this can be ultimately

7:26

exported or imported into blender so

7:28

overall this is an amazing piece of

7:30

technology imagine you have to create 3D

7:32

game assets there are a lot of websites

7:35

where people actually go buy 3D assets

7:37

like potion you know simple characters

7:40

and a lot of things and looks like you

7:42

know this is quite amazing to conclude

7:45

this demo I would like to quickly show

7:47

you something like in real time so that

7:49

we all know how this works I want to

7:51

actually you know let's create a sword

7:54

it's a very simple prompt a sword and I

7:58

I'm going to go with batch size one I

8:00

don't want like three objects I want it

8:01

to be faster so I've gone ahead with the

8:03

same guidance skill batch size one

8:05

guidance skill 15 and the prompt as a

8:06

sword and you can see how much time it

8:09

takes into real time this I'm not going

8:11

to edit this particular piece here and

8:13

you can also see certain uh you know

8:16

parameters here like for example the

8:17

progress equals true will help you

8:18

enable the progress bar so that you know

8:20

how much time it takes after this is

8:23

successfully done the next thing is we

8:24

are going to render it which which would

8:26

ideally take like half of the amount

8:28

that this has taken and you can see the

8:30

latents are created

8:32

um once you have the latents created all

8:34

you have to do is you have to render it

8:36

render it in some kind of format here we

8:38

are rendering it in the Nerf format

8:40

which should take to 30 to 60 seconds

8:42

I'm going to probably edit this part so

8:44

that you don't wait for it oh it's done

8:46

cool it's already done in 10 seconds you

8:49

have the sort so this is how you

8:51

generate a 3D object using artificial

8:54

intelligence just simply using a text

8:56

prompt we have not seen the tutorial of

8:58

image to 3D which is also possible using

9:01

the same shape e Library you can

9:04

probably use my notebook and the

9:06

notebook that they have given as part of

9:08

the example and then use image to 3D but

9:11

even simply if you want text to 3D this

9:13

is an amazing model it's quite quite an

9:15

interesting model uh an interesting

9:17

entry into the 3D generation space which

9:20

helps you create 3D mesh and Nerf as

9:23

well from the latency itself a nail

9:26

plane that looks like a banana so you

9:27

can also you know explore the creativity

9:29

that you have been doing with stable

9:31

Division and Dolly you don't have to

9:33

necessarily generate only Real World

9:35

objects right you can also do things

9:37

like that do not exist like an avocado

9:40

that looks like a chair or an airplane

9:41

that looks like a banana

9:43

um and it's it's it's really amazing

9:45

what we can do with this thing I would

9:47

like to hear from you what do you feel

9:49

about this 3D model a lot of text

9:51

generation models have been coming but I

9:53

didn't want to lose attention of this

9:55

amazing 3D model that comes from open

9:56

source open AI as an open source model

10:00

um it it looks like they have been

10:01

living up through their name as an open

10:03

source company our Pro open source

10:05

company with these kind of releases let

10:07

me know in the comment section what do

10:08

you feel otherwise all the links will be

10:10

in the YouTube description for you to

10:11

get started immediately right after this

10:13

video see you in another video Happy

10:14

prompting

Shap·E, a conditional generative model for 3D assets. Unlike recent

work on 3D generative models which produce a single output representation,

Shap·E directly generates the parameters of implicit functions that can be rendered as both textured meshes and neural radiance fields.

Shap-E Github - https://github.com/openai/shap-e

Colab Text-to-3D - https://colab.research.google.com/dri...

Paper - https://arxiv.org/pdf/2305.02463.pdf

openai/shap-e GitHub Pages

heewooj add tip

2d3831f3 days ago

Git stats

11 commits

Files

Type

Name

Latest commit message

Commit time

2 weeks ago

3 days ago

2 weeks ago

2 weeks ago

3 days ago

update paper link

last week

2 weeks ago

add missing module to setup.py

last week

README.md

Shap-E

This is the official code and model release for Shap-E: Generating Conditional 3D Implicit Functions.

See Usage for guidance on how to use this repository.
See Samples for examples of what our text-conditional model can generate.

Samples

Here are some highlighted samples from our text-conditional model. For random samples on selected prompts, see samples.md.

A chair that looks

like an avocado

An airplane that looks

like a banana

A spaceship

A birthday cupcake

A chair that looks

like a tree

A green b

oot

A penguin

Ube ice cream cone

A bowl of vegetables

Usage

Install with pip install -e ..

To get started with examples, see the following notebooks:

sample_text_to_3d.ipynb - sample a 3D model, conditioned on a text prompt.
sample_image_to_3d.ipynb - sample a 3D model, conditioned on a synthetic view image. To get the best result, you should remove background from the input image.
encode_model.ipynb - loads a 3D model or a trimesh, creates a batch of multiview renders and a point cloud, encodes them into a latent, and renders it back. For this to work, install Blender version 3.3.1 or higher, and set the environment variable BLENDER_PATH to the path of the Blender executable.

GitHub - openai/shap-e: Generate 3D objects conditioned on text or imagesGenerate 3D objects conditioned on text or images. Contribute to openai/shap-e development by creating an account on GitHub.

Shap·E: Generating Conditional 3D Implicit

Functions

Shap·E: Generating Conditional 3D Implicit

Functions

Heewoo Jun ∗ Alex Nichol *

heewoo@openai.com alex@openai.com

Abstract

We present Shap·E, a conditional generative model for 3D assets. Unlike recent work on 3D generative models which produce a single output representation, Shap·E directly generates the parameters of implicit functions that can be rendered as both textured meshes and neural radiance fields. We train Shap·E in two stages: first, we train an encoder that deterministically maps 3D assets into the parameters of an implicit function; second, we train a conditional diffusion model on outputs of the encoder. When trained on a large dataset of paired 3D and text data, our resulting models are capable of generating complex and diverse 3D assets in a matter of seconds. When compared to Point·E, an explicit generative model over point clouds, Shap·E converges faster and reaches comparable or better sample quality despite modeling a higher-dimensional, multi-representation output space. We release model weights, inference code, and samples at https: //github.com/openai/shap-e.

1 Introduction

With the recent explosion of generative image models [47, 40, 48, 17, 52, 53, 70, 15], there has been increasing interest in training similar generative models for other modalities such as audio [44, 10, 5, 31, 2, 25], video [24, 57, 23], and 3D assets [26, 28, 45, 41]. Most of these modalities lend themselves to natural, fixed-size tensor representations that can be directly generated, such as grids of pixels for images or arrays of samples for audio. However, it is less clear how to represent 3D assets in a way that is efficient to generate and easy to use in downstream applications.

Recently, implicit neural representations (INRs) have become popular for encoding 3D assets. To represent a 3D asset, INRs typically map 3D coordinates to location-specific information such as density and color. In general, INRs can be thought of as resolution independent, since they can be queried at arbitrary input points rather than encoding information in a fixed grid or sequence. Since they are end-to-end differentiable, INRs also enable various downstream applications such as style transfer [72] and differentiable shape editing [3]. In this work, we focus on two types of INRs for 3D representation:

• A Neural Radiance Field (NeRF) [38] is an INR which represents a 3D scene as a function mapping coordinates and viewing directions to densities and RGB colors. A NeRF can be rendered from arbitrary views by querying densities and colors along camera rays, and trained to match ground-truth renderings of a 3D scene.

• DMTet [56] and its extension GET3D [18] represent a textured 3D mesh as a function mapping coordinates to colors, signed distances, and vertex offsets. This INR can be used to construct 3D triangle meshes in a differentiable manner, and the resulting meshes can be rendered efficiently using differentiable rasterization libraries [32].

∗

Equal contribution

“a bowl of food” “a penguin” “a voxelized dog” “a campfire”

“a chair that looks like

“a dumpster” “a traffic cone” “a green coffee mug”

an avocado”

“an airplane that looks

“a cruise ship” “a donut with pink icing” “a pumpkin”

like a banana”

“a cheeseburger” “an elephant” “a light purple teddy bear” “a soap dispenser”

“an astronaut” “a brown boot” “a lit candle” “a goldfish”

Figure 1: Selected text-conditional meshes generated by Shap·E. Each sample takes roughly 13 seconds to generate on a single NVIDIA V100 GPU, and does not require a separate text-to-image model.

Although INRs are flexible and expressive, the process of acquiring them for each sample in a dataset can be costly. Additionally, each INR may have many numerical parameters, potentially posing challenges when training downstream generative models. Some works approach these issues by using auto-encoders with an implicit decoder to obtain smaller latent representations that can be directly modeled with existing generative techniques [43, 34, 30]. Dupont et al. [12] present an alternative approach, where they use meta-learning to create a dataset of INRs that share most of their parameters, and then train diffusion models [58, 60, 22] or normalizing flows [51, 13] on the free parameters of these INRs. Chen and Wang [6] further suggest that gradient-based meta-learning might not be necessary at all, instead directly training a Transformer [64] encoder to produce NeRF parameters conditioned on multiple views of a 3D object.

We combine and scale up several of the above approaches to arrive at Shap·E, a conditional generative model for diverse and complex 3D implicit representations. First, we scale up the approach of Chen and Wang [6] by training a Transformer-based encoder to produce INR parameters for 3D assets. Next, similar to Dupont et al. [12], we train a diffusion model on outputs from the encoder. Unlike previous approaches, we produce INRs which represent both NeRFs and meshes simultaneously, allowing them to be rendered in multiple ways or imported into downstream 3D applications.

When trained on a dataset of several million 3D assets, our models are capable of producing diverse, recognizable samples conditioned on text prompts (Figure 1). Compared to Point·E [41], a recently proposed explicit 3D generative model, our models converge faster and obtain comparable or superior results while sharing the same model architecture, datasets, and conditioning mechanisms.

Surprisingly, we find that Shap·E and Point·E tend to share success and failure cases when conditioned on images, suggesting that very different choices of output representation can still lead to similar model behavior. However, we also observe some qualitative differences between the two models, especially when directly conditioning on text captions. Like Point·E, the sample quality of our models still falls short of optimization-based approaches for text-conditional 3D generation. However, it is orders of magnitude faster at inference time than these approaches, allowing for a potentially favorable trade-off.

We release our models, inference code, and samples at https://github.com/openai/shap-e.

2 Background

2.1 Neural Radiance Fields (NeRF)

Mildenhall et al. [38] introduce NeRF, a method for representing a 3D scene as an implicit function defined as

FΘ : (x,d) 7→ (c,σ)

where x is a 3D spatial coordinate, d is a 3D viewing direction, c is an RGB color, and σ is a non-negative density value. For convenience, we split FΘ into separate functions, σ(x) and c(x,d).

To render a novel view of a scene, we treat the viewport as a grid of rays and render each ray by querying FΘ at points along the ray. More precisely, each pixel of the viewport is assigned a ray r(t) = o + td which extends from the camera origin o along a direction d. The ray can then be rendered to an RGB color by approximating the integral

, where

Mildenhall et al. [38] use quadrature to approximate this integral. In particular, they define a sequence of increasing values ti,i ∈ [1,N] and corresponding δi = ti+1−ti. The integral is then approximated via a discrete sum

N

Cˆ(r) = XTi(1 − exp(−σ(r(ti))δi))c(r(ti),d), where i=1

One remaining question is how to select the sequence of t0,...tN to achieve an accurate estimate.

This can be especially important for thin features, where a coarse sampling of points along the ray may completely miss a detail of the object. To address this problem, Mildenhall et al. [38] suggest a two-stage rendering procedure. In the first stage, timesteps ti are sampled along uniform intervals of a ray, giving a coarse estimate of the predicted color Cˆc. In computing this integral, they also compute weights proportional to the influence of each point along the ray:

wi ∼ Ti(1 − exp(−σ(r(ti))δi))

To sample timesteps for the fine rendering stage, Mildenhall et al. [38] use wi to define a piecewiseconstant PDF along the ray. This allows a new set of ti to be sampled around points of high density in the scene. While Mildenhall et al. [38] use two separate NeRF models for the coarse and fine rendering stages, we instead share the parameters between the two stages but use separate output heads for the coarse and fine densities and colors.

For notational convenience in later sections, we additionally define the transmittance of a ray as follows. Intuitively, this is the complement of the opacity or alpha value of a ray:

!

2.2 Signed Distance Functions and Texture Fields (STF)

Throughout this paper, we use the abbreviation STF to refer to an implicit function which produces both signed distances and texture colors. This section gives some background on how these implicit functions can be used to construct meshes and produce renderings.

Signed distance functions (SDFs) are a classic way to represent a 3D shape as a scalar field. In particular, an SDF f maps a coordinate x to a scalar f(x) = d, such that |d| is the distance of x to the nearest point on the surface of the shape, and d < 0 if the point is outside of the shape. As a result of this definition, the level set f(x) = 0 defines the boundary of the shape, and sign(d) determines normal orientation along the boundary. Methods such as marching cubes [35] or marching tetrahedra [11] can be used to construct meshes from this level set.

Shen et al. [56] present DMTet, a generative model over 3D shapes that leverages SDFs. DMTet produces SDF values si and displacements ∆vi for each vertex vi in a dense spatial grid. The SDF values are fed through a differentiable marching tetrahedra implementation to produce an initial mesh, and then the resulting vertices are offset using the additional vector ∆vi. They also employ a subdivision procedure to efficiently obtain more detailed meshes, but we do not consider this in our work for the sake of simplicity.

Gao et al. [18] propose GET3D, which augments DMTet with additional texture information. In particular, they train a separate model to predict RGB colors c for each surface point p. This implicit model can be queried at surface points during rendering, or offline to construct an explicit texture. GET3D uses a differentiable rasterization library [32] to produce rendered images for generated meshes. This provides an avenue to train the implicit function end-to-end with only image-space gradients.

2.3 Diffusion Models

Our work leverages denoising diffusion [58, 60, 22] to model a high-dimensional continuous distribution. We employ the Gaussian diffusion setup of Ho et al. [22], which defines a diffusion process that begins at a data sample x0 and gradually applies Gaussian noise to arrive at increasingly noisy samples x1,x2,...,xT. Typically, the noising process is set up such that xT is almost indistinguishable from Gaussian noise. In practice, we never run the noising process sequentially, but instead “jump” directly to a noised version of a sample according to

where is random noise, and α¯t is a monotonically decreasing noise schedule such that α¯0 = 1. Ho et al. [22] train a model on a data distribution q(x0) by minimizing the objective:

Lsimple

However, Ho et al. [22] also note an alternative but equivalent parameterization of the diffusion model, which we use in our work. In particular, we parameterize our model as xθ(xt,t) and train it to directly predict the denoised sample x0 by minimizing

To sample from a diffusion model, one starts at a random noise sample xT and gradually denoises it into samples xT−1,...,x0 to obtain a sample x0 from the approximated data distribution. While early work on these models focused on stochastic sampling processes [58, 60, 22], other works propose alternative sampling methods which often draw on the relationship between diffusion models and ordinary differential equations [59, 61]. In our work, we employ the Heun sampler proposed by Karras et al. [27], as we found it to produce high-quality samples with reasonable latency.

For conditional diffusion models, it is possible to improve sample quality at the cost of diversity using a guidance technique. Dhariwal and Nichol [9] first showed this effect using image-space gradients from a noise-aware classifier, and Ho and Salimans [21] later proposed classifier-free guidance to remove the need for a separate classifier. To utilize classifier-free guidance, we train our diffusion model to condition on some information y (e.g. a conditioning image or textual description), but randomly drop this signal during training to enable the model to make unconditional predictions. During sampling, we then adjust our model prediction as follows:

xˆθ(xt,t|y) = xθ(xt,t) + s · (xθ(xt,t|y) − xθ(xt,t))

where s is a guidance scale. When s = 0 or s = 1, this is equivalent to regular unconditional or conditional sampling, respectively. Setting s > 1 typically produces more coherent but less diverse samples. We employ this technique for all of our models, finding (as expected) that guidance is necessary to obtain the best results.

2.4 Latent Diffusion

While diffusion can be applied to any distribution of vectors, it is often applied directly to signals such as pixels of images. However, it is also possible to use diffusion to generate samples in a continuous latent space.

Rombach et al. [52] propose Latent Diffusion Models (LDMs) as a two-stage generation technique for images. Under the LDM framework, they first train an encoder to produce latents z = E(x) and a decoder to produce reconstructions x˜ = D(z). The encoder and decoder are trained in tandem to minimize a perceptual loss between x˜ and x, as well as a patchwise discriminator loss on x˜. After these models are trained, a second diffusion model is trained directly on encoded dataset samples. In particular, each dataset example xi is encoded into a latent zi, and then zi is used as a training example for the diffusion model. To generate new samples, the diffusion model first generates a latent sample z, and then D(z) yields an image. In the original LDM setup, the latents z are lower-dimensional than the original images, and Rombach et al. [52] propose to either regularize z towards a normal distribution using a KL penalty, or to apply a vector quantization layer [63] to prevent z from being difficult to model.

Our work leverages the above approach, but makes several simplifications. First, we do not use a perceptual loss or a GAN-based objective for our reconstructions, but rather a simple L1 or L2 reconstruction loss. Additionally, instead of using KL regularization or vector quantization to bottleneck our latents, we clamp them to a fixed numerical range and add diffusion-style noise.

3 Related Work

An existing body of work aims to generate 3D models by training auto-encoders on explicit 3D representations and then training generative models in the resulting latent space. Achlioptas et al. [1] train an auto-encoder on point clouds, and experiment with both GANs [19] and GMMs [8] to model the resulting latent space. Yang et al. [69] likewise train a point cloud auto-encoder, but their decoder is itself a conditional generative model (i.e. a normalizing flow [51]) over individual points in the point cloud; they also employ normalizing flows to model the latent space. Luo and Hu [36] explore a similar technique, but use a diffusion model for the decoder instead of a normalizing flow. Zeng et al. [71] train a hierarchical auto-encoder, where the second stage encodes a point cloud of latent vectors instead of a single latent code; they employ diffusion models at both stages of the hierarchy. Sanghi et al. [55] train a two-stage vector quantized auto-encoder [63, 50] on voxel occupancy grids, and model the resulting latent sequences autoregressively. Unlike our work, these approaches all rely on explicit output representations which are often bound to a fixed resolution or lack the ability to fully express a 3D asset.

More similar to our own method, some prior works have explored 3D auto-encoders with implicit decoders. Fu et al. [16] encode grids of SDF samples into latents which are used to condition an implicit SDF model. Sanghi et al. [54] encode voxel grids into latents which are used to condition an implicit occupancy network. Liu et al. [34] train a voxel-based encoder and separate implicit occupancy and color decoders. Kosiorek et al. [30] encode rendered views of a scene into latent vectors of a VAE, and this latent vector is used to condition a NeRF. Most similar to our encoder setup, Chen and Wang [6] use a transformer-based architecture to directly produce the parameters of an MLP conditioned on rendered views. We extend this prior body of work with Shap·E, which produces more expressive implicit representations and is trained at a larger scale than most prior work.

While the above methods all train both encoders and decoders, other works aim to produce latentconditional implicit 3D representations without a learned encoder. Park et al. [43] train what they call an “auto-decoder”, which uses a learned table of embedding vectors for each example in the dataset. In their case, they train an implicit SDF decoder that conditions on these per-sample latent vectors. Bautista et al. [4] uses a similar strategy to learn per-scene latent codes to condition a NeRF decoder. Dupont et al. [12] employ meta-learning to encode dataset examples as implicit functions. In their setup, they “encode” an example into (a subset of) the parameters of an implicit function by taking gradient steps on a reconstruction objective. Concurrently to our work, Erkoç et al. [14] utilize diffusion to directly generate the implicit MLP weights; however, akin to [12], their method requires fitting NeRF parameters for each scene through gradient-based optimization. Wang et al. [66] pursue a related approach, jointly training separate NeRFs for every sample in a dataset, but share a subset of the parameters to ensure that all resulting models use an aligned representation space. These approaches have the advantage that they do not require an explicit input representation. However, they can be expensive to scale with increasing dataset size, as each new sample requires multiple gradient steps. Moreover, this scalability issue is likely more pronounced for methods that do not incorporate meta-learning.

Several methods for 3D generation use gradient-based optimization to produce individual samples, often in the form of an implicit function. DreamFields [26] optimizes the parameters of a NeRF to match a text prompt according to a CLIP-based [46] objective. DreamFusion [45] is a similar method with a different objective based on the output of a text-conditional image diffusion model. Lin et al. [33] extend DreamFusion by optimizing a mesh representation in a second stage, leveraging the fact that meshes can be rendered more efficiently at higher resolution. Wang et al. [65] propose a different approach for leveraging text-to-image diffusion models, using them to optimize a differentiable 3D voxel grid rather than an MLP-based NeRF. While most of these approaches optimize implicit functions, Khalid et al. [28] optimize the numerical parameters of a mesh itself, starting from a spherical mesh and gradually deforming it to match a text prompt. One common shortcoming of all of these approaches is that they require expensive optimization procedures, and a lot of work must be repeated for every sample that is generated. This is in contrast to direct generative models, which can potentially amortize this work by pre-training on a large dataset.

4 Method

In our method, we first train an encoder to produce implicit representations, and then train diffusion models on the latent representations produced by the encoder. Our method proceeds in two steps:

1. We train an encoder to produce the parameters of an implicit function given a dense explicit representation of a known 3D asset (Section 4.2). In particular, the encoder produces a latent representation of a 3D asset which is then linearly projected to obtain weights of a multi-layer perceptron (MLP).

2. We train a diffusion prior on a dataset of latents obtained by applying the encoder to our dataset (Section 4.3). This model is conditioned on either images or text descriptions.

We train all of our models on a large dataset of 3D assets with corresponding renderings, point clouds, and text captions (Section 4.1).

4.1 Dataset

For most of our experiments, we employ the same dataset of underlying 3D assets as Nichol et al. [41], allowing for fairer comparisons with their method. However, we slightly extend the original post-processing as follows:

• For computing point clouds, we render 60 views of each object instead of 20. We found that using only 20 views sometimes resulted in small cracks (due to blind spots) in the inferred point clouds.

• We produce point clouds of 16K points instead of 4K.

• When rendering views for training our encoder, we simplify the lighting and materials. In particular, all models are rendered with a fixed lighting configuration that only supports diffuse and ambient shading. This makes it easier to match the lighting setup with a differentiable renderer.

For our text-conditional model and the corresponding Point·E baseline, we employ an expanded dataset of underlying 3D assets and text captions. For this dataset, we collected roughly 1 million more 3D assets from high-quality data sources. Additionally, we gathered 120K captions from human labelers for high-quality subsets of our dataset. During training of our text-to-3D models, we randomly choose between human-provided labels and the original text captions when both are available.

4.2 3D Encoder

Our encoder architecture is visualized in Figure 2. We feed the encoder both point clouds and rendered views of a 3D asset, and it outputs the parameters of a multi-layer perceptron (MLP) that represents the asset as an implicit function. Both the point cloud and input views are processed via cross-attention, which is followed by a transformer backbone that produces latent representations as a sequence of vectors. Each vector in this sequence is then passed through a latent bottleneck and projection layer whose output is treated as a single row of the resulting MLP weight matrices. During training, the MLP is queried and the outputs are used in either an image reconstruction loss or a distillation loss. For more details, see Appendix A.1.

We pre-train our encoder using only a NeRF rendering objective (Section 4.2.1), as we found this to be more stable to optimize than mesh-based objectives. After NeRF pre-training, we add additional output heads for SDF and texture color predictions, and train these heads using a two-stage process (Section 4.2.2). We show reconstructions of 3D assets for various checkpoints of our encoder with both rendering methods in Figure 3.

4.2.1 Decoding with NeRF Rendering

We mostly follow the original NeRF formulation [38], except that we share the parameters between the coarse and fine models.[1] We randomly sample 4096 rays for each training example, and minimize an L1 loss[2] between the true color C(r) and the predicted color from the NeRF:

LRGB

We also add an additional loss on the transmittance of each ray. In particular, the integrated density of a ray gives transmittance estimates Tˆc(r) and Tˆf(r) for coarse and fine rendering, respectively. We use the alpha channel from the ground-truth renderings to obtain transmittance targets T(r), giving a second loss:

Figure 2: An overview of our encoder architecture. The encoder ingests both 16k resolution RGB point clouds and rendered RGBA images with augmented spatial coordinates for each foreground pixel. It outputs parameters of an MLP, which then acts as both a NeRF and a signed texture field (STF).

We then optimize the joint objective:

LNeRF = LRGB + LT

4.2.2 Decoding with STF Rendering

After NeRF-only pre-training, we add additional STF output heads to our MLPs which predict SDF values and texture colors. To construct a triangle mesh, we query the SDF at vertices along a regular 1283 grid and apply a differentiable implementation of Marching Cubes 33 [62]. We then query the texture color head at each vertex of the resulting mesh. We differentiably render the resulting textured mesh using PyTorch3D [49]. We always render with the same (diffuse) lighting configuration which is identical to the lighting configuration used to preprocess our dataset.

In preliminary experiments, we found that randomly-initialized STF output heads were unstable and difficult to train with a rendering-based objective. To alleviate this issue, we first distill approximations of the SDF and texture color into these output heads before directly training with differentiable rendering. In particular, we randomly sample input coordinates and obtain SDF distillation targets using the Point·E SDF regression model, and RGB targets using the color of the nearest neighbor in the asset’s RGB point cloud. During distillation training, we use a sum of distillation losses and the pre-training NeRF loss:

Ldistill = LNeRF SDFregression(x)||1 + ||RGBθ(x) − RGBNN

Pre-trained Pre-trained STF

Ground-truth Distilled NeRF Distilled STF Finetuned NeRF Finetuned STF NeRF (untrained)

Figure 3: 3D asset reconstructions from different rendering modes and checkpoints. Surprisingly, we find that randomly initialized STF heads still produce some elements of the original shape, likely because the previous layer activations are used for NeRF outputs. While distillation improves STF rendering results, it produces rough looking objects. Fine-tuning on both rendering methods yields the best reconstructions.

Once the STF output heads have been initialized to reasonable values via distillation, we fine-tune the encoder for both NeRF and STF rendering end-to-end. We found it unstable to use L1 loss for STF rendering, so we instead use L2 loss only for this rendering method. In particular, we optimize the following loss for STF rendering:

Render(Meshi) − Image

where N is the number of images, s is the image resolution, Meshi is the constructed mesh for sample i, Imagei is a target RGBA rendering for image i, and Render(x) renders a mesh using a differentiable renderer. We do not include a separate transmittance loss, since this is already captured by the alpha channel of the image.

For this final fine-tuning step, we optimize the summed objective:

LFT = LNeRF + LSTF

4.3 Latent Diffusion

For our generative models, we adopt the transformer-based diffusion architecture of Point·E, but replace point clouds with sequences of latent vectors. Our latents are sequences of shape 1024×1024, and we feed this into the transformer as a sequence of 1024 tokens where each token corresponds to a different row of the MLP weight matrices. As a result, our models are roughly compute equivalent to the base Point·E models (i.e. have the same context length and width) while generating samples in a much higher-dimensional space due to the increase of input and output channels.

We follow the same conditioning strategies as Point·E. For image-conditional generation, we prepend a 256-token CLIP embedding sequence to the Transformer context. For text-conditional generation, we prepend a single token containing the CLIP text embedding. To support classifier-free guidance, we randomly set the conditioning information to zero during training with probability 0.1.

Unlike Point·E, we do not parameterize our diffusion model outputs as predictions. Instead, we directly predict x0, which is algebraically equivalent to predicting , but produced more coherent samples in early experiments. The same observation was made by Ramesh et al. [48], who opted to use x0 prediction when generating CLIP latent vectors with diffusion models.

5 Results

5.1 Encoder Evaluation

We track two render-based metrics throughout the encoder training process. First, we evaluate the peak signal-to-noise ratio (PSNR) between reconstructions and ground-truth rendered images. Additionally, to measure our encoder’s ability to capture semantically relevant details of 3D assets, Table 1: Evaluating the encoder after each stage of training. We evaluate PSNR between reconstructions and ground-truth renders, as well as CLIP R-Precision on reconstructions of samples from Point·E 1B (where the peak performance is roughly 46.8%).

Stage

NeRF PSNR (dB)

STF PSNR (dB)

NeRF Point·E CLIP

R-Precision

STF Point·E CLIP R-Precision

Pre-training (300K)

33.2

-

44.3%

-

Pre-training (600K)

34.5

-

45.2%

-

Distillation

32.9

23.9

42.6%

41.1%

Fine-tuning

35.4

31.3

45.3%

44.0%

Figure 4: Evaluations throughout training for both Shap·E and Point·E. For each checkpoint for both models, we take the maximum value when sweeping over guidance scales

{2.0,3.0,4.0,5.0,8.0,10.0,15.0}.

we encode meshes produced by the largest Point·E model and re-evaluate the CLIP R-Precision of the reconstructed NeRF and STF renders. Table 1 tracks these two metrics over the different stages of training. We find that distillation hurts NeRF reconstruction quality, but fine-tuning recovers (and slightly boosts) NeRF quality while drastically increasing the quality of STF renders.

5.2 Comparison to Point·E

Our latent diffusion model shares the same architecture, training dataset, and conditioning modes as Point·E.[3] As a result, comparing to Point·E helps us isolate the effects of generating implicit neural representations rather than an explicit representation. We compare these methods throughout training on sample-based evaluation metrics in Figure 4. As done by Jain et al. [26] and various follow-up literature, we compute CLIP R-Precision [42] on a set of COCO validation prompts. We also evaluate CLIP score on these same prompts, as this metric is often used for measuring image generation quality [40]. We only train comparable 300 million parameter models, but we also plot evaluations for the largest (1 billion parameter) Point·E model for completeness.

In the text-conditional setting, we observe that Shap·E improves on both metrics over the comparable

Point·E model. To rule out the possibility that this gap is due to perceptually small differences,

Prompt Point·E Samples Shap·E Samples (ours)

“a diamond ring”

“a traffic cone”

“a donut with pink icing”

“a corgi”

“a designer dress”

“a pair of shorts”

“a hypercube”

Figure 5: Examples of text prompts for which text-conditional Point·E and Shap·E consistently exhibit qualitatively different behavior. For each prompt, we show four random samples from both models, which were trained on the same dataset with the same base model size.

we also show qualitative samples in Figure 5, finding that these models often produce qualitatively different samples for the same text prompts. We also observe that our text-conditional Shap·E begins to get worse on evaluations before the end of training. In Appendix B, we argue that this is likely due to overfitting to the text captions, and we use an early-stopped checkpoint for all figures and tables.

Unlike the text-conditional case, our image-conditional Shap·E and Point·E models reach roughly the same final evaluation performance, with a slight advantage for Shap·E in CLIP R-Precision and a slight disadvantage in CLIP score. To investigate this phenomenon more deeply, we inspected samples from both models. We initially expected to see qualitatively different behavior from the two models, since they produce samples in different representation spaces. However, we discovered that both models tend to share similar failure cases, as shown in Figure 6a. This suggests that the training data, model architecture, and conditioning images affect the resulting samples more than the chosen representation space.

However, we do still observe some qualitative differences between the two image-conditional models. For example, in the first row of Figure 6b, we find that Point·E sometimes ignores the small slits in the bench, whereas Shap·E attempts to model them. We hypothesize that this particular difference could occur because point clouds are a poor representation for thin features or gaps. Also, we observe in Table 1 that the 3D encoder slightly reduces CLIP R-Precision when applied to Point·E samples. Since Shap·E achieves comparable CLIP R-Precision as Point·E, we hypothesize that Shap·E must generate qualitatively different samples for some prompts which are not bottlenecked by the encoder. This further suggests that explicit and implicit modeling can still learn distinct features from the same data and model architecture.

5.3 Comparison to Other Methods

We compare Shap·E to a broader class of 3D generative techniques on the CLIP R-Precision metric in

Table 2. As done by Nichol et al. [41], we include sampling latency in this table to highlight that the superior sample quality of optimization-based methods comes at a significant inference cost. We also

Conditioning

Point·E Sample Shap·E Sample

image

(a) Shared failure cases between image-conditional Shap·E and Point·E. In the first example, both models counter-intuitively infer an occluded handle on the mug. In the second, both models incorrectly interpret the proportions of the depicted animal.

Conditioning

Point·E Sample Shap·E Sample

image

(b) Conditioning images for which both Shap·E and Point·E succeed.

Figure 6: Randomly selected image-conditional samples from both Point·E and Shap·E for the same conditioning images.

note that Shap·E enjoys faster inference than Point·E because Shap·E does not require an additional upsampling diffusion model.

6 Limitations and Future Work

While our text-conditional model can understand many single object prompts with simple attributes, it has a limited ability to compose concepts. In Figure 7, we find that this model struggles to bind multiple attributes to different objects, and fails to reliably produce the correct number of objects when asked for more than two. These failures are likely the result of limited paired training data, and could potentially be alleviated by gathering or generating larger annotated 3D datasets.

Table 2: Comparison of 3D generation techniques on the CLIP R-Precision metric on COCO evaluation prompts. Compute estimates and other methods’ values are taken from Nichol et al. [41]. ∗The best text-conditional results are obtained using our expanded dataset of 3D assets.

Method

ViT-B/32

ViT-L/14

Latency

DreamFields

78.6%

82.9%

∼ 200 V100-hr

CLIP-Mesh

67.8%

74.5%

∼ 17 V100-min

DreamFusion

75.1%

79.7%

∼ 12 V100-hr

Point·E (300M, text-only)

33.6%∗

35.5%∗

25 V100-sec

Shap·E (300M, text-only)

37.8%∗

40.9%∗

13 V100-sec

Point·E (300M)

40.3%

45.6%

1.2 V100-min

Point·E (1B)

41.1%

46.8%

1.5 V100-min

Shap·E (300M)

41.1%

46.4%

1.0 V100-min

Conditioning images

69.6%

86.6%

-

Prompt Text-conditional samples

“a stool with a green seat and red legs”

“a red cube on top of a blue cube”

“two cupcakes”

“three cupcakes”

“four cupcakes”

Figure 7: Examples of text-conditional Shap·E samples prompts which require counting and attribute binding.

Additionally, while Shap·E can often produce recognizable 3D assets, the resulting samples often look rough or lack fine details. Notably, Figure 3 shows that the encoder itself sometimes loses detailed textures (e.g. the stripes on the cactus), indicating that improved encoders could potentially recover some of the lost generation quality.

For the best results, Shap·E could potentially be combined with optimization-based 3D generative techniques. For example, a NeRF or mesh produced by Shap·E could be used to initialize an optimization-based approach such as DreamFusion, potentially leading to faster convergence. Alternatively, image-based objectives could be used to guide the Shap·E sampling process, as we briefly explore in Appendix D.

7 Conclusion

We present Shap·E, a latent diffusion model over a space of 3D implicit functions that can be rendered as both NeRFs and textured meshes. We find that Shap·E matches or outperforms a similar explicit generative model given the same dataset, model architecture, and training compute. We also find that our pure text-conditional models can generate diverse, interesting objects without relying on images as an intermediate representation. These results highlight the potential of generating implicit representations, especially in domains like 3D where they can offer more flexibility than explicit representations.

8 Acknowledgments

Our thanks go to Prafulla Dhariwal, Joyce Lee, Jack Rae, and Mark Chen for helpful discussions, and to all contributors of ChatGPT, which provided valuable writing feedback.

References

[1] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. Learning representations and generative models for 3d point clouds. arXiv:1707.02392, 2017.

[2] Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, and Christian Frank. Musiclm: Generating music from text, 2023. URL https://arxiv.org/ abs/2301.11325.

[3] Chong Bao, Yinda Zhang, Bangbang Yang, Tianxing Fan, Zesong Yang, Hujun Bao, Guofeng Zhang, and Zhaopeng Cui. Sine: Semantic-driven image-based nerf editing with prior-guided editing field. arXiv:2303.13277, 2023.

[4] Miguel Angel Bautista, Pengsheng Guo, Samira Abnar, Walter Talbott, Alexander Toshev, Zhuoyuan Chen, Laurent Dinh, Shuangfei Zhai, Hanlin Goh, Daniel Ulbricht, Afshin Dehghan, and Josh Susskind. Gaudi: A neural architect for immersive 3d scene generation. arXiv:2207.13751, 2022.

[5] Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. Audiolm: a language modeling approach to audio generation, 2022. URL https://arxiv.org/abs/ 2209.03143.

[6] Yinbo Chen and Xiaolong Wang. Transformers as meta-learners for implicit neural representations, 2022. URL https://arxiv.org/abs/2208.02801.

[7] Jooyoung Choi, Jungbeom Lee, Chaehun Shin, Sungwon Kim, Hyunwoo Kim, and Sungroh Yoon. Perception prioritized training of diffusion models, 2022.

[8] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1): 1–38, 1977. ISSN 00359246. URL http://www.jstor.org/stable/2984875.

[9] Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. arXiv:2105.05233, 2021.

[10] Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. Jukebox: A generative model for music. arXiv:2005.00341, 2020.

[11] Akio Doi and Akio Koide. An efficient method of triangulating equi-valued surfaces by using tetrahedral cells. IEICE Transactions on Information and Systems, 74:214–224, 1991.

[12] Emilien Dupont, Hyunjik Kim, S. M. Ali Eslami, Danilo Rezende, and Dan Rosenbaum. From data to functa: Your data point is a function and you can treat it like one. arXiv:2201.12204, 2022.

[13] Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. arXiv:1906.04032, 2019.

[14] Ziya Erkoç, Fangchang Ma, Qi Shan, Matthias Nießner, and Angela Dai. Hyperdiffusion: Generating implicit neural fields with weight-space diffusion, 2023.

[15] Zhida Feng, Zhenyu Zhang, Xintong Yu, Yewei Fang, Lanxin Li, Xuyi Chen, Yuxiang Lu, Jiaxiang Liu, Weichong Yin, Shikun Feng, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-ofdenoising-experts. arXiv:2210.15257, 2022.

[16] Rao Fu, Xiao Zhan, Yiwen Chen, Daniel Ritchie, and Srinath Sridhar. Shapecrafter: A recursive text-conditioned 3d shape generation model. arXiv:2207.09446, 2022.

[17] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors. arXiv:2203.13131, 2022.

[18] Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d textured shapes learned from images. arXiv:2209.11163, 2022.

[19] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. arXiv:1406.2661, 2014.

[20] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv:1606.08415, 2016.

[21] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. URL https://openreview. net/forum?id=qw8AKxfYbI.

[22] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. arXiv:2006.11239, 2020.

[23] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen video: High definition video generation with diffusion models. arXiv:2210.02303, 2022.

[24] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffusion models. arXiv:2204.03458, 2022.

[25] Qingqing Huang, Daniel S. Park, Tao Wang, Timo I. Denk, Andy Ly, Nanxin Chen, Zhengdong Zhang, Zhishuai Zhang, Jiahui Yu, Christian Frank, Jesse Engel, Quoc V. Le, William Chan, Zhifeng Chen, and Wei Han. Noise2music: Text-conditioned music generation with diffusion models, 2023. URL https://arxiv.org/abs/2302.03917.

[26] Ajay Jain, Ben Mildenhall, Jonathan T. Barron, Pieter Abbeel, and Ben Poole. Zero-shot text-guided object generation with dream fields. arXiv:2112.01455, 2021.

[27] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. arXiv:2206.00364, 2022.

[28] Nasir Mohammad Khalid, Tianhao Xie, Eugene Belilovsky, and Tiberiu Popa. Clip-mesh: Generating textured meshes from text using pretrained image-text models. arXiv:2203.13333, 2022.

[29] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.

[30] Adam R Kosiorek, Heiko Strathmann, Daniel Zoran, Pol Moreno, Rosalia Schneider, Sonaˇ Mokrá, and Danilo J Rezende. NeRF-VAE: A geometry aware 3D scene generative model. arXiv:2104.00587, April 2021.

[31] Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. Audiogen: Textually guided audio generation, 2022. URL https://arxiv.org/abs/2209.15352.

[32] Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo Aila. Modular primitives for high-performance differentiable rendering. arXiv:2011.03277, 2020.

[33] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. arXiv:2211.10440, 2022.

[34] Zhengzhe Liu, Yi Wang, Xiaojuan Qi, and Chi-Wing Fu. Towards implicit text-guided 3d shape generation. arXiv:2203.14622, 2022.

[35] William E. Lorensen and Harvey E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. In Maureen C. Stone, editor, SIGGRAPH, pages 163–169. ACM, 1987. ISBN 0-89791-227-6. URL http://dblp.uni-trier.de/db/conf/siggraph/ siggraph1987.html#LorensenC87.

[36] Shitong Luo and Wei Hu. Diffusion probabilistic models for 3d point cloud generation. arXiv:2103.01458, 2021.

[37] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. arXiv:1710.03740, 2017.

[38] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.

arXiv:2003.08934, 2020.

[39] Alex Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. arXiv:2102.09672, 2021.

[40] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv:2112.10741, 2021.

[41] Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts. arXiv:2212.08751, 2022.

[42] Dong Huk Park, Samaneh Azadi, Xihui Liu, Trevor Darrell, and Anna Rohrbach. Benchmark for compositional text-to-image synthesis. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. URL https://openreview.net/forum?id=bKBhQhPeKaF.

[43] Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation.

arXiv:1901.05103, 2019.

[44] Christine Payne. Musenet. OpenAI blog, 2019. URL https://openai.com/blog/musenet.

[45] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv:2209.14988, 2022.

[46] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. arXiv:2103.00020, 2021.

[47] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. arXiv:2102.12092, 2021.

[48] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv:2204.06125, 2022.

[49] Nikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor Gordon, Wan-Yen Lo, Justin Johnson, and Georgia Gkioxari. Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501, 2020.

[50] Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with VQ-VAE-2. arXiv:1906.00446, 2019.

[51] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv:1505.05770, 2015.

[52] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. arXiv:2112.10752, 2021.

[53] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. arXiv:2205.11487, 2022.

[54] Aditya Sanghi, Hang Chu, Joseph G. Lambourne, Ye Wang, Chin-Yi Cheng, Marco Fumero, and Kamal Rahimi Malekshan. Clip-forge: Towards zero-shot text-to-shape generation. arXiv:2110.02624, 2021.

[55] Aditya Sanghi, Rao Fu, Vivian Liu, Karl Willis, Hooman Shayani, Amir Hosein Khasahmadi, Srinath Sridhar, and Daniel Ritchie. Textcraft: Zero-shot generation of high-fidelity and diverse shapes from text. arXiv:2211.01427, 2022.

[56] Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and Sanja Fidler. Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis. arXiv:2111.04276, 2021.

[57] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data. arXiv:2209.14792, 2022.

[58] Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. arXiv:1503.03585, 2015.

[59] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv:2010.02502, 2020.

[60] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. arXiv:arXiv:1907.05600, 2020.

[61] Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv:2011.13456, 2020.

[62] Evgueni Tcherniaev. Marching cubes 33: Construction of topologically correct isosurfaces. 01 1996.

[63] Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. arXiv:1711.00937, 2017.

[64] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv:1706.03762, 2017.

[65] Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A. Yeh, and Greg Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. arXiv:2212.00774, 2022.

[66] Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, and Baining Guo. Rodin: A generative model for sculpting 3d digital avatars using diffusion, 2022. URL https://arxiv.org/abs/2212. 06135.

[67] Daniel Watson, William Chan, Ricardo Martin-Brualla, Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi. Novel view synthesis with diffusion models. arXiv:2210.04628, 2022.

[68] Mika Westerlund. The emergence of deepfake technology: A review. Technology Innovation Management Review, 9:40–53, 11/2019 2019. ISSN 1927-0321. doi: http://doi.org/10.22215/ timreview/1282. URL timreview.ca/article/1282.

[69] Guandao Yang, Xun Huang, Zekun Hao, Ming-Yu Liu, Serge Belongie, and Bharath Hariharan. Pointflow: 3d point cloud generation with continuous normalizing flows. arXiv:1906.12320, 2019.

[70] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation. arXiv:2206.10789, 2022.

[71] Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, and Karsten Kreis. Lion: Latent point diffusion models for 3d shape generation. arXiv:2210.06978, 2022.

[72] Kai Zhang, Nick Kolkin, Sai Bi, Fujun Luan, Zexiang Xu, Eli Shechtman, and Noah Snavely. Arf: Artistic radiance fields. arXiv:2206.06360, 2022.

Algorithm 1 High-level pseudocode of our encoder architecture.

Input point cloud p, multiview point cloud m, learned input embedding sequence hl. Outputs: latent variable h and MLP parameters θ.

1: h ← Cat([PointConv(p),hl])

2: h ← CrossAttend(h,Proj(p))

3: h ← CrossAttend(h,PatchEmb(m))

4: h ← Transformer(h)

5: h ← h[−len(hl) :]

6: h ← tanh(h)

7: h0 ← DiffusionNoise(h)

8: θ ← Proj(h0)

9: return h,θ

A Hyperparameters

A.1 Encoder Architecture

To capture details of the input 3D asset, we feed our encoder two separate representations of a 3D model:

• Point clouds: For each 3D asset, we pre-compute an RGB point cloud with 16,384 points.

• Multiview point clouds: In addition to a point cloud, we render 20 views of each 3D asset from random camera angles at 256 × 256 resolution. We augment each foreground pixel with an (x,y,z) surface coordinate, giving an image of shape 256 × 256 × 7. We apply an 8 × 8 patch embedding these augmented renderings, resulting in a sequence of 20,480 vectors representing a multiview point cloud.

Our encoder begins by using a point convolution layer to downsample the input point cloud into a set of 1K embeddings. This set of embeddings is concatenated with a learned input embedding hl to obtain a query sequence h. We then update h with a single cross-attention layer that references the input point cloud. Next, we update h again by cross-attending to the patch embedded multiview point cloud m. Next, we apply a transformer to h and take the 1K suffix tokens as latent vectors. We then apply a tanh(x) activation to these latents to clamp them to the range [−1,1]. At this stage, we have obtained the latent vector that we target with our diffusion models, but we do not yet have the parameters of an MLP.

After computing the sequence of latents, we apply Gaussian diffusion noise q(ht) to the latents with probability 0.1. For this diffusion noise, we use the schedule α¯t = 1 − t5 which typically produces very little noise. After the noise and bottleneck layers, we project each latent vector to 256 dimensions and stack the resulting latents into four MLP weight matrices of size 256 × 256. Our full encoder architecture is described in Algorithm 1.

A.2 Encoder Training

We pre-train our encoders for 600K iterations using Adam [29] with a learning rate of 10−4 and a batch size of 64. We perform STF distillation for 50K iterations with a learning rate of 10−5 and keep the batch size at 64. We query 32K random points on each 3D asset for STF distillation. We fine-tune on STF renders for 65K iterations with the same hyperparameters as for distillation. For each stage of training, we re-initialize the optimizer state. For pre-training, we use 16-bit precision with loss scaling [37], but we found full 32-bit precision necessary to stabilize fine-tuning.

A.3 Implicit Representations

We represent our INRs as 6-layer MLPs where the first four layers are determined by the output of an encoder; the final two layers are shared across dataset examples. We do not use any biases in these models. Input coordinates are concatenated with sinusoidal embeddings following the work of Mildenhall et al. [38] and Watson et al. [67]. In particular, each coordinate dimension x is expanded as

Figure 8: NeRF reconstructions of noised latents using our diffusion schedule α¯t = e−12t. The timestep t is linearly swept from 0 to 1 from left to right.

[x,cos(20x),sin(20x),...,cos(214x),sin(214x)]

Our MLPs use SiLU activations [20] between intermediate layers. The NeRF density and RGB heads are followed by sigmoid and ReLU activations, respectively. The SDF and texture color heads are followed by tanh and sigmoid activations, respectively.

Although we use direction independent lighting in all of our experiments, we found our encoders unstable to train unless we augmented their input coordinates with ray direction embeddings. Unlike typical NeRF models, our models’ density head can be influenced by the ray direction, potentially leading to view-inconsistent objects. To ensure view-consistency at test time, we always set the ray direction embeddings to zero. Despite the out-of-distribution inputs, this approach is effective, likely because the model learns to disregard the ray direction with sufficient training. It remains an open question why the ray direction is beneficial during initial pre-training, yet appears irrelevant in later stages of training.

A.4 Diffusion Models

When training our diffusion models, we employ the same hyperparameters as used for the 300M parameter Point·E models. The only difference is that we use larger input and output projections to accommodate 1024 feature channels (instead of 6).

For diffusion models, the choice of noise schedule α¯t can often have a big impact on sample quality [39, 27, 7]. Intuitively, the relative scale between a sample and the noise injected at a particular timestep determines how much information is destroyed at that timestep, and we would like a noise schedule that gradually destroys semantic information in the signal. In early experiments, we tried the cosine [39] and linear [22] schedules, as well as a schedule which we found visually to destroy information gradually: α¯t = e−12t (see Figure 8). In these experiments, we found that the latter schedule performed better on evaluation metrics, and decided to use it for all future experiments.

We use similar Heun sampling hyperparameters as Point·E, but found that setting schurn = 0 was a better choice for Shap·E, whereas schurn = 3 was better for Point·E. Additionally, we found that, while our image-conditional models tended to prefer the same guidance scale as Point·E, our textconditional models could tolerate much higher guidance scales while still improving on evaluations (Figure 9). We find that our best text-conditional Point·E samples are obtained using a scale of 5.0, while the best Shap·E results use a scale of 20.0.

A.5 Evaluation

When evaluating CLIP-based metrics, we render our models’ samples using NeRF at 128 × 128 resolution. We sample camera positions randomly around the z-axis, with a constant 30 degree elevation for all camera poses. We find that this works well in practice, since objects in our training dataset are usually oriented with the z-axis as the logical vertical direction.

B Overfitting in Text-Conditional Models

We observe that our text-conditional model begins to get worse on evaluations after roughly 600K iterations. We hypothesize that this is due to overfitting to the text captions in the dataset, since we did not observe this phenomenon in the image-conditional case. In Figure 10a, we observe that

Figure 9: Evaluation sweep over guidance scale for text-conditional models. We find that Shap·E benefits from increasing guidance scale up to 20.0, whereas Point·E begins to saturate at lower guidance scales and then becomes worse.

(a) Train and validation loss averaged across (b) Train/validation loss averaged over the all diffusion steps. noisiest quarter of the diffusion steps.

Figure 10: Training and validation losses for our text-conditional model. We find that this model overfits, and that the overfitting is stronger for the noisiest diffusion timesteps.

the training loss decreases faster than the validation loss, but that validation loss itself never starts increasing. Why, then, does the model get worse on evaluations?

To more deeply explore this overfitting, we leverage the fact that the diffusion loss is actually a sum of many different loss terms at different noise levels. In Figure 10b, we plot the training and validation losses over only the noisiest quarter of the diffusion steps, finding that in this case overfitting is more pronounced and the validation loss indeed starts increasing at about 600K iterations. Intuitively, conditioning information is more likely to affect noisier timesteps since less information can be inferred from the noised sample xt. This supports the hypothesis that the overfitting is tied to the model’s understanding of the conditioning signal, although it may still be overfitting to other aspects of the data.

C Bias and Misuse

Biases present in our dataset are likely to impact the behavior of the models we develop. In Figure

11, we examine bias within our text-conditional model by providing it with ambiguous captions in which certain details, such as body shape or color, are left unspecified. We observe that the samples generated by the model exhibit common gender-role stereotypes in response to these ambiguous prompts.

Our models are not typically adept at producing photo-realistic samples or accurately following long and complex prompts, and this limitation comes with both benefits and drawbacks. On the positive side, it alleviates concerns regarding the potential use of our models to create convincing “DeepFakes”

Prompt Samples

“a doctor”

“a nurse”

“an engineer”

Figure 11: Examples where our text-conditional model likely exhibits biases from its dataset.

Prompt Samples

“a 1/8" titanium drill bit”

Figure 12: Examples of generated 3D objects which could have adverse consequences if used in the real world without validation.

[68]. On the negative side, it raises potential risks when our models are used in conjunction with fabrication methods such as 3D printing to create tools and parts (e.g. Figure 12). In such scenarios, 3D objects generated by the model could be introduced into the real world without undergoing adequate validation or safety testing, and this could potentially be harmful when the produced samples do not adequately meet the desired prompt.

D Guidance in Image Space

While our diffusion models operate in a latent space, we find that it is possible to guide them directly in image space. During sampling, we have some noised latent xt and a corresponding model prediction x0 = f(xt). If we treat the model prediction as a latent vector and render it with NeRF to get an image I, we can compute gradients of any image-based objective function L like so:

Given this gradient, we can then follow the classifier guidance setup of Dhariwal and Nichol [9] to update each diffusion step in the direction of a scaled gradient.

To test this idea, we leverage DreamFusion [45] to obtain image-space gradients that incentivize rendered images to match a text prompt. Since DreamFusion requires a powerful text-to-image diffusion model, we use the 3 billion parameter GLIDE model [40]. We sample from our textconditional Shap·E model using 1,024 stochastic DDPM steps. At each step, we use eight rendered views of the NeRF to obtain an estimate of the DreamFusion gradient. We then scale this gradient by a hyperparameter s before applying a guided sampling step. This process takes roughly 15 minutes on eight NVIDIA V100 GPUs.

In Figure 13, we explore what happens as we increase the DreamFusion guidance scale s while keeping the diffusion noise fixed. We observe in general that this text-conditional Shap·E model is not very good on its own with DDPM sampling, failing to capture the text prompts with s = 0. However, as we increase s, we find that the samples tend to approach something more closely matching the prompt. Notably, this is despite the fact that we do not use most of the tricks employed by DreamFusion, such as normals-based shading or grayscale rendering.

“a corgi wearing “a red cube on top

s “a corgi”

a santa hat” of a blue cube”

s = 0

s = 0.01

s = 0.05

s = 0.1

s = 0.5 s = 1.0

Figure 13: Using DreamFusion to guide our text-conditional Shap·E model.

[1] We use different linear output heads to produce coarse and fine predictions.

[2] In preliminary scans, we found that L1 loss outperformed L2 loss on PSNR after an initial warmup period where L1 was worse.

[3] However, note that Shap·E depends on a separate encoder model, while Point·E depends on separate upsampler and SDF models. Only the base diffusion model architecture is the same.

Paper - https://arxiv.org/pdf/2305.02463.pdf

Shap-E - a Hugging Face Space by hysts

For faster inference without waiting in a queue, you may duplicate the space and upgrade to GPU in settings.

Text to 3DImage to 3D

SHAP-E - Text-to-3D

!git clone https://github.com/openai/shap-e

Cloning into 'shap-e'...

remote: Enumerating objects: 304, done.

remote: Counting objects: 100% (48/48), done.

remote: Compressing objects: 100% (37/37), done.

remote: Total 304 (delta 19), reused 23 (delta 11), pack-reused 256

Receiving objects: 100% (304/304), 11.71 MiB | 22.49 MiB/s, done.

Resolving deltas: 100% (19/19), done.

[ ]

%cd shap-e

/content/shap-e

!pip install -e .

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/

Obtaining file:///content/shap-e

Preparing metadata (setup.py) ... done

Collecting clip@ git+https://github.com/openai/CLIP.git

Cloning https://github.com/openai/CLIP.git to /tmp/pip-install-_otx4sh3/clip_ecf5c957d3ea41669a4651860e427321

Running command git clone --filter=blob:none --quiet https://github.com/openai/CLIP.git /tmp/pip-install-_otx4sh3/clip_ecf5c957d3ea41669a4651860e427321

Resolved https://github.com/openai/CLIP.git to commit a9b1bf5920416aaeaec965c25dd9e8f98c864f16

Preparing metadata (setup.py) ... done

Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from shap-e==0.0.0) (3.12.0)

Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (from shap-e==0.0.0) (8.4.0)

Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (from shap-e==0.0.0) (2.0.0+cu118)

Collecting fire

Downloading fire-0.5.0.tar.gz (88 kB)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88.3/88.3 kB 8.8 MB/s eta 0:00:00

Preparing metadata (setup.py) ... done

Requirement already satisfied: humanize in /usr/local/lib/python3.10/dist-packages (from shap-e==0.0.0) (4.6.0)

Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from shap-e==0.0.0) (2.27.1)

Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from shap-e==0.0.0) (4.65.0)

Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from shap-e==0.0.0) (3.7.1)

Requirement already satisfied: scikit-image in /usr/local/lib/python3.10/dist-packages (from shap-e==0.0.0) (0.19.3)

Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from shap-e==0.0.0) (1.10.1)

Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from shap-e==0.0.0) (1.22.4)

Collecting blobfile

Downloading blobfile-2.0.2-py3-none-any.whl (74 kB)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 74.5/74.5 kB 10.1 MB/s eta 0:00:00

Collecting pycryptodomex~=3.8

Downloading pycryptodomex-3.17-cp35-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 10.1 MB/s eta 0:00:00

Requirement already satisfied: lxml~=4.9 in /usr/local/lib/python3.10/dist-packages (from blobfile->shap-e==0.0.0) (4.9.2)

Requirement already satisfied: urllib3<3,>=1.25.3 in /usr/local/lib/python3.10/dist-packages (from blobfile->shap-e==0.0.0) (1.26.15)

Collecting ftfy

Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.1/53.1 kB 7.7 MB/s eta 0:00:00

Requirement already satisfied: regex in /usr/local/lib/python3.10/dist-packages (from clip@ git+https://github.com/openai/CLIP.git->shap-e==0.0.0) (2022.10.31)

Requirement already satisfied: torchvision in /usr/local/lib/python3.10/dist-packages (from clip@ git+https://github.com/openai/CLIP.git->shap-e==0.0.0) (0.15.1+cu118)

Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from fire->shap-e==0.0.0) (1.16.0)

Requirement already satisfied: termcolor in /usr/local/lib/python3.10/dist-packages (from fire->shap-e==0.0.0) (2.3.0)

Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib->shap-e==0.0.0) (2.8.2)

Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->shap-e==0.0.0) (23.1)

Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->shap-e==0.0.0) (0.11.0)

Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->shap-e==0.0.0) (1.4.4)

Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->shap-e==0.0.0) (1.0.7)

Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->shap-e==0.0.0) (3.0.9)

Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->shap-e==0.0.0) (4.39.3)

Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.10/dist-packages (from requests->shap-e==0.0.0) (2.0.12)

Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->shap-e==0.0.0) (2022.12.7)

Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->shap-e==0.0.0) (3.4)

Requirement already satisfied: networkx>=2.2 in /usr/local/lib/python3.10/dist-packages (from scikit-image->shap-e==0.0.0) (3.1)

Requirement already satisfied: imageio>=2.4.1 in /usr/local/lib/python3.10/dist-packages (from scikit-image->shap-e==0.0.0) (2.25.1)

Requirement already satisfied: PyWavelets>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-image->shap-e==0.0.0) (1.4.1)

Requirement already satisfied: tifffile>=2019.7.26 in /usr/local/lib/python3.10/dist-packages (from scikit-image->shap-e==0.0.0) (2023.4.12)

Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch->shap-e==0.0.0) (4.5.0)

Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch->shap-e==0.0.0) (1.11.1)

Requirement already satisfied: triton==2.0.0 in /usr/local/lib/python3.10/dist-packages (from torch->shap-e==0.0.0) (2.0.0)

Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch->shap-e==0.0.0) (3.1.2)

Requirement already satisfied: lit in /usr/local/lib/python3.10/dist-packages (from triton==2.0.0->torch->shap-e==0.0.0) (16.0.2)

Requirement already satisfied: cmake in /usr/local/lib/python3.10/dist-packages (from triton==2.0.0->torch->shap-e==0.0.0) (3.25.2)

Requirement already satisfied: wcwidth>=0.2.5 in /usr/local/lib/python3.10/dist-packages (from ftfy->clip@ git+https://github.com/openai/CLIP.git->shap-e==0.0.0) (0.2.6)

Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch->shap-e==0.0.0) (2.1.2)

Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch->shap-e==0.0.0) (1.3.0)

Building wheels for collected packages: clip, fire

Building wheel for clip (setup.py) ... done

Created wheel for clip: filename=clip-1.0-py3-none-any.whl size=1369398 sha256=f3e22e6c05776411029b81feaa8f66442b6439defe6abb3db450aaee26ff1ca6

Stored in directory: /tmp/pip-ephem-wheel-cache-d861pv8v/wheels/da/2b/4c/d6691fa9597aac8bb85d2ac13b112deb897d5b50f5ad9a37e4

Building wheel for fire (setup.py) ... done

Created wheel for fire: filename=fire-0.5.0-py2.py3-none-any.whl size=116952 sha256=53aed103179e74f6f6e17ec68992bd34cb4b834de70b232b7935602e95b29ac6

Stored in directory: /root/.cache/pip/wheels/90/d4/f7/9404e5db0116bd4d43e5666eaa3e70ab53723e1e3ea40c9a95

Successfully built clip fire

Installing collected packages: pycryptodomex, ftfy, fire, blobfile, clip, shap-e

Running setup.py develop for shap-e

Successfully installed blobfile-2.0.2 clip-1.0 fire-0.5.0 ftfy-6.1.1 pycryptodomex-3.17 shap-e-0.0.0

[ ]

import torch

from shap_e.diffusion.sample import sample_latents

from shap_e.diffusion.gaussian_diffusion import diffusion_from_config

from shap_e.models.download import load_model, load_config

from shap_e.util.notebooks import create_pan_cameras, decode_latent_images, gif_widget

[ ]

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

[ ]

xm = load_model('transmitter', device=device)

model = load_model('text300M', device=device)

diffusion = diffusion_from_config(load_config('diffusion'))

[ ]

batch_size = 1

guidance_scale = 15.0

prompt = "a sword"

latents = sample_latents(

batch_size=batch_size,

model=model,

diffusion=diffusion,

guidance_scale=guidance_scale,

model_kwargs=dict(texts=[prompt] * batch_size),

progress=True,

clip_denoised=True,

use_fp16=True,

use_karras=True,

karras_steps=64,

sigma_min=1e-3,

sigma_max=160,

s_churn=0,

)

[ ]

render_mode = 'nerf' # you can change this to 'stf'

size = 64 # this is the size of the renders; higher values take longer to render.

cameras = create_pan_cameras(size, device)

for i, latent in enumerate(latents):

images = decode_latent_images(xm, latent, cameras, rendering_mode=render_mode)

display(gif_widget(images))

[ ]

# Example of saving the latents as meshes.

from shap_e.util.notebooks import decode_latent_mesh

for i, latent in enumerate(latents):

with open(f'example_mesh_{i}.ply', 'wb') as f:

decode_latent_mesh(xm, latent).tri_mesh().write_ply(f)

/content/shap-e/shap_e/models/stf/renderer.py:286: UserWarning: exception rendering with PyTorch3D: No module named 'pytorch3d'

warnings.warn(f"exception rendering with PyTorch3D: {exc}")

/content/shap-e/shap_e/models/stf/renderer.py:287: UserWarning: falling back on native PyTorch renderer, which does not support full gradients

warnings.warn(

https://colab.research.google.com/drive/1XvXBALiOwAT5-OaAD7AygqBXFqTijrVf?usp=sharing#scrollTo=7-fLWame0qJw

Google Colaboratory

SHAP-E - Text-to-3D

https://colab.research.google.com/drive/1XvXBALiOwAT5-OaAD7AygqBXFqTijrVf?usp=sharing#scrollTo=7-fLWame0qJw

Shap·E: The Revolutionary 3D Asset Generator by OpenAI

Taking Your Existing Business With HuggingGPT AI

HuggingFace Crash Course

HuggingFace Crash Course - Sentiment Analysis, Model Hub, Fine Tuning

What is Hugging Face - Crash Course (No Coding) | ML Products for Beginners

Getting Started with AI powered Q&A using Hugging Face Transformers | HuggingFace Tutorial

Hugging Face Agents

NEW Hugging Face Agents — First Look

Hugging Face has announced its take on Large Language Model (LLM) Agents. Similar to what we see in LangChain agents, Haystack agents, and ChatGPT plugins — but incredibly easy to get started with and with access to Hugging Face's huge hub of NLP models (taking inspiration from HuggingGPT).

00:00 Hugging Face Agents

00:38 Agents and Tools Explained

02:15 Current Agents Landscape

04:35 Taking Inspiration from HuggingGPT

05:45 Getting Started with Hugging Face Agents

08:28 Querying the Agents

10:58 Agent Prompt Template

11:46 Conversational Chatbot Agent in Hugging Face

16:44 Community Tools in Hugging Face

👋🏼 Socials:

Twitter: https://twitter.com/jamescalam

LinkedIn: https://www.linkedin.com/in/jamescalam/

Instagram: https://www.instagram.com/jamescalam/

These agents look great, with the integration into the HF hub, multi-modality, and an easy-to-use implementation, and I'm looking forward to doing more on HF agents in the near future.

🔗 Link to notebook:

https://github.com/aurelio-labs/cookb...

🎙️ Support me on Patreon:

https://patreon.com/JamesBriggs

🤖 70% Discount on the NLP With Transformers in Python course:

https://bit.ly/3DFvvY5

👾 Discord:

https://discord.gg/c5QtDB9RAP

Transcript

PART 1

0:00

hugging face have just announced

0:01

something that I think is probably going

0:04

to be a very major thing in the future

0:07

of large Lounge models and NLP and that

0:10

has their spin on agents for large

0:14

language models and Transformers in

0:16

general now there's quite a few reasons

0:18

as to why I think home face are in a

0:21

very good position to offer possibly one

0:24

of the best agents and Tool Frameworks

0:26

out there and I'm going to discuss those

0:29

first for those of you that haven't

0:31

heard of these things before are just

0:33

kind of not sure what they are let me

0:36

quickly explain One agent and what a

0:37

tool is so we know what large language

Agents and Tools Explained

0:41

models are they are big Transformer

0:43

models that can basically answer

0:46

questions in natural language for us

0:48

based on some natural language input a

0:51

agent kind of takes this and takes it a

0:54

little bit further and expands these LMS

0:57

out to basically allow them to have

1:00

multiple sets of reasoning and thought

1:03

so they can think to themselves and this

1:06

is ideal for when we want to integrate

1:09

what are called tools so what we can do

1:12

is we can tell an LM hey we want you to

1:15

answer a question if you can't answer by

1:18

yourself you can actually refer to some

1:20

other tools that we have given to you

1:22

and you might say something like if you

1:25

don't know about a particular topic you

1:27

can perform a Google search in order to

1:29

find out about that particular topic and

1:31

you would also explain you know how we

1:33

can do that and because the llm has this

1:35

multi-step thought process it can say

1:38

okay I've got this question I need to

1:42

use a Google Search tool and then it

1:44

will say how do I use that Google search

1:46

so it's going to provide some input to

1:49

Google search we would then go do a

1:51

Google search for it return some answers

1:53

and pass that back into the LM and now

1:56

all of a sudden it can answer a question

1:57

and you can do this for a ton of tools a

1:59

single databases knowledge base some

2:02

effective databases and so on python

2:04

interpreters you can do basically

2:07

anything you can program you can create

2:09

a tool out of it now obviously this is a

2:12

very powerful thing to be able to do now

2:14

at the moment by far the biggest library

Current Agents Landscape

2:17

for using agents is Lang chain there's

2:20

also haysak who have introduced agents

2:24

recently and and there's also actually

2:26

chat GPT plugins which are a form of

2:29

Agents as well or at least it's a form

2:31

of tools added to the agent which is

2:34

actually chat GPT itself now hooking

2:36

fails have also kind of jumped on the

2:39

bandwagon and I think their

2:41

implementation is actually uh very

2:43

interesting and particularly powerful

2:45

for a lot of different reasons a big

2:48

component of that is what hooking face

2:50

actually is so clicking face is

2:53

essentially almost like a huge community

2:56

and hub of all of these different

2:58

Transformer models diffusion models for

3:02

generating images data sets and just a

3:05

ton of anything you can think of in

3:07

machine learning home face actually

3:10

cover a lot of it and their version of

3:13

agents and tools are very interesting

3:15

and you know I haven't been all the way

3:17

through yet this is kind of my first

3:19

look but the agent itself is very simple

3:23

to use it also can be use as a

3:26

conversational agent so as a chatbot

3:27

where you have multi-steps in the

3:29

process and it also gives us access to

3:31

all of these models on hugging face

3:34

which I think is one of the coolest

3:36

things about it but and there's other

3:39

things as well but actually let me jump

3:40

into it and show you those rather than

3:43

just talking through them we'll just

3:45

have a quick look at their example here

3:46

so they have basically they didn't

3:49

really show you anything I'm going to go

3:50

through a code in the moment but you run

3:53

agent dot run capture in the following

3:55

image right and then they pass in this

3:57

image right through image here and then

4:00

the output is a beaver throwing in the

4:01

water then they also have this so agent

4:04

run read the following text out loud and

4:07

it will use a text to each model to

4:09

actually do that you can also do this so

4:11

we have like some OCR reading this

4:15

document or this image of a document

4:17

then we say in this following document

4:19

as we asked the question and the output

4:22

is this Ballroom for you which is down

4:24

here at the bottom but I'm right now the

4:27

point here is that these agents are

4:29

using like a ton of different models

4:32

from the Transformers library and I

Taking Inspiration from HuggingGPT

4:35

think this takes pretty clear

4:37

inspiration from this paper called

4:39

hugging GPT which essentially use chat

4:43

GPT with a ton of hug and face models to

4:47

do a load of cool things so it use the

4:50

same sort of approach where you have an

4:51

agent which was the chat GPT model and

4:54

if we come down to the first image here

4:56

this basically shows us how it works we

4:58

have a large Lounge model chat GPT in

5:00

this case our rgbt 3.5 turbo I assume

5:03

and this is like your controller and

5:06

then we have all these what we would

5:08

call specialist models that can do

5:10

particular things that chat GPT is not

5:13

able to do like understand what is

5:16

within an image or caption an image and

5:20

chat GPT or your your large language

5:22

model is able to basically figure out

5:26

okay given a question which models do we

5:28

need to use what input do I need to pass

5:30

them and then uses the output from those

5:32

models to inform the next step of trying

5:35

to you know figure out what it needs to

5:37

do and then provide a very cool answer

5:41

that a normal large language model would

5:43

not be able to do by itself now let's

Getting Started with Hugging Face Agents

5:45

take a look at a code example of how we

5:48

can use this now initially I'm just

5:50

actually using one of the examples from

5:53

hooking phase and then we'll just go

5:55

through it's a few cells and then we'll

5:57

do something a little more interesting

5:58

so first thing we need to do here is

6:01

install a few things so Transformers

6:04

which is a library that contains the

6:06

agents because we're going to be using

6:08

image generation and models here so the

6:11

diffusion models we need to use I can

6:13

face the fuses and we're also going to

6:16

use accelerate which I believe allows us

6:19

to run things faster now in reality and

6:21

I'm not sure if this is the case I think

6:24

we should hopefully be running on a GPU

6:27

here okay so here I've gone into my

6:30

runtime settings and just changed my

6:32

Hardware accelerator to GPU and then

6:34

what we do is we're just going to use

6:36

open AI here obviously you know for your

6:38

large Lounge model hooking face makes it

6:40

very easy to use other open source

6:42

options so you can you can do that I'm

6:44

just using this because I know it's

6:46

going to work it's quick so here we are

6:49

and I'm going to use text eventually 003

6:52

basically what I found generally

6:54

speaking is that text DaVinci zero zero

6:57

three is usually better at following

7:00

instructions for tools within agents

7:03

than GPT 3.5 turbo and that's also the

7:06

model that they use that hook and face

7:08

are using in their examples so I'm not

7:10

100 sure if they support GT 3.5 turbo

7:13

yeah I'll need to try at some point so

7:16

yeah all we do just performers tools

7:19

import open AI agent and there are other

7:22

agents as well I think there's the

7:23

hugging face agent is like HF agent

7:27

maybe but obviously we're just going to

7:29

use open AI on here you will need to add

7:32

in your API API key which you can get

7:35

from splatform

7:36

[Music]

7:38

platform the open AI .com okay so we're

7:42

going to run this this is going to

7:43

initialize our agent and actually that's

7:45

all we need to do

7:47

and now it works which is very easy and

7:52

quick to sell so I think all we're doing

7:55

here is okay so we're downloading the

7:58

tool configuration and this is a very

8:00

interesting component of hugging faces

8:04

tools implementation which is that we

8:07

can download Community contribute tools

8:10

and obviously hugging faces Zone tours

8:12

so in the next probably very soon next

8:15

few weeks we're probably going to see

8:17

some pretty insane tools appear from the

8:20

community which will be

8:22

fascinating to see that's one of the big

8:24

components as to why I think this is

8:27

going to be pretty major okay and

Querying the Agents

8:29

another reason I think this is going to

8:31

be pretty major is that we can do

8:33

multi-modal agents super easily so I

8:36

haven't done anything here right I just

8:38

initialize my agent and I said okay I

8:40

want to generate an image of a boat in

8:42

the water right and because hugging

8:45

phase has they have a big diffusers

8:48

Library which contains loads of text to

8:51

image diffusion models and they

8:54

obviously have all the transform models

8:56

as well they've kind of integrated at

8:59

least a few of those into the default

9:03

agent so if I just say generate an image

9:06

of a bone to water what it's doing here

9:08

is this isn't how long it takes to

9:09

process this is actually downloading the

9:12

model okay so the the image generation

9:14

model this will only happen once okay

PART 2

9:17

and I'll prove that by running it again

9:19

in a moment so that's going to download

9:23

it is okay here we go so we've got an

9:26

estimation from the agent I'm going to

9:28

use the following tool image generator

9:30

to generate an image according to the

9:32

prompt then it generates some code here

9:35

and

9:37

here we go here's here is our image of a

9:41

boat in the water Okay so

9:44

yeah that was super easy to do and let

9:47

me just run that again okay you'll see

9:50

that it doesn't take quite as long this

9:52

time

9:54

okay so it's generating the image

9:56

[Music]

9:57

and there we go right that was eight

10:00

seconds which you know considering it

10:01

also three seconds to generate the image

10:04

that's pretty good for an agent so that

10:08

is really cool and okay what we can also

10:12

do is okay I have a boat image here

10:14

and we can come down and do agent run

10:17

again and I can pass in this variable

10:20

okay so we use this backtick here and we

10:23

can pass in a variable which we then

10:25

enter its own actual variable here right

10:27

so we could we could also just do like

10:30

image okay and then that just means that

10:32

we need to replace image here okay so

10:35

let's run that I'm just going to ask you

10:37

to write a caption for this image again

10:40

it's going to need to download the

10:43

captioning model as you can see here

10:45

okay and then we get a boat floating in

10:49

the water with clouds in the background

10:50

all right let's run it again so we can

10:52

see how long that actually takes okay so

10:55

four seconds again very quick okay and

Agent Prompt Template

11:00

then so here I was just looking okay

11:02

what does that prompt template look like

11:04

when we're doing that run method you can

11:07

kind of see a little bit logic that is

11:08

going on in here here so I'm going to

11:10

ask you to perform tasks it includes all

11:12

the tools here and then it gives a few

11:14

examples and then ask it to figure out

11:16

what it needs to do next right this is

11:20

just yeah we don't need that it just

11:21

contains all the code for the agent

11:23

which is another nice thing that I like

11:25

about the hugging face implementation is

11:28

that the code is pretty readable

11:30

um so if something isn't quite working

11:32

the way you'd expect it to work you can

11:34

go into code and kind of figure out why

11:35

almost straight away which is not as

11:40

easy to do with other libraries at the

11:43

moment so that that's very nice and yeah

Conversational Chatbot Agent in Hugging Face

11:48

I mean it's super cool now let's have a

11:50

look at a conversational agent so

11:51

basically a chat bot right so I'm going

11:53

to say hey how are you okay we just get

11:56

this right hi there I'm doing well thank

11:57

you for asking cool I'm gonna ask you to

12:00

create an image of a draft riding a

12:02

skateboard and I I just made this up

12:04

very quickly before running this

12:07

I mean the results are not perfect but

12:09

they're they're intro they're funny

12:11

right so we're not running any special

12:14

diffusion models here so we'll get like

12:16

this weird two-headed giraffe but you

12:18

know let's stick with that and like I

12:20

said the resources are entertaining and

12:22

then they're not particularly impressive

12:24

from a image generation point of view

12:27

but it's just interesting to to see so

12:30

you're using a image generator model and

12:34

then you come down here and it's not

12:36

going to use a image generator model

12:39

it's going to use an image Transformer

12:42

model to modify the existing image and

12:45

this is something that is really cool as

12:48

well so okay first it needs to download

12:50

that in that model so let me explain

12:52

what is so cool here right so look you

12:55

can see that it's generating some code

12:57

right and this code is actually

13:00

referring to the image okay and the

13:03

image is generated by this code

13:05

beforehand so the python interpreter

13:08

that all of this is using is maintained

13:11

between chat interactions it's going to

13:14

write some code and then you can say oh

13:15

actually can you do something else and

13:17

it's it can still interact with that

13:18

coding it can still see that code and

13:21

it's going to write some more code based

13:23

on what's already done

13:25

which is not something that I have seen

13:28

done by default in other libraries like

13:32

use agents and tools so I mean that's

13:34

just a really cool thing that I I like

13:36

and it's just insane how easy it is to

13:39

get that working so cool yeah so now we

13:43

get this right so it's an elephant so

13:45

this image transform model I even use an

13:47

image transfer model before I didn't

13:49

actually know they were a thing but I

13:52

think what it does is identifies wearing

13:54

the image and the draft is which it's

13:57

done and then just try to modify that

13:59

part of the image so we get this kind of

14:02

weird I mean yeah I can see what it's

14:05

trying to do but it's interesting right

14:08

so okay cool and then this this didn't

14:11

work for me before I want to try it

14:13

again so could you give the elephant

14:14

shiny laser eyes last time I tried this

14:17

it made the elephant like made of gold

14:19

uh let's see what it does this time

14:21

maybe it just read could you maybe

14:24

elephant shiny I'm not sure

14:27

okay so it went from that again so now

14:30

we have like a I don't know what it is

14:33

it's like a gold giraffe I think

14:35

um and then okay we can caption the

14:37

image

14:38

so I'm very curious as to what it says

14:41

about this image okay and this caption

14:43

is a a gif giffe standing on the

14:46

skateboard before I'm pretty sure it

14:48

gave me a very similar very similar

14:52

output so I wonder if is it uh okay so

14:55

it's a modified image right so can you

14:57

caption the modified image

15:00

or what I'm gonna say is

15:03

I'm gonna copy this

15:06

and I'm gonna say sorry I meant the

15:09

modified image

15:13

okay okay a g graph okay so the code is

15:19

right so the caption is image captioner

15:21

modified image and then this is weird

15:23

I'm not sure why like maybe there's some

15:27

weird stuff going on the tokenizer here

15:29

but yeah here we get a GI GIF on a state

15:33

board okay fair enough and then I wanted

15:36

to test this a little more can you

15:38

search the internet some more of these

15:40

types of images so a Search tool is a

15:42

pretty typical tool that is included

15:45

within agents and I just wanted to see

15:47

if they include that by default so let's

15:50

try and we'll we'll see okay so

15:53

unfortunately no they don't seem to so

15:55

it refers to a text downloader tool okay

15:58

and that is apparently a thing so

16:00

downloads the text downloader model or

16:04

tool I'm not sure what it is exactly and

16:06

yeah it just download some tapes so it

16:09

doesn't work for everything yet but that

16:13

think is is already pretty cool the fact

16:15

that we're just referring to all these

16:16

models that we have this like python

16:20

interpreter just built in and just so

16:23

it's so easy to use I think is is really

16:26

interesting and yeah for sure we're

16:29

definitely going to do a lot more on

16:30

Transformer agents in the future but for

16:34

now yeah I just wanted to introduce the

16:37

the library to you or the the new

16:40

features to you and also just explore

16:42

them myself again like I said there's a

Community Tools in Hugging Face

16:46

massive Community aspect to this so that

16:48

is probably one of the biggest things

16:50

that I think hooking face agents has

16:52

going for it the fact that they will

16:55

have and I haven't I haven't seen if

16:57

there are

16:58

if we can actually find them on the home

17:00

face website but let me show you what it

17:03

looks like with just models so we can

17:05

come over here we have models right and

17:09

there's just tons of models on hugging

17:11

face right now imagine that they're

17:14

planning to do or are doing the same

17:16

thing with tools and it's not here yet I

17:18

don't I don't see any tools but clearly

17:21

the the code or the interface is already

17:23

there because we were downloading tools

17:25

here I I believe we're downloading tools

17:27

here so that is super interesting and

17:31

yeah I'm sure people are going to build

17:32

some insane tools very quickly

17:35

so yeah that will be pretty huge in my

17:39

opinion now I haven't seen how

17:41

customizable these agents are yet that's

17:43

something exploring very soon but I

17:46

would imagine you know who you can face

17:48

to make things pretty simple so I my

17:51

expectation is that it will be pretty

17:52

easy to to work through and figure all

17:54

that out

17:55

so yeah overall I'm I'm very excited to

17:59

see what they what they do with this I

18:01

think this will be a really cool feature

18:04

but for now I'm going to leave it there

18:06

so I hope this has been interesting and

18:08

useful thank you very much for watching

18:12

and I will see you again in the next one

18:14

bye

18:18

foreign

Shap-E

SHAP-E

For faster inference without waiting in queue, you may duplicate the space and upgrade to GPU in settings.

Shap-E - a Hugging Face Space by hysts

ALL 5 STAR AI.IO PAGE STUDY

5 STAR AI.IO

How AI and IoT are Creating An Impact On Industries Today

HELLO AND WELCOME TO THE

5 STAR AI.IOT TOOLS FOR YOUR BUSINESS

ARE NEW WEBSITE IS ABOUT 5 STAR AI and io’t TOOLS on the net.

We prevaid you the best

Artificial Intelligence tools and services that can be used to create and improve BUSINESS websites AND CHANNELS .

This site is includes tools for creating interactive visuals, animations, and videos.

as well as tools for SEO, marketing, and web development.

It also includes tools for creating and editing text, images, and audio. The website is intended to provide users with a comprehensive list of AI-based tools to help them create and improve their business.

https://studio.d-id.com/share?id=078f9242d5185a9494e00852e89e17f7&utm_source=copy

This website is a collection of Artificial Intelligence (AI) tools and services that can be used to create and improve websites. It includes tools for creating interactive visuals, animations, and videos, as well as tools for SEO, marketing, and web development. It also includes tools for creating and editing text, images, and audio. The website is intended to provide users with a comprehensive list of AI-based tools to help them create and improve their websites.

אתר זה הוא אוסף של כלים ושירותים של בינה מלאכותית (AI) שניתן להשתמש בהם כדי ליצור ולשפר אתרים. הוא כולל כלים ליצירת ויזואליה אינטראקטיבית, אנימציות וסרטונים, כמו גם כלים לקידום אתרים, שיווק ופיתוח אתרים. הוא כולל גם כלים ליצירה ועריכה של טקסט, תמונות ואודיו. האתר נועד לספק למשתמשים רשימה מקיפה של כלים מבוססי AI שיסייעו להם ליצור ולשפר את אתרי האינטרנט שלהם.

Hello and welcome to our new site that shares with you the most powerful web platforms and tools available on the web today

All platforms, websites and tools have artificial intelligence AI and have a 5-star rating

All platforms, websites and tools are free and Pro paid

The platforms, websites and the tool's are the best for growing your business in 2022/3

שלום וברוכים הבאים לאתר החדש שלנו המשתף אתכם בפלטפורמות האינטרנט והכלים החזקים ביותר הקיימים היום ברשת. כל הפלטפורמות, האתרים והכלים הם בעלי בינה מלאכותית AI ובעלי דירוג של 5 כוכבים. כל הפלטפורמות, האתרים והכלים חינמיים ומקצועיים בתשלום הפלטפורמות, האתרים והכלים באתר זה הם הטובים ביותר והמועילים ביותר להצמחת ולהגדלת העסק שלך ב-2022/3

A Guide for AI-Enhancing Your Existing Business Application

A guide to improving your existing business application of artificial intelligence

מדריך לשיפור היישום העסקי הקיים שלך בינה מלאכותית

What is Artificial Intelligence and how does it work? What are the 3 types of AI?

What is Artificial Intelligence and how does it work? What are the 3 types of AI? The 3 types of AI are: General AI: AI that can perform all of the intellectual tasks a human can. Currently, no form of AI can think abstractly or develop creative ideas in the same ways as humans. Narrow AI: Narrow AI commonly includes visual recognition and natural language processing (NLP) technologies. It is a powerful tool for completing routine jobs based on common knowledge, such as playing music on demand via a voice-enabled device. Broad AI: Broad AI typically relies on exclusive data sets associated with the business in question. It is generally considered the most useful AI category for a business. Business leaders will integrate a broad AI solution with a specific business process where enterprise-specific knowledge is required. How can artificial intelligence be used in business? AI is providing new ways for humans to engage with machines, transitioning personnel from pure digital experiences to human-like natural interactions. This is called cognitive engagement. AI is augmenting and improving how humans absorb and process information, often in real-time. This is called cognitive insights and knowledge management. Beyond process automation, AI is facilitating knowledge-intensive business decisions, mimicking complex human intelligence. This is called cognitive automation. What are the different artificial intelligence technologies in business? Machine learning, deep learning, robotics, computer vision, cognitive computing, artificial general intelligence, natural language processing, and knowledge reasoning are some of the most common business applications of AI. What is the difference between artificial intelligence and machine learning and deep learning? Artificial intelligence (AI) applies advanced analysis and logic-based techniques, including machine learning, to interpret events, support and automate decisions, and take actions. Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Deep learning is a subset of machine learning in artificial intelligence (AI) that has networks capable of learning unsupervised from data that is unstructured or unlabeled. What are the current and future capabilities of artificial intelligence? Current capabilities of AI include examples such as personal assistants (Siri, Alexa, Google Home), smart cars (Tesla), behavioral adaptation to improve the emotional intelligence of customer support representatives, using machine learning and predictive algorithms to improve the customer’s experience, transactional AI like that of Amazon, personalized content recommendations (Netflix), voice control, and learning thermostats. Future capabilities of AI might probably include fully autonomous cars, precision farming, future air traffic controllers, future classrooms with ambient informatics, urban systems, smart cities and so on. To know more about the scope of artificial intelligence in your business, please connect with our expert.

מהי בינה מלאכותית וכיצד היא פועלת? מהם 3 סוגי הבינה המלאכותית?

מהי בינה מלאכותית וכיצד היא פועלת? מהם 3 סוגי הבינה המלאכותית? שלושת סוגי הבינה המלאכותית הם: בינה מלאכותית כללית: בינה מלאכותית שיכולה לבצע את כל המשימות האינטלקטואליות שאדם יכול. נכון לעכשיו, שום צורה של AI לא יכולה לחשוב בצורה מופשטת או לפתח רעיונות יצירתיים באותן דרכים כמו בני אדם. בינה מלאכותית צרה: בינה מלאכותית צרה כוללת בדרך כלל טכנולוגיות זיהוי חזותי ועיבוד שפה טבעית (NLP). זהו כלי רב עוצמה להשלמת עבודות שגרתיות המבוססות על ידע נפוץ, כגון השמעת מוזיקה לפי דרישה באמצעות מכשיר התומך בקול. בינה מלאכותית רחבה: בינה מלאכותית רחבה מסתמכת בדרך כלל על מערכי נתונים בלעדיים הקשורים לעסק המדובר. זה נחשב בדרך כלל לקטגוריית הבינה המלאכותית השימושית ביותר עבור עסק. מנהיגים עסקיים ישלבו פתרון AI רחב עם תהליך עסקי ספציפי שבו נדרש ידע ספציפי לארגון. כיצד ניתן להשתמש בבינה מלאכותית בעסק? AI מספקת דרכים חדשות לבני אדם לעסוק במכונות, ומעבירה את הצוות מחוויות דיגיטליות טהורות לאינטראקציות טבעיות דמויות אדם. זה נקרא מעורבות קוגניטיבית. בינה מלאכותית מגדילה ומשפרת את האופן שבו בני אדם קולטים ומעבדים מידע, לעתים קרובות בזמן אמת. זה נקרא תובנות קוגניטיביות וניהול ידע. מעבר לאוטומציה של תהליכים, AI מאפשר החלטות עסקיות עתירות ידע, תוך חיקוי אינטליגנציה אנושית מורכבת. זה נקרא אוטומציה קוגניטיבית. מהן טכנולוגיות הבינה המלאכותית השונות בעסק? למידת מכונה, למידה עמוקה, רובוטיקה, ראייה ממוחשבת, מחשוב קוגניטיבי, בינה כללית מלאכותית, עיבוד שפה טבעית וחשיבת ידע הם חלק מהיישומים העסקיים הנפוצים ביותר של AI. מה ההבדל בין בינה מלאכותית ולמידת מכונה ולמידה עמוקה? בינה מלאכותית (AI) מיישמת ניתוח מתקדמות וטכניקות מבוססות לוגיקה, כולל למידת מכונה, כדי לפרש אירועים, לתמוך ולהפוך החלטות לאוטומטיות ולנקוט פעולות. למידת מכונה היא יישום של בינה מלאכותית (AI) המספק למערכות את היכולת ללמוד ולהשתפר מניסיון באופן אוטומטי מבלי להיות מתוכנתים במפורש. למידה עמוקה היא תת-קבוצה של למידת מכונה בבינה מלאכותית (AI) שיש לה רשתות המסוגלות ללמוד ללא פיקוח מנתונים שאינם מובנים או ללא תווית. מהן היכולות הנוכחיות והעתידיות של בינה מלאכותית? היכולות הנוכחיות של AI כוללות דוגמאות כמו עוזרים אישיים (Siri, Alexa, Google Home), מכוניות חכמות (Tesla), התאמה התנהגותית לשיפור האינטליגנציה הרגשית של נציגי תמיכת לקוחות, שימוש בלמידת מכונה ואלגוריתמים חזויים כדי לשפר את חווית הלקוח, עסקאות בינה מלאכותית כמו זו של אמזון, המלצות תוכן מותאמות אישית (Netflix), שליטה קולית ותרמוסטטים ללמידה. יכולות עתידיות של AI עשויות לכלול כנראה מכוניות אוטונומיות מלאות, חקלאות מדויקת, בקרי תעבורה אוויריים עתידיים, כיתות עתידיות עם אינפורמטיקה סביבתית, מערכות עירוניות, ערים חכמות וכן הלאה. כדי לדעת יותר על היקף הבינה המלאכותית בעסק שלך, אנא צור קשר עם המומחה שלנו.

Glossary of Terms

Application Programming Interface(API):

An API, or application programming interface, is a set of rules and protocols that allows different software programs to communicate and exchange information with each other. It acts as a kind of intermediary, enabling different programs to interact and work together, even if they are not built using the same programming languages or technologies. API's provide a way for different software programs to talk to each other and share data, helping to create a more interconnected and seamless user experience.

Artificial Intelligence(AI):

the intelligence displayed by machines in performing tasks that typically require human intelligence, such as learning, problem-solving, decision-making, and language understanding. AI is achieved by developing algorithms and systems that can process, analyze, and understand large amounts of data and make decisions based on that data.

Compute Unified Device Architecture(CUDA):

CUDA is a way that computers can work on really hard and big problems by breaking them down into smaller pieces and solving them all at the same time. It helps the computer work faster and better by using special parts inside it called GPUs. It's like when you have lots of friends help you do a puzzle - it goes much faster than if you try to do it all by yourself.

The term "CUDA" is a trademark of NVIDIA Corporation, which developed and popularized the technology.

Data Processing:

The process of preparing raw data for use in a machine learning model, including tasks such as cleaning, transforming, and normalizing the data.

Deep Learning(DL):

A subfield of machine learning that uses deep neural networks with many layers to learn complex patterns from data.

Feature Engineering:

The process of selecting and creating new features from the raw data that can be used to improve the performance of a machine learning model.

Freemium:

You might see the term "Freemium" used often on this site. It simply means that the specific tool that you're looking at has both free and paid options. Typically there is very minimal, but unlimited, usage of the tool at a free tier with more access and features introduced in paid tiers.

Generative Art:

Generative art is a form of art that is created using a computer program or algorithm to generate visual or audio output. It often involves the use of randomness or mathematical rules to create unique, unpredictable, and sometimes chaotic results.

Generative Pre-trained Transformer(GPT):

GPT stands for Generative Pretrained Transformer. It is a type of large language model developed by OpenAI.

GitHub:

GitHub is a platform for hosting and collaborating on software projects

Google Colab:

Google Colab is an online platform that allows users to share and run Python scripts in the cloud

Graphics Processing Unit(GPU):

A GPU, or graphics processing unit, is a special type of computer chip that is designed to handle the complex calculations needed to display images and video on a computer or other device. It's like the brain of your computer's graphics system, and it's really good at doing lots of math really fast. GPUs are used in many different types of devices, including computers, phones, and gaming consoles. They are especially useful for tasks that require a lot of processing power, like playing video games, rendering 3D graphics, or running machine learning algorithms.

‍

Large Language Model(LLM):

A type of machine learning model that is trained on a very large amount of text data and is able to generate natural-sounding text.

Machine Learning(ML):

A method of teaching computers to learn from data, without being explicitly programmed.

Natural Language Processing(NLP):

A subfield of AI that focuses on teaching machines to understand, process, and generate human language

Neural Networks:

A type of machine learning algorithm modeled on the structure and function of the brain.

Neural Radiance Fields(NeRF):

Neural Radiance Fields are a type of deep learning model that can be used for a variety of tasks, including image generation, object detection, and segmentation. NeRFs are inspired by the idea of using a neural network to model the radiance of an image, which is a measure of the amount of light that is emitted or reflected by an object.

OpenAI:

OpenAI is a research institute focused on developing and promoting artificial intelligence technologies that are safe, transparent, and beneficial to society

Overfitting:

A common problem in machine learning, in which the model performs well on the training data but poorly on new, unseen data. It occurs when the model is too complex and has learned too many details from the training data, so it doesn't generalize well.

Prompt:

A prompt is a piece of text that is used to prime a large language model and guide its generation

Python:

Python is a popular, high-level programming language known for its simplicity, readability, and flexibility (many AI tools use it)

Reinforcement Learning:

A type of machine learning in which the model learns by trial and error, receiving rewards or punishments for its actions and adjusting its behavior accordingly.

Spatial Computing:

Spatial computing is the use of technology to add digital information and experiences to the physical world. This can include things like augmented reality, where digital information is added to what you see in the real world, or virtual reality, where you can fully immerse yourself in a digital environment. It has many different uses, such as in education, entertainment, and design, and can change how we interact with the world and with each other.

Stable Diffusion:

Stable Diffusion generates complex artistic images based on text prompts. It’s an open source image synthesis AI model available to everyone. Stable Diffusion can be installed locally using code found on GitHub or there are several online user interfaces that also leverage Stable Diffusion models.

Supervised Learning:

A type of machine learning in which the training data is labeled and the model is trained to make predictions based on the relationships between the input data and the corresponding labels.

Unsupervised Learning:

A type of machine learning in which the training data is not labeled, and the model is trained to find patterns and relationships in the data on its own.

Webhook:

A webhook is a way for one computer program to send a message or data to another program over the internet in real-time. It works by sending the message or data to a specific URL, which belongs to the other program. Webhooks are often used to automate processes and make it easier for different programs to communicate and work together. They are a useful tool for developers who want to build custom applications or create integrations between different software systems.

מילון מונחים

ממשק תכנות יישומים (API): API, או ממשק תכנות יישומים, הוא קבוצה של כללים ופרוטוקולים המאפשרים לתוכנות שונות לתקשר ולהחליף מידע ביניהן. הוא פועל כמעין מתווך, המאפשר לתוכניות שונות לקיים אינטראקציה ולעבוד יחד, גם אם הן אינן בנויות באמצעות אותן שפות תכנות או טכנולוגיות. ממשקי API מספקים דרך לתוכנות שונות לדבר ביניהן ולשתף נתונים, ועוזרות ליצור חווית משתמש מקושרת יותר וחלקה יותר. בינה מלאכותית (AI): האינטליגנציה שמוצגת על ידי מכונות בביצוע משימות הדורשות בדרך כלל אינטליגנציה אנושית, כגון למידה, פתרון בעיות, קבלת החלטות והבנת שפה. AI מושגת על ידי פיתוח אלגוריתמים ומערכות שיכולים לעבד, לנתח ולהבין כמויות גדולות של נתונים ולקבל החלטות על סמך הנתונים הללו. Compute Unified Device Architecture (CUDA): CUDA היא דרך שבה מחשבים יכולים לעבוד על בעיות קשות וגדולות באמת על ידי פירוקן לחתיכות קטנות יותר ופתרון כולן בו זמנית. זה עוזר למחשב לעבוד מהר יותר וטוב יותר על ידי שימוש בחלקים מיוחדים בתוכו הנקראים GPUs. זה כמו כשיש לך הרבה חברים שעוזרים לך לעשות פאזל - זה הולך הרבה יותר מהר מאשר אם אתה מנסה לעשות את זה לבד. המונח "CUDA" הוא סימן מסחרי של NVIDIA Corporation, אשר פיתחה והפכה את הטכנולוגיה לפופולרית. עיבוד נתונים: תהליך הכנת נתונים גולמיים לשימוש במודל למידת מכונה, כולל משימות כמו ניקוי, שינוי ונימול של הנתונים. למידה עמוקה (DL): תת-תחום של למידת מכונה המשתמש ברשתות עצביות עמוקות עם רבדים רבים כדי ללמוד דפוסים מורכבים מנתונים. הנדסת תכונות: תהליך הבחירה והיצירה של תכונות חדשות מהנתונים הגולמיים שניתן להשתמש בהם כדי לשפר את הביצועים של מודל למידת מכונה. Freemium: ייתכן שתראה את המונח "Freemium" בשימוש לעתים קרובות באתר זה. זה פשוט אומר שלכלי הספציפי שאתה מסתכל עליו יש אפשרויות חינמיות וגם בתשלום. בדרך כלל יש שימוש מינימלי מאוד, אך בלתי מוגבל, בכלי בשכבה חינמית עם יותר גישה ותכונות שהוצגו בשכבות בתשלום. אמנות גנרטיבית: אמנות גנרטיבית היא צורה של אמנות שנוצרת באמצעות תוכנת מחשב או אלגוריתם ליצירת פלט חזותי או אודיו. לרוב זה כרוך בשימוש באקראיות או בכללים מתמטיים כדי ליצור תוצאות ייחודיות, בלתי צפויות ולעיתים כאוטיות. Generative Pre-trained Transformer(GPT): GPT ראשי תיבות של Generative Pre-trained Transformer. זהו סוג של מודל שפה גדול שפותח על ידי OpenAI. GitHub: GitHub היא פלטפורמה לאירוח ושיתוף פעולה בפרויקטי תוכנה

Google Colab: Google Colab היא פלטפורמה מקוונת המאפשרת למשתמשים לשתף ולהריץ סקריפטים של Python בענן Graphics Processing Unit(GPU): GPU, או יחידת עיבוד גרפית, הוא סוג מיוחד של שבב מחשב שנועד להתמודד עם המורכבות חישובים הדרושים להצגת תמונות ווידאו במחשב או במכשיר אחר. זה כמו המוח של המערכת הגרפית של המחשב שלך, והוא ממש טוב לעשות הרבה מתמטיקה ממש מהר. GPUs משמשים סוגים רבים ושונים של מכשירים, כולל מחשבים, טלפונים וקונסולות משחקים. הם שימושיים במיוחד למשימות הדורשות כוח עיבוד רב, כמו משחקי וידאו, עיבוד גרפיקה תלת-ממדית או הפעלת אלגוריתמים של למידת מכונה. מודל שפה גדול (LLM): סוג של מודל למידת מכונה שאומן על כמות גדולה מאוד של נתוני טקסט ומסוגל ליצור טקסט בעל צליל טבעי. Machine Learning (ML): שיטה ללמד מחשבים ללמוד מנתונים, מבלי להיות מתוכנתים במפורש. עיבוד שפה טבעית (NLP): תת-תחום של AI המתמקד בהוראת מכונות להבין, לעבד וליצור שפה אנושית רשתות עצביות: סוג של אלגוריתם למידת מכונה המבוססת על המבנה והתפקוד של המוח. שדות קרינה עצביים (NeRF): שדות קרינה עצביים הם סוג של מודל למידה עמוקה שיכול לשמש למגוון משימות, כולל יצירת תמונה, זיהוי אובייקטים ופילוח. NeRFs שואבים השראה מהרעיון של שימוש ברשת עצבית למודל של זוהר תמונה, שהוא מדד לכמות האור שנפלט או מוחזר על ידי אובייקט. OpenAI: OpenAI הוא מכון מחקר המתמקד בפיתוח וקידום טכנולוגיות בינה מלאכותית שהן בטוחות, שקופות ומועילות לחברה. Overfitting: בעיה נפוצה בלמידת מכונה, שבה המודל מתפקד היטב בנתוני האימון אך גרועים בחדשים, בלתי נראים. נתונים. זה מתרחש כאשר המודל מורכב מדי ולמד יותר מדי פרטים מנתוני האימון, כך שהוא לא מכליל היטב. הנחיה: הנחיה היא פיסת טקסט המשמשת לתכנון מודל שפה גדול ולהנחות את הדור שלו Python: Python היא שפת תכנות פופולרית ברמה גבוהה הידועה בפשטות, בקריאות ובגמישות שלה (כלי AI רבים משתמשים בה) למידת חיזוק: סוג של למידת מכונה שבה המודל לומד על ידי ניסוי וטעייה, מקבל תגמולים או עונשים על מעשיו ומתאים את התנהגותו בהתאם. מחשוב מרחבי: מחשוב מרחבי הוא השימוש בטכנולוגיה כדי להוסיף מידע וחוויות דיגיטליות לעולם הפיזי. זה יכול לכלול דברים כמו מציאות רבודה, שבה מידע דיגיטלי מתווסף למה שאתה רואה בעולם האמיתי, או מציאות מדומה, שבה אתה יכול לשקוע במלואו בסביבה דיגיטלית. יש לו שימושים רבים ושונים, כמו בחינוך, בידור ועיצוב, והוא יכול לשנות את האופן שבו אנו מתקשרים עם העולם ואחד עם השני. דיפוזיה יציבה: דיפוזיה יציבה מייצרת תמונות אמנותיות מורכבות המבוססות על הנחיות טקסט. זהו מודל AI של סינתזת תמונות בקוד פתוח הזמין לכולם. ניתן להתקין את ה-Stable Diffusion באופן מקומי באמצעות קוד שנמצא ב-GitHub או שישנם מספר ממשקי משתמש מקוונים הממנפים גם מודלים של Stable Diffusion. למידה מפוקחת: סוג של למידת מכונה שבה נתוני האימון מסומנים והמודל מאומן לבצע תחזיות על סמך היחסים בין נתוני הקלט והתוויות המתאימות. למידה ללא פיקוח: סוג של למידת מכונה שבה נתוני האימון אינם מסומנים, והמודל מאומן למצוא דפוסים ויחסים בנתונים בעצמו. Webhook: Webhook הוא דרך של תוכנת מחשב אחת לשלוח הודעה או נתונים לתוכנית אחרת דרך האינטרנט בזמן אמת. זה עובד על ידי שליחת ההודעה או הנתונים לכתובת URL ספציפית, השייכת לתוכנית האחרת. Webhooks משמשים לעתים קרובות כדי להפוך תהליכים לאוטומטיים ולהקל על תוכניות שונות לתקשר ולעבוד יחד. הם כלי שימושי למפתחים שרוצים לבנות יישומים מותאמים אישית או ליצור אינטגרציות בין מערכות תוכנה שונות.

WELCOME TO THE

5 STAR AI.IO

TOOLS

FOR YOUR BUSINESS

Page updated

Google Sites

Report abuse