by Eriane Austria
A project developed and showcased in Tyler Coleman's Creativity with AI class at UT Austin with the guidance of Dr. Xuexin Wei at the Brain-Behavior-and Computation Lab at UT Austin
Upon reading literature on, looking at, and creating AI-generated art, I noticed that comparisons on the various models often did not compare the same intended output using the same input. Thus, I wanted to evaluate quality in a more standardized manner and target differences in the models by comparing outputs using the same sources. In this project, I aimed to further compare the quality of diffusion, example-based synthesis, and neural style transfer (NST) models in stylized AI-generated animation through 1) creating three 10-second animations that merge a source exemplar (containing the artistic style) and an original video to evaluate and 2) propose a computational neuroscience method in an attempt to identify to what extent one keyframe from each model’s video maintains the source exemplar. Point 1 evaluates the quality of the methods, whereas Point 2 offers a potential method to measure and question the similarity between the source and the outputs from these models. My project thus enters the discourse on the integration of AI in creative pipelines and on the issue of ownership and AI. This site will cover background on the models and documentation of my processes in addition to the aforementioned points.
Below are the source exemplar and the original video I used in these models. The source exemplar was derived from the ART500K visual arts dataset, which includes over 550,000 labeled visual arts images that follow U.S. Copyright Law’s copyright term based on author death. I chose Mary Cassat’s “The Cup of Tea” from one of the ART500K sources, the Web Gallery of Art, as the source exemplar. I chose a 10 second video of a cat that I recorded as the original video.
10 second cat video recorded by me
iPhone 10 (2023)
I evaluate the quality of each model's output based on a) how closely the output maintains the style from the source exemplar and b) how closely the output maintains the content of the video.
Each model's output had some quirks. The Deforum output sometimes morphed the cat into a teacup. EbSynth had some distortions which were most noticeable upon transitioning to a different keyframe bracket. NST introduced a patterned noise into the content of the video. It's also important to note that my tweaks to each of the model's configurations also influenced and constrained the quality of their output. All models were generally successful at maintaining the content of the original video where we can clearly identify a cat, floor, wall, and door; however, the style of each output is distinguishable.
Several researchers have constructed datasets of artwork for the purpose of analysis works, like classification and retrieval, strictly for noncommercial purposes. Relatedly, many AL/ML tools train and generate content from large datasets. Several AI/ML companies have been under fire and sued for non consensual use of and profit off of artists’ work in their models. This prompts ethical and legal concerns over ownership in AI-generated art. The latter part of this project proposes a method for determining ownership using computational neuroscience methods, which could be applied to the stylized AI-generated animations, for example.
As Mao et al. (2017) argue, “the main difference between visual art and nature image is… [that visual art] contains the style” (p. 1183). Thus, if this method is to be applied to a keyframe of each of the videos, I determine the source exemplar as the reference for similarity to position this work in the context of artistic ownership and the concerns that arise related to ownership and AI.
Images can be broken down into its semantic features or visual content, like color, gradient, facial features, etc. Images can also be analyzed in layers, in which each layer is encoded with different information. Such as in the Gatys et al. (2016) paper, the feature correlations can be found with the Gram Matrix to attain the representation of the style, without the content, of the source exemplar.
The Gram Matrix, where G is the inner product between the feature maps i and j, in the layer l (Gatys et al. (2016).
We could draw on equations like these to identify and map the semantic properties of each pixel on the source exemplar and on the output keyframe. The space that an image's pixels exist is called a manifold. Once the pixels of the source exemplar are mapped in a manifold, we can identify that this manifold is this specific semantic property of the source exemplar, which we can compare with the manifold of the output keyframe.
Each pixel in AI-generated work, such as in diffusion models, map somewhere on its training data's manifolds. Whether or not the AI-generated pixel is similar to or completely off of the training data depends on how far the AI-generated pixel is from the training data pixel on the manifold.
In a similar vein, we could use this knowledge to evaluate how similar the AI-generated work is with an artist's work by determining how far a pixel of a specific semantic property from the AI output is from the artist's manifold. Or, on a broader scale, to what extent their manifolds overlap.
Because of the nature of each model, the method I proposed has limitations for evaluating similarity. For example, since the diffusion model is trained from a large dataset of styles rather than only the source exemplar and video, the properties I examined from the diffusion model is not a 1:1 match with the source exemplar nor is the output a 1:1 comparison with the other two models. Similarly, with EbSynth, I hand-painted the keyframes, so the style is not directly derived from the source exemplar; rather, the style attempted to emulate the source exemplar.
Furthermore, because of scope, I was not able to analyze any AI-generated keyframes. However, it is my hope that this proposal helps generate ideas for how we evaluate AI work, or possibly even reverse-engineering how the AI created the work.
This first part of this project evaluated the output of three models (diffusion, example-based synthesis, and neural style transfer), though there were limitations to the model comparisons as input was not exactly the same due to the nature of the tools. As mentioned in the introduction, I wanted to perform a more direct comparison of models by using the same input to find differences in quality of the output and model because not many articles provided this kind of comparison in their discussions of AI-generated art models.
The second part of this project proposed a method to evaluate the similarity between AI-generated art output and the source exemplar of an artist's work, which could be applied to the first part of this project. As many current AI/ML tools scrape the Internet, including artist work, for their training data and make profit off of their tools, I offered a quantitative method to determine this similarity and spark conversation on how we can shape copyright violations and laws in the future.
Overall, this project builds on existing research and discourse on ownership in the AI/ML/computer graphics space.