Dr.Derp’s guide to training and learned insights. TLDR: The training data is poorly captioned and improperly cropped. First I would suggest checking out this blog post https://huggingface.co/blog/stable_diffusion if you are not yet familiar with how Stable Diffusion works. Especially “How does Stable Diffusion work?” section. I found the blog does a good job in explaining the ins and outs. But another thing to understand is that Stable Diffusion is not a “live” model. That is to say, once the model is “baked” all of the data inside acts like a prism. And the front end, like Automatics WebUI acts like a looking glass that peers through the prism at different angles. So while a lot can be done during generation to manipulate the encoded data, having a good clean and properly curated model is key to good image generation. I’ve been experimenting and creating custom *.ckpt models since the moment Dreambooth came out and I want to share with you everything that I have learned before I die from Vitamin D deficiency. But before I do that, let’s step back and consider some of the shortcomings of existing models so we can understand where our focus needs to be. Current models have a tendency in generating disembodied body parts, mutations, extra limbs, mangled hands, out of frame subjects, unresponsive to directions, limited set of pose positions (not too many dynamic poses), limited set of facial expressions without much control or object color description amongst a slew of other issues. All of these issues arise not because the technology is limited in scope or there is not enough data in the data set. I strongly suspect that with good image curation and captioning a generalized model could have been made for a small fraction of the price and compute used in the current SD model. Quality over quantity is my motto. Almost every issue described above is due to poor image captioning and data curation before any training has started. Simply put, the training data was bad. Now, this is not the fault of the Stable Diffusion team. The team was pioneering a new technology and they needed a proof of concept of a large generalized model. So they did what anyone would have done in their place. Throw everything and the sink into the training data set and either label the images via an automated captioning solution or outsource the labeling process and see what they get out. And guess what? What came out was pretty good considering everything. This truly shows the robustness of the neural network and its ability to interpolate data. But, we can do better. Let’s consider why the generated images come out so poorly in some of the mentioned cases above. Mutations and extra limbs. Have a high five! Or seven or hell, why not throw in an extra hand into the mix! This happens due to poor captioning in the training data and the neural models' amazing propensity to interpolate between what it thinks you want to see. The poor captioning happens when you label images of a similar subject but without using enough descriptive captions in the respective images. In the left image, Maggie is standing in her backyard looking down with her hands by her side. In the right image, she is standing in a dark room looking to the side with her hands stretched out. In both of these cases the labels assigned to these 2 images would most likely have been something like “Maggie” or “Maggie Simpson” or “Simpsons” by the original labeling software or the labeling team. Once the model is trained with these captions, in generation the latent diffusion model will interpolate between the two images and try to give you what it thinks is the best image that has the token “Maggie” in it. Which will yield a monstrosity that has a third hand sticking out of the chest. You can see that Bart has a third hand on his left side. This happens because the difference between the trained images of the subject (one image where Bart has his hands down and another where he is holding something up) were not labeled appropriately or with enough captions to specify the pose of the hands. Going back to maggie examples. Maggie 1 should be labeled “Maggie standing outside in a backyard looking down with her hands by her side”, while Maggie 2 image should have been labeled “Maggie standing in a dark interior room looking to the side with her hands stretched outward”. These two captions would teach the neural network that while the subject is the same, the difference in tags means there is a significant difference between the subject. And with just a tiny bit more training data labeled in a standardized way, the model picks up very quickly what the “hands” and “stretched out” tag means. So in the future, when asking to generate Maggie, it will no longer interpolate between the two images and give you nice coherent results, regardless of Maggie’s pose. Disembodied body parts happen for a similar reason. If you have the same caption for two different images of the same subject, you will force the neural network to interpolate between the two in odd ways. For example if you have Bart playing tennis and in one frame he is on his backswing and in another he is not. The difference in the hand position is too great for the model to interpolate properly. Sometimes it simply rips the arm off the body and disembodies it. It is a lot more rare with large body parts but is more common with fingers and smaller objects, I’ll get to why later in the guide. Out of frame subjects. Generated images of people with their torso smack dab in the middle but the head is cut off, or the subject is all the way on the left of the canvas are very common as well. This happens because of improper cropping during data image curation. Stable diffusion was trained on 512x512 images, 1:1 aspect ratio. That means the Stable Diffusion team had to crop almost every image they used for training, as most images are either in portrait or landscape aspect ratio. The cropping was done with a simple crop right down the middle of the image. Well as you can imagine this cuts off quite a lot of people's heads who have their images in portrait aspect ratio. The images where the subject is too far on the side of the canvas and the face is also cut off, is because a landscape image was improperly cropped down the middle. Leaving only half the face visible in the cropped trained image. This can be fixed by using a smart crop solution (ai crop) which tries to focus on the subject's eyes and keep the eyes as the focus of the middle crop. This solution is not always best as there are instances where you want to crop further away from the “eye focus” as mid crop. It’s best to crop where more of the body is visible. But, there is no automated solution that exists that can pick up subject content this well, so there is another option. Up-cropping. You are sacrificing pixel density data for cohesion. With the Up-crop method, you will gain more cohesion during generation as more of the scene is trained on and the subjects fits in better in an aspect ratio the directors designed the shot with. But there are downsides. With highly stylized cartoon images where line art is important, you will lose accuracy of the line representation. Basically in The Simpsons case, the line art becomes thinner. This is a draw back only for cartoon and line art related training data. With photoreal subjects, there is no loss of image color or texture stability. But there second downside is, approximately 5% of the generated images will have the white border on the bottom, top or both. This was tested on a data set that was trained on 2500 Up-cropped images. Another way to reduce white borders below 5% is to introduce regular cropped images into the training data set. Same applies for Up-cropped images with vertical white borders on the sides. Or use a negative prompt “letter box” or something similar, but that’s more messy IMO. Some other considerations is the aspect ratio bucketing that is available in some repos. Unresponsive to directions. Stable Diffusion doesn't seem to know its left from right huh? This is also due to improper captioned data and no established standard guidelines when labeling images for training. I would suggest for your training data set, first turn all of your subjects in the same direction, if there is a prevalence in directionality in the subject. For example with The Simpsons, every one of my subjects is facing to their left. So the example of image Maggie it will be flipped horizontally and face her left just like the example in Maggie 1. TO This established directionality for your training model. And now when you specify something that is happening that needs directionality everything works as intended. For example. Directionality was not established and you get all types of issues. See if it catches your eye what it is....Welp, in this case none of the training data was properly curated for direction nor labeled appropriately. Here is another example, When labeling your training images it is important to specify which way the eyes are looking. Once all the images are flipped to the same direction, labeling the subject's gaze becomes important. Something like. “Bart standing in the kitchen looking left (right, up, down). No established directionality also I suspect is partially to blame for the issues humans have with their eyes. Stable diffusion is trying to interpolate all of the different angles the eyes are looking at in the training data set, but without standardized captioning it simply interpolates between all of them, making a mess. It's also important to caption other eye states, such as squinting, closed eyes, bored eyes (only top lid down but not bottom lid). Mouth shape is also important, so you want tags for all of the appropriate shapes you want to train on. Color problems also arise because of improper data captioning. One labeller captioned similar things blue while another captioned similar thing green. And then everything is bluegreen.... Limited set of poses and lack of dynamic poses is also due to lack of proper captioned data on the unique poses. The human body can take on many different poses and stable diffusion will have trouble interpolating between them. There is a lot of image data in the Stable Diffusion model with many dynamic poses, but they lack appropriate captioning data, so there is no way to summon them from the latent space reliably. The best way to fix that is to have standardized pose names for as many poses you can come up with and tag appropriately. Or to segment the human body into legs, torso, head, expression, hands. And tag them appropriately or a mix of both which is what I prefer. For example, I tag cross legged sitting as “Sukhasana”. So a caption will say something like “Lisa closed eyes Sukhasan, outside under a tre”. It's a complex pose that can be tokenized under one word. Other important tags are for the angle at which you are looking at the subject (high angle, low angle, overhead, first person view, etc...). Distance from the camera, full shot ( head to toe subject is visible), mid shot, closeup, subject turned away, etc... Basically the more detailed the caption data that is standardized by set rules that you adhere to, the more cohesive the trained model will come out. Hands. Why they messed up yo? This one is a two fold problem IMO. One is, a hand can take on more complex positions than anything else in the scene. But the labeling for all of those various positions was very limited or straight up non existent. Without proper captions, the neural network is left on its own to come up with what it thinks you want to see. It interpolates and hopes for the best. Another issue I suspect is inability of Stable diffusion to properly parse small details in training images due to latent diffusion operating on a low dimensional space. That 512x512 image becomes 64x64 when encoding during training. I am not 100% sure about this though and it's only a suspicion. We might see a noticeable increase in hand generation once 1024x1024 training is a thing. Crosses fingers.... Alright so to recap real quick. We learned that it is important to label the training data with standardized tags that are detailed in nature. Most important being the pose of the subject and angle of camera (high shot, low shot) view and zoom (face on, profile, full shot, etc...). Directionality needs to be established by facing all of the training images one way. Cropping must be performed either manually, or Up-cropping in case of non line art dependent data set. That’s great and all, but you say ‘Ain’t nobody got time for that!”. Well, I agree. In order for us to go forward and make AAA models with massive training data and no headache captioning and cropping all of the images, we need an automated solution that can A. Smart Crop the collected images appropriately or use bucketing aspect ratio feature in some repos. B. Flip (through use of a vision sorting model) all of the images towards the same direction when working with live subjects (establish directionality). C. Sorts all collected images automatically in respect to different subjects , poses, expressions, camera view, etc....and label them appropriately with the use of standardized labeling scheme. I might just have a solution for these problems. https://teachablemachine.withgoogle.com/ is a google machine learning interface that allows anyone to locally train a custom model (under 1 minute) to discriminate between different subjects, poses, expressions, angles, and everything in between Training the model is easy enough, google makes this process very user friendly and efficient. Sorting the data is another matter. I looked for a week in finding a solution how I could implement the trained model and let it do the sorting on my hard drive with the given image data set. I finally found that Jonny from “SD Training Labs” Discord Server (the place where I usually hang out with the reset of the developers including Automatic1111, shout out to all the peeps there) is a python developer and a python script is exactly what I needed. He was able to come up with a script that does exactly what I needed. Huge thanks Jonny! It uses your custom trained exported model from teachable machine and sorts all of the images in a folder from your desktop into its respective “class’ folder in another folder. https://we.tl/t-Z3imfIbP0i. Jonny is making a github for the scrip with more robust functionality incoming, so stay tuned. This opens up a more automated data sorting pipeline for large image data sets. With a custom trained model you can now sort the image data sets by the trained class of your choosing. Either by subject, expression, pose, or directionality....Teachable machine is very versatile and seems to be able to learn all types of data segregation methods. For example sort out all of the blurry images...etc Some tips for basic training for beginners. https://colab.research.google.com/github/TheLastBen/fast-stable-diffusion/blob/main/fast-Dream Booth.ipynb Is the colab i use the most and its pretty straight forward. You will want to train for 200-300 steps per image in the training data set, so 75 images in the data set is about 20k steps. BUT, this ratio is not linear. At larger image data set numbers, steps per image are reduced considerably, something like 50 steps per image with data in thousands of images. Always save your model every nth step. Check consistency and quality of generated images by using Euler A sampler with 50 steps and CFG of 7.5 (I found through rigorous testing its most consistent sampler across many different criteria, like coherency, speed, detail of generation, stability, etc...). If your images look krispy you overbaked the model with too many steps, use the earlier nth saved model. But always train in DDIM if it gives you the option. Don’t use the text encoder for cartoons, It's fine for photoreal images though. Some closing thoughts. Diffusion models have potential use outside of just image generation. It is after all the most efficient compression algorithm I know of. This technology will undoubtedly disrupt a lot of industry. From product design, photography, video, and many other industries that are not even aware this technology exists or have not considered its implications. One example is image compression. With this technology you can compress image data to bytes, which has implications for data storage solutions and possibly even novel video codec compression techniques. Low bandwidth high quality video baby! P.S. I am looking to break into the industry, so if you are an organization or a company in need of someone who can bring some insight to custom model making with latent diffusion, I am your man. Hi Emad! And I am broke as a joke so i will drop a paypal tip jar like a peasant and ask for redemption later. I just recently bought fire insurance and gotta get that RTX4090 somehow... https://www.paypal.com/donate/?business=9B4V7D6BBUNPL&no_recurring=0¤cy_code =USD If you want to discuss anything else with me feel free to hit me up on Discord at DR.DERP#7202 or dmitriy.slatvinskiy@gmail.com. I have done quite a lot of experimentation and model building and have some ideas about text, video, custom language library that encodes more data per tag with more generated cohesion of images on the user side, etc.... Anyways, I am now going to go touch some grass for a bit. Here are some Barts that exist in the latent space of the imagination.