Utilizing DeepFace to show that when coaching particular person folks, utilizing movie star occasion tokens lead to higher trainings and that regularization is pointless
I’ve spent the final a number of days experimenting and there’s no doubt in anyway that utilizing movie star occasion tokens is way simpler than utilizing uncommon tokens comparable to “sks” or “ohwx”. I did not use x/y grids of renders to subjectively decide this. As an alternative, I used DeepFace to robotically look at batches of renders and numerically charted the outcomes. I received the thought from u/CeFurkan and one in every of his YouTube tutorials. DeepFace is out there as a Python module.
Right here is an easy instance of a DeepFace Python script:
from deepface import DeepFace
img1_path = path_to_img1_file
img2_path = path_to_img2_file
response = DeepFace.confirm(img1_path = img1_path, img2_path = img2_path)
distance = response[‘distance’]
Within the above instance, two photos are in contrast and a dictionary is returned. The ‘distance’ factor is how shut the pictures of the folks resemble one another. The decrease the space, the higher the resemblance. There are totally different fashions you should utilize for testing.
I additionally experimented with whether or not or not regularization with generated class photos or with floor reality photographs had been simpler. And I additionally needed to seek out out if captions had been particularly useful or not. However I didn’t come to any stable conclusions about regularization or captions. For that I may use recommendation or suggestions. I am going to briefly describe what I did.
​
**THE DATASET**
The topic of my experiment was Jess Bush, the actor who performs Nurse Chapel on *Star Trek: Unusual New Worlds*. As a result of her fame is comparatively current, she isn’t current within the SD v1.5 mannequin. However numerous photographs of her may be discovered on the web. For these causes, she makes a very good take a look at topic. Utilizing [starbyface.com](https://starbyface.com), I made a decision that she considerably resembled Alexa Davalos so I used “alexa davalos” once I needed to make use of a celeb identify because the occasion token. Simply to ensure, I checked to see if “alexa devalos” rendered adequately in SD v1.5.
[25 dataset images, 512 x 512 pixels](https://preview.redd.it/29kgybodthib1.jpg?width=1024&format=pjpg&auto=webp&s=e7d65a3c34ac6e6b3332bfd75d6ec56ed360ceed)
For this experiment I skilled full Dreambooth fashions, not LoRAs. This was carried out for accuracy. Not for practicality. I’ve a pc solely devoted to SD work that has an A5000 video card with 24GB VRAM. In follow, one ought to practice particular person folks as LoRAs. That is very true when coaching with SDXL.
​
**TRAINING PARAMETERS**
In all of the trainings in my experiment I used Kohya and SD v1.5 as the bottom mannequin, the identical 25 dataset photos, 25 repeats, and 6 epochs for all trainings. I used BLIP to make caption textual content information and manually edited them appropriately. The remainder of the parameters had been typical for such a coaching.
It is value noting that the trainings that lacked regularization had been accomplished in half the steps. Ought to I’ve doubled the epochs for these trainings? I am unsure.
​
**DEEPFACE**
Every coaching produced six checkpoints. With every checkpoint I generated 200 photos in ComfyUI utilizing the default workflow that’s meant for SD v1.x. I used the immediate, “headshot picture of [instance token] girl”, and the detrimental, “smile, textual content, watermark, illustration, portray body, border, line drawing, 3d, anime, cartoon”. I used Euler at 30 steps.
Utilizing DeepFace, I in contrast every generated picture with seven of the dataset photos that had been shut ups of Jess’s face. This returned a “distance” rating. The decrease the rating, the higher the resemblance. I then averaged the seven scores and famous it for every picture. For every checkpoint I generated a histogram of the outcomes.
If I am not mistaken, the standard knowledge concerning SD coaching is that you simply wish to obtain resemblance in as few steps as attainable so as to preserve flexibility. I made a decision that the earliest epoch to realize a excessive inhabitants of generated photos that scored **decrease than 0.6** was the most effective epoch. I observed that subsequent epochs don’t enhance and typically barely declined after only some epochs. This aligns what folks have discovered by way of typical x/y grid render comparisons. It is also value noting that even in the most effective of trainings there was nonetheless a big inhabitants of generated photos that had been above that 0.6 threshold. I believe that so long as there should not many who rating above 0.7, the checkpoint continues to be viable. However I admit that that is debatable. It is attainable that with sufficient coaching many of the generated photos may rating beneath 0.6 however then there may be the problem of inflexibility resulting from over-training.
​
**CAPTIONS**
To assist with flexibility, captions are sometimes used. However if in case you have a very good dataset of photos to start with, you solely want “[instance token] [class]” for captioning. This default captioning is constructed into Kohya and is used in the event you present no captioning info within the file names or corresponding caption textual content information. I imagine that the dataset I used for Jess was sufficiently diversified. Nevertheless, I believe that captioning did assist a bit of bit.
​
**REGULARIZATION**
Within the case of coaching one individual, ***regularization isn’t crucial***. If I perceive it accurately, regularization is used for stopping your topic from taking on your entire class within the mannequin. In the event you practice a full mannequin with Dreambooth that may render footage of an individual you’ve got skilled, you don’t need that individual rendered every time you utilize the mannequin to render footage of different people who find themselves additionally in that very same class. That’s helpful for coaching fashions containing a number of topics of the identical class. However if you’re coaching a LoRA of your individual, regularization is irrelevant. And since coaching takes longer with SDXL, it makes much more sense to not use regularization when coaching one individual. Coaching with out regularization cuts coaching time in half.
There’s debate of late about whether or not or not utilizing actual photographs (a.okay.a. floor reality) for regularization will increase high quality of the coaching. I’ve examined this utilizing DeepFace and I discovered the outcomes inconclusive. Resemblance is one factor, high quality and realism is one other. In my experiment, I used photographs obtained from [Unsplash.com](https://Unsplash.com) in addition to a number of photographs I had collected elsewhere.
​
**THE RESULTS**
The very first thing that should be acknowledged is that many of the checkpoints that I chosen as the most effective in every coaching can produce good renderings. Evaluating the renderings is a subjective job. This experiment targeted on the numbers produced utilizing DeepFace comparisons.
After coaching variations of uncommon token, movie star token, regularization, floor reality regularization, no regularization, with captioning, and with out captioning, the coaching that achieved the most effective resemblance within the fewest variety of steps was this one:
​
[celebrity token, no regularization, using captions](https://preview.redd.it/ptmvrpfgvhib1.png?width=654&format=png&auto=webp&s=c278c67f8c849ad5d91e8abc0a5b1c7dcf9f1ce7)
CELEBRITY TOKEN, NO REGULARIZATION, USING CAPTIONS
Greatest Checkpoint:….5
Steps:…………..3125
Common Distance:…0.60592
% Beneath 0.7:……..97.88%
% Beneath 0.6:……..47.09%
Right here is likely one of the renders from this checkpoint that was used on this experiment:
​
[Distance Score: 0.62812](https://preview.redd.it/0gbw9z6ixhib1.png?width=512&format=png&auto=webp&s=1b3480cb7b09f98e7db516f6a2d2e1896764fe1f)
In direction of the tip of final yr, the standard knowledge was to make use of a singular occasion token comparable to “ohwx”, use regularization, and use captions. Evaluate the above histogram with that technique:
​
[“ohwx” token, regularization, using captions](https://preview.redd.it/m66hwy8rvhib1.png?width=654&format=png&auto=webp&s=531c3d50016fe21bbdac01a1d221d9790daa462e)
“OHWX” TOKEN, REGULARIZATION, USING CAPTIONS
Greatest Checkpoint:….6
Steps:…………..7500
Common Distance:…0.66239
% Beneath 0.7:……..78.28%
% Beneath 0.6:……..12.12%
[A recently published YouTube tutorial](https://youtu.be/N_zhQSx2Q3c) states that utilizing a celeb identify for an occasion token together with floor reality regularization and captioning is the easiest technique. ***I disagree***. Listed below are the outcomes of this experiment’s coaching utilizing these choices:
​
[celebrity token, ground truth regularization, using captions](https://preview.redd.it/apg7v0x8whib1.png?width=654&format=png&auto=webp&s=fbc02a6237e90883d104e945d87675d1cc548c48)
CELEBRITY TOKEN, GROUND TRUTH REGULARIZATION, USING CAPTIONS
Greatest Checkpoint:….6
Steps:…………..7500
Common Distance:…0.66239
% Beneath 0.7:……..91.33%
% Beneath 0.6:……..39.80%
The standard of this technique of coaching is **good**. It renders photos that seem comparable in high quality to the coaching that I selected as finest. Nevertheless, it took 7,500 steps. Greater than twice the variety of steps I selected as the most effective checkpoint of the most effective coaching. I imagine that the standard of the coaching would possibly enhance past six epochs. However the problem of flexibility lessens the usefulness of such checkpoints.
In all my coaching experiments, I discovered that captions improved coaching. The development was important however not dramatic. It may be very helpful in sure instances.
​
**CONCLUSIONS**
There isn’t any doubt that utilizing a celeb token vastly accelerates coaching and dramatically improves the standard of outcomes.
Regularization is ***ineffective*** for coaching fashions of particular person folks. All it does is double coaching time and hinder high quality. That is particularly vital for LoRA coaching when contemplating the time it takes to coach such fashions in SDXL.
I did this too and if I want to lower the strength to have more of the style come out I start to get Emma Watson instead of me 😂😭
How important are captions? I’ve made lots of models with dreambooth but never used captions for my dataset.
You need to post visual comparisons of a variety of prompts with and without regularization images, comparing different style types, full body, torso and portrait shots to come to a real conclusion, charts and numbers are meaningless for this type of subjective testing
Thanks for sharing this. Reading it I wonder if there would be value if this process was built in Koyha? After creating the lora checkpoints it runs a script to generate the images and run deepface against the training images?
just out of curiosity : do you have a picture of your results that resembles the actress you choose closley ?
also : do you realize that one of your images in the training set is not Jess Bush but the actress who played Tasha Yar in Star Trek : The next generation, Denise Crosby ?
but very interesting.
so if I put a celeb name instead a rare word I’ll get a better lora?
Thank you. Finally some scientific data! I’m just getting started with making loras, i watched several videos and there are a number conflicting views regarding the best approach but no data until now.
It would be good to see a short how to video or maybe a series of screenshots showing your settings. For instance im wondering if you scaled your input images to 512×512? or did you enable buckets? How many input images? Epochs etc. All visible in some screenshots.
Did you use the standard 25_[instance token] [class] naming for the folder also, with the actress’s name inserted?
In my case woman looker like the Star and not like the woman i made Lora for. Without celeb token – much beter similarity.
So does that mean, that when you tune your model, or train LoRA for specific art style, then mentioning names of artists whose style is closest to your training dataset, will increase quality of results?
Thanks for sharing your research – you’re reaching the same conclusions regarding regularization images: [https://blog.aboutme.be/2023/08/10/findings-impact-regularization-captions-sdxl-subject-lora/#conclusions](https://blog.aboutme.be/2023/08/10/findings-impact-regularization-captions-sdxl-subject-lora/#conclusions) – whereas I still relied on the token & used a more subjective evaluation of the results.
What about individuals who don’t resemble any celebrity?
How does this compare to using the celebrity name as the class token instead? And using [subject real name] for instance.
I think the point of regularization is to prevent your training data from dominating the entire model, when all woman, dogs and birds all start looking the same. So in that testing indeed regularization would work against it, but that doesn’t mean it’s bad.
I think this was already confirmed by AI antrepreneur youtuber. He has an insane 51 min video.
Thanks for the indepth analysis. Seems quite logical when you think about it. Reg images for Loras make no sense when considering what they do. And with the known celebrity with similar looks you would just change something that’s already known instead of adding a new token, which should require less training.
Works in Kohya as described here (00:10:22)
[https://www.youtube.com/watch?v=N_zhQSx2Q3c&t=622s](https://www.youtube.com/watch?v=N_zhQSx2Q3c&t=622s)
It is indeed what a Stability staff member said to u/Cefurkan in one of the posts he made dozens of days ago. I remember very well that comment. I could find it If I decided to search for it. (They said you should use known tokens they work better than ohxw etc).
Meh, use Joe Penna’s Dreambooth repo anyway.
Thanks for this, great in-depth breakdown! This is basically exactly what I was doing for 1.5, I’ve seen a lot of people swear by regularization for XL but was waiting to test it myself, thanks for saving me compute!
Can someone explain what does it mean to “use a celebrity token”? Is it just the initialization vector? Or does it go into the prompt on every step of every epoch? Is it related to the “trigger words” that are listed in Civitai LoRA pages?
This is some interesting opinions on training. so what settings should I use?
I’ve been singing this song for almost a year, regularization is a by the book theoretical method that isn’t effective in large complicated diffusion models finetuning, people wouldn’t listen.
Thank you OP, you just described in a sensible manner what my conclusions have been of training SDXL LoRas on people. Use Celebrity tokens, no regularisation images, caption images for clothes and accessories (not facial expressions).
Question: Do you know if without regularization, is the flexibility of the model negatively affected? Say if you wanted a van gogh or pixar style version of the trained person.
Your results about celeb names are very much true, I can attest in my experience using them. In my results, I will note some things outside of likeness will bleed into the final model — generations look like they’re from a red carpet shoot, have a hollywood aesthetic to them, etc.
What would be the equivalent of training a lora of my white Ragdoll cat? Just captioning with “white Ragdoll cat” rather than “ragdolljackie cat”?
Is the logic here that the more “known” word means that the training finds a close approximation faster, rather than having to go a few steps of latent randomness first?
Hey there, so I’m the one who made the [recently published YouTube tutorial](https://youtu.be/N_zhQSx2Q3c), it took me more than 10 days of testing and training (and hundreds in GPU renting) to find the right parameters for SDXL lora training which is why I “kinda” have to disagree “just a little bit” with the findings and I mean in a way it’s almost a matter of opinion at this point…. indeed as I said in my tutorial, using a combination of celebrity names that looks like the character you are trying to train + caption + regularization images made in my testing the best models (for the celebrity trick I just followed what u/mysteryguitarm told me so thanks for that).
The problem here I suppose is regularization images, because I made tests with and without and tbh I prefer models made WITH regularization images, I found that the models it created looked a bit more like the character and were also sometimes following the prompt a bit better, albeit the difference are very small that’s true…. and indeed if you consider the fact that using reg image MULTIPLY BY 2 the amount of final steps with only a small increase in quality, why even bother with them?
Well that’s a very good point and in a way I agree, If I need to make a very quick LORA and just make a good model, I won’t use REG images… It will just take twice as long for training…like who has time for that?? However again as I said, I personally saw the difference and for the sake of the tutorial to show people what the best method I personally found that yieled the best results for me, it was: celebrity + caption + reg images which is way I showed that in my video for people to follow.
And again if you find that reg images don’t give you as much quality as you think they should and that the added training time is not worth it then yeah don’t use them, you’ll be fine, as long as you have a great dataset and the right training parameters you’ll get a great model. However again, personally in my opinion, and from what I tested, reg images increases the quality of the final model even if just by a little bit, again is it worth it for you? It’s for you to decide.
I chose to use them personally unless I don’t want to wait…simple as that
I’m not fully versed on how controlnet works, but since deepface can provide a model feedback, could you use the distance value as a way of creating a reference-style controlnet to generate images with similar faces?
Pardon my stupid question, but are “instance token” and “class token” Lora/DreamBooth specific terms?
I have been fiddling with embedding/hypernetwork training for the past few weeks, and didn’t encounter those terms anywhere.
When you are using a brand new token, there is no existing information to leverage, so training essentially starts at random. Which means it take more training epochs for the model to learn the fundamentals like “new token is a human”, “new token is a female”, “new token is a blonde”, and so on. Intuitively, regularization would help with this initial phase of learning the fundamentals about this new token because regularization smooths out or spreads out the weights more, allowing the model to establish better connections for the new token’s meaning.
It makes sense that using a celebrity’s name results in better training because the model already have the basic fundamental information about said celebrity.
Could you please share the dataset ? Id like to have a go
Thank you! I’ve long suspected that “overwriting” celebrities was the most efficient face learning method and my recent experience is that this works especially well with SDXL lora’s. One of the major advantages of this approach is that you don’t have to retrain the text encoder at all because the celebritiy token is already perfectly calibrated to being a specific unique individual.
Fantastic write-up. Crazy you have an A5000! Very precise methodology. Keep it up.
From my understanding, it doesn’t make sense to me that you would use random regularization images, I used to have this debate with people when db first came out. It’s not logical. The images should come from the model, since you want it to retain prior knowledge FROM the model itself and not over-fit with your new information.
Yeah bc celebs have better data labeling duh
This is something I will test hopefully on my own images and compare
Sadly I still didn’t have time
Deepface very useful to sort images by similarity to find best images quickly. but it doesn’t consider subtle differences. So I believe quality still should be evaluated by human eyes
Also using groundth truth reg images will always better fine tune your model. That is how the model initially trained. But is a trade off between time and quality
One more mistake is experimenting with celebrities. You need to experiment with your own self to see real results
I would be curious what distance scores you would get between your two test subjects before any training. I haven’t used Deep Face, but I know that in DLib 0.6 represents a pretty large distance between faces. You need close to 0.5 for a positive identity match. Looking at the Deep Face GitHub, I’m seeing distance values like 0.25 for the same identity. So I’m wondering whether the distance scores you’re getting after training mean “these people look a little similar,” which is where you started before training.
I heavily disagree with this and have made a response post here:
https://www.reddit.com/r/StableDiffusion/comments/15tji2w/no_you_do_not_want_to_use_celebrity_tokens_in/?
Just provide my feedback on this. If you are training Asians, don’t use celebrity. It will mess up your traing massively.
The problem is that SD doesn’t know many Asian celebrities and even if it does, for example Chang Chen, it got confused so easily when you add other tokens beside words like Chen.
I wasted so many times following this “conclusion.”
The only takeaway should have is that people can only have time to test certain aspects of training and you always have to find out yourself.
To OP, have you tried to use the same methodology with other ethic groups? The issue here is names. Chinese names for example have relatively few letters and it could cause confusion to the model.
Wouldn’t mixing the tokens in the prompt ( [A|B:x,y] ) achieve the same resulting without polluting the Lora with vectors that aren’t from the subject?
Real question
I gave this a go this weekend, but it brought back the ‘identity bleed’ problems that have always plagued autoencoder deepfakes. Depending on how ingrained the existing celeb is, and how strong your data is, they tend to burst through the parasite identity at unexpected moments.
Testing on Clarifai celeb ident and uploading test images to Yandex image search (which does pure face recognition with no cheating), you might be surprised how hard it is to completely overwrite a really embedded host identity.
So if you overwrite someone huge like Margot Robbie, you’ll inherit all that pose and data goodness, but you may have trouble hiding the source. On the other hand, if you choose a less embedded celeb, you get less bleed-through but also less data.
So I think I’m not going to proceed with this, but it was interesting to try it. Entanglement is a pain in the neck, but it’s a thing.
PS Additionally, ‘red carpet’ paparazzi material is over-represented in celebs such as Robbie in LAION, which means that your parasite model is likely to end up smiling for the reporters more than you might like. If you are going to do this, would probably be best to use an actual model (i.e., a person), whose portfolio work outnumbers or at least equals their premiere red carpet presence.