I’ve spent the final a number of days experimenting and there’s no doubt in anyway that utilizing movie star occasion tokens is way simpler than utilizing uncommon tokens comparable to “sks” or “ohwx”. I did not use x/y grids of renders to subjectively decide this. As an alternative, I used DeepFace to robotically look at batches of renders and numerically charted the outcomes. I received the thought from u/CeFurkan and one in every of his YouTube tutorials. DeepFace is out there as a Python module.

Right here is an easy instance of a DeepFace Python script:

from deepface import DeepFace
img1_path = path_to_img1_file
img2_path = path_to_img2_file
response = DeepFace.confirm(img1_path = img1_path, img2_path = img2_path)
distance = response[‘distance’]

Within the above instance, two photos are in contrast and a dictionary is returned. The ‘distance’ factor is how shut the pictures of the folks resemble one another. The decrease the space, the higher the resemblance. There are totally different fashions you should utilize for testing.

I additionally experimented with whether or not or not regularization with generated class photos or with floor reality photographs had been simpler. And I additionally needed to seek out out if captions had been particularly useful or not. However I didn’t come to any stable conclusions about regularization or captions. For that I may use recommendation or suggestions. I am going to briefly describe what I did.



The topic of my experiment was Jess Bush, the actor who performs Nurse Chapel on *Star Trek: Unusual New Worlds*. As a result of her fame is comparatively current, she isn’t current within the SD v1.5 mannequin. However numerous photographs of her may be discovered on the web. For these causes, she makes a very good take a look at topic. Utilizing [starbyface.com](https://starbyface.com), I made a decision that she considerably resembled Alexa Davalos so I used “alexa davalos” once I needed to make use of a celeb identify because the occasion token. Simply to ensure, I checked to see if “alexa devalos” rendered adequately in SD v1.5.

[25 dataset images, 512 x 512 pixels](https://preview.redd.it/29kgybodthib1.jpg?width=1024&format=pjpg&auto=webp&s=e7d65a3c34ac6e6b3332bfd75d6ec56ed360ceed)

For this experiment I skilled full Dreambooth fashions, not LoRAs. This was carried out for accuracy. Not for practicality. I’ve a pc solely devoted to SD work that has an A5000 video card with 24GB VRAM. In follow, one ought to practice particular person folks as LoRAs. That is very true when coaching with SDXL.



In all of the trainings in my experiment I used Kohya and SD v1.5 as the bottom mannequin, the identical 25 dataset photos, 25 repeats, and 6 epochs for all trainings. I used BLIP to make caption textual content information and manually edited them appropriately. The remainder of the parameters had been typical for such a coaching.

It is value noting that the trainings that lacked regularization had been accomplished in half the steps. Ought to I’ve doubled the epochs for these trainings? I am unsure.



Every coaching produced six checkpoints. With every checkpoint I generated 200 photos in ComfyUI utilizing the default workflow that’s meant for SD v1.x. I used the immediate, “headshot picture of [instance token] girl”, and the detrimental, “smile, textual content, watermark, illustration, portray body, border, line drawing, 3d, anime, cartoon”. I used Euler at 30 steps.

Utilizing DeepFace, I in contrast every generated picture with seven of the dataset photos that had been shut ups of Jess’s face. This returned a “distance” rating. The decrease the rating, the higher the resemblance. I then averaged the seven scores and famous it for every picture. For every checkpoint I generated a histogram of the outcomes.

If I am not mistaken, the standard knowledge concerning SD coaching is that you simply wish to obtain resemblance in as few steps as attainable so as to preserve flexibility. I made a decision that the earliest epoch to realize a excessive inhabitants of generated photos that scored **decrease than 0.6** was the most effective epoch. I observed that subsequent epochs don’t enhance and typically barely declined after only some epochs. This aligns what folks have discovered by way of typical x/y grid render comparisons. It is also value noting that even in the most effective of trainings there was nonetheless a big inhabitants of generated photos that had been above that 0.6 threshold. I believe that so long as there should not many who rating above 0.7, the checkpoint continues to be viable. However I admit that that is debatable. It is attainable that with sufficient coaching many of the generated photos may rating beneath 0.6 however then there may be the problem of inflexibility resulting from over-training.



To assist with flexibility, captions are sometimes used. However if in case you have a very good dataset of photos to start with, you solely want “[instance token] [class]” for captioning. This default captioning is constructed into Kohya and is used in the event you present no captioning info within the file names or corresponding caption textual content information. I imagine that the dataset I used for Jess was sufficiently diversified. Nevertheless, I believe that captioning did assist a bit of bit.



Within the case of coaching one individual, ***regularization isn’t crucial***. If I perceive it accurately, regularization is used for stopping your topic from taking on your entire class within the mannequin. In the event you practice a full mannequin with Dreambooth that may render footage of an individual you’ve got skilled, you don’t need that individual rendered every time you utilize the mannequin to render footage of different people who find themselves additionally in that very same class. That’s helpful for coaching fashions containing a number of topics of the identical class. However if you’re coaching a LoRA of your individual, regularization is irrelevant. And since coaching takes longer with SDXL, it makes much more sense to not use regularization when coaching one individual. Coaching with out regularization cuts coaching time in half.

There’s debate of late about whether or not or not utilizing actual photographs (a.okay.a. floor reality) for regularization will increase high quality of the coaching. I’ve examined this utilizing DeepFace and I discovered the outcomes inconclusive. Resemblance is one factor, high quality and realism is one other. In my experiment, I used photographs obtained from [Unsplash.com](https://Unsplash.com) in addition to a number of photographs I had collected elsewhere.



The very first thing that should be acknowledged is that many of the checkpoints that I chosen as the most effective in every coaching can produce good renderings. Evaluating the renderings is a subjective job. This experiment targeted on the numbers produced utilizing DeepFace comparisons.

After coaching variations of uncommon token, movie star token, regularization, floor reality regularization, no regularization, with captioning, and with out captioning, the coaching that achieved the most effective resemblance within the fewest variety of steps was this one:


[celebrity token, no regularization, using captions](https://preview.redd.it/ptmvrpfgvhib1.png?width=654&format=png&auto=webp&s=c278c67f8c849ad5d91e8abc0a5b1c7dcf9f1ce7)


Greatest Checkpoint:….5
Common Distance:…0.60592
% Beneath 0.7:……..97.88%
% Beneath 0.6:……..47.09%

Right here is likely one of the renders from this checkpoint that was used on this experiment:


[Distance Score: 0.62812](https://preview.redd.it/0gbw9z6ixhib1.png?width=512&format=png&auto=webp&s=1b3480cb7b09f98e7db516f6a2d2e1896764fe1f)

In direction of the tip of final yr, the standard knowledge was to make use of a singular occasion token comparable to “ohwx”, use regularization, and use captions. Evaluate the above histogram with that technique:


[“ohwx” token, regularization, using captions](https://preview.redd.it/m66hwy8rvhib1.png?width=654&format=png&auto=webp&s=531c3d50016fe21bbdac01a1d221d9790daa462e)


Greatest Checkpoint:….6
Common Distance:…0.66239
% Beneath 0.7:……..78.28%
% Beneath 0.6:……..12.12%

[A recently published YouTube tutorial](https://youtu.be/N_zhQSx2Q3c) states that utilizing a celeb identify for an occasion token together with floor reality regularization and captioning is the easiest technique. ***I disagree***. Listed below are the outcomes of this experiment’s coaching utilizing these choices:


[celebrity token, ground truth regularization, using captions](https://preview.redd.it/apg7v0x8whib1.png?width=654&format=png&auto=webp&s=fbc02a6237e90883d104e945d87675d1cc548c48)


Greatest Checkpoint:….6
Common Distance:…0.66239
% Beneath 0.7:……..91.33%
% Beneath 0.6:……..39.80%

The standard of this technique of coaching is **good**. It renders photos that seem comparable in high quality to the coaching that I selected as finest. Nevertheless, it took 7,500 steps. Greater than twice the variety of steps I selected as the most effective checkpoint of the most effective coaching. I imagine that the standard of the coaching would possibly enhance past six epochs. However the problem of flexibility lessens the usefulness of such checkpoints.

In all my coaching experiments, I discovered that captions improved coaching. The development was important however not dramatic. It may be very helpful in sure instances.



There isn’t any doubt that utilizing a celeb token vastly accelerates coaching and dramatically improves the standard of outcomes.

Regularization is ***ineffective*** for coaching fashions of particular person folks. All it does is double coaching time and hinder high quality. That is particularly vital for LoRA coaching when contemplating the time it takes to coach such fashions in SDXL.

View Reddit by FugueSegueView Source