2024-01-07 23:26:05 +01:00
|
|
|
deeptagger
|
|
|
|
==========
|
|
|
|
|
|
|
|
This is an automatic image tagger/classifier written in C++,
|
2024-01-18 18:16:18 +01:00
|
|
|
primarily targeting various anime models.
|
2024-01-07 23:26:05 +01:00
|
|
|
|
2024-01-18 18:16:18 +01:00
|
|
|
Unfortunately, you will still need Python 3, as well as some luck, to prepare
|
|
|
|
the models, achieved by running download.sh. You will need about 20 gigabytes
|
|
|
|
of space for this operation.
|
2024-01-07 23:26:05 +01:00
|
|
|
|
2024-01-18 18:16:18 +01:00
|
|
|
"WaifuDiffusion v1.4" models are officially distributed with ONNX model exports
|
|
|
|
that do not support symbolic batch sizes. The script attempts to fix this
|
|
|
|
by running custom exports.
|
2024-01-07 23:26:05 +01:00
|
|
|
|
2024-01-18 18:16:18 +01:00
|
|
|
You're invited to change things to suit your particular needs.
|
|
|
|
|
|
|
|
Getting it to work
|
|
|
|
------------------
|
2024-01-07 23:26:05 +01:00
|
|
|
To build the evaluator, install a C++ compiler, CMake, and development packages
|
|
|
|
of GraphicsMagick and ONNX Runtime.
|
|
|
|
|
|
|
|
Prebuilt ONNX Runtime can be most conveniently downloaded from
|
|
|
|
https://github.com/microsoft/onnxruntime/releases[GitHub releases].
|
2024-01-18 18:16:18 +01:00
|
|
|
Remember to also install CUDA packages, such as _nvidia-cudnn_ on Debian,
|
2024-01-07 23:26:05 +01:00
|
|
|
if you plan on using the GPU-enabled options.
|
|
|
|
|
|
|
|
$ cmake -DONNXRuntime_ROOT=/path/to/onnxruntime -B build
|
|
|
|
$ cmake --build build
|
|
|
|
$ ./download.sh
|
|
|
|
$ build/deeptagger models/deepdanbooru-v3-20211112-sgd-e28.model image.jpg
|
2024-01-18 18:16:18 +01:00
|
|
|
|
2024-01-21 17:47:56 +01:00
|
|
|
The project requires a POSIX-compatible system to build.
|
2024-01-18 18:16:18 +01:00
|
|
|
|
|
|
|
Options
|
|
|
|
-------
|
|
|
|
--batch 1::
|
|
|
|
This program makes use of batches by decoding and preparing multiple images
|
|
|
|
in parallel before sending them off to models.
|
|
|
|
Batching requires appropriate models.
|
|
|
|
--cpu::
|
|
|
|
Force CPU inference, which is usually extremely slow.
|
|
|
|
--debug::
|
|
|
|
Increase verbosity.
|
|
|
|
--options "CUDAExecutionProvider;device_id=0"::
|
|
|
|
Set various ONNX Runtime execution provider options.
|
|
|
|
--pipe::
|
|
|
|
Take input filenames from the standard input.
|
|
|
|
--threshold 0.1::
|
2024-01-21 12:35:48 +01:00
|
|
|
Output weight threshold. Needs to be set higher on ML-Danbooru models.
|
|
|
|
|
|
|
|
Tagging galleries
|
|
|
|
-----------------
|
|
|
|
The appropriate invocation depends on your machine, and the chosen model.
|
|
|
|
Unless you have a powerful machine, or use a fast model, it may take forever.
|
|
|
|
|
2024-12-08 22:16:11 +01:00
|
|
|
$ find "$GALLERY/images" -type l \
|
2024-01-21 12:35:48 +01:00
|
|
|
| build/deeptagger --pipe -b 16 -t 0.5 \
|
|
|
|
models/ml_caformer_m36_dec-5-97527.model \
|
|
|
|
| sed 's|[^\t]*/||' \
|
|
|
|
| gallery tag "$GALLERY" caformer "ML-Danbooru CAFormer"
|
2024-01-18 18:16:18 +01:00
|
|
|
|
2024-01-19 15:32:10 +01:00
|
|
|
Model benchmarks (Linux)
|
|
|
|
------------------------
|
|
|
|
These were measured with ORT 1.16.3 on a machine with GeForce RTX 4090 (24G),
|
2024-01-18 18:16:18 +01:00
|
|
|
and Ryzen 9 7950X3D (32 threads), on a sample of 704 images,
|
2024-01-19 15:32:10 +01:00
|
|
|
which took over eight hours. Times include model loading.
|
2024-01-18 18:16:18 +01:00
|
|
|
|
|
|
|
There is room for further performance tuning.
|
|
|
|
|
|
|
|
GPU inference
|
|
|
|
~~~~~~~~~~~~~
|
|
|
|
[cols="<,>,>", options=header]
|
|
|
|
|===
|
|
|
|
|Model|Batch size|Time
|
|
|
|
|WD v1.4 ViT v2 (batch)|16|19 s
|
|
|
|
|DeepDanbooru|16|21 s
|
|
|
|
|WD v1.4 SwinV2 v2 (batch)|16|21 s
|
2024-01-21 12:35:48 +01:00
|
|
|
|ML-Danbooru CAFormer dec-5-97527|16|25 s
|
2024-01-18 18:16:18 +01:00
|
|
|
|WD v1.4 ViT v2 (batch)|4|27 s
|
|
|
|
|WD v1.4 SwinV2 v2 (batch)|4|30 s
|
|
|
|
|DeepDanbooru|4|31 s
|
|
|
|
|ML-Danbooru TResNet-D 6-30000|16|31 s
|
|
|
|
|WD v1.4 MOAT v2 (batch)|16|31 s
|
|
|
|
|WD v1.4 ConvNeXT v2 (batch)|16|32 s
|
2024-01-21 12:35:48 +01:00
|
|
|
|ML-Danbooru CAFormer dec-5-97527|4|32 s
|
2024-01-18 18:16:18 +01:00
|
|
|
|WD v1.4 ConvNeXTV2 v2 (batch)|16|36 s
|
|
|
|
|ML-Danbooru TResNet-D 6-30000|4|39 s
|
|
|
|
|WD v1.4 ConvNeXT v2 (batch)|4|39 s
|
|
|
|
|WD v1.4 MOAT v2 (batch)|4|39 s
|
|
|
|
|WD v1.4 ConvNeXTV2 v2 (batch)|4|43 s
|
|
|
|
|WD v1.4 ViT v2|1|43 s
|
|
|
|
|WD v1.4 ViT v2 (batch)|1|43 s
|
2024-01-21 12:35:48 +01:00
|
|
|
|ML-Danbooru CAFormer dec-5-97527|1|52 s
|
2024-01-18 18:16:18 +01:00
|
|
|
|DeepDanbooru|1|53 s
|
|
|
|
|WD v1.4 MOAT v2|1|53 s
|
|
|
|
|WD v1.4 ConvNeXT v2|1|54 s
|
|
|
|
|WD v1.4 MOAT v2 (batch)|1|54 s
|
|
|
|
|WD v1.4 SwinV2 v2|1|54 s
|
|
|
|
|WD v1.4 SwinV2 v2 (batch)|1|54 s
|
|
|
|
|WD v1.4 ConvNeXT v2 (batch)|1|56 s
|
|
|
|
|WD v1.4 ConvNeXTV2 v2|1|56 s
|
|
|
|
|ML-Danbooru TResNet-D 6-30000|1|58 s
|
|
|
|
|WD v1.4 ConvNeXTV2 v2 (batch)|1|58 s
|
|
|
|
|===
|
|
|
|
|
|
|
|
CPU inference
|
|
|
|
~~~~~~~~~~~~~
|
|
|
|
[cols="<,>,>", options=header]
|
|
|
|
|===
|
|
|
|
|Model|Batch size|Time
|
|
|
|
|DeepDanbooru|16|45 s
|
|
|
|
|DeepDanbooru|4|54 s
|
|
|
|
|DeepDanbooru|1|88 s
|
|
|
|
|ML-Danbooru TResNet-D 6-30000|4|139 s
|
|
|
|
|ML-Danbooru TResNet-D 6-30000|16|162 s
|
|
|
|
|ML-Danbooru TResNet-D 6-30000|1|167 s
|
|
|
|
|WD v1.4 ConvNeXT v2|1|208 s
|
|
|
|
|WD v1.4 ConvNeXT v2 (batch)|4|226 s
|
|
|
|
|WD v1.4 ConvNeXT v2 (batch)|16|238 s
|
|
|
|
|WD v1.4 ConvNeXTV2 v2|1|245 s
|
|
|
|
|WD v1.4 ConvNeXTV2 v2 (batch)|4|268 s
|
|
|
|
|WD v1.4 ViT v2 (batch)|16|270 s
|
2024-01-21 12:35:48 +01:00
|
|
|
|ML-Danbooru CAFormer dec-5-97527|4|270 s
|
2024-01-18 18:16:18 +01:00
|
|
|
|WD v1.4 ConvNeXT v2 (batch)|1|272 s
|
|
|
|
|WD v1.4 SwinV2 v2 (batch)|4|277 s
|
|
|
|
|WD v1.4 ViT v2 (batch)|4|277 s
|
|
|
|
|WD v1.4 ConvNeXTV2 v2 (batch)|16|294 s
|
|
|
|
|WD v1.4 SwinV2 v2 (batch)|1|300 s
|
|
|
|
|WD v1.4 SwinV2 v2|1|302 s
|
|
|
|
|WD v1.4 SwinV2 v2 (batch)|16|305 s
|
2024-01-21 12:35:48 +01:00
|
|
|
|ML-Danbooru CAFormer dec-5-97527|16|305 s
|
2024-01-18 18:16:18 +01:00
|
|
|
|WD v1.4 MOAT v2 (batch)|4|307 s
|
|
|
|
|WD v1.4 ViT v2|1|308 s
|
|
|
|
|WD v1.4 ViT v2 (batch)|1|311 s
|
|
|
|
|WD v1.4 ConvNeXTV2 v2 (batch)|1|312 s
|
|
|
|
|WD v1.4 MOAT v2|1|332 s
|
|
|
|
|WD v1.4 MOAT v2 (batch)|16|335 s
|
|
|
|
|WD v1.4 MOAT v2 (batch)|1|339 s
|
2024-01-21 12:35:48 +01:00
|
|
|
|ML-Danbooru CAFormer dec-5-97527|1|352 s
|
2024-01-18 18:16:18 +01:00
|
|
|
|===
|
2024-01-19 15:32:10 +01:00
|
|
|
|
|
|
|
Model benchmarks (macOS)
|
|
|
|
------------------------
|
|
|
|
These were measured with ORT 1.16.3 on a MacBook Pro, M1 Pro (16GB),
|
|
|
|
macOS Ventura 13.6.2, on a sample of 179 images. Times include model loading.
|
|
|
|
|
|
|
|
There was often significant memory pressure and swapping,
|
|
|
|
which may explain some of the anomalies. CoreML often makes things worse,
|
|
|
|
and generally consumes a lot more memory than pure CPU execution.
|
|
|
|
|
|
|
|
The kernel panic was repeatable.
|
|
|
|
|
|
|
|
GPU inference
|
|
|
|
~~~~~~~~~~~~~
|
2024-01-19 20:00:05 +01:00
|
|
|
[cols="<2,>1,>1", options=header]
|
2024-01-19 15:32:10 +01:00
|
|
|
|===
|
|
|
|
|Model|Batch size|Time
|
|
|
|
|DeepDanbooru|1|24 s
|
|
|
|
|DeepDanbooru|8|31 s
|
|
|
|
|DeepDanbooru|4|33 s
|
|
|
|
|WD v1.4 SwinV2 v2 (batch)|4|71 s
|
|
|
|
|WD v1.4 SwinV2 v2 (batch)|1|76 s
|
|
|
|
|WD v1.4 ViT v2 (batch)|4|97 s
|
|
|
|
|WD v1.4 ViT v2 (batch)|8|97 s
|
|
|
|
|ML-Danbooru TResNet-D 6-30000|8|100 s
|
|
|
|
|ML-Danbooru TResNet-D 6-30000|4|101 s
|
|
|
|
|WD v1.4 ViT v2 (batch)|1|105 s
|
|
|
|
|ML-Danbooru TResNet-D 6-30000|1|125 s
|
|
|
|
|WD v1.4 ConvNeXT v2 (batch)|8|126 s
|
|
|
|
|WD v1.4 SwinV2 v2 (batch)|8|127 s
|
|
|
|
|WD v1.4 ConvNeXT v2 (batch)|4|128 s
|
|
|
|
|WD v1.4 ConvNeXTV2 v2 (batch)|8|132 s
|
|
|
|
|WD v1.4 ConvNeXTV2 v2 (batch)|4|133 s
|
|
|
|
|WD v1.4 ViT v2|1|146 s
|
|
|
|
|WD v1.4 ConvNeXT v2 (batch)|1|149 s
|
|
|
|
|WD v1.4 ConvNeXTV2 v2 (batch)|1|160 s
|
|
|
|
|WD v1.4 MOAT v2 (batch)|1|165 s
|
|
|
|
|WD v1.4 SwinV2 v2|1|166 s
|
2024-01-21 12:35:48 +01:00
|
|
|
|ML-Danbooru CAFormer dec-5-97527|1|263 s
|
2024-01-19 15:32:10 +01:00
|
|
|
|WD v1.4 ConvNeXT v2|1|273 s
|
|
|
|
|WD v1.4 MOAT v2|1|273 s
|
|
|
|
|WD v1.4 ConvNeXTV2 v2|1|340 s
|
2024-01-21 12:35:48 +01:00
|
|
|
|ML-Danbooru CAFormer dec-5-97527|4|445 s
|
|
|
|
|ML-Danbooru CAFormer dec-5-97527|8|1790 s
|
2024-01-19 15:32:10 +01:00
|
|
|
|WD v1.4 MOAT v2 (batch)|4|kernel panic
|
|
|
|
|===
|
|
|
|
|
|
|
|
CPU inference
|
|
|
|
~~~~~~~~~~~~~
|
2024-01-19 20:00:05 +01:00
|
|
|
[cols="<2,>1,>1", options=header]
|
2024-01-19 15:32:10 +01:00
|
|
|
|===
|
|
|
|
|Model|Batch size|Time
|
|
|
|
|DeepDanbooru|8|54 s
|
|
|
|
|DeepDanbooru|4|55 s
|
|
|
|
|DeepDanbooru|1|75 s
|
|
|
|
|WD v1.4 SwinV2 v2 (batch)|8|93 s
|
|
|
|
|WD v1.4 SwinV2 v2 (batch)|4|94 s
|
|
|
|
|ML-Danbooru TResNet-D 6-30000|8|97 s
|
|
|
|
|WD v1.4 SwinV2 v2 (batch)|1|98 s
|
|
|
|
|ML-Danbooru TResNet-D 6-30000|4|99 s
|
|
|
|
|WD v1.4 SwinV2 v2|1|99 s
|
2024-01-21 12:35:48 +01:00
|
|
|
|ML-Danbooru CAFormer dec-5-97527|4|110 s
|
|
|
|
|ML-Danbooru CAFormer dec-5-97527|8|110 s
|
2024-01-19 15:32:10 +01:00
|
|
|
|WD v1.4 ViT v2 (batch)|4|111 s
|
|
|
|
|WD v1.4 ViT v2 (batch)|8|111 s
|
|
|
|
|WD v1.4 ViT v2 (batch)|1|113 s
|
|
|
|
|WD v1.4 ViT v2|1|113 s
|
|
|
|
|ML-Danbooru TResNet-D 6-30000|1|118 s
|
2024-01-21 12:35:48 +01:00
|
|
|
|ML-Danbooru CAFormer dec-5-97527|1|122 s
|
2024-01-19 15:32:10 +01:00
|
|
|
|WD v1.4 ConvNeXT v2 (batch)|8|124 s
|
|
|
|
|WD v1.4 ConvNeXT v2 (batch)|4|125 s
|
|
|
|
|WD v1.4 ConvNeXTV2 v2 (batch)|8|129 s
|
|
|
|
|WD v1.4 ConvNeXT v2|1|130 s
|
|
|
|
|WD v1.4 ConvNeXTV2 v2 (batch)|4|131 s
|
|
|
|
|WD v1.4 MOAT v2 (batch)|8|134 s
|
|
|
|
|WD v1.4 ConvNeXTV2 v2|1|136 s
|
|
|
|
|WD v1.4 MOAT v2 (batch)|4|136 s
|
|
|
|
|WD v1.4 ConvNeXT v2 (batch)|1|146 s
|
|
|
|
|WD v1.4 MOAT v2 (batch)|1|156 s
|
|
|
|
|WD v1.4 MOAT v2|1|156 s
|
|
|
|
|WD v1.4 ConvNeXTV2 v2 (batch)|1|157 s
|
|
|
|
|===
|
2024-01-19 20:00:05 +01:00
|
|
|
|
|
|
|
Comparison with WDMassTagger
|
|
|
|
----------------------------
|
|
|
|
Using CUDA, on the same Linux computer as above, on a sample of 6352 images.
|
|
|
|
We're a bit slower, depending on the model.
|
|
|
|
Batch sizes of 16 and 32 give practically equivalent results for both.
|
|
|
|
|
|
|
|
[cols="<,>,>,>", options="header,autowidth"]
|
|
|
|
|===
|
|
|
|
|Model|WDMassTagger|deeptagger (batch)|Ratio
|
|
|
|
|wd-v1-4-convnext-tagger-v2 |1:18 |1:55 |68 %
|
|
|
|
|wd-v1-4-convnextv2-tagger-v2 |1:20 |2:10 |62 %
|
|
|
|
|wd-v1-4-moat-tagger-v2 |1:22 |1:52 |73 %
|
|
|
|
|wd-v1-4-swinv2-tagger-v2 |1:28 |1:34 |94 %
|
|
|
|
|wd-v1-4-vit-tagger-v2 |1:16 |1:22 |93 %
|
|
|
|
|===
|