Add some benchmarks and information

2024-01-18 18:16:18 +01:00
parent 8df76dbaab
commit 77e988365d
2 changed files with 163 additions and 7 deletions
--- a/deeptagger/README.adoc
+++ b/deeptagger/README.adoc
@@ -2,24 +2,129 @@ deeptagger
 ==========
 This is an automatic image tagger/classifier written in C++,
-without using any Python, and primarily targets various anime models.
+primarily targeting various anime models.
-Unfortunately, you will still need Python and some luck to prepare the models,
+Unfortunately, you will still need Python 3, as well as some luck, to prepare
-achieved by running download.sh.  You will need about 20 gigabytes of space.
+the models, achieved by running download.sh.  You will need about 20 gigabytes
 of space for this operation.
-Very little effort is made to make this work on non-Unix systems.
+"WaifuDiffusion v1.4" models are officially distributed with ONNX model exports
 that do not support symbolic batch sizes.  The script attempts to fix this
 by running custom exports.
-Getting this to work
+You're invited to change things to suit your particular needs.
--------------------
+
 Getting it to work
 ------------------
 To build the evaluator, install a C++ compiler, CMake, and development packages
 of GraphicsMagick and ONNX Runtime.
 Prebuilt ONNX Runtime can be most conveniently downloaded from
 https://github.com/microsoft/onnxruntime/releases[GitHub releases].
-Remember to install CUDA packages, such as _nvidia-cudnn_ on Debian,
+Remember to also install CUDA packages, such as _nvidia-cudnn_ on Debian,
 if you plan on using the GPU-enabled options.
 $ cmake -DONNXRuntime_ROOT=/path/to/onnxruntime -B build
 $ cmake --build build
 $ ./download.sh
 $ build/deeptagger models/deepdanbooru-v3-20211112-sgd-e28.model image.jpg
 Very little effort is made to make the project compatible with non-POSIX
 systems.
 Options
 -------
 --batch 1::
 	This program makes use of batches by decoding and preparing multiple images
 	in parallel before sending them off to models.
 	Batching requires appropriate models.
 --cpu::
 	Force CPU inference, which is usually extremely slow.
 --debug::
 	Increase verbosity.
 --options "CUDAExecutionProvider;device_id=0"::
 	Set various ONNX Runtime execution provider options.
 --pipe::
 	Take input filenames from the standard input.
 --threshold 0.1::
 	Output weight threshold.  Needs to be set very high on ML-Danbooru models.
 Model benchmarks
 ----------------
 These were measured on a machine with GeForce RTX 4090 (24G),
 and Ryzen 9 7950X3D (32 threads), on a sample of 704 images,
 which took over eight hours.
 There is room for further performance tuning.
 GPU inference
 ~~~~~~~~~~~~~
 [cols="<,>,>", options=header]
 |===
 |Model|Batch size|Time
 |ML-Danbooru Caformer dec-5-97527|16|OOM
 |WD v1.4 ViT v2 (batch)|16|19 s
 |DeepDanbooru|16|21 s
 |WD v1.4 SwinV2 v2 (batch)|16|21 s
 |WD v1.4 ViT v2 (batch)|4|27 s
 |WD v1.4 SwinV2 v2 (batch)|4|30 s
 |DeepDanbooru|4|31 s
 |ML-Danbooru TResNet-D 6-30000|16|31 s
 |WD v1.4 MOAT v2 (batch)|16|31 s
 |WD v1.4 ConvNeXT v2 (batch)|16|32 s
 |WD v1.4 ConvNeXTV2 v2 (batch)|16|36 s
 |ML-Danbooru TResNet-D 6-30000|4|39 s
 |WD v1.4 ConvNeXT v2 (batch)|4|39 s
 |WD v1.4 MOAT v2 (batch)|4|39 s
 |WD v1.4 ConvNeXTV2 v2 (batch)|4|43 s
 |WD v1.4 ViT v2|1|43 s
 |WD v1.4 ViT v2 (batch)|1|43 s
 |ML-Danbooru Caformer dec-5-97527|4|48 s
 |DeepDanbooru|1|53 s
 |WD v1.4 MOAT v2|1|53 s
 |WD v1.4 ConvNeXT v2|1|54 s
 |WD v1.4 MOAT v2 (batch)|1|54 s
 |WD v1.4 SwinV2 v2|1|54 s
 |WD v1.4 SwinV2 v2 (batch)|1|54 s
 |WD v1.4 ConvNeXT v2 (batch)|1|56 s
 |WD v1.4 ConvNeXTV2 v2|1|56 s
 |ML-Danbooru TResNet-D 6-30000|1|58 s
 |WD v1.4 ConvNeXTV2 v2 (batch)|1|58 s
 |ML-Danbooru Caformer dec-5-97527|1|73 s
 |===
 CPU inference
 ~~~~~~~~~~~~~
 [cols="<,>,>", options=header]
 |===
 |Model|Batch size|Time
 |DeepDanbooru|16|45 s
 |DeepDanbooru|4|54 s
 |DeepDanbooru|1|88 s
 |ML-Danbooru TResNet-D 6-30000|4|139 s
 |ML-Danbooru TResNet-D 6-30000|16|162 s
 |ML-Danbooru TResNet-D 6-30000|1|167 s
 |WD v1.4 ConvNeXT v2|1|208 s
 |WD v1.4 ConvNeXT v2 (batch)|4|226 s
 |WD v1.4 ConvNeXT v2 (batch)|16|238 s
 |WD v1.4 ConvNeXTV2 v2|1|245 s
 |WD v1.4 ConvNeXTV2 v2 (batch)|4|268 s
 |WD v1.4 ViT v2 (batch)|16|270 s
 |WD v1.4 ConvNeXT v2 (batch)|1|272 s
 |WD v1.4 SwinV2 v2 (batch)|4|277 s
 |WD v1.4 ViT v2 (batch)|4|277 s
 |WD v1.4 ConvNeXTV2 v2 (batch)|16|294 s
 |WD v1.4 SwinV2 v2 (batch)|1|300 s
 |WD v1.4 SwinV2 v2|1|302 s
 |WD v1.4 SwinV2 v2 (batch)|16|305 s
 |WD v1.4 MOAT v2 (batch)|4|307 s
 |WD v1.4 ViT v2|1|308 s
 |WD v1.4 ViT v2 (batch)|1|311 s
 |WD v1.4 ConvNeXTV2 v2 (batch)|1|312 s
 |WD v1.4 MOAT v2|1|332 s
 |WD v1.4 MOAT v2 (batch)|16|335 s
 |WD v1.4 MOAT v2 (batch)|1|339 s
 |ML-Danbooru Caformer dec-5-97527|4|637 s
 |ML-Danbooru Caformer dec-5-97527|16|689 s
 |ML-Danbooru Caformer dec-5-97527|1|829 s
 |===
--- a/deeptagger/bench-interpret.sh
+++ b/deeptagger/bench-interpret.sh
@@ -0,0 +1,51 @@
 #!/bin/sh -e
 parse() {
 	awk 'BEGIN {
 		OFS = FS = "\t"
 	} {
 		name = $1
 		path = $2
 		cpu = $3 != ""
 		batch = $4
 		time = $5
 		if (path ~ "/batch-")
 			name = name " (batch)"
 		else if (name ~ /^WD / && batch > 1)
 			next
 	} {
 		group = name FS cpu FS batch
 		if (lastgroup != group) {
 			if (lastgroup)
 				print lastgroup, mintime
 			lastgroup = group
 			mintime = time
 		} else {
 			if (mintime > time)
 				mintime = time
 		}
 	} END {
 		print lastgroup, mintime
 	}' "${BENCH_LOG:-bench.out}"
 }
 cat <<END
 GPU inference
 ~~~~~~~~~~~~~
 [cols="<,>,>", options=header]
 |===
 |Model|Batch size|Time
 $(parse | awk -F'\t' 'BEGIN { OFS = "|" }
 	!$2 { print "", $1, $3, $4 " s" }' | sort -t'|' -nk4)
 |===
 CPU inference
 ~~~~~~~~~~~~~
 [cols="<,>,>", options=header]
 |===
 |Model|Batch size|Time
 $(parse | awk -F'\t' 'BEGIN { OFS = "|" }
 	$2 { print "", $1, $3, $4 " s" }' | sort -t'|' -nk4)
 |===
 END