Add some benchmarks and information

2024-01-18 18:16:18 +01:00
parent 8df76dbaab
commit 77e988365d
2 changed files with 163 additions and 7 deletions
--- a/deeptagger/README.adoc
+++ b/deeptagger/README.adoc
@@ -2,24 +2,129 @@ deeptagger
 ==========

 This is an automatic image tagger/classifier written in C++,
-without using any Python, and primarily targets various anime models.
+primarily targeting various anime models.

-Unfortunately, you will still need Python and some luck to prepare the models,
-achieved by running download.sh.  You will need about 20 gigabytes of space.
+Unfortunately, you will still need Python 3, as well as some luck, to prepare
+the models, achieved by running download.sh.  You will need about 20 gigabytes
+of space for this operation.

-Very little effort is made to make this work on non-Unix systems.
+"WaifuDiffusion v1.4" models are officially distributed with ONNX model exports
+that do not support symbolic batch sizes.  The script attempts to fix this
+by running custom exports.

-Getting this to work
--------------------
+You're invited to change things to suit your particular needs.
+
+Getting it to work
+------------------
 To build the evaluator, install a C++ compiler, CMake, and development packages
 of GraphicsMagick and ONNX Runtime.

 Prebuilt ONNX Runtime can be most conveniently downloaded from
 https://github.com/microsoft/onnxruntime/releases[GitHub releases].
-Remember to install CUDA packages, such as _nvidia-cudnn_ on Debian,
+Remember to also install CUDA packages, such as _nvidia-cudnn_ on Debian,
 if you plan on using the GPU-enabled options.

 $ cmake -DONNXRuntime_ROOT=/path/to/onnxruntime -B build
 $ cmake --build build
 $ ./download.sh
 $ build/deeptagger models/deepdanbooru-v3-20211112-sgd-e28.model image.jpg
+
+Very little effort is made to make the project compatible with non-POSIX
+systems.
+
+Options
+-------
+--batch 1::
+	This program makes use of batches by decoding and preparing multiple images
+	in parallel before sending them off to models.
+	Batching requires appropriate models.
+--cpu::
+	Force CPU inference, which is usually extremely slow.
+--debug::
+	Increase verbosity.
+--options "CUDAExecutionProvider;device_id=0"::
+	Set various ONNX Runtime execution provider options.
+--pipe::
+	Take input filenames from the standard input.
+--threshold 0.1::
+	Output weight threshold.  Needs to be set very high on ML-Danbooru models.
+
+Model benchmarks
+----------------
+These were measured on a machine with GeForce RTX 4090 (24G),
+and Ryzen 9 7950X3D (32 threads), on a sample of 704 images,
+which took over eight hours.
+
+There is room for further performance tuning.
+
+GPU inference
+~~~~~~~~~~~~~
+[cols="<,>,>", options=header]
+|===
+|Model|Batch size|Time
+|ML-Danbooru Caformer dec-5-97527|16|OOM
+|WD v1.4 ViT v2 (batch)|16|19 s
+|DeepDanbooru|16|21 s
+|WD v1.4 SwinV2 v2 (batch)|16|21 s
+|WD v1.4 ViT v2 (batch)|4|27 s
+|WD v1.4 SwinV2 v2 (batch)|4|30 s
+|DeepDanbooru|4|31 s
+|ML-Danbooru TResNet-D 6-30000|16|31 s
+|WD v1.4 MOAT v2 (batch)|16|31 s
+|WD v1.4 ConvNeXT v2 (batch)|16|32 s
+|WD v1.4 ConvNeXTV2 v2 (batch)|16|36 s
+|ML-Danbooru TResNet-D 6-30000|4|39 s
+|WD v1.4 ConvNeXT v2 (batch)|4|39 s
+|WD v1.4 MOAT v2 (batch)|4|39 s
+|WD v1.4 ConvNeXTV2 v2 (batch)|4|43 s
+|WD v1.4 ViT v2|1|43 s
+|WD v1.4 ViT v2 (batch)|1|43 s
+|ML-Danbooru Caformer dec-5-97527|4|48 s
+|DeepDanbooru|1|53 s
+|WD v1.4 MOAT v2|1|53 s
+|WD v1.4 ConvNeXT v2|1|54 s
+|WD v1.4 MOAT v2 (batch)|1|54 s
+|WD v1.4 SwinV2 v2|1|54 s
+|WD v1.4 SwinV2 v2 (batch)|1|54 s
+|WD v1.4 ConvNeXT v2 (batch)|1|56 s
+|WD v1.4 ConvNeXTV2 v2|1|56 s
+|ML-Danbooru TResNet-D 6-30000|1|58 s
+|WD v1.4 ConvNeXTV2 v2 (batch)|1|58 s
+|ML-Danbooru Caformer dec-5-97527|1|73 s
+|===
+
+CPU inference
+~~~~~~~~~~~~~
+[cols="<,>,>", options=header]
+|===
+|Model|Batch size|Time
+|DeepDanbooru|16|45 s
+|DeepDanbooru|4|54 s
+|DeepDanbooru|1|88 s
+|ML-Danbooru TResNet-D 6-30000|4|139 s
+|ML-Danbooru TResNet-D 6-30000|16|162 s
+|ML-Danbooru TResNet-D 6-30000|1|167 s
+|WD v1.4 ConvNeXT v2|1|208 s
+|WD v1.4 ConvNeXT v2 (batch)|4|226 s
+|WD v1.4 ConvNeXT v2 (batch)|16|238 s
+|WD v1.4 ConvNeXTV2 v2|1|245 s
+|WD v1.4 ConvNeXTV2 v2 (batch)|4|268 s
+|WD v1.4 ViT v2 (batch)|16|270 s
+|WD v1.4 ConvNeXT v2 (batch)|1|272 s
+|WD v1.4 SwinV2 v2 (batch)|4|277 s
+|WD v1.4 ViT v2 (batch)|4|277 s
+|WD v1.4 ConvNeXTV2 v2 (batch)|16|294 s
+|WD v1.4 SwinV2 v2 (batch)|1|300 s
+|WD v1.4 SwinV2 v2|1|302 s
+|WD v1.4 SwinV2 v2 (batch)|16|305 s
+|WD v1.4 MOAT v2 (batch)|4|307 s
+|WD v1.4 ViT v2|1|308 s
+|WD v1.4 ViT v2 (batch)|1|311 s
+|WD v1.4 ConvNeXTV2 v2 (batch)|1|312 s
+|WD v1.4 MOAT v2|1|332 s
+|WD v1.4 MOAT v2 (batch)|16|335 s
+|WD v1.4 MOAT v2 (batch)|1|339 s
+|ML-Danbooru Caformer dec-5-97527|4|637 s
+|ML-Danbooru Caformer dec-5-97527|16|689 s
+|ML-Danbooru Caformer dec-5-97527|1|829 s
+|===
--- a/deeptagger/bench-interpret.sh
+++ b/deeptagger/bench-interpret.sh
@@ -0,0 +1,51 @@
+#!/bin/sh -e
+parse() {
+	awk 'BEGIN {
+		OFS = FS = "\t"
+	} {
+		name = $1
+		path = $2
+		cpu = $3 != ""
+		batch = $4
+		time = $5
+
+		if (path ~ "/batch-")
+			name = name " (batch)"
+		else if (name ~ /^WD / && batch > 1)
+			next
+	} {
+		group = name FS cpu FS batch
+		if (lastgroup != group) {
+			if (lastgroup)
+				print lastgroup, mintime
+
+			lastgroup = group
+			mintime = time
+		} else {
+			if (mintime > time)
+				mintime = time
+		}
+	} END {
+		print lastgroup, mintime
+	}' "${BENCH_LOG:-bench.out}"
+}
+
+cat <<END
+GPU inference
+~~~~~~~~~~~~~
+[cols="<,>,>", options=header]
+|===
+|Model|Batch size|Time
+$(parse | awk -F'\t' 'BEGIN { OFS = "|" }
+	!$2 { print "", $1, $3, $4 " s" }' | sort -t'|' -nk4)
+|===
+
+CPU inference
+~~~~~~~~~~~~~
+[cols="<,>,>", options=header]
+|===
+|Model|Batch size|Time
+$(parse | awk -F'\t' 'BEGIN { OFS = "|" }
+	$2 { print "", $1, $3, $4 " s" }' | sort -t'|' -nk4)
+|===
+END