z-vector-search — RAG-Powered Semantic Search for z/OS

24,565

IBM z/OS Messages Indexed

C++17

Pure Native Code

84 MB

Embedding Model Size

<0.5s

Per-Message Enrichment

The Problem

Console messages shouldn't require tribal knowledge

Too many messages, not enough context. Operators sift through hundreds of console messages per hour — ABENDs, RACF violations, CICS abends — with no easy way to know which ones matter.
Knowledge is scattered. The answer lives across IBM manuals, runbooks, ticket histories, and people's heads. What if you could just ask?
Data can't leave the LPAR. For air-gapped, regulated workloads, shipping logs to a cloud LLM is a non-starter. Everything must run on z/OS.

z/OS Operator Console — SYS1

N SYS1 17:30:42 STC00010 $HASP373 PAYROLL STARTED

N SYS1 17:30:45 STC00123 IEF450I PAYROLL - ABEND=S0C7

N SYS1 17:30:45 STC00001 IEA404W REAL STORAGE SHORTAGE

N SYS1 17:30:46 STC00200 ICH408I USER(BATCH1) ACCESS REVOKED

N SYS1 17:30:48 STC00080 DFH1501I CICS TRANSACTION COMPLETED

847 messages in the last hour. Which ones matter?

How It Works

The RAG Pipeline

Every query flows through a pipeline that turns raw text into semantic understanding. Everything runs locally on z/OS — no external API calls, no cloud dependencies.

📄

Ingest

Console messages, documents, or IBM manuals

✂️

Chunk

256 tokens per chunk, 64-token overlap

🧠

Embed

Nomic Embed Text v1.5 via llama.cpp

📐

Normalize

L2 normalization for cosine distance

💾

Store

SQLite + sqlite-vec — single .db file

🔀

Classify & Search

Keyword, semantic, or hybrid — auto-detected

🏆

Rank & Return

Reciprocal Rank Fusion → top-K results

Under the Hood

Architecture

Pure C++17 with vendored dependencies. No Python runtime, no Java, no external services beyond what zopen provides.

🧬

Embedding Engine

Nomic Embed Text v1.5, an encoder-only model purpose-built for turning text into vectors. Quantized to Q4_K_M at just 84 MB. Uses MEAN pooling and document/query prefixes for optimal retrieval quality.

llama.cpp Nomic v1.5 Q4_K_M

🗄️

Vector Store

SQLite extended with sqlite-vec for KNN similarity search. No database server, no network dependencies — just a single .db file with text chunks, embeddings, and structured metadata side by side.

SQLite sqlite-vec Cosine Distance

🔀

Hybrid Search

Auto-classifies queries as keyword, semantic, or hybrid. Exact message IDs use SQL LIKE; natural language uses vector similarity; mixed queries merge both with Reciprocal Rank Fusion (RRF).

RRF k=60 Auto-classify KNN

⚡

SIMD Acceleration

Custom s390x VXE intrinsics for vector math and quantized matrix-vector multiplies. 128-bit SIMD processes 8 floats per iteration, turning scalar hot paths into vectorized operations on z15+.

VXE s390x vec_mule/vec_mulo

CLI Suite

The Tools

A suite of command-line tools that work together. Index once, query repeatedly. All tools default to ~/.z-vector-search/ for zero-config usage.

📥

z-index

Index documents into the persistent vector store. Supports incremental indexing — only new or modified files are re-encoded.

🔍

z-query

Search the store with natural language or structured queries. Auto-detects keyword, semantic, or hybrid mode.

🖥️

z-console

Enrich z/OS console messages with IBM documentation and operational history. Reads live SYSLOG via pcon.

⚙️

z-setup

One-time setup: downloads the embedding model from Hugging Face and unpacks the IBM messages knowledge base.

In Action

z-console Demo

Feed z-console a RACF violation message and watch it instantly return the IBM documentation and your operational history.

z-console — enriched output

$ z-console "ICH408I USER(BATCH1) GROUP(PROD) LOGON/JOB INITIATION - ACCESS REVOKED"

Parsed 1 message, 1 interesting, 1 unique ID to look up.

━━━ ICH408I (severity: E) ━━━
  ICH408I USER(BATCH1) GROUP(PROD) NAME(BATCH JOB)
    LOGON/JOB INITIATION - ACCESS REVOKED

  IBM Documentation (keyword match):
     ICH408I — A RACF-defined user has been revoked. The user's access
     authority has been removed, typically because consecutive incorrect
     password attempts exceeded the SETROPTS PASSWORD limit.
     System Action: The logon or job is rejected.
     Operator Response: Contact the security administrator to reinstate
     access via ALTUSER userid RESUME.
     (distance: 0.12)

  Operational History (semantic match):
     [2026-04-03 14:22] ICH408I,ICH409I, BATCH1 revoked on SYS1,
     resolved by security team reset. Related: RACF password policy
     change ticket INC-4421.
     (distance: 0.31)

Performance

Making llama.cpp Faster on z/OS

The s390x backend was running pure scalar code through every hot path. ~900 lines of new VXE intrinsics changed that.

On x86, llama.cpp vectorizes everything with AVX2/AVX-512. On ARM, it uses NEON. On z/OS? Nothing. The quantized matrix-vector multiplies and elementwise float helpers that dominate every forward pass were all scalar.

IBM Z processors from z13 onwards include the Vector Facility for z/Architecture (VXE) — a 128-bit SIMD instruction set. The new s390x implementations process 8 floats per loop iteration using vec_xl/vec_add/vec_xst.

The core trick for Q4_K quantized multiply: vec_mule/vec_mulo widen int8 → int16 cleanly, then a second pair horizontally reduces into int32 — the whole sequence retires in a handful of cycles on z15.

Metrics output (JSON)

// z-query --metrics "what does abend S0C4 mean"
{
  "mode": "semantic",
  "model_load_ms": 2341.5,
  "embed_ms": 287.3,
  "search_ms": 42.1,
  "total_ms": 2812.4,
  "results": 5,
  "store_chunks": 34102
}

⚡

Vector Helpers Vectorized

add, sub, mul, scale, mad — 8 floats per iteration via vec_xl/vec_xst

🧮

Q4_K × Q8_K GEMV

Brand new ggml_gemv_q4_K_8x4_q8_K with VXE intrinsics for the hot path

📊

Q8_K Row Quantization

__builtin_s390_vfisb for round-and-convert in a single instruction

🔧

CMake Integration

OS390 build path: -fzvector -m64 -march=z15 with optional MASSV linkage

🤖

AI-Assisted Development

IBM Bob helped navigate VXE intrinsics and cross-check quantized format details

Quick Start

Get Started in 5 Minutes

You'll need the zopen package manager set up on your z/OS system. The QuickStart Guide takes about five minutes.

1

Install via zopen

The simplest way — pulls in llama.cpp and all tools in one shot.

$ zopen install z-vector-search

2

Run setup

Downloads the embedding model (~84 MB) and unpacks the IBM messages knowledge base (~160 MB).

$ z-setup

3

Query the knowledge base

Search 24,565 IBM z/OS messages with natural language — out of the box.

$ z-query "what does abend S0C4 mean"

4

Enrich console messages

Feed z-console a single message or read your live SYSLOG for instant context.

$ z-console --pcon -l

Build from source (alternative)

# 1. Install llama.cpp via zopen
zopen install llamacpp

# 2. Build z-vector-search
cmake -B build -DLLAMA_ROOT=$ZOPEN_PKGINSTALL/llamacpp
cmake --build build

# 3. Run setup and start searching
z-setup
z-query "what does abend S0C4 mean"

Semantic Search for the Mainframe