About Astar Labs

The honest version

Astar Labs is one person, a GPU rental bill, and a lot of curiosity.

When AI started taking over the world, I did the opposite of what most people did: I stepped back. Not because I wasn't interested, but because I was interested in the wrong thing for most people's tastes: not what AI could do for me, but how it actually worked. The tools got shinier, the benchmarks got bigger, and I kept thinking the same thing: I want to build one myself.

So I did.

TFM, short for Transformer Foundation Model, is my attempt to understand AI from the ground up. It won't beat GPT-4. It won't beat anything, frankly. It's a small model, trained on a single rented GPU, by someone doing this entirely in their spare time. But every weight in it, every line of code behind it, is something I built and understand. That matters more to me than benchmarks.

If you're here to use it, welcome. If you're here to poke around the GitHub and see how the sausage is made, even better.

Astar Labs

The name is a hand-me-down. Before TFM, there was Astar Technologies, a project where I built self-landing model rockets. When I pivoted to AI, I kept "Astar" and appended "Labs", because, honestly, it sounds appropriately AI-ey. The full internal joke is "AI Research Lab Astar" -> Astar Labs. I wish there was a deeper story, but sometimes, there isn't.

The models

TFM-1.5 | last-gen creative specialist

Parameter Count: 152M |
            Vocabulary Size: 16k |
            Context Window: 4k

TFM-1.5 is the original. It was pretrained on TinyStories, a dataset of short, simple narratives, and fine-tuned on UltraChat conversations. The result is a model that isn't particularly smart in the traditional sense, but has a genuine flair for writing. It understands narrative structure, it has a feel for language, and it can surprise you. Think of it less as an assistant, and more as a creative collaborator.

TFM-1.6 | knowledgeable generalist

Parameter Count: 167M |
            Vocabulary Size: 24k |
            Context Window: 4k

TFM-1.6 is where things get more serious. Pretrained on the full SlimPajama-6B dataset and fine-tuned on OpenOrca, it has significantly broader knowledge and better instruction-following than its predecessor. The larger vocabulary size (24k vs 1.5's 16k) accounts for most of the parameter difference, and gives it a more expressive token space to work with. It's still a small model, but it's the best one I've built so far.

Under the hood

Both models share the same core architecture: a decoder-only Transformer, built in PyTorch with custom implementations of:

SwiGLU activation - a gated linear unit variant that consistently outperforms ReLU/GELU in practice, popularised by PaLM and LLaMa.
Multi-Head Attention - implemented from scratch rather than relying on torch.nn.MultiHeadAttention, to better integrate other custom components.
RoPE (Rotary Positional Embeddings) - a relative positional encoding scheme that generalises better across sequence lengths than learned absolute embeddings.

The tokenizer is a BPE tokenizer built with the tokenizers library, using NFD normalization, Metaspace and Punctuation pre-tokenizers, and a Metaspace decoder. It's trained per-model, which is why vocabulary sizes differ between 1.5 and 1.6.

Training runs on a single rented RTX 5090, chosen for its balance of compute power, VRAM, and rental cost. Local inference runs on MPS, where TFM reaches a comfortable ~80 tokens per second at around 1.5 GB of VRAM, lean enough to run on consumer hardware without breaking a sweat.

What's next?

TFM-1.6 is currently finishing up pretraining and moving into fine-tuning soon. Once it's out, the focus shifts to a few things I'm genuinely excited about:

Longer context windows - a 4k context window is fine for conversation, but it's limiting for anything more ambitious, like web search grounding, tool calling, document Q&A, etc. Pushing that boundary is the next meaningful step.
BitNet - binary-weight networks are a fascinating direction for making small models even more efficient. If they hold up at this scale, they could dramatically reduce inference cost and memory footprint.
Mamba-3 - State Space Models offer a fundamentally different approach to sequence modelling: linear complexity instead of quadratic attention. Whether they'll eventually replace Transformers is an open question, but they're worth understanding firsthand.

The goal is the same as it's always been: build it myself, understand it properly, and share it with whoever's curious enough to show up.

TFM is free to use. It is not affiliated with any company, research institution, or commercial entity. It's just a project. A very personal one.