[January 2026] AI & Machine Learning Monthly Newsletter 🤖

March 20, 2026 Monkey 14 Views 0 Comments

73rd concern! If you happen to missed them, you may read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey everybody!

Daniel right here, I’m a machine studying engineer who teaches the next beginner-friendly machine studying programs:

I additionally write often about machine studying alone weblog in addition to make movies on the subject on YouTube.

Since there’s rather a lot occurring, the utmost care has been taken to maintain issues to the purpose.

Here is what you might need missed in January 2026 as an A.I. & Machine Studying Engineer… let’s get you caught up!

My work

Howdy everybody! What a superb first month of the yr for the world of ML and AI.

I’ll begin off by sharing a few of my current work after which get into a few of my favorite sources from the previous month.

In January, I launched three new YouTube movies, all are targeted on utilizing Hugging Face Transformers for various duties together with open-source fashions:

Learn to fine-tune an LLM (Gemma-3-270M) — A step-by-step information to full fine-tuning an LLM utilizing Hugging Face instruments, see code.
Learn to fine-tune a VLM (SmolVLM2) — Tremendous-tuning a Imaginative and prescient Language Mannequin (VLM) for customized picture understanding duties, see code.
Build a multimodal RAG system with NVIDIA Nemotron VL embedding models — Mix textual content and picture retrieval for extra highly effective RAG purposes, see code.

From the Web

Anthropic releases a study on how AI coding assistance comes with pros and cons. In a managed examine, builders utilizing AI help scored 17% decrease on comprehension checks than those that coded manually, equal to just about two letter grades (e.g. A grade comprehension → C grade comprehension). The researchers recognized six distinct interplay patterns, discovering that AI Delegation (full reliance on code technology) led to the worst outcomes whereas Conceptual Inquiry (utilizing AI as a instructor quite than a coder) preserved studying. The important thing perception: how you employ AI determines whether or not you study or lose expertise.

anthropic-study

Anthropic examine design for seeing how AI instruments affect developer ability, information and pace.

training-dataset

Instance enter/output pairs for a coaching dataset to customise Qwen-Picture-Edit for turning real-life photos into an isometric fashion. Supply: Andy Coenen weblog.

Case Examine: GPT-OSS-20B vs BERT on Consumer Hardware for Multi-Label Text Classification by Ben Toussaint. Ben places his RTX 4090 GPU to the check to see if he can beat GPT-OSS-120B with a fine-tuned classification mannequin. Not solely did he discover encoder fashions corresponding to mDeBERTa-v3-base can carry out on par or higher than a fine-tuned GPT-OSS-20B however it additionally was capable of prepare the mannequin in 90 seconds and obtain 174x quicker inference (235 samples per second for mDeBERTa-v3 mannequin vs 1.35 samples per second for the GPT-OSS-20B mannequin). If you happen to’d wish to see what it’s like coaching your personal textual content classification mannequin, see text classification tutorial on learnhuggingface.com.
The rise of tabular foundation models. A thought-provoking piece on how basis fashions are beginning to come to tabular knowledge, probably altering how we method conventional ML issues. The article explores what this would possibly imply for practitioners who’ve relied on XGBoost and random forests. One cool characteristic that tabular basis fashions appear to inherit (much like LLMs) is the flexibility to do in-context studying (e.g. feed the mannequin a few samples of your goal knowledge and it may alter itself to your job).
Reminder: Data is the foundation of Language Models by Cameron Wolfe. I revisited this text after beginning to fine-tine LLMs and VLMs of my very own. It’s a deep-dive into how coaching knowledge shapes LLM capabilities. Notably references the LIMA paper the place roughly 1,000 aligned and high-quality samples have been sufficient to steer an LLM into producing superb outputs. As a result of LLMs have already seen a lot knowledge, in the event you’re seeking to fine-tune your personal fashions, maybe simply beginning with 100-1000 high quality samples can get glorious outcomes.
VLM from scratch series. A hands-on, beginner-friendly introduction to Imaginative and prescient Language Fashions. Word: The content material seems to be authored with Claude help, so look ahead to tough edges, however on the primary couple of minutes studying the code, it appears like interactive format, accessible for studying. See the first notebook to get began.
Weblog submit: Use agents or be left behind by Tim Dettmers. Tim Dettmers (who additionally helped create QLoRA and bitsandbytes and among the finest guides to buying GPUs for deep learning) writes about how AI brokers have helped his writing course of and in addition helped him attempt new experiments he may not have in any other case performed. Drawing from his background in manufacturing facility automation, he introduces course of optimization ideas to guage when brokers really assist. I particularly favored the elements the place he mentions the place AI brokers didn’t assist him for varied workflows. The primary one being emails.

Hugging Face Updates

From the Web (continued)

agent-skills

An Agent Ability is a markdown file which provides an Agent a set of directions, docs, executable code and extra as a possible device to make use of when exploring an issue area.

Daniel’s Open-source AI of the Month

OCR fashions proceed to get higher and higher

The OCR area continues to warmth up with a number of highly effective new releases:

LightOnOCR-2, an environment friendly 1B VLM for OCR — Continues to push the boundaries of environment friendly OCR with spectacular pace benchmarks, reaching 5.71 pages per second on a single NVIDIA H100 80GB GPU. Greatest outcomes on the OlmOCR-Bench benchmark, even outperforming the Mistral OCR 3 API. Within the paper they state that the structure is a Mistral Small 3.1 imaginative and prescient encoder (400M parameters) + Qwen3 language mannequin (600M parameters), they then fine-tune this mannequin on a customized dataset for textual content extraction functions. Cool concept: can you are taking simply the imaginative and prescient encoder from a bigger mannequin and use it as the bottom for a smaller mannequin? For instance, take the imaginative and prescient encoder from Kimi-K2.5 (a really giant mannequin) and use it for downstream duties?

lighton-ocr

Demo of LightOnOCR-2 working on the Consideration Is All You Want paper.

PaddleOCR-VL-1.5 — An up to date model of the favored PaddleOCR sequence with improved vision-language capabilities and assist for 109 languages. It additionally now helps cross-page desk merging in addition to cross-page paragraph heading recognition which helps out for long-document parsing. It achieves 94.5% on OmniDocBench v1.5 with simply 0.9B parameters.
DeepSeek OCR 2 introduces DeepEncoder V2, which basically modifications how AI “sees” paperwork. As an alternative of the normal raster scanning method, it makes use of “Visible Causal Circulate” to learn paperwork in logical order similar to people do. The 3B parameter mannequin achieves 91.09% on OmniDocBench v1.5, with notably improved dealing with of advanced layouts, tables, and multi-column paperwork. The breakthrough: changing CLIP with Qwen2-0.5B because the imaginative and prescient encoder allows semantic reasoning about studying order.

Open-source VLMs and LLMs

Kimi K2.5 — Moonshot AI releases their strongest open-source mannequin but: a 1 trillion parameter MoE mannequin with 32 billion energetic parameters. Constructed by means of continuous pretraining on roughly 15 trillion combined visible and textual content tokens, K2.5 excels at visible coding (producing code from UI designs and video workflows) and introduces “Agent Swarm”—the flexibility to self-direct as much as 100 AI sub-agents working in parallel. The mannequin claims to outperform GPT-5.2, Claude 4.5 and Gemini 3 Professional on a number of benchmarks (the outcomes are combined by way of which mannequin is finest for a sure benchmark however Kimi K2.5 isn’t any slouch on any of them). Blog post.

kimi-k2.5

Kimi K2.5 efficiency on varied benchmarks in comparison with frontier fashions corresponding to Gemini 3 Professional, Claude 4.5 Opus and OpenAI’s GPT 5.2.

SimpleSeg for VLM-based segmentation — An easy method to including segmentation capabilities to VLMs. I actually take pleasure in their simple knowledge annotation pipeline. Go from picture → field labels → segmentation labels → flip into define → prepare VLM to breed the define.

simpleseg-annotation-pipeline

youtu-vl-4b-example

Instance of Youtu-VL detecting coordinates in a pure picture with text-based prompting. Youtu-VL is ready to output coordiantes for bounding containers which may plotted on a picture.

LlamaBarn — A macOS menu bar app for working native LLMs with llama.cpp. Nice for fast entry to AI help with out leaving your workflow.
RADIOv4 — NVIDIA combines SigLIP2, DINOv3, and SAM3 into the identical characteristic area. Models available beneath a commercially pleasant license.
Arcee AI releases open-weight large models — One other entrant within the open-weight area. See the podcast discussion.
GLM-4.7-Flash (30B complete parameters, 3B energetic parameters) — Z.ai‘s newest environment friendly mannequin launch is the perfect at school for the 30B MoE (Combination of Specialists) area.
iFlyBot-VLM for spatial reasoning — A VLM particularly tuned for spatial understanding duties. Is ready to floor itself on 2D and 3D detections.
Flux.2 Klein 4B and 9B — Black Forest Labs releases smaller, quicker picture technology fashions. The smaller 4B mannequin is on the market beneath the Apache 2.0 license. See the blog post for extra.
Google releases a sequence of translation fashions with TranslateGemma — Starting from 4B parameters to 12B and 27B parameters and fine-tuned for translation duties throughout 55 totally different languages. The TranslateGemma fashions preserve the multimodal capabilities of the unique Gemma 3 fashions. Blog post.

Speech and Audio

NVIDIA Magpie for Speech Generation — A 357M parameter multilingual TTS mannequin.
Qwen3-TTS model series — The Qwen group enters the text-to-speech area with a 600M parameter and 1.7B parameter mannequin every able to creating customized voices, cloning voices and producing customized audio system.
VibeVoice ASR — Microsoft’s ASR mannequin supporting 60-minute speech transcription with diarization, timestamping, and transcription in a single go.

Embeddings and Retrieval

Medical and Specialised Fashions

MedGemma 1.5 4B — Google’s next-generation medical textual content and picture interpretation mannequin, now with medical speech-to-text by way of MedASR (a speech to textual content mannequin targeted on the medical area). There is a Kaggle competition to discover medical purposes with the brand new MedGemma-1.5 fashions as effectively.

Laptop Imaginative and prescient

RF-DETR and Segmentation upgrades — Roboflow extends their glorious RF-DETR detection fashions with segmentation with new checkpoints starting from Nano to 2XLarge in dimension. The fashions outperform YOLO26 fashions on the COCO benchmark at comparable inference speeds.
Zcore — Discover essentially the most environment friendly and finest dataset to make use of from a corpus of information. Zero-shot labeling for large-scale picture datasets primarily based on how a lot every pattern will add to the coaching course of.
YOLO26 — The newest iteration of YOLO improves detection throughput with a brand new NMS-free inference setup.

Papers

Imaginative and prescient encoder. All Ministral 3 fashions use a 410M parameter ViT as a imaginative and prescient encoder for picture understanding that’s copied from Mistral Small 3.1 Base and stored frozen, with the identical structure described in Pixtral [Agrawal et al., 2024]. We discard the pretrained projection layer from the ViT to language mannequin’s area and prepare a brand new projection for each mannequin.

Google Introduces GIST, the next stage in smart sampling. GIST (Grasping Impartial Set Thresholding) is a novel algorithm that gives provable ensures for choosing high-quality knowledge subsets that maximize each range and utility. When coaching on large datasets, you want a consultant subset that is not redundant, GIST solves this by balancing the diversity-utility tradeoff with mathematical ensures. Particularly helpful for pre-training workflows at scale.

Releases

Google releases Agentic Vision with Gemini 3 Flash. As an alternative of simply parsing photos in a single go, Gemini 3 Flash can now use a Assume, Act, Observe loop with code execution. The mannequin can zoom in, examine, and manipulate photos step-by-step to floor solutions in visible proof. This delivers a constant 5-10% high quality enhance throughout imaginative and prescient benchmarks by changing probabilistic guessing with verifiable, step-by-step inference execution. Developer docs.

gemini-3-agentic-vision

Demo of how Gemini 3 Flash makes use of agentic steps to interrupt down a imaginative and prescient downside

Movies

AI is able to extract books verbatim — A regarding demonstration of how AI fashions can reproduce copyrighted content material from coaching knowledge. A number of flagship fashions together with Gemini, Claude and GPT have been capable of reproduce Harry Potter (and different copyrighted supplies) with upwards of 90% accuracy. So evidently it’s okay for big firms to syphon the web for coaching knowledge however in relation to utilizing the outputs to coach different fashions it’s “in opposition to the phrases of service”. Slightly pot calling the kettle black, no?

See you subsequent month!

What an enormous month for the ML world in January!

As all the time, let me know if there’s something you assume must be included in a future submit.

Appreciated one thing right here? Share it with somebody.

Within the meantime, continue learning, preserve creating, preserve dancing.

See you subsequent month,

Daniel

www.mrdbourke.com | YouTube

By the way in which, I am additionally an teacher with Zero To Mastery Academy instructing folks Machine Studying & AI in essentially the most environment friendly means attainable. You may see a number of of our programs beneath or try all Zero To Mastery courses.

Source link