[October 2025] AI & Machine Learning Monthly Newsletter 💻🤖

March 20, 2026 Monkey 10 Views 0 Comments

seventieth concern! For those who missed them, you possibly can read the previous issues of my monthly A.I. & Machine Learning newsletter here.

Hey there, Daniel right here.

I’m an A.I. & Machine Studying Engineer who additionally teaches the next beginner-friendly machine studying programs:

I additionally write repeatedly about machine studying alone weblog in addition to make movies on the subject on YouTube.

Since there’s loads happening, the utmost care has been taken to maintain issues to the purpose.

Sufficient about me! You are right here for this month’s A.I. & Machine Learning Monthly Newsletter.

Sometimes a 500ish (+/-1,000ish, normally +) phrase put up detailing among the most fascinating issues on machine studying I’ve discovered within the final month.

This is what you may need missed in October 2025 as an A.I. & Machine Studying Engineer… let’s get you caught up!

My work

My subsequent course is ~~going~~ NOW live!!! I’m very excited for this launch, as Hugging Face is the trendy homepage for AI and this course combines the very best of Machine Studying and Hugging Face. In my very own AI work, I exploit the platform daily. And the course is designed to show you learn how to do the identical in a project-focused method.
A word on AI/ML Month-to-month for November 2025 (subsequent month): I’m going to be skipping subsequent month’s AI/ML month-to-month concern as I’m getting married on November twenty fifth 💍. So regardless of how a lot I really like studying and writing about AI/ML, I’m going to spend a number of weeks hanging out with my stunning new spouse and away from a keyboard. I’ll see you all for the ultimate version of 2025!

From the Web

A textbook on LLM coaching: Ever surprise what goes into coaching a cutting-edge massive language mannequin? It’s simply web textual content and a Transformer structure proper? Effectively… turns on the market’s fairly a bit extra. To try to seize all of it from small experiments to mannequin structure to learn how to create a dataset to learn how to keep a cluster of 384 H100 GPUs, Hugging Face researchers and engineers wrote a 200+ web page information referred to as The Smol Training Playbook:The Secrets to Building World-Class LLMs. Inside they doc nearly all the pieces that led them to constructing SmolLM3.
After 5 years of improvement, the Hugging Face Hub (huggingface_hub) library hits v1.0 with a number of upgrades and a handful of breaking adjustments.
Hugging Face upgrades streaming datasets with 100x fewer requests, 10x sooner knowledge decision, 2x pattern/sec, 0 employee crashes at 256 concurrent staff. Streaming datasets allows you to use a dataset with out downloading it. These updates had been sufficient to outperform native SSDs when coaching on 64xH100 GPUs and 256 staff. Get began streaming datasets with a number of strains of code:

from datasets import load_dataset


dataset = load_dataset("HuggingFaceM4/FineVisionMax", 
											 cut up="practice", 
											 streaming=True)


print(subsequent(iter(dataset)))

Exo share how they connected a NVIDIA DGX Spark and a Mac Studio M3 Ultra to improve LLM latency. The NVIDIA DGX Spark has 4x the compute however the Mac Studio has 3x the reminiscence bandwidth. Compute energy helps with TTFT (time-to-first-token), this the delay from sending a immediate to seeing the primary response token, this is called prefill. Reminiscence bandwidth helps with TPS (tokens-per-second) which is the velocity at which tokens seem step-by-step after the primary one, that is sometimes called decoding. Combing the NVIDIA DGX Spark’s prefill capability with the M3 Extremely’s decoding capability resulted in a 2.8x speedup over a baseline M3 Extremely utilizing Llama-3.1-8b-Instruct.

exo-m3-ultra-plus-dgx-spark-overview

Combing the memory-bandwidth from the M3 Extremely within the Mac Studio in addition to the compute energy from the NVIDIA DGX Spark ends in a better of each worlds situation for LLM inference. Attributable to asynchronous knowledge transfers, the switch time between the 2 units is negligible within the closing speedup outcomes. Photographs from Exo weblog.

They break a contemporary VLM (HuggingFaceTB/SmolVLM-256M-Instruct) into 5 predominant elements:

Processor (prepares and aligns uncooked textual content and picture inputs).
Imaginative and prescient module (converts pixel knowledge into high-dimensional patch embeddings).
Connector (compresses and initiatives visible options into the identical embedding house as textual content tokens).
Enter merger (replaces placeholder tokens with visible embeddings to kind a unified multimodal sequence).
Decoder (generates context-aware textual content by attending to each visible and textual info).

visualizing-how-vlms-work

Instance of a VLM structure which splits a picture into patches, encodes it with a imaginative and prescient encoder, compresses the enter pixels then connects it with language enter tokens to a decoder LLM mannequin. Supply: Visualizing how VLMs work weblog put up.

basketball-tracking-raw

Basketball participant detection, recognition, segmentation and monitoring utilizing a mix of open-source pc imaginative and prescient fashions. Supply: Roboflow weblog.

LoRA is a type of PEFT (Parameter Environment friendly Nice-tuning) which is a method the place a smaller variety of mannequin parameters are educated to adapt a base mannequin to a particular process somewhat than coaching the complete mannequin.

thinking-machines-lora-vs-fullft

Outcomes from a number of completely different runs of LoRA settings versus the complete fine-tuning (orange line). At decrease ranks, LoRA underperforms full fine-tuning however because the ranks get increased, outcomes. get nearer and nearer to full fine-tuning. Supply: Pondering Machines weblog.

It’s typically rather more compute environment friendly than full fine-tuning and as Pondering Machines discovered, it might probably equal full fine-tuning below two predominant situations:

LoRA is utilized to all layers of the community, particularly the MLP/MoE layers which home a lot of the parameters.
LoRA works properly when not capability constrained, i.e., the variety of trainable parameters exceeds the quantity of knowledge to be realized, which will be estimated by way of dataset measurement.

LoRA’s are what’s utilized in Apple’s Adapters framework, a framework which allows you to create light-weight Adapters to enhance the efficiency of Apple’s on-device Basis fashions for a particular process.

One thing that stood out to me was the usage of the Hugging Face peft library:

In our experiments, we used the usual parametrization used within the Hugging Face peft library.

It’s cool to see how a lot floor space completely different Hugging Face libraries are beginning to cowl.

UPTOHERE: extra sources…
- Add the Smol Giant Language Mannequin coaching booklet

Daniel’s Open-source AI of the month

Roboflow releases RF-DETR Seg, a state-of-the-art segmentation mannequin which achieves 170 FPS on a NVIDIA T4 GPU. This can be a observe as much as the not too long ago launched RF-DETR models for detection that are additionally best-in-class for his or her measurement.
ModernVBERT is a 250M parameter vision-language retriever (it’s designed to retrieve comparable paperwork primarily based on visible/textual content inputs). ?Primarily based on the Ettin archietecture (talked about in ML Monthly September 2025) now with an added imaginative and prescient element. It performs on par with fashions 10x the dimensions besides it’s a lot sooner. All mannequin checkpoints, datasets and coaching recipes can be found below the MIT license.
Rex-Omni is a VLM (Vision Language Model or MLLM, Multimodal Large Language Model) which turns object detection, object referring, visible prompting, pointing, OCR and extra right into a next-token prediction downside. Rex-Omni is fine-tuned from Qwen2.5-VL-3B and combines 10 pc imaginative and prescient duties into one mannequin and is ready to carry out on par with specialist pc imaginative and prescient fashions. The modal can deal with text-based or visual-based inputs for a number of sorts of detection duties. One thing of word is the two-stage coaching pipeline, the primary is SFT (Supervised Nice-tuning) to get the mannequin used to predicting next-token like detection coordinates, the second is reinforcement studying by way of GPRO to get the mannequin to stick to geometrically appropriate outputs (e.g. the mannequin is rewarded for predicting the best variety of bins). Attempt the demo on your own images.

rex-omni-demo-overview

Rex-Omni is ready to detect bins, factors, keypoints and lots of different vision-based outputs as a consequence of to its coaching knowledge and coaching schedule. You should utilize pure language inputs akin to “bagel” or “espresso” and even vision-based inputs akin to current bounding bins of goal objects to detect.

A paradigm shift for OCR

Over the past month we’ve been continued to be blessed with high-quality open-source OCR fashions.

A number of traits have appeared:

Use artificial knowledge to create synthetic paperwork which permits a mannequin to be taught with good floor reality.
Construct structure detection and recognition into the identical pipeline (for instance, the place do headers go and what’s the studying order?)
Take an current VLM (vision-language mannequin) and fine-tune it straight for OCR-like duties

All the following have both all of or at the very least one of many above qualities:

olmOCR-2-overview

olmOCR-2 makes use of actual paperwork to extract structure and content material after which turns these into HTML renderings for coaching an OCR mannequin. olmOCR-2 additionally open-sources mannequin weights, coaching knowledge, coaching code, inference code and comes with an open license. Supply: olmOCR-2 paper.

LightOnOCR-1B focuses on end-to-end pipelines and velocity. The mannequin is able to handing 5.71 pages per second on a single H100 GPU which is the same as 493,000 pages per day for lower than ~$0.01 per 1000 pages.
PaddleOCR-VL-0.9B combines a NaViT (native decision ViT) imaginative and prescient encoder with a light-weight ERNIE-4.5-0.3B language decoder. The mannequin helps 109 completely different languages.

paddleocr-vl

Structure of the PaddleOCR-VL-0.9B mannequin which mixes a Imaginative and prescient Encoder (400M parameters), a MLP connector and a LLM decoder mannequin (300M parameters). Supply PaddleOCR-VL paper.

Chandra is a 9B parameter OCR model which might output HTML, JSON and markdown. It helps kind reconstruction with checkboxes and in addition helps 40+ languages. On olmOCR-Bench Chandra is on par with olmOCR-2.
Nanonets-OCR2-3B is an OCR mannequin able to extracting LaTeX formulation, extracting photographs (with description textual content) between tags, signature detection, watermark detection, stream charts (as mermaid code) and extra. For a really complete extraction or in case your paperwork are picture, checkbox and stream chart heavy, you would possibly wish to strive Nanonets-OCR2.
DeepSeek-OCR is a robust OCR mannequin but additionally very environment friendly. The paper explores utilizing imaginative and prescient as compression. As in, how good of outcomes are you able to get while you frequently decrease the imaginative and prescient token utilization? The DeepSeek researchers discovered that even with a compression ratio of 10x (10x decrease imaginative and prescient tokens in comparison with pure textual content tokens), you may get a 97% restoration price. And with a compression price of 20x, the OCR accuracy remains to be ~60%. Sam Witteeven has a great video breakdown of the paper.

deepseek-ocr-compression

The DeepSeek-OCR paper discovered you may get ~97% precision with 10x much less imaginative and prescient to textual content tokens. This permits DeepSeek-OCR to get glorious outcomes regardless of utilizing much less tokens than different fashions. Supply: DeepSeek-OCR paper.

With all of those OCR fashions popping out, naturally, you would possibly ask, which OCR mannequin must you use?

Primarily based on benchmarks alone, every performs fairly properly on varied duties.

Nonetheless, as all the time, when you have a particular process in thoughts, finest to check out these fashions by yourself datasets and see which is finest in your use case.

Open-source VLMs

A fleet of open-source VLMs hit the market over the previous month. Lots of that are beginning to embody their open-source knowledge.

Apriel-1.5-15B-Thinker is a multimodal reasoning mannequin which has undergone text-only supervised fine-tuning (no reinforcement studying). It performs on par with Gemini 2.5 Flash on the Synthetic Evaluation Intelligence Index. They begin with the Pixtral-12B-Base-2409 mannequin and upscale the mannequin to 15B parameter by depth scaling. See the paper for extra.
The Qwen3-VL family expands to 2B, 4B, 8B and 32B Instruct and Pondering variants.
LlaVA-OneVision-1.5 is a group of open-weight, open data (a mix of ImageNet-21k, LAION-CN, DataComp-1B, COYO700M, SA-1B and extra) and open coaching VLMs. The 8B mannequin is on par or higher than Qwen2.5-VL-7B (a really robust open-source VLM). Get the code on GitHub, learn the paper, strive the demo. The tip-to-end coaching value is about $16,000 on A100 GPUs at roughly $0.60 per GPU-hour.
Bee-8B-SFT and Bee-8B-RL are VLMs educated on open data (although you should definitely test the licenses, as a few of it’s non-commercial) and are aggressive with fashions akin to Qwen2.5-VL-7B.
ByteDance replace the Sa2VA (these fashions mix SAM2 with a VLM for enabling segmentation and object detection with language) collection of fashions with, InternVL3, Qwen2.5-VL and Qwen3-VL backbones.
SANSA permits adapting the SAM2 mannequin to carry out few-shot segmentation due to an adaptation module (small variety of parameters). This allows you to give a reference picture with a segmentation masks after which have the identical segmentation masks labelled in a brand new picture (e.g. give a single picture with 1x masks of a cat and have the cat within the subsequent picture labelled with a masks). See the demo notebook to strive it by yourself photographs.

Small LLMs are getting higher

IBM-Granite launch Granite 4.0 (3B, 7B, 32B parameters) and Granite 4.0 Nano language fashions (1B and 350M parameters). All fashions are below the Apache 2.0 license. Fashions within the Nano collection carry out on the Pareto frontier for his or her measurement, outperforming Gemma3-270M (for the Granite 4.0 350M mannequin) in addition to Qwen3-1.7B (for the Granite 4.0 1B mannequin). See the blog post for extra.

granite-4-nano-language-models

Granite 4.0 Nano fashions carry out the very best throughout aggregated benchmarks for his or her measurement. Supply: Granite 4.0 Nano weblog put up.

Facebook release MobileLLM-Pro (non-commercial license) which is a 1B parameter mannequin with INT4 quantization (1.3% efficiency loss from the bottom mannequin) which permits it to be run quick on small units and even CPUs.

A few cool issues

Open Code is an open-source model of Claude Code permitting you to carry any mannequin or supplier proper into the terminal or your editor of alternative.
handy.computer is a free and open-source speech to textual content app which might run speech to textual content in any textual content discipline. Below the hood it runs the Whisper models by OpenAI. The code is absolutely accessible so if it is advisable prolong it in your personal preferences, you possibly can.
Emu-3.5 is an open-source picture technology/enhancing mannequin on par with Gemini 2.5 Flash Picture (Nano Banana).
OpenAI launch gpt-oss-safeguard-120b and gpt-oss-safeguard-20b for safety-focused textual content classification (e.g. given a coverage, they’ll classify whether or not an enter is secure to make use of in your system or not). A workflow right here may very well be to make use of these fashions to label a corpus of samples after which fine-tune a smaller textual content classification mannequin like Ettin to repeat the duty at scale.
RICE-ViT (Region-Aware Cluster Discrimination) is a imaginative and prescient encoder which encodes region-level info and OCR into the weights of a imaginative and prescient encoder. For instance, the mannequin captures object and OCR semantics in the identical illustration. When used as a imaginative and prescient encoder for VLM mannequin coaching, the RICE-ViT fashions carry out favourably towards different imaginative and prescient encoders akin to SigLIP2. See the paper for extra.

Analysis

The SAM 3 (Segment Anything 3) paper gets posted to OpenReview in time for ICLR 2026. It’s titled SAM 3: Phase Something with Ideas. Now, it’s not 100% confirmed that is Meta’s improve to SAM 2 (because the paper remains to be in evaluate) however it positively matches the standards of being the observe up. The brand new model of SAM will enable segmentation with text-based ideas, e.g. “canine” or “automobiles” and the mannequin will section objects within the picture associated to these ideas. Some highlights of the paper for me had been the information engine and enhancements utilizing people within the loop (a quite common pattern amongst massive basis fashions). As soon as the mannequin weights are launched (if they’re), I’ll you should definitely embody them in a future ML month-to-month concern.

sam3-overview

SAM 3 permits utilizing text-based or visual-based enter prompts to a mannequin to acquire segmentation masks. In coaching, the information is routinely labelled after which verified by both AI or human verifiers in a loop trend. Supply: SAM 3 paper.

Talks

See you subsequent month!

What an enormous month for the ML world in October!

As all the time, let me know if there’s something you assume needs to be included in a future put up.

Within the meantime, continue learning, maintain creating, maintain dancing.

See you subsequent month,

Daniel

www.mrdbourke.com | YouTube

By the best way, I am additionally an teacher with Zero To Mastery Academy instructing individuals Machine Studying & AI in essentially the most environment friendly means potential. You possibly can see a number of of our programs under or take a look at all Zero To Mastery courses.

Source link