December 8, 2025
Misraj Team
Team
Baseer is Misraj’s vision–language model for converting Arabic document images and PDFs into structured Markdown while preserving layout. Trained on 500k pages, it outperforms leading open source and ...
At Misraj we consider high quality Arabic document understanding a core requirement for the next generation of Arabic AI systems. A large portion of valuable Arabic content still lives in scanned books, old PDFs, and document images that are difficult to search, analyze, or reuse. This is even more challenging for Arabic with its cursive script, right to left direction, diacritics, and diverse fonts and page layouts.
Baseer is our response to this challenge. It is a vision language model that we fine tune specifically for Arabic document OCR with the goal of converting document images into clean and well structured Markdown that preserves both text and layout. Baseer is built on top of the Qwen2.5 VL 3B Instruct model and adapted to Arabic documents through a targeted fine tuning strategy on a large and carefully designed dataset.
The problem we wanted to solve
Most OCR solutions treat document understanding as a plain text extraction problem. They ignore headings, columns, tables, footers, and side content. Many of them struggle with right to left layouts, ligatures, and diacritics and they are usually optimized for Latin scripts. As a result they fail to capture the true structure of real Arabic documents and often produce text that is hard to use in downstream applications.
We wanted a model that sees the entire page, understands the layout, reads the text accurately, and outputs a representation that is immediately usable for training, indexing, and intelligent document workflows.
What we built
Baseer is a decoder focused vision language model for Arabic document to Markdown OCR. After evaluating several open source vision language models we selected Qwen2.5 VL 3B Instruct as the base model because it showed the most robust behavior on Arabic in our qualitative tests. In fine tuning we freeze the vision encoder and update only the language decoder so that the model retains strong visual capabilities while specializing its language component for Arabic document structure and typography.
Our data and training strategy
To train Baseer we constructed a hybrid dataset of 500,000 image text pairs. Three hundred thousand pairs come from a synthetic pipeline that starts from high quality Arabic Markdown documents. These documents are converted to HTML, then to Word, then to PDF, and finally rendered as high resolution page images. Along this pipeline we randomize fonts, page sizes, colors, column layouts, margins, and other formatting parameters and apply rich augmentations that simulate real world noise such as blur, aging, shadows, and perspective distortion.
The remaining 200,000 pairs come from real world Arabic books, magazines, educational materials, and scientific papers. We select pages that involve complex layouts with tables, figures, multi column text, and marginal notes. We obtain initial transcriptions using a strong multimodal model and then have human experts validate and correct representative subsets to ensure both textual and structural fidelity. All target texts are formatted as Markdown with tables in HTML and special tags for elements such as page numbers, watermarks, and embedded images.
In parallel we created the Misraj DocOCR benchmark which consists of 400 diverse Arabic document images with expert verified ground truth. We also corrected and improved the KITAB pdf to markdown benchmark by removing hallucinations and fixing alignment issues, and we released this corrected version for the community.
Our results compared to other systems
We evaluated Baseer on Misraj DocOCR and on the corrected KITAB benchmark and compared it against a range of open source and commercial systems including Gemini 2.5 Pro, Azure AI Document Intelligence, Dots OCR, Qari, Qwen VL, and Gemma VL. The evaluation covered word error rate, character error rate, BLEU, ChrF, and two structure aware metrics, TEDS and MARS, which measure how well a model preserves the document layout.
Comparison of Baseer with other models on the Misraj DocOCR benchmark:
|
Model |
Word Error Rate |
Character Error Rate |
BLEU |
ChrF |
TEDS - Structure accuracy |
MARS - Layout score |
Short note |
|---|---|---|---|---|---|---|---|
|
Baseer - Misraj |
Best |
Excellent |
High |
High |
Best |
Best |
Strong balance between text accuracy and layout understanding and best overall scores |
|
Gemini 2.5 Pro |
Good |
Good |
Excellent |
Excellent |
Medium |
Good |
Very strong linguistic ability yet weaker structural understanding on Arabic pages |
|
Azure Document Intelligence |
Very good |
Good |
Good |
Good |
Medium |
Medium |
Stable generic OCR but loses structure on complex Arabic layouts |
|
Dots OCR |
Medium |
Medium |
Medium |
Medium |
Poor |
Poor |
Not sufficiently optimized for Arabic script |
|
Qari OCR |
Medium |
Good |
These results show that Baseer achieves the lowest word error rate and the highest structural scores and outperforms both open source and commercial systems in understanding Arabic document layout while maintaining competitive lexical quality.
Why Baseer matters to Misraj
For us at Misraj Baseer is a foundational component of the Arabic AI stack. It enables applications that require reliable access to the content and structure of Arabic documents such as intelligent archives, legal and governmental automation, educational content transformation, and advanced document aware search. It also demonstrates that domain specific adaptation of a strong multimodal base model with the right data and benchmarks can outperform much larger closed models.
Where Baseer can be used
Baseer is designed to integrate into many products and platforms, including:
Large scale digitization of Arabic archives and libraries
Smart intake and analysis of government and legal forms and contracts
High quality indexing and search over scanned Arabic PDFs and reports
Tools that help researchers extract tables, figures, and references from books and papers
Generating realistic training data for future Arabic language and vision language models
Conclusion
Baseer shows that it is possible to reach state of the art performance for Arabic document OCR by combining a capable base model with a carefully engineered Arabic dataset, a dedicated benchmark, and a focused fine tuning strategy. At Misraj we see Baseer and Misraj DocOCR as key building blocks for future Arabic document understanding systems and for a new generation of Arabic centric AI applications.
Research paper link:
Contact us to discover how Mesraj's technologies can transform the way your organization works.
Start your journey to smarter solutions
|
Low |
|
Medium |
|
Poor |
|
Poor |
|
Arabic focused but struggles with multi column pages and tables |
|
Qwen VL |
Good |
Good |
Good |
Good |
Medium |
Medium |
Solid general vision language model but behind Baseer on tables and structural fidelity |
|
Gemma VL |
Weak |
Weak |
Low |
Low |
Poor |
Poor |
Does not handle Arabic and right to left layouts reliably |