wasm-arabic-multimodal-framework | Misraj AI

Wasm: The First Arabic Framework
for Structured Multimodal Corpus Construction

December 2, 2025

Misraj Team

Team

Wasm is Misraj’s first Arabic framework for processing Common Crawl into a structured multimodal web dataset that preserves document layout and aligns text with images. It provides a flexible foundati...

Wasm: The First Arabic Framework for Structured Multimodal Corpus Construction

At Misraj, we believe that the next generation of Arabic language and vision models requires high quality data that reflects how Arabic actually appears on the web, both in text and in visual structure. To serve this goal, we developed Wasm, the first Arabic framework that processes Common Crawl data into a multimodal dataset where text and images are preserved in a coherent, document level structure that is directly usable for training both LLMs and vision language models.

The core problem we set out to address is that most existing pipelines reduce web pages to flat text. When structure, headings, section boundaries, and image relationships are discarded, models lose access to the real layout of documents. For multimodal and document understanding systems, this structure is not cosmetic but essential. It tells the model what is a title, what is a caption, which images belong to which paragraphs, and how information flows through a page.

Wasm introduces an Arabic centric pipeline inspired by OBELICS, but significantly adapted to the characteristics of Arabic web content. The pipeline begins by filtering Common Crawl snapshots for pages that contain Arabic, then standardizes the HTML while aggressively removing boilerplate elements such as navigation menus, ads, and unrelated widgets. At the same time, it preserves meaningful structure, including headings, lists, sections, and the positions of images relative to text.

For quality control, we rely on an Arabic specific perplexity model based on KenLM, trained on carefully curated Modern Standard Arabic and dialectal data. This allows us to distinguish coherent human written Arabic from low quality or machine generated text while still retaining legitimate linguistic variation. On top of that, we apply fine grained deduplication at the HTML node level using a sequence alignment strategy similar to Needleman–Wunsch. This avoids discarding entire pages just because they share repeated footers or boilerplate, and instead removes only the redundant fragments.

The resulting dataset keeps the natural order of text and images, with a representation that can be rendered as structured Markdown, as text image pairs, or as fully interleaved sequences. This flexibility makes Wasm suitable for pre training language models, vision language models, and for building benchmarks that require realistic Arabic documents rather than synthetic examples.

We have released the core pipeline together with a sample of the processed data to support reproducible research and to encourage the community to build upon this resource. Parts of Wasm are already in production inside Misraj, powering models such as Baseer that specialize in understanding and extracting content from Arabic documents.

For us at Misraj, Wasm is more than a data pipeline. It is a foundational layer for Arabic AI: a way to capture the richness, diversity, and structure of Arabic web content so that future models can learn not only the words, but also how information is actually presented to real users.

Research paper link:

https://arxiv.org/pdf/2511.07080

Built on Trust. Measured by Impact.

Start your journey to smarter solutions

Wasm: The First Arabic Frameworkfor Structured Multimodal Corpus Construction

Built on Trust. Measured by Impact.

Wasm: The First Arabic Framework
for Structured Multimodal Corpus Construction