December 2, 2025
Misraj Team
Team
Wasm is Misraj’s first Arabic framework for processing Common Crawl into a structured multimodal web dataset that preserves document layout and aligns text with images. It provides a flexible foundati...
At Misraj, we believe that the next generation of Arabic language and vision models requires high quality data that reflects how Arabic actually appears on the web, both in text and in visual structure. To serve this goal, we developed Wasm, the first Arabic framework that processes Common Crawl data into a multimodal dataset where text and images are preserved in a coherent, document level structure that is directly usable for training both LLMs and vision language models.
The core problem we set out to address is that most existing pipelines reduce web pages to flat text. When structure, headings, section boundaries, and image relationships are discarded, models lose access to the real layout of documents. For multimodal and document understanding systems, this structure is not cosmetic but essential. It tells the model what is a title, what is a caption, which images belong to which paragraphs, and how information flows through a page.
Wasm introduces an Arabic centric pipeline inspired by OBELICS, but significantly adapted to the characteristics of Arabic web content. The pipeline begins by filtering Common Crawl snapshots for pages that contain Arabic, then standardizes the HTML while aggressively removing boilerplate elements such as navigation menus, ads, and unrelated widgets. At the same time, it preserves meaningful structure, including headings, lists, sections, and the positions of images relative to text.
For quality control, we rely on an Arabic specific perplexity model based on KenLM, trained on carefully curated Modern Standard Arabic and dialectal data. This allows us to distinguish coherent human written Arabic from low quality or machine generated text while still retaining legitimate linguistic variation. On top of that, we apply fine grained deduplication at the HTML node level using a sequence alignment strategy similar to Needleman–Wunsch. This avoids discarding entire pages just because they share repeated footers or boilerplate, and instead removes only the redundant fragments.
The resulting dataset keeps the natural order of text and images, with a representation that can be rendered as structured Markdown, as text image pairs, or as fully interleaved sequences. This flexibility makes Wasm suitable for pre training language models, vision language models, and for building benchmarks that require realistic Arabic documents rather than synthetic examples.
We have released the core pipeline together with a sample of the processed data to support reproducible research and to encourage the community to build upon this resource. Parts of Wasm are already in production inside Misraj, powering models such as Baseer that specialize in understanding and extracting content from Arabic documents.
For us at Misraj, Wasm is more than a data pipeline. It is a foundational layer for Arabic AI: a way to capture the richness, diversity, and structure of Arabic web content so that future models can learn not only the words, but also how information is actually presented to real users.
Research paper link:
Contact us to discover how Mesraj's technologies can transform the way your organization works.
Start your journey to smarter solutions