Wednesday, 17 March, 2021 UTC


Summary

Introduction

At some point in your software development path, you'll have to convert files from one format to another.
DOCX (used by Microsoft Word) is a pretty common file format for a lot of people to use. And sometimes, we'd like to convert Word Documents into HTML.
This can easily be achieved via the Mammoth package. It's an easy, efficient, and fast library used to convert DOCX files to HTML. In this article, we'll learn how to use Mammoth in Python to convert DOCX to HTML.

Installing Mammoth

As a good practice, remember to have your virtual environment ready and activated before the installation:
$ python3 -m venv myenv
$ . myenv/bin/activate
Let's then install Mammoth with pip:
$ pip3 install mammoth
This tutorial uses Mammoth version 1.4.15. Here's a sample document you can use throughout this tutorial. If you have a document to convert, make sure that it's a .docx file!
Now that you're ready to go, let's get started with extracting the text and writing that as HTML.

Extract the Raw Text of a DOCX File

Preserving the formatting while converting to HTML is one of the best features of Mammoth. However, if you just need the text of the DOCX file, you'll be pleasantly surprised at how few lines of code are needed.
You can use the extract_raw_text() method to retrieve it:
import mammoth

with open(input_filename, "rb") as docx_file:
    result = mammoth.extract_raw_text(docx_file)
    text = result.value # The raw text
    with open('output.txt', 'w') as text_file:
        text_file.write(text)
Note that this method does not return a valid HTML document. It only returns the text on the page, hence why we save it with the .txt extension. If you do need to keep the layout and/or formatting, you'll want to extract the HTML contents.

Convert Docx to HTML with Custom Style Mapping

By default, Mammoth converts your document into HTML but it does not give you a valid HTML page. While web browsers can display the content, it is missing an <html> tag to encapsulate the document, and a <body> tag to contain the document. How you choose to integrate its output is up to you. Let's say you're using a web framework that has templates. You'd likely define a template to display a Word Document and load Mammoth's output inside the template's body.
Mammoth is not only flexible with how you can use its output but how you can create it as well. Particularly, we have a lot of options when we want to style the HTML we produce. We map styles by matching each DOCX formatting rule to the equivalent (or as close as we can get) CSS rule.
To see what styles your DOCX file has, you have two options:
  1. You can open your docx file with MS Word and check the Styles toolbar.
  2. You can dig into the XML files by opening your DOCX file with an archive manager, and then navigate to the /word/styles.xml and locate your styles.
The second option can be used by those who don't have access to MS Word or an alternative word processor that can interpret and display the styles.
Mammoth already has some of the most common style maps covered by default. For instance, the Heading1 docx style is mapped to the <h1> HTML element, bold is mapped to the <strong> HTML element, etc.
We can also use Mammoth to customize the document's styles while mapping them. For example, if you wanted to change all bold occurrences in the DOCX file to italic in the HTML, you can do this:
import mammoth

custom_styles = "b => i"

with open(input_filename, "rb") as docx_file:
    result = mammoth.convert_to_html(docx_file, style_map = custom_styles)
    text = result.value
    with open('output.html', 'w') as html_file:
        html_file.write(text)
With the custom_styles variable, the style on the left is from the DOCX file, while the one on the right is the corresponding CSS.
Let's say we wanted to omit the bold occurrences altogether, we can leave the mapping target blank:
custom_styles = "b => "
Sometimes the document we're porting has many styles to retain. It quickly becomes impractical to create a variable for every style we want to map. Luckily we can use docstrings to map as many styles as we want in one go:
custom_styles = """ b => del
                    u => em
                    p[style-name='Heading 1'] => i"""
You may have noticed that the last mapping was a bit different from the others. When mapping styles, we can use square brackets [] with a condition inside them so that only a subset of elements are styled that way.
In our example, p[style-name='Heading 1'] selects paragraphs that has a style name Heading 1. We can also use p[style-name^='Heading'] to select each paragraph that has a style name starting with Heading.
Style mapping also allows us to map styles to custom CSS classes. By doing so, we can shape the style of HTML as we like. Let's do an example where we define our basic custom CSS in a docstring like this:
custom_css ="""
    <style>
    .red{
        color: red;
    }
    .underline{
        text-decoration: underline;
    }
    .ul.li{
        list-style-type: circle;
    }
    table, th, td {
    border: 1px solid black;
    }
    </style>
    """
Now we can update our mapping to reference the CSS classes we've defined in the <style> block:
custom_styles = """ b => b.red
                    u => em.red
                    p[style-name='Heading 1'] => h1.red.underline"""
Now all we need to do is merge the CSS and the HTML together:
edited_html = custom_css + html
If your DOCX file has any of those elements, you will be able to see the results.
Now that we know how to map styles, let's use a more well-known CSS framework (along with the JS) to give our HTML a better look and practice a more likely real-life scenario.

Mapping Styles With Bootstrap (or Any Other UI Framework)

Just like we did with the custom_css, we need to ensure that the CSS is loaded with the HTML. We need to add the Bootstrap file URI or CDN to our HTML:
bootstrap_css = '<link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-BmbxuPwQa2lc/FVzBcNJ7UAyJxM6wuqIj61tLrc4wSX0szH/Ev+nYRRuWlolflfl" crossorigin="anonymous">'
bootstrap_js = '<script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/js/bootstrap.bundle.min.js" integrity="sha384-b5kHyXgcpbZJO/tY9Ul7kGkf1S0CWuKcCD38l8YkeH8z8QjE0GmW1gYU5S9FOnJ0" crossorigin="anonymous"></script>'
We'll also slightly tweak our custom_styles to match our new CSS classes:
custom_styles = """ b => b.mark
                    u => u.initialism
                    p[style-name='Heading 1'] => h1.card
                    table => table.table.table-hover
                    """
In the first line, we're mapping bold DOCX style to the b HTML element with a class mark, which is a Bootstrap class equivalent of the HTML <mark> tag, used for highlighting part of the text.
In the second line, we're adding the initialism class to the u HTML element, slightly decreasing the font size and transforming the text into the uppercase.
In the third line, we're selecting all paragraphs that have the style name Heading 1 and converting them to h1 HTML elements with the Bootstrap class of card, which sets multiple style properties such as background color, position, and border for the element.
In the last line, we're converting all tables in our docx file to the table HTML element, with Bootstrap's table class to give it a new look, also we're making it highlight when hovered, by adding the Bootstrap class of table-hover.
Like before, we use dot-notation to map multiple classes to the same HTML element, even though the styles come from another source.
Finally, add the Bootstrap CDNs to our HTML:
edited_html = bootstrap_css + html + bootstrap_js
Our HTML is now ready to be shared, with a polished look and feel! Here's the full code for reference:
import mammoth

input_filename = "file-sample_100kB.docx"

custom_styles = """ b => b.mark
                    u => u.initialism
                    p[style-name='Heading 1'] => h1.card
                    table => table.table.table-hover
                    """


bootstrap_css = '<link href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-BmbxuPwQa2lc/FVzBcNJ7UAyJxM6wuqIj61tLrc4wSX0szH/Ev+nYRRuWlolflfl" crossorigin="anonymous">'
bootstrap_js = '<script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/js/bootstrap.bundle.min.js" integrity="sha384-b5kHyXgcpbZJO/tY9Ul7kGkf1S0CWuKcCD38l8YkeH8z8QjE0GmW1gYU5S9FOnJ0" crossorigin="anonymous"></script>'


with open(input_filename, "rb") as docx_file:
    result = mammoth.convert_to_html(docx_file, style_map = custom_styles)
    html = result.value 

edited_html = bootstrap_css + html + bootstrap_js

output_filename = "output.html"
with open(output_filename, "w") as f: 
    f.writelines(edited_html)
Also, another point to note here that in a real-life scenario, you probably will not add Bootstrap CSS directly to the HTML content as we did here. Instead, you would load/inject the HTML content to a prepacked HTML page, which already would have the necessary CSS and JS bundles.
So far you've seen how much flexibility we have to style our output. Mammoth also allows us to modify the content we're converting. Let's take a look at that now.

Dealing With Images We Don't Want Shared

Let's say we'd like to omit images from our DOCX file from being converted. The convert_to_html() accepts a convert_image argument, which is an image handler function. It returns a list of images, that should be converted and added to the HTML document.
Naturally, if we override it and return an empty list, they'll be omitted from the converted page:
def ignore_image(image):
    return []
Now, let's pass that function as a parameter into the convert_to_html() method:
with open(input_filename, "rb") as docx_file:
    result = mammoth.convert_to_html(docx_file, style_map = custom_styles, convert_image=ignore_image)
    html = result.value
    with open('output.html', 'w') as html_file:
        html_file.write(text)
That's it! Mammoth will ignore all the images when generating an HTML file.
We've been programmatically using Mammoth with Python so far. Mammoth is also a CLI tool, therefore we have another interface to do DOCX to HTML conversations. Let's see how that works in the next section.

Convert DOCX to HTML Using Command Line Tool

File conversion with Mammoth, using the CLI, typically looks like this:
$ mammoth path/to/input_filename.docx path/to/output.html
If you wanted to separate the images from the HTML, you can specify an output folder:
$ mammoth file-sample_100kB.docx --output-dir=imgs
We can also add custom styles as we did in Python. You need to first create a custom style file:
$ touch my-custom-styles
Then we'll add our custom styles in it, the syntax is same as before:
b => b.red
u => em.red
p[style-name='Heading 1'] => h1.red.underline
Now we can generate our HTML file with custom style:
$ mammoth file-sample_100kB.docx output.html --style-map=my-custom-styles
And you're done! Your document would have been converted with the defined custom styles.

Conclusion

File typecasting is a common situation when working on web technologies. Converting DOCX files into well-known and easy to manipulate HTML allows us to reconstruct the data as much as we need. With Mammoth, we've learned how to extract the text from a docx and how to convert it to HTML.
When converting to HTML we can style the output with CSS rules we create or ones that come with common UI frameworks. We can also omit data we don't want to be available in the HTML. Lastly, we've seen how to use the Mammoth CLI as an alternative option for file conversion.
You can find a sample docx file along with the full code of the tutorial on this GitHub repository.