Scenario: Need to convert a webpage into a Word document for a user guide or to repurpose content? Capturing an entire webpage, especially content that requires scrolling, can be challenging.
I have faced several challenges during the process of creating a PDF of a webpage, capturing the entire page, and here is the solution works for me to ease my task,
This tutorial demonstrates two Python-based methods for converting webpages to Word: one for extracting plain text and another for preserving images and basic styling
1. Using the Browser’s Built-in Print to PDF > Word
Simplest, but sometimes has limitations cuts off content
2. Using Browser Extensions E.g: GoFullPage and FireShot
Good for full-page capture but limited to PNG and PDF Format but low resolution.
3. Using Online Converters
Convenient for quick, Similar to the browser’s built-in print function, these might not always capture the full page or preserve complex formatting perfectly.
4. Using a Screenshot Tool
Tools like Snagit (paid), Greenshot (Free) and ShareX (Free) allow for scrolling to capture the entire webpage as an image. Then, you can use an image editor or a PDF creator to convert the image or PDF to word.
Overall there is image resolution is not good while capturing from tool based.
Capturing a full webpage to a Word document with good quality can be tricky, as Word isn’t designed for perfect web page replication. However, we use Python script to capture the full webpage in local PC.
Prerequisites
- Operating System (Windows or Linux)- We use windows
- Install Python: Download and install Python from https://www.python.org/downloads/.
- Visual Studio Code (Optional) – You can run the script via terminal
Step 1: Install Python and Required Libraries
pip install requests
pip install beautifulsoup4
pip install html2docx
pip install python-docx
pip install pypandoc
Step 2: Python Script (Webpage to Word Text)
import requests
from bs4 import BeautifulSoup
from docx import Document
def webpage_to_docx(url, output_filename):
try:
response = requests.get(url)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
soup = BeautifulSoup(response.content, "html.parser")
document = Document()
# Extract text content and add to document
for paragraph in soup.find_all("p"): # You can target other tags like h1, h2, etc.
document.add_paragraph(paragraph.text)
document.save(output_filename)
print(f"Webpage saved to {output_filename}")
except requests.exceptions.RequestException as e:
print(f"Error fetching URL: {e}")
except Exception as e:
print(f"An error occurred: {e}")
# Example usage
url = "https://www.buildingtheitguy.com/index.php/how-to-install-kali-linux-on-virtualbox-in-five-steps/linux/" # Replace with the URL you want
output_filename = r"C:\To\Path\webpage_text.docx" # Use 'r' for raw string in Windows paths
webpage_to_docx(url, output_filename)
Step 2 : Python Script (Webpage to Word Image and styles)
To capture full webpage with good resolution image and styles, we need to use Pandoc library
1. Install Pandoc:
You need to download and install Pandoc from the official website:
- Pandoc website: https://pandoc.org/installing.html
Follow the instructions for your operating system (Windows, macOS, or Linux).
- Windows: Download the installer (
.msi
file) and run it.
2. Add Pandoc to your PATH (if necessary):
After installing Pandoc, you might need to add it to your system’s PATH environment variable. This allows your operating system to find the Pandoc executable from the command line.
- Windows:
- Search for “environment variables” in the Windows search bar.
- Click on “Edit the system environment variables.”
- Click on “Environment Variables…”
- In the “System variables” section, find the “Path” variable and click “Edit…”.
- Click “New” and add the path to your Pandoc installation directory. This is usually
C:\Program Files\Pandoc
orC:\Users\<YourUserName>\AppData\Local\Pandoc
. - Click “OK” on all dialogs to save the changes. You may need to restart your terminal or command prompt for the changes to take effect.
import pypandoc
import requests
import os
def webpage_to_docx_pandoc(url, output_filename):
try:
response = requests.get(url)
response.raise_for_status()
with open("temp.html", "w", encoding="utf-8") as f:
f.write(response.text)
pypandoc.convert_file("temp.html", "docx", outputfile=output_filename)
os.remove("temp.html")
print(f"Webpage saved to {output_filename}")
except Exception as e:
print(f"An error occurred: {e}")
# Example usage (install pypandoc and Pandoc first)
url = "https://www.buildingtheitguy.com/index.php/how-to-install-kali-linux-on-virtualbox-in-five-steps/linux/"
output_filename = r"E:C:\To\Path\fullwebpage_withimage.docx" # Use 'r' for raw string in Windows paths
webpage_to_docx_pandoc(url, output_filename)
Final – Input & Output