Assignment III

HTML to Markdown Converter

Introduction

In the previous assignment, you converted markdown documents to HTML, simulating a common workflow in static site generation and documentation systems. This assignment tackles the inverse problem: converting HTML back to markdown and monitoring changes over time.

Why does this matter? Organizations often need to track changes to external web content. This may include competitors' product pages, regulatory documents, news articles, or technical documentation. Manual monitoring is tedious, error-prone and prevents scalable analysis. By building an automated system that downloads, converts, and detects changes in web pages, you'll create a practical tool for content monitoring and archival.

This assignment integrates several scripting concepts we have covered:

CSV processing to manage a list of websites to monitor
Automation with cron to schedule regular downloads
File compression and archiving to efficiently store historical snapshots
Difference detection and log parsing to identify meaningful changes
Python and command-line tools working together in a cohesive pipeline

Your solution will mirror real-world systems used for compliance monitoring, competitive intelligence, and detecting changes in production environments.

Overview

You will create two scripts: (1) html2md, to download HTML pages and convert them to markdown format, and (2) diffcheck, to compare archived snapshots to detect changes. You will also provide a cron specification to automate daily execution.

Please read the entire assignment before you jump into the implementation!

Part 1: The html2md script

1.1. Requirements

Create a script called html2md that

Accepts two command-line arguments: ./html2md <csv_file_path> <output_dir>

csv_file_path :- a CSV file containing information on websites to monitor
output_dir :- a directory path where archived snapshots of the websites will be stored

Reads and processes the CSV file:

The CSV uses pipe (|) as the separator
Each line has exactly three fields: title|url|download_date; an example is as shown below:

Implements the following date-driven download logic:

Download if download_date <= today's date
Skip if the download date is specified for the future, i.e., download_date > today's date
For example, if today is November 19, 2025 and the download_date is Dec 25, 2025, the web page must not be downloaded; but if the download_date is January 1, 2025, the web page must be downloaded.

Downloads and converts HTML to Markdown:

This assignment limits the conversion scope to English Wikipedia article pages only (en.wikipedia.org), and even within that, only the main article title and body text.

Identifying Main Content: Wikipedia's main article content is contained within the <div> element with a specific class. Focus your parsing on this element and its children.

Required Conversions: Your conversion to markdown must preserve the following elements within the article:

Headings (h1 through h6); using the top-level heading for the main page title
Paragraphs and text content, preserving bold, italics, and other standard inline text formatting elements.
Links, preserving URLs "as-is" (relative, absolute, or anchor links). Links without href attributes should be rendered as plain text.
Lists (ordered and unordered), preserving nesting up to three levels deep.
Code blocks (if present), including inline as well as multiline blocks.
Blockquotes
Nested elements (e.g., bold text within links, links within list items), preserving both the outer and inner formatting. For example: [**bold text**](url).

Elements to Strip/Ignore: You should completely ignore or strip the following elements:

Anything outside the main article.
Within the article (even if in the class or id you will have identified):
- Tables (including infoboxes, data tables, navigation tables)
- Images and their captions/thumbnails (<img>, <figure>)
- Math equations and mathematical notation beyond basic arithmetic (<math>, MathML)
- Hatnotes: Italicized disambiguation/notice boxes (for example, in this page)
- Reference superscripts, e.g., [1], [2], [citation needed].
- Edit links (these typically appear as [ edit ] next to section or subsection headings)
- Navigation boxes: Templates at the bottom (class navbox)
- Maintenance templates: Warning banners about article quality, fundraising banners
- "Main article" hatnotes: Links like "Main article: Topic" within sections
- External links section: The "External links" section at the end
- Categories: Category tags at the very bottom
- Table of Contents: The auto-generated TOC box

Edge Cases and Assumptions:

Empty sections: Skip sections with no content.
Unicode/special characters: Preserve Unicode characters as-is. Use UTF-8 encoding for all files.
Malformed HTML: You may assume the input HTML is well-formed. If parsing fails catastrophically, output an error message and skip that URL.
Whitespace: Use standard markdown conventions (blank line between paragraphs, blank line after headings). Minor whitespace differences will not affect grading.
Non-Wikipedia URLs: This assignment is designed for Wikipedia article pages. Behavior with non-Wikipedia URLs is undefined and will not be tested.

Required approach: use an HTML parsing library (like beautifulsoup4, lxml, or Python's built-in html.parser) and write your own HTML-to-markdown conversion logic.
Prohibited tools: html2text, markdownify, pandoc, or any other library that automatically converts HTML to markdown. Understanding HTML structure and implementing data conversion logic is a key learning objective of this assignment. Bypassing this core requirement will lead to a zero in this assignment.

Systematically names Markdown files: include the title field from the CSV, sanitized for filesystem compatibility: the final filename must be limited to lowercased alphanumeric characters, with words separated by single underscores. E.g., Python_(programming_language) to python_programming_language.md; remove characters that are invalid in filenames (like /, \, :, *, ?, ", <, >, etc.).
Creates a compressed archive:
- After all downloads complete, compress the converted files (i.e., all the .md files) into a single .tar.gz archive.
- Name the archive with the current date and timestamp in the format YYYY-MM-DD_HH-mm-ss. This closely follows the ISO 8601 format, with three small modifications: (i) there is no timezone requirement, (ii) the date and time are separated by a single underscore, and (iii) the hours, minutes, and seconds are hyphen-separated instead of colons. The archive name could, for instance, be 2025-11-19_18-33-59.tar.gz.
- Store the above archive in output_dir (if this directory doesn't exist, your script should create it).
Handles errors gracefully:
- If a web page is unreachable (e.g., error 404, 429, 504), print an error message and continue with the remaining URLs; the error message must include the HTTP status code.
- If CSV parsing fails, print a clear error message and exit.
- Usual practice to follow: all errors must be printed to stderr.

Implementation Notes

Since you are focusing on Wikipedia articles, the HTML structure is predictable. Pay attention to the id and class properties of the <div> tags. This will help you build a consistent strategy. Here is an example approach using BeautifulSoup (this is untested, and provided just to give you an idea, there is no mandate to follow this format or style):

from bs4 import BeautifulSoup

def html_to_markdown(html_content):

soup = BeautifulSoup(html_content, 'html.parser')

# find main content area (this is specific to Wikipedia article pages)

content = soup.find('div', {'class': 'mw-parser-output'})

markdown_lines = []

# process the page title

title = soup.find('h1')

if title:

markdown_lines.append(f"# {title.get_text().strip()}\n")

# process each element in content

for element in content.find_all(['h2', 'h3', 'p', 'ul', 'ol']):

# convert based on tag type

# ... your conversion logic here

return '\n'.join(markdown_lines)

Your html2md script can be written in bash or Python (or any language with a #! shebang).
The script must be world-executable.
You may create additional helper scripts, but a user should only need to run ./html2md input.csv output_dir
Keep Markdown files in a temporary location during processing, then delete them. This "cleanup" is an essential component of scripting. There are many scenarios where you may have to create temporary files, which are removed once the work is done.

Mapping URLs to Markdown Files

Your solution must maintain a way to identify the correspondence between a markdown file and the URL from which it was downloaded and converted. This is essential for the second part of this assignment. There are many ways to achieve this. Some possible options include:

Filenames include a unique correspondence with the URL, e.g., by appending the first 8 characters of the md5 hash of the URL, which will generate filenames like python_documentation_5d41402a.md.
An additional metadata file stored inside each archive (not the most recommended approach).
URL comments or frontmatter in markdown files.

The design choice is yours, but your approach must work with the second part of this assignment.

Part 2: The diffcheck script

Create a script called diffcheck that has the following properties:

Accepts two command-line argument: ./diffcheck N output_dir, where N is an integer representing the number of days to look back, and output_dir should be the same output directory specified during the use of html2md.
Compares two archives: (i) the most recent archive from N days ago, and (ii) the most recent archive from today. Both archives should be in the same output_dir.
Detects content changes: extract and compare the corresponding markdown files from the two archives, and determine which (if any) web pages have modified their content. Whitespace changes do not constitute modification to HTML content.
Output results clearly:

If there are no changes, print (to stdout): "No changes in any web page content in the last N days."
If changes are detected, then print the following information (again, to stdout), with {...} replaced by the actual values:

The following web pages have been modified in the last {N} days:

- {Title} ({url})

...

Handles missing archives:

If archives from N days ago do not exist in output_dir, print (to stderr): "Error: no archive from N days ago was found."
If there are no archives created today, print (again, to stderr): "Error: no archives were created today (you can run html2md to create one)."

Implementation Notes

The script must be world-executable, so that the user can run it as ./diffcheck N output_dir.
You will need to extract the two archives to temporary locations for comparison. These extracted files must be cleaned up after the comparison is completed.
Consider using diff, cmp, or Python's difflib for comparison. This is merely a suggestion, and you may choose some other method to identify content modifications.
Use your mapping strategy (from Part 1) to link files back to the URLs for output.

Part 3: cron Automation

Requirements

Create a file called cronjob.txt containing a cron specification that

Runs html2md every day at 3:30 am; and
Redirects all output to a log file: /home/johndoe/ise337/hw3/downloads.log

This information must be written in the proper format and specifications required by a genuine crontab. You can check the correctness of this file as follows (this is essentially a test script for this part of the assignment):

Use the crontab command with the filename as an argument to replace your current crontab with the contents of the file:

crontab cronjob.txt

Important: This command will replace any existing cron jobs in your crontab with the contents of cronjob.txt. So, if you have any existing cron jobs, you should back that up before carrying out this step.

Testing Your Solution

Provided Test Data

Your script will be tested with a sample input.csv containing a few (< 10) Wikipedia article URLs (or HTML files containing Wikipedia articles locally downloaded).

IMPORTANT NOTE: You should test almost all of your assignment using locally downloaded HTML files. If you hit Wikipedia servers too often, you may get blocked. Please respect Wikipedia's robots.txt rules! Getting blocked by Wikipedia for ignoring this rule will not be considered as a valid reason for creating any exceptions for grading. For example, instead of repeatedly testing your assignment with https://en.wikipedia.org/wiki/Web_crawler, download the page only once as a single HTML file, and use this file's local URI in your input.csv (e.g., file:///Users/rbanerjee/Downloads/web_crawler.html).

When you are finally testing your entire pipeline, use a mix of file:// and http:// paths in out input.csv. You can handle both using Python's urllib library.

What to Submit

Submit a single zip file named firstname_lastname.zip containing:

html2md, your main download-and-convert executable script;
diffcheck, your change detection executable script;
cronjob.txt, your cron specification;
any supporting scripts or programs that are needed to run your main scripts; and
a requirements.txt or environment.yml, so that grading can be done in the environment you specify (without this, your code is at the mercy of the grader's best judgment about dependencies; if you don't provide the requirements, the grader is forced to guess what is needed, and no grading error reports will be accepted for such submissions).

DO NOT INCLUDE (i) any HTML or Markdown files, (ii) any compressed .tar.gz archives, or (iii) __pycache__ or other temporary files.

Due Date

Friday December 5, 11:59 pm on Brightspace.

Page updated

Report abuse