Terminology: A markup language is a text-encoding system which specifies the structure and formatting of a document and potentially the relationships among its parts. HTML, XML, LaTeX, and Markdown are popular examples. If a markup language (1) has simple, unobtrusive syntax; (2) it is designed to be easy to write using any generic text editor; and (3) is easy to read in its raw form, it is categorised as a lightweight markup language (LML). An LML is often used in applications where it may be necessary to read the raw document as well as the final rendered output, or where it is prudent to spend little time in visually render the text.
In the previous assignment, you wrote a README.md file to provide documentation for your shell script. The .md extension denotes a file in the "Markdown", which is a popular LML. In this programming assignment, you will write a Python program that converts a simplified Markdown file into valid HTML, and demonstrate automated testing using the pytest framework.
Before you start, create a conda environment for this assignment, and create its environment.yml file. Your assignment will be graded by creating the same environment based on the specifications in your environment.yml.
Input: A test file (input.md) containing Markdown-formatted content.
Output: A test file (output.html) containing HTML markup.
The rest of this section describes the core Markdown features to support.
#, ##, ###, ####, #####, ###### → <h1>, <h2>, <h3>, <h4>, <h5>, <h6>. Alternately, on the line below the text, add any number of == (at least two = characters, not one) for heading level 1 or -- (again, at least two - characters, not one) for heading level 2. For example, both the following Markdown lines will become <h1>Heading</h1> in HTML:
# Heading
Heading
======
Note the single space before "Heading". This space is what designates the # symbol as a Markdown symbol instead of a normal text character.
To create paragraphs, use a blank line to separate one or more lines of text. A newline character will be converted into a <br> tag.
This is a single line.
It is the first paragraph in this example.
And this is the second paragraph, because of the empty line above.
The corresponding HTML will be
<p>This is a single line.<br>It is the first paragraph in this example.</p>
<p>And this is the second paragraph, because of the empty line above.</p>
Text emphasis can be rendered as bold or italic or both.
To render text in bold, add two asterisks or underscores before and after a word or phrase. To render the middle of a word as bold, add two asterisks without spaces around the letters. For example:
all that glitters is **not** gold becomes all that glitters is <strong>not</strong> gold
all that glitters is __not__ gold becomes all that glitters is <strong>not</strong> gold
all that g**litters** is not g**o**ld becomes all that g<strong>litters</strong> is not g<strong>o</strong>ld
all that g__litters__ is not g__o__ld stays as it is, with no special HTML meaning given to the surrounding two underscores.
Similarly, to render a word or phrase as italicized, surround with a single asterisk or underscore before and after. To italicize the middle of a word for emphasis, add one asterisk without spaces around the letters.
Text can be bold and italicized at the same time:
This text is ***really important***.
This text is ___really important___.
This text is really im***port***ant.
This text is really im___port___ant.
will become
This text is <em><strong>really important</em></strong>.
This text is <em><strong>really important</em></strong>.
This text is really im<em><strong>port</em></strong>ant.
This text is really im___port___ant.
To create an ordered list, add line items with numbers followed by periods. The numbers don’t have to be in numerical order, but the list should start with the number one. For example, both the lists below will get converted to the same HTML code:
1. First item
2. Second item
3. Third item
and
1. First item
1. Second item
1. Third item
Both the above texts will be converted to HTML as
<ol>
<li>First item</li>
<li>Second item</li>
<li>Third item</li>
</ol>
Similarly, for unordered lists, add dashes (-), asterisks (*), or plus signs (+) in front of line items. The single space (after 1. or - or * or +) is important for proper converstion to HTML. Without that space, these characters lose their special meaning as list item indicators.
While nested lists are possible in full-fledged markdown, handling them is not within the scope of this assignment.
A word or phrase surrounded by single or double backticks (` or ``, just like single or double quotes used for Python strings) gets denoted as code in HTML. For example, I prefer `pytest` over `unittest`. will become I prefer <code>pytest</code> over <code>unittest</code>.
Longer multi-line code blocks are possible in Markdown, but beyond the scope of this assignment.
Hyperlinked text is expressed in Markdown as [text](url), which gets converted to <a href="url">text</a>. For the purpose of this assignment, you may assume that URLs are properly encoded (e.g., spaces will be represented as %20).
There are many other Markdown features, like linking images, embedding HTML code within Markdown text, and cross-referencing to other parts of the same document. But any such feature not listed in the six items of this section, are beyond the scope of this assignment.
You are required to write a main script called md2html.py. This script must handle command-line input and output, as follows:
python md2html.py input.md output.html
python md2html.py filename.md
If no output HTML file is specified, then the same file name must be used for the output HTML. In the second example, your script must therefore create a file called filename.html, where the input Markdown file's equivalent HTML will be written.
You must use standard built-in library only. The re module is allowed for pattern matching, but no external libraries may be used for conversion to HTML.
It is highly recommended that you write modular code. For example, start thinking about simple one-line Markdown text with no nested formatting tags, and only then write the more complex methods. If you do this, you will realize that you can reuse the simpler methods from your own code. Specifically, your code must have the following methods (the method names clearly dictate what they do):
convert_emphasis(text: str) -> str
convert_paragraph(text: str) -> str
convert_headings(text: str) -> str
convert_ordered_list(text: str) -> str
convert_unordered_list(text: str) -> str
convert_code(text: str) -> str
convert_link(text: str) -> str
convert(text: str) -> str # this is the main function, which takes Markdown text and converts it to HTML
Recall that pytest requires a specific directory structure. You must create this structure, and write your test file called test_md2html.py.
Your tests must verify that the 8 methods specified above tested for accurate converstion of each Markdown feature to HTML.
In totality, your tests must include at least 3-4 normal and 3-4 edge cases.
Your tests must run automatically via pytest.
For example:
from md2html import convert_headings
def test_convert_headings():
assert(convert_headings("# Sample heading") == "<h1>Sample heading</h1>"
... # and so on.
You must make use fixtures and parametrization for repeated patterns.
Your submission must be on Brightspace. It must be a single .zip file containing the following:
md2html.py
test_md2html.py
environment.yml
A short README.md (one page max) explaining (i) any known limitations of your implementation, and (ii) how to run your tests using pytest.
Friday November 7, 11:59 pm on Brightspace.