Dependency Management: HtmlUnit
Posted by Uncle Bob on 02/11/2007
If you are planning on building an API, please, please, think about dependency management. Don’t make me know more about your world view than necessary. Consider what happened to me as I explored HtmlUnit…
I’m using HtmlUnit to parse and interpret HTML web pages. I’ve been very impressed with this library so far. And I appreciate the hard work and dedication of people who give their software away for free. So, although this blog is a complaint, it should not be misconstrued into anything more than constructive criticism. Besides, what I am complaining about here is so universal that it really wouldn’t matter whose software I chose to scrutinize. The HtmlUnit authors just got lucky in this case.
What I want to do with HtmlUnit is quite simple. Given a string containing HTML, I’d like to query that HTML for certain tags and attributes. For example, I’d like to do this:
HtmlPage page = HTMLParser.parse(htmlString);
HtmlElement html = page.getDocumentElement();
HtmlElement listForm = html.getHtmlElementById("list_form");
assertEquals("/Library/books/manage.do", listForm.getAttributeValue("Action"));
Sweet, simple, uncomplicated. Just create the DOM from an HTML String, and then query that DOM. Unfortunately, HtmlUnit does not appear to be that simple. What you have to do instead looks like this:
StringWebResponse stringWebResponse = new StringWebResponse(htmlString);
WebClient webClient = new WebClient();
webClient.setJavaScriptEnabled(false);
HtmlPage page = HTMLParser.parse(stringWebResponse, new TopLevelWindow("", webClient));
HtmlElement html = page.getDocumentElement();
HtmlElement listForm = html.getHtmlElementById("list_form");
assertEquals("/Library/books/manage.do", listForm.getAttributeValue("Action"));
The extra stuff in here is apparently due to the fact that the authors wanted to be able to simulate browsers, frames, and javascript. I think their goal was laudable. However, I wish they had done this without forcing those frames, browsers, and script engines down my throat.
Given my simple needs, why do I care about WebClient and Window. Why do I have to turn off the javascript engine? It may seem a small thing, but it bothers me nonetheless. It’s the principle of the matter that gets under my skin. The pragmatic programmers called it The Principle of Least Surprise. I call it, simply, dependency management. Don’t make people depend on more than they need.
The cost, to me, was an hour of rooting around in the documentation, example code, and my own trial-and-error experiments. (The benefit to me was another blog topic ;-) That cost may not seem great; but it must be paid again and again by everyone who wants to use the package in a way that doesn’t quite fit the authors’ world view.
There may, in fact, be a simpler way to do what I want to do with HtmlUnit. If there is, I haven’t been able to find it, and I’d be grateful if anyone out there, including the authors, could guide me in the right direction.
Comments
Paul King about 6 hours later:
HtmlUnit is streamlined for accessing sites (perhaps the String case is not so well handled). Here is the normal thing you would do – coded in Groovy:
import com.gargoylesoftware.htmlunit.WebClient def webClient = new WebClient() def page = webClient.getPage(some_url) def listForm = page.getFormByName('list_form') assert '/Library/books/manage.do' == listForm.getAttributeValue("Action")
dtolbert about 1 year later:
I can’t thank you enough, you saved me a couple hours of bumbling around with HtmlUnit. I’ve ran into quite an issue involving a Javascript routine that returns a bit of JSON that I can play a bit with to decode into Html. I then wanted to take that Html and create an HtmlPage out if, which I would then in turn parse.
I think I was on the right path. What I believe I was doing wrong was using my existing WebClient object to create the HtmlPage with a StringWebResponse.
I can’t get enough praise to the HtmlUnit library. It truely is a gem and “just works” in most cases.
Fletch over 3 years later:
Thanks for posting this. It saved me a lot of trouble. Some things are a pain in the ass with HtmlUnit, but generally it’s fantastic for web testing and automation.
uselectit.com over 3 years later:
I think the html unit has been very successful so far and internet operators all over the world are very grateful to these service providers who are providing the software that they have hardly developed all over the years for free. The web client and the window and the java script causes problem for some. Anyway as everything has some sort of disadvantages this software may also have them but the point we have to note here is that how many people are benefiting from this software. I think it definitely needs its admiration. It definitely deserves it! Isn’t it?
sohbet over 3 years later:
The web client and the window and the java script causes problem for some. Anyway as everything has some sort of disadvantages this software may also have them but the point we have to note here is that how many people are
cheap vps over 4 years later:
Anyway as everything has some sort of disadvantages this software may also have them but the point we have to note here is that how many people arecheap VPS
Chris over 4 years later:
Seems like your approach won’t work anymore. I’m doing the following:
StringWebResponse stringWebResponse = new StringWebResponse(htmlString, new URL("http://fakeurl")); page = HTMLParser.parseHtml(stringWebResponse, webClient.getCurrentWindow());
carhartt over 4 years later:
Thank you for the post. The version of the module dependency should be selected according to the following rule: The lowest version providing the functionality required by the module (or bundle). By required functionality we bascially mean provided API
ames over 4 years later:
Dependency Management: HtmlUnit , another master piece of blog post. keep it up
Criminal Records over 4 years later:
I think I was on the right path. What I believe I was doing wrong was using my existing WebClient object to create the HtmlPage with a StringWebResponse.
Tenant Screening over 4 years later:
Anyway as everything has some sort of disadvantages this software may also have them but the point we have to note here is that how many people are benefiting from this software. I think it definitely needs its admiration.
clothing manufacturer over 4 years later:
disadvantages this software may also have them but the point we have to note here is that how many people are
viagra bez recepty over 4 years later:
Given my simple needs, why do I care about WebClient and Window. Why do I have to turn off the javascript engine? It may seem a small thing, but it bothers me nonetheless. It’s the principle of the matter that gets under my skin. The pragmatic programmers called it The Principle of Least Surprise. I call it, simply, dependency management. Don’t make people depend on more than they need.
Futons For You over 4 years later:
Given my simple needs, why do I care about WebClient and Window. Why do I have to turn off the javascript engine? It may seem a small thing, but it bothers me nonetheless. It’s the principle of the matter that gets under my skin.
Bilal Shahid over 4 years later:
I have been using htmlunit for some time now but after reading this post, I think I might switch to httpunit.
filtration d'eau over 4 years later:
Can I write this in PHP format? It will take less time
router table over 4 years later:
The web client and the window and the java script causes problem for some. Anyway as everything has some sort of disadvantages this software may also have them but the point we have to