A project recently required that I grab a section of the current page and feed it into another utility as XML. Normally I would have just passed the utility the URL of the current page and asked it to throws it’s own GET, but in this case, I needed to preserve changes made to the page through DOM scripting.

Since reading from innerHTML is, for all intents and purposes, useless for running against a standard parser cross browser, I wrote this little script. It’s sole purpose is pick up where innerHTML currently fails, and return as close as possible of a “proper” representation of the current structure of the requested element.

Through the eye of a browser

Since innerHTML is a non-standard property, each browser treats it differently, this has wreaked havoc on a number of my XPATH/XSL style parsers since the data that comes back is often not valid XHTML, even if the initial data was.

Firefox

firefox.jpg

  • HTML entities are encoded
  • Self-closing tags aren’t

IE

ie.jpg

  • HTML entities are encoded
  • Self-closing tags aren’t
  • All tag names are forced to uppercase
  • A number of attribute values are no longer wrapped in double or single quotes

Safari

safari.jpg

  • HTML entities are encoded
  • Self-closing tags aren’t
  • All tag names are forced to uppercase

Sample content and demo

Toolkit

Like innerHTML, except not as broken

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Use it like:

target_node = document.getElementById("dom_xhtml_text"); var page_XHTML = ""; page_XHTML = parseNode(target_node);

Click to Parse

What innerHTML thinks the above looks like:

What my parser thinks the above looks like:

  • Digg
  • del.icio.us
  • feedmelinks
  • Reddit
  • NewsVine
  • StumbleUpon
  • Technorati

5 Comments

  1. Posted February 1, 2008 at 4:42 pm | Permalink

    Thanks for this! Been looking for something like this for a while…..however. Is there any way to remove the surrounding div tags, like innerHTML would do? I’m looking for innerHTML “content” but with proper HTML tags (lower case and self closing /).

    Thanks!

  2. Posted February 2, 2008 at 2:20 pm | Permalink

    Sure, grab the latest version - DOM -> XHTML string parser - v0.2

    and call it like:

        parseNode(target_node, true); // The second param is for an "omit_outer_node" flag

    Should work how you need it now. Shoot me a link once you get your project up and running.

  3. Posted September 18, 2008 at 6:36 pm | Permalink

    Hi James, I just stumbled across this after some serious headaches caused by this problem with innerHTML and needed to give you a huge thanks as this immediately fixed the issue! I’ve been using jQuery with AJAX to build out a form on the fly and then save it to the server and the innerHTML (which jQuery.html() uses) was getting messed up along the way. I filed a bug report over there and pointed them towards this post for a fix :) Thanks again for posting up this solution! Eric

  4. TJ Maciak
    Posted September 29, 2008 at 4:13 pm | Permalink

    Hey this is really cool and works pretty good. I was having major problems/headaches with IE returning upper case tags and one-name attributes like id=oneName instead of being like id=”oneName” when I was using .innerHTML to obtain the contents of a div so that I could send it on to iText using xhtml to make a PDF. The one problem I have encountered is that your code does something strange with the   tag when using it in IE (6 & 7). I ended up solving this strangeness by doing a pre-parse on the code that I send to your parseNode function and that fixed the problem using this syntax: document.getElementById(divID).innerHTML = document.getElementById(divID).innerHTML.replaceAll(” ”,’ ‘);

    If I didn’t do the preparse then the xhtml reader would tell me that the string I was sending to it (the text returning from parseNode) was invalid with the following output: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.

    Other than that this will help a lot of people out who have to deal with IE (thankfully Firefox handles the innerHTML beautifully sending xhtml compliant lower case tags)! If you have further questions about the problem I ran into feel free to contact me via email and I will be happy to help test it out more. Thanks again for offering this up as it is a time & headache saver.

    TJ

  5. TJ Maciak
    Posted September 29, 2008 at 4:14 pm | Permalink

    good thing I proof read. The tag I was talking about is & nbsp; (put together - otherwise your comments render it as a space - as it should!) :)

    TJ

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*