Google Answers: Handling HTML entities in Java's HTMLEditorKit

View Question

Q: Handling HTML entities in Java's HTMLEditorKit ( No Answer, 4 Comments )

Question

Subject: Handling HTML entities in Java's HTMLEditorKit
Category: Computers > Programming
Asked by: vespasian-ga
List Price: $10.00

Posted: 26 Apr 2003 14:02 PDT
Expires: 07 May 2003 23:51 PDT
Question ID: 195868

Hi, I've written a simple HTML parser by overriding the methods in HTMLEditorKit.ParserCallback. For now, I'm simply having it print out the HTML as an HTTP response, to see if it works. It works OK, except that for some reason, the handleText() method doesn't recognize Unicode entities. It takes entities like "•" and outputs a "?" instead. (I'm not talking about console output, but rather the output web page). I have tried various solutions, like changing the Content-Type of the respnose header to include "charset=UTF-8". Nothing worked. I'd be happy to just output the entities in their original form (e.g., "•") but Java's HTML parser insists on mapping them to actual Unicode characters. Thanks in advance for any help!
Clarification of Question by vespasian-ga on 26 Apr 2003 14:12 PDT Actually, it's not just Unicode entities, but also others, like " ". They're all getting discarded by handleText(). Is there any way I can just tell the parser, basically, "hands off my entities"?
Clarification of Question by vespasian-ga on 27 Apr 2003 10:31 PDT Eadfrith -- I tried your code and it produced the same results as mine. Here is the relevant part of my code: public byte[] markDeadLinks(String host, byte[] response){ ByteArrayInputStream responseStream = new ByteArrayInputStream(response); BufferedInputStream bufResponse = new BufferedInputStream(responseStream); responseHeader = skipHeader(bufResponse); Parser callback = new Parser(host); // HTMLParse is a class that extends HTMLEditorKit HTMLEditorKit.Parser parse = new HTMLParse().getParser(); try{ parse.parse(new InputStreamReader(bufResponse), callback, true); } catch (Exception e) {} Saman007uk20: My problem is that the parser tries to resolve the entities before it even passes the text to me. So I can't "escape" the &'s because I never see them. Thanks for your replies!
Clarification of Question by vespasian-ga on 27 Apr 2003 14:41 PDT Eadfrith: thanks for your response. I tried your solution, and it now correctly handles textual entities (e.g., " "). It still does not handle numeric entities (e.g., ""). I added some diagnostic output, and it seems getEntity(int) never gets called (although the second version of getEntity() does). Anyway, your answer has helped me make some progress in this problem, so please feel free to claim the reward. Thanks.
Clarification of Question by vespasian-ga on 27 Apr 2003 23:39 PDT Wow, you're not a researcher? Well, I'm even more grateful for your helpful replies then. You should be a researcher! It would be nice to see your expertise rewarded. I'd be honored if you would claim the reward for this question as your first one. :) Anyway, I used your first solution, and it works great (the flaw that you pointed out doesn't seem to be a hindrance -- I tested it, and these entities get resolved by the browser). Thanks again for your kind help!

Answer

There is no answer at this time.

Comments

Subject: Re: Handling HTML entities in Java's HTMLEditorKit
From: eadfrith-ga on 26 Apr 2003 19:10 PDT

How are you invoking the parse operation? It sounds like your problem
is that the DTD is not being set on the parser - the parser uses the
DTD to resolve entities found in the source - with no DTD the entities
can't be resolved. This could be happen if you're creating your own
Parser, (a DocumentParser maybe?).

Anyway, this code will create a parser with a HTML DTD set. Try it and
see if it fixes your problem.

import javax.swing.text.html.parser.ParserDelegator;

Parser p = new ParserDelegator();
p.parse(reader, yourParserCallback, true);


Cheers,

Eadfrith

Subject: Re: Handling HTML entities in Java's HTMLEditorKit
From: saman007uk20-ga on 27 Apr 2003 01:52 PDT

I suggest u add a backslash before the & , so it would read:
"\&nbsp;". This will tell the java to ignore &as a special character
and just print it out.

Regards,

Saman007uk20

Subject: Re: Handling HTML entities in Java's HTMLEditorKit
From: eadfrith-ga on 27 Apr 2003 12:30 PDT

Vespasian,

So, the problem wasn't that a DTD wasn't being set. I dug a little
deeper and the problem is that as it parses the html the parser (in
fact a DocumentParser) uses its DTD to resolve entities, via the 2
forms of the getEntity() method in the DTD class. It seems that if
these methods return null, indicating that they don't recognise the
entity, then the parser outputs the entity unchanged, which is what
you want. So, if we could plug in our own DTD that returned null when
asked for an entity then we'd be OK. The problem is that the whole
html parser API is very unfriendly and it's pretty tricky to change
the DTD that it uses.

I've hacked together the code below which attempts to replace the
standard DTD with our own that overrides the getEntity method to
return null.

import java.io.ByteArrayInputStream;
import java.io.BufferedInputStream;
import java.io.InputStreamReader;

import javax.swing.text.html.parser.DTD;
import javax.swing.text.html.parser.Parser;
import javax.swing.text.html.parser.Entity;
import javax.swing.text.html.parser.ParserDelegator;
import javax.swing.text.html.parser.DocumentParser;

import javax.swing.text.html.HTMLEditorKit.ParserCallback;

/**
 * The DTDEntityFilter class extends DTD but returns null when 
 * asked to resolve an entity.
 */
public class DTDEntityFilter extends DTD
{
  // Singleton instance
  private static DTD c_instance;
  
  // Use lazy instantiation to create singlton
  public static DTD getInstance()
  {
    if(null == c_instance)
    {
      DTD dtd = new DTDEntityFilter();
      c_instance = ParserDelegatorExt.createDTD(dtd, "html32");
    }
    return c_instance;
  }

  /**
   * We have to extend the ParserDelegator class in order to access 
   * the createDTD method, which is protected.
   */
  static class ParserDelegatorExt extends ParserDelegator
  {
    public static DTD createDTD(DTD dtd, String name)
    {
      return ParserDelegator.createDTD(dtd, name);
    }
  }

  public DTDEntityFilter()
  {
    super("html32");
  }

  public Entity getEntity(int ch)
  {
    return null;
  }

  public Entity getEntity(String name)
  {
    return null;
  }

  public static void main(String[] args) 
  {
    
    ByteArrayInputStream responseStream = new
ByteArrayInputStream(response);
    BufferedInputStream bufResponse = new
BufferedInputStream(responseStream);
    
    ParserCallback callback = new YourParserCallbackClass(); 

    DocumentParser parser = new
DocumentParser(DTDEntityFilter.getInstance());
    
    try
    {
      parser.parse(new InputStreamReader(bufResponse), callback,
true);
    }
    catch (Exception e)
    {
      e.printStackTrace();
    }
  }
}

Let me know if this works.

Cheers,

Eadfrith

Subject: Re: Handling HTML entities in Java's HTMLEditorKit
From: eadfrith-ga on 27 Apr 2003 18:55 PDT

Vespasian,

I'm not a Google researcher, so this one is on the house :-)

Too bad that we couldn't solve the numeric entity problem. Sun's
implementation of the html parser is pretty lousy. As you've
discovered, it never calls getEntity(int). Instead, it just resolves
the entities by replacing them with a single character whose Unicode
value is that specified by the entity. This is buried in a private
method in the Parser implemetation, so we can't alter the standard
behaviour. It would have been far better if they had called
getEntity(int) and put the same code in the default implementation of
this method in the DTD class. This would have let us modify the
behaviour, as we have with textual entites.

Anyway, I think you now have 2 two options:

1. Have your handleText method re-encode the entities. Presumably your
current handleText method does something simple like this:

public void handleText(char[] data, int pos)
{
  responseWriter.write(data);
} 

you could do something like this instead:

  public void handleText(char[] data, int pos)
  {
    for (int i=0; i<data.length; i++) 
    {
      if(data[i] > 128)
      {
        responseWriter.write("&#" + data[i] + ";");
      }
      else
      {
        responseWriter.write(data, i, 1);
      }
    }
  }

The problem with this approach is that if characters that have textual
entities were instead encoded using numeric entites then you'd miss
them. For example, if the html source used &#060; instead of &lt; then
you'd incorrectly pass it through unencoded. I don't think there's
anything to be done about this.

2. The other approach is to reconsider saman007uk20's solution. You
could do this by wrapping the input stream in a filter and escape all
'&' characters before they get read by the parser. You could use
BufferedReader to get started. Let me know if you want to explore this
solution.

Cheers,

Eadfrith

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy