Hallo,
ich möchte eine Webseite zugreifen und danach sie parsen lassen!
Man kann sie direkt nicht parsen (z.B. mit Hilfe einen XSLT-Parser), weil sie nicht wohlgeformt strukturiert ist ("es ist ein HTML")!. Deswegen braucht man sie zuerst in die so gennante XHTML zu konvertieren und dann sie zu parsen.
Ich habe versucht sie nach dem Auslesen mit Hilfe von JTidy in XHTML umwandeln, aber ohne erfolg! wie folgt:
Tidy tidy = new Tidy();
tidy.setMakeClean( true ); // Ohne Störungen
tidy.setXmlTags( true ); // Eingabe als XML behandeln
URL url;
url = new URL( "Die HTML Seite" );
Reader inputStream = new InputStreamReader( url.openStream() );
BufferedReader in = new BufferedReader( inputStream );
for ( String s; ( s = in.readLine() ) != null; ){
FileOutputStream out= new FileOutputStream(s);
ByteArrayInputStream is = new ByteArrayInputStream(s.getBytes("UTF-8"));
Document doc = tidy.parseDOM( is, out);
}
Und ich bekomme im Trace diese Fehlermeldung, wie folgt:
org.w3c.tidy.DOMDocumentImpl@e746a2
org.w3c.tidy.DOMDocumentImpl@1ce56f8
line 1 column 5 - Warning: replacing illegal character code 131
line 1 column 5 - Warning: replacing illegal character code 131
org.w3c.tidy.DOMDocumentImpl@1afbbe3
org.w3c.tidy.DOMDocumentImpl@584e97
org.w3c.tidy.DOMDocumentImpl@18fc7ca
org.w3c.tidy.DOMDocumentImpl@85bf5f
org.w3c.tidy.DOMDocumentImpl@d733ca
org.w3c.tidy.DOMDocumentImpl@891d76
org.w3c.tidy.DOMDocumentImpl@1ed4d06
org.w3c.tidy.DOMDocumentImpl@5bece2
org.w3c.tidy.DOMDocumentImpl@11cf4e5
org.w3c.tidy.DOMDocumentImpl@121e5a
Characters codes for the Microsoft Windows fonts in the range
128 - 159 may not be recognized on other platforms. You are
instead recommended to use named entities, e.g. ™ rather
than Windows character code 153 (0x2122 in Unicode). Note that
as of February 1998 few browsers support the new entities."
line 1 column 1 - Warning: unexpected </head>
org.w3c.tidy.DOMDocumentImpl@3abc87
org.w3c.tidy.DOMDocumentImpl@2f5dda
org.w3c.tidy.DOMDocumentImpl@1bad2e8
org.w3c.tidy.DOMDocumentImpl@6c8255
org.w3c.tidy.DOMDocumentImpl@1e0bf98
org.w3c.tidy.DOMDocumentImpl@42bb13
line 1 column 22 - Warning: replacing illegal character code 131
line 1 column 159 - Warning: unexpected </a> in <img>
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
org.w3c.tidy.DOMDocumentImpl@107e4bc
Characters codes for the Microsoft Windows fonts in the range
128 - 159 may not be recognized on other platforms. You are
instead recommended to use named entities, e.g. ™ rather
than Windows character code 153 (0x2122 in Unicode). Note that
as of February 1998 few browsers support the new entities."
line 1 column 1 - Warning: unexpected </div>
org.w3c.tidy.DOMDocumentImpl@139f953
org.w3c.tidy.DOMDocumentImpl@11fb8c6
org.w3c.tidy.DOMDocumentImpl@19bd1ca
org.w3c.tidy.DOMDocumentImpl@ea58e3
org.w3c.tidy.DOMDocumentImpl@171ccb0
org.w3c.tidy.DOMDocumentImpl@35378d
org.w3c.tidy.DOMDocumentImpl@1d23632
line 1 column 1 - Warning: replacing illegal character code 131
line 1 column 1 - Warning: replacing illegal character code 131
org.w3c.tidy.DOMDocumentImpl@1e42d5a
org.w3c.tidy.DOMDocumentImpl@190c5c0
org.w3c.tidy.DOMDocumentImpl@1a6c214
org.w3c.tidy.DOMDocumentImpl@10fba26
Characters codes for the Microsoft Windows fonts in the range
128 - 159 may not be recognized on other platforms. You are
instead recommended to use named entities, e.g. ™ rather
than Windows character code 153 (0x2122 in Unicode). Note that
as of February 1998 few browsers support the new entities."
line 1 column 201 - Warning: unexpected </a> in <img>
line 1 column 205 - Warning: unexpected </noscript> in <img>
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
...
Ich werde sehr dankbar, wenn Du mir helfen kannst!
schöne Grüße,
Hama