Bug #3522

Encoding bug with lxml

Added by Jean-Philippe Dutreve 9 months ago. Updated 8 months ago.

Status:Resolved Start:2016-09-02
Priority:Normal Due date:
Assigned to:- % Done:

100%

Category:Core / Browser2 Spent time: -
Target version:1.2
Module:weboob Branch:

Description

Some encodings are not known by lxml (latin-1, ISO8859_1), due to wrong naming:

File "/usr/local/lib/python2.7/site-packages/weboob-1.2-py2.7.egg/weboob/browser/pages.py", line 560, in build_doc
parser = html.HTMLParser(encoding=self.encoding)
File "/usr/local/lib/python2.7/site-packages/lxml-3.6.0-py2.7-linux-x86_64.egg/lxml/html/__init__.py", line 1887, in init
super(HTMLParser, self).__init__(**kwargs)
File "src/lxml/parser.pxi", line 1631, in lxml.etree.HTMLParser.__init__ (src/lxml/lxml.etree.c:114397)
File "src/lxml/parser.pxi", line 795, in lxml.etree._BaseParser.__init__ (src/lxml/lxml.etree.c:106144)
LookupError: unknown encoding: 'ISO8859_1'

Here's some code to fixe it:

vi /usr/local/lib/python2.7/site-packages/weboob-1.2-py2.7.egg/weboob/browser/pages.py:HTMLpage

def build_doc(self, content):
if self.encoding == 'latin-1':
self.encoding = 'latin1'
if self.encoding == 'ISO8859_1':
self.encoding = 'ISO8859-1'

Related issues

related to weboob - Bug #2602: Module AmericanExpress - Erreur "unknown encoding: 'ISO88... Resolved 2016-04-03

Associated revisions

Revision 024b71b8e15f11ebbf899788afacde6ebb485351
Added by Laurent Bachelier 9 months ago

browser: Fix some invalid encoding names

fixes #3522
fixes #2602

Thanks to Jean-Philippe Dutreve

Revision 4101ca4fc698d10594a44934b9fd032b35d63761
Added by Laurent Bachelier 8 months ago

browser: Really fix invalid encoding names

024b71b8e15f11ebbf899788afacde6ebb485351 but in the proper place
fixes #3522
fixes #2602

History

Updated by Laurent Bachelier 9 months ago

I've made a tentative fix based on yours (main difference is that I don't change self.encoding)

https://git.symlink.me/?p=laurentb/weboob.git;a=commitdiff;h=7ad66b95a0fae1c7983fcc4896d0a21348c0f3c7

Updated by Nicolas ROULLEAU 9 months ago

Hi,

I got the exact same logs as @Jean-Philippe Dutreve, appart the unknown encoding is latin-1. Whatever fix I tried (from JP or from Laurent), I still got the error !

File "/usr/local/lib/python2.7/dist-packages/weboob-1.2-py2.7.egg/weboob/browser/pages.py", line 613, in build_doc parser = html.HTMLParser(encoding=self.encoding) File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 1681, in init super(HTMLParser, self).__init__(**kwargs) File "parser.pxi", line 1640, in lxml.etree.HTMLParser.__init__ (src/lxml/lxml.etree.c:104782) File "parser.pxi", line 802, in lxml.etree._BaseParser.__init__ (src/lxml/lxml.etree.c:96957) LookupError: unknown encoding: 'latin-1'"

Any idea ?

Nicolas

Updated by Laurent Bachelier 8 months ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

Also available in: Atom PDF