Bug #1743

[citibank] Use plain PhantomJS for login.

Added by Oleg Plakhotniuk about 2 years ago. Updated about 2 years ago.

Status:Resolved Start:2015-02-27
Priority:Normal Due date:
Assigned to:Oleg Plakhotniuk % Done:

100%

Category:Modules Spent time: -
Target version:1.1
Module:citibank Branch:

Description

No more Firefox. No more Selenium. Hooray!


Related issues

related to weboob - Bug #1740: [citibank] Use Selenium only for login Resolved 2015-02-26

Associated revisions

History

Updated by Oleg Plakhotniuk about 2 years ago

  • Status changed from In progress to To merge
  • % Done changed from 50 to 100

Done in branch issue1743

Updated by Jean-Philippe Dutreve about 2 years ago

Well done !

But there is removal of CoffeeScript support in PhantomJS >= 2.0.

So, you could comment that :

- PhantomJS < 2.0 is required (or better integrate that in the weboob build dependencies)
- Selenium is not needed because you don't want to parse the page content (just to read the resulting cookies). Otherwise Selenium it's required.

Updated by Oleg Plakhotniuk about 2 years ago

  • Status changed from To merge to In progress
  • % Done changed from 100 to 50

But there is removal of CoffeeScript support in PhantomJS >= 2.0.

Whoops, didn't know that. I'll rewrite it to Javascript then.

Updated by Oleg Plakhotniuk about 2 years ago

Crap, now setTimeout stopped working... It seems because of this bug in PhantomJS

Updated by Oleg Plakhotniuk about 2 years ago

Well, PhantomJS turned out to be even worse pain in the ass than Firefox + Selenium.
After using that brainy thing for a little longer I got rid of standalone browsers altogether.
Now I only use V8 JavaScript engine to run some dynamically generated code on login page.

I'll update my branch once I'm done with testing.

Updated by Oleg Plakhotniuk about 2 years ago

  • Status changed from In progress to To merge
  • % Done changed from 50 to 100

OK, we're good to go. Done in branch issue1743

Updated by Oleg Plakhotniuk about 2 years ago

I'll write down some notes for the future, while memories are still fresh in my head. Everything said below is about Citibank website only.

I was running this module on 64-bit Arch Linux box with 512MB RAM, 4GB swap, 1GHz, inside Docker container. All software versions are latest available at 2014-11-11.

Python Requests:
  • (cons) Blocker: I couldn't figure a good way to handle dynamically generated obfuscated Javascript parts which were used in login process.
  • (pros) Low memory; fast.
Firefox + Selenium:
  • (cons) Firefox process wasn't dying properly even after Python process exited. I had to kill it with "kill -9" externally.
  • (cons) Memory-hungry; slow; leaves zombie processes behind; needs X server to run; largest amount of code.
  • (pros) Easiest way to scrape a Javascript-intensive website. Got it up and running in a day.
PhantomJS + Selenium:
  • (cons) Blocker: Sometimes PhantomJS gets stuck in the middle of scraping.
  • (cons) Blocker: Cannot download files
  • (pros) Less memory-hungry than Firefox; doesn't need X server; no loose processes; no zombies.
  • (pros) It's still pretty easy to scrape websites with rich clients.
PhantomJS:
  • (cons) Blocker: Timers and event callbacks sporadically stop working at least when scraping Citibank website.
  • (pros) No dependencies on Selenium. A bit faster.
V8 + Python Requests (current solution):
  • (cons) Requires more brainwork than any of the above.
  • (pros) Low memory; fast; smallest amount of code.
There were 3 working solutions:
  • Firefox + Selenium for full scraping (#1642);
  • Firefox + Selenium for login, Python Requests for the rest of scraping (#1740);
  • V8 for login, Python Requests for the rest of scraping (#1743).

Updated by Laurent Bachelier about 2 years ago

Your last solution sounds pretty interesting! Moreover, there might be even lighter JavaScript interpreters. Perhaps it will grow into a more general solution usable by other modules :)

Updated by Oleg Plakhotniuk about 2 years ago

Thanks, Laurent! Yeah, when we have more use cases we can generalize it into something reusable.

Updated by Oleg Plakhotniuk about 2 years ago

  • Status changed from To merge to In progress
  • % Done changed from 100 to 50

Hold on, website connection timeouts every once in a while, need to tweak retries and waiting time.
It's a couple of lines, so I'll just add it to this patch.

Updated by Oleg Plakhotniuk about 2 years ago

  • Status changed from In progress to To merge
  • % Done changed from 50 to 100

Done. I also rebased the patch to recent master.

Updated by Oleg Plakhotniuk about 2 years ago

  • Status changed from To merge to Resolved

Updated by Romain Bignon about 2 years ago

  • Target version set to 1.1

Also available in: Atom PDF