Найдите исполнителя для вашего проекта прямо сейчас!
Разместите заказ на фриланс-бирже и предложения поступят уже через несколько минут.

Project via GitHub

Project description

Main task

The assignment required me to write a parser for a Spanish government website that stores information about different types of companies. I had to write a parser for agriculture and industrial productions. The parser also need to check for mandatory items from the production information. In my case they turned out to be: name and email. And subsequently, so that all data is saved in a .csv file.

Problems and their solutions

When choosing the type of production, you need to go through a captcha, I didn’t want to spend the client’s money on buying a subscription to auto-solving captcha, and in the end, the parser was used only twice. So I solved this problem using data cookies, although this problem could have been solved using selenium. The client simply opened a window with the site in his browser and copied the ID of his session, my code already automatically changed the session cookies of the parser session.

Solution of technical specifications

Thanks to solving the problem using cookies, the user can choose absolutely any type of production, not only from those specified. To check the mandatory items, I created a separate script where you can use the config to specify other mandatory ones, in addition to those requirements in the specification. I decided to compile the parser itself in a convenient variant for the customer, in an .exe file. I also wrote a UI for easy use. And if something goes wrong during parsing, I created caching so that you can continue parsing from where you left off last time.

Notation

I tried to reuse this after some time, but I noticed that the site had changed a lot and they had a separate API for parsing. At the time when the parser was written, the API did not exist yet. And additional SSL certificate protection was added.