PHP, Perl дата extractor

Гость19 лет в сервисе
Данные заказчика будут вам доступны после подачи заявки
21.10.2006

Were using regular expressions to pull specific information from HTML pages. The information is always the same, the HTML varies.

We are data mining. Example, you will go to PriceGrabber.com and you will use regular expressions to pull specific information. We have over 40 websites that we need to do this for. Each one of these should take you between 1-2 hours to do.

Пример

-----------

We will be taking from the html of pages like this:

http://www99.shopping.com/xFS?KW=sony+vaio

This particular page is handled by 2 regular expressions:

1. /^[\n\W\w]+?(([\n\W\w]+?)\n\n)/ithe replacement is $1. This clips off the beginning of the page and leaves the rest of the page intact, starting with the results.

2. /(([\n\W\w]+?)\n\n)[ \n]+ (?:)[\n\W\w]+$/i

the replacement is $1. This clips off the end of the page after the results and leaves nothing but results to comb through.

3. /[\n\W\w]+?\n *(.+?)\n[\n\W\w]+?[\n ]+(?:)*[\n ]+(.+?)\n[\n\W\w]+?

[\n ]+(.+?)\n[\n\W\w]+?

[\n ]+/i

this last expression is replaced by $1æ$2æ$3æ$4æ$5ææ where æ represents the end of the item detail and ææ represents the end of the product after all information is gathered.

The information is as follows in the exact order:

1. product url

2. product image url

3. product name

4. description (if any)

5. item price

Assume that each site you get data from uses a different layout for a given keyword search terms. Most websites can be handled with only 1 set of 3 expressions... others require 2 or more sets.

We will need to data mine from over 40 different comparison shopping sites.