Help with a script to scrape html table to csv

robb · June 15, 2021, 8:34am

Hi guys, I know this is completely offtopic for nethserver, but we have quite some decent coders in our community. So I thought, maybe any of you can help this non-coder with creating a script to daily scrape a html table from a website. The Url of the website is always the same
The table has a clear table definition in a div
The table header is always the same:

< div class=“text”>
< table class=“table table-striped” cellspacing=“0”>
< thead>
< tr>
< th>Column one
< th>Column two
< th>Column three
< th>Column four
< /tr>
< /thead>
The table contents is updated every day and can contain the same content as the previous day. There also can be new rows added.
The table body is between
< table>
< tbody>
< /tbody>
< /table>

tags

The table is not the only contents of the webpage.

Ultimately I would like the script to be a cron job on my NethServer server and store the output as a csv with timestamp as name.

Any pointers and help is very much appreciated…

Andy_Wismer · June 15, 2021, 8:38am

@robb

In PERL, this could be a simple one-liner…

Flow:

Download html
Parse html with PERL, auto-allocating into CSV tables
save as CSV
output some feedback in log or mail…

@stephdl is probably more adept than me in this…

My 2 cents
Andy

m.traeumner · June 15, 2021, 10:57am

Perhaps this can help:

curl "http://www.webpagewithtableinit.com/" 2>/dev/null | grep -i -e '</\?TABLE\|</\?TD\|</\?TR\|</\?TH' | sed 's/^[\ \t]*//g' | tr -d '\n' | sed 's/<\/TR[^>]*>/\n/Ig'  | sed 's/<\/\?\(TABLE\|TR\)[^>]*>//Ig' | sed 's/^<T[DH][^>]*>\|<\/\?T[DH][^>]*>$//Ig' | sed 's/<\/T[DH][^>]*><T[DH][^>]*>/,/Ig'

Information how it works you can find at the following link:

robb · June 15, 2021, 11:42am

Thanks so far.
I was fiddling with libreoffice calc and I can download the table:
Go to the Sheet tab in the menu,
click “Link to external data”
type in the URL of the webpage containing the table
at importoptions, just hit enter
Next window is select available Tables/ranges
select the first table: HTML_1 and hit enter
The table is downloaded and imported in Libreoffice Calc.

BUT, this is not going to be very useful when I need this to be done on a daily basis on a set time…

@m.traeumner where is the output of the curl command going? Should I just add a -o {timestamp}.csv or something like that?

m.traeumner · June 15, 2021, 12:21pm

Yes I would do so. For me at the example the output is going to the terminal, that doesn’t make sense.

Andy_Wismer · June 15, 2021, 12:26pm

Hi

Here’s an example…

hope it helps
Andy

Andy_Wismer · June 15, 2021, 12:29pm

Another good example:

robb · June 15, 2021, 1:59pm

When I change the url to the url I need I don’t get any output at all (not to terminal)
Could that be that the sed / awk foo needs tweaking to exactly the website format?

m.traeumner · June 16, 2021, 8:50am

Hi @robb,
there was one comment to the post:

Cool idea, but doesn’t work for a lot of HTML files. Mine have the data inside the table on its own line (i.e. <td>\ncell value\n</td> ), and this script ends up stripping out that data because it removes every line that doesn’t have a table tag in it.

Perhaps this is the problem. I think you could analyse your html site and have to change the sed-settings.

I think it’s not the html format, it’s the table format.

robb · June 16, 2021, 9:11am

That might be it since the data is inside the table:

<table class="table table-striped" cellspacing="0">
<thead>
<tr>
<th>column1</th>
<th>column2</th>
<th>column3</th>
<th>column4</th>
</tr>
</thead>
<tbody>
<tr>
<td>
data column1<br/> additional data column1
</td>
<td>
data column 2
</td>
<td>
data column 3
</td>
<td class="text-center">
data column 4
</td>
</tr>

I could use some help changing the script, since I am not that handy with ‘sed foo’.
Also, ideally the < br/> in column 1 is stripped out from the csv

m.traeumner · June 16, 2021, 10:03am

I don’t know sed too, but I’ve found something interesting.
You could try to save the html file with curl -o table.html and the sed command has to have an input and output file:

sed ‘s/one/two/g’ inputfile outputfile

The option s is for search a string. In this example one
The g option is two replace it. Here one is replaced with two.

Here is a reference:

giacomo · June 16, 2021, 12:55pm

Talking from my experience, scraping html pages with sed or similar tool can bring only pain .

You should use a better tool, almost any programming language has its own parser.
Take a look to:

robb · June 16, 2021, 5:14pm

So I should create a vhost or something where I run such a php script? It’s all like magic for me, but will have a small dive into php then.

giacomo · June 17, 2021, 7:46am

You do not need to create a virtual host: you can use PHP from command line.

Example, put this in a file named test.php:

#!/usr/bin/php

<?php

// Load the remote site
$url = "https://www.nethserver.org";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$res = curl_exec($ch);
curl_close($ch);

// Parse the content
$dom = new DomDocument();
@$dom->loadHTML($res);

// Navigate the tree
$li = $dom->getElementById('menu-item-112');
$children = $li->getElementsByTagName('span');
foreach ($children as $child) {
    echo $child->nodeValue."\n";
}

Then, make the file executable and launch it:

chmod a+x test.php
./test.php

The output should be Learn More, the text of of a menu from NethServer site.

robb · June 18, 2021, 11:44am

How do I send the output to a new file with a timestamped name? Something like output{timestamp}.csv (the out put already will be in some kind of delimiter since it already is a table on the webpage)

m.traeumner · June 18, 2021, 12:57pm

I didn’t try, I think there are some mistakes, but you can try this:

Date and Time you can get with

// Get Date and Time as a String and write it to $date
$date=date("Y-m-d-H-i");

Saving the file should look like this:


//creating the filename and an emty data variable
$filename="output" . $time;
$data=""
// create the file
file_put_contents ( $filename , $data);

// Navigate the tree
$li = $dom->getElementById('menu-item-112');
$children = $li->getElementsByTagName('span');
foreach ($children as $child) {
   // echo $child->nodeValue."\n";
// Opens the file
$current = file_get_contents($filename);
// Add new content
    $data .= $child->nodeValue."\n";
// Write back to the file, not overriting the old content
    file_put_contents($filename, $data, FILE_APPEND);
}

Data should be the content of the file which you should get from @giacomo’s script. I think

m.traeumner · June 29, 2021, 12:54pm

@robb Did you try?

JamieLanger · March 5, 2024, 10:11am

I’ve been following the discussion on scraping HTML tables to CSV and found the suggestions here very insightful. After considering various approaches, I’m leaning towards using PHP for this task, as Giacomo suggested. PHP’s built-in DOM parsing capabilities seem to offer a more structured and maintainable approach compared to traditional command-line utilities. This method also seems to handle HTML’s inherent complexities better. I’m planning to experiment with this approach and will adjust the parsing logic to accommodate the specific format of my target table. Thanks, everyone, for the valuable input!

oneitonitram · March 5, 2024, 10:25am

Just to add into it, this tools could be cool as well for checking website changes

its opensource and selfhostable

You can watch for changes, and trigger a Fetch for the data that has been updated