scraping stuff

facebook videos

Embedded facebook videos may be downloaded by copying the link into a third-party downloader

https://www.fbdown.net/download.php

httrack

This is a neat tool for it can scrape a website and also follow and download offsite hyperlinked MIME data.

Run httrack or webhttrack to scrape a website
- https://www.httrack.com/
- https://www.httrack.com/html/index.html
use same address or same domain
use level=3
supply:

http://localhost:81/mediawiki/index.php/HPC_Report

on completion drill down to the localhost_81 directory and delete
- all ri:files
- all Special: files
- all Arising: files
- then search for unwanted .pdf

find localhost_82 -print | grep *.pdf  and delete them

then test all the Navigation Links
test the Index and contained links to ensure there is no Arising Proprietary or Confidential Information.
when archiving do not include the hts-cache

TO DO

make a new category for Arising stuff and move pages to that category.
make a separate LAMP containing the report and data

You could try these commands:

httrack -O hpc-report2 -*p3 -B -a --continue https://hpc.arising.com.au/mediawiki/index.php/HPC_Report -*?title=* -*/images/thumb/* -*/ri\:* -*/Special:*

httrack -O hpc-report3 -*p3 -B -a https://hpc.arising.com.au/mediawiki/index.php/HPC_Report -*?title=* -*/images/thumb/* -*/ri\:* -*/Special:* -*/File:* -r4

UWC

UWC is a Universal Wiki Converter that can output Confluence markup to assist with importing from various wiki and plain file format into a Confluence wiki. It has been developed in Java and I have trialed to convert this mediawiki (currently a Windows10 deployment).

migration notes https://migrations.atlassian.net/wiki/spaces/UWC/overview?mode=global
documentation https://migrations.atlassian.net/wiki/spaces/UWC/pages/1015876/UWC+Current+Documentation
git repository https://bitbucket.org/appfusions/universal-wiki-converter/src/master/

The development notes are much better than the regular documentation. One of the suggested things is to turn the automatic Confluence upload off.

https://migrations.atlassian.net/wiki/spaces/UWC/pages/1015868/UWC+Quick+Start#UWCQuickStart-PageswillbesenttoConfluenceafterconversion
- https://migrations.atlassian.net/wiki/spaces/UWC/pages/1015872/UWC+Supported+Wikis

UWC uses mysql driver to contact the default mediawiki database. The mysql client may be installed on debian (for testing) via:

apt-get install mysql-client

Working Notes

UWC is executed through the following stages:

configuring
export
conversion (and confluence upload)

Configuring is performed by editing the properties files to establish export properties for the source wiki you wish to export.

Exporting is then performed via the command-line run_wcs.sh -e <path-to-export.properties>. This will bring up a UI anyway - which is kind of strange, but you can also directly execute the exporter out of the uwc.jar.

Exporting extracts the wiki markup from the wiki database into variously named pages.

The page naming rules are scantly described in the documentation, and they are partitioned between directories either named by well known namespaces:

Pages
Users

and the namespaces (that have been created locally) e.g.

100
102
103

(For the Arising HPC report mediawiki only some Pages will be of project relevance, since namespaces 100 are How Tos, and 102 & 103 are ir: name-spaces.)

The mediawiki images are obtained from a static web-server content via the file system (this is named attachments in the UI) and this path must be specified for the conversion process, and UWC must be run where these static files are available on the file system (they could be mounted via samba) because uwc is unable to obtain files directly across the network.

Next the pages or directory to be converted must be supplied in the pages dialog via the Add button, and then the Convert button will be available and can be "pressed" to start the conversion.

If you go to the extended properties you may disable the upload to confluence while you are testing.

You can also set the property engine-saves-to-disk=true to have UWC create an output directory (from where it is run) and place all the Confluence mark-up files into that directory under the same names as the exported page names.

If upload is enabled, then uwc uses Confluence http://hostname/rpc/xmlrp to upload the pages.

hints

The exported files are in mediawiki format - you can also obtain these by editting each page and copying into an appropriately named .txt file and direct uwc to those pages to perform conversion, and save/upload as required.
Confluence markup output files can also be written to disk by setting property as above (and can be manually uploaded via a Confluence Edit instead of using xmlrcp to upload).
Map your attachments to the file-system that is running uwc so images etc are available.

xmldump

You can also xmldump from the mediawiki into the XML format and then run php [1] over it to convert it into a collection of pages that uwc can understand/

See https://confluence.panio.info/display/PUBL/HOW+to+import+Mediawiki+into+Confluence+using+Mediawiki+XML+dump+and+UWC

Moving a mediawiki

These maintenance scripts have values derived from

 /www/mediawiki/includes/DefaultSettings.php

ensure that $wgDBServer is set to the correct host
ensure that $wgDBport is set to the correct value (e.g. 3306 for the R: wiki and 3406 for the H: wiki);

The Administrator credentials are sourced from

/www/mediawiki/AdminSettings.php

set the $wgDBadminuser = 'root' ; // for old wikis
set the $wgDBadminpassword = ; // for old wikis

The new /mnt/repos/www/hpc.arising.com.au/mediawiki/includes/ has new credentials:

$wgDBname = "hpcmediawikidb"
$wgDBuser = "hpcuser";
$wgDBpassword = "m3d1@p@ss";

Then /mnt/repos/www/hpc.arising.com.au/mediawiki/includes/DefaultSettings.php

$wgDBport = 5432;
$wgDBname = 'my_wiki';
$wgDBuser = 'wikiuser';
$wgDBpassword = ;
$wgDBadminuser = null;
$wgDBadminpassword = null; /* use defaults */

There are massive schema changes between mediawiki versions, xml import and export only work if you are mirroring an existing version.

manually https://www.mediawiki.org/wiki/Manual:Moving_a_wiki
- backup https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps
- restore https://www.mediawiki.org/wiki/Manual:Restoring_a_wiki_from_backup
export
- pages https://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export
- images https://www.mediawiki.org/wiki/Manual:DumpUploads.php
import
- pages https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps
- images https://www.mediawiki.org/wiki/Manual:ImportImages.php
bots (that may be converted)
- https://commons.wikimedia.org/wiki/Commons:File_upload_service/Script
- https://strategywiki.org/wiki/User:File_Upload_Bot_(Kernigh)

Scripting Techniques

These script procedures have been the most useful for migrating from a really old mediawiki to the latest.

export images

cd to the www/mediawiki directory
ensure you have setup the database credentials for access by the maintenance scripts
generate a list of images to upload

php.exe maintenance/dumpUploads.php | sed 's/^/cp /' | sed 's/\\/\//g' | sed 's/$/ backup-dir' > db-copy.sh

alternatively copy the images directory to file (if you don't trust the database; omit archive, temp and thumb instance)

find images -not -name '*./archive/.*' -not -name '*./thumb/.*' -not -name '*./deleted/.*' -type f -print | sed 's/^/cp /' | sed 's/\\/\//g' | sed 's/$/ backup-dir/' > dir-copy.sh

edit the script and include

#!/bin/bash

execute the script
tar the target directory

tar cvf tarball.tar backup-dir

Move the tarball to the target machine mediawiki of interest

scp tarball.tar user@host:/tmp/

extract the files

tar xvf tarball.tar

import images

ensure that you have your mediawiki administrative access setup correctly (see above)
run the import script

sudo php maintenance/importImages.php  backup-directory

export markup text

export the markup text, and revisions, without: user accounts, images, edit logs, deleted revisions, etc

php dumpBackup.php --full > dump.xml

import markup text

 php maintenance/importDump.php dump.xml

import via java HTML2Mediawiki

The veracity of this technique seems poor; it has been included because I tried it first, and just in case I want to start a project to migrate mediawikis.

Java source https://bitbucket.org/axelclk/info.bliki.wiki/wiki/HTML2Mediawiki
- https://bitbucket.org/axelclk/info.bliki.wiki/wiki/HTML2Mediawiki
- https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home

import via Html2Wiki

(this does a poor job when sourcing pages from the mediawiki; extraneous special pages and markup are imported.)

There is this package, which is available under Debian, and windows et. al. which can be used to import HTML into mediawiki

https://www.mediawiki.org/wiki/Manual:Importing_external_content

Steps:

install Html2Wiki extension
edit /etc/php/7.3/apache2/php.ini and
- increase post_max_size = 512M ;
- increase upload_max_filesize = 512M
- set file_uploads = On
- Add extensions

extension=mysql.so
extension=gd.so

update /etc/apache2/apache.conf
- include AllowEncodedSlashes On

AllowEncodedSlashes On
</Directory>

install Nuke extension so you can undo imports
- https://www.mediawiki.org/wiki/Extension:Nuke

install tidy

sudo apt install tidy

clone mediawiki via SQL and filesystem copy

You can create a fresh database, import the SQL and copy the mediawiki to obtain a clone.

import database

Import via SQL and copies of images directory

connect to mysql mariadb daemon e.g.

mysql -u root  [-P 3306 -h 127.0.0.1]

create a database and install privileges. Take note that you use the correct host

CREATE DATABASE devmediawikidb;
CREATE USER 'hpcmediawikiuser'@'localhost' IDENTIFIED BY 'some@p@ss';
GRANT ALL PRIVILEGES ON hpcmediawikidb.* TO 'hpcmediawikiuser'@'127.0.0.0' identified by 'some@p@ss' ;
exit;

now import

 mysql -u hpcmediawikiuser -p m3d1@p@ss -P 3306 -h 127.0.0.1 -D hpcmediawikidb < the.sql

Confluence

downloads https://www.atlassian.com/software/confluence/download-archives
https://www.atlassian.com/try
https://confluence.atlassian.com/doc/installing-a-confluence-trial-838416249.html#
uwc uses Confluence xmlrpc https://developer.atlassian.com/server/confluence/confluence-xml-rpc-and-soap-apis/ to transfer markup

Mirroring wikipedia

Saving Video

The vlc video player can be used to save a DVD/video to mp4 format.

TO DO

find the code that performs the Confluence upload and see how it works
- ensure it can be run independently of conversions.

references

universal wiki converter (it's written in Java and does work for Confluence 5.5.7 - despite the documentation)

Scraping websites

scraping stuff

facebook videos

httrack

UWC

Working Notes

hints

xmldump

Moving a mediawiki

Scripting Techniques

export images

import images

export markup text

import markup text

import via java HTML2Mediawiki

import via Html2Wiki

clone mediawiki via SQL and filesystem copy

import database

Confluence

Mirroring wikipedia

Saving Video

TO DO

references

categories