Scraping websites
scraping stuff
See Web crawler
facebook videos
Embedded facebook videos may be downloaded by copying the link into a third-party downloader
https://www.fbdown.net/download.php
httrack
This is a neat tool for it can scrape a website and also follow and download offsite hyperlinked MIME data.
- Run httrack or webhttrack to scrape a website
- use same address or same domain
- use level=3
- supply:
http://localhost:81/mediawiki/index.php/HPC_Report
- on completion drill down to the localhost_81 directory and delete
- all ri:files
- all Special: files
- all Arising: files
- then search for unwanted .pdf
find localhost_82 -print | grep *.pdf and delete them
- then test all the Navigation Links
- test the Index and contained links to ensure there is no Arising Proprietary or Confidential Information.
- when archiving do not include the hts-cache
TO DO
- make a new category for Arising stuff and move pages to that category.
- make a separate LAMP containing the report and data
You could try these commands:
httrack -O hpc-report2 -*p3 -B -a --continue https://hpc.arising.com.au/mediawiki/index.php/HPC_Report -*?title=* -*/images/thumb/* -*/ri\:* -*/Special:*
httrack -O hpc-report3 -*p3 -B -a https://hpc.arising.com.au/mediawiki/index.php/HPC_Report -*?title=* -*/images/thumb/* -*/ri\:* -*/Special:* -*/File:* -r4
UWC
UWC is a Universal Wiki Converter that can output Confluence markup to assist with importing from various wiki and plain file format into a Confluence wiki. It has been developed in Java and I have trialed to convert this mediawiki (currently a Windows10 deployment).
- migration notes https://migrations.atlassian.net/wiki/spaces/UWC/overview?mode=global
- documentation https://migrations.atlassian.net/wiki/spaces/UWC/pages/1015876/UWC+Current+Documentation
- git repository https://bitbucket.org/appfusions/universal-wiki-converter/src/master/
The development notes are much better than the regular documentation. One of the suggested things is to turn the automatic Confluence upload off.
UWC uses mysql driver to contact the default mediawiki database. The mysql client may be installed on debian (for testing) via:
apt-get install mysql-client
Working Notes
UWC is executed through the following stages:
- configuring
- export
- conversion (and confluence upload)
Configuring is performed by editing the properties files to establish export properties for the source wiki you wish to export.
Exporting is then performed via the command-line run_wcs.sh -e <path-to-export.properties>. This will bring up a UI anyway - which is kind of strange, but you can also directly execute the exporter out of the uwc.jar.
Exporting extracts the wiki markup from the wiki database into variously named pages.
The page naming rules are scantly described in the documentation, and they are partitioned between directories either named by well known namespaces:
- Pages
- Users
and the namespaces (that have been created locally) e.g.
- 100
- 102
- 103
(For the Arising HPC report mediawiki only some Pages will be of project relevance, since namespaces 100 are How Tos, and 102 & 103 are ir: name-spaces.)
The mediawiki images are obtained from a static web-server content via the file system (this is named attachments in the UI) and this path must be specified for the conversion process, and UWC must be run where these static files are available on the file system (they could be mounted via samba) because uwc is unable to obtain files directly across the network.
Next the pages or directory to be converted must be supplied in the pages dialog via the Add button, and then the Convert button will be available and can be "pressed" to start the conversion.
If you go to the extended properties you may disable the upload to confluence while you are testing.
You can also set the property engine-saves-to-disk=true to have UWC create an output directory (from where it is run) and place all the Confluence mark-up files into that directory under the same names as the exported page names.
If upload is enabled, then uwc uses Confluence http://hostname/rpc/xmlrp to upload the pages.
hints
- The exported files are in mediawiki format - you can also obtain these by editting each page and copying into an appropriately named .txt file and direct uwc to those pages to perform conversion, and save/upload as required.
- Confluence markup output files can also be written to disk by setting property as above (and can be manually uploaded via a Confluence Edit instead of using xmlrcp to upload).
- Map your attachments to the file-system that is running uwc so images etc are available.
xmldump
You can also xmldump from the mediawiki into the XML format and then run php [1] over it to convert it into a collection of pages that uwc can understand/
Moving a mediawiki
These maintenance scripts have values derived from
/www/mediawiki/includes/DefaultSettings.php
- ensure that $wgDBServer is set to the correct host
- ensure that $wgDBport is set to the correct value (e.g. 3306 for the R: wiki and 3406 for the H: wiki);
The Administrator credentials are sourced from
/www/mediawiki/AdminSettings.php
- set the $wgDBadminuser = 'root' ; // for old wikis
- set the $wgDBadminpassword = ; // for old wikis
The new /mnt/repos/www/hpc.arising.com.au/mediawiki/includes/ has new credentials:
- $wgDBname = "hpcmediawikidb"
- $wgDBuser = "hpcuser";
- $wgDBpassword = "m3d1@p@ss";
Then /mnt/repos/www/hpc.arising.com.au/mediawiki/includes/DefaultSettings.php
- $wgDBport = 5432;
- $wgDBname = 'my_wiki';
- $wgDBuser = 'wikiuser';
- $wgDBpassword = ;
- $wgDBadminuser = null;
- $wgDBadminpassword = null; /* use defaults */
There are massive schema changes between mediawiki versions, xml import and export only work if you are mirroring an existing version.
- manually https://www.mediawiki.org/wiki/Manual:Moving_a_wiki
- export
- import
- bots (that may be converted)
Scripting Techniques
These script procedures have been the most useful for migrating from a really old mediawiki to the latest.
export images
- cd to the www/mediawiki directory
- ensure you have setup the database credentials for access by the maintenance scripts
- generate a list of images to upload
php.exe maintenance/dumpUploads.php | sed 's/^/cp /' | sed 's/\\/\//g' | sed 's/$/ backup-dir' > db-copy.sh
- alternatively copy the images directory to file (if you don't trust the database; omit archive, temp and thumb instance)
find images -not -name '*./archive/.*' -not -name '*./thumb/.*' -not -name '*./deleted/.*' -type f -print | sed 's/^/cp /' | sed 's/\\/\//g' | sed 's/$/ backup-dir/' > dir-copy.sh
- edit the script and include
#!/bin/bash
- execute the script
- tar the target directory
tar cvf tarball.tar backup-dir
- Move the tarball to the target machine mediawiki of interest
scp tarball.tar user@host:/tmp/
- extract the files
tar xvf tarball.tar
import images
- ensure that you have your mediawiki administrative access setup correctly (see above)
- run the import script
sudo php maintenance/importImages.php backup-directory
export markup text
- export the markup text, and revisions, without: user accounts, images, edit logs, deleted revisions, etc
php dumpBackup.php --full > dump.xml
import markup text
php maintenance/importDump.php dump.xml
import via java HTML2Mediawiki
The veracity of this technique seems poor; it has been included because I tried it first, and just in case I want to start a project to migrate mediawikis.
import via Html2Wiki
(this does a poor job when sourcing pages from the mediawiki; extraneous special pages and markup are imported.)
There is this package, which is available under Debian, and windows et. al. which can be used to import HTML into mediawiki
Steps:
- install Html2Wiki extension
- edit /etc/php/7.3/apache2/php.ini and
- increase post_max_size = 512M ;
- increase upload_max_filesize = 512M
- set file_uploads = On
- Add extensions
extension=mysql.so extension=gd.so
- update /etc/apache2/apache.conf
- include AllowEncodedSlashes On
AllowEncodedSlashes On </Directory>
- install Nuke extension so you can undo imports
- install tidy
sudo apt install tidy
clone mediawiki via SQL and filesystem copy
You can create a fresh database, import the SQL and copy the mediawiki to obtain a clone.
import database
Import via SQL and copies of images directory
- connect to mysql mariadb daemon e.g.
mysql -u root [-P 3306 -h 127.0.0.1]
- create a database and install privileges. Take note that you use the correct host
CREATE DATABASE devmediawikidb; CREATE USER 'hpcmediawikiuser'@'localhost' IDENTIFIED BY 'some@p@ss'; GRANT ALL PRIVILEGES ON hpcmediawikidb.* TO 'hpcmediawikiuser'@'127.0.0.0' identified by 'some@p@ss' ; exit;
- now import
mysql -u hpcmediawikiuser -p m3d1@p@ss -P 3306 -h 127.0.0.1 -D hpcmediawikidb < the.sql
Confluence
- downloads https://www.atlassian.com/software/confluence/download-archives
- https://www.atlassian.com/try
- https://confluence.atlassian.com/doc/installing-a-confluence-trial-838416249.html#
- uwc uses Confluence xmlrpc https://developer.atlassian.com/server/confluence/confluence-xml-rpc-and-soap-apis/ to transfer markup
Mirroring wikipedia
- https://www.pmwiki.org/wiki/Cookbook/ExportHTML?from=Cookbook.PmWiki2HTML-usingWGET
- https://github.com/pirate/wikipedia-mirror
- https://www.mediawiki.org/wiki/Manual:Grabbers
Saving Video
The vlc video player can be used to save a DVD/video to mp4 format.
TO DO
- find the code that performs the Confluence upload and see how it works
- ensure it can be run independently of conversions.
references
- universal wiki converter (it's written in Java and does work for Confluence 5.5.7 - despite the documentation)
- https://blog.valiantys.com/en/confluence-en/universal-wiki-converter-transfer-wiki-content-confluence/
- https://bitbucket.org/appfusions/universal-wiki-converter/src/master/
- https://migrations.atlassian.net/wiki
- documentation
- https://migrations.atlassian.net/wiki/spaces/UWC/overview?mode=global
- https://migrations.atlassian.net/wiki/spaces/UWC/pages/1015876/UWC+Current+Documentation
- https://migrations.atlassian.net/wiki/spaces/UWC/pages/1015861/UWC+Command+Line+Interface
- https://migrations.atlassian.net/wiki/spaces/UWC/pages/1015891/UWC+Developer+Documentation
- https://migrations.atlassian.net/wiki/spaces/UWC/pages/1015849/UWC+Mediawiki+Notes

