Entry tags:
Idiot-Proof Methods of Datahoarding
A lot of datahoarding resources assume that you're intimately familiar with command prompts, webcrawlers, scripts, and more. But what if... you're stupid? What if you're extremely unfamiliar with those sorts of things, looking to have your hand held through the entire process? What if your name is Azure and you have an account called bedes on Dreamwidth.org? (Wait, that's getting too specific.)
Well, these resources are for you! Idiot-proof resources I've gathered for archiving stuff online! As tested and approved by an actual idiot!
(Quick disclaimer: I call myself "stupid" and "an idiot" lightheartedly, and as a morally neutral descriptor. Being stupid about certain things isn't bad! Which is why I'm making this post!)
Cobalt.tools: this is a great all-in-one resource for saving videos, audio, photos and gifs, with a large list of websites it can rip from. This list includes, but is not limited to, Facebook, Instagram, YouTube, Tumblr, Dailymotion, Pinterest, Reddit, and Twitch. It's open-source, easy to use, privacy-focused, ad-free, fast, and, frankly, It Just Works. No installation is required to use this tool.
Spotdownloader: a tool for ripping from Spotify with the highest possible quality! Very easy to use, and supports downloading a single song, a playlist, and an album, with all the important metadata about the song(s) in tact. (If downloading multiple songs, it puts them all in a convenient ZIP file.) Ad-free, and even has a userscript alternative. No installations are needed.
DiscordChatExporter: used to export any Discord message history to a file! It can export to HTML (dark/light), TXT, CSV and JSON, and supports Discord's form of markdown, embeds, attachments, and emojis. You have to install it via GitHub, and it has an excellent optional user interface that makes the process easy to follow!
Archive.org: known as the quintessential internet archive. I find it difficult to navigate what's already there sometimes, but, in terms of uploading your stuff, it's pretty easy! You do need an account to archive files, but archive.org is definitely the most useful for archiving webpages, which can be done without an account. There is also an Internet Archive extension, which you can use to save webpages quickly and easily, without needing to leave the page or open a new tab. You can also easily archive webpages via GhostArchive and Archive.Today, since backups are always necessary! No installations or downloads are required for any of these (except the Internet Archive extension, which requires you to add the extension).
Imgbrd-Grabber: a customizable tool for bulk-downloading images from imageboards, including (but not limited to) danbooru, safebooru, ArtStation, DeviantArt, Newgrounds and Pixiv, as well as many boorus specializing in pornographic content. The main reason that this tool is idiot-proof is thanks to the extremely thorough instructions provided for the installation and usage process.
Hydrus Network: you're gonna need a way to sort through all those new images you just downloaded, huh? Hydrus Network allows you to do just that -- making a locally-hosted booru of all of your art, with tags! It also supports bulk-downloading, like Imgbrd-Grabber, and shares a lot of the same supported sources, but has the notable unique features of being able to grab from Tumblr, and the ability to 'subscribe' to any gallery, repeating it every few days to keep up with new results. Also like Imgbrd-Grabber, it is an install which is mainly here thanks to its very thorough instructions, which walk the reader through everything.
AO3 Downloader: a life-saver for any person who has thought, "God, I wish I could download all of my bookmarks, but that would take sooo long to do individually." Another Github download which is saved by its thorough instructions!
tumblr-utils: a fantastic method for backing up your tumblr account. It's quite a pain to download and set up, but (say it with me, everyone!) it's saved with a Google Doc of extremely thorough instructions (skip to the "Tumblr-Utils" section... or read all of it! it's a great doc that goes over all sorts of different options for backing up your blog -- this is just the one I prefer). Once you get through setting it all up the first time, though, it works like a dream, with very simple command prompts, which are explained in the doc in layman's terms. These command prompts can save video and audio locally, fallback to the Internet Archive, save your likes, make an index of tags, and support ✨ incremental backups ✨! (Meaning that it continues from the last backup, instead of downloading the entire blog from scratch each time.)
If you know of any data-hoarding / archival resources that wasn't mentioned here, and you think even a total Python-illiterate doofus could get working, link it in the comments below! (Also, please include if it involves or requires any downloads, just because I think that's useful info.)
Well, these resources are for you! Idiot-proof resources I've gathered for archiving stuff online! As tested and approved by an actual idiot!
(Quick disclaimer: I call myself "stupid" and "an idiot" lightheartedly, and as a morally neutral descriptor. Being stupid about certain things isn't bad! Which is why I'm making this post!)
Cobalt.tools: this is a great all-in-one resource for saving videos, audio, photos and gifs, with a large list of websites it can rip from. This list includes, but is not limited to, Facebook, Instagram, YouTube, Tumblr, Dailymotion, Pinterest, Reddit, and Twitch. It's open-source, easy to use, privacy-focused, ad-free, fast, and, frankly, It Just Works. No installation is required to use this tool.
Spotdownloader: a tool for ripping from Spotify with the highest possible quality! Very easy to use, and supports downloading a single song, a playlist, and an album, with all the important metadata about the song(s) in tact. (If downloading multiple songs, it puts them all in a convenient ZIP file.) Ad-free, and even has a userscript alternative. No installations are needed.
DiscordChatExporter: used to export any Discord message history to a file! It can export to HTML (dark/light), TXT, CSV and JSON, and supports Discord's form of markdown, embeds, attachments, and emojis. You have to install it via GitHub, and it has an excellent optional user interface that makes the process easy to follow!
Archive.org: known as the quintessential internet archive. I find it difficult to navigate what's already there sometimes, but, in terms of uploading your stuff, it's pretty easy! You do need an account to archive files, but archive.org is definitely the most useful for archiving webpages, which can be done without an account. There is also an Internet Archive extension, which you can use to save webpages quickly and easily, without needing to leave the page or open a new tab. You can also easily archive webpages via GhostArchive and Archive.Today, since backups are always necessary! No installations or downloads are required for any of these (except the Internet Archive extension, which requires you to add the extension).
Imgbrd-Grabber: a customizable tool for bulk-downloading images from imageboards, including (but not limited to) danbooru, safebooru, ArtStation, DeviantArt, Newgrounds and Pixiv, as well as many boorus specializing in pornographic content. The main reason that this tool is idiot-proof is thanks to the extremely thorough instructions provided for the installation and usage process.
Hydrus Network: you're gonna need a way to sort through all those new images you just downloaded, huh? Hydrus Network allows you to do just that -- making a locally-hosted booru of all of your art, with tags! It also supports bulk-downloading, like Imgbrd-Grabber, and shares a lot of the same supported sources, but has the notable unique features of being able to grab from Tumblr, and the ability to 'subscribe' to any gallery, repeating it every few days to keep up with new results. Also like Imgbrd-Grabber, it is an install which is mainly here thanks to its very thorough instructions, which walk the reader through everything.
AO3 Downloader: a life-saver for any person who has thought, "God, I wish I could download all of my bookmarks, but that would take sooo long to do individually." Another Github download which is saved by its thorough instructions!
tumblr-utils: a fantastic method for backing up your tumblr account. It's quite a pain to download and set up, but (say it with me, everyone!) it's saved with a Google Doc of extremely thorough instructions (skip to the "Tumblr-Utils" section... or read all of it! it's a great doc that goes over all sorts of different options for backing up your blog -- this is just the one I prefer). Once you get through setting it all up the first time, though, it works like a dream, with very simple command prompts, which are explained in the doc in layman's terms. These command prompts can save video and audio locally, fallback to the Internet Archive, save your likes, make an index of tags, and support ✨ incremental backups ✨! (Meaning that it continues from the last backup, instead of downloading the entire blog from scratch each time.)
If you know of any data-hoarding / archival resources that wasn't mentioned here, and you think even a total Python-illiterate doofus could get working, link it in the comments below! (Also, please include if it involves or requires any downloads, just because I think that's useful info.)
no subject
I will post a word of caution that any tool or website that requires access to credentials is best used with an alternative account, and of course to practice best password management practices (i.e. don't use the same password in multiple places).
I have specialized in a few different kinds of archiving:
EBooks
EBooks are best stored as an .epub; it's the smallest type of file and is native to all e-readers. This is one of the options for downloading fics on AO3 as well. I recommend Calibre for organizing all of those .epubs. There are several plugins available for Calibre that will also strip DRM from books purchased from Amazon or Kobo; as this veers a bit into piracy, I won't provide any links here, but DuckDuckGo can point you in the right direction if you're so inclined. Amazon DRM is harder to strip, but Kobo is easy.
Podcasts
My ride or die tool here is podcast-dl. It works with any RSS feed for a podcast, so if you have access to one from Patreon or a free feed, you can paste it in. Pretty much any podcast on an app or Spotify will have an RSS feed *somewhere*, so hunt around for one. This is a command line tool, but it has great documentation. It even has options for autofilling with the correct metadata, so you don't have to do anything after the fact!
Audiobooks
Amazon will pull audiobooks from your library even if you've purchased them; this has happened with copies of 1984 and histories of the Taliban I have in my library (what can I say, I love collecting history that people want to bury). To strip DRM from these, I recommend Libation - this is also a management tool for updating tags and sorting everything on your machine.
YouTube
Youtube downloaders are a dime a dozen, and it's a bit of a game of cat and mouse with Google to get these to work. Currently, I think this is the most up to date one? Youtube-dl - most Youtube downloaders are command line and require some Python/Docker knowledge, unfortunately.
Specific Topics
Maybe you have a special interest you want to preserve data for that doesn't fit into one of the above categories. For example, maybe you're interested in climate data gathered by universities and/or governments. Anything more specialized will most likely require scripting knowledge and familiarity with how APIs work, but the great news is there are most likely people who also want to archive that sort of data! Poke around forums, Reddit, and GitHub to see if people have made tools for whatever it is you're looking to use.
Data Management
You've got all this data, now where are you going to store it? You have a few places, based on the amount of data you have, what it is, and your comfort level with the possibility of others having access to it.
* The Internet Archive is currently dealing with lawsuits about copyright infringement, and has suffered from DDOS attacks. Depending on what it is you want to archive, especially if you want consistent access to it, this may not be the best place for it. Totally throw stuff on here for webpages though!
* General cloud storage (ex. Google Drive, OneDrive) is good for ease of access and quickly transferring across devices. It will be tied to your account, and there's a pretty good chance it will be scanned for AI training/data gathering purposes. I wouldn't store anything sensitive or containing PII. But if you want to move podcasts/fanfics/fanart quickly over, this is a great way to do it. Remember: There Is No Cloud, it's just someone else's computer.
* The next question is HDD or SSD (hard disk drive or solid state drive). HDDs are cheaper, but SSDs are much faster and are getting cheaper by the minute. If you go the hardware route, I would get on it sooner rather than later; due to Political Reasons, especially in the USA, I fully expect computer hardware prices to get much more expensive in the near future. If you do this, I'd have a sorting system per drive and label them as such, otherwise you'll spend so much time wondering if X image was on Y drive and plugging/unplugging stuff to find it. Not that I've... done.. that.... before.........
* You could, of course, set up a personal server. This is beyond me, though, and requires a pretty hefty financial investment. If you do this, get UNRAID set up and learn how to install dockerized containers of any applications you like, and go the full self-hosted route. A word of advice here is to avoid the Reddits specializing in self-hosted stuff, as a lot of people tend to get pretty weird/cagey about this topic or assume you'll have oodles of tech knowledge already.
no subject