tuxbox: hacking my computers: athena-link

Note: This post (and the script it contains) has been updated as of December 14, 2010. (v1.4.0) The script can also be downloaded from my server here.

Also, Posterous has done a lot of work on solving this problem since I wrote my script. You can see their latest solutions here.

Recently, I switched from TwitPic to Posterous as my method of posting phone pictures (and now video) to the Internet. But since I switched, I didn't want to have my data history split in two, so I decided to write a script to download each of my TwitPic images with their associated text and date, and upload them to Posterous with the same information.

Initially, I wanted to make one long post with all of the images, and their text below. However, with the Posterous API, it isn't possible to refer to a specific image in your body text, so individual posts is the way I went.

Along the way, I became familiar with yet another Linux command: curl.

I love that Posterous has an API that (once you figure out curl) is pretty easy to use. TwitPic, on the other hand, has absolutely zero support for exporting anything. The fact that they're so non-user-centric and out-dated was a driving force in my switching. The only reason I hadn't switched to img.ly already was because img.ly has a bug that prevents images sent from my phone from being posted, since my phone sends them without a file extension. I worked with their tech support for a while, but they didn't fix it. I got a new phone, but it was also a Samsung, and it did the same thing with images. Oh, well. Posterous is better.

Anyway, here is the script:

First run it with just the first two arguments, and it will download all of your TwitPic data, including thumbnail images. Once you're satisfied, supply your Posterous User ID, Password, and Site ID. (If you don't know your Site ID, run the script with your Posterous User ID, Password, and no Site ID, and it will query your Posterous site info as long as your Posterous credentials are valid.)

Note: if you want to run this from Windows, you should install Cygwin (with, at a mimum, curl and sed) and run it from there.

./twitpic-to-posterous.sh [twitpic-id] [working-dir] [postrous-id] [posterous-password] [posterous-site-id] [skip-number]

#!/bin/sh

# Copyright 2010 Tim "burndive" of http://burndive.blogspot.com/ and http://tuxbox.blogspot.com/
# This software is licensed under the Creative Commons GNU GPL version 2.0 or later.
# License informattion: http://creativecommons.org/licenses/GPL/2.0/

# This script was obtained from here:
# http://tuxbox.blogspot.com/2010/03/twitpic-to-posterous-export-script.html

RUN_DATE=`date +%F--%H-%m-%S`
SCRIPT_VERSION_STRING="v1.4.0"

TP_NAME=$1
WORKING_DIR=$2
P_ID=$3
P_PW=$4
P_SITE_ID=$5
UPLOAD_SKIP=$6

# Comma separated list of tags to apply to your posts
P_TAGS="twitpic"
# Whether or not to auto-post from Posterous
P_AUTOPOST=0
# Whether or not the Posterous posts are marked private
P_PRIVATE=0

# This is the default limit of the number of posts that can be uploaded per day
P_API_LIMIT=50

DOWNLOAD_FULL=1
DOWNLOAD_SCALED=0
DOWNLOAD_THUMB=0
PREFIX=twitpic-$TP_NAME
HTML_OUT=$PREFIX-all-$RUN_DATE.html
UPLOAD_OUT=posterous-upload-$P_SITE_ID-$RUN_DATE.xml

if [ -z "$TP_NAME" ]; then
  echo "You must supply a TP_NAME."
  exit
fi
if [ ! -d "$WORKING_DIR" ]; then
  echo "You must supply a WORKING_DIR."
  exit
fi
if [ -z "$UPLOAD_SKIP" ]; then
  UPLOAD_SKIP=0
fi
UPLOAD_SKIP_DIGITS=`echo $UPLOAD_SKIP | sed -e 's/[^0-9]//g'`
if [ "$UPLOAD_SKIP" != "$UPLOAD_SKIP_DIGITS" ]; then
  echo "Invalid UPLOAD_SKIP: $UPLOAD_SKIP"
  exit
fi

cd $WORKING_DIR

if [ -f "$HTML_OUT" ]; then
  rm -v $HTML_OUT
fi

# If Posterous username and password were supplied, but not site ID, query the server and exit.
P_SITE_INFO_FILE=posterous-$P_SITE_ID.out
if [ ! -z "$P_ID" ] && [ ! -z "$P_PW" ] && [ -z "$P_SITE_ID" ]; then
  echo "Getting Posterous account info..."
  curl -u "$P_ID:$P_PW" "http://posterous.com/api/getsites" -o $P_SITE_INFO_FILE
  SITE_ID_RET=`grep "<id>$P_SITE_ID</id>" $P_SITE_INFO_FILE`
  if [ -z "$SITE_ID_RET" ]; then
    echo "Please supply your Posterous Site ID as the fifth argument."
    echo "Here is the response from the Posterous server.  If you entered correct credentials, you should see your Site ID(s):"
    cat $P_SITE_INFO_FILE | tee -a $UPLOAD_OUT
    exit
  fi
fi

# Confirm that we have a valid Posterous Site ID
if [ ! -z "$P_SITE_ID" ]; then
  echo "Getting Posterous account info..."
  curl -u "$P_ID:$P_PW" "http://posterous.com/api/getsites" -o $P_SITE_INFO_FILE
  SITE_ID_RET=`grep "<id>$P_SITE_ID</id>" $P_SITE_INFO_FILE`
  if [ -z "$SITE_ID_RET" ]; then
    echo "Make sure that you have supplied a valid Posterous Site ID as the fifth parameter.  If you don't know your Site ID, leave it out, and this script will query the server."
    echo "Here is the response from the Posterous server.  If you entered correct credentials, you should see your site ID(s):"
    cat $P_SITE_INFO_FILE | tee -a $UPLOAD_OUT
    exit
  fi
fi

MORE=1
PAGE=1
while [ $MORE -ne 0 ]; do
  echo PAGE: $PAGE
  FILENAME=$PREFIX-page-$PAGE.html
  if [ ! -s $FILENAME ]; then
    wget http://twitpic.com/photos/${TP_NAME}?page=$PAGE -O $FILENAME
    if [ ! -s "$FILENAME" ]; then
      echo "ERROR: could not get $FILENAME" | tee -a $LOG_FILE
      sleep 5
    fi
  fi
  if [ -z "`grep "More photos &gt;" $FILENAME`" ]; then
    MORE=0
  else
    PAGE=`expr $PAGE + 1`
  fi
done

ALL_IDS=`cat $PREFIX-page-* | grep -Eo "<a href=\"/[a-zA-Z0-9]+\">" | grep -Eo "/[a-zA-Z0-9]+" | grep -Eo "[a-zA-Z0-9]+" | sort -r | xargs`

# For Testing
#ALL_IDS="1kdjc"

COUNT=0
LOG_FILE=$PREFIX-log-$RUN_DATE.txt

echo $ALL_IDS | tee -a $LOG_FILE

for ID in $ALL_IDS; do
  COUNT=`expr $COUNT + 1`
  echo $ID: $COUNT | tee -a $LOG_FILE

  echo "Processing $ID..."
  FULL_HTML=$PREFIX-$ID-full.html
  if [ ! -s "$FULL_HTML" ]; then
    wget http://twitpic.com/$ID/full -O $FULL_HTML
    if [ ! -s "$FULL_HTML" ]; then
      echo "ERROR: could not get FULL_HTML for $ID" | tee -a $LOG_FILE
      sleep 5
    fi
  fi
  TEXT=`grep "<img src=" $FULL_HTML | tail -n1 | grep -oE "alt=\"[^\"]*\"" | sed \
        -e 's/^alt="//'\
        -e 's/"$//'\
        -e "s/&#039;/'/g"\
        -e 's/&quot;/"/g'\
        `
  if [ "$TEXT" = "" ]; then
    TEXT="Untitled"
  fi
  echo "TEXT: $TEXT" | tee -a $LOG_FILE
  # Recognize hashtags and username references in the tweet
  TEXT_RICH=`echo "$TEXT" | sed \
        -e 's/\B\@\([0-9A-Za-z_]\+\)/\@<a href="http:\/\/twitter.com\/\1">\1<\/a>/g' \
        -e 's/\#\([0-9A-Za-z_-]*[A-Za-z_-]\+[0-9A-Za-z_-]*\)/<a href="http:\/\/twitter.com\/search\?q\=%23\1">\#\1<\/a>/g' \
        `
  echo "TEXT_RICH: $TEXT_RICH" | tee -a $LOG_FILE

  # Convert hashtags into post tags
  P_TAGS_POST=$P_TAGS`echo "$TEXT" | sed \
        -e 's/\#\([^A-Za-z_-]\)*\B//g' \
        -e 's/^[^\#]*$//g' \
        -e 's/[^\#]*\(\#\([0-9A-Za-z_-]*[A-Za-z_-]\+[0-9A-Za-z_-]*\)\)[^\#]*\(\#[0-9]*\B\)*/,\2/g' \
        `
  # Uncomment if you don't want hashtags converted into post tags
  #P_TAGS_POST=$P_TAGS

  # Add custom tags from a file (optional).  The file is formatted like this:
  # ,tag1,tag2,tag3
  TAGS_FILE=$PREFIX-$ID-tags-extra.txt
  if [ -s "$TAGS_FILE" ]; then
    P_TAGS_POST=$P_TAGS_POST`cat $TAGS_FILE`
  fi
  echo "P_TAGS_POST: $P_TAGS_POST" | tee -a $LOG_FILE

  TEXT_FILE=$PREFIX-$ID-text.txt
  if [ ! -s $TEXT_FILE ]; then
    echo "$TEXT" > $TEXT_FILE
  fi
  FULL_URL=`grep "<img src=" $FULL_HTML | grep -Eo "src=\"[^\"]*\"" | grep -Eo "http://[^\"]*"`
  echo "FULL_URL: $FULL_URL" | tee -a $LOG_FILE

  SCALED_HTML=$PREFIX-$ID-scaled.html
  if [ ! -s "$SCALED_HTML" ]; then
    wget http://twitpic.com/$ID -O $SCALED_HTML
    if [ ! -s "$SCALED_HTML" ]; then
      echo "ERROR: could not get SCALED_HTML for $ID" | tee -a $LOG_FILE
      sleep 5
    fi
  fi
  SCALED_URL=`grep "id=\"photo-display\"" $SCALED_HTML | grep -Eo "http://[^\"]*" | head -n1`
  echo "SCALED_URL: $SCALED_URL" | tee -a $LOG_FILE
  POST_DATE=`grep -Eo "Posted on [a-zA-Z0-9 ,]*" $SCALED_HTML | sed -e 's/Posted on //'`
  echo "POST_DATE: $POST_DATE" | tee -a $LOG_FILE

  THUMB_URL=`cat $PREFIX-page-* | grep -E "<a href=\"/$ID\">" | grep -Eo "src=\"[^\"]*\"" | head -n1 | sed -e 's/src=\"//' -e 's/\"$//'`
  echo "THUMB_URL: $THUMB_URL" | tee -a $LOG_FILE

  EXT=`echo "$FULL_URL" | grep -Eo "[a-zA-Z0-9]+\.[a-zA-Z0-9]+\?" | head -n1 | grep -Eo "\.[a-zA-Z0-9]+"`
  if [ -z "$EXT" ]; then
    EXT=`echo "$FULL_URL" | grep -Eo "\.[a-zA-Z0-9]+$"`
  fi
  echo "EXT: $EXT"
  if [ "$DOWNLOAD_FULL" -eq 1 ]; then
    FULL_FILE="$PREFIX-$ID-full$EXT"
    if [ ! -s $FULL_FILE ]; then
      wget "$FULL_URL" -O $FULL_FILE
      if [ ! -s "$FULL_FILE" ]; then
        echo "ERROR: could not get FULL_URL for $ID: $FULL_URL" | tee -a $LOG_FILE
        sleep 5
      fi
    fi
  fi
  if [ "$DOWNLOAD_SCALED" -eq 1 ]; then
    SCALED_FILE=$PREFIX-$ID-scaled$EXT
    if [ ! -s $SCALED_FILE ]; then
      wget "$SCALED_URL" -O $SCALED_FILE
      if [ ! -s "$SCALED_FILE" ]; then
        echo "ERROR: could not get SCALED_URL for $ID: $SCALED_URL" | tee -a $LOG_FILE
        sleep 5
      fi
    fi
  fi
  if [ "$DOWNLOAD_THUMB" -eq 1 ]; then
    THUMB_FILE=$PREFIX-$ID-thumb$EXT
    if [ ! -s $THUMB_FILE ]; then
      wget "$THUMB_URL" -O $THUMB_FILE
      if [ ! -s "$THUMB_FILE" ]; then
        echo "ERROR: could not get THUMB_URL for $ID: $THUMB_URL" | tee -a $LOG_FILE
        sleep 5
      fi
    fi
  fi

  BODY_TEXT="$TEXT_RICH <p>[<a href=http://twitpic.com/$ID>Twitpic</a>]</p>"

  # Format the post date correctly
  YEAR=`echo "$POST_DATE" | sed -e 's/[A-Z][a-z]* [0-9]*, //'`
  DAY=`echo "$POST_DATE" | sed -e 's/[A-Z][a-z]* //' -e 's/, [0-9]*//'`
  MONTH=`echo "$POST_DATE" | sed -e 's/ [0-9]*, [0-9]*//' | sed \
    -e 's/January/01/' \
    -e 's/February/02/' \
    -e 's/March/03/' \
    -e 's/April/04/' \
    -e 's/May/05/' \
    -e 's/June/06/' \
    -e 's/July/07/' \
    -e 's/August/08/' \
    -e 's/September/09/' \
    -e 's/October/10/' \
    -e 's/November/11/' \
    -e 's/December/12/' \
    `
  # Adjust the time to local midnight when west of GMT
  HOURS_LOC=`date | grep -Eo " [0-9]{2}:" | sed -e 's/://' -e 's/ //'`
  HOURS_UTC=`date -u | grep -Eo " [0-9]{2}:" | sed -e 's/://' -e 's/ //'`
  HOURS_OFF=`expr $HOURS_UTC - $HOURS_LOC + 7`
  echo "HOURS_LOC: $HOURS_LOC"
  echo "HOURS_UTC: $HOURS_UTC"
  echo "HOURS_OFF: $HOURS_OFF"
  if [ "$HOURS_OFF" -lt 0 ]; then
    # We're east of GMT, do not adjust
    HOURS_OFF=0
  fi
  if [ "$HOURS_OFF" -lt 10 ]; then
    HOURS_OFF=0$HOURS_OFF
  fi
  if [ "$DAY" != "" ] && [ "$DAY" -lt 10 ]; then
    DAY=0$DAY
  fi
  DATE_FORMATTED="$YEAR-$MONTH-$DAY-$HOURS_OFF:00"
  echo "DATE_FORMATTED: $DATE_FORMATTED" | tee -a $LOG_FILE

  echo "<p><img src='$FULL_FILE' alt='$TEXT' title='$TEXT' /></p>" >> $HTML_OUT
  echo "$BODY_TEXT" >> $HTML_OUT
  echo "  Post date: $DATE_FORMATTED; Count: $COUNT" >> $HTML_OUT

  # Upload this Twitpic data to Posterous
  if [ ! -z "$P_SITE_ID" ]; then

    # First make sure we're under the API upload limit
    if [ "$COUNT" -le "$UPLOAD_SKIP" ]; then
      echo Skipping upload...
      continue
    fi
    if [ "$COUNT" -gt "`expr $UPLOAD_SKIP + $P_API_LIMIT`" ]; then
      echo "Skipping upload due to daily Posterous API upload limit of $P_API_LIMIT."
      echo "To resume uploading where we left off today, supply UPLOAD_SKIP parameter of `expr $UPLOAD_SKIP + $P_API_LIMIT`."
      continue
    fi

    P_OUT_FILE="posterous-$P_SITE_ID-$ID.out"
    if [ -s "$P_OUT_FILE" ]; then
      rm "$P_OUT_FILE"
    fi
    echo "Uploading Twitpic image..."
    curl -u "$P_ID:$P_PW" "http://posterous.com/api/newpost" -o "$P_OUT_FILE" \
      -F "site_id=$P_SITE_ID" \
      -F "title=$TEXT" \
      -F "autopost=$P_AUTOPOST" \
      -F "private=$P_PRIVATE" \
      -F "date=$DATE_FORMATTED" \
      -F "tags=$P_TAGS_POST" \
      -F "source=burndive's Twitpic-to-Posterous script $SCRIPT_VERSION_STRING" \
      -F "sourceLink=http://tuxbox.blogspot.com/2010/03/twitpic-to-posterous-export-script.html" \
      -F "body=$BODY_TEXT" \
      -F "media=@$FULL_FILE"
    cat $P_OUT_FILE  | tee -a $UPLOAD_OUT
  fi
done
echo Done.

This software is licensed under the CC-GNU GPL version 2.0 or later.

PS: If you use my code, I appreciate comments to let me know, and any feedback you may have, especially if it's not working right for you, but also just to say thanks.

For convenience, you can download this script from my server.

So what happens when your DNS changes more often than your IP address? The whole point of DNS is so that your IP address can change, and you don't have to update your links (also, there's something about being human-friendly, but who needs that?). Sadly in my case, athena has had a string of DNS subdomains that haven't lasted quite so long as I had hoped. My problem is that I link to pages from my blog (mostly containing pictures, audio, and PDF documents) that I would like to be accessible on a permanent basis. Recently, I lost athena.sexypenguins.com, and so I've moved to athena.goddns.net. At some point, I'm just going to have to purchase my own domain. Thus far I've resisted out of (mostly) momentum, but now that I've registered my first domain and found that it's not so bad to be parked, I'm more inclined to plunk down the money. The question is, what should I choose? In the meantime, I've decided to create an "athena-link" label on my blogs, that will at least keep track of which posts have links to athena, so that I can update them whenever I have to change the DNS, which it appears will happen at least one more time. For now (and for the first time in a while), all of my links are up-to-date. Another problem with my current setup is that FreeDNS's policies dictate that I need to have Google's ability to access my server manually enabled every time that I switch to a new domain. This wouldn't be that big of a deal, except that I've started hosting feeds on my site, and Google Reader uses Google's DNS. Also, Google just bought out FeedBurner, and one of my feeds uses that as a proxy.

tuxbox: hacking my computers

Pages

Wednesday, December 08, 2010

Twitpic-to-Posterous Script: Another Update

Wednesday, March 17, 2010

TwitPic to Posterous Export Script

Friday, June 05, 2009

Qwest Woes

Thursday, May 08, 2008

Baidu MP3

Wednesday, April 16, 2008

More DNS Woes