tuxbox: hacking my computers: 2010/03

Note: This post (and the script it contains) has been updated as of December 14, 2010. (v1.4.0) The script can also be downloaded from my server here.

Also, Posterous has done a lot of work on solving this problem since I wrote my script. You can see their latest solutions here.

Recently, I switched from TwitPic to Posterous as my method of posting phone pictures (and now video) to the Internet. But since I switched, I didn't want to have my data history split in two, so I decided to write a script to download each of my TwitPic images with their associated text and date, and upload them to Posterous with the same information.

Initially, I wanted to make one long post with all of the images, and their text below. However, with the Posterous API, it isn't possible to refer to a specific image in your body text, so individual posts is the way I went.

Along the way, I became familiar with yet another Linux command: curl.

I love that Posterous has an API that (once you figure out curl) is pretty easy to use. TwitPic, on the other hand, has absolutely zero support for exporting anything. The fact that they're so non-user-centric and out-dated was a driving force in my switching. The only reason I hadn't switched to img.ly already was because img.ly has a bug that prevents images sent from my phone from being posted, since my phone sends them without a file extension. I worked with their tech support for a while, but they didn't fix it. I got a new phone, but it was also a Samsung, and it did the same thing with images. Oh, well. Posterous is better.

Anyway, here is the script:

First run it with just the first two arguments, and it will download all of your TwitPic data, including thumbnail images. Once you're satisfied, supply your Posterous User ID, Password, and Site ID. (If you don't know your Site ID, run the script with your Posterous User ID, Password, and no Site ID, and it will query your Posterous site info as long as your Posterous credentials are valid.)

Note: if you want to run this from Windows, you should install Cygwin (with, at a mimum, curl and sed) and run it from there.

./twitpic-to-posterous.sh [twitpic-id] [working-dir] [postrous-id] [posterous-password] [posterous-site-id] [skip-number]

#!/bin/sh

# Copyright 2010 Tim "burndive" of http://burndive.blogspot.com/ and http://tuxbox.blogspot.com/
# This software is licensed under the Creative Commons GNU GPL version 2.0 or later.
# License informattion: http://creativecommons.org/licenses/GPL/2.0/

# This script was obtained from here:
# http://tuxbox.blogspot.com/2010/03/twitpic-to-posterous-export-script.html

RUN_DATE=`date +%F--%H-%m-%S`
SCRIPT_VERSION_STRING="v1.4.0"

TP_NAME=$1
WORKING_DIR=$2
P_ID=$3
P_PW=$4
P_SITE_ID=$5
UPLOAD_SKIP=$6

# Comma separated list of tags to apply to your posts
P_TAGS="twitpic"
# Whether or not to auto-post from Posterous
P_AUTOPOST=0
# Whether or not the Posterous posts are marked private
P_PRIVATE=0

# This is the default limit of the number of posts that can be uploaded per day
P_API_LIMIT=50

DOWNLOAD_FULL=1
DOWNLOAD_SCALED=0
DOWNLOAD_THUMB=0
PREFIX=twitpic-$TP_NAME
HTML_OUT=$PREFIX-all-$RUN_DATE.html
UPLOAD_OUT=posterous-upload-$P_SITE_ID-$RUN_DATE.xml

if [ -z "$TP_NAME" ]; then
  echo "You must supply a TP_NAME."
  exit
fi
if [ ! -d "$WORKING_DIR" ]; then
  echo "You must supply a WORKING_DIR."
  exit
fi
if [ -z "$UPLOAD_SKIP" ]; then
  UPLOAD_SKIP=0
fi
UPLOAD_SKIP_DIGITS=`echo $UPLOAD_SKIP | sed -e 's/[^0-9]//g'`
if [ "$UPLOAD_SKIP" != "$UPLOAD_SKIP_DIGITS" ]; then
  echo "Invalid UPLOAD_SKIP: $UPLOAD_SKIP"
  exit
fi

cd $WORKING_DIR

if [ -f "$HTML_OUT" ]; then
  rm -v $HTML_OUT
fi

# If Posterous username and password were supplied, but not site ID, query the server and exit.
P_SITE_INFO_FILE=posterous-$P_SITE_ID.out
if [ ! -z "$P_ID" ] && [ ! -z "$P_PW" ] && [ -z "$P_SITE_ID" ]; then
  echo "Getting Posterous account info..."
  curl -u "$P_ID:$P_PW" "http://posterous.com/api/getsites" -o $P_SITE_INFO_FILE
  SITE_ID_RET=`grep "<id>$P_SITE_ID</id>" $P_SITE_INFO_FILE`
  if [ -z "$SITE_ID_RET" ]; then
    echo "Please supply your Posterous Site ID as the fifth argument."
    echo "Here is the response from the Posterous server.  If you entered correct credentials, you should see your Site ID(s):"
    cat $P_SITE_INFO_FILE | tee -a $UPLOAD_OUT
    exit
  fi
fi

# Confirm that we have a valid Posterous Site ID
if [ ! -z "$P_SITE_ID" ]; then
  echo "Getting Posterous account info..."
  curl -u "$P_ID:$P_PW" "http://posterous.com/api/getsites" -o $P_SITE_INFO_FILE
  SITE_ID_RET=`grep "<id>$P_SITE_ID</id>" $P_SITE_INFO_FILE`
  if [ -z "$SITE_ID_RET" ]; then
    echo "Make sure that you have supplied a valid Posterous Site ID as the fifth parameter.  If you don't know your Site ID, leave it out, and this script will query the server."
    echo "Here is the response from the Posterous server.  If you entered correct credentials, you should see your site ID(s):"
    cat $P_SITE_INFO_FILE | tee -a $UPLOAD_OUT
    exit
  fi
fi

MORE=1
PAGE=1
while [ $MORE -ne 0 ]; do
  echo PAGE: $PAGE
  FILENAME=$PREFIX-page-$PAGE.html
  if [ ! -s $FILENAME ]; then
    wget http://twitpic.com/photos/${TP_NAME}?page=$PAGE -O $FILENAME
    if [ ! -s "$FILENAME" ]; then
      echo "ERROR: could not get $FILENAME" | tee -a $LOG_FILE
      sleep 5
    fi
  fi
  if [ -z "`grep "More photos &gt;" $FILENAME`" ]; then
    MORE=0
  else
    PAGE=`expr $PAGE + 1`
  fi
done

ALL_IDS=`cat $PREFIX-page-* | grep -Eo "<a href=\"/[a-zA-Z0-9]+\">" | grep -Eo "/[a-zA-Z0-9]+" | grep -Eo "[a-zA-Z0-9]+" | sort -r | xargs`

# For Testing
#ALL_IDS="1kdjc"

COUNT=0
LOG_FILE=$PREFIX-log-$RUN_DATE.txt

echo $ALL_IDS | tee -a $LOG_FILE

for ID in $ALL_IDS; do
  COUNT=`expr $COUNT + 1`
  echo $ID: $COUNT | tee -a $LOG_FILE

  echo "Processing $ID..."
  FULL_HTML=$PREFIX-$ID-full.html
  if [ ! -s "$FULL_HTML" ]; then
    wget http://twitpic.com/$ID/full -O $FULL_HTML
    if [ ! -s "$FULL_HTML" ]; then
      echo "ERROR: could not get FULL_HTML for $ID" | tee -a $LOG_FILE
      sleep 5
    fi
  fi
  TEXT=`grep "<img src=" $FULL_HTML | tail -n1 | grep -oE "alt=\"[^\"]*\"" | sed \
        -e 's/^alt="//'\
        -e 's/"$//'\
        -e "s/&#039;/'/g"\
        -e 's/&quot;/"/g'\
        `
  if [ "$TEXT" = "" ]; then
    TEXT="Untitled"
  fi
  echo "TEXT: $TEXT" | tee -a $LOG_FILE
  # Recognize hashtags and username references in the tweet
  TEXT_RICH=`echo "$TEXT" | sed \
        -e 's/\B\@\([0-9A-Za-z_]\+\)/\@<a href="http:\/\/twitter.com\/\1">\1<\/a>/g' \
        -e 's/\#\([0-9A-Za-z_-]*[A-Za-z_-]\+[0-9A-Za-z_-]*\)/<a href="http:\/\/twitter.com\/search\?q\=%23\1">\#\1<\/a>/g' \
        `
  echo "TEXT_RICH: $TEXT_RICH" | tee -a $LOG_FILE

  # Convert hashtags into post tags
  P_TAGS_POST=$P_TAGS`echo "$TEXT" | sed \
        -e 's/\#\([^A-Za-z_-]\)*\B//g' \
        -e 's/^[^\#]*$//g' \
        -e 's/[^\#]*\(\#\([0-9A-Za-z_-]*[A-Za-z_-]\+[0-9A-Za-z_-]*\)\)[^\#]*\(\#[0-9]*\B\)*/,\2/g' \
        `
  # Uncomment if you don't want hashtags converted into post tags
  #P_TAGS_POST=$P_TAGS

  # Add custom tags from a file (optional).  The file is formatted like this:
  # ,tag1,tag2,tag3
  TAGS_FILE=$PREFIX-$ID-tags-extra.txt
  if [ -s "$TAGS_FILE" ]; then
    P_TAGS_POST=$P_TAGS_POST`cat $TAGS_FILE`
  fi
  echo "P_TAGS_POST: $P_TAGS_POST" | tee -a $LOG_FILE

  TEXT_FILE=$PREFIX-$ID-text.txt
  if [ ! -s $TEXT_FILE ]; then
    echo "$TEXT" > $TEXT_FILE
  fi
  FULL_URL=`grep "<img src=" $FULL_HTML | grep -Eo "src=\"[^\"]*\"" | grep -Eo "http://[^\"]*"`
  echo "FULL_URL: $FULL_URL" | tee -a $LOG_FILE

  SCALED_HTML=$PREFIX-$ID-scaled.html
  if [ ! -s "$SCALED_HTML" ]; then
    wget http://twitpic.com/$ID -O $SCALED_HTML
    if [ ! -s "$SCALED_HTML" ]; then
      echo "ERROR: could not get SCALED_HTML for $ID" | tee -a $LOG_FILE
      sleep 5
    fi
  fi
  SCALED_URL=`grep "id=\"photo-display\"" $SCALED_HTML | grep -Eo "http://[^\"]*" | head -n1`
  echo "SCALED_URL: $SCALED_URL" | tee -a $LOG_FILE
  POST_DATE=`grep -Eo "Posted on [a-zA-Z0-9 ,]*" $SCALED_HTML | sed -e 's/Posted on //'`
  echo "POST_DATE: $POST_DATE" | tee -a $LOG_FILE

  THUMB_URL=`cat $PREFIX-page-* | grep -E "<a href=\"/$ID\">" | grep -Eo "src=\"[^\"]*\"" | head -n1 | sed -e 's/src=\"//' -e 's/\"$//'`
  echo "THUMB_URL: $THUMB_URL" | tee -a $LOG_FILE

  EXT=`echo "$FULL_URL" | grep -Eo "[a-zA-Z0-9]+\.[a-zA-Z0-9]+\?" | head -n1 | grep -Eo "\.[a-zA-Z0-9]+"`
  if [ -z "$EXT" ]; then
    EXT=`echo "$FULL_URL" | grep -Eo "\.[a-zA-Z0-9]+$"`
  fi
  echo "EXT: $EXT"
  if [ "$DOWNLOAD_FULL" -eq 1 ]; then
    FULL_FILE="$PREFIX-$ID-full$EXT"
    if [ ! -s $FULL_FILE ]; then
      wget "$FULL_URL" -O $FULL_FILE
      if [ ! -s "$FULL_FILE" ]; then
        echo "ERROR: could not get FULL_URL for $ID: $FULL_URL" | tee -a $LOG_FILE
        sleep 5
      fi
    fi
  fi
  if [ "$DOWNLOAD_SCALED" -eq 1 ]; then
    SCALED_FILE=$PREFIX-$ID-scaled$EXT
    if [ ! -s $SCALED_FILE ]; then
      wget "$SCALED_URL" -O $SCALED_FILE
      if [ ! -s "$SCALED_FILE" ]; then
        echo "ERROR: could not get SCALED_URL for $ID: $SCALED_URL" | tee -a $LOG_FILE
        sleep 5
      fi
    fi
  fi
  if [ "$DOWNLOAD_THUMB" -eq 1 ]; then
    THUMB_FILE=$PREFIX-$ID-thumb$EXT
    if [ ! -s $THUMB_FILE ]; then
      wget "$THUMB_URL" -O $THUMB_FILE
      if [ ! -s "$THUMB_FILE" ]; then
        echo "ERROR: could not get THUMB_URL for $ID: $THUMB_URL" | tee -a $LOG_FILE
        sleep 5
      fi
    fi
  fi

  BODY_TEXT="$TEXT_RICH <p>[<a href=http://twitpic.com/$ID>Twitpic</a>]</p>"

  # Format the post date correctly
  YEAR=`echo "$POST_DATE" | sed -e 's/[A-Z][a-z]* [0-9]*, //'`
  DAY=`echo "$POST_DATE" | sed -e 's/[A-Z][a-z]* //' -e 's/, [0-9]*//'`
  MONTH=`echo "$POST_DATE" | sed -e 's/ [0-9]*, [0-9]*//' | sed \
    -e 's/January/01/' \
    -e 's/February/02/' \
    -e 's/March/03/' \
    -e 's/April/04/' \
    -e 's/May/05/' \
    -e 's/June/06/' \
    -e 's/July/07/' \
    -e 's/August/08/' \
    -e 's/September/09/' \
    -e 's/October/10/' \
    -e 's/November/11/' \
    -e 's/December/12/' \
    `
  # Adjust the time to local midnight when west of GMT
  HOURS_LOC=`date | grep -Eo " [0-9]{2}:" | sed -e 's/://' -e 's/ //'`
  HOURS_UTC=`date -u | grep -Eo " [0-9]{2}:" | sed -e 's/://' -e 's/ //'`
  HOURS_OFF=`expr $HOURS_UTC - $HOURS_LOC + 7`
  echo "HOURS_LOC: $HOURS_LOC"
  echo "HOURS_UTC: $HOURS_UTC"
  echo "HOURS_OFF: $HOURS_OFF"
  if [ "$HOURS_OFF" -lt 0 ]; then
    # We're east of GMT, do not adjust
    HOURS_OFF=0
  fi
  if [ "$HOURS_OFF" -lt 10 ]; then
    HOURS_OFF=0$HOURS_OFF
  fi
  if [ "$DAY" != "" ] && [ "$DAY" -lt 10 ]; then
    DAY=0$DAY
  fi
  DATE_FORMATTED="$YEAR-$MONTH-$DAY-$HOURS_OFF:00"
  echo "DATE_FORMATTED: $DATE_FORMATTED" | tee -a $LOG_FILE

  echo "<p><img src='$FULL_FILE' alt='$TEXT' title='$TEXT' /></p>" >> $HTML_OUT
  echo "$BODY_TEXT" >> $HTML_OUT
  echo "  Post date: $DATE_FORMATTED; Count: $COUNT" >> $HTML_OUT

  # Upload this Twitpic data to Posterous
  if [ ! -z "$P_SITE_ID" ]; then

    # First make sure we're under the API upload limit
    if [ "$COUNT" -le "$UPLOAD_SKIP" ]; then
      echo Skipping upload...
      continue
    fi
    if [ "$COUNT" -gt "`expr $UPLOAD_SKIP + $P_API_LIMIT`" ]; then
      echo "Skipping upload due to daily Posterous API upload limit of $P_API_LIMIT."
      echo "To resume uploading where we left off today, supply UPLOAD_SKIP parameter of `expr $UPLOAD_SKIP + $P_API_LIMIT`."
      continue
    fi

    P_OUT_FILE="posterous-$P_SITE_ID-$ID.out"
    if [ -s "$P_OUT_FILE" ]; then
      rm "$P_OUT_FILE"
    fi
    echo "Uploading Twitpic image..."
    curl -u "$P_ID:$P_PW" "http://posterous.com/api/newpost" -o "$P_OUT_FILE" \
      -F "site_id=$P_SITE_ID" \
      -F "title=$TEXT" \
      -F "autopost=$P_AUTOPOST" \
      -F "private=$P_PRIVATE" \
      -F "date=$DATE_FORMATTED" \
      -F "tags=$P_TAGS_POST" \
      -F "source=burndive's Twitpic-to-Posterous script $SCRIPT_VERSION_STRING" \
      -F "sourceLink=http://tuxbox.blogspot.com/2010/03/twitpic-to-posterous-export-script.html" \
      -F "body=$BODY_TEXT" \
      -F "media=@$FULL_FILE"
    cat $P_OUT_FILE  | tee -a $UPLOAD_OUT
  fi
done
echo Done.

This software is licensed under the CC-GNU GPL version 2.0 or later.

PS: If you use my code, I appreciate comments to let me know, and any feedback you may have, especially if it's not working right for you, but also just to say thanks.

For convenience, you can download this script from my server.

It's been a while since this blog actually lived up to its name and I posted something to do with actual hacking on my Linux box.

You may recall a post a while back where I used the 'sed' command to create a modified copy of my TwitPic feed so that a thumbnail would show up when I imported the feed into a Blog List gadget in Blogger.

Well, I recently switched from using TwitPic for uploading pictures from my phone to using Posterous for uploading pictures and video from my phone. There were many reasons in the "pros" column, but in the "cons" was the fact that, when I imported my feed into that same Blogger widget, no thumbnail appeared.

So, just like with the TwitPic feed, I set out to modify my Posterous feed in order to get the thumbnail to appear. One problem I encountered is that the feeds were totally different formats. I based my TwitPic feed modification on a feed I knew to be working (from Digg), but performing that same transformation on the Posterous feed proved to be problematic.

What I ended up doing was simply extracting the information I needed from the Posterous feed, and then creating a one-item feed in the known-good format. The feed looks nothing like the original Posterous feed, but that's just fine, since all it will be used for is pulling the latest post into my blog sidebar.

One improvement I'm considering working on is providing a useful thumbnail when I upload a video. Currently (at least with the 3gp format), the Posterous feed just sticks a generic blank file icon in the thumbnail field. What I would like is a still frame from the movie. In order to do this myself, I would need to download the enclosure link, process the video into a still image, post the image on the web, and then put the image URL into the feed. All very doable given the right tools.

I'll have to test out what happens when I use the MP4 format for video, which my phone is also capable of creating.

Here's my script (so far). Feel free to use it under the terms of the license listed below. If you have any questions or suggestions, please feel free to leave a comment.

posterous.sh (run as an hourly cron job):

#!/bin/sh

# Copyright 2010 Tim "burndive" of http://burndive.blogspot.com/
# This software is licensed under the Creative Commons GNU GPL version 2.0 or later.
# License informattion: http://creativecommons.org/licenses/GPL/2.0/

# This script was obtained from here:
# http://tuxbox.blogspot.com/2010/03/posterous-blogger-sidebar-widget.html

DOMAIN=$1
FEED_DIR=$2
FEED_TITLE=Posterous
FEED_DESC="The purpose of this feed is to provide a thumbnail of the latest item in a Blogger sidebar widget."


if [ -z $DOMAIN ]; then
  echo "You must enter a Posterous DOMAIN."
  exit
fi

if [ -z $FEED_DIR ]; then
  echo "You must supply a directory."
  exit
fi

if [ ! -d $FEED_DIR ]; then
  echo "You must supply a valid directory."
  exit
fi

FEED_URL="http://$DOMAIN/rss.xml"
TMP_FILE="/tmp/posterous-$DOMAIN.xml"
FEED_FILE="$FEED_DIR/posterous-$DOMAIN.xml"

# Fetch the RSS feed
wget -q $FEED_URL -O $TMP_FILE

if [ ! -f $TMP_FILE ]; then
  echo "Failed to download $FEED_URL to $TMP_FILE"
  exit
fi

NEW_LATEST=`grep guid $TMP_FILE | head -n1`

if [ ! -f $FEED_FILE ]; then
  FEED_LATEST="" 
else 
  FEED_LATEST=`grep guid $FEED_FILE | head -n1`
fi

# Comment these out
#echo "FEED_LATEST: $FEED_LATEST"
#echo "NEW_LATEST : $NEW_LATEST"

if [ "$FEED_LATEST" = "$NEW_LATEST" ]; then
#  echo "There is no change in the feed."
#  echo "FEED_LATEST: $FEED_LATEST"
  exit
fi

IMG_HTML=`grep -i "img src" $TMP_FILE | head -n1 | grep -Eo "<img src='[^']*'[^>]*>" | sed -e 's/\"/\&quot;/g' -e 's/</\&lt;/g' -e 's/>/\&gt;/g'`
#echo "IMG_HTML: $IMG_HTML"

IMG_URL=`grep -i "img src" $TMP_FILE | head -n1 | grep -Eo "http:[^']*" | tail -n1`
#echo "IMG_URL: $IMG_URL"

# Create a minimalist RSS feed
echo "<?xml version='1.0'?> " > $FEED_FILE
echo "<rss version='2.0' xmlns:media='http://search.yahoo.com/mrss/'>" >> $FEED_FILE
echo "<channel>" >> $FEED_FILE
echo "<title>$FEED_TITLE</title>" >> $FEED_FILE
echo "<description>$FEED_DESC</description>" >> $FEED_FILE
echo "<link>http://$DOMAIN/</link>" >> $FEED_FILE

echo "<item>" >> $FEED_FILE
grep "<title>" $TMP_FILE | head -n2 | tail -n1 >> $FEED_FILE
grep "<pubDate>" $TMP_FILE | head -n1 >> $FEED_FILE
echo "<description>$IMG_HTML</description>" >> $FEED_FILE
grep "<link" $TMP_FILE | head -n3 | tail -n1 >> $FEED_FILE
echo "$NEW_LATEST" >> $FEED_FILE
echo "<media:thumbnail url=\"$IMG_URL\" height=\"56\" width=\"75\" />" >> $FEED_FILE
echo "</item>" >> $FEED_FILE

echo "</channel>" >> $FEED_FILE
echo "</rss>" >> $FEED_FILE

# Cean up
rm $TMP_FILE

This software is licensed under the CC-GNU GPL version 2.0 or later.

tuxbox: hacking my computers

Pages

Wednesday, March 17, 2010

TwitPic to Posterous Export Script

Saturday, March 13, 2010

Posterous Blogger Sidebar Widget Thumbnail Feed Script