Thursday, March 21, 2013

Predictive Analysis For Product

If you are involved in Product, Itamar Rosenn has a short talk you must listen to.  In his short (about 10 min) brief, he explains about how his data science team works with user data (osemn) and how these effect product direction. Data driven product is a topic that which I will write more about in coming weeks ( my post Focus On Relationships, Skip Personas is a good intro to why I advocate this approach); however, in the mean time, here are some notes from his talk.

Using R

Itamar first talks, on a technical level, how his team processes this large data set. At the time, he mentions that data analysis is done in a distributed framework, the data clusters are analyzed with a MapReduce implementation via Hadoop. When doing initial analysis, they live in 'Hadoop land', having written a query language on top or python. (I speculate they even wrote their own).

When the data is ready for early stage exploration, they reduce the data to less than 1 million observations, then export them into R, and spend 15-20% of their time there. They chose R for roughly these reasons:

  • Flexibility, allows easy way to trim and work with the data so it can be explored.
  • Graphics package, great for communicating findings

Again, R is about exploring findings and commnicating those findings. For full analysis, they use custom Hadoop processes for that.

Describing and modeling user behavior

In 2008 Facebook observed a drop in user retention.  To explore and predict the characteristics of users who remained with Facebook and those who did not, they used R to explore and help model these behaviors and characteristics.

To begin, they picked 2 cohorts of new users; each cohort was about 300,00 users (the amount of users they got in a week of time). For each user they built a rich dat set of all their characteristics  age, sex, school information (if provided), how many friends they had, how many reached out to them, how many times they were communicating with each other, did they upload photos, how many photos did they look at...

They then defined a 'new user period' of two weeks and pulled data from that time. Their goal was to develop 2 predictive tasks for these new users:
  • Would they continue to be an active user after 3 months 
  • The level of activity when they did become an active user. ( Level of activity is defined as number of sessions, frequency of posting a photos, writing on someone's wall).

The 1st step was classification: given these data sets in R, they ran 2 classification procedures:

Recursive partitioning is preferred b/c it gives consumers of data (the rest of team and company) a more digestible decision framework for understanding why users stay or don't stay. Logistics regression helps the data scientists to understand how variables interact with each other.

Findings: Will a user be active after 3 months

Two findings that stood out (when they took these out and ran recursive partitioning again, nothing stood out ) were: 
  • If a user came on more than once within the new user period
  • Whether they supplied their gender
The found that if people came onto the site and gave some basic information (gender in this study), they were more likely to stay on the site. Logistics regression showed that when new users reached out to them, retainment would be high.

From this they then had some actionable metrics. The found that tossing a lot of trends at them or a bunch of interesting concepts at them helped with user acquisition; instead, they needed to communicate a value proposing that speaks to them in a very efficient way.

For the 2nd task, predicting how much they will interact once they are users, they first playe with a linear model, but moved to lasso and lars. Lars worked because they had a huge array of variables (about the user and how they interact with site). Naively putting those into a linear regression didn't work. Using lars allowed for a model like selection to find a terse, economical regression that givies good results. They ran lars for 2-3 weeks.

Findings: How active are users after 3 months

They found that the most highly predictive independent variables for the activity level of a user fell into 2-3 general classes:

  • Direct communication: if they are reached out to by a large group of people, very frequently, they are more likely to stay.
  • Related to platform: If they used 3rd party apps frequently they would stay on. If they used them a whole lot, they would drop off.
  • Personal: If they gave a lot of information about them self. It was highly predictive  

That sums up my notes and take on what Itamar talked about. It's a lot of great information and I look forward to writing and exploring it more in the coming weeks.