Social media and webmail are so popular and so easy to navigate today that even your grandmother could use it. And if she is like billions of other people in the world, she already has a social network profile (2.4 billion) or an email account (3.1 billion).
With almost everyone—from small children to great-grandparents—using some form of social media or email on a daily basis, online data is growing at an exponential rate, and this has clearly affected the electronic discovery (eDiscovery) and digital forensics process.
Take a look at the statistics below:
1 in 5 minutes online is spent on social networks
6.6 hours a month are spent per user on Facebook
3 million blogs are started every month
72 hours of video are uploaded to YouTube every minute
400 million tweets are sent every day
4.5 million photos are uploaded to Flickr every minute
And here is some even more staggering data:
3.1 billion email accounts
901 million active users on Facebook
54 million WordPress sites
161 million members on Linkedin
64 million blogs on Tumblr
140 million active users on Twitter
2.4 billion social networking accounts worldwide
(Sources for these statistics: comScore, Technorati, YouTube, The Radicati Group, Inc., Twitter, LinkedIn, and WordPress)
It only makes sense that some of this online information could be essential to a lawsuit or an investigation.
The Evolution of eDiscovery
While the young industry of eDiscovery and digital forensics has been around long enough to develop standards and best practices for handling multiple types of digital files on various mediums, the inclusion of data from various social networking sites (SNS) and webmail platforms is relatively new.
The challenge with SNS and webmail, where each platform has its own rules or no rules at all, is collecting the data in a way that fully preserves the digital information while maintaining its initial context and meaning.
All email is not alike, and that’s especially true when it comes to webmail. Different programs and systems output email in various formats, meaning the strings of metadata don’t look the same. It is nearly impossible to effectively cull down a mountain of duplicative emails if they were generated using different webmail or internal mail programs.
My company, DSi, was recently involved with a project for which we collected emails from 80 different accounts—approximately 500,000 email messages from both internal mail programs and multiple webmail applications. The emails were collected during the eDiscovery process of a case, and needed to be culled and searched by attorneys to determine what would be pertinent to the lawsuit.
Many of the emails were EML, a standard format used by multiple email programs. We also collected from other platforms like Lotus Notes and Exchange, which are stored in their own unique formats. Additional emails were collected from multiple Internet Message Access Protocol (IMAP) and Post Office Protocol (POP3) accounts, including Gmail, Hotmail (or Live Mail), Yahoo, Apple’s Me.com, and various others.
We quickly realized during the collection process that there is no one tool that can accurately collect all of the various email formats. Multiple methods and applications need to be employed to accommodate the myriad platforms and file types and still maintain data integrity.
There was also the big issue of handling deduplication, a common way to reduce the amount of data on the front end of an eDiscovery project. Even though email has been around for decades, and in fact pre-dates the Internet, there is no one application that can properly deduplicate across multiple email platforms.
Each platform can structure its data differently, which means individual mail providers may or may not contain the same fields, or, if they have the same fields, they may be named differently, or the storage of the metadata may differ, or other issues may arise.
After employing a combination of standard and proprietary methods to collect data from the multiple webmail accounts, DSi’s engineering team designed a solution to perform the actions we needed to compare and deduplicate. This included analyzing each field from every platform to determine the differences and creating a standardization algorithm to run on all data prior to the deduplication process. We wrote custom code to parse the data, create hashcodes, and store the information so that it could be processed through a standard electronic discovery platform. We also conducted numerous quality checks throughout to confirm our process was accurate, effective, and defensible.
Social Networking Sites:
Facebook, Twitter, LinkedIn, Google
Each social media platform is different with unique code and variations. They run on their own hardware and software platforms, and some, such as Facebook, have even developed custom technology to run their sites. Because of that, each requires its own method of forensically collecting data. Additionally, collection processes need to keep up with the constantly changing code base for these social media giants.
Facebook was the first platform to create a simple way to download a user’s information. It includes all posts, messages, and chat conversations as well as photos and videos that the user has shared. There is also the option of an “expanded archive” that includes additional historic information such as IP addresses used during logins. Facebook data is provided in an HTML format that can be viewed on a computer.
If the user downloads his own Facebook information, an expert in forensic collections should be on hand to ensure everything is handled correctly and that a specific protocol is followed.
Twitter has been more intensive than Facebook because there was not an internal method for collecting all of a user’s tweets. However, Twitter recently started rolling out a new feature that allows users to download their entire Twitter archive with the click of a button, including the ability to filter the output by month or search via keywords, phrases, hashtags, and usernames. Until it is completely rolled out, there are other methods— either writing custom code or using an emerging platform—to grab all available tweets, including contacts, lists, retweets, geographic places, links, and accounts that are following the user or followed by the user.
For LinkedIn, our experience has shown that the most effective way to gather data is by writing custom code. Because of the way information has been stored and structured on the site, LinkedIn has been the most disjointed system of all the major social media networks and thus the most difficult one from which to collect data. Ongoing LinkedIn upgrades will better accommodate easier collection.
Google has taken great strides in making collection easy on their applications, such as Google Docs, Gmail, and chats. Their recently launched eDiscovery tool, Google Vault, makes it easy to monitor information governance, archive emails and chats, perform eDiscovery searching, export and audit, and place legal holds—all of which greatly simplifies collections of Google information.
Other cloud-based documents and calendars can be collected through an existing application or by writing code to fit specific requirements. After that information is collected, it can be converted to formats that can be opened in common programs, such as Microsoft Office.
Proceed with Caution
There are many ways to gather data from social networks and webmail. However, not every collection method is acceptable, and digital forensics and eDiscovery companies must have proper authorization from the service provider. Each platform’s terms of service should be viewed carefully to determine if the agreement will be violated—either by the manner in which collection happens or because of the information that is gathered.
In the future, all webmail applications and social platforms will follow the leads of Facebook and Google by establishing methods within the applications to collect, search, and view archived records. Similarly, look for the eDiscovery industry to place an emphasis on learning and understanding the best practices that are involved in webmail and social media collection.
However, we are not yet to that point. Before collecting any webmail or social media, it is important to conduct an in-depth vetting process with the companies involved to learn about their procedures, protocols, and quality control standards.