Beyond Coding and Webscraping

This post was originally written as an IPython notebook, a link to which I planned to include in this blog post. Then I had the great idea of just using the notebook itself. Turns out converting it a markdown file was extremly simple. Very cool.

I’m a part of the first cohort of Beyond Coding, a program designed to teach the soft skills required to forge a career in technology. The first of the six class sequence was held at Trello. Part of the three hour session involved Elizabeth Hall, VP of People at Trello, speaking about how to give a 30 second elevator pitch, among other job hunting tips. One of her points was to look the companies in the porfolios of venture capital firms as potential job hubs. Manually gathering the names of these companies seems…unwise. This is a job for web scraping.

from bs4 import BeautifulSoup as bs
import requests

sparkCapitalLink = 'http://sparkcapital.com/portfolio/'

response = requests.get(sparkCapitalLink)
soup = bs(response.text)

Inspecting the elements shows the company names are stored under the div tag with the company class.

companies = soup.findAll('div',{'class':"company"})
companies[0]

<div class="company">
<a class="smaj" href="http://sparkcapital.com/portfolio/1stdibs/">
<img alt="1stDibs" src="http://sparkcapital.com/wp-content/uploads/2012/11/1stDibs-Logo-e1382974973289.png"/>
</a>
</div>

So it’s as simple as accessing the value of ‘alt’. Nice.

sparkCapitalPortfolio = []
for company in companies:
    company_name = company.find('img')['alt']
    sparkCapitalPortfolio.append(company_name)
    print(company_name)

print("\n")
print("There are {} companies in Spark Capital's portfolio".format(len(companies)))

1stDibs
5min Media
Academia.edu
Adap.tv
Admeld
Affirm
Andela
Aviary
Behalf
Benu
BloomNation
Bond Street
Boxee
Coach.me
Coin
Colu
Convert Media
Cover
Covestor
Crowdrise
Cybereason
DIY
Drop Messages
eShares
eToro
Everything But The House
ExFM
Foursquare
Freight Farms
Frontier Strategy Group
FundersClub
gdgt
Get Your Guide
Goji
Hello Alfred
Hey
IEX
Intune
IP Wireless
iWireless
Jana
Kateeva
KickApps
Kik
Kitchensurfing
Kumu
Lexity
Linkwell
Mark43
Menara Networks
Next New Networks
Nimble Commerce
Oculus VR
OMGPOP
OneRiot
Orchard
Panjo
peerTransfer
Picturelife
Pixie
Plaid
Postmates
Priceonomics
Privlo
Proletariat
Qriously
Quantopian
RunKeeper
SendMe
Sensr.net
Sift Science
Signpost
Sincerely
Skillshare
Slack
Socratic
Sourcepoint
Splash
Stack Exchange
Storefront
Storenvy
Super
Superpedestrian
Svpply
Talkspace
Thalmic Labs
thePlatform
Timehop
Trello
Triggit
Tumblr
Twitter
Upworthy
Verivue
VivaReal
Warby Parker
Wayfair
Wealthfront
Work Market

There are 99 companies in Spark Capital's portfolio

That’s a lot of companies. Still, at least the task is easier now. I’m going to do the same for another VC firm that Ms. Hall mentioned.

indexVenturesLink = 'https://indexventures.com/companies#category:65'

response2 = requests.get(indexVenturesLink)
soup2 = bs(response2.text)

An inspection shows this to be either less complicated or more complicated. I can’t tell. Let’s forge on.

companies2 = soup2.findAll('img')
companies2[0]

<img src="/sites/all/themes/indexventures_theme/images/logo.png"/>

So maybe more complicated? The company names are in the image tags, but so are other unwanted elements. Hopefully only the companies have the ‘alt’ attibute.

indexVenturePortfolio = []

for company in companies2:
    try:
        company_name = company['alt']
        indexVenturePortfolio.append(company_name)
        print(company_name)
    except:
        print('Not a company')

Not a company
Not a company
Not a company
Not a company
Not a company
Not a company
Not a company
Not a company
Not a company
Not a company
Not a company
Not a company
Not a company
Not a company
Not a company
Not a company
Not a company
Not a company
1stdibs: HQ: New York, USA
1stdibs
8tracks: HQ: San Francisco, USA
8tracks
Abe's Market: HQ: Chicago, USA
Abe's Market: Born In: Jerusalem, Israel
Acutus Medical: HQ: San Diego, USA
Acutus Medical
Adallom: HQ: Menlo Park, USA
Adallom: Born In: Tel Aviv, Israel
Addex: HQ: Geneva, Switzerland
Addex
Adyen: HQ: Amsterdam, Netherlands
Adyen
Adzuna: HQ: London, UK
Adzuna
Aegerion (NASDAQ: AEGR): HQ: Jersey City, USA
Aegerion (NASDAQ: AEGR)
AlertMe (LON:CNA): HQ: London, UK
AlertMe (LON:CNA)
Algolia: HQ: Paris, France
Algolia
Alkemics: HQ: Paris, France
Alkemics
Anki: HQ: San Francisco, USA
Anki
Ariad (NASDAQ: ARIA): HQ: Boston, USA
Ariad (NASDAQ: ARIA)
Arista (NYSE:ANET): HQ: Santa Clara, USA
Arista (NYSE:ANET)
ArtBinder: HQ: New York, USA
ArtBinder
asos (LSE:ASC): HQ: London, UK
asos (LSE:ASC)
Assistly (Salesforce): HQ: San Francisco, USA
Assistly (Salesforce)
Astley Clarke: HQ: London, UK
Astley Clarke
Autobutler: HQ: Copenhagen, Denmark
Autobutler
Auxmoney: HQ: Dusseldorf, Germany
Auxmoney
B-Hive (NYSE:VMW): HQ: San Carlos, USA
B-Hive (NYSE:VMW): Born In: Tel Aviv, Israel
Base: HQ: Palo Alto, USA
Base: Born In: Krakow, Poland
Betfair (LSE:BET): HQ: London, UK
Betfair (LSE:BET)
Big Health: HQ: London, UK
Big Health
Big Switch: HQ: Palo Alto, USA
Big Switch
BioXell (SIX:COPN): HQ: Milan, Italy
BioXell (SIX:COPN)
BitPay: HQ: Atlanta, USA
BitPay
BlaBlaCar: HQ: Paris, France
BlaBlaCar
Blaze: HQ: London, UK
Blaze

Almost got it. I just need to get rid of the addresses and keep the unique items. In general, the odd placed items have just the company name, but that isn’t always the case. Still, that will go a long way.

newindexVenturePortfolio = []

for index, item in enumerate(indexVenturePortfolio):
    if index%2!=0:
        newindexVenturePortfolio.append(item)
        
newindexVenturePortfolio

['1stdibs',
 '8tracks',
 "Abe's Market: Born In: Jerusalem, Israel",
 'Acutus Medical',
 'Adallom: Born In: Tel Aviv, Israel',
 'Addex',
 'Adyen',
 'Adzuna',
 'Aegerion (NASDAQ: AEGR)',
 'AlertMe (LON:CNA)',
 'Algolia',
 'Alkemics',
 'Anki',
 'Ariad (NASDAQ: ARIA)',
 'Arista (NYSE:ANET)',
 'ArtBinder',
 'asos (LSE:ASC)',
 'Assistly (Salesforce)',
 'Astley Clarke',
 'Autobutler',
 'Auxmoney',
 'B-Hive (NYSE:VMW): Born In: Tel Aviv, Israel',
 'Base: Born In: Krakow, Poland',
 'Betfair (LSE:BET)',
 'Big Health',
 'Big Switch',
 'BioXell (SIX:COPN)',
 'BitPay',
 'BlaBlaCar',
 'Blaze']

The next step is to split on both the right bracket and the colon characters.

cleanedindexVenturePortfolio = [company.split(':')[0] for company in newindexVenturePortfolio]
cleanedindexVenturePortfolio = [company.split('(')[0] for company in cleanedindexVenturePortfolio]
cleanedindexVenturePortfolio

['1stdibs',
 '8tracks',
 "Abe's Market",
 'Acutus Medical',
 'Adallom',
 'Addex',
 'Adyen',
 'Adzuna',
 'Aegerion ',
 'AlertMe ',
 'Algolia',
 'Alkemics',
 'Anki',
 'Ariad ',
 'Arista ',
 'ArtBinder',
 'asos ',
 'Assistly ',
 'Astley Clarke',
 'Autobutler',
 'Auxmoney',
 'B-Hive ',
 'Base',
 'Betfair ',
 'Big Health',
 'Big Switch',
 'BioXell ',
 'BitPay',
 'BlaBlaCar',
 'Blaze']

Perfect. And now to merge both lists.

fullcompanyList = []
fullcompanyList.append(sparkCapitalPortfolio)
fullcompanyList.append(cleanedindexVenturePortfolio)

fullcompanyList = [company for company_list in fullcompanyList for company in company_list]

fullcompanyList

['1stDibs',
 '5min Media',
 'Academia.edu',
 'Adap.tv',
 'Admeld',
 'Affirm',
 'Andela',
 'Aviary',
 'Behalf',
 'Benu',
 'BloomNation',
 'Bond Street',
 'Boxee',
 'Coach.me',
 'Coin',
 'Colu',
 'Convert Media',
 'Cover',
 'Covestor',
 'Crowdrise',
 'Cybereason',
 'DIY',
 'Drop Messages',
 'eShares',
 'eToro',
 'Everything But The House',
 'ExFM',
 'Foursquare',
 'Freight Farms',
 'Frontier Strategy Group',
 'FundersClub',
 'gdgt',
 'Get Your Guide',
 'Goji',
 'Hello Alfred',
 'Hey',
 'IEX',
 'Intune',
 'IP Wireless',
 'iWireless',
 'Jana',
 'Kateeva',
 'KickApps',
 'Kik',
 'Kitchensurfing',
 'Kumu',
 'Lexity',
 'Linkwell',
 'Mark43',
 'Menara Networks',
 'Next New Networks',
 'Nimble Commerce',
 'Oculus VR',
 'OMGPOP',
 'OneRiot',
 'Orchard',
 'Panjo',
 'peerTransfer',
 'Picturelife',
 'Pixie',
 'Plaid',
 'Postmates',
 'Priceonomics',
 'Privlo',
 'Proletariat',
 'Qriously',
 'Quantopian',
 'RunKeeper',
 'SendMe',
 'Sensr.net',
 'Sift Science',
 'Signpost',
 'Sincerely',
 'Skillshare',
 'Slack',
 'Socratic',
 'Sourcepoint',
 'Splash',
 'Stack Exchange',
 'Storefront',
 'Storenvy',
 'Super',
 'Superpedestrian',
 'Svpply',
 'Talkspace',
 'Thalmic Labs',
 'thePlatform',
 'Timehop',
 'Trello',
 'Triggit',
 'Tumblr',
 'Twitter',
 'Upworthy',
 'Verivue',
 'VivaReal',
 'Warby Parker',
 'Wayfair',
 'Wealthfront',
 'Work Market',
 '1stdibs',
 '8tracks',
 "Abe's Market",
 'Acutus Medical',
 'Adallom',
 'Addex',
 'Adyen',
 'Adzuna',
 'Aegerion ',
 'AlertMe ',
 'Algolia',
 'Alkemics',
 'Anki',
 'Ariad ',
 'Arista ',
 'ArtBinder',
 'asos ',
 'Assistly ',
 'Astley Clarke',
 'Autobutler',
 'Auxmoney',
 'B-Hive ',
 'Base',
 'Betfair ',
 'Big Health',
 'Big Switch',
 'BioXell ',
 'BitPay',
 'BlaBlaCar',
 'Blaze']

Finally, output the results to a text file to complete a *day’s work.

*: Estimate should not be quoted for accuracy.

with open('companies.txt','w') as cFile:
    for company in fullcompanyList:
        cFile.write(company)
        cFile.write('\n')

The job search continues in a much cleaner fashion thanks to the power of web scraping.

Written on July 6, 2015