Getting interesting business data: Form 5500 and SOS API

Sample Code

This is the first post in my “Getting interesting business data” series! There are a lot of really great sources of business data out there and I want to show how to get them and use them.

In order to request business data you’ll need a Cobalt Intelligence API key. In order to get contact information (email addresses or phone numbers) you’ll need a Datafinder API key. Once you have these, you’ll need to rename the .sample.env file to .env and replace the dummy values with the actual API keys.

Form 5500

fun money gif

The form 5500 is an official form that is legally required by companies that have any kind of tax advantaged pension or (I think) healthcare plan. Because they get tax benefits, these benefits come from the taxpayers and makes this information public. This allows us to search out all of these companies and see their estimated number of employees and an estimate of the size of their business.

form 5500 search example

As you can see in the picture, it’s very easy to search by keyword. I typically search by something that I will know is in the title of the business and change my filter to “Plan Name” or “Plan Sponsor”.

The great thing is that you can export this data to a CSV. While there is a PDF for each business that includes a phone number, I don’t go into that with this post. It would be simple enough, though, to parse the PDF for the phone number.

Because there is a form filed for each year of the plan, there are typically a lot of duplicates. I used this function to clean them out. It’ll take the csv, parse through it and remove any duplicates, and then resave it as another csv.

async function removeDuplicates(fileName: string) {
    const uncleanBusinesses = await csvtojson().fromFile(`${fileName}.csv`);
    const businesses: any[] = [];

    for (let i = 0; i < uncleanBusinesses.length; i++) {
        const business = uncleanBusinesses[i];
        const foundBusiness = businesses.find(businessToCheck => businessToCheck.EIN === business.EIN);

        if (!foundBusiness) {
            businesses.push(business);
        }
    }
    console.log('Total', businesses.length);

    const csv = json2csv.parse(businesses);
    fs.writeFileSync(`${fileName}, cleaned.csv`, csv);
}

SOS API

fun code gif

Now for my baby. The Secretary of State API was built for just this purpose. I was running into datasets that had business data but didn’t have owner information. Fortunately, for many businesses this information is public with the SOS API! You just make a simple request to the API with the state and the business name and it returns the relevant data. It even comes with a javascript SDK!

The SOS API currently supports all states (except New Jersey, which doesn’t have any public data):

Once I have the data cleaned, this is the function I use to get the owner details using the SOS API SDK (three acronyms in a row is good, right?).

async function getOwnerDetails(fileName: string) {
    const sosApi = new SosApi(process.env.cobaltIntApiKey);
    const cleanBusinesses = await csvtojson().fromFile(`${fileName}, cleaned.csv`);

    const businesses: any[] = [];
    const promises: any[] = [];

    for (let i = 0; i < cleanBusinesses.length; i++) {
        const cleanBusiness = cleanBusinesses[i];
        console.log('Searching for', cleanBusiness['Sponsor Name'], cleanBusiness.State);

        promises.push(sosApi.getBusinessDetails(cleanBusiness['Sponsor Name'], cleanBusiness.State).then(businessResults => {
            const business = businessResults[0];
            console.log('business', business);

            if (business?.agentName) {
                businesses.push({
                    ...business,
                    link: cleanBusiness.Link,
                    assets: cleanBusiness.Assets,
                    employees: cleanBusiness.Participants
                });
            }
        })
            .catch(e => {
                console.log('Error happened', e);
            }));

        // if (i % 70 === 0) {
        //     await timeout(45000);
        // }
    }

    await Promise.all(promises);

    const csv = json2csv.parse(businesses);
    fs.writeFileSync(`${fileName}, from API.csv`, csv);
}

Because the SOS API is built for a fair amount of load, you can hit it pretty heavy. I’m pushing these to the API all without waiting for it to complete. The SDK handles any retries and returns the data as it finds it.

There is some limit to how hard you can hit it, which is why you see that commented out section there. It looks like this:

        // if (i % 70 === 0) {
        //     await timeout(45000);
        // }

What this will do is send 70 requests and then wait 45 seconds. The API will start to throw errors if you burst it too heavily so this kind of thing will protect against it. In the example I used here, “cruise”, I only had 69 businesses so I just commented it out for my purposes.

Getting contact information

fun email gif

For contact information I use Datafinder. It returns the best contact information I’ve been able to find at a fairly reasonable price. Their API works really easily. I use the owner information that I found in the previous step and just pass that along with location data (city, state) to datafinder. Here’s the function I use for getting an email:

async function getEmail(name: string, state: string, city: string) {
    let dataFinderResponse: AxiosResponse;

    let email: string;

    try {
        dataFinderResponse = await axios.get(`https://api.datafinder.com/v2/qdf.php?service=email&k2=${process.env.dataFinderApiKey}&d_fullname=${name}&d_state=${state}`);
    } 
    catch (e) {
        console.log('Error from datafinder', e.response, name);
        return;
    }

    if (!dataFinderResponse.data?.datafinder.results) {
        console.log('Data finder without results. Trying with the city', dataFinderResponse.data.datafinder);
    }
    else {
        console.log('Data finder response', dataFinderResponse.data?.datafinder.results?.[0]);
        email = dataFinderResponse.data?.datafinder.results?.[0].EmailAddr;
    }

    if (!email) {
        try {
            dataFinderResponse = await axios.get(`https://api.datafinder.com/v2/qdf.php?service=email&k2=${process.env.dataFinderApiKey}&d_fullname=${name}&d_state=${state}&d_city=${city}`);
        } catch (e) {
            console.log('Error from datafinder', e.response, name);
            return;
        }


        if (!dataFinderResponse.data?.datafinder.results) {
            console.log('Data finder without results even when trying with city.', dataFinderResponse.data.datafinder);
        }
        else {
            console.log('Data finder response', dataFinderResponse.data?.datafinder.results?.[0]);
            email = dataFinderResponse.data?.datafinder.results?.[0].EmailAddr;
        }
    }

    return email;
}

It tries to find an email at the state level first. If it doesn’t find one, it moves to the city level.

And….BAM. We just used some cool technology to get some contact information for a specific industry of businesses.

The end.


Sample Code

Transcript:

Hello there. My name’s Jordan Hansen. I’m from cobalt intelligence. And today I’m talking about the secretary of state API. I talk about the pay a lot today. I’m going to take a different approach. I think there’s some really cool, really great data sources out there in the internet. And I want to show what you, the power you can do, uh, the power of them.

If you combine them with things like the API or other opportunities that really can automate collection of a lot of this data, get owner information, get contact information. You’ll need. For this you’ll need two API keys versus a cobalt intelligence API key. And just go over here and say the second is the data finder API key.

Now this is extra credit. I’m going to go over and show how to get email addresses. I’m using the data finder API key. You don’t have to have this. It’s not required for the sample, but if you want it, you can get it and you can do that as well. So the data finder I use them for skip tracing. Um, I don’t know if really it’s a skip tracing, but they have great contact information.

Um, and it’s the best one I’ve found for email addresses and phone numbers for companies. So here we go. Um, the form 5,500 is what we’re talking about today. Form 5,500 is a form that companies are required to file by law if they have a tax advantage to pension plan. And I think that includes some health plans.

I’m not exactly sure what’s included, but form 5,500 is what it’s called and it’s public because of freedom of information act, you’re essentially getting a benefit from the government. So this information is public. Now this does interesting things like get you some companies you can search by. Search by things like business name and that’s what we’re going to be doing today.

And you can get. Uh, there are how many employees they have and how many employees, how many participants are in the plan, which is probably close to the employee count. Um, and then how much assets they have. So you can kind of gauge size of the business pretty well. And I’m gonna show you how we can leverage that into some other data.

So today I’m going to search for crews. Um, let’s say I want it to mark it to like, sometimes people have specific things for. Affected businesses. So cruise lines, for example, let’s go over here to cruise lines. The cool things is we go over here, cruise. We can see all these things here. We can download one of these.

Um, and this shows you, there it’s a PDF of the form they filed. It’s kind of the cool thing. It shows who the plan administrator is, which typically the business. Sometimes they have a third-party administrator and then they had the phone number here. Now you can get this phone number out of this by using some kind of PDF parser, not going to do that today.

You can do it and it wouldn’t be too difficult. But we’re today, we’re just going to work on, uh, email justice, what we’re looking for. So let’s go over here. We’re going to export to CSV. And this is the cool part about this. We’re going to come over here and say, right. I actually did some samples here before when we get rid of this.

And then we’re going to put over here in this package right here and say re. The export, there we go. Now we have it in there now, as you can see, there’s a lot of duplicates and this is because they have to file a form every year. And so we have one from 20 11, 20 10, 20 13. So the first function we’re going to do is called remove duplicates.

And I’m using a bunch of different things here to handle, um, how this all works. We’re using things like remove duplicates, and that’s the first thing we’re going to do with this file. We’re going to come over here. Um, we had the unclean businesses in here and we’re just gonna pass in the file name, which we named Cruz right there.

Oh no, we did not name it. Let’s rename it to cruise. Who is that CSV. And then we’re going to parse it with CSB to Jason, and it’s going to turn it into Jason and there were a loop through it and we’re gonna make sure there’s no duplicates. So we’re going to say, Hey, find the business, um, the businesses where we’re pushing these into.

So we’re pushing them into this. Find them, if there is, uh, if it already exists, don’t push it. So I sent you, we’re cleaning out all of the duplicates, so that’s, we’re going to try. So we’re going to MPM start this a romaine index file and was quick 69 businesses. Out of that list of 491 or 69 distances.

That’s pretty cool. So now we’re going to do it. Go ahead and do the cool part. We’re going to come over here, get owner details. And now we’re going to use the API, the cobalt intelligence secretary of state API. What this does is every business has to register with a state. Um, N roughly, let’s say that’s not a conclusive rule, but, um, when they get to any size or Verizon, LLC, they have to register with their state.

Now this is public data. So every state has the secretary of state. Um, and I built an API. We have an API cobalt intelligence does that will allow you to get this business information. So what you do is you import this SOC yeah. If you’re using JavaScript from the. If you’re not using JavaScript, you got to do it yourself manually.

And that’s fine. I have a bunch of videos showing how to do it. So you go find those and you can use those. So we can come over here, initialize your, passing, your API key, and then you get your SOS API right here. And then from there, um, we’re going to open up the clean businesses. This is the ones that room with the duplicates.

We come over here and we say, okay, these are the cleaned ones, right? That’s what we did over here. And this remove duplicates function right here. And then we’re going to go, okay. Let’s loop through this. So we’re gonna loop through our businesses. Um, I’m going to start this right now because it’s going to take a little bit, give me a start and it’s going to get the owner to Delta.

Now it’s going to take a few minutes. He’s starting to go right in there and bam, bam, bam, bam. It’s going to go through now. It’s going to handle these things. Sometimes it’s going to find the business. Other times they’ve got businesses coming in. Okay. So what this does, don’t be distracted. All. You’re going to be distracted.

I am too. Okay. We’re gonna come through here. It’s going to loop through all the clean businesses. So 69 in this case, and it’s going to say searching for whatever, and it’s going to pass in the clean businesses sponsored name, which is the business name. I’m assuming most of the time and the state. And then the result is going to come back.

So it’s going to come over here. It’s going to say, okay. We found the results and it could be an array of businesses in case there’s multiple businesses with that same name. Um, that typically should only happen if there are multiple states, but just in case we grabbed the first one. And then we’re gonna check to see, is there an agent name now, agent name?

I’m assuming to be the owner here. It’s going to go through here and it’s gonna say, okay, there’s an agent and let’s push it in here. We’ll keep the link, the assets and the participants. That’s funny cruising associates. It’s actually Robert F crews and they, but they do tax and finances. So not even cruise related.

That’s funny. Okay. And then if, as an air, as it’s going to come over here, now I am doing this asynchronously as in, I just dumped, I think, 70 at the API all at once. And it can handle pretty heavy load at once. It does queuing on the state, so it doesn’t scrape them too hard. And so it’s good.

Interesting. Okay. So it’s going to come over here as we go pretty fast. And, um, so it’s handling 72. We’re getting close to being done almost. And you can tell that because it’s retrying, um, because of the queue system, it goes over there. And if it takes a little bit longer, or if it’s scraping places like Delaware, it looks, we may be almost done.

In fact, there’s only one here we try anymore. I hope that thing doesn’t get stuck. Sometimes it gets stuck. Okay. We try and tip number eight. I wonder what. It could be similar state. Anyway, it’s going to come through here and you can’t dump like thousands of requests on the API once. Bam. We didn’t find it.

Alternative business crowns can connection. Okay. So bam, now we’re done. We just ran through 70 states really quickly. We went through here, we had the promises, we pushed it into there and then we handled the response and the promise when it was finished. And then when we went right here and waited for a waited for all of them to come.

And then we put them in this API file. Now let’s look what it looks like. Let’s see how we did. So from those 69, and we got 39 businesses with agent names, some of them are inactive, which is possible because it would have been past ones. But anyway, we have things like the cool thing is we want agent name, right?

Region name right here, right there. What does that color? So, and so now we have, now we have their names, right? Nobody jumps out for this one, but so we, now we have a bunch of. So this is the first step that you can do this with any category, right? Come over here and search for whatever you want. And I say, industry, it’s going to be keyword search.

They just type in the name of the business and go through it. But now the next part is going to be getting emails. I’ll show you how to do that. Await get emails. Um, and I call it path. I want to call it file name. I think over here I fly, man. I go. I’ll pass and cruise. And now we’re going to go ahead and get emails to this.

So rose go over and look at this function. Now we’ve now gotten a business owner information. Now let’s go to send emails. So we’ll come over here. We’re going to say, okay, we’re going to open up that file file name from API. That’s what we saved it desk right here in this function. C right.

There. Okay. Now we’re going to loop through this we’re to say, okay, let’s loop through the businesses we go through here and we’re going to say, okay, if there’s already a business, if there’s already an email address, some states I can get emailed just as far ready. So you don’t have to use an API for those Washington, Virginia, New Hampshire.

I can get emails for those already, but if they happen to have, um, if there is an email and the email is this, whatever in a or something else, we can skip those. And it does not equal. Uh, cause we’re not going to square for those. Now we go through and we make sure we have a state and agent name. If they don’t, we skip it because we need those to be able to get them.

And then we call over to this function I’ve created called get email. Um, you pass your agent, name, your state, your city, and it makes it so we can go through it. So let’s look at that function, try it down here. So I’m going to use Axios here, which is just some HTTP request system. Could I come over here and say, okay, we’re going to make a request.

We go to data finder, we pass in our API key. We pass in the name, this. Um, and then it will return the data. Now, if it didn’t find anything, I go ahead and I go in more depth and I do it, I think without the city, I think I just go more general and I add the city to try to get more depth. So I couldn’t find the email with just the state.

I niche it down and I go just to the city and the same thing we did. And if we don’t get anything, we can say, oh, we’re sad. We didn’t get anything. Now a data finder says they get about a 38% hit rate. I found that to be pretty close. So how many do we have? Um, I don’t know, 39 I think is what we said. So we’ll see how many we get there.

Pretty fast. Let’s run this and we’re going to call, get emails. That’ll go through and look through that list. And again, I’m gonna include the code with all this, so you can access the code and go through it. Um, but this is a way to get through emails, get banned for can’t find anything. Oh, we found one sweet.

So we’re now we’re coming through here and look at they’re pretty quick. So that’s three seconds, three seconds sometimes, you know, once. We’re 39. So we’re talking about, I dunno, I’m not doing this asynchronously, right? So I’m not slamming it. Um, but still it’s pretty fast. The response is there we go. Bam, bam, bam.

Almost done. They’re done. Now we go over here and see, um, how much we ended up with, I think we’ll go over here, cruise with emails and 19. So now we filtered that list down. I mean, it looked impressive with 4 91, but you can do this with a bunch of different industries. And now we found 49 R 19 businesses out of that list.

And we filled that out of those duplicates, which is 69. And then we filtered out the ones we can get agent information for, which was like 30 something. And then we came up with about 20 or 19 with email addresses. So that’s pretty good. Now we can do whatever we want with those. Maybe we reach out to them and say, Hey, we want to be friends.

We like people they’re cruises and we’re best friends. Now, maybe that’s what you want to do. That’s fine. So again, this is how you go through, get email addresses. Um, you can get contact information, data finder can do contact information as well, and, um, get business information, whatever you want. You don’t have to use this, but this is the cool tools you can use with the API and the form.

And that is it. Wella.

Hey there. Want to be friends?

I'll be straight with you. I'm going to want to show off my web scraping genius. But I want to hear about your web scraping troubles and successes as well.

And I like you. So I won't be one of those weird friends who is always calling. What do you say?