What are Hedge Funds Buying? Extracting 13F data with Python.

7 min readApr 28, 2020

Part I — How to retrieve the stock holdings of Hedge Funds automatically using Python and store these in an SQL database.

Introduction

Hedge Fund and other investment managers with over $100m in Assets under Management (AUM) are required to file a quarterly document in the US showing the securities that they hold. These documents, known as a 13F filing, are keenly watched by markets to understand what top hedge funds are thinking and how they are allocating their money.

Unfortunately for most investors, simply mimicking the investments you see in a 13F is unlikely to yield you much better than the market average, and could be significantly worse. There are several reasons for this. Firstly, the form isn’t filed until 45 days after the end of the quarter, meaning plenty of investments could have changed in the intervening time. The anticipated price movements could also have already occurred by this point.

Secondly, even when you have the holdings data it is sometimes hard to know what an asset manager’s net position is. If a fund reports a long position in the shares, a put option and a call option they could be net long, net short or more likely have a limited pay-off in a given circumstance (e.g. a collar, providing returns if the price moves between 2% and 10%). There could also be unlisted positions, such as short positions in a convertible bond, that would require longs in the underlying security whilst the manager maintains a net short exposure — or an investor could be partially hedged via Swaps or CDS.

Thirdly, the 13F does not require short positions to be listed so again it can give an incomplete or potentially misleading view of the asset manager’s net exposure to given security.

Furthermore, the analysis is point-in-time, and without knowing what the holding times are for various assets it is difficult to know whether they are trading in and out of the stocks on a day-by-day basis, or whether these are long-term conviction purchases.

Finally, the frequency of the reporting means that piggybacking on these trades is difficult. Knowing when a manager has exited a position could take up to a quarter of a year after it actually occurs.

That being said, 13Fs are very useful for understanding thematically what the market is looking at. You can also see where Hedge Funds are clustering their trades (Big Tech) and get a sense of flows into and out of sectors. Depending on the investor you can also sometimes infer some characteristics of the company. For example, Elliot Management are activist investors who may be proposing changes to the corporate structure. Berkshire Hathaway is likely to hold their investment for many years. AQR are likely to have identified some fundamental value factors at play in their long positions.

Getting the holdings with Python

We’re going to be using Python to get the 13F, primarily because Python is a useful tool for the analysis we will be doing in Part II.

Fortunately, there are already many tools for Python that make sourcing this data very simple so let’s not reinvent the wheel and start by loading these packages.

from sec_edgar_downloader import Downloader

This package is called SEC Edgar Downloader. Edgar is the regulatory filing system to which the 13F is submitted.

We’ll use this package to form the core of our get_files() function:

def get_files(filing, CIK):
    save_path = 'C:\\13F\\'
    dl = Downloader(save_path)
    dl.get(filing, CIK)
    CIK = CIK.lstrip("0")
    files = os.listdir("C:\\13F\\sec_edgar_filings\\" + CIK + "\\" + filing)
    data = [parse(file, CIK) for file in sorted(files)]
    try:
        return pd.concat(data)
    except ValueError:
        print("All Values are None")
        return None

Here we save the13F files to our local drive in the form of .txt documents. We pass two parameters, firstly the name of the filing — “13F-HR”, HR standing for Holdings Report. Secondly, we pass a CIK which is a Central Index Key identifying the asset manager.

Then we use the SEC Edgar Downloader to download the relevant 13F text files. Once this is done, we can call our next function parse() using list comprehension. This line:

data = [parse(file, CIK) for file in sorted(files)]

sorts all files by name (which also sorts chronologically based on the naming convention) and passes the file to our next function.

Parsing the Data

To parse the data we are going to make use of a great tool called BeautifulSoup.

from bs4 import BeautifulSoup

This will allow us to parse the XML format of the 13F filings. The XML format is customised for the SEC document but Beautiful Soup can parse the files.

def parse(file, CIK):
    xml = open('C:\\13F\\sec_edgar_filings\\' + CIK + '\\13F-HR\\' + file).read()
    soup = BeautifulSoup(xml, 'lxml')
    try:
        date = datetime.datetime.strptime(soup.find('signaturedate').contents[0], '%m-%d-%Y')
        name = soup.find('name').contents[0]
    except:
        return
    cols = ['nameOfIssuer', 'cusip', 'value', 'sshPrnamt', 'sshPrnamtType', 'putCall']
    data = []
    print("Processing " + name + " (" + CIK + ") for date " + str(date))
    for infotable in soup.find_all(['ns1:infotable', 'infotable']):
        row = []
        for col in cols:
            data = infotable.find([col.lower(), 'ns1:' + col.lower()])
            row.append(data.text.strip() if d else 'NaN')
        row.append(date)
        row.append(CIK)
        row.append(name)
        data.append(row)
    df = pd.DataFrame(data)
    cols.append('date')
    cols.append('fund_cik')
    cols.append('fund')
    try:
        df.columns = cols
        return df
    except:
        return

Here we are doing several things to parse our data, extract the content that we want, and get it all into a neat DataFrame. Firstly, we are opening the file and parsing it with Beautiful Soup. This gives us a variable — soup — which we can use to traverse the document.

The first thing we are going to extract is the date of the document via a field called signaturedate and convert this into a DateTime object. We are also going to extract the name of the asset management through the appropriately named name field.

If we can’t extract the name and date we are simply going to terminate the function. The signature date field was introduced in 2013 so if you are looking for holdings data before this, simply remove the (overly wide) Try: and Except: block that I have put in and change your data parser. There is a text field at the top of the document that you can use, outside of BeautifulSoup.

The actual data we are looking for is held in a series of info_table tags, so we are going to use Beautiful Soup to find all those and loop through them. Each time we loop through we are going to capture the variables of interest to us, here I am taking:

cols = [‘nameOfIssuer’, ‘cusip’, ‘value’, ‘sshPrnamt’, ‘sshPrnamtType’, ‘putCall’]

As we loop through and add each row to our result, I am also adding the date, fund name, and the CIK to each row so that we can use these in our lookup table later.

Finally, a DataFrame is returned which contains all the rows and columns we have selected. This is returned to our previous function which concatenates all the results into a single DataFrame and returns this.

Storing the Data

Now that we have extracted the data that we want it would be a good idea to store it somewhere. I’ve decided to store it in an SQL database but a NoSQL solution might be preferable for your purposes.

form = '13F-HR'
ciks = ['1037389']i = 1
# Get files from Edgar
for cik in ciks:
    # Provide feedback
    print("----- " + str(i) + " of " + str(len(ciks)) + " -----")
    i += 1

    # Download files
    df = get_files(form, cik)

    if df is not None:

        # Perform cleaning
        df['value'] = df['value'].astype(float)
        df['sshPrnamt'] = df['sshPrnamt'].astype(int)
        df['date'] = df['date'].apply(get_qend)

        # Upload to Database
        engine = create_engine("mysql+pymysql://{user}:{pw}@localhost/{db}"
                               .format(user="USERNAME",
                                       pw="PASSWORD",
                                       db="DATABASE"))
        try:
            df.to_sql('holdings', con=engine, if_exists='append', chunksize=1000, index=False)
        except:
            pass
        print("Uploaded holdings to sql")

Here is the main function of the file which calls our get_files() function, cleans the data a little, and then uploads it to our SQL database.

We declare our information of interest in the form of a variable form which in this case is 13F-HR, and a list of CIKs for the funds we wish to investigate. Here I have used 1037389 for Renaissance Technologies as they seem like pretty smart guys.

We loop through the list providing a bit of feedback to the console on how far we have got, in case you decide to add several CIKs and want to check on progress. You can find CIKs here: https://www.sec.gov/edgar/searchedgar/cik.htm

We convert the two numeric data fields into a float and an int respectively and apply a custom function called get_qend to our date. This function takes the date and returns the end date of the prior quarter, which is the reference date for the 13F.

There are few date functions used for various purposes which I’ve included below for your reference.

# Get prior quarter
def get_prior_quarter(q):
    prior = q - 1
    if prior == 0:
        prior = 4
    return prior


# Get current quarter
def get_quarter(date):
    return (date.month - 1) / 3 + 1

# Get prior weekday for a given date
def prev_weekday(date):
    date -= timedelta(days=1)
    while date.weekday() > 4:  
        date -= timedelta(days=1)
    return date

# Get prior quarter end for a given date
def get_qend(date):
    quarter = get_prior_quarter(int(get_quarter(date)))
    month = quarter * 3
    year = date.year
    day = calendar.monthrange(year, month)[1]
    quarterEnd = datetime.date(year, month, day)
    if quarterEnd.weekday() > 4:
        return prev_weekday(quarterEnd)
    else:
        return quarterEnd

Finally, we use SQLAlchemy to connect to our database using create_engine. You will need to pass your own username, password, and database table name to this function.

Then we convert our DataFrame to SQL automatically using SQLAlchemy and upload the data to our database. In this case, I am using index=false to prevent it from creating a numeric index as I am using a composite primary key across several fields for uniquely identifying the data.

Once you have run this your 13F data should be safely stored in your database, ready for analysis. I’ll show you some of the analysis you can do in Part II.

Conclusion

Whilst the 13F isn’t going to give you the portfolio of your dreams, it can provide some interesting insights into thematic allocations of leading hedge funds.

We’ve shown how to automatically retrieve the data for any number of Hedge Funds, clean the data and extract the fields that you are interested in, and then store this in a relational database.

In Part II I will show some of the analyses that you can do on these holdings and apply some data science techniques to see if we can find any interesting characteristics of these Hedge Funds.

You can follow me on Twitter @mgrint or read more of my articles at https://grint.tech. Email me at matt@grint.tech.