Every developer and organization which uses open source dependencies in their code should know what open source licenses are and what are the risks that an organization faces when using open source dependencies to write their software and distribute it.
In this blog post, we’ll dive into some of the most interesting trends and statistics of open source licenses. In order to get such statistics, we scanned over 1,000,000 open source packages, in the ratio of 50:50 for poorly-maintained packages and well-maintained packages, to test the average maintained packages. These packages were filtered by using the open source tool scorecard project (by OSSF). The results were the following:
What is the most popular and used open source license these days?
MIT, big time. This license is a very permissive one, as GitHub defined it “A short and simple permissive license with conditions only requiring preservation of copyright and license notices. Licensed works, modifications and larger works may be distributed under different terms and without source code”, according to choosealicense.com.
The top 4 most popular open source license found were:
- MIT – 66%
- Apache2 – 13%
- GPL 3.0 – 5%
- GPL 2.0 – 4%
The rest of the popularity results can be seen in this chart:
Diving deeper into the statistics and usage of the MIT license, we can see that this license is being used in over 6M repositories, 31M code instances, and 191,000 packages, according to GitHub.
In addition, we checked the distribution of the languages of MIT and found that the most popular language that uses MIT license was JavaScript. Python, HTML, and Ruby were next in line, accordingly:
How many open source packages contained a legal risk?
According to Checkmarx, a package considered a high/medium risk package is defined as a package with legal risk. Checkmarx SCA gives every open source package it scans a few risk scores (copyright risk score, patent risk score, and so on) which calculate the level of risk of every package being scanned.
Out of all the scanned packages, 23,622 packages contained a legal risk. That means that ~2.5% of all open source packages contain a legal risk that needs to be addressed.
According to GitHub’s State of the Octoverse survey, the average project now has 203 OS dependencies, which means that in each open source project you import, you will use an average of 5 OS packages with legal risk attached to them.
We wanted to check the distribution of languages with the most legal high-risk packages. The results were outstanding. 94% of the results pointed to Maven (Java), while 0-1% pointed to Pip (Python):
Moreover, if we compare the number of risky Maven dependencies to the total Maven dependencies fetched in this research, we can learn that 19% of the Maven dependencies are considered high risk. That means that 1 out of 5 Maven OS dependencies you use might be legally risky.
So, Maven open source dependencies might contain high legal risk, how do you know?
We wanted to dive a little bit into Maven’s results and check which was the most popular source from which we took the decision for the license type. Basically, eliminating popular sources like GitHub and Mvnrepository that can be scrapped in order to get the license of each OS package (with a certain level of trust, of course), we used the pom.xml and the jar file of each Maven package. These files contain interesting data about the package, as well as the license it is using.
The results were the following:
We can see that in 62% of the Maven instances, the license was derived from the pom.xml file while only 38% arrived from the jar file.
Comparing risks of license that are part of the copyleft family
In the previous blog post, we discussed copyleft vs. permissive licenses, and why copyleft ones are considered riskier to use.
In short, copyleft is a property of the license that means that the package is free to use, but it is forbidden to make it proprietary. A copyleft license is also viral since any work containing a package that is copyleft-license must also retain this property. We use 3 different options to mark the copyleft value of each license:
- Full – Full copyleft license
- Partial – Copyleft applies on modifications only
- No – Not a copyleft license
The results were the following:
We can see that 95% of packages do not contain a copyleft type of license, 4% contains partial copyleft licenses and only 1% are full copyleft licenses. These results make sense, as we noticed the two most popular licenses in use were MIT and Apache2, which are permissive licenses.
Copyright risk score distribution:
One of the components that are scored and calculated in order to decide the level of legal risk that an open source package contains is the copyright risk score. A copyright risk score is a number between 1 and 7, in which 1 is considered the most permissive score, while 7 is the most restrictive score:
1 – Licensee may use code without restriction.
2 – Anyone who distributes the code must retain any attributions included in the original distribution.
3 – Anyone who distributes the code must provide certain notices, attributions, and/or licensing terms in documentation with the software.
4 – Anyone who distributes a modification of the code may be required to make the source code for the modification publicly available at no charge.
5 – Anyone who distributes a modification of the code or a product that is based on or contains part of the code may be required to make publicly available the source code for the product or modification, subject to an exception for software that dynamically links to the original code.
6 – Anyone who distributes a modification of the code or a product that is based on or contains part of the code may be required to make publicly available the source code for the product or modification.
7 – Anyone who develops a product that is based on or contains part of the code, or who modifies the code, may be required to make publicly available the source code for that product or modification if s/he (a) distributes the software or (b) enables others to use the software via hosted or web services.
License risk is derived from the copyright risk score of the license. Levels 1-3 are considered low risk, levels 4-5 are medium, 6 and up are high. The result was the following:
As we can see, 93% of the packages contained low risk, while 4% of the packages contained medium risk and 3% of the packages contained high risk.
Final thoughts:
As MIT & Apache2 share together almost 80% of license types found among 1,000,000 open source packages, we can learn that these two permissive licenses are here to stay, and they might symbolize a pattern in which both organization and developer prefer to use open source packages that contain a more permissive type of license.
From the fetched statistics we learned that Maven packages can be very risky to use, and tend to contain the riskiest licenses in their dependencies, be extra careful when using them.
One thing is sure, the open source community is trying to make open source components easy to adopt and comply with, while organizations and developers using such components have to make sure they are aware of every component they use, verifying that they meet its requirements and creating a compliant environment for their software.