Welcome to Source code ECOsystem Linked Data (SeCold)

Basic facts

  • SeCold is the first online Linked Data repository of source code facts! 
  • The current version is SeCold 2.0 (published ~2012 May)
  • The first version was SeCcold V. 001 (published on 2011 Jan. 20)
    • Minor update in September 2011
  • This is an Ambient Software Evolution Group's research project (Concordia University)

What types of fact?

Any type of implicit and explicit fact you can find in software repositories such as:

  1. Source code file
  2. Tokens
  3. AST Nodes
  4. Code Clones
  5. Bugs
  6. Commits
  7. Authors
  8. Licenses
  9. ...
We extract information from source code, bug/issue and versioning systems. All pieces of information are inter-connected explicitly.

Basic introduction:

Objective of the Ambient Software Engineering Group is to develop novel techniques and approaches that help define ambient software engineering as an extension to collaborative software engineering, addressing the inter-communication among software engineers and their tools as well as the seamless knowledge integration in a global context. The group in particular focuses on the use of semantic web technologies as one of the enabling technologies to achieve this goal. The SECOLD project is a Linked Data approach that has been designed to support interoperability and sharing of open datasets by allowing on the fly inter-linking of data using the basic layers of the Semantic Web and the HTTP protocol. In our research, we focus on providing a Uniform Resource Locator (URL) generation schema and a supporting ontological representation for the inter-linking of data extracted from source code ecosystems. As a result, we created the Source code ECOsystem Linked Data (SECOLD) framework that adheres to the Linked Data publication standard. The framework provides not only source code and facts that are usable by both humans and machines for browsing or querying, but it will also assist the research community at large in sharing and utilizing a standardized source code representation.


What is Linked Data about?

From DBpedia.org: "Linked Data is a method to publish data on the Web and to interlink data between different data sources. Linked Data can be accessed using Semantic Web browsers, just as traditional Web documents are accessed using HTML browsers. However, instead of following document links between HTML pages, Semantic Web browsers enable surfers to navigate between different data sources by following RDF links. RDF links can also be followed by robots or Semantic Web search engines in order to crawl the Semantic Web. See Linked Data – The Story so farand How to publish Linked Data on the Web for more information about Linked Data"

To see how other domains are using Linked Data try the following site:


SECOLD connection to LOD Cloud

 SeCold CKAN record: http://ckan.net/dataset/secold


How SECOLD works?

Source Code: It crawls open source code available on the Internet. Then, it applys source code analysis techniques in different levels (e.g. syntax) over the source code, and extracts facts. It assignes derefrencable unique IDs (i.e. URI) to each extracted resource. All facts are saved into a triple store. The result is available publicy as a Linked Data endpoint.
  • Version Control: It connects to public version control repositories. Then, it extracts version control information like developers' contribution and commits etc. The result is connected to Source Code facts.
  • Vision?

    Provide a URL for each source code resource in any level (e.g. token, variable, and method call statement).

    What is it for?

    It is a multi-purpose project. It is useful mostly for (1) software research community (2) software documentation/traceability (3) future of software development

    How much data?

    It has 1.5 billion triples in the first release (SECOLD V. 001). The data extracted from ~18,000 open source projects [UCI Dataset]. The overall code size is 1,500,000 file including ~400,000,000 line of code.

    How to use it?

    • URL generation schema: go to ID Schema page from the left menu.
    • Vocabulary set (the ontology): go to Ontology page from the left menu.
    • Data conversion: go to Public Services page from the left menu.
    • The repository: go to (1) Online Access (2) Query Endpoint (3) Download, from the left menu.


    The SECOLD is programming language independent. It is able to publish fact from any source code or version control. Nevertheless, the first release only contains Java code and SVN. We are planning to cover C#, C++, CVS and Git for the second release.



    Our thanks to FRANZ.COM for allowing us to use unlimited version of AllegroGraph triple store for SECOLD project.


    Follow us on http://twitter.com/secold