Microsoft Graph API
Microsoft Graph API
Download files from sharepoint site
1. Enable Graph API
On Azure Portal -> Application Registration
Create a new registration, this gives client_id, tenant_id, etc
Under the Certificates&Secrets of the application, create a new client secret
save the secret, this together with the client id, tenant id will be used for connecting to the api
Under the API permission, add permissions
There are a bunch of permissions available within the add permission section.
Choose Microsoft Graph.
There two types of permission to choose, delegated and application permissions.
The application permission once added, needs to be 'Grant admin consent'. It's global and without user context.
The application will have the permission to access the corresponding resources.
The delegated permission needs a user context. Once added to the application registration, it still requires the end user to log in to be able to access the corresponding resources under the user's own context.
For example, for listing all teams channels, the application permission allows listing all channels within the organization.
The delegated permission allows only listing the channels that the user has access to.
Choose the permissions to be granted.
e.g. Sites.read.all, User.read.all, etc.
The graph api can connect to a bunch of microsoft products, sharepoint, onedrive, outlook, ad groups, etc.
each of them has it's only permission, so grant the permissions required for the application registration.
2. Authenticate with Microsoft Authority
This connects to the Microsoft authority https://login.microsoftonline.com and requests for a Bearer token.
With the bearer token, user can further access the graph API
For application permission, it only needs to authenticate with the client id, client secret generated in the App Registration.
For delegated permission, it needs to authenticate with the username, password / interactive login / multi-factor login
2.1. For application permission
import requests
# Microsoft Azure AD app (client) credentials
client_id = '<your_client_id>'
client_secret = '<your_client_secret>'
tenant_id = '<your_tenant_id>'
# Scopes required by the application (application permissions)
scopes = ['https://graph.microsoft.com/.default']
# Get an access token using client credentials flow
token_url = f'https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/token'
payload = {
'client_id': client_id,
'client_secret': client_secret,
'scope': ' '.join(scopes),
'grant_type': 'client_credentials'
}
response = requests.post(token_url, data=payload)
# Obtain the access token
access_token = response.json()['access_token']
2.2. Delegated permission
There are three ways to log in, ie. username password, interactive (pops up a login page), multi-factor login (pops up a login page and with authenticator code).
For interactive and multi-factor login, the MSAL library needs to be installed. pip install msal.
Client_id and tenant_id are required for connecting to the right app registration, but client_secret is not needed, instead the user would need to log in.
import requests
import msal
# Microsoft Azure AD app (client) credentials
client_id = 'xxx'
tenant_id = 'xxx'
# Authority for the tenant
authority = f'https://login.microsoftonline.com/{tenant_id}'
# Scopes required by the application (delegated permissions)
# Ensure that your Azure AD application is configured with the appropriate delegated permissions
# (such as User.Read and Sites.Read.All) in the Azure portal
scopes = ['User.Read', 'Sites.Read.All']
############ interactive log in ############
# Initialize the MSAL public client application
app = msal.PublicClientApplication(
client_id,
authority=authority
)
# Initialize the login flow
result = app.acquire_token_interactive(scopes)
############ username password log in ############
# Initialize the MSAL confidential client application
app = msal.ConfidentialClientApplication(
client_id,
authority=authority
)
# Acquire token using username/password
result = app.acquire_token_by_username_password(username='<user_email>', password='<user_password>', scopes=scopes)
############ multi-factor log in ############
# Initialize the MSAL public client application
app = msal.PublicClientApplication(
client_id,
authority=authority
)
# Initialize the login flow with device code
flow = app.initiate_device_flow(scopes=scopes)
if 'user_code' not in flow:
print('Failed to create device flow:', flow.get("error_description"))
exit()
print('Please authenticate using the code:', flow['user_code'])
print('Visit', flow['verification_uri'], 'and enter the code to authenticate.')
result = app.acquire_token_by_device_flow(flow)
# Obtain the access token
access_token = result['access_token']
3. Call Graph API
Here uses an example to list all the sharepoint sites.
Simply use the graph api endpoint (can be found on Microsoft Graph API reference)
headers = {
'Authorization': 'Bearer ' + access_token, #the token from the previous step
'Content-Type': 'application/json'
}
# Microsoft Graph API endpoint
graph_endpoint = 'https://graph.microsoft.com/v1.0/sites'
# Make a GET request to list all sites
response = requests.get(graph_endpoint, headers=headers)
print(response.text)
4. Download file from sharepoint list
There are multiple steps to achieve this.
Make sure enough permissions have been granted. Some sharepoint sites are actually a teams channel so teams and channels permissions are also needed.
Typical permissions: Sites.Read.All, TeamSettings.Read.All, Directory.Read.All, Group.Read.All, Channel.ReadBasic.All, Files.read.ALL
Firstly list all the sites, so as to find the id of the sharepoint site you are after.
url = "https://graph.microsoft.com/v1.0/sites/"
r = requests.get(
url = url,
headers={'Authorization': 'Bearer %s' % token},
)
for i in r.json()['value']:
print(i['displayName'] + ' --- ' + i['id'])
The ID field looks like "myorg.sharepoint.com,4444555566-xxxx-yyyy-zzzz-8888ac9e7777,123456789-xxxx-yyyy-zzzz-111199c44444"
The second element in the delimited ID string is the site id.
Secondly list all the drives associated with the sharepoint site.
site_id = '4444555566-xxxx-yyyy-zzzz-8888ac9e7777'
url = f"https://graph.microsoft.com/v1.0/sites/{site_id}/drives"
r = requests.get(
url = url,
headers={'Authorization': 'Bearer %s' % token}, #the bearer token from previous step
)
for i in r.json()['value']:
print(i['name'] + ' --- ' + i['id'])
Once have the drive id, query the children items within the drive.
Note the root item is used in the api url.
site_id = '4444555566-xxxx-yyyy-zzzz-8888ac9e7777'
drive_id = 'b!cXXXXXXXXXXXXXXXXX'
url = f"https://graph.microsoft.com/v1.0/sites/{site_id}/drives/{drive_id}/root/children"
r = requests.get(
url = url,
headers={'Authorization': 'Bearer %s' % token},
)
for i in r.json()['value']:
print(i['name'] + ' --- ' + i['id'])
If the item is a folder, then check the children files of the folder with folder(item) id.
Note the URL path goes from drive to items and to children.
site_id = '4444555566-xxxx-yyyy-zzzz-8888ac9e7777'
drive_id = 'b!cXXXXXXXXXXXXXXXXX'
item_id = '01xxxxxxxxxxxxxxxx'
url = f"https://graph.microsoft.com/v1.0/sites/{site_id}/drives/{drive_id}/items/{item_id}/children"
r = requests.get(
url = url,
headers={'Authorization': 'Bearer %s' % token},
)
Again from the response, get the children files and ids.
Pick the id of the file to be downloaded.
Finally with the file(item) id, download the content as binary string.
Note the API URL can go from the drive directly to the child file item, without going through the folder item first.
The 'content' in the URL refers to the binary content of the file.
site_id = '4444555566-xxxx-yyyy-zzzz-8888ac9e7777'
drive_id = 'b!cXXXXXXXXXXXXXXXXX'
child_item_id = '01xxxxxxxxxxxxxxxx' #the file item id
url = f"https://graph.microsoft.com/v1.0/sites/{site_id}/drives/{drive_id}/items/{child_item_id}/content"
r = requests.get(
url = url,
headers={'Authorization': 'Bearer %s' % token},
)
The response carries the binary string as r.content.
Now you can write the binary string into a file, or read it into pandas
with open('c:/temp/test_download_sharepoint_file.xlsx', 'wb') as f:
f.write(r.content)
import pandas as pd
dfs = pd.read_excel(r.content, sheet_name = None) #read all sheets