Mid-level Fusion for End-to-End Temporal Activity Detection in Untrimmed Video