Deep learning algorithms have achieved revolutionary successes in many applications including computer vision, natural language processing, and biomedical informatics, leveraging a huge amount of labeled training data to generate powerful distributed representations. Rather than training standard end-to-end supervised models, learning adaptive models capable of understanding different input contexts is one-step closer towards building self-aware machine reasoning systems. In this talk, I will discuss learning adaptive deep representations of texts, videos, and their interactions. First I will talk about how context-dependent representation learning helps a question answering system become self-aware to handle ambiguous situations. Then I will show how to generate adaptive spatiotemporal representations for translating videos to natural language descriptions. Finally, I will present our recent work on video generation from text capturing semantically meaningful context-dependent representations.