PK
Career
| 13 min read

I Cold-Emailed Sony's Research Director. They Called Back. Here's What Happened.

From Principal AI Consultant to internship interview. A technical deep-dive into multimodal AI research, adaptive fusion, and why clarity matters more than titles.

PK
Pavan Kumar
AI Consultant & Platform Engineer
I Cold-Emailed Sony's Research Director. They Called Back. Here's What Happened.

It started with a cold email.

I wasn’t job hunting. I was already working as a Principal AI Consultant, building crypto trading platforms with AI agents, leading product strategy at VibeTensor, and consulting for multiple startups. The last thing I needed was an internship.

But Sony Research India’s philosophy resonated with me. Kando, they call it. The deep emotional connection between technology and the user. That blend of innovation and human experience. So I emailed the Director of Sony Research India directly.

Two days later, my phone rang.

The Unexpected Call

“We’re hiring for a multimodal research intern position,” the HR voice said. “The team wants to talk to you.”

I hesitated. An internship? I was past that stage in my career. But curiosity won. What would a technical discussion with Sony’s multimodal AI team look like?

I said yes.

The Panel: Three Perspectives on Multimodal AI

The interview wasn’t what I expected.

Three researchers joined the call:

  • Interviewer 1 – Researcher focusing on multimodal approaches for entertainment domain
  • Interviewer 2 – Research scientist specializing in computer vision and content analysis
  • Interviewer 3 – NLP and neural machine translation expert

No softballs. No “tell me about yourself” warmup. They wanted to know what I knew about multimodal fusion, and they wanted specifics.

The Technical Gauntlet Begins

Interviewer 1 started: “When we fuse modalities at any stage in a model, what is our goal? What’s the intuition behind fusion?”

I responded: “The goal is to combine complementary information from different modalities to form a richer, more discriminative representation than any single modality can provide.”

“So we’re assuming there’s always complementary information?” he pushed back.

This is where it gets interesting.

“Not necessarily,” I said. “Complementary is ideal, but in practice, modalities can also be redundant or even conflicting.”

“Then what happens?” Interviewer 1 asked. “What other tasks happen in the pipeline besides fusion?”

The Pipeline: Beyond Simple Concatenation

I walked them through my understanding:

Step 1: Modality-specific preprocessing

Each modality needs tailored feature extraction. CNNs for visual frames, Wav2Vec for audio, transformers for text. Each encoder operates independently to produce comparable embeddings.

Step 2: Alignment

This is critical. Alignment isn’t just about dimensions. It includes:

  • Dimensional alignment – Mapping embeddings to a common latent space
  • Temporal alignment – Synchronizing modalities that evolve over time
  • Semantic alignment – Ensuring meaningful correspondence between modalities

Interviewer 1 pressed: “Give me an example of alignment. How can it be done?”

“Cross-modal attention or temporal synchronization,” I replied. “In audio-visual emotion recognition, speech signals and facial expressions evolve over time but not always in sync. Temporal alignment ensures features from both modalities correspond to the same moment.”

“But if you’re doing cross-attention, you’re almost bringing them together, right? Is that alignment or fusion?” he challenged.

I paused. “Alignment happens implicitly in cross-attention through learned associations. But yes, there’s overlap between alignment and fusion in modern architectures.”

My Research Background: Publications and Patents

Interviewer 2 took over, asking about my research portfolio.

“Can you tell us about your publications and patents? Especially the patent on emotion recognition and analysis system for mental health assessment.”

I walked them through my work:

“I have an Indian Patent published in The Patent Office Journal for a multimodal emotion recognition system that integrates facial, voice, physiological, and behavioral data using attention-based fusion. It was validated through a hospital study.”

I continued: “The system addresses a gap in existing emotion recognition patents. Most don’t consider all these modalities together. This patent focuses on classifying emotional states with real-world clinical validation.”

“What about your other publications?” he asked.

“My recent work includes federated learning for privacy-preserving healthcare collaboration, accepted at IC3SE-2025. I’m the first author. I also published work on artistic style transfer using GANs with Pix2Pix implementation at IC3SE 2024, and earlier during college, I worked on facial feature extraction and emotional analysis using ML techniques, published in IJARESM. That early study eventually evolved into supporting research for my patent.”

From GANs to Diffusion Models

“You worked on artistic style transfer using GANs. Have you explored diffusion models?”

“Yes, experimentally,” I said. “After completing the GAN-based work, I implemented a small-scale DDPM to compare generative stability.”

“What’s DDIM and how is it different from DDPM?” he asked.

“DDIM is the Denoising Diffusion Implicit Model. It’s a deterministic variant of DDPM that provides faster, deterministic sampling compared to the stochastic nature of DDPM.”

“These are generative models,” Interviewer 2 continued. “For emotion recognition, which type of model would you prefer?”

“I’d prefer multimodal discriminative models rather than purely generative ones,” I said. “For classification tasks like emotion recognition, discriminative models are more direct. Generative models are useful for augmentation or handling missing modalities, but not as the primary classifier.”

Emotion Recognition Modalities

“In multimodal emotion recognition, what kind of modalities would you use as inputs?” Interviewer 2 asked.

“I would use multiple modalities,” I explained. “Facial expressions through vision transformers, vocal patterns through audio encoders, and potentially behavioral modality. For physiological signals, you could also incorporate heart rate variability or HRV from wearable devices, and skin conductivity for emotional arousal detection.”

The Weightage Question: Equal or Adaptive?

Interviewer 2 posed a practical scenario: “You have three modalities: text, audio, and images. The task is emotion recognition from a video scene. Which modality is more important?”

“It depends on the context of the scene,” I answered. “Generally, audio and visual modalities carry the strongest emotion signals, while text provides semantic grounding.”

“So video and audio have more weightage than text?” he pressed.

“Yes, typically.”

“How would you fuse them? Equal weightage or adaptive?”

This was the crux. “I’d use an adaptive weighting mechanism rather than fixed equal weights. Each modality’s contribution changes depending on its actual reliability in a given context. The weights should be learned, not fixed.”

Precision, Recall, and the F1 Question

Interviewer 2 shifted to fundamentals. “What is precision and recall? What’s the physical significance?”

I explained: “Precision measures how many predicted positives are actually positive. High precision means fewer false positives. In emotion recognition, if the model predicts anger, high precision ensures it’s truly anger, not misclassified neutral or happy.”

“And recall?”

“Recall measures how many actual positive samples the model correctly identifies. High recall means the model misses fewer true positives. In emotion recognition, it means correctly detecting most instances of a particular emotion, even if some neutral samples are misclassified.”

“Why is F1 score required?” Interviewer 2 asked. “Why not rely only on precision and recall?”

“F1 score is especially useful when data is imbalanced or when both false positives and false negatives are costly. Precision alone tells how accurate positive predictions are. Recall tells how many actual positives we captured. But when one is high and the other is low, it’s hard to judge overall performance. F1 balances both.”

The CLIP Question: Where I Hit a Wall

Interviewer 1 returned with the toughest question of the interview.

“Have you come across the CLIP paper? Do you know how multimodal features are brought together in large-scale models like CLIP, GPT, or DeepSeek?”

“I’ve studied CLIP,” I said. “In large multimodal models, features from different modalities are brought together through a contrastive embedding space.”

“How is CLIP different from traditional multimodal models? How is it different from the emotion recognition approach you described earlier?”

I explained: “Traditional models like emotion recognition networks use supervised fusion. They combine features from face, voice, etc., and train end-to-end for a specific task like emotion classification. The fusion is explicit and task-dependent.”

“And CLIP?”

“CLIP uses a contrastive pretraining objective on large-scale image-text pairs. It doesn’t fuse modalities directly for classification. Instead, it learns separate encoders for vision and text and aligns them in a shared embedding space using cosine similarity.”

“Once you establish that representation, what happens next?” Interviewer 1 pressed. “You have to utilize it somewhere. Why bring embeddings to that space? What’s the advantage?”

I started: “Mapping all modalities into a shared embedding space enables semantic alignment and generalizes across modalities.”

“But why?” he interrupted. “Earlier we were able to do that too. Why is this model able to achieve something different by doing this? What’s the advantage or shortcoming? How is it different?”

I paused. I knew the surface-level answer, but not the nuanced distinction he was looking for.

“I’m not fully confident in that specific distinction,” I admitted.

“That’s okay,” Interviewer 1 said. “No big deal.”

The Availability Question

The interview wrapped with practicalities.

“What are you doing now? Do you have an active contract somewhere?” Interviewer 1 asked.

“I’m a freelancer, not full-time. If clients need anything, I work on weekends. Not weekdays. Mostly development work, not research.”

“Since we’re an R&D organization, even for interns, you can only have one contractual role at a time. That’s the law in India. If you’re offered this internship, would you wrap up freelance activities?”

“Yes, definitely. I can give full commitment,” I said.

There was a pause.

“So you’ve completed your master’s in mathematics from NIT Patna, and you’re done with academics?” Interviewer 1 confirmed.

“Yes.”

The Honest Reality

The interview ended cordially.

“We’ll share our feedback with HR,” Interviewer 1 said. “This is a multistage assessment process. There’s a possibility of one or two more rounds. HR will let you know if you’re through to the next stage.”

I thanked them. They thanked me. The call ended.

But I knew the truth.

I wasn’t a good fit. Not because I lacked technical knowledge, but because the role itself didn’t align with where I was in my career. I was looking for a consultant or research scientist position, not an internship. The title mattered less than the substance of the work, but the structure of the role mattered a lot.

They were looking for someone to dedicate full-time to a learning-focused internship program. I was looking for a research collaboration where I could contribute immediately while continuing to explore other opportunities.

Neither of us was wrong. We just weren’t aligned.

What I Learned

1. Cold Emails Work (Sometimes)

I cold-emailed the Director of Sony Research India. They called back. That alone was worth it. The worst they can say is no. The best they can say is “Let’s talk.”

2. Know Your “Why”

I went into that interview without clarity on what I wanted. I was curious, but not committed. When Interviewer 1 asked about my availability and contractual obligations, I realized I hadn’t thought through whether this role actually fit my goals.

Curiosity is great. But clarity is better.

3. Depth Beats Breadth in Technical Interviews

They didn’t care about my resume bullet points. They cared about whether I could explain:

  • Why fusion works
  • How alignment differs from fusion
  • What makes modern multimodal models different from older approaches

Surface-level knowledge gets exposed fast.

4. Admitting “I Don’t Know” is Acceptable

When Interviewer 1 asked about CLIP’s specific advantage over traditional multimodal fusion, I didn’t have a nuanced answer. I said so. He moved on.

Nobody expects you to know everything. They expect you to know the difference between what you know and what you don’t.

What I should have said about CLIP:

Looking back, I now realize the key distinction I missed. CLIP’s advantage isn’t just about shared embedding spaces. Traditional multimodal models also create shared representations. The real breakthrough is:

  1. Zero-shot transfer: CLIP can classify images into classes it never saw during training by computing similarity with text descriptions. Traditional models require task-specific training for every new class.

  2. Compositionality: CLIP understands novel combinations like “a robot playing piano” without seeing that exact example, because it learns compositional semantics from natural language.

  3. Scale and supervision: CLIP trained on 400M image-text pairs from the internet using natural language as supervision, not manually labeled datasets. This is fundamentally different from traditional emotion recognition models that need carefully annotated emotion labels.

  4. Task-agnostic flexibility: CLIP doesn’t learn a specific task. It learns general visual-semantic alignment, making it adaptable to any vision-language task through prompting at inference time.

That’s the nuanced answer I wish I’d given. Sometimes the best learning happens after the interview.

5. Fit Matters More Than Skill

I could answer most of their technical questions. I had relevant research experience. I’d worked on multimodal systems, federated learning, and GANs. But none of that mattered if the role itself wasn’t a fit.

An internship is a learning-focused, full-time commitment. A consultant role is project-based, flexible, and contribution-focused. I wanted the latter. They offered the former.

Skill gets you in the room. Fit determines if you stay.

The Numbers

  • Interview duration: 41 minutes
  • Number of interviewers: 3
  • Technical questions asked: 39
  • Non-technical/availability questions: 6
  • Total questions: 45
  • Questions I answered confidently: 44
  • Questions I admitted partial uncertainty on: 1 (CLIP’s specific advantage over traditional fusion)
  • Follow-up rounds scheduled: 0 (mutual understanding of misfit)

Key realization: Sometimes the best outcome of an interview is clarity, not an offer.

Should You Cold-Email Research Directors?

Yes, if you:

  • Have relevant research or publications in the domain
  • Understand the organization’s philosophy and can articulate why it resonates with you
  • Are genuinely interested in the role (not just the brand)
  • Are prepared for a technical deep-dive, not a resume review

Skip it if you:

  • Haven’t researched the organization’s work
  • Aren’t clear on what you want from the opportunity
  • Expect the conversation to be exploratory without substance
  • Aren’t prepared to commit if they say yes

Your Next Steps (Do This Now)

If you’re preparing for a multimodal AI interview:

This week:

  1. Study CLIP, ALIGN, and modern vision-language models (10-15 hours)
  2. Understand the difference between task-specific fusion and contrastive pretraining
  3. Practice explaining alignment (dimensional, temporal, semantic) with examples

Next week:

  1. Implement a small multimodal project (emotion recognition, image-text retrieval)
  2. Be ready to explain your design choices (Why this fusion strategy? Why these encoders?)
  3. Prepare questions that show you understand the research domain

Within a month:

You’ll walk into technical interviews with confidence, not just competence.

One Last Thing

If you’ve ever hesitated to send a cold email because you thought “they won’t reply,” send it anyway.

I emailed Sony Research India’s Director not expecting a response. Two days later, I was on a call with three researchers discussing cross-modal attention and semantic alignment.

The interview didn’t lead to an offer. But it led to clarity. And sometimes, that’s more valuable.

The best technical conversations come from taking the first step. This was me taking mine. Your turn.

Found this helpful? Share it with someone preparing for AI research interviews. They’ll thank you later.

Found this helpful? Share it with others.

Want to Discuss This Topic?

I'd love to hear your thoughts and continue the conversation.